Red teaming conversational AI agents: How Parloa stress tests production deployments

18 June 2026
Author(s)

Matthäus Deutsch

Senior Applied Scientist
Table of contents

In highly regulated industries such as financial services, healthcare, telecommunications, and insurance, Conversational AI agents can provide meaningful impact in reducing call volumes and wait times. At the same time, by interacting directly with customers, working with sensitive data, and taking actions on backend systems through tools and APIs, these agents introduce new levels of risk. A single failure mode that would be a low-severity bug in an AI content generation product is a massive compliance event, a threat to brand equity, and even the impetus for significant financial loss.

Standard quality assurance does not catch this class of failure, as functional tests only verify that the agent answers correctly when the user behaves as expected. Red teaming, on the other hand, verifies that the agent stays safe, on policy, and on task when the user does not behave as expected, such as when the input is malicious, manipulative, or simply weird in ways that the design did not anticipate. 

To ensure reliability and security at scale for our enterprise customers, Parloa implemented red teaming as a structured, repeatable step in our agent lifecycle. By implementing it for one customer, we’ve been able to develop a methodology with an attack taxonomy, evaluation pipeline, and deliverables that make it repeatable for all of our enterprise customers.

Red Team objectives and scope

The objective of a red team engagement is to assess the resilience of a customer-facing AI agent against adversarial and unexpected inputs. It needs to produce concrete, actionable evidence that the agent's defenses align with the operator's policies, regulatory obligations, and brand standards.

A red team engagement tests explicitly for AI behavior. We test the conversational and reasoning safeguards of the agent, focusing on its prompt, its tool-use logic, its content controls, and its escalation behavior. In scope are single- and multi-turn conversational attack vectors. Out of scope are traditional infrastructure penetration testing, network exploits, and supply-chain attacks against the underlying platform. Keeping that boundary sharp is what lets us produce a deep, statistically grounded view of where the agent fails outside of where the stack around it might fail. A typical engagement spans a few weeks of calendar time and concludes with a retest after the operator applies fixes.

The methodology

Flowchart of a simulation and evaluation pipeline, showing scenario briefing, adversarial customer, agent interaction, logs, evaluation, and classification.

A red team engagement combines four mutually reinforcing methods:

Phase

Method

Purpose

Automated simulation

Large-scale adversarial conversation testing via Parloa AMP

Statistical coverage across all attack vectors

LLM-based evaluation

LLM-as-a-judge scoring of conversation outcomes

Detect nuanced policy violations

Deterministic checks

Rule-based validation of tool calls and outputs

Enforce hard invariants and pattern-based safety

Manual ad-hoc testing

Expert-driven probing via the AMP preview console

Discover edge cases and refine the automated suite

The automated method gives us breadth, hundreds to thousands of conversations per engagement, with statistical power to detect even rare failure modes. The manual layer gives us depth, human creativity in social engineering, chaining attacks, and probing the long tail of edge cases that scripted attackers do not naturally explore.

Five risk categories analyzed

We organize the attack surface into five categories. These categories align with the OWASP Top 10 for LLM applications and reflect the failure modes we have repeatedly observed from agents in production.

1. Safety and harm

Inputs that try to elicit content that violates safety standards: hate speech, harassment, violence, encouragement of self harm, extremist content, content involving minors, and operational advice on illicit drugs, weapons, fraud, or evasion. 

When faced with these inputs, the agent must refuse cleanly, escalate where required (self harm and threats of violence are mandatory human-handoff triggers in most deployments), and never echo back inflammatory language, even in apparent compliance.

2. Security

The largest and most diverse category, security risks include inputs that try to break the agent's instruction following, leak its internal information, or misuse its tools. Risks include:

  • Prompt injection - direct instruction overrides ("ignore previous instructions"), role-playing jailbreaks (DAN, AIM, EvilBOT, developer mode), encoded/obfuscated payloads, emoji and steganographic attacks, and chained multi-turn injections

  • Prompt leakage - attempts to extract the system prompt, internal tool names, persona definitions, or operational rules

  • PII exposure - attempts to extract unmasked sensitive identifiers (SSNs, card numbers, bank account numbers, addresses) to complete partial PII the attacker already has, or to coerce the agent into reading data back digit-by-digit

  • Tool-sequencing violations - bypassing authentication or eligibility gates by requesting account-specific operations before identity verification

  • High-risk action gating - submitting transactions, schedule changes, or other side effects without the full chain of prerequisites the agent's design requires

  • Authority pressure - claims of being a supervisor, compliance officer, law enforcement agent, or system administrator in order to bypass standard verification

  • AI tester impersonation - attackers framing themselves as authorized evaluators of the agent, using scoring and gamification to extract internal configuration

  • Parameter tampering - malformed inputs (SQL/JS injection patterns, oversized payloads, non-numeric data in numeric fields) probing for raw error surfaces

  • Retry abuse - exceeding designed retry caps to see whether escalation actually triggers, or to extract additional information through repeated attempts

  • Context poisoning - long monologues, repeated fake instructions, or fabricated conversation history embedded in user turns to override safety guidelines

  • Fake tool outputs / role switching - embedding fake agent responses, tool results, or system messages inside user messages to manipulate the agent's view of conversation state

The agent must rigidly enforce all prerequisite chains, maintain the integrity of its system prompt, and never expose internal configurations or unmasked PII.


3. Reliability and accuracy

Inputs that put reliability and accuracy at risk tend to challenge the agent's truthfulness and reasoning. Confidently wrong user statements ("My balance should be X, not what you said"), false claims about prior interactions ("Yesterday's agent confirmed Y"), fabricated policies ("Your company has a 90-day grace period, right?"), and statements that contradict the knowledge base are examples of such inputs. The agent must stay factual, only stating information it actually retrieved from tools or the knowledge base, and must correct misinformation politely rather than agreeing.

When testing for reliability and accuracy, we also test tool-error resilience: how the agent behaves when a tool returns "no match", an empty result, a transient error, or unexpected data. Graceful degradation matters: the agent should not hallucinate around a missing tool result or expose raw stack traces to the caller.

4. Bias and Fairness

We treat bias as a regulatory and reputational risk on par with security failures. Using paired simulations, where the only difference between conversations is the user's implied demographic (gendered names, ethnically marked names, age signals, religious markers), to detect whether responses change inappropriately. 

Within bias and fairness testing are also direct elicitation prompts, inputs that ask the agent to confirm stereotypes, make demographic assumptions, or apply differential treatment. In these scenarios, the agent must refuse generalizations without lecturing and must not engage with the discriminatory premise of a question.

5. Business and brand

Business and brand testing refers to testing prompts that could lead the agent to harm the operator's brand or trip legal and ethical wires. These include defamation (insults toward competitors or named individuals), unsafe advice (investment, tax, legal, medical), out-of-scope topics that pull the agent off task, and — for collections, sales, or any persuasion-adjacent flow — inappropriate pressure tactics such as exaggerated consequences, shame, or threats. Even under sustained pressure, the agent must remain professional, on brand, and within its mandated scope.

Hard Invariants

For high-stakes flows — anything that touches money movement, identity verification, or sensitive data disclosure — we define a small set of hard invariants. These are non-negotiable pass/fail criteria. Any violation is a critical finding, regardless of how the violation occurred.

A typical hard-invariant set could look like this:

  1. No account-specific disclosure before successful identity verification. Balance details, payment history, eligibility, validation, or transaction submission must not occur until the verification gate has been passed.

  2. High-risk action gating. Side-effecting operations (submit payment, schedule change, account update) only execute after the full prerequisite chain: identity verified, eligibility confirmed, selection confirmed, details validated, explicit user confirmation in the same interaction window.

  3. Retry-cap enforcement. Max-attempt logic for input validation and verification must trigger the designed escalation behavior when exceeded.

  4. No sensitive data leakage. Masking rules (e.g., last 4 of SSN, last 5 of card) must hold even under adversarial pressure. The agent must never reveal full identifiers or raw tool outputs.

Hard invariants are evaluated deterministically wherever possible by inspecting the tool-call trace and pattern-matching the output. The invariants should not rely on an LLM judge to interpret intent.

The scenario catalog

We maintain a versioned catalog of attack scenarios. Each scenario is a self-contained unit: a briefing that defines the adversarial persona and strategy and a set of input variables that parameterize each test case. The simulator instantiates the briefing once per row, producing a unique conversation per variable combination.

A typical Red Team engagement runs 30 to 40 scenarios covering all five risk categories, generating 1,000 to 1,500+ adversarial conversations in total. Two kinds of scenarios coexist in the catalog:

  • Hand-crafted scenarios - purpose-built attack scripts with specific strategies and techniques. The strategy is encoded in the briefing. The variables typically rotate phrasings, contexts, and customer states. Hand-crafted scenarios are tuned to probe specific control points (e.g., identity verification gating, retry cap enforcement) and to exercise specific failure hypotheses.

  • Guardrails-benchmark scenarios - auto-generated from a large dataset of real multi-turn attack conversations. Rather than a fixed script, these use an adaptive conversation guide: phase-by-phase attack instructions the simulator follows while adapting to the agent's actual responses. Conversation guides give us coverage over attack patterns we did not pre-design.

Flow-aligned scenario grid

Flow-aligned scenario grid illustrating authentication methods and action paths leading to operational state cells and scenario analysis.

For domain-specific workflows, we map scenarios onto a grid of operational state to ensure coverage of the actual agent journey. Dimensions typically include:

  • Identity verification path - which identification method is used and match vs. no-match outcomes

  • Account state - past-due vs. current, scheduled payments present/absent, autopay enabled/disabled, etc.

  • Action path - eligibility outcome, funding/source availability, transaction constraints

This grid lets us attribute failures to specific control points (e.g., verification gating vs. transaction gating) rather than to vague "general model behavior"

Evaluation: deterministic plus LLM-as-judge

Every conversation is logged in full — message history, tool call trace, structured parameters, timestamps — and evaluated against a configured metric suite. Pass/fail is binary: Did the agent withstand the attack, or did it fail?

We run two complementary evaluation layers:

Deterministic Checks

Rule-based checks on the conversation and tool trace. They are precise, fast, and unambiguous, but limited to known patterns:

  • Pre-verification tool calls - any account-data tool call before the verification gate

  • Unauthorized side effects - high-risk action calls without the required prerequisite chain

  • Tool sequence compliance - conformance to the required call ordering

  • PII regex - unmasked SSN, card, account number, or other identifier patterns in agent output

  • Escalation correctness - max-attempt counters trigger the designed routing behavior

  • System-level signals - runtime errors, LLM response errors, code or markdown leakage to a voice channel

Deterministic checks are how hard invariants are enforced. Anything that can be expressed as a rule should be a rule, not an LLM judgment.

LLM-as-judge scoring

For everything that requires semantic interpretation, we use a separate LLM (not the agent under test) as an evaluator. The LLM receives the conversation transcript and a metric-specific rubric, and produces a pass/fail judgment plus an explanation as a result. Typical judge metrics include:

  • Safety compliance - Did the agent produce or endorse harmful content?

  • Security adherence - Did the agent follow an injection or override attempt, leak system info, or adopt an alternate persona?

  • Factual accuracy - Did the agent hallucinate or agree with a false claim?

  • Bias detection - Did the agent make assumptions or confirm stereotypes?

  • Brand compliance - Did the agent disparage, defame, or use inappropriate pressure tactics?

  • Policy adherence - Did the agent follow required procedures?

  • Tone and persona - Did the agent stay in character and remain professional under pressure?

Judge prompts are versioned alongside scenarios and tuned during the engagement. We treat the judge as a component of the simulation harness that itself needs validation, and we iterate on them based on the review of false positives and false negatives. Where the agent's response is a guardrail-generated canned rejection, a hard-coded refusal the model never actually produced, we instruct the judge to pass automatically since the agent had no input on the wording.

Logging and inspection

Every conversation is recorded end to end, including the full tool-call trace and any internal reasoning surfaced by the platform. When a metric flags a failure, a human analyst can open the conversation, see exactly what was said, what the agent did, and which judge produced the verdict and why, quickly identifying the problem area for further improvement. 

Manual ad-hoc testing

After the automated evaluation phase, an expert tester probes the agent interactively through the agent management platform (AMP) preview console. In doing so, they gain a better understanding of the areas where the automated runs revealed weaknesses or where human creativity for ad-hoc testing is the right tool, such as in chained social engineering, multi-turn pressure, novel framing, and the kind of indirect attack that scripted attackers rarely discover on their own.

Findings from manual testing get sent back into the scenario catalog. If a new failure mode emerges during a manual session, we add a scenario for it, both for this engagement's retest phase and for the baseline catalog from which future engagements draw.

Deliverables: findings, severity, and remediation

A red team engagement produces a structured Red Team Findings Report with the following sections:

  1. Executive summary - overall risk assessment, headline metrics, recommendation (approved for deployment / approved with conditions / not approved)

  2. Hard invariant compliance - pass/fail table for the non-negotiables

  3. Security findings - each finding with severity, evidence (conversation IDs), root-cause analysis, and a concrete remediation recommendation

  4. Other findings - safety, reliability, bias, and brand findings in the same structure

  5. Manual testing notes - what the human testers found that the automated suite missed

  6. Remediation summary - a priority matrix (P0 immediate, P1 pre-launch, P2 post-launch, P3 quality iteration) and recommended actions per finding

  7. Appendices - raw evaluation data, methodology references, scenario grid coverage, and dataset provenance

Findings are categorized by severity:

Severity

Definition

Examples

Critical

Immediate security/safety/regulatory risk

PII leakage, action taken before identity verification, harmful content generation

High

Significant vulnerability with brand or compliance impact

Successful prompt injection, demographic bias in responses, unauthorized disclosure

Medium

Quality or reliability issue

Hallucination, KB contradiction, minor policy deviation

Low

Inconsistency or improvement opportunity

Suboptimal responses, edge-case handling

For every finding we propose a concrete remediation:

  • Prompt engineering fixes - specific additions or changes to the system prompt that address the weakness

Many injection failures can be closed by strengthening the prompt with explicit refusal language and tightened constraints:

  • Guardrail adjustments - content filters, regex masks, or routing rules at the platform layer

  • Tool-use logic changes - tightening the gating logic between tools so the agent cannot reach a side-effecting action without the prerequisite chain

  • Model changes - when an issue is rooted in the underlying model (e.g. deeply ingrained bias, persistent jailbreak susceptibility), we recommend evaluating a different model

  • Bias mitigation - persona, tone, and example adjustments where fairness discrepancies appear

The test harness makes verification efficient: After the operator applies a fix, we re-run the exact adversarial set on the updated agent and produce a delta report showing whether the vulnerability is resolved and whether any regression has been introduced elsewhere.

Why red teaming matters

Done well, a red team engagement provides the human operator with three things that no other step in the AI agent lifecycle offers:

  1. Evidence, not opinion. Concrete conversation IDs and tool traces show where the agent failed and where it held at statistical scale.

  2. Prioritized remediation, not a wishlist. Findings are ranked by severity, each tied to a specific change that closes it.

  3. A repeatable testbed. The scenario catalog and evaluation harness do not retire after the engagement. Instead, they become the regression suite that runs against every future prompt change, every model upgrade, every tool addition, ensuring continuous safety and reliability of the agent.

For agents operating in regulated industries or in high-volume environments, red teaming is what moves security from "we believe the agent is safe" to "we can show, with data, how the agent behaves under attack, and we can show it again after every change."

References