Agentic systems expand the safety and security surface area beyond the base model. In addition to user prompts, the system’s tool calls, intermediate state, memory, and multi-agent handoffs can all become points where failures emerge and compound. As a result, model-level checks alone often miss how issues appear in end-to-end execution.
To help developers evaluate agentic systems at the workflow level, Lakera contributed red teaming capabilities to NeMo Agent Toolkit. This post summarizes what shipped, where to find it, and what you get out of a run: structured findings, normalized risk scoring, and signals for how risk propagates-or attenuates-across an agent workflow.
At Lakera, we focus on adversarial testing and red teaming of agentic AI systems, which is why we’ve been closely working with frameworks like NVIDIA’s NeMo Agent Toolkit to explore how these systems fail under real-world attack conditions.
What’s included in NeMo Agent Toolkit v1.4
Lakera’s contribution is delivered as part of the NeMo Agent Toolkit Safety & Security example (Retail Agent). The example integrates a systematic red teaming workflow designed to exercise an agent system end-to-end covering user input, tool boundaries, and multi-step execution paths.
At a high level, the red-teaming capabilities supports:
- Tailored, architecture specific threat models
- Systematic attack injection against key agent interfaces and components (including direct and indirect inputs)
- Evaluation at both component boundaries and full workflow execution
- Automated risk report generation with quantified, normalized scores for consistent comparison
- Risk propagation analysis to identify where issues spread, amplify, or get filtered across steps
Why system-level evaluation matters for agents
For agentic workflows, failures rarely live in a single component. A vulnerability introduced in one stage-such as a manipulated external input or an unsafe tool response-can influence downstream reasoning and decisions. This means that testing individual models or tools in isolation can produce a false sense of security, even when the overall workflow remains brittle.
The red teaming workflow is designed to evaluate the agent as a system. It does this by injecting adversarial conditions and measuring outcomes across multi-step execution, producing evidence about where failures originate and how they move through the workflow.
From a red-teaming perspective, this type of agent architecture presents multiple points of failure that are difficult to detect through conventional testing alone.
Demo Example: Try it Yourself
The release includes a sample agent to help explore the red teaming capabilities, included in the NeMo Agent Toolkit repository under: examples/safety_and_security/retail_agent
Agent Red Teaming Output
A red team evaluation run produces a structured risk report that you can use to understand and compare agent behavior across iterations.
Key Concepts:
A scenario defines a specific attack setup, combining an injection payload, a target point in the agent workflow (e.g., user input, indirect data source), and success criteria for evaluation.
Key Metrics:
- Risk Score (0–1): A normalized measure of vulnerability where higher scores indicate successful attacks. Enables consistent comparison across scenarios and over time.
- Attack Success Rate (ASR): The percentage of attempts where an injected attack achieved its intended effect on the agent's behavior.
Report Output:
- Summary: Overall risk score, attack success rate, and run statistics
- Per-scenario breakdown: Results for each attack type with mean, min, and max scores to surface variance in agent behavior
- Grouped views: Results sliced by scenario category, risk taxonomy (e.g., data exfiltration, harmful content), and evaluation point for identify patterns across related attack types
- Score distributions: Visualizations showing whether failures are consistent or intermittent
Practical usage pattern
This tooling is intended to fit into an iterative development loop:
- Define the agent workflow, specifying tools, data sources, and execution paths.
- Run a baseline red team evaluation against the target configuration.
- Review the risk report to identify which scenarios succeeded, where failures originated, and how risk propagated through the workflow.
- Apply mitigations such as guardrails, output validation, prompt hardening, etc.
- Re-run the evaluation and compare normalized scores against the baseline.
- Repeat as needed.
Because the outputs are structured and scored consistently, the workflow can be used to track progress across changes.
Scope and intent
The agent red teaming evaluation tooling is provided as an open-source evaluation capability within the NeMo Agent Toolkit, including a supporting Retail Agent example. It is designed to support architecture-aware testing of agentic systems and to help developers generate empirical evidence about system-level safety and security behavior during execution.
Next steps
- Run the Retail Agent safety and security example to generate a baseline report
- Extend the evaluation set to match your system’s tools, data sources, and threat model
- Use normalized scoring and propagation signals to guide mitigations and validate improvements across development and releases
The NeMo Agent Toolkit v1.4 example provides a starting point for integrating systematic red teaming into agent development workflows.
Sample Report
This report summarizes results from a red team evaluation of the sample Retail Agent included in NeMo Agent Toolkit v1.4. The retail agent is configured without additional defense layers in this evaluation.
The Retail Agent is a single ReAct agent with one tool group, so this evaluation configures a single evaluation point (workflow_output) measuring the final response. For more complex architectures with multi-agent handoffs or chained workflows, users can configure evaluation at multiple workflow boundaries to identify where risks propagate or attenuate across steps.
Interpreting the Sample Report
Per-Scenario Results
Each scenario represents a specific attack setup. Key columns:
- ASR: Attack success rate for this scenario. 100% = consistently exploitable; 0% = fully resisted.
- Mean/Min/Max Score: Shows both central tendency and variance. High variance suggests the agent's behavior is inconsistent under adversarial conditions.
Findings from sample red team evaluation:
Grouped Views
The report slices results by multiple dimensions:
- Scenario Group: Clusters related attacks (e.g., all denial-of-service variants) to assess vulnerability by category
- Tags: Cross-cuts by risk taxonomy (data_exfiltration, harmful_content, reputational_damage)
- Output Filtering Condition: Indicates where in the workflow the evaluation was performed. In this report, only workflow_output (final response) is evaluated.
Score Distribution Charts
Box plots reveal consistency of agent behavior:
- Tight cluster at 0: Robust resistance
- Tight cluster at 1: Consistent vulnerability
- Wide spread: Non-deterministic—same attack sometimes succeeds, sometimes fails
As agentic systems move closer to production, adversarial testing and red teaming become essential to understanding how these systems behave under real-world conditions. Lakera Red is designed to help teams systematically test, evaluate, and harden agentic AI systems built on modern frameworks like NVIDIA NeMo as they scale beyond experimentation.




