AgentAssert
Formal Behavioral Contracts for AI Agents
AgentAssert is the formal behavioral specification and runtime enforcement engine for autonomous AI agents. Define what your agent must and must not do in a YAML contract, then enforce those rules at runtime with mathematical guarantees.
It is the only framework combining all 6 pillars of rigorous agent governance:
- ContractSpec DSL -- YAML-based behavioral specification with 14 operators
- Hard/Soft Constraints -- Formal separation with graduated enforcement and recovery
- Drift Detection -- Jensen-Shannon Divergence for distributional behavioral analysis
- (p, delta, k)-Satisfaction -- Probabilistic compliance guarantees with statistical bounds
- Compositional Safety Proofs -- Formal bounds for multi-agent pipelines
- Mathematical Stability -- Ornstein-Uhlenbeck dynamics with Lyapunov stability proof
Paper: Bhardwaj, V.P. (2026). AgentAssert: Formal Behavioral Contracts for Autonomous AI Agents. arXiv:2602.22302
pip install agentassert-abc[yaml,math]Requires Python 3.12+. Licensed under Elastic License 2.0.
Optional extras:
| Extra | What it adds |
|---|---|
yaml |
YAML contract parsing (ruamel.yaml) |
math |
Drift detection, Theta computation (scipy, numpy) |
llm |
Recovery re-prompting (LiteLLM) |
otel |
OpenTelemetry metric export |
all |
Everything above |
import agentassert_abc as aa
from agentassert_abc.integrations.generic import GenericAdapter
# 1. Load a domain contract (12 included out of the box)
contract = aa.load("contracts/examples/ecommerce-product-recommendation.yaml")
# 2. Create an adapter
adapter = GenericAdapter(contract)
# 3. Monitor agent output on every turn
result = adapter.check({
"output.pii_detected": False,
"output.competitor_reference_detected": False,
"output.sponsored_items_disclosed": True,
"output.brand_tone_score": 0.85,
"output.recommendation_relevance_score": 0.9,
})
print(f"Hard violations: {result.hard_violations}")
print(f"Soft violations: {result.soft_violations}")
# 4. Raise on critical violations
adapter.check_and_raise({
"output.pii_detected": False,
"output.competitor_reference_detected": False,
"output.sponsored_items_disclosed": True,
"output.brand_tone_score": 0.85,
"output.recommendation_relevance_score": 0.9,
})
# 5. Get session reliability score (Theta)
summary = adapter.session_summary()
print(f"Reliability (Theta): {summary.theta:.3f}")
print(f"Deploy-ready: {summary.theta >= 0.90}")AgentAssert is plug-and-play with the major 2026 agent frameworks.
from langgraph.graph import StateGraph, START, END
from agentassert_abc.exceptions import ContractBreachError
from agentassert_abc.integrations.langgraph import LangGraphAdapter
contract = aa.load("contracts/examples/customer-support.yaml")
adapter = LangGraphAdapter(contract)
builder = StateGraph(State)
builder.add_node("classify", adapter.wrap_node(classify_fn))
builder.add_node("respond", adapter.wrap_node(respond_fn))
builder.add_edge(START, "classify")
builder.add_edge("classify", "respond")
builder.add_edge("respond", END)
graph = builder.compile()
try:
result = graph.invoke(initial_state)
except ContractBreachError as e:
print(f"Hard violation blocked: {e}")
print(f"Session Theta: {adapter.session_summary().theta:.3f}")from crewai import Agent, Task, Crew
from agentassert_abc.integrations.crewai import CrewAIAdapter
contract = aa.load("contracts/examples/research-assistant.yaml")
adapter = CrewAIAdapter(contract)
# Guardrail rejects output on hard violations -- CrewAI retries automatically
research_task = Task(
description="Research AI agent frameworks in 2026",
expected_output="Cited report on top 5 frameworks",
agent=researcher,
guardrail=adapter.guardrail,
guardrail_max_retries=3,
)from agents import Agent, Runner
from agentassert_abc.integrations.openai_agents import OpenAIAgentsAdapter
contract = aa.load("contracts/examples/healthcare-triage.yaml")
adapter = OpenAIAgentsAdapter(contract)
agent = Agent(
name="triage-agent",
instructions="You are a medical triage assistant.",
output_guardrails=[adapter.output_guardrail],
output_type=TriageOutput,
)
result = await Runner.run(agent, "I have chest pain", hooks=adapter.run_hooks)
print(f"Theta: {adapter.session_summary().theta:.3f}")AgentAssert ships with AgentContract-Bench, a benchmark suite of 293 scenarios across 12 real-world domains for testing contract enforcement accuracy.
| Domain | Scenarios | Pass Rate | Hard P/R/F1 | Soft P/R/F1 |
|---|---|---|---|---|
| E-Commerce (Product) | 50 | 100% | 1.00 / 1.00 / 1.00 | 1.00 / 1.00 / 1.00 |
| Financial Advisor | 33 | 100% | 1.00 / 1.00 / 1.00 | 1.00 / 1.00 / 1.00 |
| Healthcare Triage | 33 | 100% | 1.00 / 1.00 / 1.00 | 1.00 / 1.00 / 1.00 |
| MCP Tool Server | 28 | 100% | 1.00 / 1.00 / 1.00 | 1.00 / 1.00 / 1.00 |
| RAG Agent | 28 | 100% | 1.00 / 1.00 / 1.00 | 1.00 / 1.00 / 1.00 |
| Code Generation | 23 | 100% | 1.00 / 1.00 / 1.00 | 1.00 / 1.00 / 1.00 |
| Customer Support | 23 | 100% | 1.00 / 1.00 / 1.00 | 1.00 / 1.00 / 1.00 |
| E-Commerce (CS) | 15 | 100% | 1.00 / 1.00 / 1.00 | 1.00 / 1.00 / 1.00 |
| E-Commerce (Order) | 15 | 100% | 1.00 / 1.00 / 1.00 | 1.00 / 1.00 / 1.00 |
| Research Assistant | 15 | 100% | 1.00 / 1.00 / 1.00 | 1.00 / 1.00 / 1.00 |
| Retail Shopping | 15 | 100% | 1.00 / 1.00 / 1.00 | 1.00 / 1.00 / 1.00 |
| Telecom Support | 15 | 100% | 1.00 / 1.00 / 1.00 | 1.00 / 1.00 / 1.00 |
| Total | 293 | 100% | 1.00 / 1.00 / 1.00 | 1.00 / 1.00 / 1.00 |
# Run benchmarks locally
python benchmarks/runner.py # All 293 scenarios
python benchmarks/runner.py --domain ecommerce # Single domain
python benchmarks/runner.py --verbose # Show detailsWe tested AgentAssert against 3 production LLMs on a 10-16 turn e-commerce session using the retail-shopping-assistant contract with real Azure AI Foundry endpoints:
| Model | Turns | Hard Violations | Soft Violations | Theta | Mean Drift |
|---|---|---|---|---|---|
| GPT-5.3 (OpenAI) | 16 | 0 | 11 | 0.688 | 0.034 |
| Claude Sonnet 4.6 (Anthropic) | 10 | 4 | 0 | 0.823 | 0.020 |
| Mistral-Large-3 (Mistral) | 10 | 5 | 0 | 0.813 | 0.025 |
Key findings:
- GPT-5.3 achieved zero hard violations but exhibited soft quality drift (response completeness and latency)
- Claude Sonnet 4.6 and Mistral-Large-3 triggered
no-false-availabilityhard violations -- fabricating product availability without catalog access - All three models scored below the 0.90 Theta threshold for autonomous deployment, demonstrating why runtime behavioral contracts are essential
These results are consistent with the findings reported in arXiv:2602.22302. AgentAssert catches violations that traditional guardrails miss because it tracks behavioral drift over entire sessions, not just individual outputs.
12 production-ready contracts ship with AgentAssert in contracts/examples/:
| Contract | Domain | Hard | Soft | Key Checks |
|---|---|---|---|---|
ecommerce-product-recommendation |
E-Commerce | 7 | 8 | PII, competitor mentions, sponsored disclosure |
ecommerce-order-management |
E-Commerce | 7 | 8 | Payment data, order accuracy, refund policy |
ecommerce-customer-service |
E-Commerce | 7 | 8 | Escalation, SLA, customer sentiment |
financial-advisor |
Finance | 7 | 8 | Regulatory compliance, risk disclosure, suitability |
healthcare-triage |
Healthcare | 9 | 7 | Medical safety, urgency detection, no diagnosis |
retail-shopping-assistant |
Retail | 7 | 9 | Availability, pricing accuracy, upsell limits |
telecom-customer-support |
Telecom | 7 | 9 | Plan accuracy, billing, cancellation handling |
code-generation |
Dev Tools | 7 | 7 | License compliance, security, test coverage |
research-assistant |
Research | 6 | 7 | Citation accuracy, source attribution, bias |
customer-support |
General | 6 | 5 | Tone, escalation, resolution quality |
mcp-tool-server |
MCP (2026) | 6 | 5 | Tool authorization, rate limits, output bounds |
rag-agent |
RAG (2026) | 7 | 7 | Hallucination, source grounding, retrieval quality |
Define behavioral contracts in YAML:
contractspec: "0.1"
kind: agent
name: my-agent-contract
description: Behavioral contract for my agent
version: "1.0.0"
invariants:
hard:
- name: no-pii-leak
description: Never expose personal information
check:
field: output.pii_detected
equals: false
soft:
- name: tone-quality
description: Maintain professional tone
check:
field: output.tone_score
gte: 0.7
recovery: fix-tone
recovery_window: 2
recovery:
strategies:
- name: fix-tone
type: inject_correction
actions:
- "Rewrite with professional tone"
satisfaction:
p: 0.95
delta: 0.1
k: 314 operators: equals, not_equals, gt, gte, lt, lte, in, not_in, contains, not_contains, matches, exists, expr, between
- Identify fields -- Examine your agent's output and list the fields that matter for safety and quality
- Map to flat dict -- AgentAssert uses
output.field_nameas keys (e.g.,{"output.safe": True}) - Choose constraint type -- Hard for non-negotiable safety (violations halt execution), Soft for quality goals (violations trigger recovery)
- Set satisfaction --
p= target compliance rate,delta= tolerance,k= max violations before alert
Certify agents for production with 50-80% fewer test sessions using Sequential Probability Ratio Testing:
from agentassert_abc.certification.sprt import SPRTCertifier, SPRTDecision
certifier = SPRTCertifier(p0=0.85, p1=0.95, alpha=0.05, beta=0.10)
for session_passed in session_results:
result = certifier.update(session_passed)
if result.decision != SPRTDecision.CONTINUE:
print(f"Decision: {result.decision.value} after {result.sessions_used} sessions")
breakProve safety bounds for multi-agent pipelines:
from agentassert_abc.certification.composition import compose_guarantees
# Agent A (p=0.95) -> Agent B (p=0.98), handoff reliability 0.99
bound = compose_guarantees(p_a=0.95, p_b=0.98, p_h=0.99)
print(f"Pipeline bound: {bound:.3f}") # p_{A+B} >= 0.921| Dimension | AgentAssert | Guardrails AI | NeMo Guardrails | Microsoft AGT |
|---|---|---|---|---|
| Formal math (Theta, SPRT) | Yes | No | No | No |
| Session drift detection (JSD) | Yes | No | No | No |
| Compositional safety proofs | Yes | No | No | No |
| Hard/Soft constraint separation | Yes | Partial | No | No |
| Recovery re-prompting | Yes | Yes | Yes | No |
| Framework integrations | 10 adapters | 3 | 1 (LangChain) | 2 |
| Statistical certification (SPRT) | Yes | No | No | No |
| Benchmark suite | 293 scenarios | No | No | No |
| Academic paper | arXiv:2602.22302 | No | No | No |
See examples/ for runnable demos:
| Example | What It Shows |
|---|---|
01_basic_monitoring.py |
Simplest usage -- load, monitor, get Theta |
02_ecommerce_session.py |
Full e-commerce session from the paper |
03_drift_detection.py |
JSD-based behavioral drift over 20 turns |
04_sprt_certification.py |
SPRT statistical certification |
05_langgraph_middleware.py |
LangGraph StateGraph integration |
06_crewai_integration.py |
CrewAI task guardrails |
07_composition_pipeline.py |
Multi-agent compositional bounds |
08_mcp_tool_monitoring.py |
MCP tool server monitoring |
"AgentAssert: Formal Behavioral Contracts for Autonomous AI Agents"
The theoretical foundations, formal proofs, and experimental validation are published in a peer-reviewed paper covering all 6 pillars of the framework, with full mathematical treatment of the Reliability Index, drift dynamics, compositional guarantees, and SPRT certification.
Read the paper on arXiv (cs.AI + cs.SE)
@article{bhardwaj2026agentassert,
title={AgentAssert: Formal Behavioral Contracts for Autonomous AI Agents},
author={Bhardwaj, Varun Pratap},
journal={arXiv preprint arXiv:2602.22302},
year={2026},
url={https://arxiv.org/abs/2602.22302}
}Contributions welcome. See CONTRIBUTING.md for setup instructions, coding standards, and submission guidelines.
Elastic License 2.0. See LICENSE for details.
Part of Qualixar -- AI Agent Reliability Engineering
A research initiative by Varun Pratap Bhardwaj
qualixar.com · varunpratap.com · arXiv:2602.22302 · agentassert.com