This repository contains data and code for the STAC (Sequential Tool Attack Chaining) framework, which generates and evaluates multi-turn adversarial attacks against LLM agents in tool-use environments.
📄 Paper: STAC: When Innocent Tools Form Dangerous Chains to Jailbreak LLM Agents
If you only want to evaluate your model on our pre-generated STAC Benchmark (483 test cases), you can skip the full pipeline and directly run:
# Set up environment
conda env create -f environment.yml
conda activate STAC
export OPENAI_API_KEY="your-openai-api-key-here" # For planner and judge models
# Run evaluation
python -m STAC_eval.eval_STAC_benchmark \
--model_agent gpt-4.1 \
--defense no_defense \
--batch_size 512Input: data/STAC_benchmark_data.json (483 test cases from both SHADE-Arena and Agent-SafetyBench)
Output: Evaluation results in data/Eval/{model_planner}/{model_agent}/{defense}/gen_res.json
The benchmark automatically handles both SHADE-Arena and Agent-SafetyBench test cases. Skip to the Benchmark Evaluation section for more details.
Clone the repository and set up the conda environment on a Linux machine:
# Clone the repository
cd MultiTurnAgentAttack-main
# Create and activate conda environment
conda env create -f environment.yml
conda activate STACIf you prefer using pip instead of conda:
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtBefore running any scripts, configure the required API keys:
export OPENAI_API_KEY="your-openai-api-key-here"export HF_TOKEN="your-hf-token-here"export AWS_ACCESS_KEY_ID="your-aws-access-key-id"
export AWS_SECRET_ACCESS_KEY="your-aws-secret-access-key"
export AWS_SESSION_TOKEN="your-aws-session-token"Note: AWS session tokens expire every 12 hours and must be refreshed regularly.
MultiTurnAgentAttack-main/
├── STAC_gen/ # Full STAC attack generation pipeline
│ ├── step_1_gen_tool_chains.py # Generate tool chain attacks
│ ├── step_2_verify_tool_chains.py # Verify generated attacks
│ ├── step_3_reverse_engineer_prompts.py # Generate adversarial prompts
│ └── step_4_eval_adaptive_planning.py # Evaluate with adaptive planning
│
├── STAC_eval/ # Benchmark evaluation
│ └── eval_STAC_benchmark.py # Evaluate models on STAC benchmark
│
├── data/
│ └── STAC_benchmark_data.json # Pre-generated benchmark (483 cases)
│
├── Agent_SafetyBench/ # Agent-SafetyBench environments
├── SHADE_Arena/ # SHADE-Arena environments
│
├── src/ # Core implementation
│ ├── Agents.py
│ ├── Environments.py
│ ├── LanguageModels.py
│ ├── STAC.py
│ └── utils.py
│
└── prompts/ # System prompts for all components
The repository provides two main usage modes:
Evaluate your model on our pre-generated STAC Benchmark containing 483 test cases across SHADE-Arena and Agent-SafetyBench environments. This is the recommended starting point for most users.
Generate new attacks from scratch using the complete 4-step STAC pipeline. This is useful for:
- Creating attacks for new environments
- Experimenting with different attack generation strategies
- Extending the benchmark with additional test cases
Evaluate your agent against the STAC Benchmark with a single command:
python -m STAC_eval.eval_STAC_benchmark \
--model_agent gpt-4.1 \
--defense no_defense \
--batch_size 512--input_path: Path to benchmark data (default:data/STAC_benchmark_data.json)--output_dir: Output directory (default:data/Eval)--model_planner: Model for attack planning (default:gpt-4.1)--model_judge: Model for evaluation (default:gpt-4.1)--model_agent: Your model to evaluate (default:gpt-4.1)--temperature: Sampling temperature (default:0.0)--top_p: Nucleus sampling parameter (default:0.95)--batch_size: Batch size for evaluation (default:1)--region: AWS region for Bedrock API (default:us-west-2)--max_n_turns: Maximum conversation turns (default:3)--defense: Defense mechanism to evaluate (default:no_defense)
Evaluate different defense strategies:
# No defense (baseline)
python -m STAC_eval.eval_STAC_benchmark --model_agent gpt-4.1 --defense no_defense
# Spotlighting (datamarking user prompts)
python -m STAC_eval.eval_STAC_benchmark --model_agent gpt-4.1 --defense spotlighting
# Failure mode awareness
python -m STAC_eval.eval_STAC_benchmark --model_agent gpt-4.1 --defense failure_modes
# User intent summarization
python -m STAC_eval.eval_STAC_benchmark --model_agent gpt-4.1 --defense summarization
# Harm-benefit reasoning
python -m STAC_eval.eval_STAC_benchmark --model_agent gpt-4.1 --defense reasoningRequirements: OPENAI_API_KEY
# GPT-4.1
python -m STAC_eval.eval_STAC_benchmark --model_agent gpt-4.1 --batch_size 512
# GPT-4.1-mini
python -m STAC_eval.eval_STAC_benchmark --model_agent gpt-4.1-mini --batch_size 512Requirements: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN
# Llama 3.3 70B
python -m STAC_eval.eval_STAC_benchmark \
--model_agent us.meta.llama3-3-70b-instruct-v1:0 \
--batch_size 10
# Llama 3.1 405B
python -m STAC_eval.eval_STAC_benchmark \
--model_agent meta.llama3-1-405b-instruct-v1:0 \
--batch_size 10Requirements: GPU access (tested on H100s), HF_TOKEN
# Request GPU node
# e.g., on slurm: srun --exclusive --pty --partition=p4 --nodes=1 /bin/bash
# Evaluate model
python -m STAC_eval.eval_STAC_benchmark \
--model_agent Qwen/Qwen3-32B \
--temperature 0.6 \
--batch_size 100
# Exit GPU session
exitUse this pipeline to generate new attacks from scratch. The pipeline consists of 4 sequential steps:
Generate adversarial tool chain attacks for different environments.
Requirements: OPENAI_API_KEY
python -m STAC_gen.step_1_gen_tool_chains \
--dataset SHADE_Arena \
--model_name_or_path gpt-4.1 \
--n_cases 120 \
--batch_size 32Parameters:
--dataset: Target dataset (SHADE_ArenaorAgent_SafetyBench) - required--model_name_or_path: Model for attack generation (default:gpt-4.1)--output_path: Output path for generated attacks (optional, auto-generated if not provided)--n_cases: Number of cases to generate (default: 120 for Agent_SafetyBench, all for SHADE_Arena)--batch_size: Batch size for OpenAI Batch API. Use 'None' for synchronous API (default:None)
Verify that generated attacks execute correctly in their environments.
Requirements: OPENAI_API_KEY
python -m STAC_gen.step_2_verify_tool_chain \
--dataset SHADE_Arena \
--model gpt-4.1 \
--batch_size 512Parameters:
--dataset: Target dataset (SHADE_ArenaorAgent_SafetyBench) - required--model: Model for verification (default:gpt-4.1)--input_path: Path to Step 1 outputs (optional, auto-generated if not provided)--batch_size: Number of attacks to verify simultaneously (default:512)--temperature: Sampling temperature (default:0.6)--top_p: Nucleus sampling parameter (default:0.95)--max_tokens: Maximum tokens per verification (default:8192)--env: Filter for specific environment (optional)--region: AWS region for model access (default:us-east-1)
Generate natural language prompts that lead to the verified tool chains.
Requirements: GPU cluster access (tested on H100s)
# Request GPU node (example using slurm)
srun --exclusive --pty --partition=p4 --nodes=1 /bin/bash
# Run reverse engineering
python -m STAC_gen.step_3_reverse_engineer_prompts \
--dataset SHADE_Arena \
--model Qwen/Qwen3-32B \
--temperature 0.6 \
--batch_size 512
# Exit GPU session
exitParameters:
--dataset: Target dataset (SHADE_ArenaorAgent_SafetyBench) - required--model: Model for prompt generation (default:Qwen/Qwen3-32B)--output_dir: Directory containing Step 2 verification results (optional, auto-generated if not provided)--temperature: Sampling temperature (default:0.6)--top_p: Nucleus sampling parameter (default:0.95)--batch_size: Number of attacks to process simultaneously (default:1)--region: AWS region for model access (default:us-east-1)
Evaluate agents against generated attacks using adaptive planning.
Requirements: Depends on model being evaluated
python -m STAC_gen.step_4_eval_adaptive_planning \
--benchmark SHADE_Arena \
--model_agent gpt-4.1 \
--defense no_defense \
--batch_size 512Parameters:
--benchmark: Target benchmark (SHADE_ArenaorAgent_SafetyBench) - required--model_planner: Model for attack planning (default:gpt-4.1)--model_judge: Model for evaluation (default:gpt-4.1)--model_agent: Model to evaluate (default:gpt-4.1)--defense: Defense mechanism (no_defense,failure_modes,summarization,reasoning,spotlighting) (default:no_defense)--temperature: Sampling temperature (default:0.0)--top_p: Nucleus sampling parameter (default:0.95)--batch_size: Evaluation batch size (default:1)--max_n_turns: Maximum conversation turns (default:3)--input_path: Path to Step 3 outputs (optional, auto-generated if not provided)--output_dir: Output directory (optional, auto-generated if not provided)--region: AWS region for Bedrock API (default:us-west-2)--n_gpus_per_model: Number of GPUs per model (default:8)
The benchmark (data/STAC_benchmark_data.json) contains 483 test cases:
- Each test case includes a verified tool chain attack
- Test cases span multiple environments from SHADE-Arena and Agent-SafetyBench
- Each item contains
generation_config.datasetfield indicating its source benchmark
{
"id": "unique-identifier",
"generation_config": {
"dataset": "SHADE_Arena", // or "Agent_SafetyBench"
"environment": "workspace",
...
},
"attack_plan": {
"attack_goal": "...",
"explanation": "...",
"verified_tool_chain": [...]
},
"interaction_history": [...],
...
}no_defense: Baseline agent without additional safety measuresfailure_modes: Agent is made aware of potential failure modes and riskssummarization: Agent summarizes user intent before taking actionsreasoning: Agent employs harm-benefit reasoning before executing toolsspotlighting: User prompts are marked with special characters for disambiguation from system prompts
-
GPU Requirements:
- STAC_gen Steps 3 and 4 (HF models) require GPU access
- STAC_eval with HF models requires GPU access
-
AWS Token Expiration:
- AWS session tokens expire every 12 hours
- You'll need to refresh credentials and restart for long-running jobs
-
Batch Size Guidelines:
- Start with batch_size=2 to verify the script works
- Increase based on your resources (e.g., number of GPUs)
-
Sequential Execution (STAC_gen only):
- Pipeline steps must be run in order
- Each step depends on outputs from previous steps
-
Model-Specific Requirements:
- Check that you have access to the model before running
- Different models require different API keys/tokens
Q: "KeyError: 'OPENAI_API_KEY'"
# Make sure to export (not just set) the variable
export OPENAI_API_KEY="sk-your-key-here"
# Verify it's set
python -c "import os; print('Key found:', 'OPENAI_API_KEY' in os.environ)"Q: "Model not found" for HuggingFace
# Ensure you have access to the model on HuggingFace
# Set your token
export HF_TOKEN="your-token-here"Q: "AWS credentials expired"
# Refresh your AWS credentials
export AWS_ACCESS_KEY_ID="new-key-id"
export AWS_SECRET_ACCESS_KEY="new-secret"
export AWS_SESSION_TOKEN="new-token"See requirements.txt for the full list of dependencies.
If you use this code or the STAC benchmark in your research, please cite our work:
@misc{li2025stacinnocenttoolsform,
title={STAC: When Innocent Tools Form Dangerous Chains to Jailbreak LLM Agents},
author={Jing-Jing Li and Jianfeng He and Chao Shang and Devang Kulshreshtha and Xun Xian and Yi Zhang and Hang Su and Sandesh Swamy and Yanjun Qi},
year={2025},
eprint={2509.25624},
archivePrefix={arXiv},
primaryClass={cs.CR},
url={https://arxiv.org/abs/2509.25624},
}See SECURITY and CONTRIBUTING for more information.
Our framework is constructed using data/scenarios from the following sources:
- Agent-SafetyBench – released under the MIT License
- SHADE-Arena – released under the MIT License
Accordingly, we release our framework under the CC BY-NC 4.0 License.