CKA-Agent: Bypassing LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search

🔥 Latest Results on Frontier Models (Dec 2025)

CKA-Agent demonstrates consistent high attack success rates against the latest frontier models, including GPT-5.2, Gemini-3.0-Pro, and Claude-Haiku-4.5. The results are summarized below:

Model	HarmBench				StrongREJECT
Model	FS ↑	PS ↑	V ↓	R ↓	FS ↑	PS ↑	V ↓	R ↓
🟢 GPT-5.2	0.889	0.079	0.024	0.008	0.932	0.056	0.006	0.006
🟣 Gemini-3.0-Pro	0.881	0.087	0.000	0.032	0.951	0.037	0.006	0.006
🟠 Claude-Haiku-4.5	0.960	0.024	0.008	0.008	0.969	0.025	0.006	0.000

Metrics: FS = Full Success, PS = Partial Success, V = Vacuous, R = Refusal. Results collected in December 2025.

Overview

This repository contains the official implementation of CKA-Agent, a novel approach to bypassing the guardrails of commercial large language models (LLMs) through harmless prompt weaving and adaptive tree search techniques.

Environment Setup

Install uv

curl -LsSf https://astral.sh/uv/install.sh | sh

Create env

uv venv --python 3.12
source .venv/bin/activate
uv pip install vllm --torch-backend=auto
uv pip install accelerate fastchat nltk pandas google-genai httpx[socks] anthropic

Experiment Configuration

Configure your experiments by modifying the config/config.yml file. You can control the following aspects:

Test Dataset: Choose from available datasets like harmbench_cka or strongreject_cka.
Target Models: Select black-box or white-box models such as gpt-oss-120b or gemini-2.5-xxx.
Jailbreak Methods: Enable and configure various implemented baseline methods.
Evaluations: Define evaluation metrics and judge models like gemini-2.5-flash.
Defense Methods: Apply different defense mechanisms as needed.

For detailed configuration instructions and examples, please refer to the configuration README.

Running Experiments

The run_experiment.sh script executes main.py to run the entire experiment pipeline (jailbreak and evaluation) by default.

./run_experiment.sh

You can modify the run_experiment.sh script or directly pass arguments to main.py to run specific phases:

full: Runs the entire pipeline (default).
jailbreak: Runs only the jailbreak methods.
judge: Runs only the evaluation on existing results.
resume: Resumes an interrupted experiment.

Example (running only the jailbreak phase):

python main.py --phase jailbreak

Cite

If you find this repository useful for your research, please consider citing the following paper:

@misc{wei2025trojan,
      title={The Trojan Knowledge: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search}, 
      author={Rongzhe Wei and Peizhi Niu and Xinjie Shen and Tony Tu and Yifan Li and Ruihan Wu and Eli Chien and Pin-Yu Chen and Olgica Milenkovic and Pan Li},
      year={2025},
      eprint={2512.01353},
      archivePrefix={arXiv},
      primaryClass={cs.CR},
      url={https://arxiv.org/abs/2512.01353}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CKA-Agent: Bypassing LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search

🔥 Latest Results on Frontier Models (Dec 2025)

Overview

Environment Setup

Experiment Configuration

Running Experiments

Cite

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
assets		assets
config		config
data		data
defense		defense
evaluation		evaluation
methods		methods
model		model
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
run_experiment.sh		run_experiment.sh

License

Graph-COM/CKA-Agent

Folders and files

Latest commit

History

Repository files navigation

CKA-Agent: Bypassing LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search

🔥 Latest Results on Frontier Models (Dec 2025)

Overview

Environment Setup

Experiment Configuration

Running Experiments

Cite

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages