This repository contains the official implementation of CKA-Agent, a novel approach to bypassing the guardrails of commercial large language models (LLMs) through harmless prompt weaving and adaptive tree search techniques.
Install uv
curl -LsSf https://astral.sh/uv/install.sh | shCreate env
uv venv --python 3.12
source .venv/bin/activate
uv pip install vllm --torch-backend=auto
uv pip install accelerate fastchat nltk pandas google-genai httpx[socks] anthropicConfigure your experiments by modifying the config/config.yml file. You can control the following aspects:
- Test Dataset: Choose from available datasets like
harmbench_ckaorstrongreject_cka. - Target Models: Select black-box or white-box models such as
gpt-oss-120borgemini-2.5-xxx. - Jailbreak Methods: Enable and configure various implemented baseline methods.
- Evaluations: Define evaluation metrics and judge models like
gemini-2.5-flash. - Defense Methods: Apply different defense mechanisms as needed.
For detailed configuration instructions and examples, please refer to the configuration README.
The run_experiment.sh script executes main.py to run the entire experiment pipeline (jailbreak and evaluation) by default.
./run_experiment.shYou can modify the run_experiment.sh script or directly pass arguments to main.py to run specific phases:
full: Runs the entire pipeline (default).jailbreak: Runs only the jailbreak methods.judge: Runs only the evaluation on existing results.resume: Resumes an interrupted experiment.
Example (running only the jailbreak phase):
python main.py --phase jailbreakIf you find this repository useful for your research, please consider citing the following paper:
@misc{wei2025wolfsheepsclothingbypassing,
title={A Wolf in Sheep's Clothing: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search},
author={Rongzhe Wei and Peizhi Niu and Xinjie Shen and Tony Tu and Yifan Li and Ruihan Wu and Eli Chien and Olgica Milenkovic and Pan Li},
year={2025},
eprint={2512.01353},
archivePrefix={arXiv},
primaryClass={cs.CR},
url={https://arxiv.org/abs/2512.01353},
}