Ahson Saiyed, Sabrina Sadiekh, Chirag Agarwal
Large Language Models (LLMs) remain vulnerable to optimization-based jailbreak attacks that exploit internal gradient structure. While Sparse Autoencoders (SAEs) are widely used for interpretability, their robustness implications remain underexplored. We present a study of integrating pretrained SAEs into transformer residual streams at inference time, without modifying model weights or blocking gradients.
Across four model families (Gemma, LLaMA, Mistral, Qwen) and two strong white-box attacks (GCG, BEAST) plus three black-box benchmarks, SAE-augmented models achieve up to a 5Γ reduction in jailbreak success rate relative to the undefended baseline and reduce cross-model attack transferability.
Parametric ablations reveal:
- A monotonic doseβresponse relationship between L0 sparsity and attack success rate
- A layer-dependent defenseβutility tradeoff, where intermediate layers balance robustness and clean performance
These findings are consistent with a representational bottleneck hypothesis: sparse projection reshapes the optimization geometry exploited by jailbreak attacks.
All experiments use the HarmBench evaluation framework:
- HarmBench prompts: 218 harmful behavior prompts (
harmbench_behaviors_text_all.csv) - Optimizer targets: per-prompt target strings (
harmbench_targets_text.json) - Black-box benchmarks: Salad-Data, Prompt Injections Benchmark, SafeEval (1,500 jailbreak prompts)
- Random baselines: 5 categories of random text suffixes (alphanumeric, mixed case, lowercase, numbers, unicode)
Pre-computed adversarial suffixes and SAE features will be released on HuggingFace.
sparse-jailbreak/
βββ configs/
β βββ models.yaml # 41 model+SAE configurations
β βββ attack.yaml # GCG + spectral monitoring configs
β βββ feature_extraction/ # Per-model SAE feature extraction configs
β βββ gemma_9b.yml
β βββ llama_8b.yml
β βββ mistral_7b.yml
βββ scripts/
β βββ run_gcg.py # GCG suffix generation (RQ1, RQ2, RQ3)
β βββ run_beast.py # BEAST suffix generation (RQ1)
β βββ run_eval.py # Suffix evaluation / generation
β βββ run_feature_extraction.py # SAE feature extraction (RQ4)
β βββ run_jaccard_analysis.py # Jaccard similarity analysis (RQ4)
β βββ run_all_feature_extraction.sh # Batch runner for all extraction jobs
βββ src/
β βββ __init__.py
β βββ models.py # All SAE intervention model wrappers
β βββ utils.py # Shared utilities
β βββ gcg_spectral.py # Spectral gradient monitoring (RQ4)
βββ suffixes/ # Pre-computed adversarial suffixes
βββ random_suffixes/ # Random baseline suffixes
βββ envs/ # Conda environment files
βββ requirements.txt
βββ README.md
All SAE wrappers inherit from torch.nn.Module and hook the SAE into forward(), ensuring gradients flow through the SAE encodeβdecode during GCG optimization. Six SAE types are supported:
sae_type |
Source | Models | Notes |
|---|---|---|---|
none |
β | All | Bare HF model (baseline) |
goodfire |
Goodfire SAEs | LLaMA-3.1-8B, LLaMA-3.3-70B | Linear enc + ReLU + linear dec |
mistral_res |
JoshEngels | Mistral-7B | Norm-scaling (constant=64) |
gemma_scope |
Gemma Scope via sae_lens | Gemma-2-2B/9B/27B, Qwen2.5-7B | SAE.from_pretrained() |
andyrdt_layer |
andyrdt via sae_lens | LLaMA-3.1-8B (layer ablation) | trainer_1 at each layer |
andyrdt_sparsity |
andyrdt via sae_lens + hot-swap | LLaMA-3.1-8B (sparsity ablation) | Hot-swaps ae.pt for k=32/128/256 |
| Config | Model | SAE | Layer | Width |
|---|---|---|---|---|
gemma_2b_sae |
Gemma-2-2B | Gemma Scope | 12 | 16K |
gemma_9b_sae |
Gemma-2-9B | Gemma Scope | 19 | 16K |
gemma_27b_sae |
Gemma-2-27B | Gemma Scope | 34 | 131K |
llama_8b_sae |
LLaMA-3.1-8B-Instruct | Goodfire | 19 | 65K |
llama_70b_sae |
LLaMA-3.3-70B-Instruct | Goodfire | 50 | 65K |
mistral_7b_sae |
Mistral-7B | JoshEngels | 16 | 65K |
qwen_7b_sae |
Qwen2.5-7B-Instruct | andyrdt | 19 | 131K |
Layer placement (Tables 5β6, 13):
gemma_9b_sae_layer{5,10,20,30,35}
llama_8b_sae_layer{3,7,11,15,19,23,27}
qwen_7b_sae_layer{3,7,11,15,19,23,27}
Sparsity (Tables 3β4):
gemma_9b_sae_l20_w{16k,65k,131k} # Gemma width sweep
llama_8b_sae_l19_k{32,64,128,256} # LLaMA top-k sweep
# SAE-augmented model
python scripts/run_gcg.py \
--model_config llama_8b_sae \
--harmbench_dir /path/to/data/ \
--output_dir results/suffixes/llama_8b_sae/ \
--batch_path batches/batch_1_of_12.pkl --batch_id 1
# Baseline
python scripts/run_gcg.py \
--model_config llama_8b_base \
--harmbench_dir /path/to/data/ \
--output_dir results/suffixes/llama_8b_base/ --batch_id 1python scripts/run_beast.py \
--model_config gemma_2b_sae \
--harmbench_dir /path/to/data/ \
--output_dir results/beast/gemma_2b_sae/ --batch_id 1for layer in 3 7 11 15 19 23 27; do
python scripts/run_gcg.py \
--model_config llama_8b_sae_layer${layer} \
--harmbench_dir /path/to/data/ \
--output_dir results/layer_ablation/llama_8b_layer${layer}/
donefor k in 32 64 128 256; do
python scripts/run_gcg.py \
--model_config llama_8b_sae_l19_k${k} \
--harmbench_dir /path/to/data/ \
--output_dir results/sparsity/llama_8b_k${k}/
donepython scripts/run_gcg.py \
--model_config llama_8b_sae \
--spectral_config spectral_full \
--harmbench_dir /path/to/data/ \
--output_dir results/spectral/llama_8b_sae/ --batch_id 1python scripts/run_eval.py \
--model_config llama_8b_sae \
--suffix_file suffixes/sampled_lightweight_suffixes.pkl \
--harmbench_dir /path/to/data/ \
--output_dir results/generation/# Extract features (adversarial)
python scripts/run_feature_extraction.py \
--config configs/feature_extraction/gemma_9b.yml
# Extract features (random baselines)
python scripts/run_feature_extraction.py \
--config configs/feature_extraction/gemma_9b.yml \
--pickle_file random_suffixes/text_lowercase_suffixes.pkl \
--output_dir results/features/gemma_9b_text_lowercase
# Or run all 18 jobs at once
bash scripts/run_all_feature_extraction.sh --parallel
# Compute Jaccard similarity
python scripts/run_jaccard_analysis.py \
--feature_dir results/features/gemma_9b/ \
--output_dir results/jaccard/ --top_k 100Conda environment files are provided in envs/:
# Primary environment (GCG, evaluation, feature extraction)
conda env create -f envs/sae_lens.yml
# SAE steering / intervention experiments
conda env create -f envs/saeSteer.ymlOr install from requirements:
conda create -n sae_robustness python=3.10 -y
conda activate sae_robustness
pip install -r requirements.txt| Package | Version | Purpose |
|---|---|---|
torch |
β₯ 2.1.0 | Core framework |
transformers |
β₯ 4.40.0 | Model loading |
nanogcg |
β₯ 0.2.0 | GCG attack |
sae-lens |
β₯ 4.0.0 | Gemma Scope / andyrdt SAEs |
safetensors |
β₯ 0.4.0 | Mistral SAE weights |
| Model | VRAM (approx) |
|---|---|
| Gemma-2-2B + SAE | ~12 GB |
| Gemma-2-9B + SAE | ~40 GB (fp32) |
| LLaMA-3.1-8B + SAE | ~20 GB (bf16) |
| Mistral-7B + SAE | ~20 GB (bf16) |
| Qwen2.5-7B + SAE | ~35 GB (fp32) |
| Gemma-2-27B + SAE | ~60 GB (bf16) |
| LLaMA-3.3-70B + SAE | ~150 GB (bf16) |
@article{saiyed2026sparseshield,
title = {Towards Understanding the Robustness of Sparse Autoencoders},
author = {Saiyed, Ahson and Sadiekh, Sabrina and Agarwal, Chirag},
journal = {arXiv preprint arXiv:XXXX.XXXXX},
year = {2026},
}