Skip to content

feat(analyzer): detect anti-refusal statements (jailbreak preamble)#65

Open
ankushchadha wants to merge 1 commit into
NVIDIA:mainfrom
ankushchadha:feat/anti-refusal-analyzer
Open

feat(analyzer): detect anti-refusal statements (jailbreak preamble)#65
ankushchadha wants to merge 1 commit into
NVIDIA:mainfrom
ankushchadha:feat/anti-refusal-analyzer

Conversation

@ankushchadha

Copy link
Copy Markdown

Closes #64.

What this adds

A new static pattern analyzer, static_patterns_anti_refusal, that detects anti-refusal statements in skills — instructions that suppress the agent's ability to refuse, hedge, or apply safety constraints. This is a jailbreak technique distinct from generic instruction-override (P1): rather than injecting a new task, it neutralizes the model's refusal behavior so subsequent harmful requests succeed.

Three rule families, all HIGH severity:

ID Pattern Examples
AR1 Refusal Suppression "never refuse", "always comply", "you cannot say no"
AR2 Disclaimer Suppression "no warnings/disclaimers", "do not moralize"
AR3 Safety Policy Nullification "you have no restrictions", "ignore your guidelines", "do anything now"

How it fits the pipeline

Follows the existing static_patterns_* convention exactly — analyze() + node() in one module, registered in nodes/analyzers/__init__.py. No graph.py change is needed since edges are built by looping over ANALYZER_NODE_IDS. Findings are deduped per (file, line, rule) and get a confidence penalty when matched inside a code/doc example (reusing the existing is_code_example helper) to reduce false positives.

Relationship to existing rules

Additive, not a replacement. AR1 (refusal suppression) and AR2 (disclaimer suppression) are not covered by any existing rule. AR3 overlaps P1 on some safety-bypass phrasings (e.g. "bypass safety restrictions", "unrestricted mode") — for those, both P1 and AR3 fire and both are retained (dedup is per-rule, not across rules). AR3 also catches framings P1 misses, such as "ignore your guidelines" (P1 requires no possessive) and the declarative "you have no restrictions".

Known limitations (documented as xfail tests)

  • Over-detection: "always comply" in narrative prose, and "warnings" used as a technical term (logs/output) can false-positive. The optional Stage-2 LLM meta-analyzer is the intended backstop.
  • Under-detection: a regex engine won't catch leetspeak, zero-width splits, homoglyphs, or synonym rephrasings.

Both classes are pinned as strict xfail tests so they're visible to reviewers and tracked for a future revision (normalization / semantic variant) rather than silently unknown.

Changes

  • New nodes/analyzers/static_patterns_anti_refusal.py
  • New PatternCategory.ANTI_REFUSAL + AR1–AR3 explanations, names, remediations, and category mapping in pattern_defaults.py
  • Registered the node in analyzers/__init__.py (static analyzers 12 → 13)
  • Unit tests + node test + documented-limitation xfail tests in tests/nodes/analyzers/test_static_patterns_anti_refusal.py
  • Updated registry test expectations and DEVELOPMENT.md counts
  • README: new Anti-Refusal pattern table; counts 64 → 67 patterns / 16 → 17 categories

Testing

  • make test: 610 passed, 11 skipped, 6 xfailed
  • make lint clean; ruff format --check reports all files already formatted
  • DCO sign-off in place

Motivation / reference

Anti-refusal instructions are an empirically demonstrated boundary-defeat mechanism: in controlled experiments across multiple models, an anti-refusal instruction in the system prompt caused an agent to abandon deployer-configured operational boundaries, and removing it eliminated the effect — A. Chadha, When LLMs Jailbreak Themselves: Reflexive Identity Bypass in Agentic Systems, Zenodo preprint (v3, 2026), https://doi.org/10.5281/zenodo.20404651 (see Corrigendum No. 2). Disclosure: I am the author of that preprint.

Add a static pattern analyzer that flags anti-refusal statements in
skills: instructions that suppress the agent's ability to refuse, hedge,
or apply safety constraints. This is a jailbreak technique distinct from
generic instruction-override (P1) -- rather than injecting a new task it
neutralizes the model's refusal behavior so later harmful requests
succeed.

Three rule families:
  AR1 Refusal Suppression        -- "never refuse", "always comply"
  AR2 Disclaimer Suppression     -- "no warnings", "do not moralize"
  AR3 Safety Policy Nullification -- "you have no restrictions",
      "ignore your guidelines", "do anything now"

Findings are HIGH severity, deduped per (file, line, rule), with a
code/doc-example confidence penalty to reduce false positives.

- New analyzer nodes/analyzers/static_patterns_anti_refusal.py
- New PatternCategory ANTI_REFUSAL + AR1-AR3 explanations, names,
  remediations and category mapping in pattern_defaults.py
- Registered node in analyzers/__init__.py (static analyzers 12 -> 13)
- Unit tests + node test in test_static_patterns_anti_refusal.py
- Documented known limitations as xfail tests (2 false positives,
  4 regex-evasion gaps) tracked for a future revision; the optional
  Stage-2 LLM meta-analyzer is the backstop for residual false positives
- README: new Anti-Refusal pattern table; counts 64->67 / 16->17
- Updated registry test expectations and DEVELOPMENT.md counts

make test: 610 passed, 11 skipped, 6 xfailed. ruff check clean.

Signed-off-by: Ankush Chadha <ankushchadha@gmail.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(analyzer): detect anti-refusal statements (jailbreak preamble)

1 participant