feat(analyzer): detect anti-refusal statements (jailbreak preamble) by ankushchadha · Pull Request #65 · NVIDIA/SkillSpector

ankushchadha · 2026-06-15T06:41:26Z

Closes #64.

What this adds

A new static pattern analyzer, static_patterns_anti_refusal, that detects anti-refusal statements in skills — instructions that suppress the agent's ability to refuse, hedge, or apply safety constraints. This is a jailbreak technique distinct from generic instruction-override (P1): rather than injecting a new task, it neutralizes the model's refusal behavior so subsequent harmful requests succeed.

Three rule families, all HIGH severity:

ID	Pattern	Examples
AR1	Refusal Suppression	"never refuse", "always comply", "you cannot say no"
AR2	Disclaimer Suppression	"no warnings/disclaimers", "do not moralize"
AR3	Safety Policy Nullification	"you have no restrictions", "ignore your guidelines", "do anything now"

How it fits the pipeline

Follows the existing static_patterns_* convention exactly — analyze() + node() in one module, registered in nodes/analyzers/__init__.py. No graph.py change is needed since edges are built by looping over ANALYZER_NODE_IDS. Findings are deduped per (file, line, rule) and get a confidence penalty when matched inside a code/doc example (reusing the existing is_code_example helper) to reduce false positives.

Relationship to existing rules

Additive, not a replacement. AR1 (refusal suppression) and AR2 (disclaimer suppression) are not covered by any existing rule. AR3 overlaps P1 on some safety-bypass phrasings (e.g. "bypass safety restrictions", "unrestricted mode") — for those, both P1 and AR3 fire and both are retained (dedup is per-rule, not across rules). AR3 also catches framings P1 misses, such as "ignore your guidelines" (P1 requires no possessive) and the declarative "you have no restrictions".

Known limitations (documented as `xfail` tests)

Over-detection: "always comply" in narrative prose, and "warnings" used as a technical term (logs/output) can false-positive. The optional Stage-2 LLM meta-analyzer is the intended backstop.
Under-detection: a regex engine won't catch leetspeak, zero-width splits, homoglyphs, or synonym rephrasings.

Both classes are pinned as strict xfail tests so they're visible to reviewers and tracked for a future revision (normalization / semantic variant) rather than silently unknown.

Changes

New nodes/analyzers/static_patterns_anti_refusal.py
New PatternCategory.ANTI_REFUSAL + AR1–AR3 explanations, names, remediations, and category mapping in pattern_defaults.py
Registered the node in analyzers/__init__.py (static analyzers 12 → 13)
Unit tests + node test + documented-limitation xfail tests in tests/nodes/analyzers/test_static_patterns_anti_refusal.py
Updated registry test expectations and DEVELOPMENT.md counts
README: new Anti-Refusal pattern table; counts 64 → 67 patterns / 16 → 17 categories

Testing

make test: 610 passed, 11 skipped, 6 xfailed
make lint clean; ruff format --check reports all files already formatted
DCO sign-off in place

Motivation / reference

Anti-refusal instructions are an empirically demonstrated boundary-defeat mechanism: in controlled experiments across multiple models, an anti-refusal instruction in the system prompt caused an agent to abandon deployer-configured operational boundaries, and removing it eliminated the effect — A. Chadha, When LLMs Jailbreak Themselves: Reflexive Identity Bypass in Agentic Systems, Zenodo preprint (v3, 2026), https://doi.org/10.5281/zenodo.20404651 (see Corrigendum No. 2). Disclosure: I am the author of that preprint.

Add a static pattern analyzer that flags anti-refusal statements in skills: instructions that suppress the agent's ability to refuse, hedge, or apply safety constraints. This is a jailbreak technique distinct from generic instruction-override (P1) -- rather than injecting a new task it neutralizes the model's refusal behavior so later harmful requests succeed. Three rule families: AR1 Refusal Suppression -- "never refuse", "always comply" AR2 Disclaimer Suppression -- "no warnings", "do not moralize" AR3 Safety Policy Nullification -- "you have no restrictions", "ignore your guidelines", "do anything now" Findings are HIGH severity, deduped per (file, line, rule), with a code/doc-example confidence penalty to reduce false positives. - New analyzer nodes/analyzers/static_patterns_anti_refusal.py - New PatternCategory ANTI_REFUSAL + AR1-AR3 explanations, names, remediations and category mapping in pattern_defaults.py - Registered node in analyzers/__init__.py (static analyzers 12 -> 13) - Unit tests + node test in test_static_patterns_anti_refusal.py - Documented known limitations as xfail tests (2 false positives, 4 regex-evasion gaps) tracked for a future revision; the optional Stage-2 LLM meta-analyzer is the backstop for residual false positives - README: new Anti-Refusal pattern table; counts 64->67 / 16->17 - Updated registry test expectations and DEVELOPMENT.md counts make test: 610 passed, 11 skipped, 6 xfailed. ruff check clean. Signed-off-by: Ankush Chadha <ankushchadha@gmail.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(analyzer): detect anti-refusal statements (jailbreak preamble)#65

feat(analyzer): detect anti-refusal statements (jailbreak preamble)#65
ankushchadha wants to merge 1 commit into
NVIDIA:mainfrom
ankushchadha:feat/anti-refusal-analyzer

ankushchadha commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ankushchadha commented Jun 15, 2026

What this adds

How it fits the pipeline

Relationship to existing rules

Known limitations (documented as xfail tests)

Changes

Testing

Motivation / reference

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Known limitations (documented as `xfail` tests)