Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 10 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ SkillSpector helps you answer: **"Is this skill safe to install?"**
## Features

- **Multi-format input**: Scan Git repos, URLs, zip files, directories, or single files
- **64 vulnerability patterns** across 16 categories: prompt injection, data exfiltration, privilege escalation, supply chain, excessive agency, output handling, system prompt leakage, memory poisoning, tool misuse, rogue agent, trigger abuse, dangerous code (AST), taint tracking, YARA signatures, MCP least privilege, and MCP tool poisoning
- **67 vulnerability patterns** across 17 categories: prompt injection, data exfiltration, privilege escalation, supply chain, excessive agency, output handling, system prompt leakage, memory poisoning, tool misuse, rogue agent, anti-refusal, trigger abuse, dangerous code (AST), taint tracking, YARA signatures, MCP least privilege, and MCP tool poisoning
- **Two-stage analysis**: Fast static analysis + optional LLM semantic evaluation
- **Live vulnerability lookups**: SC4 queries [OSV.dev](https://osv.dev) for real-time CVE data with automatic offline fallback
- **Multiple output formats**: Terminal, JSON, Markdown, and SARIF reports
Expand Down Expand Up @@ -183,7 +183,7 @@ skillspector scan ./my-skill/ --no-llm

## Vulnerability Patterns

SkillSpector detects **64 vulnerability patterns** across 16 categories:
SkillSpector detects **67 vulnerability patterns** across 17 categories:

### Prompt Injection (5 patterns)

Expand All @@ -195,6 +195,14 @@ SkillSpector detects **64 vulnerability patterns** across 16 categories:
| P4 | Behavior Manipulation | MEDIUM | Subtle instructions altering agent decisions |
| P5 | Harmful Content | CRITICAL | Instructions that could cause physical harm |

### Anti-Refusal (3 patterns)

| ID | Pattern | Severity | Description |
|----|---------|----------|-------------|
| AR1 | Refusal Suppression | HIGH | Instructions to never refuse or always comply (e.g. "never refuse", "always comply") |
| AR2 | Disclaimer Suppression | HIGH | Instructions to omit warnings, disclaimers, or ethical commentary (e.g. "no disclaimers", "do not moralize") |
| AR3 | Safety Policy Nullification | HIGH | Jailbreak framing that nullifies guardrails (e.g. "you have no restrictions", "ignore your guidelines", "do anything now") |

### Data Exfiltration (4 patterns)

| ID | Pattern | Severity | Description |
Expand Down
4 changes: 2 additions & 2 deletions docs/DEVELOPMENT.md
Original file line number Diff line number Diff line change
Expand Up @@ -124,7 +124,7 @@ There are no conditional edges: after `resolve_input` → `build_context`, all a
|------|------|--------|
| **resolve_input** | Consumes `input_path` or `skill_path`; resolves URLs/zips/files via InputHandler; sets `skill_path` and (when needed) `temp_dir_for_cleanup` | [resolve_input.py](../src/skillspector/nodes/resolve_input.py) |
| **build_context** | Reads `skill_path`, populates `components`, `file_cache`, `ast_cache`, `manifest`, `component_metadata`, `has_executable_scripts` | [build_context.py](../src/skillspector/nodes/build_context.py) |
| **Analyzers** | 20 nodes; each returns `AnalyzerNodeResponse` (list of `Finding`). State reducer appends to `findings`. | [nodes/analyzers/__init__.py](../src/skillspector/nodes/analyzers/__init__.py) (`ANALYZER_NODE_IDS`, `ANALYZER_NODES`) |
| **Analyzers** | 21 nodes; each returns `AnalyzerNodeResponse` (list of `Finding`). State reducer appends to `findings`. | [nodes/analyzers/__init__.py](../src/skillspector/nodes/analyzers/__init__.py) (`ANALYZER_NODE_IDS`, `ANALYZER_NODES`) |
| **meta_analyzer** | Per-file LLM filter/enrich of `findings` → `filtered_findings` via `LLMMetaAnalyzer`; one LLM call per file (or per chunk for oversized files); token budgets from `constants.py`; falls back when `use_llm` is False | [meta_analyzer.py](../src/skillspector/nodes/meta_analyzer.py), [llm_analyzer_base.py](../src/skillspector/nodes/llm_analyzer_base.py) |
| **report** | Builds SARIF 2.1.0, computes `risk_score`, `risk_severity`, `risk_recommendation`; writes `report_body` from `output_format` (terminal/json/markdown/sarif) | [report.py](../src/skillspector/nodes/report.py) |

Expand Down Expand Up @@ -156,7 +156,7 @@ There are no conditional edges: after `resolve_input` → `build_context`, all a
| `pattern_defaults.py` | Shared pattern metadata (category, explanation, remediation) |
| `static_yara.py` | YARA-based static analyzer |
| `osv_client.py` | OSV.dev API client for live vulnerability lookups (SC4); batch queries with caching and fallback |
| `static_patterns_*.py` | 11 pattern-based analyzers (prompt_injection, data_exfiltration, etc.) |
| `static_patterns_*.py` | 12 pattern-based analyzers (prompt_injection, data_exfiltration, anti_refusal, etc.) |
| `behavioral_ast.py` | AST-based behavioral analyzer (AST1–AST8): detects exec, eval, subprocess, os.system, compile, dynamic import/getattr, and dangerous execution chains |
| `behavioral_taint_tracking.py` | Taint-tracking behavioral analyzer (stub) |
| `mcp_least_privilege.py`, `mcp_tool_poisoning.py`, `mcp_rug_pull.py` | MCP analyzer stubs |
Expand Down
5 changes: 5 additions & 0 deletions src/skillspector/nodes/analyzers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,9 @@
from skillspector.nodes.analyzers.semantic_security_discovery import (
node as semantic_security_discovery_node,
)
from skillspector.nodes.analyzers.static_patterns_anti_refusal import (
node as static_patterns_anti_refusal_node,
)
from skillspector.nodes.analyzers.static_patterns_data_exfiltration import (
node as static_patterns_data_exfiltration_node,
)
Expand Down Expand Up @@ -80,6 +83,7 @@
"static_patterns_memory_poisoning",
"static_patterns_tool_misuse",
"static_patterns_rogue_agent",
"static_patterns_anti_refusal",
"static_yara",
"behavioral_ast",
"behavioral_taint_tracking",
Expand All @@ -103,6 +107,7 @@
"static_patterns_memory_poisoning": static_patterns_memory_poisoning_node,
"static_patterns_tool_misuse": static_patterns_tool_misuse_node,
"static_patterns_rogue_agent": static_patterns_rogue_agent_node,
"static_patterns_anti_refusal": static_patterns_anti_refusal_node,
"static_yara": static_yara_node,
"behavioral_ast": behavioral_ast_node,
"behavioral_taint_tracking": behavioral_taint_tracking_node,
Expand Down
17 changes: 17 additions & 0 deletions src/skillspector/nodes/analyzers/pattern_defaults.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ class PatternCategory(StrEnum):
YARA_MATCH = "YARA Match"
MCP_LEAST_PRIVILEGE = "MCP Least Privilege"
MCP_TOOL_POISONING = "MCP Tool Poisoning"
ANTI_REFUSAL = "Anti-Refusal"


# Pattern-specific explanations (why the finding is dangerous)
Expand Down Expand Up @@ -119,6 +120,10 @@ class PatternCategory(StrEnum):
"TP2": "Unicode deception detected in skill identifiers or descriptions. Homoglyphs, RTL overrides, or invisible characters can make malicious content appear benign.",
"TP3": "Instruction injection patterns found in parameter descriptions or default values. Parameter metadata is read by LLMs and can override intended behavior.",
"TP4": "Skill description does not match actual code behavior. The declared purpose diverges from what the code actually does, indicating possible deception.",
# Anti-Refusal Statements (jailbreak)
"AR1": "Skill instructs the agent to never refuse or to always comply. Suppressing the agent's ability to decline removes a core safety control and enables downstream harmful requests to succeed.",
"AR2": "Skill instructs the agent to omit warnings, disclaimers, or ethical commentary. Stripping safety caveats hides risk from the user and is a common jailbreak preamble.",
"AR3": "Skill attempts to nullify the agent's safety policies or restrictions ('you have no restrictions', 'ignore your guidelines', 'do anything now'). This is a direct jailbreak that disables guardrails.",
}

# Rule ID -> category (for report output)
Expand Down Expand Up @@ -182,6 +187,10 @@ class PatternCategory(StrEnum):
"TP2": PatternCategory.MCP_TOOL_POISONING.value,
"TP3": PatternCategory.MCP_TOOL_POISONING.value,
"TP4": PatternCategory.MCP_TOOL_POISONING.value,
# Anti-Refusal Statements (jailbreak)
"AR1": PatternCategory.ANTI_REFUSAL.value,
"AR2": PatternCategory.ANTI_REFUSAL.value,
"AR3": PatternCategory.ANTI_REFUSAL.value,
}

# Rule ID -> pattern display name (for report output)
Expand Down Expand Up @@ -245,6 +254,10 @@ class PatternCategory(StrEnum):
"TP2": "Unicode Deception",
"TP3": "Parameter Description Injection",
"TP4": "Description-Behavior Mismatch",
# Anti-Refusal Statements (jailbreak)
"AR1": "Refusal Suppression",
"AR2": "Disclaimer Suppression",
"AR3": "Safety Policy Nullification",
}

# Pattern-specific remediations (how to fix the issue)
Expand Down Expand Up @@ -326,6 +339,10 @@ class PatternCategory(StrEnum):
"TP2": "Replace non-ASCII characters in identifiers with ASCII equivalents. Remove RTL override and invisible formatting characters.",
"TP3": "Remove injection patterns, system tokens, and suspicious content from parameter descriptions and default values.",
"TP4": "Update the skill description to accurately reflect all capabilities, or remove undeclared functionality.",
# Anti-Refusal Statements (jailbreak)
"AR1": "Remove any instruction telling the agent to never refuse or always comply. The agent must retain the ability to decline unsafe, out-of-scope, or harmful requests.",
"AR2": "Remove instructions that suppress warnings, disclaimers, or ethical commentary. Let the agent surface safety-relevant caveats to the user.",
"AR3": "Remove jailbreak framing that nullifies safety policies or restrictions. Skill content must not instruct the agent to ignore its guidelines or operate without guardrails.",
}


Expand Down
172 changes: 172 additions & 0 deletions src/skillspector/nodes/analyzers/static_patterns_anti_refusal.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,172 @@
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""Static patterns: anti-refusal statements (AR1-AR3). Node and analyze() in one module.

Anti-refusal statements are instructions embedded in a skill that try to suppress the
agent's ability to refuse, hedge, or apply safety constraints. They are a jailbreak
technique distinct from generic instruction-override: rather than injecting a new task,
they neutralize the model's *refusal behavior* so subsequent harmful requests succeed.

Three families are detected:
AR1 - Refusal suppression: "never refuse", "always comply", "you cannot say no".
AR2 - Disclaimer/caveat suppression: "no warnings", "do not moralize", "skip the ethics".
AR3 - Safety/policy nullification: "you have no restrictions", "ignore your guidelines",
"not bound by any policy", "do anything now".
"""

from __future__ import annotations

import re
import sys

from skillspector.logging_config import get_logger
from skillspector.models import AnalyzerFinding, Location, Severity
from skillspector.state import AnalyzerNodeResponse, SkillspectorState

from . import static_runner
from .common import get_context, get_line_number, is_code_example
from .pattern_defaults import PatternCategory

logger = get_logger(__name__)

ANALYZER_ID = "static_patterns_anti_refusal"

# AR1: Refusal suppression - instruct the agent to never refuse / always comply.
AR1_PATTERNS = [
(r"\b(?:never|do\s+not|don'?t)\s+(?:ever\s+)?(?:refuse|decline)\b", 0.85),
(r"\byou\s+(?:must|will|should|can)\s*(?:not|never)\s+(?:refuse|decline|say\s+no)\b", 0.9),
(r"\byou\s+(?:cannot|can'?t|can\s+not)\s+(?:refuse|decline|say\s+no)\b", 0.85),
(r"\bnever\s+say\s+(?:that\s+)?(?:you\s+)?(?:can'?t|cannot|won'?t|no)\b", 0.85),
(r"\balways\s+(?:comply|obey|answer|respond|agree)\b", 0.7),
(
r"\b(?:you\s+)?must\s+(?:always\s+)?(?:answer|comply\s+with|respond\s+to|fulfill)\s+"
r"(?:every|all|any|each)\b",
0.8,
),
(r"\brespond\s+to\s+(?:every|all|any)\s+(?:request|question|prompt|query|command)\b", 0.6),
(r"\bno\s+request\s+is\s+(?:off[- ]limits|forbidden|too)\b", 0.85),
]

# AR2: Disclaimer / caveat / warning suppression.
AR2_PATTERNS = [
(
r"\b(?:do\s+not|don'?t|never|no)\s+(?:include|add|give|provide|use|append)?\s*"
r"(?:any\s+)?(?:warnings?|disclaimers?|caveats?)\b",
0.8,
),
(
r"\bwithout\s+(?:any\s+)?(?:warnings?|disclaimers?|caveats?|moralizing|moralising|"
r"judg(?:e?ment|ing))\b",
0.8,
),
(r"\b(?:do\s+not|don'?t|never)\s+(?:lecture|moralize|moralise|preach|judge|warn)\b", 0.85),
(r"\b(?:skip|omit|leave\s+out|drop)\s+(?:the\s+)?(?:disclaimer|warning|ethics|safety)\b", 0.8),
(
r"\b(?:no|without)\s+(?:ethical|moral)\s+(?:judg(?:e?ment)|considerations?|commentary|"
r"objections?)\b",
0.8,
),
(r"\b(?:do\s+not|don'?t)\s+(?:apologize|apologise|say\s+sorry)\b", 0.55),
]

# AR3: Safety / policy / restriction nullification (jailbreak framing).
AR3_PATTERNS = [
(
r"\byou\s+have\s+no\s+(?:restrictions?|limitations?|filters?|rules?|guidelines?|"
r"boundaries|limits)\b",
0.9,
),
(
r"\b(?:you\s+are|act\s+as|pretend\s+(?:to\s+be|you'?re))[^.\n]{0,30}"
r"(?:unrestricted|unfiltered|uncensored|no\s+filter)\b",
0.85,
),
(
r"\b(?:ignore|disregard|forget|bypass|override)\s+(?:your\s+)?(?:content\s+)?"
r"(?:policy|policies|guidelines?|safety|restrictions?|filters?)\b",
0.9,
),
(
r"\bnot\s+bound\s+by\s+(?:any\s+)?(?:rules?|policy|policies|guidelines?|restrictions?|"
r"openai|anthropic)\b",
0.9,
),
(r"\bdo\s+anything\s+now\b", 0.8),
(r"\b(?:developer|debug|god|sudo|jailbreak)\s+mode\s+(?:enabled|on|activated|engaged)\b", 0.75),
(r"\bno\s+(?:content\s+)?(?:policy|policies|filters?|restrictions?)\s+appl(?:y|ies)\b", 0.85),
(
r"\b(?:free\s+from|without)\s+(?:any\s+)?(?:safety\s+)?(?:guardrails?|constraints?|"
r"safeguards?)\b",
0.8,
),
]

_RULES = [("AR1", AR1_PATTERNS), ("AR2", AR2_PATTERNS), ("AR3", AR3_PATTERNS)]

# Confidence penalty applied when the match appears inside a code/doc example, and the
# minimum confidence required to emit a finding after the penalty.
_EXAMPLE_PENALTY = 0.4
_MIN_CONFIDENCE = 0.5


def analyze(content: str, file_path: str, file_type: str) -> list[AnalyzerFinding]:
"""Analyze content for anti-refusal statements (AR1-AR3)."""
findings: list[AnalyzerFinding] = []
tag = [PatternCategory.ANTI_REFUSAL.value]

for rule_id, patterns in _RULES:
for pattern, base_confidence in patterns:
for match in re.finditer(pattern, content, re.IGNORECASE | re.MULTILINE):
context = get_context(content, match.start(), context_lines=3)
confidence = base_confidence
if is_code_example(context):
confidence -= _EXAMPLE_PENALTY
if confidence < _MIN_CONFIDENCE:
continue
findings.append(
AnalyzerFinding(
rule_id=rule_id,
message="Anti-Refusal Statement",
severity=Severity.HIGH,
location=Location(
file=file_path,
start_line=get_line_number(content, match.start()),
),
confidence=round(confidence, 2),
tags=tag,
context=context,
matched_text=match.group(0)[:200],
)
)
return _deduplicate_findings(findings)


def _deduplicate_findings(findings: list[AnalyzerFinding]) -> list[AnalyzerFinding]:
"""Keep the highest-confidence finding per (file, line, rule_id)."""
best: dict[tuple[str, int, str], AnalyzerFinding] = {}
for f in findings:
key = (f.location.file, f.location.start_line, f.rule_id)
existing = best.get(key)
if existing is None or f.confidence > existing.confidence:
best[key] = f
return list(best.values())


def node(state: SkillspectorState) -> AnalyzerNodeResponse:
"""Run anti_refusal patterns and return findings."""
findings = static_runner.run_static_patterns(state, [sys.modules[__name__]])
logger.info("%s: %d findings", ANALYZER_ID, len(findings))
return {"findings": findings}
3 changes: 2 additions & 1 deletion tests/nodes/analyzers/test_registry.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@
from skillspector.nodes.analyzers import ANALYZER_NODE_IDS, ANALYZER_NODES

# Expected analyzer node IDs per SADD spec workflow reference table.
# Order: static (12), behavioral (2), mcp (3), semantic (3).
# Order: static (13), behavioral (2), mcp (3), semantic (3).
EXPECTED_ANALYZER_NODE_IDS: list[str] = [
"static_patterns_prompt_injection",
"static_patterns_data_exfiltration",
Expand All @@ -33,6 +33,7 @@
"static_patterns_memory_poisoning",
"static_patterns_tool_misuse",
"static_patterns_rogue_agent",
"static_patterns_anti_refusal",
"static_yara",
"behavioral_ast",
"behavioral_taint_tracking",
Expand Down
Loading