Skip to content

Caching ast#1

Open
daniplatform wants to merge 36 commits into
mainfrom
caching-AST
Open

Caching ast#1
daniplatform wants to merge 36 commits into
mainfrom
caching-AST

Conversation

@daniplatform
Copy link
Copy Markdown
Owner

No description provided.

ParzivalHack and others added 30 commits April 13, 2026 19:47
…upply-chain flag, from working as intended :)
Updated the CodeQL workflow to change the cron schedule and modify build mode for Rust.
…ivalHack#28) (ParzivalHack#47)

# GOAL:

fix issue ParzivalHack#28, refactor message error during AST file parsing and add a
new param to enable python SyntaxWarning

## Changes

- CLI & Wizard Mode: Added a new parameter (flag) to enable/disable
SyntaxWarning reporting. This allows users to decide if they want to
treat syntax warnings as blocking issues or ignore them during scans.
- Refactoring get_python_file_asts:
- Improved the logic that captures and reports errors during AST
generation.
- Standardized error messages to make them more descriptive when a file
fails to parse.
- Integrated the new enable_syntax_warnings logic within the core
file-walking loop.
- Created a new test suite test_get_asts.py to verify:
  - Default behavior (warnings ignored).
  - Behavior when warnings are enabled (treated as errors/exceptions).
  - Handling of valid, invalid, and encoding-error files.

---------

Co-authored-by: Tommaso Bona <piergeolo@gmail.com>
Removed redundant description of the setup_cron.sh script.
…nces and metrics, directly in the terminal, inside a ASCII table :)
…alHack#49)

## Summary

Introduces two new mechanisms for controlling rule behavior globally,
then applies them to fix a class of false positives that surface when
scanning mature Python codebases.

### Problem

Scanning a large Django-based project produced 864 findings with `-s
HIGH`. Manual verification showed that **~720 were false positives**:

- **453** from `PY515`/`SHELL645`/`SHELL670` — flagging every
`re.compile()` call as "code compilation"
- **101** from `PY102` — the taint engine firing on generic callable
invocations
- **78** from `PY511`/`JSON612` — `json.loads()` flagged as dangerous
deserialization
- **dozens more** from `AUTH711`, `CSRF747`, `SESS744`, `PATH813`, etc.
— firing on test fixtures and well-known safe patterns

Without filtering, scanning also revealed ~9,000 additional Low-severity
findings from rules that flag every Python built-in (`isinstance`,
`super`, `len`, f-strings, `getattr`, etc.).

### Solution

#### 1. `[defaults]` section in rules TOML

```toml
[defaults]
exclude_file_patterns = ["*tests*", "*fixtures*", "*testdata*", "*conftest*"]
disabled_rule_ids     = ["ISINSTANCE855", "LEN1101", "SUPER1128", ...]
```

- `exclude_file_patterns` — glob list applied to **all rules**;
eliminates the need to repeat `exclude_file_pattern` on every individual
rule
- `disabled_rule_ids` — completely disables noisy rules without deleting
their definitions (easy to re-enable per project)

#### 2. Per-rule `exclude_pattern`

Regex matched against the flagged line; suppresses the finding if it
matches. Fixes rules that are correct in general but have well-known
safe variants:

```toml
[[rule]]
id = "PY107"
ast_match = "Call(func.value.id=yaml, func.attr=load)"
# Safe when Loader=SafeLoader is explicitly passed
exclude_pattern = "Loader\\s*=\\s*(yaml\\.)?(Safe|Base)Loader"
```

### Results on Django 6.1-alpha

| | Before | After |
|---|---|---|
| Total findings (all severities) | 864 (-s HIGH) | **383** |
| `re.compile()` false positives | 453 | **0** |
| `yaml.load(SafeLoader)` false positives | 2 | **0** |
| Test fixture credential false positives | 9 | **0** |
| Session fixation false positives | 1 | **0** |
| Python built-in noise rules | ~9,000 (hidden by -s HIGH) | **0** |
| `pickle.loads()` true positives | 57 → 8 (test files excluded) | **✓
present** |
| `exec()` true positives | 3 | **✓ 3** |

### Files changed

- `src/pyspector/_rust_core/src/rules.rs` — `Defaults` struct,
`Rule::is_file_excluded()`, `exclude_pattern` field
- `src/pyspector/_rust_core/src/analysis/mod.rs` — apply
`disabled_rule_ids` before scan
- `src/pyspector/_rust_core/src/analysis/ast_analysis.rs` — use
`is_file_excluded()` and `exclude_pattern`
- `src/pyspector/_rust_core/src/analysis/config_analysis.rs` — same
- `src/pyspector/rules/built-in-rules.toml` — `[defaults]` section +
per-rule fixes
- `tests/unit/test_false_positive_reductions.py` — 26 tests (each fix
has a suppression test + true-positive retention test)

## Test plan

- [x] `python -m pytest tests/unit/test_false_positive_reductions.py` —
26 tests pass
- [x] Scan a Django project: `pyspector scan /path/to/django` — confirm
`re.compile()`, `yaml.load(SafeLoader)`, and test-fixture findings are
gone
- [x] Confirm `pickle.loads()` and `exec()` findings still present in
production code

Co-authored-by: satoridev01 <info@satori.ci>
…arzivalHack#51)

This is a follow up on minimizing false positives, increasing true
positives and making PySpector taint analysis faster and better for
multiple repositories to speed up and reduce the sample to noise ratio.

## Rules

142 rules were deleted, 2 were disabled, 28 were added and 41 were
modified. There is a total of 127 rules.

### New Rules

| Rule | What it detects | Severity | Confirmed TP |
|---|---|---|---|
| `SSTI001` | `render_template_string(user_input)`,
`env.from_string(tainted)` | Critical | pygoat |
| `ORM001` | SQLAlchemy `text(f"SELECT...{var}")` | Critical | — |
| `ORM002` | Django `raw()`, `order_by(tainted)`, `extra(tainted)`
(CVE-2021-35042) | Critical | django |
| `DESER725` | `jsonpickle.decode()` | Critical | — |
| `DESER726` | `dill.loads()` | Critical | — |
| `DESER_JOBLIB001` | `joblib.load()` — ML model deserialization via
pickle | Critical | sklearn ×11 |
| `DESER_NUMPY001` | `numpy.load(allow_pickle=True)` | Critical |
tensorflow ×1 |
| `DESER_TORCH001` | `torch.load()` without `weights_only=True` |
Critical | — |
| `TLS001` | `requests.get(url, verify=False)`, `ssl=False` | High |
stock |
| `SSH001` | Paramiko `AutoAddPolicy()` — SSH MITM | High | — |
| `JWT001` | `jwt.decode(options={"verify_signature": False})` | High |
pygoat |
| `ZIPSLIP001` | `extractall()` without path validation | High | cpython
×4, ansible ×2 |
| `XXE001` | `lxml.etree.parse()` without `resolve_entities=False` |
High | — |
| `FLASK001` | `app.run(debug=True)` | Critical | pygoat, ivpa |
| `OPEN_REDIRECT001` | `redirect(tainted_url)`,
`HttpResponseRedirect(tainted)` | High | — |
| `PLAIN_PWD001` | `Model.objects.create(password=tainted)` — plaintext
DB storage | Critical | pygoat, ivpa |
| `DJANGO_DEBUG001` | `DEBUG = True` in settings (Django and Flask) |
Critical | pygoat ×2, flask |
| `ENV_URL001` | `os.environ.get("*_URL")` as HTTP endpoint — SSRF (AST
rule) | High | semgrep ×2 |
| `COOKIE_FILE001` | Env var used as cookie jar file path | High | — |
| `ENV_GIT_URL001` | CI env var URL → `git fetch` — CI token
exfiltration (AST rule) | High | semgrep ×1 |
| `RUAMEL_UNSAFE001` | `YAML(typ="unsafe")` | Critical | — |
| `SQL_CONCAT001` | `"SELECT..." + user_var` — SQL via string
concatenation | High | pygoat ×5, ivpa ×1 |
| `HARDCODED_PWD001` | `PASSWORD = 'literal'` at module level | High |
ivpa |
| `SHELL_BYPASS001` | `subprocess.run(["bash", "-c", user_cmd])` — shell
bypass | High | — |
| `PY306_CACHE` | `pickle.loads()` in cache backends — cache poisoning →
RCE | Critical | django ×6 |
| `G101B` | Uppercase secret constants (`SECRET_KEY`, `API_KEY` ≥ 16
chars) | High | pygoat ×3 |
| `DESER724` | `types.FunctionType()` from deserialized bytecode —
arbitrary code execution | Critical | — |
| `SANDBOX307` | `object.__subclasses__()` traversal — Python sandbox
escape | Critical | — |
| `SANDBOX308` | `__init__.__globals__` access — Python sandbox escape
via global namespace | Critical | — |

### Modified Rules

| Rule | Change | Impact |
|---|---|---|
| `ADMIN795` | `exclude_pattern` — reduced FPs on test credentials | |
| `BACKUP801` | Pattern requires word char before extension (`\w\.bak`);
excludes `.rst/.md` | Eliminated 7 FPs in cpython docs |
| `CRYPTO708` | Extended to `random.choices()`, `random.sample()`,
`random.randrange()` | Catches API key generation with weak PRNG |
| `DELATTR834` | Converted from AST pattern to taint sink (`delattr(obj,
tainted_attr)`) | |
| `DESER723` | `description`, `remediation` — clarified marshal.loads
risk | |
| `FORMAT864` | Converted from AST pattern to taint sink
(`.format(tainted)`) | |
| `G101` | `exclude_pattern` — added test/fixture exclusions | |
| `G103` | Excludes `def` lines (API param defaults) and chained
assignments | Eliminated 4 FPs in ftplib, netrc |
| `GETATTR828` | `exclude_file_pattern =
"*serializer*,*schema*,*/pandas/core/*,*/pandas/io/*"` | Eliminated 22
FPs in pandas, 9 in django |
| `GLOBALS843` | Removed subscript match — only exec/eval with globals()
| Eliminated FPs from module attribute registration |
| `HASH807` | Activated with broader context exclusions (was disabled) |
|
| `HTTPS789` | `exclude_file_pattern` — excluded test files | |
| `IMPORT825` | `exclude_pattern`, `remediation` — reduced test
discovery FPs | |
| `LOG741` | `description`, `severity`, `remediation`, `pattern` —
narrowed to log injection | |
| `OAUTH774` | `exclude_pattern` — reduced FPs on OAuth callbacks | |
| `OPEN1149` | Converted from AST pattern to taint sink; severity and
confidence updated | |
| `OPEN_REDIRECT001` | `exclude_file_pattern` for Django contrib/views
(relative + absolute paths) | Eliminated 15 FPs in django framework code
|
| `ORM001` | Word boundary on `text` keyword; same migration exclusions
| Eliminated 29 FPs in django (`gettext(...)`) |
| `PATH813` | `exclude_pattern` — reduced FPs on safe path joins | |
| `PERM650` | Converted from regex pattern to taint sink for SQL
injection | |
| `PY002` | `exclude_file_pattern = "*/cache/backends/*"` | Cache
backends covered by PY306_CACHE — prevents double-reporting |
| `PY101` | `exclude_file_pattern =
"*/migrations/*,*/alembic/*,*/backends/*"` | Eliminated 69 FPs in django
(ORM DDL infrastructure) |
| `PY103` | Converted from AST pattern to taint sink
(`os.system(tainted)`) | |
| `PY105` | Converted from regex to taint sink (`mark_safe(tainted)`) |
|
| `PY106` | `ast_match` — tightened subprocess shell=True detection | |
| `PY107`/`PY302` | `file_content_exclude = "from ruamel.yaml\|import
ruamel"` — new per-file content exclusion mechanism | Eliminated 14 FPs
in semgrep (all ruamel.yaml safe usage) |
| `PY201` | Extended exclude for MD5 checksum contexts | Eliminated TF
and pandas checksum FPs |
| `PY202` | `exclude_pattern` — excluded SHA1 in non-crypto contexts | |
| `PY507` | Converted from regex to taint sink (`.exec(tainted)`) | |
| `RAND810` | Converted from AST pattern to taint sink
(`random.seed(tainted)`) | |
| `REGEX870` | `description`, `pattern`, `exclude_pattern`,
`remediation` — ReDoS narrowed | |
| `SEC501` | Excludes quoted references, definitions, and method calls
on `.exec` | Eliminated docstring FPs across all repos |
| `SER522` | Converted from regex pattern to taint sink | |
| `SETATTR831` | Converted from AST pattern to taint sink (`setattr(obj,
tainted_attr, val)`) | |
| `SHELL631` | Converted from regex pattern to taint sink for SQL
injection | |
| `SHELL675` | Converted from regex pattern to taint sink for SQL
interpolation | |
| `SHELL689` | Converted from regex pattern to taint sink for subprocess
| |
| `SQL586` | Converted from regex pattern to taint sink for SQL
formatting | |
| `SQL693` | Converted from regex pattern to taint sink for SQL execute
| |
| `SYMLINK816` | `description`, `pattern`, `remediation` — symlink
traversal clarified | |
| `TIMING759` | Excludes null-check patterns | Eliminated timing oracle
FPs from presence checks |
| `TLS001` | Extended exclude for internal array operations | Eliminated
6 pandas internal FPs |
| `TOKEN771` | `description`, `confidence`, `exclude_pattern`,
`remediation` — JWT expiry check refined | |
| `ZIPSLIP001` | Added safe-filter exclusion; excludes regex string
accessor | Eliminated satori-cli (Python 3.12 safe filter) + pandas ×4
FPs |

### Taint Engine Changes (`taint_analysis.rs`)

| Area | What changed |
|---|---|
| Sources | The engine now recognizes more entry points as
attacker-controlled: HTTP handler parameters detected via route
decorators, HTTP client responses, and file contents loaded via
deserialization functions. File contents are always treated as
potentially attacker-controlled even when the file path itself was
chosen by the operator — this is what enables supply-chain detection. |
| Origin model | Not all external input is equally dangerous. Data
coming from CLI arguments, environment variables, or deployment
configuration is operator-supplied and treated as trusted. Data coming
from HTTP requests or deserialized file contents is attacker-controlled.
Sinks only fire on the latter, which eliminates an entire class of false
positives on CLI tools without touching any individual rule. |
| Propagation | Taint now follows data across function call boundaries,
through class attributes, and through control flow constructs like
loops, context managers, and exception handlers. Previously, taint was
lost as soon as data crossed a function boundary or was stored in an
object. |
| Sanitizers | The engine recognizes functions that clean data —
database query escaping clears SQL taint, HTML escaping clears HTML
taint. Partially sanitized data (e.g., HTML-escaped but not
shell-escaped) does not get promoted to fully clean. |
| Performance | Analysis skips functions that have no tainted data
flowing through them, runs the convergence loop and final pass in
parallel, and caches control flow graphs between iterations. Combined
with the call graph improvements, this reduced scan time on large repos
by 2–5×. |

### New Infrastructure

| Feature | Description |
|---|---|
| `file_content_exclude` field on Rule | Per-file content regex checked
ONCE before any analysis — prevents rule from firing on files that
import a specific library |
| Comma-separated `exclude_file_pattern` | Was treated as a single
literal pattern; now split on comma — fixed all multi-pattern exclusions
that were silently not working |
| `vulnerable_keyword` on TaintSinkRule | Sink only fires for a specific
named kwarg (e.g., `create(password=tainted)`) — prevents positional arg
FPs |
| CLI vs HTTP taint origin | `@app.command()` / Click / Typer parameters
→ `TaintOrigin::OperatorConfig`; HTTP request parameters →
`TaintOrigin::HttpRequest`. Operator-supplied paths are not injection
vectors. FILE_DESERIALIZER results always produce `HttpRequest`
regardless of file path origin, preserving supply-chain detection. |
| `sys.argv` / `os.environ` → `OperatorConfig` | `sys.argv[n]` and
`os.environ.get()` now produce `TaintOrigin::OperatorConfig`. Eliminates
PY305 FPs on stdlib tools (timeit, pdb, runpy) and Django management
shell. |
| Duplicate rule consolidation | 8 groups of rules shared identical
patterns — each location was firing 2-5× for the same vulnerability.
Duplicates deleted; one canonical rule per pattern remains. |

---

## Taint Input Model

### How input sources are classified

| Origin | `TaintOrigin` | Is attacker-controlled? | Example |
|---|---|---|---|
| HTTP request parameters | `HttpRequest` | Yes |
`request.POST.get("q")`, `request.args["id"]`, FastAPI path/query params
|
| HTTP request headers/cookies | `HttpRequest` | Yes |
`request.COOKIES["session"]`, `request.headers["X-Token"]` |
| File contents (deserializers) | `HttpRequest` | Yes — supply chain |
`json.load(f)`, `yaml.load(f)`, `pickle.load(f)`, `toml.load(f)` — even
if `f` came from a CLI-specified path |
| CLI arguments | `OperatorConfig` | No — operator-trusted |
`@app.command()` params (Typer), `@click.argument()`, `@click.option()`,
`sys.argv[n]` |
| Environment variables | `OperatorConfig` | No — in web app threat
model | `os.environ.get("DB_URL")` — set by deployment operator |
| Environment variables (CI) | — (AST rule only) | Yes — in
CI/supply-chain threat model | `os.environ.get("SEMGREP_URL")` — in
GitHub Actions, env vars can be set by PR authors via workflow triggers;
`ENV_URL001` / `ENV_GIT_URL001` catch this regardless of taint origin |
| Hardcoded literals | `DeveloperDefined` | No | String constants,
integer literals |

---

## Benchmark Scans: Previous vs Current

| Repo | Files | Funcs OLD | Funcs NEW | Time OLD | Time NEW | Findings
OLD | Findings NEW | TP est. NEW | FP est. NEW | S/N NEW |
|---|---|---|---|---|---|---|---|---|---|---|
| django/django | 2,876 | 26,964 ⚠️ | 7,137 | N/A¹ | 99s | N/A | 68 |
~22 | ~46 | 32% |
| pallets/flask | 78 | 1,139 | 315 | 4.6s | 1.2s | 27 | 7 | 5 | 2 | 71%
|
| pandas-dev/pandas | 537 | 7,934 | 7,171 | 549s | 64s | 412 | 15 | 8 |
7 | 53% |
| scikit-learn/scikit-learn | 743 | 3,811 | 3,725 | 152s | 29s | 135 |
41 | ~37 | ~4 | 90% |
| psf/requests | 37 | 623 | 227 | 3.1s | 0.6s | 11 | 5 | 3 | 2 | 60% |
| parzivalhack/pyspector | 19 | 145 | 109 | 3.2s | 3.9s | 23 | 4 | 4 | 0
| 100% |
| satorici/satori-cli | 42 | 190 | 190 | 3.9s | 0.9s | 29 | 3 | 2 | 1 |
67% |
| fastapi/fastapi | 1,109 | 4,376 | 875 | 32.2s | 3.9s | 69 | 0 | 0 | 0
| — |
| adeyosemanputra/pygoat | 80 | 173 | 173 | 2.8s | 0.75s | 116 | 72 | 68
| 4 | 94% |
| mukxl/Intentionally-Vulnerable-Python-Application | 1 | 6 | 6 | 0.2s |
0.28s | 8 | 7 | 7 | 0 | 100% |
| ansible/ansible | 1,772 | 9,504 ⚠️ | 4,416 | N/A¹ | 28s | N/A | 124 |
~55 | ~69 | 44% |
| python/cpython | 1,424 | — ⚠️ | 14,599 | N/A¹ | 150s | N/A | 274 | ~60
| ~214 | 22% |
| tensorflow/tensorflow | 2,266 | — ⚠️ | 16,974 | N/A¹ | 134s | N/A | 29
| ~18 | ~11 | 62% |
| semgrep/semgrep | 706 | 2,040 | 1,342 | 37.0s | 17s | 139 | 11 | 7 | 4
| 64% |

> ¹ The previous version has no test file exclusion — call graphs of
9,504–26,964 functions cause OOM/timeout. New branch excludes test
files, reducing function counts by 50–70%.

### True positives by repo (to be analyzed)

| Repo | Confirmed TPs | Key findings |
|---|---|---|
| adeyosemanputra/pygoat | ~68 | CSRF×25, timing×7, pickle×4, eval×3,
FLASK001, PLAIN_PWD001, DJANGO_DEBUG001×2 |
| mukxl/Intentionally-Vulnerable-Python-Application | 7 | PY002
(pickle), HARDCODED_PWD001, timing, FLASK001 |
| django/django | ~22 | PY306_CACHE×6 (cache poisoning→RCE), ORM002×3,
PY106 |
| scikit-learn/scikit-learn | ~37 | DESER_JOBLIB001×11, pickle×4,
ZIPSLIP001×1, HASH807×1 |
| ansible/ansible | ~55 | SHELL602×7, ZIPSLIP001×1, PY305×3
(strategy/collection loader), PY002×4 |
| semgrep/semgrep | 7 | ENV_URL001×2 (SEMGREP_URL SSRF),
ENV_GIT_URL001×1 (CI token theft), OPEN1149×1, HASH807×4 (SHA-256 for
token hashing) |
| python/cpython | ~60 | DESER723×3 (marshal/zipimport), ZIPSLIP001×2,
SSRF_001×2, PY002 (IDLE RPC), IMPORT825 (logging config) |
| tensorflow/tensorflow | ~18 | LOG741×7 (log injection), DESER723/724
(bytecode), DESER_NUMPY001×1, HASH807×3 |
| psf/requests | 3 | TIMING759×2 (password `==` in auth), G405 |
| pallets/flask | 5 | exec() in from_pyfile(), SHA1, DJANGO_DEBUG001 |
| satorici/satori-cli | 2 | SSRF_001×2 (API response URL used in HTTP
client) |
| parzivalhack/pyspector | 4 | Supply-chain: PATH813 + OPEN1149×2
(aipocgen.py, json.load config), HASH807×1 |

---

## Signal-to-noise ratio

### New

```
mukxl/Intentionally-Vulnerable-Python-Application  ████████████████████ 100% — all vulns caught
parzivalhack/pyspector                             ████████████████████ 100% — 4 supply-chain TPs
adeyosemanputra/pygoat                             ███████████████████░  94% — ground truth
scikit-learn/scikit-learn                          █████████████████░░░  90% — DESER_JOBLIB001 ×11
pallets/flask                                      ██████████████░░░░░░  71% — exec() intentional
satorici/satori-cli                                █████████████░░░░░░░  67% — SSRF TPs confirmed
semgrep/semgrep                                    █████████████░░░░░░░  64% — CI security + HASH807
tensorflow/tensorflow                              ████████████░░░░░░░░  62% — log injection + bytecode
psf/requests                                       ████████████░░░░░░░░  60% — timing oracle in auth
fastapi/fastapi                                    ████████████████████  n/a — zero findings (true negatives)
pandas-dev/pandas                                  ██████░░░░░░░░░░░░░░  53% — GETATTR828 delegation
ansible/ansible                                    █████████░░░░░░░░░░░  44% — automation attack surface
django/django                                      ██████░░░░░░░░░░░░░░  32% — ORM infrastructure FPs
python/cpython                                     █████░░░░░░░░░░░░░░░  22% — interpreter by design
```

### Old (repos that completed)

```
mukxl/Intentionally-Vulnerable-Python-Application  ████████████████████ 100% (same)
adeyosemanputra/pygoat                             █████████████░░░░░░░  ~67% (INPUT1143, XSS517 FPs)
scikit-learn/scikit-learn                          ████░░░░░░░░░░░░░░░░  ~22% (CENTER927, CRYPTO708 noise)
pandas-dev/pandas                                  ██░░░░░░░░░░░░░░░░░░   ~2% (412 findings, ~8 real)
fastapi/fastapi                                    ░░░░░░░░░░░░░░░░░░░░   ~0% (69 findings, 0 real)
semgrep/semgrep                                    ██░░░░░░░░░░░░░░░░░░   ~3% (139 findings, ~4 real)
satorici/satori-cli                                ███░░░░░░░░░░░░░░░░░   ~7% (29 findings, ~2 real)
psf/requests                                       █████░░░░░░░░░░░░░░░  ~27% (11 findings, 3 real)
```

---

## Files changed

- `src/pyspector/rules/built-in-rules.toml` — 142 rules deleted, 25
added, 41 modified, 2 activated; net: 269 → 127 rules
- `src/pyspector/_rust_core/src/analysis/taint_analysis.rs` — taint
engine: CLI vs HTTP origin, sys.argv/os.environ → OperatorConfig, dead
function removed
- `src/pyspector/_rust_core/src/graph/call_graph_builder.rs` — O(1) call
resolution, test/docs file exclusion
- `src/pyspector/_rust_core/src/analysis/ast_analysis.rs` — per-file
exclusion pre-filter, unused import removed
- `src/pyspector/_rust_core/src/analysis/mod.rs` — phase timing,
parallel scanning
- `src/pyspector/_rust_core/src/rules.rs` — `file_content_exclude`,
`vulnerable_keyword`, comma-split patterns
- `src/pyspector/cli.py` — per-phase timing instrumentation
- `src/pyspector/reporting.py` — severity serialization fixed (was
uppercasing "HIGH", now preserves "High")
- `src/pyspector/triage.py` — unused import removed
- `tests/unit/` — 168 tests, all passing (including previously broken
reporting_test.py)

## Tests changed

```
168 passed, 0 failed  (was: 116 on main; reporting_test.py had 2 pre-existing failures now fixed)
+52 new tests covering new rules, engine changes, taint origins, and deduplication
```

Co-authored-by: satoridev01 <info@satori.ci>
Co-authored-by: Tommaso Bona <piergeolo@gmail.com>
satoridev01 and others added 6 commits May 27, 2026 11:27
…ass (ParzivalHack#54)

## Summary

This PR adds an explicit `--debug` flag and makes the default output
focused on findings, warnings and errors.

**[`--debug`]**: identical to the old default output

### Changes

- **`pyspector scan --debug`**: new flag. Default output keeps the
banner (name, version, credits, joke), findings, warnings and errors.
`--debug` re-enables the previous verbose output, including `println!`
lines from the Rust core (silenced by redirecting fd 1 to `/dev/null`
around `run_scan` / `scan_supply_chain`).
- **Dynamic version in the banner**: the displayed version is now read
via `importlib.metadata.version("pyspector")` instead of being hardcoded
in `cli.py`, so it stays in sync with `setup.cfg` on every bump.
- **Honor `exclude` in the Python pre-pass**: `get_python_file_asts` now
filters `path.glob("**/*.py")` against the config's `exclude` list.
Previously only the Rust core respected it, so the walker still entered
`.venv/` / `node_modules/` and emitted a `Info: Skipped` per file inside
them. New `_is_path_excluded` helper matches patterns against relative
path, absolute path and individual path components.
- **Default excludes**: added `node_modules`, `bower_components`,
`vendor` to `DEFAULT_CONFIG.exclude` in `config.py` alongside the
existing `.venv` / `__pycache__` / `build` / `dist` / `*.egg-info` /
`venv` entries.

### Before / After

Scanning a project containing a `.venv/`:

**Before** (truncated — actually printed ~150 `Info: Skipped` lines from
setuptools/pbr tests inside .venv):

```
Info: Skipped .venv/lib/python3.13/site-packages/pbr/tests/test_version.py (test file or fixture)
Info: Skipped .venv/lib/python3.13/site-packages/pbr/tests/test_packaging.py (test file or fixture)
...
[*] Starting PySpector scan on '.'
[*] Successfully parsed 73 Python files in 0.41s
[*] Disabled 2 rules via [defaults].disabled_rule_ids
[*] Starting analysis with 125 rules
[*] Pattern/config scan: 0.00s → 3 issues
[*] AST analysis: 0.00s → 1 issues
[*] Building call graph from 73/73 files ...
... (~25 more lines)
[+] Rule ID: SHELL602
    ...
```

**After** (default):

```
  o__ __o   ... (banner)
Version: 0.1.9
Made with <3 by github.com/ParzivalHack

💡 I'd tell you a joke about NAT but I would have to translate.


============================================================
  HIGH (4 issues)
============================================================

[+] Rule ID: SHELL602
    ...
```

Co-authored-by: satoridev01 <info@satori.ci>
…Hack#55)

## Summary

Adds detectors for 24 common credential formats (AWS, GitHub, GitLab,
Slack, Stripe, Google, OpenAI, Anthropic/Claude, SendGrid, PostHog, NPM,
PyPI, Discord, Telegram, DigitalOcean, Doppler, Cloudflare, Heroku,
HubSpot, Fastly, plus DB-connection-string and basic-auth-URL detectors)
and significantly reduces false-positive noise from the existing `G101`
/ `G101B` / `G102` / `G103` / `G104` / `AI404` rules by extending their
`exclude_pattern` and `exclude_file_pattern` lists. Validated against a
763-repo corpus side-by-side with TruffleHog.

### New
| Rule | Provider | Format |
|---|---|---|
| G110 | AWS | `(AKIA|ASIA|AIDA|AROA|AGPA|ANPA|ANVA|ASCA)[0-9A-Z]{16}` |
| G111 | GitHub | `(ghp|gho|ghu|ghs|ghr)_[A-Za-z0-9]{36}` and
`github_pat_[A-Za-z0-9_]{82}` |
| G112 | GitLab | `glpat-[A-Za-z0-9_-]{20}` |
| G113 | Slack token | `xox[abprso]-[A-Za-z0-9-]{10,}` |
| G114 | Slack webhook |
`https://hooks.slack.com/services/T<id>/B<id>/<token>` |
| G115 | Stripe | `(sk|rk)_(live|test)_[A-Za-z0-9]{24,}` |
| G116 | Google | `AIza[A-Za-z0-9_-]{35}` |
| G117 | OpenAI | `sk-[A-Za-z0-9]{48}` and
`sk-(proj|svcacct|admin|None)-[A-Za-z0-9_-]{20,}` |
| G118 | Anthropic / Claude |
`sk-ant-(api|admin|sid)\d{2}-[A-Za-z0-9_-]{80,110}` |
| G119 | SendGrid | `SG\.[A-Za-z0-9_-]{22}\.[A-Za-z0-9_-]{43}` |
| G120 | PostHog | `phc_[A-Za-z0-9]{40}` |
| G121 | Database URL with creds |
`(postgres(ql)?|mysql|mongodb(\+srv)?|mariadb|redis|rediss|amqp|amqps|mssql|oracle)://user:pass@…`
|
| G122 | JWT in code | `eyJ…\.eyJ…\.[A-Za-z0-9_-]+` (3-part) |
| G123 | Basic-auth URL | `https?://user:pass@host…` (password forbidden
to contain `/` — eliminates JS-stack-trace FPs) |
| G124 | NPM | `npm_[A-Za-z0-9]{36}` |
| G125 | PyPI | `pypi-AgEIcHlwaS5vcmc[A-Za-z0-9_-]{50,}` |
| G126 | Discord bot | `[MN][A-Za-z0-9]{23}\.[\w-]{6}\.[\w-]{27}` |
| G127 | Telegram bot | `\d{8,10}:[A-Za-z0-9_-]{35}` |
| G128 | DigitalOcean | `(dop|doo|dor)_v1_[a-f0-9]{64}` |
| G129 | Doppler |
`dp\.(pt|st|ct|scim|audit|prov|sa)\.[A-Za-z0-9_-]{30,}` |
| G130 | Cloudflare | OCA Key: `v1\.0-[a-f0-9]{32}-[a-f0-9]{146}` +
40-char tokens near "cloudflare" keyword |
| G131 | Heroku | UUID near "heroku" keyword (legacy format) |
| G132 | HubSpot | `pat-(na1|na2|na3|eu1)-<UUID>` private app + legacy
UUID near "hubspot" |
| G133 | Fastly | 32-char token near "fastly" keyword |

### Changed

- **`AI404`** (Hugging Face): pattern tightened to require at least 16
consecutive alphanumeric chars after `hf_`. Eliminates placeholder FPs
like `hf_token`, `hf_X`, `hf_xxx_your_token`, `hf_....`. Doctest lines
(`>>>` / `...`) excluded.

- **`G104`** (JWT secret): pattern now requires `≥16` non-quote chars in
the value (previously `.+` matched literal field-name values like
`"kb_jwt"`). `exclude_pattern` added: `your_`,
`change-(me|in-production)`, `default-secret`, `do-not-share`, `demo-`,
`never-(hardcode|use)`.

- **`G101`** (broad password/secret): `exclude_pattern` extended to
suppress:
- common placeholder values: `your_`, `insert_`, `example_`,
`placeholder`, `change-me`, `replace-me`, `todo`, `fake`, `dummy`,
`sample`, `demo`, `server_api_key`, `api_key_secret`, `my_password`,
`root_password`
  - values ending in `_here` / containing `*_HERE`
  - all-uppercase placeholder-name strings like `"YOUR_OPENAI_API_KEY"`
- lines starting with `print(`, `click.echo(`, `sys.stderr.`
(instructional output)
  - doctest lines (`>>>` / `...`)

- **`G101B`** (uppercase const secret): same placeholder /
instructional-line / doctest exclusions.

- **`G102`** (private key block): added `exclude_file_pattern =
"*.md,*.rst,*.html,*.txt,*.adoc,*.tex,*.ipynb"`. Documentation /
walkthrough / knowledge-base content showing `-----BEGIN … PRIVATE
KEY-----` as an example was a 100% FP source in our corpus.

- **`G103`** (blank password): `exclude_pattern` adds
`^\s*[A-Z][A-Z0-9_]+\s*=` (Django/Flask uppercase config defaults like
`EMAIL_HOST_PASSWORD = ""` are intentionally overrideable from env).
`exclude_file_pattern` adds `*settings*.py,*config*.py`.

- **`G117`**, **`G113`**: explicit `-your-`, `-here\b`, `-replace-`
substring excludes catch patterns like `xoxb-your-slack-bot-token` and
`sk-svcacct-your-embedding-key-here`. `exclude_file_pattern` adds
`*.env.example,*.env.template,*.env.sample,*.env.dist,env.example`.

### G121 / G121L — production vs dev-default split

`G121` (Critical / High) now excludes connection strings whose host is
one of the well-known local/docker-compose names (`localhost`,
`127.0.0.1`, `0.0.0.0`, `::1`, `host.docker.internal`, `db`, `database`,
`postgres(ql)`, `mysql`, `mariadb`, `mongo(db)`, `redis`, `rabbitmq`,
`broker`, `kafka`, `memcached`, `amqp`). Host tokens are matched only
when followed by a URL-component terminator (`:`, `/`, `?`, `#`, quote,
whitespace), so substrings like `db.prod.example.com` still hit G121 —
only standalone host tokens like `@db:5432` get downgraded.

**`G121L`** (new, Low / Low) covers the dev-default class: same
connection-string shape, but only when the host is one of those
local/container names. This converts the dominant remaining G121 FP
class —
`postgresql://guaardvark:guaardvark@localhost:5432/guaardvark`-style
local-dev defaults — into a separate, low-priority signal that an
analyst can choose to ignore or batch-review, without dropping the
finding entirely (it is still a literal hardcoded credential).

`[defaults].exclude_pattern_placeholder` now declares the
placeholder/dummy-secret regex
(`(?i)EXAMPLE|FAKE|PLACEHOLDER|SAMPLE|x{10,}|0{10,}|1{10,}|abcdefghij|1234567890abcdef|AbCdEfGhIjKlMnOp|f3a8b2c1`)
in one place. Each rule's `exclude_pattern` references it via the
sentinel `__SHARED_PLACEHOLDERS__`, which `get_default_rules()` (in
`pyspector/config.py`) string-substitutes before handing the TOML text
to the Rust core. Adding a new placeholder shape is now a one-line edit
rather than touching 15 rule blocks. The Rust core needs no changes —
substitution happens in the existing Python rule-loading path. Existing
rule TOMLs without the sentinel continue to work unchanged.

### G122 unscoping

`G122` previously had `file_pattern = "*.py"`. JWTs leak into `.yaml`,
`.json`, `.sh`, `.tf`, and CI configs at least as often as into Python
files. Removing the restriction adds new TP coverage without measurable
FP impact (2 new hits in the validation corpus, both edge cases in
`.drawio` and `.json` files containing image-URL JWTs).

### Shared FP fixes triggered by the validation corpus

- **`G121`** / **`G123`** now suppress f-string and shell interpolation
in the credential portion: `{var}`, `{self.x}`, `${VAR}`, `$(VAR)`,
`$VAR`, `<placeholder>`, `{{ var }}` (Jinja/Helm).
- **`G121`** ignores `re.match()` / `re.compile()` / `re.search()`
patterns that happen to describe a connection-string shape.
- **`G123`** pattern now forbids `/` in the password segment,
eliminating the dominant JS-stack-trace FP class
(`http://localhost:5173/node_modules/.vite/deps/@react.js?…:759:3) @
http://…`). `*.log` added to `exclude_file_pattern`.
- **`G121`** / **`G123`** add
`*.env.example,*.env.template,*.tpl,*.j2,*.jinja,*.template,*cookiecutter*`
to `exclude_file_pattern`.
- **`G114`** placeholder filter now suppresses Slack webhook URLs with
`T00000000/B00000000/XXXX…` template values.
- **`G110`** suppresses `AKIAIOSFOLQUICKSTART` (well-known lakefs
quickstart documented credential).

## Validation

Comparison with TruffleHog (v3.95.3) on 763 repos that originally
flagged any "Hardcoded" finding.** Both tools scanned the same shallow
clones; PySpector with the new rules, TruffleHog with
`--no-verification` for fair format-vs-format comparison.

| Tool | Findings | Heuristic-TP | Heuristic-FP | Precision |
|---|---:|---:|---:|---:|
| **PySpector v2 (this PR)** | 1,135 | 884 | 251 | **78%** |
| TruffleHog 3.95.3 (no-verify) | 5,814 | 462 | 5,352 | **8%** |


Comparison with PySpector vs Modified PySpector

| Metric                  | Original | This PR  | Change       |
|-------------------------|---------:|---------:|-------------:|
| Total findings          |    2,295 |    1,135 | **−51%**     |
| Heuristic-TP            |    1,242 |      884 | −29%         |
| Heuristic-FP            |    1,053 |      251 | **−76%**     |
| Precision               |   54.1%  |   77.9%  | **+23.8pp**  |
| TP : FP ratio           |     1.18 |     3.52 | **~3× better** |

Per-rule breakdown (500 OK-cloned repo subset)

| Rule | Original | This PR | Notes |

|-------|---------:|--------:|------------------------------------------------------------------------------|
| G101 | 1,743 | 702 | Tightened exclude — kills ~60% of placeholder /
instructional-output FPs |
| G101B | 359 | 0 | Largely subsumed by the new format-specific rules;
placeholders also filtered |
| G102 | 138 | 100 | `.md`/`.rst` doc-extension excludes drop ~30
walkthrough FPs |
| G103 | 48 | 34 | `UPPER_CASE = ""` config-default and `*settings*.py`
excludes |
| G104 | 2 | 2 | Same hits |
| AI404 | 5 | 1 | Tightened to require ≥16 alnum chars after `hf_` |
| G110 | — | 1 | NEW — AWS access key (`AKIA…`) |
| G115 | — | 2 | NEW — Stripe live/test keys |
| G116 | — | 187 | NEW — Google API key (`AIza…`) |
| G117 | — | 22 | NEW — OpenAI `sk-…` / `sk-proj-…` |
| G121 | — | 49 | NEW — DB connection string with embedded credentials |
| G122 | — | 26 | NEW — three-part JWT in code (now non-Python files
too) |
| G123 | — | 8 | NEW — basic-auth URL |
| G127 | — | 1 | NEW — Telegram bot token |
| G110–G127 total | 0 | 296 | NEW provider coverage absent from the
original ruleset |
| **TOTAL** | 2,295 | 1,135 | |
core: per-rule CWE field + CWE-aware cross-rule dedup

Adds a `cwe` field on each rule. When two rules report findings at the
same (file, line) and share the same CWE (e.g. DESER_TORCH001 + AI202
both flagging one torch.load line under CWE-502), the engine collapses
them: the finding whose rule declares the higher severity wins, with
rule_id lex order as stable tiebreaker on equal severity. CWE itself
does not set severity — each rule's severity comes from its own TOML
field. Distinct CWEs at the same line stay distinct, so
`os.system(eval(user_input))` correctly reports both CWE-78 and CWE-94.

Rust core
- rules.rs / issues.rs: new optional `cwe: Option<String>`, carried from
Rule → Issue and exposed to Python via pyo3
 - analysis/{config,ast,taint}_analysis.rs: pass it through Issue::new
 - analysis/mod.rs: 2-stage dedup
     stage 1 = existing fingerprint dedup (same rule, exact match)
stage 2 = CWE-aware merge by (file, line, cwe), highest severity wins.
Rules without a CWE skip stage 2.

cli.py
- file_path passed to Rust is now `py_file.resolve()` (absolute,
canonical) so AST-rule and pattern-rule findings agree on the same path
string and stage-2 dedup actually triggers.

reporting.py
 - JSON output gains a top-level `cwe` field on each issue
- SARIF output emits `external/cwe/cwe-N` in each rule's
`properties.tags` — standard SARIF taxon, parses cleanly in GitHub Code
Scanning and DefectDojo

setup.py
- RustExtension declares `debug=False` so `pip install -e .` produces
release-mode binaries; previously editable installs ran ~3× slower.

Rules — all 179 [[rule]] blocks now declare a CWE (built-in-rules.toml +
built-in-rules-ai.toml). Mapping summary:

CWE-78 command injection PROC819, SHELL602/689, PY102/103/106, AI503,
...
CWE-22 path traversal PATH813, OPEN1149, AI502, ZIPSLIP001, FILE526, ...
CWE-94 code/template injection PY001/305/500, SEC501, SSTI001,
SANDBOX307/308, AI101/102/103/105/106/107, ...
CWE-502 insecure deserialization DESER*, PY002/107/204/301/302/306,
YAML001, AI201/202/203/204/205, RUAMEL_UNSAFE001, ...
CWE-89 SQL injection PY101, SQL586/693, ORM001/002, AI104/504, ...
CWE-918 SSRF SSRF_001, NET705, AI501, ENV_URL001, ...
CWE-295 TLS / cert verification TLS001, SSL531, SSH001, G405, NET705
  CWE-327 weak crypto                 PY201/202/203/205, HASH807
  CWE-338 weak PRNG                   CRYPTO708, RAND810
CWE-798 hardcoded credentials G101/101B/102/104/110..133, AI002/404,
AUTH711, ADMIN795, CFG001, ...
  CWE-352 CSRF                        G404, CSRF747, OAUTH774
CWE-489 active debug code G401/403, FLASK001, FLASK_DEBUG001,
DJANGO_DEBUG001, DEBUG798
  CWE-79  XSS                         PY105
  CWE-611 XXE                         PY303, XXE001
  CWE-942 CORS                        CORS780
  CWE-601 open redirect               OPEN_REDIRECT001
  CWE-1004 sensitive cookie attr      COOKIE792, COOKIE_FILE001
  CWE-319 cleartext transmission      HTTPS789, AI403
CWE-200 info disclosure INFO738, BACKUP801, FILE528, AI402, AI405
  CWE-117 log injection               LOG741
  CWE-208 timing attack               TIMING759
  CWE-1333 ReDoS                      REGEX870
  (full list in the rule TOMLs themselves)

New AST rules
 - YAML001        yaml.load() without SafeLoader   (CWE-502, Critical)
 - FLASK_DEBUG001 .run(debug=True) on Flask/FastAPI (CWE-489, High)

AI202 hardened
 - pattern tightened to `torch\.load\s*\(`
- exclude_pattern now matches DESER_TORCH001's: skip lines with
`weights_only=True`
- now redundant with DESER_TORCH001 (both CWE-502) → stage-2 dedup
collapses them to one Critical finding per torch.load line

Test on Ghy0501/MCITlib (4,743 .py / 27,568 functions):

                                  this branch     main (post-ParzivalHack#55)
  wall clock                          593s              606s
  total findings                     1,740             3,103
  unique (file, line, CWE) groups    1,740             1,918
  duplicate groups (≥2 rules)            0             1,185
  excess duplicate findings              0             1,185
  heuristic-TP                       1,684             3,047
  heuristic-FP                          56                56

Dedup is reflected directly: branch produces 0 duplicate groups where
main produces 1,185 (i.e. 1,185 places where 2+ rules describe the same
vulnerability at the same line). FP count is identical (56) since FPs
are pattern-shape artifacts that don't depend on dedup. The remaining
178-finding gap (1,918 unique vs 1,740) is AI202 no longer flagging
torch.load(..., weights_only=True). Wall clock −13s is within noise.
## Summary
- add AI206 to flag Hugging Face `from_pretrained(...,
trust_remote_code=True)` model loading
- add targeted tests for the rule metadata, matcher behavior, and
scanner integration when the Rust core is available

Closes ParzivalHack#18

## Validation
- `/tmp/pyspector-pdlc043-venv/bin/python -m pytest
tests/unit/test_ai_rules.py -q` -> 2 passed, 2 skipped locally because
the PySpector Rust core is not available in this runner
- `/tmp/pyspector-pdlc043-venv/bin/python -m py_compile
tests/unit/test_ai_rules.py`
- TOML parse check for `src/pyspector/rules/built-in-rules-ai.toml`
- direct AST matcher shape check for `trust_remote_code=True` vs `False`
- `git diff --check`
Three-level (L1 in-memory mtime / L2 disk content-hash / L3 chunk-aware)
incremental AST cache that skips the pure-Python json.dumps(AstEncoder)
bottleneck across runs and on partial file changes.

- src/pyspector/ast_cache.py: cache implementation (JSON+base64 persistence,
  no pickle/code-exec on load)
- src/pyspector/_ast_encode.py: shared AST->JSON encoder (single source of
  truth, eliminates encoder drift between cli.py and the cache)
- src/pyspector/cli.py: wire the cache into get_python_file_asts
- tests/unit/ast_cache_test.py: unit tests

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants