Caching ast#1
Open
daniplatform wants to merge 36 commits into
Open
Conversation
…upply-chain flag, from working as intended :)
Updated the CodeQL workflow to change the cron schedule and modify build mode for Rust.
…s + align local readme.md with remote
…ivalHack#28) (ParzivalHack#47) # GOAL: fix issue ParzivalHack#28, refactor message error during AST file parsing and add a new param to enable python SyntaxWarning ## Changes - CLI & Wizard Mode: Added a new parameter (flag) to enable/disable SyntaxWarning reporting. This allows users to decide if they want to treat syntax warnings as blocking issues or ignore them during scans. - Refactoring get_python_file_asts: - Improved the logic that captures and reports errors during AST generation. - Standardized error messages to make them more descriptive when a file fails to parse. - Integrated the new enable_syntax_warnings logic within the core file-walking loop. - Created a new test suite test_get_asts.py to verify: - Default behavior (warnings ignored). - Behavior when warnings are enabled (treated as errors/exceptions). - Handling of valid, invalid, and encoding-error files. --------- Co-authored-by: Tommaso Bona <piergeolo@gmail.com>
Removed redundant description of the setup_cron.sh script.
…nces and metrics, directly in the terminal, inside a ASCII table :)
…alHack#49) ## Summary Introduces two new mechanisms for controlling rule behavior globally, then applies them to fix a class of false positives that surface when scanning mature Python codebases. ### Problem Scanning a large Django-based project produced 864 findings with `-s HIGH`. Manual verification showed that **~720 were false positives**: - **453** from `PY515`/`SHELL645`/`SHELL670` — flagging every `re.compile()` call as "code compilation" - **101** from `PY102` — the taint engine firing on generic callable invocations - **78** from `PY511`/`JSON612` — `json.loads()` flagged as dangerous deserialization - **dozens more** from `AUTH711`, `CSRF747`, `SESS744`, `PATH813`, etc. — firing on test fixtures and well-known safe patterns Without filtering, scanning also revealed ~9,000 additional Low-severity findings from rules that flag every Python built-in (`isinstance`, `super`, `len`, f-strings, `getattr`, etc.). ### Solution #### 1. `[defaults]` section in rules TOML ```toml [defaults] exclude_file_patterns = ["*tests*", "*fixtures*", "*testdata*", "*conftest*"] disabled_rule_ids = ["ISINSTANCE855", "LEN1101", "SUPER1128", ...] ``` - `exclude_file_patterns` — glob list applied to **all rules**; eliminates the need to repeat `exclude_file_pattern` on every individual rule - `disabled_rule_ids` — completely disables noisy rules without deleting their definitions (easy to re-enable per project) #### 2. Per-rule `exclude_pattern` Regex matched against the flagged line; suppresses the finding if it matches. Fixes rules that are correct in general but have well-known safe variants: ```toml [[rule]] id = "PY107" ast_match = "Call(func.value.id=yaml, func.attr=load)" # Safe when Loader=SafeLoader is explicitly passed exclude_pattern = "Loader\\s*=\\s*(yaml\\.)?(Safe|Base)Loader" ``` ### Results on Django 6.1-alpha | | Before | After | |---|---|---| | Total findings (all severities) | 864 (-s HIGH) | **383** | | `re.compile()` false positives | 453 | **0** | | `yaml.load(SafeLoader)` false positives | 2 | **0** | | Test fixture credential false positives | 9 | **0** | | Session fixation false positives | 1 | **0** | | Python built-in noise rules | ~9,000 (hidden by -s HIGH) | **0** | | `pickle.loads()` true positives | 57 → 8 (test files excluded) | **✓ present** | | `exec()` true positives | 3 | **✓ 3** | ### Files changed - `src/pyspector/_rust_core/src/rules.rs` — `Defaults` struct, `Rule::is_file_excluded()`, `exclude_pattern` field - `src/pyspector/_rust_core/src/analysis/mod.rs` — apply `disabled_rule_ids` before scan - `src/pyspector/_rust_core/src/analysis/ast_analysis.rs` — use `is_file_excluded()` and `exclude_pattern` - `src/pyspector/_rust_core/src/analysis/config_analysis.rs` — same - `src/pyspector/rules/built-in-rules.toml` — `[defaults]` section + per-rule fixes - `tests/unit/test_false_positive_reductions.py` — 26 tests (each fix has a suppression test + true-positive retention test) ## Test plan - [x] `python -m pytest tests/unit/test_false_positive_reductions.py` — 26 tests pass - [x] Scan a Django project: `pyspector scan /path/to/django` — confirm `re.compile()`, `yaml.load(SafeLoader)`, and test-fixture findings are gone - [x] Confirm `pickle.loads()` and `exec()` findings still present in production code Co-authored-by: satoridev01 <info@satori.ci>
…arzivalHack#51) This is a follow up on minimizing false positives, increasing true positives and making PySpector taint analysis faster and better for multiple repositories to speed up and reduce the sample to noise ratio. ## Rules 142 rules were deleted, 2 were disabled, 28 were added and 41 were modified. There is a total of 127 rules. ### New Rules | Rule | What it detects | Severity | Confirmed TP | |---|---|---|---| | `SSTI001` | `render_template_string(user_input)`, `env.from_string(tainted)` | Critical | pygoat | | `ORM001` | SQLAlchemy `text(f"SELECT...{var}")` | Critical | — | | `ORM002` | Django `raw()`, `order_by(tainted)`, `extra(tainted)` (CVE-2021-35042) | Critical | django | | `DESER725` | `jsonpickle.decode()` | Critical | — | | `DESER726` | `dill.loads()` | Critical | — | | `DESER_JOBLIB001` | `joblib.load()` — ML model deserialization via pickle | Critical | sklearn ×11 | | `DESER_NUMPY001` | `numpy.load(allow_pickle=True)` | Critical | tensorflow ×1 | | `DESER_TORCH001` | `torch.load()` without `weights_only=True` | Critical | — | | `TLS001` | `requests.get(url, verify=False)`, `ssl=False` | High | stock | | `SSH001` | Paramiko `AutoAddPolicy()` — SSH MITM | High | — | | `JWT001` | `jwt.decode(options={"verify_signature": False})` | High | pygoat | | `ZIPSLIP001` | `extractall()` without path validation | High | cpython ×4, ansible ×2 | | `XXE001` | `lxml.etree.parse()` without `resolve_entities=False` | High | — | | `FLASK001` | `app.run(debug=True)` | Critical | pygoat, ivpa | | `OPEN_REDIRECT001` | `redirect(tainted_url)`, `HttpResponseRedirect(tainted)` | High | — | | `PLAIN_PWD001` | `Model.objects.create(password=tainted)` — plaintext DB storage | Critical | pygoat, ivpa | | `DJANGO_DEBUG001` | `DEBUG = True` in settings (Django and Flask) | Critical | pygoat ×2, flask | | `ENV_URL001` | `os.environ.get("*_URL")` as HTTP endpoint — SSRF (AST rule) | High | semgrep ×2 | | `COOKIE_FILE001` | Env var used as cookie jar file path | High | — | | `ENV_GIT_URL001` | CI env var URL → `git fetch` — CI token exfiltration (AST rule) | High | semgrep ×1 | | `RUAMEL_UNSAFE001` | `YAML(typ="unsafe")` | Critical | — | | `SQL_CONCAT001` | `"SELECT..." + user_var` — SQL via string concatenation | High | pygoat ×5, ivpa ×1 | | `HARDCODED_PWD001` | `PASSWORD = 'literal'` at module level | High | ivpa | | `SHELL_BYPASS001` | `subprocess.run(["bash", "-c", user_cmd])` — shell bypass | High | — | | `PY306_CACHE` | `pickle.loads()` in cache backends — cache poisoning → RCE | Critical | django ×6 | | `G101B` | Uppercase secret constants (`SECRET_KEY`, `API_KEY` ≥ 16 chars) | High | pygoat ×3 | | `DESER724` | `types.FunctionType()` from deserialized bytecode — arbitrary code execution | Critical | — | | `SANDBOX307` | `object.__subclasses__()` traversal — Python sandbox escape | Critical | — | | `SANDBOX308` | `__init__.__globals__` access — Python sandbox escape via global namespace | Critical | — | ### Modified Rules | Rule | Change | Impact | |---|---|---| | `ADMIN795` | `exclude_pattern` — reduced FPs on test credentials | | | `BACKUP801` | Pattern requires word char before extension (`\w\.bak`); excludes `.rst/.md` | Eliminated 7 FPs in cpython docs | | `CRYPTO708` | Extended to `random.choices()`, `random.sample()`, `random.randrange()` | Catches API key generation with weak PRNG | | `DELATTR834` | Converted from AST pattern to taint sink (`delattr(obj, tainted_attr)`) | | | `DESER723` | `description`, `remediation` — clarified marshal.loads risk | | | `FORMAT864` | Converted from AST pattern to taint sink (`.format(tainted)`) | | | `G101` | `exclude_pattern` — added test/fixture exclusions | | | `G103` | Excludes `def` lines (API param defaults) and chained assignments | Eliminated 4 FPs in ftplib, netrc | | `GETATTR828` | `exclude_file_pattern = "*serializer*,*schema*,*/pandas/core/*,*/pandas/io/*"` | Eliminated 22 FPs in pandas, 9 in django | | `GLOBALS843` | Removed subscript match — only exec/eval with globals() | Eliminated FPs from module attribute registration | | `HASH807` | Activated with broader context exclusions (was disabled) | | | `HTTPS789` | `exclude_file_pattern` — excluded test files | | | `IMPORT825` | `exclude_pattern`, `remediation` — reduced test discovery FPs | | | `LOG741` | `description`, `severity`, `remediation`, `pattern` — narrowed to log injection | | | `OAUTH774` | `exclude_pattern` — reduced FPs on OAuth callbacks | | | `OPEN1149` | Converted from AST pattern to taint sink; severity and confidence updated | | | `OPEN_REDIRECT001` | `exclude_file_pattern` for Django contrib/views (relative + absolute paths) | Eliminated 15 FPs in django framework code | | `ORM001` | Word boundary on `text` keyword; same migration exclusions | Eliminated 29 FPs in django (`gettext(...)`) | | `PATH813` | `exclude_pattern` — reduced FPs on safe path joins | | | `PERM650` | Converted from regex pattern to taint sink for SQL injection | | | `PY002` | `exclude_file_pattern = "*/cache/backends/*"` | Cache backends covered by PY306_CACHE — prevents double-reporting | | `PY101` | `exclude_file_pattern = "*/migrations/*,*/alembic/*,*/backends/*"` | Eliminated 69 FPs in django (ORM DDL infrastructure) | | `PY103` | Converted from AST pattern to taint sink (`os.system(tainted)`) | | | `PY105` | Converted from regex to taint sink (`mark_safe(tainted)`) | | | `PY106` | `ast_match` — tightened subprocess shell=True detection | | | `PY107`/`PY302` | `file_content_exclude = "from ruamel.yaml\|import ruamel"` — new per-file content exclusion mechanism | Eliminated 14 FPs in semgrep (all ruamel.yaml safe usage) | | `PY201` | Extended exclude for MD5 checksum contexts | Eliminated TF and pandas checksum FPs | | `PY202` | `exclude_pattern` — excluded SHA1 in non-crypto contexts | | | `PY507` | Converted from regex to taint sink (`.exec(tainted)`) | | | `RAND810` | Converted from AST pattern to taint sink (`random.seed(tainted)`) | | | `REGEX870` | `description`, `pattern`, `exclude_pattern`, `remediation` — ReDoS narrowed | | | `SEC501` | Excludes quoted references, definitions, and method calls on `.exec` | Eliminated docstring FPs across all repos | | `SER522` | Converted from regex pattern to taint sink | | | `SETATTR831` | Converted from AST pattern to taint sink (`setattr(obj, tainted_attr, val)`) | | | `SHELL631` | Converted from regex pattern to taint sink for SQL injection | | | `SHELL675` | Converted from regex pattern to taint sink for SQL interpolation | | | `SHELL689` | Converted from regex pattern to taint sink for subprocess | | | `SQL586` | Converted from regex pattern to taint sink for SQL formatting | | | `SQL693` | Converted from regex pattern to taint sink for SQL execute | | | `SYMLINK816` | `description`, `pattern`, `remediation` — symlink traversal clarified | | | `TIMING759` | Excludes null-check patterns | Eliminated timing oracle FPs from presence checks | | `TLS001` | Extended exclude for internal array operations | Eliminated 6 pandas internal FPs | | `TOKEN771` | `description`, `confidence`, `exclude_pattern`, `remediation` — JWT expiry check refined | | | `ZIPSLIP001` | Added safe-filter exclusion; excludes regex string accessor | Eliminated satori-cli (Python 3.12 safe filter) + pandas ×4 FPs | ### Taint Engine Changes (`taint_analysis.rs`) | Area | What changed | |---|---| | Sources | The engine now recognizes more entry points as attacker-controlled: HTTP handler parameters detected via route decorators, HTTP client responses, and file contents loaded via deserialization functions. File contents are always treated as potentially attacker-controlled even when the file path itself was chosen by the operator — this is what enables supply-chain detection. | | Origin model | Not all external input is equally dangerous. Data coming from CLI arguments, environment variables, or deployment configuration is operator-supplied and treated as trusted. Data coming from HTTP requests or deserialized file contents is attacker-controlled. Sinks only fire on the latter, which eliminates an entire class of false positives on CLI tools without touching any individual rule. | | Propagation | Taint now follows data across function call boundaries, through class attributes, and through control flow constructs like loops, context managers, and exception handlers. Previously, taint was lost as soon as data crossed a function boundary or was stored in an object. | | Sanitizers | The engine recognizes functions that clean data — database query escaping clears SQL taint, HTML escaping clears HTML taint. Partially sanitized data (e.g., HTML-escaped but not shell-escaped) does not get promoted to fully clean. | | Performance | Analysis skips functions that have no tainted data flowing through them, runs the convergence loop and final pass in parallel, and caches control flow graphs between iterations. Combined with the call graph improvements, this reduced scan time on large repos by 2–5×. | ### New Infrastructure | Feature | Description | |---|---| | `file_content_exclude` field on Rule | Per-file content regex checked ONCE before any analysis — prevents rule from firing on files that import a specific library | | Comma-separated `exclude_file_pattern` | Was treated as a single literal pattern; now split on comma — fixed all multi-pattern exclusions that were silently not working | | `vulnerable_keyword` on TaintSinkRule | Sink only fires for a specific named kwarg (e.g., `create(password=tainted)`) — prevents positional arg FPs | | CLI vs HTTP taint origin | `@app.command()` / Click / Typer parameters → `TaintOrigin::OperatorConfig`; HTTP request parameters → `TaintOrigin::HttpRequest`. Operator-supplied paths are not injection vectors. FILE_DESERIALIZER results always produce `HttpRequest` regardless of file path origin, preserving supply-chain detection. | | `sys.argv` / `os.environ` → `OperatorConfig` | `sys.argv[n]` and `os.environ.get()` now produce `TaintOrigin::OperatorConfig`. Eliminates PY305 FPs on stdlib tools (timeit, pdb, runpy) and Django management shell. | | Duplicate rule consolidation | 8 groups of rules shared identical patterns — each location was firing 2-5× for the same vulnerability. Duplicates deleted; one canonical rule per pattern remains. | --- ## Taint Input Model ### How input sources are classified | Origin | `TaintOrigin` | Is attacker-controlled? | Example | |---|---|---|---| | HTTP request parameters | `HttpRequest` | Yes | `request.POST.get("q")`, `request.args["id"]`, FastAPI path/query params | | HTTP request headers/cookies | `HttpRequest` | Yes | `request.COOKIES["session"]`, `request.headers["X-Token"]` | | File contents (deserializers) | `HttpRequest` | Yes — supply chain | `json.load(f)`, `yaml.load(f)`, `pickle.load(f)`, `toml.load(f)` — even if `f` came from a CLI-specified path | | CLI arguments | `OperatorConfig` | No — operator-trusted | `@app.command()` params (Typer), `@click.argument()`, `@click.option()`, `sys.argv[n]` | | Environment variables | `OperatorConfig` | No — in web app threat model | `os.environ.get("DB_URL")` — set by deployment operator | | Environment variables (CI) | — (AST rule only) | Yes — in CI/supply-chain threat model | `os.environ.get("SEMGREP_URL")` — in GitHub Actions, env vars can be set by PR authors via workflow triggers; `ENV_URL001` / `ENV_GIT_URL001` catch this regardless of taint origin | | Hardcoded literals | `DeveloperDefined` | No | String constants, integer literals | --- ## Benchmark Scans: Previous vs Current | Repo | Files | Funcs OLD | Funcs NEW | Time OLD | Time NEW | Findings OLD | Findings NEW | TP est. NEW | FP est. NEW | S/N NEW | |---|---|---|---|---|---|---|---|---|---|---| | django/django | 2,876 | 26,964⚠️ | 7,137 | N/A¹ | 99s | N/A | 68 | ~22 | ~46 | 32% | | pallets/flask | 78 | 1,139 | 315 | 4.6s | 1.2s | 27 | 7 | 5 | 2 | 71% | | pandas-dev/pandas | 537 | 7,934 | 7,171 | 549s | 64s | 412 | 15 | 8 | 7 | 53% | | scikit-learn/scikit-learn | 743 | 3,811 | 3,725 | 152s | 29s | 135 | 41 | ~37 | ~4 | 90% | | psf/requests | 37 | 623 | 227 | 3.1s | 0.6s | 11 | 5 | 3 | 2 | 60% | | parzivalhack/pyspector | 19 | 145 | 109 | 3.2s | 3.9s | 23 | 4 | 4 | 0 | 100% | | satorici/satori-cli | 42 | 190 | 190 | 3.9s | 0.9s | 29 | 3 | 2 | 1 | 67% | | fastapi/fastapi | 1,109 | 4,376 | 875 | 32.2s | 3.9s | 69 | 0 | 0 | 0 | — | | adeyosemanputra/pygoat | 80 | 173 | 173 | 2.8s | 0.75s | 116 | 72 | 68 | 4 | 94% | | mukxl/Intentionally-Vulnerable-Python-Application | 1 | 6 | 6 | 0.2s | 0.28s | 8 | 7 | 7 | 0 | 100% | | ansible/ansible | 1,772 | 9,504⚠️ | 4,416 | N/A¹ | 28s | N/A | 124 | ~55 | ~69 | 44% | | python/cpython | 1,424 | —⚠️ | 14,599 | N/A¹ | 150s | N/A | 274 | ~60 | ~214 | 22% | | tensorflow/tensorflow | 2,266 | —⚠️ | 16,974 | N/A¹ | 134s | N/A | 29 | ~18 | ~11 | 62% | | semgrep/semgrep | 706 | 2,040 | 1,342 | 37.0s | 17s | 139 | 11 | 7 | 4 | 64% | > ¹ The previous version has no test file exclusion — call graphs of 9,504–26,964 functions cause OOM/timeout. New branch excludes test files, reducing function counts by 50–70%. ### True positives by repo (to be analyzed) | Repo | Confirmed TPs | Key findings | |---|---|---| | adeyosemanputra/pygoat | ~68 | CSRF×25, timing×7, pickle×4, eval×3, FLASK001, PLAIN_PWD001, DJANGO_DEBUG001×2 | | mukxl/Intentionally-Vulnerable-Python-Application | 7 | PY002 (pickle), HARDCODED_PWD001, timing, FLASK001 | | django/django | ~22 | PY306_CACHE×6 (cache poisoning→RCE), ORM002×3, PY106 | | scikit-learn/scikit-learn | ~37 | DESER_JOBLIB001×11, pickle×4, ZIPSLIP001×1, HASH807×1 | | ansible/ansible | ~55 | SHELL602×7, ZIPSLIP001×1, PY305×3 (strategy/collection loader), PY002×4 | | semgrep/semgrep | 7 | ENV_URL001×2 (SEMGREP_URL SSRF), ENV_GIT_URL001×1 (CI token theft), OPEN1149×1, HASH807×4 (SHA-256 for token hashing) | | python/cpython | ~60 | DESER723×3 (marshal/zipimport), ZIPSLIP001×2, SSRF_001×2, PY002 (IDLE RPC), IMPORT825 (logging config) | | tensorflow/tensorflow | ~18 | LOG741×7 (log injection), DESER723/724 (bytecode), DESER_NUMPY001×1, HASH807×3 | | psf/requests | 3 | TIMING759×2 (password `==` in auth), G405 | | pallets/flask | 5 | exec() in from_pyfile(), SHA1, DJANGO_DEBUG001 | | satorici/satori-cli | 2 | SSRF_001×2 (API response URL used in HTTP client) | | parzivalhack/pyspector | 4 | Supply-chain: PATH813 + OPEN1149×2 (aipocgen.py, json.load config), HASH807×1 | --- ## Signal-to-noise ratio ### New ``` mukxl/Intentionally-Vulnerable-Python-Application ████████████████████ 100% — all vulns caught parzivalhack/pyspector ████████████████████ 100% — 4 supply-chain TPs adeyosemanputra/pygoat ███████████████████░ 94% — ground truth scikit-learn/scikit-learn █████████████████░░░ 90% — DESER_JOBLIB001 ×11 pallets/flask ██████████████░░░░░░ 71% — exec() intentional satorici/satori-cli █████████████░░░░░░░ 67% — SSRF TPs confirmed semgrep/semgrep █████████████░░░░░░░ 64% — CI security + HASH807 tensorflow/tensorflow ████████████░░░░░░░░ 62% — log injection + bytecode psf/requests ████████████░░░░░░░░ 60% — timing oracle in auth fastapi/fastapi ████████████████████ n/a — zero findings (true negatives) pandas-dev/pandas ██████░░░░░░░░░░░░░░ 53% — GETATTR828 delegation ansible/ansible █████████░░░░░░░░░░░ 44% — automation attack surface django/django ██████░░░░░░░░░░░░░░ 32% — ORM infrastructure FPs python/cpython █████░░░░░░░░░░░░░░░ 22% — interpreter by design ``` ### Old (repos that completed) ``` mukxl/Intentionally-Vulnerable-Python-Application ████████████████████ 100% (same) adeyosemanputra/pygoat █████████████░░░░░░░ ~67% (INPUT1143, XSS517 FPs) scikit-learn/scikit-learn ████░░░░░░░░░░░░░░░░ ~22% (CENTER927, CRYPTO708 noise) pandas-dev/pandas ██░░░░░░░░░░░░░░░░░░ ~2% (412 findings, ~8 real) fastapi/fastapi ░░░░░░░░░░░░░░░░░░░░ ~0% (69 findings, 0 real) semgrep/semgrep ██░░░░░░░░░░░░░░░░░░ ~3% (139 findings, ~4 real) satorici/satori-cli ███░░░░░░░░░░░░░░░░░ ~7% (29 findings, ~2 real) psf/requests █████░░░░░░░░░░░░░░░ ~27% (11 findings, 3 real) ``` --- ## Files changed - `src/pyspector/rules/built-in-rules.toml` — 142 rules deleted, 25 added, 41 modified, 2 activated; net: 269 → 127 rules - `src/pyspector/_rust_core/src/analysis/taint_analysis.rs` — taint engine: CLI vs HTTP origin, sys.argv/os.environ → OperatorConfig, dead function removed - `src/pyspector/_rust_core/src/graph/call_graph_builder.rs` — O(1) call resolution, test/docs file exclusion - `src/pyspector/_rust_core/src/analysis/ast_analysis.rs` — per-file exclusion pre-filter, unused import removed - `src/pyspector/_rust_core/src/analysis/mod.rs` — phase timing, parallel scanning - `src/pyspector/_rust_core/src/rules.rs` — `file_content_exclude`, `vulnerable_keyword`, comma-split patterns - `src/pyspector/cli.py` — per-phase timing instrumentation - `src/pyspector/reporting.py` — severity serialization fixed (was uppercasing "HIGH", now preserves "High") - `src/pyspector/triage.py` — unused import removed - `tests/unit/` — 168 tests, all passing (including previously broken reporting_test.py) ## Tests changed ``` 168 passed, 0 failed (was: 116 on main; reporting_test.py had 2 pre-existing failures now fixed) +52 new tests covering new rules, engine changes, taint origins, and deduplication ``` Co-authored-by: satoridev01 <info@satori.ci> Co-authored-by: Tommaso Bona <piergeolo@gmail.com>
…ass (ParzivalHack#54) ## Summary This PR adds an explicit `--debug` flag and makes the default output focused on findings, warnings and errors. **[`--debug`]**: identical to the old default output ### Changes - **`pyspector scan --debug`**: new flag. Default output keeps the banner (name, version, credits, joke), findings, warnings and errors. `--debug` re-enables the previous verbose output, including `println!` lines from the Rust core (silenced by redirecting fd 1 to `/dev/null` around `run_scan` / `scan_supply_chain`). - **Dynamic version in the banner**: the displayed version is now read via `importlib.metadata.version("pyspector")` instead of being hardcoded in `cli.py`, so it stays in sync with `setup.cfg` on every bump. - **Honor `exclude` in the Python pre-pass**: `get_python_file_asts` now filters `path.glob("**/*.py")` against the config's `exclude` list. Previously only the Rust core respected it, so the walker still entered `.venv/` / `node_modules/` and emitted a `Info: Skipped` per file inside them. New `_is_path_excluded` helper matches patterns against relative path, absolute path and individual path components. - **Default excludes**: added `node_modules`, `bower_components`, `vendor` to `DEFAULT_CONFIG.exclude` in `config.py` alongside the existing `.venv` / `__pycache__` / `build` / `dist` / `*.egg-info` / `venv` entries. ### Before / After Scanning a project containing a `.venv/`: **Before** (truncated — actually printed ~150 `Info: Skipped` lines from setuptools/pbr tests inside .venv): ``` Info: Skipped .venv/lib/python3.13/site-packages/pbr/tests/test_version.py (test file or fixture) Info: Skipped .venv/lib/python3.13/site-packages/pbr/tests/test_packaging.py (test file or fixture) ... [*] Starting PySpector scan on '.' [*] Successfully parsed 73 Python files in 0.41s [*] Disabled 2 rules via [defaults].disabled_rule_ids [*] Starting analysis with 125 rules [*] Pattern/config scan: 0.00s → 3 issues [*] AST analysis: 0.00s → 1 issues [*] Building call graph from 73/73 files ... ... (~25 more lines) [+] Rule ID: SHELL602 ... ``` **After** (default): ``` o__ __o ... (banner) Version: 0.1.9 Made with <3 by github.com/ParzivalHack 💡 I'd tell you a joke about NAT but I would have to translate. ============================================================ HIGH (4 issues) ============================================================ [+] Rule ID: SHELL602 ... ``` Co-authored-by: satoridev01 <info@satori.ci>
…Hack#55) ## Summary Adds detectors for 24 common credential formats (AWS, GitHub, GitLab, Slack, Stripe, Google, OpenAI, Anthropic/Claude, SendGrid, PostHog, NPM, PyPI, Discord, Telegram, DigitalOcean, Doppler, Cloudflare, Heroku, HubSpot, Fastly, plus DB-connection-string and basic-auth-URL detectors) and significantly reduces false-positive noise from the existing `G101` / `G101B` / `G102` / `G103` / `G104` / `AI404` rules by extending their `exclude_pattern` and `exclude_file_pattern` lists. Validated against a 763-repo corpus side-by-side with TruffleHog. ### New | Rule | Provider | Format | |---|---|---| | G110 | AWS | `(AKIA|ASIA|AIDA|AROA|AGPA|ANPA|ANVA|ASCA)[0-9A-Z]{16}` | | G111 | GitHub | `(ghp|gho|ghu|ghs|ghr)_[A-Za-z0-9]{36}` and `github_pat_[A-Za-z0-9_]{82}` | | G112 | GitLab | `glpat-[A-Za-z0-9_-]{20}` | | G113 | Slack token | `xox[abprso]-[A-Za-z0-9-]{10,}` | | G114 | Slack webhook | `https://hooks.slack.com/services/T<id>/B<id>/<token>` | | G115 | Stripe | `(sk|rk)_(live|test)_[A-Za-z0-9]{24,}` | | G116 | Google | `AIza[A-Za-z0-9_-]{35}` | | G117 | OpenAI | `sk-[A-Za-z0-9]{48}` and `sk-(proj|svcacct|admin|None)-[A-Za-z0-9_-]{20,}` | | G118 | Anthropic / Claude | `sk-ant-(api|admin|sid)\d{2}-[A-Za-z0-9_-]{80,110}` | | G119 | SendGrid | `SG\.[A-Za-z0-9_-]{22}\.[A-Za-z0-9_-]{43}` | | G120 | PostHog | `phc_[A-Za-z0-9]{40}` | | G121 | Database URL with creds | `(postgres(ql)?|mysql|mongodb(\+srv)?|mariadb|redis|rediss|amqp|amqps|mssql|oracle)://user:pass@…` | | G122 | JWT in code | `eyJ…\.eyJ…\.[A-Za-z0-9_-]+` (3-part) | | G123 | Basic-auth URL | `https?://user:pass@host…` (password forbidden to contain `/` — eliminates JS-stack-trace FPs) | | G124 | NPM | `npm_[A-Za-z0-9]{36}` | | G125 | PyPI | `pypi-AgEIcHlwaS5vcmc[A-Za-z0-9_-]{50,}` | | G126 | Discord bot | `[MN][A-Za-z0-9]{23}\.[\w-]{6}\.[\w-]{27}` | | G127 | Telegram bot | `\d{8,10}:[A-Za-z0-9_-]{35}` | | G128 | DigitalOcean | `(dop|doo|dor)_v1_[a-f0-9]{64}` | | G129 | Doppler | `dp\.(pt|st|ct|scim|audit|prov|sa)\.[A-Za-z0-9_-]{30,}` | | G130 | Cloudflare | OCA Key: `v1\.0-[a-f0-9]{32}-[a-f0-9]{146}` + 40-char tokens near "cloudflare" keyword | | G131 | Heroku | UUID near "heroku" keyword (legacy format) | | G132 | HubSpot | `pat-(na1|na2|na3|eu1)-<UUID>` private app + legacy UUID near "hubspot" | | G133 | Fastly | 32-char token near "fastly" keyword | ### Changed - **`AI404`** (Hugging Face): pattern tightened to require at least 16 consecutive alphanumeric chars after `hf_`. Eliminates placeholder FPs like `hf_token`, `hf_X`, `hf_xxx_your_token`, `hf_....`. Doctest lines (`>>>` / `...`) excluded. - **`G104`** (JWT secret): pattern now requires `≥16` non-quote chars in the value (previously `.+` matched literal field-name values like `"kb_jwt"`). `exclude_pattern` added: `your_`, `change-(me|in-production)`, `default-secret`, `do-not-share`, `demo-`, `never-(hardcode|use)`. - **`G101`** (broad password/secret): `exclude_pattern` extended to suppress: - common placeholder values: `your_`, `insert_`, `example_`, `placeholder`, `change-me`, `replace-me`, `todo`, `fake`, `dummy`, `sample`, `demo`, `server_api_key`, `api_key_secret`, `my_password`, `root_password` - values ending in `_here` / containing `*_HERE` - all-uppercase placeholder-name strings like `"YOUR_OPENAI_API_KEY"` - lines starting with `print(`, `click.echo(`, `sys.stderr.` (instructional output) - doctest lines (`>>>` / `...`) - **`G101B`** (uppercase const secret): same placeholder / instructional-line / doctest exclusions. - **`G102`** (private key block): added `exclude_file_pattern = "*.md,*.rst,*.html,*.txt,*.adoc,*.tex,*.ipynb"`. Documentation / walkthrough / knowledge-base content showing `-----BEGIN … PRIVATE KEY-----` as an example was a 100% FP source in our corpus. - **`G103`** (blank password): `exclude_pattern` adds `^\s*[A-Z][A-Z0-9_]+\s*=` (Django/Flask uppercase config defaults like `EMAIL_HOST_PASSWORD = ""` are intentionally overrideable from env). `exclude_file_pattern` adds `*settings*.py,*config*.py`. - **`G117`**, **`G113`**: explicit `-your-`, `-here\b`, `-replace-` substring excludes catch patterns like `xoxb-your-slack-bot-token` and `sk-svcacct-your-embedding-key-here`. `exclude_file_pattern` adds `*.env.example,*.env.template,*.env.sample,*.env.dist,env.example`. ### G121 / G121L — production vs dev-default split `G121` (Critical / High) now excludes connection strings whose host is one of the well-known local/docker-compose names (`localhost`, `127.0.0.1`, `0.0.0.0`, `::1`, `host.docker.internal`, `db`, `database`, `postgres(ql)`, `mysql`, `mariadb`, `mongo(db)`, `redis`, `rabbitmq`, `broker`, `kafka`, `memcached`, `amqp`). Host tokens are matched only when followed by a URL-component terminator (`:`, `/`, `?`, `#`, quote, whitespace), so substrings like `db.prod.example.com` still hit G121 — only standalone host tokens like `@db:5432` get downgraded. **`G121L`** (new, Low / Low) covers the dev-default class: same connection-string shape, but only when the host is one of those local/container names. This converts the dominant remaining G121 FP class — `postgresql://guaardvark:guaardvark@localhost:5432/guaardvark`-style local-dev defaults — into a separate, low-priority signal that an analyst can choose to ignore or batch-review, without dropping the finding entirely (it is still a literal hardcoded credential). `[defaults].exclude_pattern_placeholder` now declares the placeholder/dummy-secret regex (`(?i)EXAMPLE|FAKE|PLACEHOLDER|SAMPLE|x{10,}|0{10,}|1{10,}|abcdefghij|1234567890abcdef|AbCdEfGhIjKlMnOp|f3a8b2c1`) in one place. Each rule's `exclude_pattern` references it via the sentinel `__SHARED_PLACEHOLDERS__`, which `get_default_rules()` (in `pyspector/config.py`) string-substitutes before handing the TOML text to the Rust core. Adding a new placeholder shape is now a one-line edit rather than touching 15 rule blocks. The Rust core needs no changes — substitution happens in the existing Python rule-loading path. Existing rule TOMLs without the sentinel continue to work unchanged. ### G122 unscoping `G122` previously had `file_pattern = "*.py"`. JWTs leak into `.yaml`, `.json`, `.sh`, `.tf`, and CI configs at least as often as into Python files. Removing the restriction adds new TP coverage without measurable FP impact (2 new hits in the validation corpus, both edge cases in `.drawio` and `.json` files containing image-URL JWTs). ### Shared FP fixes triggered by the validation corpus - **`G121`** / **`G123`** now suppress f-string and shell interpolation in the credential portion: `{var}`, `{self.x}`, `${VAR}`, `$(VAR)`, `$VAR`, `<placeholder>`, `{{ var }}` (Jinja/Helm). - **`G121`** ignores `re.match()` / `re.compile()` / `re.search()` patterns that happen to describe a connection-string shape. - **`G123`** pattern now forbids `/` in the password segment, eliminating the dominant JS-stack-trace FP class (`http://localhost:5173/node_modules/.vite/deps/@react.js?…:759:3) @ http://…`). `*.log` added to `exclude_file_pattern`. - **`G121`** / **`G123`** add `*.env.example,*.env.template,*.tpl,*.j2,*.jinja,*.template,*cookiecutter*` to `exclude_file_pattern`. - **`G114`** placeholder filter now suppresses Slack webhook URLs with `T00000000/B00000000/XXXX…` template values. - **`G110`** suppresses `AKIAIOSFOLQUICKSTART` (well-known lakefs quickstart documented credential). ## Validation Comparison with TruffleHog (v3.95.3) on 763 repos that originally flagged any "Hardcoded" finding.** Both tools scanned the same shallow clones; PySpector with the new rules, TruffleHog with `--no-verification` for fair format-vs-format comparison. | Tool | Findings | Heuristic-TP | Heuristic-FP | Precision | |---|---:|---:|---:|---:| | **PySpector v2 (this PR)** | 1,135 | 884 | 251 | **78%** | | TruffleHog 3.95.3 (no-verify) | 5,814 | 462 | 5,352 | **8%** | Comparison with PySpector vs Modified PySpector | Metric | Original | This PR | Change | |-------------------------|---------:|---------:|-------------:| | Total findings | 2,295 | 1,135 | **−51%** | | Heuristic-TP | 1,242 | 884 | −29% | | Heuristic-FP | 1,053 | 251 | **−76%** | | Precision | 54.1% | 77.9% | **+23.8pp** | | TP : FP ratio | 1.18 | 3.52 | **~3× better** | Per-rule breakdown (500 OK-cloned repo subset) | Rule | Original | This PR | Notes | |-------|---------:|--------:|------------------------------------------------------------------------------| | G101 | 1,743 | 702 | Tightened exclude — kills ~60% of placeholder / instructional-output FPs | | G101B | 359 | 0 | Largely subsumed by the new format-specific rules; placeholders also filtered | | G102 | 138 | 100 | `.md`/`.rst` doc-extension excludes drop ~30 walkthrough FPs | | G103 | 48 | 34 | `UPPER_CASE = ""` config-default and `*settings*.py` excludes | | G104 | 2 | 2 | Same hits | | AI404 | 5 | 1 | Tightened to require ≥16 alnum chars after `hf_` | | G110 | — | 1 | NEW — AWS access key (`AKIA…`) | | G115 | — | 2 | NEW — Stripe live/test keys | | G116 | — | 187 | NEW — Google API key (`AIza…`) | | G117 | — | 22 | NEW — OpenAI `sk-…` / `sk-proj-…` | | G121 | — | 49 | NEW — DB connection string with embedded credentials | | G122 | — | 26 | NEW — three-part JWT in code (now non-Python files too) | | G123 | — | 8 | NEW — basic-auth URL | | G127 | — | 1 | NEW — Telegram bot token | | G110–G127 total | 0 | 296 | NEW provider coverage absent from the original ruleset | | **TOTAL** | 2,295 | 1,135 | |
core: per-rule CWE field + CWE-aware cross-rule dedup
Adds a `cwe` field on each rule. When two rules report findings at the
same (file, line) and share the same CWE (e.g. DESER_TORCH001 + AI202
both flagging one torch.load line under CWE-502), the engine collapses
them: the finding whose rule declares the higher severity wins, with
rule_id lex order as stable tiebreaker on equal severity. CWE itself
does not set severity — each rule's severity comes from its own TOML
field. Distinct CWEs at the same line stay distinct, so
`os.system(eval(user_input))` correctly reports both CWE-78 and CWE-94.
Rust core
- rules.rs / issues.rs: new optional `cwe: Option<String>`, carried from
Rule → Issue and exposed to Python via pyo3
- analysis/{config,ast,taint}_analysis.rs: pass it through Issue::new
- analysis/mod.rs: 2-stage dedup
stage 1 = existing fingerprint dedup (same rule, exact match)
stage 2 = CWE-aware merge by (file, line, cwe), highest severity wins.
Rules without a CWE skip stage 2.
cli.py
- file_path passed to Rust is now `py_file.resolve()` (absolute,
canonical) so AST-rule and pattern-rule findings agree on the same path
string and stage-2 dedup actually triggers.
reporting.py
- JSON output gains a top-level `cwe` field on each issue
- SARIF output emits `external/cwe/cwe-N` in each rule's
`properties.tags` — standard SARIF taxon, parses cleanly in GitHub Code
Scanning and DefectDojo
setup.py
- RustExtension declares `debug=False` so `pip install -e .` produces
release-mode binaries; previously editable installs ran ~3× slower.
Rules — all 179 [[rule]] blocks now declare a CWE (built-in-rules.toml +
built-in-rules-ai.toml). Mapping summary:
CWE-78 command injection PROC819, SHELL602/689, PY102/103/106, AI503,
...
CWE-22 path traversal PATH813, OPEN1149, AI502, ZIPSLIP001, FILE526, ...
CWE-94 code/template injection PY001/305/500, SEC501, SSTI001,
SANDBOX307/308, AI101/102/103/105/106/107, ...
CWE-502 insecure deserialization DESER*, PY002/107/204/301/302/306,
YAML001, AI201/202/203/204/205, RUAMEL_UNSAFE001, ...
CWE-89 SQL injection PY101, SQL586/693, ORM001/002, AI104/504, ...
CWE-918 SSRF SSRF_001, NET705, AI501, ENV_URL001, ...
CWE-295 TLS / cert verification TLS001, SSL531, SSH001, G405, NET705
CWE-327 weak crypto PY201/202/203/205, HASH807
CWE-338 weak PRNG CRYPTO708, RAND810
CWE-798 hardcoded credentials G101/101B/102/104/110..133, AI002/404,
AUTH711, ADMIN795, CFG001, ...
CWE-352 CSRF G404, CSRF747, OAUTH774
CWE-489 active debug code G401/403, FLASK001, FLASK_DEBUG001,
DJANGO_DEBUG001, DEBUG798
CWE-79 XSS PY105
CWE-611 XXE PY303, XXE001
CWE-942 CORS CORS780
CWE-601 open redirect OPEN_REDIRECT001
CWE-1004 sensitive cookie attr COOKIE792, COOKIE_FILE001
CWE-319 cleartext transmission HTTPS789, AI403
CWE-200 info disclosure INFO738, BACKUP801, FILE528, AI402, AI405
CWE-117 log injection LOG741
CWE-208 timing attack TIMING759
CWE-1333 ReDoS REGEX870
(full list in the rule TOMLs themselves)
New AST rules
- YAML001 yaml.load() without SafeLoader (CWE-502, Critical)
- FLASK_DEBUG001 .run(debug=True) on Flask/FastAPI (CWE-489, High)
AI202 hardened
- pattern tightened to `torch\.load\s*\(`
- exclude_pattern now matches DESER_TORCH001's: skip lines with
`weights_only=True`
- now redundant with DESER_TORCH001 (both CWE-502) → stage-2 dedup
collapses them to one Critical finding per torch.load line
Test on Ghy0501/MCITlib (4,743 .py / 27,568 functions):
this branch main (post-ParzivalHack#55)
wall clock 593s 606s
total findings 1,740 3,103
unique (file, line, CWE) groups 1,740 1,918
duplicate groups (≥2 rules) 0 1,185
excess duplicate findings 0 1,185
heuristic-TP 1,684 3,047
heuristic-FP 56 56
Dedup is reflected directly: branch produces 0 duplicate groups where
main produces 1,185 (i.e. 1,185 places where 2+ rules describe the same
vulnerability at the same line). FP count is identical (56) since FPs
are pattern-shape artifacts that don't depend on dedup. The remaining
178-finding gap (1,918 unique vs 1,740) is AI202 no longer flagging
torch.load(..., weights_only=True). Wall clock −13s is within noise.
## Summary - add AI206 to flag Hugging Face `from_pretrained(..., trust_remote_code=True)` model loading - add targeted tests for the rule metadata, matcher behavior, and scanner integration when the Rust core is available Closes ParzivalHack#18 ## Validation - `/tmp/pyspector-pdlc043-venv/bin/python -m pytest tests/unit/test_ai_rules.py -q` -> 2 passed, 2 skipped locally because the PySpector Rust core is not available in this runner - `/tmp/pyspector-pdlc043-venv/bin/python -m py_compile tests/unit/test_ai_rules.py` - TOML parse check for `src/pyspector/rules/built-in-rules-ai.toml` - direct AST matcher shape check for `trust_remote_code=True` vs `False` - `git diff --check`
Three-level (L1 in-memory mtime / L2 disk content-hash / L3 chunk-aware) incremental AST cache that skips the pure-Python json.dumps(AstEncoder) bottleneck across runs and on partial file changes. - src/pyspector/ast_cache.py: cache implementation (JSON+base64 persistence, no pickle/code-exec on load) - src/pyspector/_ast_encode.py: shared AST->JSON encoder (single source of truth, eliminates encoder drift between cli.py and the cache) - src/pyspector/cli.py: wire the cache into get_python_file_asts - tests/unit/ast_cache_test.py: unit tests Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.