Reads the documents behind an insurance claim and recommends a payout — with a line-by-line audit trail behind every dollar.
ClaimWright takes the document set behind a security-deposit insurance claim (lease, tenant ledger, deposit-waiver addendum, move-out itemization, repair invoices) and recommends how much to pay — capped at the policy's max benefit — or declines, and shows exactly why. Claude Opus 4.8 does the reading; a pure-Python deterministic engine does the math. On a held-out test split it lands within $250 of the human adjudicator on 91% of claims, with a median error of $0, for about $0.33 a claim.
The React workspace: a claim's verdict, confidence, and "why this number" provenance (payout → basis → citation → source document), and the batch history with per-run cost. Shown with sample (synthetic) claim data — no real claimant information.
This is the whole idea. The model does the judgment that needs reading comprehension — what each charge is, what the ledger balance is, whether the claim is eligible — and returns it as structured data through a forced tool call. A deterministic Python engine then does the arithmetic, the cap, and the exclusions. So a payout can never be a number the model invented — every figure traces back to engine code and a document line.
documents ─► Opus 4.8 extraction ─► deterministic engine ─► payout + audit trail
(hybrid read) (forced tool_use, (anchor on balance, (kept/dropped lines,
structured facts) cap, eligibility gate) every $ to a source)
Measured against the human-adjudicated amounts on a held-out test split (the
Status and Approved Benefit Amount columns are used only for scoring, never
as model input — see the leakage boundary).
| stage | sample | MAE | median error | within $250 |
|---|---|---|---|---|
| line-item sum (first design) | 40 test | $909 | $589 | 43% |
| balance-anchored | 40 test | $121 | $0 | 85% |
| final (+ guardrails + hybrid reading) | 25 random test | $62 | $0 | 91% |
Median error is $0 — the system lands on the human number for most claims. It
never declines a claim the human paid; the residual error is mostly source-data
ambiguity, not misreads. python -m scripts.rescore replays stored extractions
through the engine, so a rulebook change can be re-scored against the human
decisions with no model calls.
The calibration workbench: held-out accuracy, predicted-vs-human scatter, error distribution, and the rulebook-by-rulebook improvement history — each run re-scored by replaying stored reads, no API calls. Shown with sample data.
- Model reads, engine decides. Forced-tool extraction pulls charges, ledger balance, and eligibility; a pure function computes the payout — nothing is hallucinated, every dollar traces to code and a document line.
- Anchor on the balance, not the line items. The money on most claims is unpaid rent stated as a single ledger balance, which summing repair lines misses. Anchoring on it (capped at max benefit) cut MAE from $909 to $121.
- Hybrid document reading. Each PDF page is routed by text density: ~75% read
free and lossless with pure-Python
pdfminer.six, true scans go to vision. No poppler or tesseract binaries, so the same code runs everywhere — including the packaged desktop app. Cost: ~$0.33 a claim. - Calibration with zero API calls. Extractions are stored and the engine is a pure function, so a candidate rulebook (JSON, not code) re-scores against the human decisions by replaying stored reads.
- Structural leakage guard. The human-answer columns are blocked from the
model by an explicit allow-list projection (
core/csv_data.py) — not a convention, a boundary that holds across the DB round-trip. - Multi-user with per-tenant isolation. Each user gets a separate SQLite workspace; runs can be snapshotted and shared read-only into a space, copy-on-share so a viewer never touches the originator's live data.
- Built-in white-hat security pass. The repo can run an authorized hunt against itself via autohack: a Claude session traces user input to sinks, then a second model tries to disprove each finding.
The CSV carries the human adjudicators' decisions. Those must never influence the
system's own decision. core/csv_data.py builds the model-input object from an
explicit allow-list of columns, so a forbidden column cannot reach the model
or the engine — and that holds across the database round-trip. The forbidden
columns are read only by the eval/scoring code.
Python 3.13 · Anthropic SDK (Opus 4.8, forced tool_use, prompt caching) ·
Pydantic v2 · Django-Ninja · SQLite · pure-Python document extraction
(pdfminer.six + python-docx, no external binaries) · pytest. React + Vite
frontend, served same-origin by the backend. Deployed on Railway; also shipped as
a macOS desktop app (PyInstaller). No LLM framework — direct SDK calls, structured
output validated at the boundary.
core/ framework-agnostic adjudication (imports no web framework)
csv_data.py CSV load + the allow-list leakage guard
drive/ keyless per-claim document fetch
ingest/ PDF/image -> Claude content blocks
llm/ Opus extraction (forced tool_use) + prompts
adjudication/ schema, calibratable rulebook, deterministic engine
eval/ metrics (MAE, decline confusion) + train/test split
pipeline.py one claim end-to-end as a fail-soft state machine
store.py / tenancy.py / auth_store.py per-tenant persistence + auth
api_django/ Django-Ninja app over core (HTTP API + SPA serving)
web/ React SPA (Run / Claim detail / Review / Past runs / Shared)
config/rulebook.json the active, calibratable policy
Requires Python 3.13 and an ANTHROPIC_API_KEY.
python -m venv .venv && source .venv/bin/activate
pip install -e .
echo "ANTHROPIC_API_KEY=sk-ant-..." > .env
python -m pytest -q # tests
python manage.py runserver # API + SPA at http://localhost:8000
cd web && npm install && npm run dev # frontend dev server (http://localhost:5173)
python -m scripts.eval_run # adjudicate + score 25 test claims vs ground truthThe frontend runs against bundled synthetic fixtures with VITE_USE_MOCKS=1, so
the whole UI works with no backend and no API key.


