Skip to content

JoshKappler/claim-wright

Repository files navigation

ClaimWright

Reads the documents behind an insurance claim and recommends a payout — with a line-by-line audit trail behind every dollar.

ClaimWright takes the document set behind a security-deposit insurance claim (lease, tenant ledger, deposit-waiver addendum, move-out itemization, repair invoices) and recommends how much to pay — capped at the policy's max benefit — or declines, and shows exactly why. Claude Opus 4.8 does the reading; a pure-Python deterministic engine does the math. On a held-out test split it lands within $250 of the human adjudicator on 91% of claims, with a median error of $0, for about $0.33 a claim.

Claim detail — the recommended payout and the audit trail behind it

Past runs — every adjudication batch, traceable, with per-run cost

The React workspace: a claim's verdict, confidence, and "why this number" provenance (payout → basis → citation → source document), and the batch history with per-run cost. Shown with sample (synthetic) claim data — no real claimant information.

The split: the model reads, the engine decides

This is the whole idea. The model does the judgment that needs reading comprehension — what each charge is, what the ledger balance is, whether the claim is eligible — and returns it as structured data through a forced tool call. A deterministic Python engine then does the arithmetic, the cap, and the exclusions. So a payout can never be a number the model invented — every figure traces back to engine code and a document line.

documents ─► Opus 4.8 extraction ─► deterministic engine ─► payout + audit trail
 (hybrid read)  (forced tool_use,      (anchor on balance,    (kept/dropped lines,
                 structured facts)      cap, eligibility gate)  every $ to a source)

Accuracy

Measured against the human-adjudicated amounts on a held-out test split (the Status and Approved Benefit Amount columns are used only for scoring, never as model input — see the leakage boundary).

stage sample MAE median error within $250
line-item sum (first design) 40 test $909 $589 43%
balance-anchored 40 test $121 $0 85%
final (+ guardrails + hybrid reading) 25 random test $62 $0 91%

Median error is $0 — the system lands on the human number for most claims. It never declines a claim the human paid; the residual error is mostly source-data ambiguity, not misreads. python -m scripts.rescore replays stored extractions through the engine, so a rulebook change can be re-scored against the human decisions with no model calls.

Calibration workbench — accuracy, scatter, and the rulebook iteration history

The calibration workbench: held-out accuracy, predicted-vs-human scatter, error distribution, and the rulebook-by-rulebook improvement history — each run re-scored by replaying stored reads, no API calls. Shown with sample data.

Highlights

  • Model reads, engine decides. Forced-tool extraction pulls charges, ledger balance, and eligibility; a pure function computes the payout — nothing is hallucinated, every dollar traces to code and a document line.
  • Anchor on the balance, not the line items. The money on most claims is unpaid rent stated as a single ledger balance, which summing repair lines misses. Anchoring on it (capped at max benefit) cut MAE from $909 to $121.
  • Hybrid document reading. Each PDF page is routed by text density: ~75% read free and lossless with pure-Python pdfminer.six, true scans go to vision. No poppler or tesseract binaries, so the same code runs everywhere — including the packaged desktop app. Cost: ~$0.33 a claim.
  • Calibration with zero API calls. Extractions are stored and the engine is a pure function, so a candidate rulebook (JSON, not code) re-scores against the human decisions by replaying stored reads.
  • Structural leakage guard. The human-answer columns are blocked from the model by an explicit allow-list projection (core/csv_data.py) — not a convention, a boundary that holds across the DB round-trip.
  • Multi-user with per-tenant isolation. Each user gets a separate SQLite workspace; runs can be snapshotted and shared read-only into a space, copy-on-share so a viewer never touches the originator's live data.
  • Built-in white-hat security pass. The repo can run an authorized hunt against itself via autohack: a Claude session traces user input to sinks, then a second model tries to disprove each finding.

The leakage boundary

The CSV carries the human adjudicators' decisions. Those must never influence the system's own decision. core/csv_data.py builds the model-input object from an explicit allow-list of columns, so a forbidden column cannot reach the model or the engine — and that holds across the database round-trip. The forbidden columns are read only by the eval/scoring code.

Stack

Python 3.13 · Anthropic SDK (Opus 4.8, forced tool_use, prompt caching) · Pydantic v2 · Django-Ninja · SQLite · pure-Python document extraction (pdfminer.six + python-docx, no external binaries) · pytest. React + Vite frontend, served same-origin by the backend. Deployed on Railway; also shipped as a macOS desktop app (PyInstaller). No LLM framework — direct SDK calls, structured output validated at the boundary.

Project layout

core/              framework-agnostic adjudication (imports no web framework)
  csv_data.py        CSV load + the allow-list leakage guard
  drive/             keyless per-claim document fetch
  ingest/            PDF/image -> Claude content blocks
  llm/               Opus extraction (forced tool_use) + prompts
  adjudication/      schema, calibratable rulebook, deterministic engine
  eval/              metrics (MAE, decline confusion) + train/test split
  pipeline.py        one claim end-to-end as a fail-soft state machine
  store.py / tenancy.py / auth_store.py   per-tenant persistence + auth
api_django/        Django-Ninja app over core (HTTP API + SPA serving)
web/               React SPA (Run / Claim detail / Review / Past runs / Shared)
config/rulebook.json   the active, calibratable policy

Running it

Requires Python 3.13 and an ANTHROPIC_API_KEY.

python -m venv .venv && source .venv/bin/activate
pip install -e .
echo "ANTHROPIC_API_KEY=sk-ant-..." > .env

python -m pytest -q                 # tests
python manage.py runserver          # API + SPA at http://localhost:8000
cd web && npm install && npm run dev # frontend dev server (http://localhost:5173)

python -m scripts.eval_run          # adjudicate + score 25 test claims vs ground truth

The frontend runs against bundled synthetic fixtures with VITE_USE_MOCKS=1, so the whole UI works with no backend and no API key.

About

Security-deposit insurance claim adjudication — Claude reads the documents, a deterministic engine decides the payout (sanitized portfolio build).

Resources

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors