ClaimWright

Reads the documents behind an insurance claim and recommends a payout — with a line-by-line audit trail behind every dollar.

ClaimWright takes the document set behind a security-deposit insurance claim (lease, tenant ledger, deposit-waiver addendum, move-out itemization, repair invoices) and recommends how much to pay — capped at the policy's max benefit — or declines, and shows exactly why. Claude Opus 4.8 does the reading; a pure-Python deterministic engine does the math. On a held-out test split it lands within $250 of the human adjudicator on 91% of claims, with a median error of $0, for about $0.33 a claim.

_{The React workspace: a claim's verdict, confidence, and "why this number"
provenance (payout → basis → citation → source document), and the batch history
with per-run cost. Shown with sample (synthetic) claim data — no real claimant
information.}

The split: the model reads, the engine decides

This is the whole idea. The model does the judgment that needs reading comprehension — what each charge is, what the ledger balance is, whether the claim is eligible — and returns it as structured data through a forced tool call. A deterministic Python engine then does the arithmetic, the cap, and the exclusions. So a payout can never be a number the model invented — every figure traces back to engine code and a document line.

documents ─► Opus 4.8 extraction ─► deterministic engine ─► payout + audit trail
 (hybrid read)  (forced tool_use,      (anchor on balance,    (kept/dropped lines,
                 structured facts)      cap, eligibility gate)  every $ to a source)

Accuracy

Measured against the human-adjudicated amounts on a held-out test split (the Status and Approved Benefit Amount columns are used only for scoring, never as model input — see the leakage boundary).

stage	sample	MAE	median error	within $250
line-item sum (first design)	40 test	$909	$589	43%
balance-anchored	40 test	$121	$0	85%
final (+ guardrails + hybrid reading)	25 random test	$62	$0	91%

Median error is $0 — the system lands on the human number for most claims. It never declines a claim the human paid; the residual error is mostly source-data ambiguity, not misreads. python -m scripts.rescore replays stored extractions through the engine, so a rulebook change can be re-scored against the human decisions with no model calls.

_{The calibration workbench: held-out accuracy, predicted-vs-human scatter,
error distribution, and the rulebook-by-rulebook improvement history — each run
re-scored by replaying stored reads, no API calls. Shown with sample data.}

Highlights

Model reads, engine decides. Forced-tool extraction pulls charges, ledger balance, and eligibility; a pure function computes the payout — nothing is hallucinated, every dollar traces to code and a document line.
Anchor on the balance, not the line items. The money on most claims is unpaid rent stated as a single ledger balance, which summing repair lines misses. Anchoring on it (capped at max benefit) cut MAE from $909 to $121.
Hybrid document reading. Each PDF page is routed by text density: ~75% read free and lossless with pure-Python pdfminer.six, true scans go to vision. No poppler or tesseract binaries, so the same code runs everywhere — including the packaged desktop app. Cost: ~$0.33 a claim.
Calibration with zero API calls. Extractions are stored and the engine is a pure function, so a candidate rulebook (JSON, not code) re-scores against the human decisions by replaying stored reads.
Structural leakage guard. The human-answer columns are blocked from the model by an explicit allow-list projection (core/csv_data.py) — not a convention, a boundary that holds across the DB round-trip.
Multi-user with per-tenant isolation. Each user gets a separate SQLite workspace; runs can be snapshotted and shared read-only into a space, copy-on-share so a viewer never touches the originator's live data.
Built-in white-hat security pass. The repo can run an authorized hunt against itself via autohack: a Claude session traces user input to sinks, then a second model tries to disprove each finding.

The leakage boundary

The CSV carries the human adjudicators' decisions. Those must never influence the system's own decision. core/csv_data.py builds the model-input object from an explicit allow-list of columns, so a forbidden column cannot reach the model or the engine — and that holds across the database round-trip. The forbidden columns are read only by the eval/scoring code.

Stack

Python 3.13 · Anthropic SDK (Opus 4.8, forced tool_use, prompt caching) · Pydantic v2 · Django-Ninja · SQLite · pure-Python document extraction (pdfminer.six + python-docx, no external binaries) · pytest. React + Vite frontend, served same-origin by the backend. Deployed on Railway; also shipped as a macOS desktop app (PyInstaller). No LLM framework — direct SDK calls, structured output validated at the boundary.

Project layout

core/              framework-agnostic adjudication (imports no web framework)
  csv_data.py        CSV load + the allow-list leakage guard
  drive/             keyless per-claim document fetch
  ingest/            PDF/image -> Claude content blocks
  llm/               Opus extraction (forced tool_use) + prompts
  adjudication/      schema, calibratable rulebook, deterministic engine
  eval/              metrics (MAE, decline confusion) + train/test split
  pipeline.py        one claim end-to-end as a fail-soft state machine
  store.py / tenancy.py / auth_store.py   per-tenant persistence + auth
api_django/        Django-Ninja app over core (HTTP API + SPA serving)
web/               React SPA (Run / Claim detail / Review / Past runs / Shared)
config/rulebook.json   the active, calibratable policy

Running it

Requires Python 3.13 and an ANTHROPIC_API_KEY.

python -m venv .venv && source .venv/bin/activate
pip install -e .
echo "ANTHROPIC_API_KEY=sk-ant-..." > .env

python -m pytest -q                 # tests
python manage.py runserver          # API + SPA at http://localhost:8000
cd web && npm install && npm run dev # frontend dev server (http://localhost:5173)

python -m scripts.eval_run          # adjudicate + score 25 test claims vs ground truth

The frontend runs against bundled synthetic fixtures with VITE_USE_MOCKS=1, so the whole UI works with no backend and no API key.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
api_django		api_django
config		config
core		core
docs		docs
scripts		scripts
tests		tests
web		web
.env.example		.env.example
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
README.md		README.md
SECURITY.md		SECURITY.md
claimwright.spec		claimwright.spec
launcher.py		launcher.py
manage.py		manage.py
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ClaimWright

The split: the model reads, the engine decides

Accuracy

Highlights

The leakage boundary

Stack

Project layout

Running it

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ClaimWright

The split: the model reads, the engine decides

Accuracy

Highlights

The leakage boundary

Stack

Project layout

Running it

About

Resources

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages