Agentic contract extraction from PDF with retrieval, field-level specialists, verifier loops, and auditable post-extraction risk orchestration.
This project is built as a portfolio-grade AI engineering system: not just "one prompt", but a measurable multi-agent pipeline with explicit evidence, confidence, arbitration, and ablations.
Contract extraction pipelines fail in different ways:
- one-shot prompts miss clause-level details
- retrieval alone can reduce context cost but still mis-extract fields
- high-stakes outputs need verification and deterministic guardrails
ContractFlow addresses this with staged agentic execution and evaluation-first development.
- Retriever agent
- Chunks contracts by page/section heading and returns top-k evidence chunks.
- Field agents
- One agent per schema field.
- Each returns
value + evidence snippets + confidence + issues.
- Verifier/Judge agent
- Decides
accept,revise, orunknown. - Can trigger targeted retrieval + repair passes.
- Post-extraction risk orchestrator (v2)
- Deterministic policy score first, then optional risk-review agent and judge arbitration.
- Full factor trace persisted in
_meta.retrieval.risk.
flowchart LR
A[PDF] --> B[Text + OCR]
B --> C[Chunk by page/heading]
C --> D[Retriever]
D --> E[Global baseline]
D --> F[Field agents]
F --> G[Candidate select]
G --> H[Verifier/Judge<br/>accept, revise, unknown]
H -->|revise| D
H --> I[Normalize + validate]
I --> J[Risk Orchestrator V2<br/>rules, review, judge]
J --> K[JSON + audit]
Benchmark date: February 14, 2026
Canonical artifact: data/benchmarks/gold_ablation_presentation.json
| Mode | Exact Accuracy | Partial Accuracy | Exact CI95 | Avg Total Tokens / Doc | Delta Exact vs Naive |
|---|---|---|---|---|---|
| naive | 0.8500 | 0.9333 | 0.8333..0.8833 | 12,692 | +0.0000 |
| retrieval | 0.8000 | 0.8333 | 0.7667..0.8333 | 12,609.8 | -0.0500 |
| field_agents | 0.7833 | 0.8500 | 0.7000..0.8667 | 29,178 | -0.0667 |
| orchestrated (tuned) | 0.9167 | 0.9167 | 0.8667..0.9667 | 26,178.8 | +0.0667 |
Notes:
- Gold set currently contains 5 labeled CUAD contracts (
.gold.json). - This benchmark includes derived fields (
risk_level,risk_explanation) for all modes. - Orchestrated uses a tuned low-token profile (see optimization table below).
xychart-beta
title "Exact Acc (gold-5)"
x-axis ["N", "R", "F", "O"]
y-axis "acc" 0 --> 1
bar [0.85, 0.8, 0.7833, 0.9167]
N=naive, R=retrieval, F=field_agents, O=orchestrated
xychart-beta
title "Avg Tokens/Doc (gold-5)"
x-axis ["N", "R", "F", "O"]
y-axis "tokens" 0 --> 32000
bar [12692, 12609.8, 29178, 26178.8]
N=naive, R=retrieval, F=field_agents, O=orchestrated
| Orchestrated Profile | Exact | Partial | Avg Tokens / Doc |
|---|---|---|---|
| default | 0.8500 | 0.8500 | 53,863.6 |
| tuned | 0.9167 | 0.9167 | 26,178.8 |
Tuned settings used in this benchmark:
top_k=2max_chunk_chars=800chunk_max_chars=1300max_repairs=2disable_verifier=truerisk_review_top_k=2
- Strong fields:
doc_type: 1.00party_a_name: 1.00governing_law: 1.00liability_cap: 1.00risk_level: 1.00risk_explanation: 1.00
- Current bottlenecks:
termination_notice_days: 0.60party_b_name: 0.80effective_date: 0.80term_length: 0.80
Implemented in contractflow/core/risk_engine.py and the post-extraction orchestration stage in contractflow/core/extractor.py, with policy in docs/risk_policy.json.
- 3 output classes:
low,medium,high risk_levelandrisk_explanationare derived fields (not directly extracted by the schema prompt)- weighted factors: liability, governing law region, transfer posture, term, termination, non-solicit
- dedicated liability-cap parser supports:
- uncapped / none-specified posture
- month-window normalization (
<N> months fees) - fixed monetary caps (
<CUR> <amount>)
- uncertainty-aware scoring from evidence/confidence coverage
- hard-trigger floors for high-risk combinations
- optional risk-review agent on triggered uncertainty/conflict cases
- optional LLM judge arbitration after deterministic scoring
- balanced risk benchmark available in
data/risk_gold/risk_gold_v1.json(5/5/5 low-medium-high) - normal behavior on uncertainty:
- missing values remain
unknown - not auto-promoted to
uncappedoroutside
- missing values remain
contractflow/core/pdf_utils.py: PDF text extraction + OCR fallbackchunking.py: chunking, BM25/embeddings/hybrid retrievalextractor.py: naive/retrieval/field_agents/orchestrated pipelinesextractor_validation.py: deterministic normalization/coercion rulesliability.py: liability cap clause parser + canonicalizationrisk_engine.py: policy-driven risk scoring + judge arbitration
contractflow/schemas/contract_schema.json
contractflow/ui/app.py: FastAPI service for upload, extraction, and risk explainabilitytemplates/index.html: OpenAI-style minimal UIstatic/: UI CSS + JS
scripts/baseline_extract.py,bulk_extract.py,inspect_chunks.pyrun_ui.py: local web UI launcherevaluate.py,evaluate_risk.py,evaluate_risk_gold.py,ablation_eval.pycalibration_curves.py: field/risk confidence calibration reportsretrieval_diagnostics.py,bootstrap_labels.py,build_cuad_pdfs.py
docs/domain.md,agentic_roadmap.md,risk_policy.json
data/raw_pdfs/,labels/,risk_gold/,preds_ablations/,benchmarks/
python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txtSet API key:
set OPENAI_API_KEY=your_key_hereOptional OCR dependencies (for scanned PDFs): Poppler + Tesseract.
# Naive
python scripts/baseline_extract.py data/raw_pdfs/nda_harvard.pdf
# Retrieval context
python scripts/baseline_extract.py data/raw_pdfs/nda_harvard.pdf --retrieval
# Field agents
python scripts/baseline_extract.py data/raw_pdfs/nda_harvard.pdf --field-agents
# Orchestrated with verifier/judge
python scripts/baseline_extract.py data/raw_pdfs/nda_harvard.pdf --orchestrated
# Orchestrated low-cost tuned profile
python scripts/baseline_extract.py data/raw_pdfs/nda_harvard.pdf --orchestrated --orchestrated-profile low_cost
# Orchestrated with risk-review disabled (rules + judge only)
python scripts/baseline_extract.py data/raw_pdfs/nda_harvard.pdf --orchestrated --disable-risk-review
# Override risk-review model and retrieval depth
python scripts/baseline_extract.py data/raw_pdfs/nda_harvard.pdf --orchestrated --risk-review-model gpt-5.2 --risk-review-top-k 5python scripts/run_ui.py --host 127.0.0.1 --port 8000 --reloadOpen http://127.0.0.1:8000 and:
- upload a PDF
- choose mode (
naive,retrieval,field_agents,orchestrated) - choose retrieval backend (
bm25,embeddings,hybrid) - run extraction and inspect:
- extracted fields
- explainable risk summary (drivers, protectors, triggers, uncertainty)
- orchestration trace
If uvicorn is missing in your venv, reinstall dependencies:
pip install -r requirements.txt# Baseline 3 modes on gold labels (include derived fields)
python scripts/ablation_eval.py --labels-dir data/labels --label-suffix .gold.json --modes naive,retrieval,field_agents --preds-root data/preds_ablations_gold_baseline3_incl --overwrite --include-derived --bootstrap-samples 1000 --out data/benchmarks/gold_ablation_baseline3_include_derived.json
# Tuned orchestrated mode on gold labels
python scripts/ablation_eval.py --labels-dir data/labels --label-suffix .gold.json --modes orchestrated --preds-root data/preds_ablations_gold_orch_tuned_incl --overwrite --orchestrated-profile low_cost --include-derived --bootstrap-samples 1000 --out data/benchmarks/gold_ablation_orchestrated_tuned_include_derived.json
# Optional: evaluate one prediction directory directly
python scripts/evaluate.py --labels-dir data/labels --preds-dir data/preds_ablations_gold_orch_tuned_incl/orchestrated --label-suffix .gold.json --include-derived --bootstrap-samples 1000 --out data/benchmarks/eval_gold_orchestrated_tuned_include_derived.json
# Risk-only balanced benchmark (rules-first risk engine quality)
python scripts/evaluate_risk_gold.py --dataset data/risk_gold/risk_gold_v1.json --out data/benchmarks/risk_gold_v1_eval.json
# Confidence calibration (field confidence + risk confidence)
python scripts/calibration_curves.py --preds-dir data/preds_ablations_gold_orch_tuned_incl/orchestrated --labels-dir data/labels --label-suffix .gold.json --bins 10 --out data/benchmarks/calibration_gold_orchestrated_tuned.json --csv-dir data/benchmarks/calibration_gold_orchestrated_tuned_csv# compile + tests + balanced risk-gold check
python scripts/smoke_check.py
# optional: skip tests when iterating quickly
python scripts/smoke_check.py --skip-tests- Expand extraction gold set from 5 docs to 20-30 docs (strongest hiring signal for generalization).
- Add calibration-driven confidence thresholds (per field + risk) instead of fixed constants.
- Add a clause-specific joint agent for
termination_notice_daysand remedy windows (current weakest field). - Add downloadable run reports (JSON + evidence table + risk factor trace) from the UI.