ContractFlow

Agentic contract extraction from PDF with retrieval, field-level specialists, verifier loops, and auditable post-extraction risk orchestration.

This project is built as a portfolio-grade AI engineering system: not just "one prompt", but a measurable multi-agent pipeline with explicit evidence, confidence, arbitration, and ablations.

Why This Project

Contract extraction pipelines fail in different ways:

one-shot prompts miss clause-level details
retrieval alone can reduce context cost but still mis-extract fields
high-stakes outputs need verification and deterministic guardrails

ContractFlow addresses this with staged agentic execution and evaluation-first development.

What Makes It Agentic

Retriever agent

Chunks contracts by page/section heading and returns top-k evidence chunks.

Field agents

One agent per schema field.
Each returns value + evidence snippets + confidence + issues.

Verifier/Judge agent

Decides accept, revise, or unknown.
Can trigger targeted retrieval + repair passes.

Post-extraction risk orchestrator (v2)

Deterministic policy score first, then optional risk-review agent and judge arbitration.
Full factor trace persisted in _meta.retrieval.risk.

Architecture

flowchart LR
    A[PDF] --> B[Text + OCR]
    B --> C[Chunk by page/heading]
    C --> D[Retriever]
    D --> E[Global baseline]
    D --> F[Field agents]
    F --> G[Candidate select]
    G --> H[Verifier/Judge<br/>accept, revise, unknown]
    H -->|revise| D
    H --> I[Normalize + validate]
    I --> J[Risk Orchestrator V2<br/>rules, review, judge]
    J --> K[JSON + audit]

Benchmark Snapshot (Gold Labels, 5 CUAD Docs)

Benchmark date: February 14, 2026
Canonical artifact: data/benchmarks/gold_ablation_presentation.json

Mode	Exact Accuracy	Partial Accuracy	Exact CI95	Avg Total Tokens / Doc	Delta Exact vs Naive
naive	0.8500	0.9333	0.8333..0.8833	12,692	+0.0000
retrieval	0.8000	0.8333	0.7667..0.8333	12,609.8	-0.0500
field_agents	0.7833	0.8500	0.7000..0.8667	29,178	-0.0667
orchestrated (tuned)	0.9167	0.9167	0.8667..0.9667	26,178.8	+0.0667

Notes:

Gold set currently contains 5 labeled CUAD contracts (.gold.json).
This benchmark includes derived fields (risk_level, risk_explanation) for all modes.
Orchestrated uses a tuned low-token profile (see optimization table below).

Accuracy Diagram

xychart-beta
    title "Exact Acc (gold-5)"
    x-axis ["N", "R", "F", "O"]
    y-axis "acc" 0 --> 1
    bar [0.85, 0.8, 0.7833, 0.9167]

N=naive, R=retrieval, F=field_agents, O=orchestrated

Token Usage Diagram

xychart-beta
    title "Avg Tokens/Doc (gold-5)"
    x-axis ["N", "R", "F", "O"]
    y-axis "tokens" 0 --> 32000
    bar [12692, 12609.8, 29178, 26178.8]

N=naive, R=retrieval, F=field_agents, O=orchestrated

Orchestrated Token Optimization (Gold-5)

Orchestrated Profile	Exact	Partial	Avg Tokens / Doc
default	0.8500	0.8500	53,863.6
tuned	0.9167	0.9167	26,178.8

Tuned settings used in this benchmark:

top_k=2
max_chunk_chars=800
chunk_max_chars=1300
max_repairs=2
disable_verifier=true
risk_review_top_k=2

Field-Level Signal (Orchestrated, Gold-5 Exact Accuracy)

Strong fields:
- doc_type: 1.00
- party_a_name: 1.00
- governing_law: 1.00
- liability_cap: 1.00
- risk_level: 1.00
- risk_explanation: 1.00
Current bottlenecks:
- termination_notice_days: 0.60
- party_b_name: 0.80
- effective_date: 0.80
- term_length: 0.80

Risk Engine V2

Implemented in contractflow/core/risk_engine.py and the post-extraction orchestration stage in contractflow/core/extractor.py, with policy in docs/risk_policy.json.

3 output classes: low, medium, high
risk_level and risk_explanation are derived fields (not directly extracted by the schema prompt)
weighted factors: liability, governing law region, transfer posture, term, termination, non-solicit
dedicated liability-cap parser supports:
- uncapped / none-specified posture
- month-window normalization (<N> months fees)
- fixed monetary caps (<CUR> <amount>)
uncertainty-aware scoring from evidence/confidence coverage
hard-trigger floors for high-risk combinations
optional risk-review agent on triggered uncertainty/conflict cases
optional LLM judge arbitration after deterministic scoring
balanced risk benchmark available in data/risk_gold/risk_gold_v1.json (5/5/5 low-medium-high)
normal behavior on uncertainty:
- missing values remain unknown
- not auto-promoted to uncapped or outside

Repository Layout

contractflow/core/
- pdf_utils.py: PDF text extraction + OCR fallback
- chunking.py: chunking, BM25/embeddings/hybrid retrieval
- extractor.py: naive/retrieval/field_agents/orchestrated pipelines
- extractor_validation.py: deterministic normalization/coercion rules
- liability.py: liability cap clause parser + canonicalization
- risk_engine.py: policy-driven risk scoring + judge arbitration
contractflow/schemas/
- contract_schema.json
contractflow/ui/
- app.py: FastAPI service for upload, extraction, and risk explainability
- templates/index.html: OpenAI-style minimal UI
- static/: UI CSS + JS
scripts/
- baseline_extract.py, bulk_extract.py, inspect_chunks.py
- run_ui.py: local web UI launcher
- evaluate.py, evaluate_risk.py, evaluate_risk_gold.py, ablation_eval.py
- calibration_curves.py: field/risk confidence calibration reports
- retrieval_diagnostics.py, bootstrap_labels.py, build_cuad_pdfs.py
docs/
- domain.md, agentic_roadmap.md, risk_policy.json
data/
- raw_pdfs/, labels/, risk_gold/, preds_ablations/, benchmarks/

Quickstart

1) Install

python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txt

Set API key:

set OPENAI_API_KEY=your_key_here

Optional OCR dependencies (for scanned PDFs): Poppler + Tesseract.

2) Run One Document

# Naive
python scripts/baseline_extract.py data/raw_pdfs/nda_harvard.pdf

# Retrieval context
python scripts/baseline_extract.py data/raw_pdfs/nda_harvard.pdf --retrieval

# Field agents
python scripts/baseline_extract.py data/raw_pdfs/nda_harvard.pdf --field-agents

# Orchestrated with verifier/judge
python scripts/baseline_extract.py data/raw_pdfs/nda_harvard.pdf --orchestrated

# Orchestrated low-cost tuned profile
python scripts/baseline_extract.py data/raw_pdfs/nda_harvard.pdf --orchestrated --orchestrated-profile low_cost

# Orchestrated with risk-review disabled (rules + judge only)
python scripts/baseline_extract.py data/raw_pdfs/nda_harvard.pdf --orchestrated --disable-risk-review

# Override risk-review model and retrieval depth
python scripts/baseline_extract.py data/raw_pdfs/nda_harvard.pdf --orchestrated --risk-review-model gpt-5.2 --risk-review-top-k 5

2b) Run The Web UI

python scripts/run_ui.py --host 127.0.0.1 --port 8000 --reload

Open http://127.0.0.1:8000 and:

upload a PDF
choose mode (naive, retrieval, field_agents, orchestrated)
choose retrieval backend (bm25, embeddings, hybrid)
run extraction and inspect:
- extracted fields
- explainable risk summary (drivers, protectors, triggers, uncertainty)
- orchestration trace

If uvicorn is missing in your venv, reinstall dependencies:

pip install -r requirements.txt

3) Reproduce Evaluation

# Baseline 3 modes on gold labels (include derived fields)
python scripts/ablation_eval.py --labels-dir data/labels --label-suffix .gold.json --modes naive,retrieval,field_agents --preds-root data/preds_ablations_gold_baseline3_incl --overwrite --include-derived --bootstrap-samples 1000 --out data/benchmarks/gold_ablation_baseline3_include_derived.json

# Tuned orchestrated mode on gold labels
python scripts/ablation_eval.py --labels-dir data/labels --label-suffix .gold.json --modes orchestrated --preds-root data/preds_ablations_gold_orch_tuned_incl --overwrite --orchestrated-profile low_cost --include-derived --bootstrap-samples 1000 --out data/benchmarks/gold_ablation_orchestrated_tuned_include_derived.json

# Optional: evaluate one prediction directory directly
python scripts/evaluate.py --labels-dir data/labels --preds-dir data/preds_ablations_gold_orch_tuned_incl/orchestrated --label-suffix .gold.json --include-derived --bootstrap-samples 1000 --out data/benchmarks/eval_gold_orchestrated_tuned_include_derived.json

# Risk-only balanced benchmark (rules-first risk engine quality)
python scripts/evaluate_risk_gold.py --dataset data/risk_gold/risk_gold_v1.json --out data/benchmarks/risk_gold_v1_eval.json

# Confidence calibration (field confidence + risk confidence)
python scripts/calibration_curves.py --preds-dir data/preds_ablations_gold_orch_tuned_incl/orchestrated --labels-dir data/labels --label-suffix .gold.json --bins 10 --out data/benchmarks/calibration_gold_orchestrated_tuned.json --csv-dir data/benchmarks/calibration_gold_orchestrated_tuned_csv

4) Run Pre-Push Smoke Check

# compile + tests + balanced risk-gold check
python scripts/smoke_check.py

# optional: skip tests when iterating quickly
python scripts/smoke_check.py --skip-tests

Next High-Impact Improvements

Expand extraction gold set from 5 docs to 20-30 docs (strongest hiring signal for generalization).
Add calibration-driven confidence thresholds (per field + risk) instead of fixed constants.
Add a clause-specific joint agent for termination_notice_days and remedy windows (current weakest field).
Add downloadable run reports (JSON + evidence table + risk factor trace) from the UI.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ContractFlow

Why This Project

What Makes It Agentic

Architecture

Benchmark Snapshot (Gold Labels, 5 CUAD Docs)

Accuracy Diagram

Token Usage Diagram

Orchestrated Token Optimization (Gold-5)

Field-Level Signal (Orchestrated, Gold-5 Exact Accuracy)

Risk Engine V2

Repository Layout

Quickstart

1) Install

2) Run One Document

2b) Run The Web UI

3) Reproduce Evaluation

4) Run Pre-Push Smoke Check

Next High-Impact Improvements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
contractflow		contractflow
data		data
docs		docs
scripts		scripts
tests		tests
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

ContractFlow

Why This Project

What Makes It Agentic

Architecture

Benchmark Snapshot (Gold Labels, 5 CUAD Docs)

Accuracy Diagram

Token Usage Diagram

Orchestrated Token Optimization (Gold-5)

Field-Level Signal (Orchestrated, Gold-5 Exact Accuracy)

Risk Engine V2

Repository Layout

Quickstart

1) Install

2) Run One Document

2b) Run The Web UI

3) Reproduce Evaluation

4) Run Pre-Push Smoke Check

Next High-Impact Improvements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages