Skip to content

616xold/contractFlow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ContractFlow

Agentic contract extraction from PDF with retrieval, field-level specialists, verifier loops, and auditable post-extraction risk orchestration.

This project is built as a portfolio-grade AI engineering system: not just "one prompt", but a measurable multi-agent pipeline with explicit evidence, confidence, arbitration, and ablations.

Why This Project

Contract extraction pipelines fail in different ways:

  • one-shot prompts miss clause-level details
  • retrieval alone can reduce context cost but still mis-extract fields
  • high-stakes outputs need verification and deterministic guardrails

ContractFlow addresses this with staged agentic execution and evaluation-first development.

What Makes It Agentic

  1. Retriever agent
  • Chunks contracts by page/section heading and returns top-k evidence chunks.
  1. Field agents
  • One agent per schema field.
  • Each returns value + evidence snippets + confidence + issues.
  1. Verifier/Judge agent
  • Decides accept, revise, or unknown.
  • Can trigger targeted retrieval + repair passes.
  1. Post-extraction risk orchestrator (v2)
  • Deterministic policy score first, then optional risk-review agent and judge arbitration.
  • Full factor trace persisted in _meta.retrieval.risk.

Architecture

flowchart LR
    A[PDF] --> B[Text + OCR]
    B --> C[Chunk by page/heading]
    C --> D[Retriever]
    D --> E[Global baseline]
    D --> F[Field agents]
    F --> G[Candidate select]
    G --> H[Verifier/Judge<br/>accept, revise, unknown]
    H -->|revise| D
    H --> I[Normalize + validate]
    I --> J[Risk Orchestrator V2<br/>rules, review, judge]
    J --> K[JSON + audit]
Loading

Benchmark Snapshot (Gold Labels, 5 CUAD Docs)

Benchmark date: February 14, 2026
Canonical artifact: data/benchmarks/gold_ablation_presentation.json

Mode Exact Accuracy Partial Accuracy Exact CI95 Avg Total Tokens / Doc Delta Exact vs Naive
naive 0.8500 0.9333 0.8333..0.8833 12,692 +0.0000
retrieval 0.8000 0.8333 0.7667..0.8333 12,609.8 -0.0500
field_agents 0.7833 0.8500 0.7000..0.8667 29,178 -0.0667
orchestrated (tuned) 0.9167 0.9167 0.8667..0.9667 26,178.8 +0.0667

Notes:

  • Gold set currently contains 5 labeled CUAD contracts (.gold.json).
  • This benchmark includes derived fields (risk_level, risk_explanation) for all modes.
  • Orchestrated uses a tuned low-token profile (see optimization table below).

Accuracy Diagram

xychart-beta
    title "Exact Acc (gold-5)"
    x-axis ["N", "R", "F", "O"]
    y-axis "acc" 0 --> 1
    bar [0.85, 0.8, 0.7833, 0.9167]
Loading

N=naive, R=retrieval, F=field_agents, O=orchestrated

Token Usage Diagram

xychart-beta
    title "Avg Tokens/Doc (gold-5)"
    x-axis ["N", "R", "F", "O"]
    y-axis "tokens" 0 --> 32000
    bar [12692, 12609.8, 29178, 26178.8]
Loading

N=naive, R=retrieval, F=field_agents, O=orchestrated

Orchestrated Token Optimization (Gold-5)

Orchestrated Profile Exact Partial Avg Tokens / Doc
default 0.8500 0.8500 53,863.6
tuned 0.9167 0.9167 26,178.8

Tuned settings used in this benchmark:

  • top_k=2
  • max_chunk_chars=800
  • chunk_max_chars=1300
  • max_repairs=2
  • disable_verifier=true
  • risk_review_top_k=2

Field-Level Signal (Orchestrated, Gold-5 Exact Accuracy)

  • Strong fields:
    • doc_type: 1.00
    • party_a_name: 1.00
    • governing_law: 1.00
    • liability_cap: 1.00
    • risk_level: 1.00
    • risk_explanation: 1.00
  • Current bottlenecks:
    • termination_notice_days: 0.60
    • party_b_name: 0.80
    • effective_date: 0.80
    • term_length: 0.80

Risk Engine V2

Implemented in contractflow/core/risk_engine.py and the post-extraction orchestration stage in contractflow/core/extractor.py, with policy in docs/risk_policy.json.

  • 3 output classes: low, medium, high
  • risk_level and risk_explanation are derived fields (not directly extracted by the schema prompt)
  • weighted factors: liability, governing law region, transfer posture, term, termination, non-solicit
  • dedicated liability-cap parser supports:
    • uncapped / none-specified posture
    • month-window normalization (<N> months fees)
    • fixed monetary caps (<CUR> <amount>)
  • uncertainty-aware scoring from evidence/confidence coverage
  • hard-trigger floors for high-risk combinations
  • optional risk-review agent on triggered uncertainty/conflict cases
  • optional LLM judge arbitration after deterministic scoring
  • balanced risk benchmark available in data/risk_gold/risk_gold_v1.json (5/5/5 low-medium-high)
  • normal behavior on uncertainty:
    • missing values remain unknown
    • not auto-promoted to uncapped or outside

Repository Layout

  • contractflow/core/
    • pdf_utils.py: PDF text extraction + OCR fallback
    • chunking.py: chunking, BM25/embeddings/hybrid retrieval
    • extractor.py: naive/retrieval/field_agents/orchestrated pipelines
    • extractor_validation.py: deterministic normalization/coercion rules
    • liability.py: liability cap clause parser + canonicalization
    • risk_engine.py: policy-driven risk scoring + judge arbitration
  • contractflow/schemas/
    • contract_schema.json
  • contractflow/ui/
    • app.py: FastAPI service for upload, extraction, and risk explainability
    • templates/index.html: OpenAI-style minimal UI
    • static/: UI CSS + JS
  • scripts/
    • baseline_extract.py, bulk_extract.py, inspect_chunks.py
    • run_ui.py: local web UI launcher
    • evaluate.py, evaluate_risk.py, evaluate_risk_gold.py, ablation_eval.py
    • calibration_curves.py: field/risk confidence calibration reports
    • retrieval_diagnostics.py, bootstrap_labels.py, build_cuad_pdfs.py
  • docs/
    • domain.md, agentic_roadmap.md, risk_policy.json
  • data/
    • raw_pdfs/, labels/, risk_gold/, preds_ablations/, benchmarks/

Quickstart

1) Install

python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txt

Set API key:

set OPENAI_API_KEY=your_key_here

Optional OCR dependencies (for scanned PDFs): Poppler + Tesseract.

2) Run One Document

# Naive
python scripts/baseline_extract.py data/raw_pdfs/nda_harvard.pdf

# Retrieval context
python scripts/baseline_extract.py data/raw_pdfs/nda_harvard.pdf --retrieval

# Field agents
python scripts/baseline_extract.py data/raw_pdfs/nda_harvard.pdf --field-agents

# Orchestrated with verifier/judge
python scripts/baseline_extract.py data/raw_pdfs/nda_harvard.pdf --orchestrated

# Orchestrated low-cost tuned profile
python scripts/baseline_extract.py data/raw_pdfs/nda_harvard.pdf --orchestrated --orchestrated-profile low_cost

# Orchestrated with risk-review disabled (rules + judge only)
python scripts/baseline_extract.py data/raw_pdfs/nda_harvard.pdf --orchestrated --disable-risk-review

# Override risk-review model and retrieval depth
python scripts/baseline_extract.py data/raw_pdfs/nda_harvard.pdf --orchestrated --risk-review-model gpt-5.2 --risk-review-top-k 5

2b) Run The Web UI

python scripts/run_ui.py --host 127.0.0.1 --port 8000 --reload

Open http://127.0.0.1:8000 and:

  • upload a PDF
  • choose mode (naive, retrieval, field_agents, orchestrated)
  • choose retrieval backend (bm25, embeddings, hybrid)
  • run extraction and inspect:
    • extracted fields
    • explainable risk summary (drivers, protectors, triggers, uncertainty)
    • orchestration trace

If uvicorn is missing in your venv, reinstall dependencies:

pip install -r requirements.txt

3) Reproduce Evaluation

# Baseline 3 modes on gold labels (include derived fields)
python scripts/ablation_eval.py --labels-dir data/labels --label-suffix .gold.json --modes naive,retrieval,field_agents --preds-root data/preds_ablations_gold_baseline3_incl --overwrite --include-derived --bootstrap-samples 1000 --out data/benchmarks/gold_ablation_baseline3_include_derived.json

# Tuned orchestrated mode on gold labels
python scripts/ablation_eval.py --labels-dir data/labels --label-suffix .gold.json --modes orchestrated --preds-root data/preds_ablations_gold_orch_tuned_incl --overwrite --orchestrated-profile low_cost --include-derived --bootstrap-samples 1000 --out data/benchmarks/gold_ablation_orchestrated_tuned_include_derived.json

# Optional: evaluate one prediction directory directly
python scripts/evaluate.py --labels-dir data/labels --preds-dir data/preds_ablations_gold_orch_tuned_incl/orchestrated --label-suffix .gold.json --include-derived --bootstrap-samples 1000 --out data/benchmarks/eval_gold_orchestrated_tuned_include_derived.json

# Risk-only balanced benchmark (rules-first risk engine quality)
python scripts/evaluate_risk_gold.py --dataset data/risk_gold/risk_gold_v1.json --out data/benchmarks/risk_gold_v1_eval.json

# Confidence calibration (field confidence + risk confidence)
python scripts/calibration_curves.py --preds-dir data/preds_ablations_gold_orch_tuned_incl/orchestrated --labels-dir data/labels --label-suffix .gold.json --bins 10 --out data/benchmarks/calibration_gold_orchestrated_tuned.json --csv-dir data/benchmarks/calibration_gold_orchestrated_tuned_csv

4) Run Pre-Push Smoke Check

# compile + tests + balanced risk-gold check
python scripts/smoke_check.py

# optional: skip tests when iterating quickly
python scripts/smoke_check.py --skip-tests

Next High-Impact Improvements

  1. Expand extraction gold set from 5 docs to 20-30 docs (strongest hiring signal for generalization).
  2. Add calibration-driven confidence thresholds (per field + risk) instead of fixed constants.
  3. Add a clause-specific joint agent for termination_notice_days and remedy windows (current weakest field).
  4. Add downloadable run reports (JSON + evidence table + risk factor trace) from the UI.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors