DocSage

DocSage is an agentic document-intelligence system: ingest financial and business documents, persist structured rows (transactions) in a database, build a semantic index for question answering, and expose both through a FastAPI backend and a Next.js UI. The “brain” for generation is Google Gemini; retrieval uses local FAISS + sentence-transformers; multi-step reasoning is orchestrated with LangGraph.

What DocSage does
Repository map
Quick start (local)
Configuration
Backend architecture
Concept glossary — IDP, OCR, RAG, FAISS, LangGraph, HyDE, SQL grounding, Gemini, …
LangGraph chat pipeline
Document ingestion and data flow
HTTP API surface
Frontend (web)
Dependencies, scripts, and containers
Deployment
Limitations and extension points

What DocSage does

Upload documents (PDF, images, etc.) via the API or UI.
Parse them through an IDP-style pipeline: text extraction (PDF and/or OCR), heuristic classification, and structured field extraction.
Store Document rows plus derived Transaction rows in PostgreSQL or SQLite.
Index in FAISS for RAG: per document, an extraction_summary chunk (from structured extracted_data) when present, plus chunked raw_text—each chunk carries chunk_type, chunk_index, and document metadata for grounding in API responses.
Answer questions via POST /api/v1/chat/insights: a LangGraph workflow can route between fast analytic paths, SQL over documents + transactions (with JOIN on document_id), and full retrieve → rerank → grade → synthesize agentic RAG with HyDE-style query rewriting. Optional history on the request enables multi-turn chat (last 20 turns used server-side); synthesis prompts insist on filename / document_id citations when evidence exists.

flowchart LR
  subgraph client [Client]
    Browser[Browser_Next.js]
  end
  subgraph api [FastAPI_api]
    Routes[routers]
    Agent[LangGraph_agent]
    RAG[RAGService_FAISS]
    DB[(SQLAlchemy_DB)]
    LLM[Gemini_API]
  end
  Browser -->|HTTP_JSON| Routes
  Routes --> Agent
  Agent --> RAG
  Agent --> DB
  Agent --> LLM
  Routes --> DB

Repository map

Path	Role
`api/`	Python package `app`: FastAPI entry `api/app/main.py`, routers, services, agents, models, `api/requirements.txt`, `api/Dockerfile`.
`api/app/routers/`	HTTP route modules (analytics, documents, chat, anomalies, compare, receipts, export, insights report).
`api/app/services/`	Business logic: RAG, LLM wrapper, IDP pipeline, SQL tools, `extraction_summary` helper (`extraction_summary.py`), insights, anomalies, export, etc.
`api/app/agents/`	LangGraph graph compiler, nodes, state, `langgraph_runner.py`.
`api/app/vectorstore/`	`faiss_store.py`: FAISS + embeddings.
`api/scripts/`	Operational scripts (seed, ingest, embeddings, diagnostics)—not started automatically.
`web/`	Next.js 14 App Router UI; entry layout `web/src/app/layout.tsx`, pages under `web/src/app/`.
`web/src/lib/api.ts`	Typed `fetch` helpers against `/api/v1`.
`scripts/run-dev.sh`	Local dev: optional Postgres, API + web.
`docker-compose.postgres.yml`	Postgres-only Compose (used by dev script).
`docker-compose.yml`	Postgres + API image.
`docker-compose.prod.yml`	Production-oriented Compose example.
`railway.toml`	Railway: Dockerfile build, uvicorn start, `/health`.
`.env.example`	Template for `api/.env` (copy into `api/`).
`data/` (under `api/` at runtime)	Uploads and indexes: `raw_docs`, `embeddings`, `processed`—paths come from `config.py` (`RAW_DOCS_PATH`, `FAISS_`, etc.) relative to the API working directory. Do not commit large `api/data/`* trees; keep them local or on a volume.

Quick start (local)

Prerequisites: Node.js, Python 3.11+, optional Docker for Postgres, Google AI Studio API key for LLM features.

From the repository root:

./scripts/run-dev.sh

Uses docker-compose.postgres.yml to start Postgres when Docker is available. If Docker is not running, the API and web still start; use USE_SQLITE=true in api/.env for a file DB, or start Docker and rerun.
Copies .env.example → api/.env once if missing; ensures web/.env.local has NEXT_PUBLIC_API_URL.
Recreates api/.venv if broken; installs Python and npm deps; runs uvicorn and npm run dev.

Ports: API_PORT (default 8000), WEB_PORT (default 3000). UI: http://localhost:3000; OpenAPI: http://localhost:8000/docs.

Busy Postgres port: POSTGRES_PORT=5433 ./scripts/run-dev.sh and set POSTGRES_PORT=5433 in api/.env.

Skip Docker: ./scripts/run-dev.sh --no-docker

Database initialization

On API startup, api/app/main.py runs a lifespan hook that calls init_db(): SQLAlchemy create_all for models in api/app/models.py. This creates missing tables (e.g. documents, transactions) but does not run Alembic-style migrations for schema changes.

Configuration

API (`api/app/config.py` + `api/.env`)

Settings load from environment and optional api/.env. Unknown keys are ignored (extra="ignore") so stale variables do not crash startup.

Variable	Purpose
`POSTGRES_USER`, `POSTGRES_PASSWORD`, `POSTGRES_DB`, `POSTGRES_HOST`, `POSTGRES_PORT`	PostgreSQL connection when `USE_SQLITE` is false.
`USE_SQLITE`	If `true`, `DATABASE_URL` is SQLite (`sqlite:///./docsage.db`).
`API_HOST`, `API_PORT`	Uvicorn bind (used when running `app.main` as `__main__`).
`GOOGLE_API_KEY`	Gemini API key; alias `GEMINI_API_KEY`.
`GOOGLE_AI_MODEL`	Primary `generateContent` model id.
`GOOGLE_AI_MODEL_FALLBACKS`	Comma-separated fallback model ids (overload / transient errors / unknown primary).
`FAISS_INDEX_PATH`, `FAISS_DOCUMENTS_PATH`	Paths to FAISS index file and pickle sidecar for chunk metadata.
`EMBEDDING_MODEL`	`sentence-transformers` model name (default `all-MiniLM-L6-v2`, 384-d vectors).
`RAW_DOCS_PATH`, `PROCESSED_PATH`	Upload and processed file roots.
`CORS_ORIGINS`	Comma-separated browser origins, or `*` (dev only; avoid in production).
`MAX_UPLOAD_MB`	Upload size cap for document uploads.
`DEBUG`, `LOG_LEVEL`	App logging / debug flags.

Web

Variable	Purpose
`NEXT_PUBLIC_API_URL`	Origin of the FastAPI server (no trailing slash), e.g. `http://127.0.0.1:8000`. Used by `web/src/lib/api.ts`. Must match the host you use in the browser (`localhost` vs `127.0.0.1`) to avoid CORS/preflight issues.
`NEXT_PUBLIC_GOOGLE_OAUTH_ENABLED`	Optional. Set to `true` to show “Continue with Google” before the client fetches `GET /api/v1/auth/config`; otherwise the UI reads that endpoint and only shows the button when the API has OAuth credentials.
`NEXT_PUBLIC_SITE_URL`	Optional canonical site URL for metadata. If unset on Vercel, `VERCEL_URL` is used in `web/src/app/layout.tsx` for `metadataBase`.

Docker Compose: pass GOOGLE_API_KEY, GOOGLE_AI_MODEL, GOOGLE_AI_MODEL_FALLBACKS into the api service (see docker-compose.yml, docker-compose.prod.yml).

Backend architecture

FastAPI application

api/app/main.py constructs the app with:

CORS from settings.cors_origins_list.
APIRouter subtree mounted at /api/v1 including analytics, anomalies, documents, compare, receipts, export, insights report, chat.
Legacy POST /chat/insights delegating to the same handler as v1 chat.

Persistence

SQLAlchemy 2.x engine + SessionLocal in api/app/db.py.
Models in api/app/models.py:
- Document: filename, path, type, raw text, JSON extracted_data, timestamps.
- Transaction: document_id, date, amount, vendor, category, description, JSON metadata (ORM attribute meta_data to avoid reserved name issues), confidence / correction flags.
- DocumentCorrection: audit of user corrections.

Routers use Depends(get_db) for request-scoped sessions.

Validation

api/app/schemas.py defines Pydantic models for HTTP I/O. Notable:

QueryRequest: query, use_rag, use_sql, optional history (List[ChatMessage] with role and content). The chat router keeps the last 20 turns with non-empty content. use_rag / use_sql bias LangGraph routing (e.g. SQL-only path when RAG is off and keywords suggest aggregation).

Concept glossary

Each item: what the concept is, then how DocSage applies it (files).

Intelligent Document Processing (IDP)

IDP is the class of systems that turn messy documents (PDFs, scans) into structured, machine-usable data—classification, key-value extraction, validation—not just raw text.

DocSage: api/app/services/idp_pipeline.py implements extraction and heuristic classification; the upload path in api/app/routers/documents.py calls parse_document then persists rows. This is “IDP-inspired”: rules + LLM/heuristics rather than a full enterprise IDP product.

OCR (Optical Character Recognition)

OCR recovers text from pixels (photos, scanned pages).

DocSage: pytesseract with Pillow-compatible inputs in extract_text_with_ocr (idp_pipeline.py). The Docker image installs tesseract-ocr (api/Dockerfile) so containers can OCR without extra host setup.

PDF text extraction

Digital PDFs often expose a text layer; extraction without OCR is faster and more accurate.

DocSage: pdfplumber in extract_text_from_pdf walks pages and concatenates extract_text() output (idp_pipeline.py). Image-only PDFs may still need rasterization + OCR (pipeline-dependent).

Heuristic classification and regex extraction

Heuristics use keywords and patterns to guess document type and pull amounts, dates, and vendors without a dedicated ML model per field.

DocSage: classify_document, extract_amounts, and related helpers in idp_pipeline.py; extracted JSON is stored on Document.extracted_data and Transaction rows are synthesized via extract_transactions_from_document (api/scripts/ingest_docs.py) used from the documents router.

Embeddings

An embedding is a dense vector representing text (or other modalities) in a space where semantic similarity ≈ vector proximity.

DocSage: SentenceTransformer in api/app/vectorstore/faiss_store.py encodes strings; default model all-MiniLM-L6-v2 produces 384-dimensional vectors (EMBEDDING_MODEL in config). api/scripts/build_embeddings.py walks all Document rows and, for each, emits (1) a single extraction_summary string via extraction_to_index_text (extraction_summary.py) when extracted_data is present, and (2) sliding-window chunks over raw_text when present. Metadata on each vector row includes chunk_type (extraction_summary vs raw_text), chunk_index, total_chunks (for raw splits), and document id/filename—used downstream for rerank sources and UI citations.

Vector store and approximate search (FAISS)

A vector store indexes vectors for nearest-neighbor search (which chunks are closest to the query embedding).

DocSage: FAISS IndexFlatL2—exact L2 search over all vectors (simple, no training). Index and parallel pickle list of metadata are saved to FAISS_INDEX_PATH / FAISS_DOCUMENTS_PATH (relative to API cwd, typically api/data/embeddings/ in local dev). RAGService loads the index if files exist and exposes search(query, k).

RAG (Retrieval-Augmented Generation)

RAG grounds LLM answers in retrieved passages from a corpus instead of parametric memory alone, reducing hallucination on factual questions about your documents.

DocSage: RAGService wraps the FAISS store; LangGraph nodes call search with a rewritten query after HyDE and optional reranking (api/app/agents/langgraph/nodes.py). node_rerank / node_synthesize attach sources: document_id, filename, chunk_index, chunk_type, score for citation-style UX; final synthesis instructs the model to cite filename and document id when evidence exists and to avoid inventing facts when context and SQL are empty.

LangGraph and agentic control flow

LangGraph models an agent as a state machine: nodes (functions) update state; edges (conditional or fixed) choose the next step. “Agentic” here means multiple LLM and tool steps with branching, not a single prompt.

DocSage: build_agent_graph compiles a StateGraph over AgentState (api/app/agents/langgraph/state.py) including optional history for multi-turn prompts. run_agent_pipeline invokes the compiled graph with graph.invoke, then shapes the response for the REST API.

LangChain packages in api/requirements.txt (langchain-core, langchain-community) support the broader ecosystem; application code under app/ imports LangGraph directly rather than high-level LangChain chains.

HyDE (Hypothetical Document Embeddings)

HyDE asks the LLM to draft a fake answer or passage that would answer the question; that text is embedded and used to retrieve real chunks. It often improves recall when the raw user question is short or mismatched to chunk wording.

DocSage: node_hyde_rewrite in nodes.py calls call_llm to produce a hypothetical block; retrieval uses the rewritten text (see also refinement_hint on failed grades). HyDE, grading, SQL generation, and synthesis prompts also receive a compact conversation block from history when the client sends prior turns.

Retrieval depth, cross-encoder reranking, and grading loop

Two-stage retrieval can mean: (1) cheap bi-encoder over many candidates, then (2) cross-encoder scoring query–passage pairs for a top subset. A grader decides if context is good enough or triggers another retrieval loop.

DocSage: constants RETRIEVE_K, RERANK_POOL, RERANK_KEEP, MAX_RETRIEVAL_LOOPS in nodes.py. cross-encoder/ms-marco-MiniLM-L-6-v2 scores pairs. node_grade sets grade_pass; conditional edges in graph.py send failures back to hyde until the cap, then proceed to optional_sql → synthesize.

SQL grounding (LLM-generated SELECT)

Grounding here means the LLM sees the real table schema and sample rows before emitting read-only SQL executed against your DB.

DocSage: SQLTools introspects transactions for low-level helpers and exposes get_multitable_sql_llm_context()—combined documents + transactions schemas, truncated raw_text previews, and shrunk extracted_data in document samples (SQLite vs Postgres aware). _generate_sql in nodes.py includes that context, the user question, and a short conversation block from history, and explains that transactions.document_id references documents.id (JOIN allowed). node_sql_only and node_optional_sql merge SQL results into answers; optional SQL is keyword-gated (including document/invoice-style terms).

Fast analytic path (metrics shortcut)

Some questions match precomputed aggregates faster than full RAG.

DocSage: node_route checks keyword hints for vendor/category breakdowns and routes to node_metrics_fast, which calls InsightsService (api/app/services/insights.py) and returns without vector search (nodes.py).

Anomaly detection

Anomaly detection flags unusual rows (duplicates, outliers, date oddities).

DocSage: api/app/services/anomaly_detection.py; exposed via api/app/routers/anomalies.py.

Google Gemini (Generative Language API)

Gemini is accessed through REST generateContent (v1beta), not a proprietary SDK requirement in this repo.

DocSage: api/app/services/llm_service.py builds the request with optional systemInstruction, walks a model chain (primary + GOOGLE_AI_MODEL_FALLBACKS), retries 429 / 5xx with backoff and optional Retry-After, skips to the next model on 400 / 404 (e.g. deprecated model id), and returns assistant text or structured error strings.

LangGraph chat pipeline

Graph topology (mermaid)

Mirrors api/app/agents/langgraph/graph.py.

flowchart TD
  entry[route_entry]
  route[node_route]
  metrics[node_metrics_fast]
  sqlOnly[node_sql_only]
  hyde[node_hyde_rewrite]
  retrieve[node_retrieve]
  rerank[node_rerank]
  grade[node_grade]
  sqlOpt[node_optional_sql]
  synth[node_synthesize]
  endNode[END]
  entry --> route
  route -->|metrics_fast| metrics
  route -->|sql_only| sqlOnly
  route -->|agentic_rag| hyde
  hyde --> retrieve
  retrieve --> rerank
  rerank --> grade
  grade -->|retry_HyDE_loop| hyde
  grade -->|pass_or_cap| sqlOpt
  sqlOpt --> synth
  metrics --> endNode
  sqlOnly --> endNode
  synth --> endNode

The grade → hyde edge is conditional: only when grade_pass is false and retrieval_iteration is below MAX_RETRIEVAL_LOOPS (_grade_next).

HTTP entry and response shape

api/app/routers/chat.py:

Lazily constructs a singleton RAGService (loads FAISS if index files exist).
run_chat maps request.history to the graph (last 20 non-empty turns).
run_chat calls run_agent_pipeline(query, rag, use_rag=..., use_sql=..., history=...).
Returns QueryResponse: answer, sources (list of dicts with document_id, filename, chunk_index, chunk_type, score when RAG ran), sql_query, steps, tool_calls (derived from steps for UI convenience).

When RAG is “skipped” in spirit: use_rag=False with SQL-biased routing yields sql_only. use_rag=False also clears sources in the runner output. use_sql=False disables the optional SQL augmentation node path in the graph state.

Other LLM call sites (outside the graph)

Several features call call_llm directly without LangGraph: e.g. insights report generation (api/app/services/insights_generator.py), categorization (api/app/services/categorization.py), HyDE / synthesize / SQL prompt nodes. The graph is the orchestrator for interactive chat; batch/report flows may be linear.

Document ingestion and data flow

POST /api/v1/documents with multipart file (documents.py).
File bytes saved under RAW_DOCS_PATH.
parse_document(path) runs the IDP pipeline (idp_pipeline.py).
Document inserted; extract_transactions_from_document yields dicts → Transaction rows committed.
FAISS is not automatically rebuilt on every upload. After new imports, extraction changes, or IDP tweaks, refresh vectors so extraction_summary and raw_text chunks stay in sync—for example:
```
cd api && ./.venv/bin/python scripts/build_embeddings.py
```
You can also use RAGService.build_index / add_documents (rag.py) in custom ops; add_documents currently rebuilds the full index for simplicity.

HTTP API surface

All v1 routes are prefixed with /api/v1 unless noted.

Tag	Method	Path	Purpose
analytics	GET	`/analytics/summary`	Dashboard KPIs: counts, spend, averages.
	GET	`/analytics/time-series`	Time-bucketed series for charts.
	GET	`/analytics/vendor-stats`	Top vendors by spend.
	GET	`/analytics/category-breakdown`	Spend by category.
	GET	`/analytics/spending-forecast`	Simple forward-looking projection.
	GET	`/analytics/monthly-spend`	Spend for a given year/month.
anomalies	GET	`/anomalies`	Rule-based anomaly list.
chat	POST	`/chat/insights`	Agentic RAG + SQL pipeline. Body: `query` (required), `use_rag`, `use_sql`, optional `history` (array of `{ role, content }`, server uses last 20 non-empty turns).
compare	GET	`/documents/{document_id}/similar`	Similar documents (e.g. shared vendor / join logic in service).
	POST	`/documents/compare`	Pairwise diff / compare (`CompareBody`).
documents	GET	`/documents`	List documents (filters, pagination).
	POST	`/documents`	Upload + parse + persist transactions.
	GET	`/documents/{id}`	Metadata.
	GET	`/documents/{id}/detail`	Rich detail payload.
	GET	`/documents/{id}/confidence`	Extraction confidence signals.
	GET	`/documents/{id}/preview`	Preview / annotated stream where implemented.
	PATCH	`/documents/{id}`	Update extracted JSON (`DocumentUpdateBody`).
exports	GET	`/exports/excel`	Download Excel export blob.
	GET	`/exports/summary`	Text/markdown summary for export UX.
insights-report	POST	`/insights/generate-report`	LLM-generated narrative report.
receipt-matching	GET	`/receipt-matching/unmatched`	Queue of unmatched receipts.
	POST	`/receipt-matching/{receipt_doc_id}/match`	Link receipt to candidate transaction.

Legacy (no /api/v1 prefix): POST /chat/insights — same body/response as v1 chat.

Interactive docs: /docs (Swagger UI).

Frontend (web)

Framework: Next.js 14 App Router — file-based routes under web/src/app/.
Data fetching: @tanstack/react-query on dashboard and other data screens; fetch wrappers in web/src/lib/api.ts target ${NEXT_PUBLIC_API_URL}/api/v1.
Chat: chat/page.tsx sends the last 20 user/assistant turns as history with each POST /chat/insights request so follow-up questions keep context server-side.
Theming: next-themes with Tailwind darkMode: "class"; landing vs app shell split in web/src/components/ShellLayout.tsx (/ uses marketing layout; other routes use AppShell sidebar).
Motion: Framer Motion primitives (web/src/components/motion/).
Icons / branding: logos in web/public/ (logo.png, logo-dark.png); favicons favicon.ico, icon.png, apple-icon.png with metadata.icons in layout.tsx.

Pages (examples): dashboard, chat, documents, insights, anomalies, compare, export, receipt-matching; marketing home composes LandingStory.

Dependencies, scripts, and containers

Python (`api/requirements.txt`) — grouped by role

Group	Examples
HTTP	`fastapi`, `uvicorn`, `python-multipart`, `pydantic`, `pydantic-settings`
DB	`sqlalchemy`, `psycopg2-binary`
Agents	`langgraph`, `langchain-core`, `langchain-community`
Vectors / ML	`faiss-cpu`, `sentence-transformers`, `numpy`, `pandas`
Documents	`pdfplumber`, `pytesseract`, `Pillow`, `opencv-python`, `openpyxl`
LLM HTTP	`requests`

Operational scripts (`api/scripts/`)

Not invoked by default: seed_db.py, ingest_docs.py, build_embeddings.py, migrate_database.py, add_documents_from_folder.py, diagnose_and_fix_transactions.py, download_huggingface_dataset.py, preload_kaggle_invoices.py, etc. Use them manually for migrations, backfills, demo data, and embedding rebuilds.

Docker

api/Dockerfile COPYs from repo root: docker build -f api/Dockerfile . installs Tesseract system packages, Python deps, copies api/app and api/scripts, creates data/ subtrees, runs uvicorn on port 8000.

Compose files wire Postgres + env; see repository root YAMLs.

Railway

railway.toml: Dockerfile builder, uvicorn app.main:app --host 0.0.0.0 --port $PORT, health check /health, restart policy.

Deployment

Frontend: deploy subdirectory web/ (e.g. Vercel). Set NEXT_PUBLIC_API_URL to your API’s public origin. Set NEXT_PUBLIC_SITE_URL or rely on VERCEL_URL for metadata (see layout).
Backend: container host (Fly, Railway, Cloud Run, etc.) using api/Dockerfile with root build context. Inject GOOGLE_API_KEY, DB URL, CORS_ORIGINS matching the exact browser origin(s) in production.
Persistence: mount a volume (or object storage strategy) for api/data/raw_docs, api/data/embeddings, and the SQLite file if used—ephemeral disks lose indexes and uploads on restart.

Authentication and multi-tenancy

DocSage uses JWT-based authentication with email/password registration and optional Google OAuth.

Backend

User model in api/app/models.py: email (unique), hashed password (nullable for OAuth-only), optional oauth_provider/oauth_sub.
Auth router at /api/v1/auth/ (api/app/routers/auth.py):
- POST /register — email + password; returns JWT.
- POST /login — email + password; returns JWT.
- GET /me — current user from token.
- GET /google — redirects to Google consent screen.
- GET /google/callback — exchanges code, upserts user, redirects to frontend with token in URL hash.
Dependency get_current_user in api/app/deps.py protects all non-auth routes.
Tenant isolation: every query in documents, transactions, analytics, anomalies, compare, receipts, export, reports, and chat is filtered by user_id.
Per-user upload paths: files are stored under data/raw_docs/user_{id}/.
Per-user RAG: FAISS indexes live at data/embeddings/user_{id}/.
Chat sessions API at /api/v1/chat/sessions (CRUD scoped to current user).

Frontend

AuthProvider in web/src/contexts/auth.tsx: stores JWT in localStorage, exposes login, register, logout, setTokenFromOAuth.
Bearer token added to all API requests via web/src/lib/api.ts.
Route protection in ShellLayout: unauthenticated users are redirected to /login; public routes: /, /login, /register, /auth/callback.
Login/Register pages: Google sign-in is shown only when the API reports OAuth is enabled (GET /auth/config) or NEXT_PUBLIC_GOOGLE_OAUTH_ENABLED=true.
Chat sessions sync to the server API when authenticated; fall back to localStorage when offline.

Configuration

Variable	Purpose
`JWT_SECRET`	Secret for HS256 token signing (change in production).
`JWT_EXPIRE_MINUTES`	Token validity (default 7 days).
`GOOGLE_OAUTH_CLIENT_ID`	Google Cloud console client ID (optional).
`GOOGLE_OAUTH_CLIENT_SECRET`	Matching secret.
`GOOGLE_OAUTH_REDIRECT_URI`	Must match console; default `http://localhost:8000/api/v1/auth/google/callback`.
`FRONTEND_URL`	Where the OAuth callback redirects with the token hash fragment.

Public GET /api/v1/auth/config (no auth): returns { "google_oauth_enabled": boolean } so the web UI can hide “Continue with Google” when OAuth is not configured on the server.

Migration

Run python scripts/migrate_database.py from api/ to add user_id columns to existing tables and create users / chat_sessions tables. Existing rows without a user_id are hidden from authenticated queries until backfilled.

Limitations and extension points

Area	Limitation	Possible extension
FAISS	`IndexFlatL2` is linear; slow at very large N	IVF / HNSW, or managed vector DB (Pinecone, pgvector, …).
Index updates	`add_documents` rebuilds whole index	Incremental add, background jobs, versioning.
Schema	`create_all` only; no Alembic in tree	Migrations for production schema evolution.
Chat	`history` trimmed to 20 turns with content (`chat.py`)	Configurable cap, thread storage, or rolling summary.
OCR	Host must have Tesseract unless using Docker image	Cloud OCR APIs, better layout models.
Compare routes	Mounted at `/documents/...` alongside document CRUD	Ensure route ordering in OpenAPI matches FastAPI resolution for edge IDs.
Secrets	Never commit real `GOOGLE_API_KEY`; rotate if leaked	Secret manager, `.env` gitignored (already).

This README is the authoritative high-level map of the codebase; for line-level behavior, follow the links into api/app/ and web/src/.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
api		api
scripts		scripts
web		web
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Logo.png		Logo.png
README.md		README.md
docker-compose.postgres.yml		docker-compose.postgres.yml
docker-compose.prod.yml		docker-compose.prod.yml
docker-compose.yml		docker-compose.yml
railway.toml		railway.toml

Folders and files

Latest commit

History

Repository files navigation

DocSage

Table of contents

What DocSage does

Repository map

Quick start (local)

Database initialization

Configuration

API (api/app/config.py + api/.env)

Web

Backend architecture

FastAPI application

Persistence

Validation

Concept glossary

Intelligent Document Processing (IDP)

OCR (Optical Character Recognition)

PDF text extraction

Heuristic classification and regex extraction

Embeddings

Vector store and approximate search (FAISS)

RAG (Retrieval-Augmented Generation)

LangGraph and agentic control flow

HyDE (Hypothetical Document Embeddings)

Retrieval depth, cross-encoder reranking, and grading loop

SQL grounding (LLM-generated SELECT)

Fast analytic path (metrics shortcut)

Anomaly detection

Google Gemini (Generative Language API)

LangGraph chat pipeline

Graph topology (mermaid)

HTTP entry and response shape

Other LLM call sites (outside the graph)

Document ingestion and data flow

HTTP API surface

Frontend (web)

Dependencies, scripts, and containers

Python (api/requirements.txt) — grouped by role

Operational scripts (api/scripts/)

Docker

Railway

Deployment

Authentication and multi-tenancy

Backend

Frontend

Configuration

Migration

Limitations and extension points

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

API (`api/app/config.py` + `api/.env`)

Python (`api/requirements.txt`) — grouped by role

Operational scripts (`api/scripts/`)

Packages