DocuMind is a local-first retrieval-augmented generation (RAG) system for technical and research document libraries. Documents are ingested, chunked, embedded into ChromaDB (cosine space), and queried through a FastAPI surface. Answers are grounded on retrieved passages, returned with structured citations, and shaped by mode-specific generation policies. Default LLM and embedding inference run via Ollama on operator-controlled hardware.
This document specifies architecture, control flows, configuration, and operational behavior sufficient for engineering review, extension, and production hardening.
- System overview
- Design principles
- Repository layout
- Runtime architecture
- Data lifecycle: ingest → index
- Retrieval and generation pipeline
- Query modes
- FLARE-inspired active retrieval
- HTTP API
- Configuration
- Security middleware
- Observability and reliability
- Deployment
- Bundled corpus and scripts
- Testing
- Known limitations and extension points
- Portfolio artifacts
- References
| Layer | Responsibility |
|---|---|
| Presentation | Next.js 15 dashboard (web/) for operator workflows; optional Streamlit (frontend/app.py) calling the same REST API. |
| Application | FastAPI application (app/main.py): routing, middleware, dependency injection, lifespan-managed singletons. |
| Domain services | Document parsing and chunking (app/services/document_service.py, app/utils/chunker.py); vector persistence (app/services/embedding_service.py); RAG orchestration (app/services/rag_service.py). |
| Model I/O | Ollama client (app/utils/ollama_client.py): chat completions and per-text embeddings over HTTP. |
| Persistence | Chroma persistent client on disk (CHROMA_PERSIST_DIR); collection metadata uses cosine distance (hnsw:space: cosine). |
Ports (convention): API 8001, Next.js dev 3002, Ollama 11434.
- Grounding first — Final user-facing answers for LLM-backed modes are conditioned only on retrieved chunk text; prompts explicitly forbid inventing papers, metrics, or datasets absent from context.
- Explicit provenance — Responses include
SourceCitationobjects (document id, title, section, page hint, chunk index, distance, preview). - Dependency-aware serving — Liveness vs readiness split so orchestrators can distinguish “process up” from “dependencies usable”.
- Configurable retrieval policy — Top‑k, distance cutoff, keyword rerank weight, fallback when strict filtering returns nothing, and diversity caps are all environment-tunable.
- Single-tenant baseline — One shared library index per deployment; ACLs per document are not implemented in-tree (see §16).
| Path | Role |
|---|---|
app/main.py |
FastAPI app, lifespan, global exception handler, middleware, router includes. |
app/config.py |
pydantic-settings Settings; single cached get_settings(). |
app/logging_config.py |
Optional JSON logging layout. |
app/routers/ingest.py |
Multipart ingest, delete by doc_id. |
app/routers/papers.py |
List / get / delete paper metadata from index. |
app/routers/query.py |
RAG query and collection stats. |
app/routers/arxiv.py |
arXiv PDF fetch by id. |
app/services/document_service.py |
File type detection, text extraction, delegation to chunker. |
app/services/embedding_service.py |
Chroma add/query/delete; Ollama embeddings. |
app/services/rag_service.py |
Retrieval, rerank, diversity, mode prompts, FLARE branch, Ollama chat. |
app/utils/chunker.py |
RecursiveCharacterTextSplitter; section heuristics in metadata. |
app/utils/ollama_client.py |
Retry-wrapped HTTP to Ollama /api/chat and /api/embeddings. |
app/models/ |
Pydantic request/response models shared by routers. |
data/sample_docs/ |
Bundled UTF-8 corpus (see §14). |
tests/ |
API and unit tests; tests/conftest.py uses dependency overrides and fake embedding/RAG for isolation. |
evaluation/ |
Optional regression fixtures for pipeline shape. |
scripts/ |
Corpus generators, portfolio PDF, arXiv bulk helpers. |
web/ |
Next.js operator UI. |
Dockerfile / docker-compose.yml |
Container image (Python 3.11-slim, non-root) and Compose stack with Chroma volume + healthcheck. |
flowchart TB
subgraph clients [Clients]
N[Next.js]
S[Streamlit]
end
subgraph api [DocuMind API]
F[FastAPI]
L[Lifespan: services + seed]
end
subgraph svc [Services]
D[DocumentService]
E[ChromaEmbeddingService]
R[RAGService]
end
subgraph ext [External]
O[Ollama]
C[(ChromaDB)]
end
N --> F
S --> F
F --> L
L --> D
L --> E
L --> R
D --> E
R --> E
R --> O
E --> O
E --> C
Lifespan (app/main.py): On startup, constructs OllamaClient, ChromaEmbeddingService, DocumentService, RAGService. Runs seed_sample_docs() when Ollama is healthy: compares SAMPLE_CORPUS_VERSION marker on disk to settings; on mismatch, deletes sample_* vectors, rewrites marker, then ingests each data/sample_docs/*.txt as sample_<stem>.
Routers mount under /api/v1 except health routes at root.
- Input:
POST /api/v1/ingest(multipart file) orPOST /api/v1/fetch-arxiv(JSONarxiv_id). - Validation: File size cap
MAX_FILE_SIZE_MB; MIME/type checks in ingest router / document service. - Extraction: PyPDF2 for PDF, python-docx for DOCX, raw decode for
.txt. - Metadata: Heuristic title, authors, year, optional arXiv id from leading text when parseable.
- Chunking:
DocumentChunkeruses LangChainRecursiveCharacterTextSplitterwithCHUNK_SIZEandCHUNK_OVERLAP. Eachlangchain_core.documents.Documentcarries metadata:doc_id,filename,section(heuristic),chunk_index,page_numberwhen known, etc. - Indexing:
ChromaEmbeddingService.add_documentsembeds each chunk via OllamaEMBEDDING_MODEL, writes to Chroma with stable ids{doc_id}_{i}and metadata includingdoc_idfor deletion and listing.
DELETE /api/v1/papers/{doc_id} and DELETE /api/v1/ingest/{doc_id} call embedding_service.delete_document. If no chunks exist for that doc_id, the service returns false and the API responds 404 — empty delete is not silently successful.
Chroma collection is created with metadata={"hnsw:space": "cosine"}. Query results expose distance per hit; the RAG layer sorts ascending (lower distance = closer match) and keeps rows with distance < RELEVANCE_THRESHOLD before optional fallback (threshold is a tunable cutoff on this distance scale for your embedding model and corpus).
All logic below is implemented in app/services/rag_service.py unless noted.
For a user top_k and query_mode, the service expands the vector search n_results before reranking (e.g. up to 64 for general / compare, up to 56 for other modes). This widens the candidate pool so rerank and diversity filters have material to work with.
embedding_service.search(embed_query, retrieve_k, section_filter)returns rows{content, metadata, distance}.- Keyword rerank: Rows are sorted by
distance − KEYWORD_RERANK_WEIGHT × keyword_overlap_score(rerank_query, content)
so lexical overlap with the user question can reorder within a distance band. - Threshold filter: Keep rows with
distance < RELEVANCE_THRESHOLD. - Fallback: If nothing passes and
ENABLE_FALLBACK_RETRIEVALis true, take the topFALLBACK_TOP_Nby rerank order and mark internally (answer may append a disclosure line). - Diversity:
_select_diverse_sourcesprefers at most one strong chunk perdoc_idbefore filling remaining slots, reducing single-document context monopolization. - Context slot cap: Depends on
query_mode(e.g. up to 24 chunks forgeneral/compare).
datasetsmode: Does not call the LLM for the main body. It scans retrieved chunk text for known dataset hints and patterns, emits a structured Markdown inventory. FLARE is skipped.- Other modes: Builds a single context block from selected chunks, applies the mode’s system prompt from
SYSTEM_PROMPTS, callsOllamaClient.chatwith mode-dependent temperature, returns Markdown answer plusSourceCitationlist. - Confidence: Derived from mean chunk distance (clamped); exposed as a scalar for UI.
query_mode |
Behavior |
|---|---|
general |
Broad grounded synthesis; higher temperature than methodology. |
compare |
Cross-paper comparison framing; large retrieval budget; table-oriented prompt. |
methodology |
Implementation-focused extraction; moderate temperature. |
datasets |
Deterministic dataset / benchmark surfacing from chunk text. |
reproduce |
Reproducibility checklist style; structured sections in prompt. |
Optional section_filter restricts Chroma where clause on metadata section (abstract, introduction, methodology, experiments, results, conclusion).
Full FLARE (Jiang et al., arXiv:2305.06983) uses token-level confidence to trigger mid-generation retrieval. Ollama’s chat API used here does not expose per-token logprobs.
Implementation: When use_flare (request) or FLARE_ACTIVE_RETRIEVAL (settings) is true and mode ≠ datasets:
- Run the standard first-pass retrieval → context selection.
- Build a truncated mini-context (bounded by
FLARE_DRAFT_MAX_CONTEXT_CHARS) from selected chunks. - Call the LLM once with
FLARE_DRAFT_SYSTEMto produce a 2–4 sentence forward-looking draft; unsupported facts must appear as???or explicit excerpt-level hedges. - If
flare_triggers_follow_up(draft)is true, run a secondsearchwith a composite query (user question + draft excerpt, capped length). - Merge reranked lists by chunk identity, keeping the better (lower) distance per chunk; re-run threshold, fallback, and diversity on the merged set.
- Final synthesis uses merged chunks. Response fields
flare_enabledandflare_followup_retrievalrecord what occurred.
| Method | Path | Body / params | Notes |
|---|---|---|---|
| GET | /health |
— | Ollama availability + collection stats. |
| GET | /health/live |
— | Process liveness. |
| GET | /health/ready |
— | 503 if dependencies not ready. |
| POST | /api/v1/ingest |
multipart/form-data file |
Returns ingest stats JSON. |
| DELETE | /api/v1/ingest/{doc_id} |
— | 404 if no chunks. |
| POST | /api/v1/fetch-arxiv |
{ "arxiv_id": "..." } |
Downloads PDF, ingests. |
| POST | /api/v1/query |
QueryRequest JSON |
See app/models/request_models.py. |
| GET | /api/v1/papers |
— | Library cards. |
| GET | /api/v1/papers/{doc_id} |
— | One document. |
| DELETE | /api/v1/papers/{doc_id} |
— | 404 if no chunks. |
| GET | /api/v1/collection/stats |
— | Aggregate counts. |
OpenAPI: /docs, /redoc, /openapi.json unless DISABLE_OPENAPI=true.
Authentication: When API_KEY is non-empty, all /api/v1/* routes (except CORS preflight) require header X-API-Key matching the setting; mismatch → 401.
All keys are listed in .env.example. Grouped reference:
| Group | Variables | Purpose |
|---|---|---|
| Models | OLLAMA_BASE_URL, LLM_MODEL, EMBEDDING_MODEL |
Inference endpoints and model tags. |
| Vector store | CHROMA_PERSIST_DIR, CHROMA_COLLECTION_NAME |
On-disk path and logical collection. |
| Chunking | CHUNK_SIZE, CHUNK_OVERLAP |
Text splitter parameters; affects chunk count and context granularity. |
| Retrieval defaults | TOP_K_RESULTS, RELEVANCE_THRESHOLD, ENABLE_FALLBACK_RETRIEVAL, FALLBACK_TOP_N, KEYWORD_RERANK_WEIGHT |
Global defaults; per-request top_k overrides for query. |
| Ingest | MAX_FILE_SIZE_MB, ARXIV_BASE_URL |
Upload cap and arXiv PDF export host. |
| Sample corpus | SAMPLE_CORPUS_VERSION |
Bump to purge and re-seed all sample_* docs on startup. |
| Network | CORS_ORIGINS, CORS_ALLOW_ALL, TRUSTED_HOSTS |
Browser and Host-header policy. |
| App | APP_ENV, DISABLE_OPENAPI |
Environment label; docs toggle. |
| Security / transport | API_KEY, ENABLE_RESPONSE_GZIP |
Optional API key gate; gzip responses. |
| Logging | LOG_LEVEL, LOG_JSON |
Verbosity and JSON log lines. |
| FLARE | FLARE_ACTIVE_RETRIEVAL, FLARE_DRAFT_MAX_CONTEXT_CHARS |
Global FLARE default and draft context budget. |
Applied in app/main.py (order matters for FastAPI / Starlette):
- CORS —
CORSMiddlewarewith explicit origins or wildcard whenCORS_ALLOW_ALL(dev-only). - Trusted hosts — Optional
TrustedHostMiddlewarewhenTRUSTED_HOSTSis set. - Gzip —
GZipMiddlewarewhenENABLE_RESPONSE_GZIPand payload exceeds minimum size. - Per-request —
X-Request-IDassignment, optional API key gate, default security headers (X-Content-Type-Options,X-Frame-Options,Referrer-Policy;Permissions-Policyin productionAPP_ENV). - Errors —
HTTPExceptionandRequestValidationErrorreturn structured JSON; uncaught exceptions return 500 withrequest_idin body.
- Request correlation — Every response carries
X-Request-ID; access logs includerequest_id, method, path, status,duration_ms. - Structured logs —
LOG_JSON=truefor log platforms. - Healthchecks — Docker Compose defines an HTTP probe against
/health/live(seedocker-compose.yml). Prefer/health/readyfor LB routing when Ollama and Chroma must be live.
| Target | Command / notes |
|---|---|
| Docker Compose | docker compose up --build — publishes 8001, mounts Chroma volume chroma_data, read-only ./data. Set OLLAMA_BASE_URL to reachable Ollama (default host.docker.internal:11434 on Docker Desktop). |
| Bare metal / VM | uvicorn app.main:app --host 0.0.0.0 --port 8001 (add --proxy-headers behind TLS terminator per your platform). |
| Windows dev | .\start_documind.ps1 (Ollama, API, Next); uses .venv\Scripts\python.exe when present. First boot can sit in corpus ingest for a long time before /health responds; the script waits up to 180 minutes (-MaxApiWaitMinutes). -SkipModelPull speeds repeat boots. .\stop_documind.ps1 clears ports 3002, 8001, 11434 — confirm Ollama shutdown is intended. |
Backup: Copy CHROMA_PERSIST_DIR regularly; it is the authoritative index. Source PDFs/DOCX should remain in object storage or VCS-independent archives if they are not all under data/.
data/sample_docs/— Approximately 460 UTF-8 technical briefs: curated landmark-style summaries plus 400 deterministic synthetic paperssample_corpus_p7_*.txtgenerated byscripts/generate_production_corpus.py. Expect on the order of 5k–10k chunks at defaultCHUNK_SIZE=800after full ingest.- Regeneration:
python scripts/generate_production_corpus.py --count 500 --forcethen bumpSAMPLE_CORPUS_VERSIONagain so startup purges oldsample_*rows. - Hand-authored expansion:
scripts/materialize_institutional_corpus.pyadds named institutional-style briefs (skips existing filenames). - arXiv bulk:
scripts/bulk_ingest_arxiv.py+data/arxiv_seed_list.txt.
pytest -qtests/conftest.py overrides FastAPI dependencies with fake embedding/RAG services so unit tests do not require Ollama or Chroma.
Query regression suite: tests/test_rag_query_suite.py runs 20 parameterized cases (tests/query_eval_cases.py) against real RAGService with a ranking fake vector layer (tests/ranking_fake_embedding.py) and a deterministic Ollama stub — metrics cover status, has_answer, source counts, answer substrings, and wall time. Live library smoke: python scripts/run_query_eval.py --base-url http://127.0.0.1:8001 (optional --csv report.csv; skips empty-corpus cases with --skip-empty-corpus-cases).
CI: On push and pull request to main or master, GitHub Actions (.github/workflows/ci.yml) runs on Python 3.11 and 3.12: ruff check (syntax / undefined-name rules), then pytest. No Ollama or Chroma in CI. Pytest and Ruff defaults: pyproject.toml. Dependabot for Actions: .github/dependabot.yml.
Not implemented in this repository (non-exhaustive):
- Per-user or per-tenant ACL on chunks or documents.
- SSO / OIDC for the API or UI.
- OCR pipeline for low-quality scanned PDFs beyond basic text extraction.
- Hosted managed vector SaaS swap (Pinecone, Weaviate, etc.) — would replace
ChromaEmbeddingServicewhile preserving router contracts. - Token-level FLARE — requires a host that exposes logprobs or an alternative uncertainty model.
Natural extensions: swap Ollama for OpenAI/Azure OpenAI behind the same RAGService boundary; add golden-set eval CI; wire /health/ready to load balancers.
Under portfolio/: client project catalog HTML, portfolio brief HTML, optional PDF generation (scripts/portfolio_requirements.txt, scripts/generate_portfolio_pdf.py), dashboard screenshot portfolio/screenshots/documind-dashboard.png.
Regenerating the screenshot (recommended): A bare playwright screenshot of the home page misses indexed doc counts and synthesis. Use the bundled driver after API + Next are up and sample ingest has progressed:
.\.venv\Scripts\pip install -r scripts\screenshot_requirements.txt
.\.venv\Scripts\playwright install chromium
.\scripts\capture_dashboard.ps1 # waits for /health/live, Gold demo scenario (compare, Top K 24, FLARE), synthesis text, tall-viewport PNG
# Smaller corpus / faster index gate: .\scripts\capture_dashboard.ps1 -MinDocs 40
# Custom API / wait cap: .\scripts\capture_dashboard.ps1 -ApiBase "http://127.0.0.1:8001" -MaxLivenessWaitMinutes 240Or directly: .\.venv\Scripts\python scripts\capture_dashboard_playwright.py --help — waits for ≥120 chars in .prose-answer, scrolls synthesis into view, writes 1680×3200 portfolio/screenshots/documind-dashboard.png (default --viewport-width 1680; use --viewport-width 1440 if needed), then 1000×750 portfolio/screenshots/documind-upwork-catalog-1000x750.png (default stack infographic tile; --plain-catalog-thumb top-crops the dashboard). Thumb only: .\.venv\Scripts\python scripts\capture_dashboard_playwright.py --thumb-only. Standalone tile: python scripts/catalog_thumb_art.py --out portfolio/screenshots/documind-upwork-catalog-1000x750.png. Avoid --full-page for portfolio assets.
- Jiang et al., Active Retrieval Augmented Generation (FLARE), arXiv:2305.06983.
- FastAPI, Pydantic v2, ChromaDB, LangChain text splitters, Ollama HTTP API.
Python 3.11+ (Dockerfile pins 3.11-slim), FastAPI, Uvicorn, Pydantic Settings, ChromaDB, langchain_core + langchain-text-splitters, Ollama, Next.js 15, React 18, TypeScript, pytest, optional Streamlit.