Skip to content

cdtalley/DocuMind

Repository files navigation

DocuMind — Technical Reference

DocuMind is a local-first retrieval-augmented generation (RAG) system for technical and research document libraries. Documents are ingested, chunked, embedded into ChromaDB (cosine space), and queried through a FastAPI surface. Answers are grounded on retrieved passages, returned with structured citations, and shaped by mode-specific generation policies. Default LLM and embedding inference run via Ollama on operator-controlled hardware.

This document specifies architecture, control flows, configuration, and operational behavior sufficient for engineering review, extension, and production hardening.


Table of contents

  1. System overview
  2. Design principles
  3. Repository layout
  4. Runtime architecture
  5. Data lifecycle: ingest → index
  6. Retrieval and generation pipeline
  7. Query modes
  8. FLARE-inspired active retrieval
  9. HTTP API
  10. Configuration
  11. Security middleware
  12. Observability and reliability
  13. Deployment
  14. Bundled corpus and scripts
  15. Testing
  16. Known limitations and extension points
  17. Portfolio artifacts
  18. References

1. System overview

Layer Responsibility
Presentation Next.js 15 dashboard (web/) for operator workflows; optional Streamlit (frontend/app.py) calling the same REST API.
Application FastAPI application (app/main.py): routing, middleware, dependency injection, lifespan-managed singletons.
Domain services Document parsing and chunking (app/services/document_service.py, app/utils/chunker.py); vector persistence (app/services/embedding_service.py); RAG orchestration (app/services/rag_service.py).
Model I/O Ollama client (app/utils/ollama_client.py): chat completions and per-text embeddings over HTTP.
Persistence Chroma persistent client on disk (CHROMA_PERSIST_DIR); collection metadata uses cosine distance (hnsw:space: cosine).

Ports (convention): API 8001, Next.js dev 3002, Ollama 11434.


2. Design principles

  1. Grounding first — Final user-facing answers for LLM-backed modes are conditioned only on retrieved chunk text; prompts explicitly forbid inventing papers, metrics, or datasets absent from context.
  2. Explicit provenance — Responses include SourceCitation objects (document id, title, section, page hint, chunk index, distance, preview).
  3. Dependency-aware servingLiveness vs readiness split so orchestrators can distinguish “process up” from “dependencies usable”.
  4. Configurable retrieval policy — Top‑k, distance cutoff, keyword rerank weight, fallback when strict filtering returns nothing, and diversity caps are all environment-tunable.
  5. Single-tenant baseline — One shared library index per deployment; ACLs per document are not implemented in-tree (see §16).

3. Repository layout

Path Role
app/main.py FastAPI app, lifespan, global exception handler, middleware, router includes.
app/config.py pydantic-settings Settings; single cached get_settings().
app/logging_config.py Optional JSON logging layout.
app/routers/ingest.py Multipart ingest, delete by doc_id.
app/routers/papers.py List / get / delete paper metadata from index.
app/routers/query.py RAG query and collection stats.
app/routers/arxiv.py arXiv PDF fetch by id.
app/services/document_service.py File type detection, text extraction, delegation to chunker.
app/services/embedding_service.py Chroma add/query/delete; Ollama embeddings.
app/services/rag_service.py Retrieval, rerank, diversity, mode prompts, FLARE branch, Ollama chat.
app/utils/chunker.py RecursiveCharacterTextSplitter; section heuristics in metadata.
app/utils/ollama_client.py Retry-wrapped HTTP to Ollama /api/chat and /api/embeddings.
app/models/ Pydantic request/response models shared by routers.
data/sample_docs/ Bundled UTF-8 corpus (see §14).
tests/ API and unit tests; tests/conftest.py uses dependency overrides and fake embedding/RAG for isolation.
evaluation/ Optional regression fixtures for pipeline shape.
scripts/ Corpus generators, portfolio PDF, arXiv bulk helpers.
web/ Next.js operator UI.
Dockerfile / docker-compose.yml Container image (Python 3.11-slim, non-root) and Compose stack with Chroma volume + healthcheck.

4. Runtime architecture

flowchart TB
  subgraph clients [Clients]
    N[Next.js]
    S[Streamlit]
  end
  subgraph api [DocuMind API]
    F[FastAPI]
    L[Lifespan: services + seed]
  end
  subgraph svc [Services]
    D[DocumentService]
    E[ChromaEmbeddingService]
    R[RAGService]
  end
  subgraph ext [External]
    O[Ollama]
    C[(ChromaDB)]
  end
  N --> F
  S --> F
  F --> L
  L --> D
  L --> E
  L --> R
  D --> E
  R --> E
  R --> O
  E --> O
  E --> C
Loading

Lifespan (app/main.py): On startup, constructs OllamaClient, ChromaEmbeddingService, DocumentService, RAGService. Runs seed_sample_docs() when Ollama is healthy: compares SAMPLE_CORPUS_VERSION marker on disk to settings; on mismatch, deletes sample_* vectors, rewrites marker, then ingests each data/sample_docs/*.txt as sample_<stem>.

Routers mount under /api/v1 except health routes at root.


5. Data lifecycle: ingest → index

5.1 Ingestion

  • Input: POST /api/v1/ingest (multipart file) or POST /api/v1/fetch-arxiv (JSON arxiv_id).
  • Validation: File size cap MAX_FILE_SIZE_MB; MIME/type checks in ingest router / document service.
  • Extraction: PyPDF2 for PDF, python-docx for DOCX, raw decode for .txt.
  • Metadata: Heuristic title, authors, year, optional arXiv id from leading text when parseable.
  • Chunking: DocumentChunker uses LangChain RecursiveCharacterTextSplitter with CHUNK_SIZE and CHUNK_OVERLAP. Each langchain_core.documents.Document carries metadata: doc_id, filename, section (heuristic), chunk_index, page_number when known, etc.
  • Indexing: ChromaEmbeddingService.add_documents embeds each chunk via Ollama EMBEDDING_MODEL, writes to Chroma with stable ids {doc_id}_{i} and metadata including doc_id for deletion and listing.

5.2 Deletion semantics

DELETE /api/v1/papers/{doc_id} and DELETE /api/v1/ingest/{doc_id} call embedding_service.delete_document. If no chunks exist for that doc_id, the service returns false and the API responds 404 — empty delete is not silently successful.

5.3 Vector space

Chroma collection is created with metadata={"hnsw:space": "cosine"}. Query results expose distance per hit; the RAG layer sorts ascending (lower distance = closer match) and keeps rows with distance < RELEVANCE_THRESHOLD before optional fallback (threshold is a tunable cutoff on this distance scale for your embedding model and corpus).


6. Retrieval and generation pipeline

All logic below is implemented in app/services/rag_service.py unless noted.

6.1 Retrieval budget

For a user top_k and query_mode, the service expands the vector search n_results before reranking (e.g. up to 64 for general / compare, up to 56 for other modes). This widens the candidate pool so rerank and diversity filters have material to work with.

6.2 Vector search and rerank

  1. embedding_service.search(embed_query, retrieve_k, section_filter) returns rows {content, metadata, distance}.
  2. Keyword rerank: Rows are sorted by
    distance − KEYWORD_RERANK_WEIGHT × keyword_overlap_score(rerank_query, content)
    so lexical overlap with the user question can reorder within a distance band.
  3. Threshold filter: Keep rows with distance < RELEVANCE_THRESHOLD.
  4. Fallback: If nothing passes and ENABLE_FALLBACK_RETRIEVAL is true, take the top FALLBACK_TOP_N by rerank order and mark internally (answer may append a disclosure line).
  5. Diversity: _select_diverse_sources prefers at most one strong chunk per doc_id before filling remaining slots, reducing single-document context monopolization.
  6. Context slot cap: Depends on query_mode (e.g. up to 24 chunks for general / compare).

6.3 Generation

  • datasets mode: Does not call the LLM for the main body. It scans retrieved chunk text for known dataset hints and patterns, emits a structured Markdown inventory. FLARE is skipped.
  • Other modes: Builds a single context block from selected chunks, applies the mode’s system prompt from SYSTEM_PROMPTS, calls OllamaClient.chat with mode-dependent temperature, returns Markdown answer plus SourceCitation list.
  • Confidence: Derived from mean chunk distance (clamped); exposed as a scalar for UI.

7. Query modes

query_mode Behavior
general Broad grounded synthesis; higher temperature than methodology.
compare Cross-paper comparison framing; large retrieval budget; table-oriented prompt.
methodology Implementation-focused extraction; moderate temperature.
datasets Deterministic dataset / benchmark surfacing from chunk text.
reproduce Reproducibility checklist style; structured sections in prompt.

Optional section_filter restricts Chroma where clause on metadata section (abstract, introduction, methodology, experiments, results, conclusion).


8. FLARE-inspired active retrieval

Full FLARE (Jiang et al., arXiv:2305.06983) uses token-level confidence to trigger mid-generation retrieval. Ollama’s chat API used here does not expose per-token logprobs.

Implementation: When use_flare (request) or FLARE_ACTIVE_RETRIEVAL (settings) is true and mode ≠ datasets:

  1. Run the standard first-pass retrieval → context selection.
  2. Build a truncated mini-context (bounded by FLARE_DRAFT_MAX_CONTEXT_CHARS) from selected chunks.
  3. Call the LLM once with FLARE_DRAFT_SYSTEM to produce a 2–4 sentence forward-looking draft; unsupported facts must appear as ??? or explicit excerpt-level hedges.
  4. If flare_triggers_follow_up(draft) is true, run a second search with a composite query (user question + draft excerpt, capped length).
  5. Merge reranked lists by chunk identity, keeping the better (lower) distance per chunk; re-run threshold, fallback, and diversity on the merged set.
  6. Final synthesis uses merged chunks. Response fields flare_enabled and flare_followup_retrieval record what occurred.

9. HTTP API

Method Path Body / params Notes
GET /health Ollama availability + collection stats.
GET /health/live Process liveness.
GET /health/ready 503 if dependencies not ready.
POST /api/v1/ingest multipart/form-data file Returns ingest stats JSON.
DELETE /api/v1/ingest/{doc_id} 404 if no chunks.
POST /api/v1/fetch-arxiv { "arxiv_id": "..." } Downloads PDF, ingests.
POST /api/v1/query QueryRequest JSON See app/models/request_models.py.
GET /api/v1/papers Library cards.
GET /api/v1/papers/{doc_id} One document.
DELETE /api/v1/papers/{doc_id} 404 if no chunks.
GET /api/v1/collection/stats Aggregate counts.

OpenAPI: /docs, /redoc, /openapi.json unless DISABLE_OPENAPI=true.

Authentication: When API_KEY is non-empty, all /api/v1/* routes (except CORS preflight) require header X-API-Key matching the setting; mismatch → 401.


10. Configuration

All keys are listed in .env.example. Grouped reference:

Group Variables Purpose
Models OLLAMA_BASE_URL, LLM_MODEL, EMBEDDING_MODEL Inference endpoints and model tags.
Vector store CHROMA_PERSIST_DIR, CHROMA_COLLECTION_NAME On-disk path and logical collection.
Chunking CHUNK_SIZE, CHUNK_OVERLAP Text splitter parameters; affects chunk count and context granularity.
Retrieval defaults TOP_K_RESULTS, RELEVANCE_THRESHOLD, ENABLE_FALLBACK_RETRIEVAL, FALLBACK_TOP_N, KEYWORD_RERANK_WEIGHT Global defaults; per-request top_k overrides for query.
Ingest MAX_FILE_SIZE_MB, ARXIV_BASE_URL Upload cap and arXiv PDF export host.
Sample corpus SAMPLE_CORPUS_VERSION Bump to purge and re-seed all sample_* docs on startup.
Network CORS_ORIGINS, CORS_ALLOW_ALL, TRUSTED_HOSTS Browser and Host-header policy.
App APP_ENV, DISABLE_OPENAPI Environment label; docs toggle.
Security / transport API_KEY, ENABLE_RESPONSE_GZIP Optional API key gate; gzip responses.
Logging LOG_LEVEL, LOG_JSON Verbosity and JSON log lines.
FLARE FLARE_ACTIVE_RETRIEVAL, FLARE_DRAFT_MAX_CONTEXT_CHARS Global FLARE default and draft context budget.

11. Security middleware

Applied in app/main.py (order matters for FastAPI / Starlette):

  • CORSCORSMiddleware with explicit origins or wildcard when CORS_ALLOW_ALL (dev-only).
  • Trusted hosts — Optional TrustedHostMiddleware when TRUSTED_HOSTS is set.
  • GzipGZipMiddleware when ENABLE_RESPONSE_GZIP and payload exceeds minimum size.
  • Per-requestX-Request-ID assignment, optional API key gate, default security headers (X-Content-Type-Options, X-Frame-Options, Referrer-Policy; Permissions-Policy in production APP_ENV).
  • ErrorsHTTPException and RequestValidationError return structured JSON; uncaught exceptions return 500 with request_id in body.

12. Observability and reliability

  • Request correlation — Every response carries X-Request-ID; access logs include request_id, method, path, status, duration_ms.
  • Structured logsLOG_JSON=true for log platforms.
  • Healthchecks — Docker Compose defines an HTTP probe against /health/live (see docker-compose.yml). Prefer /health/ready for LB routing when Ollama and Chroma must be live.

13. Deployment

Target Command / notes
Docker Compose docker compose up --build — publishes 8001, mounts Chroma volume chroma_data, read-only ./data. Set OLLAMA_BASE_URL to reachable Ollama (default host.docker.internal:11434 on Docker Desktop).
Bare metal / VM uvicorn app.main:app --host 0.0.0.0 --port 8001 (add --proxy-headers behind TLS terminator per your platform).
Windows dev .\start_documind.ps1 (Ollama, API, Next); uses .venv\Scripts\python.exe when present. First boot can sit in corpus ingest for a long time before /health responds; the script waits up to 180 minutes (-MaxApiWaitMinutes). -SkipModelPull speeds repeat boots. .\stop_documind.ps1 clears ports 3002, 8001, 11434 — confirm Ollama shutdown is intended.

Backup: Copy CHROMA_PERSIST_DIR regularly; it is the authoritative index. Source PDFs/DOCX should remain in object storage or VCS-independent archives if they are not all under data/.


14. Bundled corpus and scripts

  • data/sample_docs/ — Approximately 460 UTF-8 technical briefs: curated landmark-style summaries plus 400 deterministic synthetic papers sample_corpus_p7_*.txt generated by scripts/generate_production_corpus.py. Expect on the order of 5k–10k chunks at default CHUNK_SIZE=800 after full ingest.
  • Regeneration: python scripts/generate_production_corpus.py --count 500 --force then bump SAMPLE_CORPUS_VERSION again so startup purges old sample_* rows.
  • Hand-authored expansion: scripts/materialize_institutional_corpus.py adds named institutional-style briefs (skips existing filenames).
  • arXiv bulk: scripts/bulk_ingest_arxiv.py + data/arxiv_seed_list.txt.

15. Testing

pytest -q

tests/conftest.py overrides FastAPI dependencies with fake embedding/RAG services so unit tests do not require Ollama or Chroma.

Query regression suite: tests/test_rag_query_suite.py runs 20 parameterized cases (tests/query_eval_cases.py) against real RAGService with a ranking fake vector layer (tests/ranking_fake_embedding.py) and a deterministic Ollama stub — metrics cover status, has_answer, source counts, answer substrings, and wall time. Live library smoke: python scripts/run_query_eval.py --base-url http://127.0.0.1:8001 (optional --csv report.csv; skips empty-corpus cases with --skip-empty-corpus-cases).

CI: On push and pull request to main or master, GitHub Actions (.github/workflows/ci.yml) runs on Python 3.11 and 3.12: ruff check (syntax / undefined-name rules), then pytest. No Ollama or Chroma in CI. Pytest and Ruff defaults: pyproject.toml. Dependabot for Actions: .github/dependabot.yml.


16. Known limitations and extension points

Not implemented in this repository (non-exhaustive):

  • Per-user or per-tenant ACL on chunks or documents.
  • SSO / OIDC for the API or UI.
  • OCR pipeline for low-quality scanned PDFs beyond basic text extraction.
  • Hosted managed vector SaaS swap (Pinecone, Weaviate, etc.) — would replace ChromaEmbeddingService while preserving router contracts.
  • Token-level FLARE — requires a host that exposes logprobs or an alternative uncertainty model.

Natural extensions: swap Ollama for OpenAI/Azure OpenAI behind the same RAGService boundary; add golden-set eval CI; wire /health/ready to load balancers.


17. Portfolio artifacts

Under portfolio/: client project catalog HTML, portfolio brief HTML, optional PDF generation (scripts/portfolio_requirements.txt, scripts/generate_portfolio_pdf.py), dashboard screenshot portfolio/screenshots/documind-dashboard.png.

Regenerating the screenshot (recommended): A bare playwright screenshot of the home page misses indexed doc counts and synthesis. Use the bundled driver after API + Next are up and sample ingest has progressed:

.\.venv\Scripts\pip install -r scripts\screenshot_requirements.txt
.\.venv\Scripts\playwright install chromium
.\scripts\capture_dashboard.ps1                    # waits for /health/live, Gold demo scenario (compare, Top K 24, FLARE), synthesis text, tall-viewport PNG
# Smaller corpus / faster index gate:  .\scripts\capture_dashboard.ps1 -MinDocs 40
# Custom API / wait cap:                .\scripts\capture_dashboard.ps1 -ApiBase "http://127.0.0.1:8001" -MaxLivenessWaitMinutes 240

Or directly: .\.venv\Scripts\python scripts\capture_dashboard_playwright.py --help — waits for ≥120 chars in .prose-answer, scrolls synthesis into view, writes 1680×3200 portfolio/screenshots/documind-dashboard.png (default --viewport-width 1680; use --viewport-width 1440 if needed), then 1000×750 portfolio/screenshots/documind-upwork-catalog-1000x750.png (default stack infographic tile; --plain-catalog-thumb top-crops the dashboard). Thumb only: .\.venv\Scripts\python scripts\capture_dashboard_playwright.py --thumb-only. Standalone tile: python scripts/catalog_thumb_art.py --out portfolio/screenshots/documind-upwork-catalog-1000x750.png. Avoid --full-page for portfolio assets.


18. References

  • Jiang et al., Active Retrieval Augmented Generation (FLARE), arXiv:2305.06983.
  • FastAPI, Pydantic v2, ChromaDB, LangChain text splitters, Ollama HTTP API.

Stack summary

Python 3.11+ (Dockerfile pins 3.11-slim), FastAPI, Uvicorn, Pydantic Settings, ChromaDB, langchain_core + langchain-text-splitters, Ollama, Next.js 15, React 18, TypeScript, pytest, optional Streamlit.

About

Local-first RAG platform for technical document libraries. Features FastAPI, ChromaDB, and active retrieval (FLARE) powered by Ollama.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors