YAILA is a MERN-based AI learning platform for document chat, summaries, flashcards, quizzes, knowledge graphs, and learning roadmaps.
This repo now includes a production-ready large-document ingestion and retrieval upgrade:
- page-batched PDF parsing instead of full-document extraction first
- resumable ingestion with checkpoints
- batched embeddings and batched chunk writes
- optional Endee vector database integration for semantic search
- Mongo-backed fallback vector search for backward compatibility
- metadata-aware retrieval with document/page/section traceability
- a retrieval API for semantic search and RAG context inspection
flowchart LR
A["Uploaded PDF/Image"] --> B["Document Queue"]
B --> C["Streaming Parser / OCR"]
C --> D["Chunk Session"]
D --> E["Embedding Batches"]
E --> F["Chunk Store (Mongo)"]
E --> G["Vector Store (Mongo or Endee)"]
F --> H["Knowledge Graph / Roadmap"]
F --> I["Summary / Quiz / Flashcards"]
G --> J["Semantic Retrieval"]
F --> J
J --> K["Chat / RAG Context Assembly"]
backendfrontend
YAILA keeps the Node/Express app structure. Endee is vendored under backend/vendor/endee and integrated as an optional HTTP vector backend behind a clean adapter layer in the backend.
cd backend
npm install
cp .env.example .env
npm run devCore env knobs live in:
backend/.env.example
Important settings:
VECTOR_STORE_PROVIDER=mongo|endeeINGESTION_PAGE_BATCH_SIZEINGESTION_CHUNK_BATCH_SIZEEMBEDDING_BATCH_SIZEINGESTION_CHECKPOINT_ENABLEDINGESTION_USE_AI_CHUNK_SUMMARIESRETRIEVAL_TOP_KRETRIEVAL_CONTEXT_RADIUS
Default behavior stays backward compatible with Mongo vector search:
VECTOR_STORE_PROVIDER=mongoTo use Endee:
VECTOR_STORE_PROVIDER=endee
ENDEE_BASE_URL=http://localhost:8080
ENDEE_AUTH_TOKEN=
ENDEE_INDEX_NAME=document-chunks
ENDEE_SPACE_TYPE=cosine
ENDEE_PRECISION=int16Endee source is vendored under:
backend/vendor/endee
Use Endee’s own build/run docs there if you want the external vector DB path. If Endee is unavailable, YAILA still keeps Mongo chunk records and can fall back to Mongo vector retrieval.
cd frontend
npm install
npm run devFrontend env:
VITE_API_URL=http://localhost:5001/apiFor PDFs in the 1000–2000 page range, the backend now:
- Reads page counts first.
- Parses pages in batches instead of loading the whole PDF into memory.
- Removes repeated boilerplate where detectable.
- Builds chunks incrementally with deterministic chunk indexes.
- Embeds chunks in batches.
- Writes chunks in bulk to Mongo.
- Indexes vectors in bulk to Mongo or Endee.
- Saves progress checkpoints so failed jobs can resume.
Checkpoint state is stored in:
backend/models/DocumentIngestionCheckpoint.js
Semantic retrieval is unified in:
backend/services/retrievalService.js
Vector backends:
backend/services/vectorStores/mongoVectorStore.jsbackend/services/vectorStores/endeeVectorStore.js
The chat flow uses hybrid retrieval:
- semantic candidates
- lexical candidates
- rerank
- near-duplicate filtering
- optional adjacent context expansion
You can inspect the semantic retrieval path directly:
POST /api/ai/retrieveRequest body:
{
"query": "Explain biconditional logic",
"documentIds": ["<document-id>"],
"topK": 4
}cd backend
npm testCurrent tests cover:
- chunking behavior
- resumable chunk-session state
- local embedding fallback
- Endee adapter indexing/search hydration
- retrieval merging and de-duplication
Run the synthetic large-document benchmark:
cd backend
npm run benchmark:ingestionOptional knobs:
BENCHMARK_PAGE_COUNT=1500 BENCHMARK_PARAGRAPHS_PER_PAGE=6 npm run benchmark:ingestionbackend/services/documentIngestionService.jsbackend/services/chunkingService.jsbackend/utils/pdfParser.jsbackend/services/retrievalService.jsbackend/services/vectorStores/vectorStoreFactory.jsbackend/controllers/aiController.js
- Mongo remains the default vector path for easy local startup.
- Endee is integrated cleanly but optional.
- OCR stays opt-in and is not used by default for PDFs with an existing text layer.
- The ingestion path preserves document/page/section metadata for downstream study features.