YAILA

YAILA is a MERN-based AI learning platform for document chat, summaries, flashcards, quizzes, knowledge graphs, and learning roadmaps.

This repo now includes a production-ready large-document ingestion and retrieval upgrade:

page-batched PDF parsing instead of full-document extraction first
resumable ingestion with checkpoints
batched embeddings and batched chunk writes
optional Endee vector database integration for semantic search
Mongo-backed fallback vector search for backward compatibility
metadata-aware retrieval with document/page/section traceability
a retrieval API for semantic search and RAG context inspection

Architecture

flowchart LR
    A["Uploaded PDF/Image"] --> B["Document Queue"]
    B --> C["Streaming Parser / OCR"]
    C --> D["Chunk Session"]
    D --> E["Embedding Batches"]
    E --> F["Chunk Store (Mongo)"]
    E --> G["Vector Store (Mongo or Endee)"]
    F --> H["Knowledge Graph / Roadmap"]
    F --> I["Summary / Quiz / Flashcards"]
    G --> J["Semantic Retrieval"]
    F --> J
    J --> K["Chat / RAG Context Assembly"]

Repo layout

backend
frontend

YAILA keeps the Node/Express app structure. Endee is vendored under backend/vendor/endee and integrated as an optional HTTP vector backend behind a clean adapter layer in the backend.

Backend setup

cd backend
npm install
cp .env.example .env
npm run dev

Core env knobs live in:

backend/.env.example

Important settings:

VECTOR_STORE_PROVIDER=mongo|endee
INGESTION_PAGE_BATCH_SIZE
INGESTION_CHUNK_BATCH_SIZE
EMBEDDING_BATCH_SIZE
INGESTION_CHECKPOINT_ENABLED
INGESTION_USE_AI_CHUNK_SUMMARIES
RETRIEVAL_TOP_K
RETRIEVAL_CONTEXT_RADIUS

Running with Endee

Default behavior stays backward compatible with Mongo vector search:

VECTOR_STORE_PROVIDER=mongo

To use Endee:

VECTOR_STORE_PROVIDER=endee
ENDEE_BASE_URL=http://localhost:8080
ENDEE_AUTH_TOKEN=
ENDEE_INDEX_NAME=document-chunks
ENDEE_SPACE_TYPE=cosine
ENDEE_PRECISION=int16

Endee source is vendored under:

backend/vendor/endee

Use Endee’s own build/run docs there if you want the external vector DB path. If Endee is unavailable, YAILA still keeps Mongo chunk records and can fall back to Mongo vector retrieval.

Frontend setup

cd frontend
npm install
npm run dev

Frontend env:

VITE_API_URL=http://localhost:5001/api

Large-document ingestion flow

For PDFs in the 1000–2000 page range, the backend now:

Reads page counts first.
Parses pages in batches instead of loading the whole PDF into memory.
Removes repeated boilerplate where detectable.
Builds chunks incrementally with deterministic chunk indexes.
Embeds chunks in batches.
Writes chunks in bulk to Mongo.
Indexes vectors in bulk to Mongo or Endee.
Saves progress checkpoints so failed jobs can resume.

Checkpoint state is stored in:

backend/models/DocumentIngestionCheckpoint.js

Retrieval and RAG

Semantic retrieval is unified in:

backend/services/retrievalService.js

Vector backends:

backend/services/vectorStores/mongoVectorStore.js
backend/services/vectorStores/endeeVectorStore.js

The chat flow uses hybrid retrieval:

semantic candidates
lexical candidates
rerank
near-duplicate filtering
optional adjacent context expansion

Retrieval API

You can inspect the semantic retrieval path directly:

POST /api/ai/retrieve

Request body:

{
  "query": "Explain biconditional logic",
  "documentIds": ["<document-id>"],
  "topK": 4
}

Tests

cd backend
npm test

Current tests cover:

chunking behavior
resumable chunk-session state
local embedding fallback
Endee adapter indexing/search hydration
retrieval merging and de-duplication

Benchmark

Run the synthetic large-document benchmark:

cd backend
npm run benchmark:ingestion

Optional knobs:

BENCHMARK_PAGE_COUNT=1500 BENCHMARK_PARAGRAPHS_PER_PAGE=6 npm run benchmark:ingestion

Key backend files

backend/services/documentIngestionService.js
backend/services/chunkingService.js
backend/utils/pdfParser.js
backend/services/retrievalService.js
backend/services/vectorStores/vectorStoreFactory.js
backend/controllers/aiController.js

Notes

Mongo remains the default vector path for easy local startup.
Endee is integrated cleanly but optional.
OCR stays opt-in and is not used by default for PDFs with an existing text layer.
The ingestion path preserves document/page/section metadata for downstream study features.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
backend		backend
frontend		frontend
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

YAILA

Architecture

Repo layout

Backend setup

Running with Endee

Frontend setup

Large-document ingestion flow

Retrieval and RAG

Retrieval API

Tests

Benchmark

Key backend files

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

YAILA

Architecture

Repo layout

Backend setup

Running with Endee

Frontend setup

Large-document ingestion flow

Retrieval and RAG

Retrieval API

Tests

Benchmark

Key backend files

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages