TurboQuant · Zero new tables · RAG platform (release track v0.4.3)
Using RAG in the browser? Start with USER_FEATURES_GUIDE.md (collections, search, memory — user-oriented). This document covers architecture, REST, MCP, and limits for builders and automations.
- Overview
- Architecture
- Layer-by-Layer Description
- Data Model — 8DNA
- Startup Sequence
- REST API Reference
- MCP Tool Catalogue
- Subscription Tier Matrix
- Developer Flows
- AI Builder Integration
- Known Limitations
The RAG Platform Service gives every AI in the AgentStack ecosystem a shared, semantically searchable memory. It is accessible three ways:
| Client | Entry Point | Auth |
|---|---|---|
| HTTP clients, frontend | POST /api/rag/... |
require_authentication |
| Web dashboard | /dashboard/:projectId?module=rag — collections, ingest, search, session memory (RAGDashboardWidget) |
same session / project as app |
| Any AI agent | agentstack.execute MCP (rag.* actions) |
project-scoped session |
| AI Builder internally | RAGEngine.get_context_for_prompt() |
project_id from stage |
Key properties:
- Zero new tables — uses existing
data_projects_projectanddata_projects_backupwithentity_typetags. - TurboQuant compression — ~8× memory reduction on embeddings with near-zero accuracy loss (FWHT + outlier-aware scalar quantization + QJL residual).
- Hybrid search — BM25 sparse + dense TQ inner product, fused with Reciprocal Rank Fusion (RRF) and optionally re-ranked by MMR.
- Subscription-gated — Free / Starter / Pro / Enterprise limits applied at the engine level, not just the API.
┌─────────────────────────────────────────────────────────────────────┐
│ CLIENTS │
│ REST /api/rag/* MCP agentstack.execute AI Builder Stage │
└──────────┬──────────────────┬────────────────────────┬──────────────┘
│ │ │
▼ ▼ ▼
┌──────────────────────────────────────────────────────────────────────┐
│ API LAYER │
│ rag_endpoints.py tools_rag.py RAGContextSelector │
│ (Auth + Ownership) (11 @mcp_tool) (Context Optimizer v2) │
└──────────────────────────────────┬───────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────────┐
│ RAGEngine (singleton) │
│ ingest() search() memory_add/get/search() get_context_for_prompt()│
└──┬──────────────────────────────────────────────────────────────────┬┘
│ Core Layer Persistence Layer │
▼ ▼
┌──────────────────────────────────┐ ┌────────────────────────────────┐
│ TurboQuantizer (L0) │ │ CollectionManager │
│ TextChunker (L1) │ │ → data_projects_project │
│ EmbedderService (L1) │ │ DocumentRepository │
│ TQVectorStore (L2) │ │ → data_projects_backup │
│ MemoryStore (L4) │ │ MemoryRepository │
│ │ │ → data_projects_backup │
└──────────────────────────────────┘ └────────────────────────────────┘
│
NeuralCacheEngine ─────────────┘
L1: MemoryTurn deque
L2: VectorCacheEntry (TQ-compressed embeddings)
Implements Google's TurboQuant compression in pure NumPy:
- Fast Walsh-Hadamard Transform — randomised rotation of the vector space (O(n log n), pure NumPy, no GPU required).
- Outlier-aware scalar quantisation — detects the top
outlier_ratiodimensions by absolute magnitude; these get 2-bit dedicated channels while the rest are quantised tobitsprecision. - QJL 1-bit residual — a Johnson-Lindenstrauss sketched residual vector preserves the unbiased inner-product estimate for ANN search.
API:
tq = TurboQuantizer(bits=4, outlier_ratio=0.05)
codebook = tq.compress(vectors) # (n, dim) float32 → QuantizationCodebook
approx = tq.decompress(codebook) # QuantizationCodebook → (n, dim) float32
scores = tq.inner_product(query, cb) # (n,) float32 relevance scores
raw = TurboQuantizer.serialize(cb) # bytes
cb2 = TurboQuantizer.deserialize(raw)Splits documents into overlapping chunks:
- Default: 512 tokens / 64 token overlap.
- Sentence-boundary aware (never cuts mid-sentence).
- Each chunk carries a
content_hash(SHA-256[:16]) for progressive indexing.
Provider-agnostic embedding wrapper:
| Provider | Model | Dimensions |
|---|---|---|
openai |
text-embedding-3-small |
1536 |
gemini |
text-embedding-004 |
768 |
mock |
deterministic hash-based | 128 |
Cache strategy: tries VectorCacheEntry (TQ-compressed, ~8× smaller) in
NeuralCacheEngine.memory_pool; falls back to cache.set() with a plain list.
Cache TTL is 24 h — embeddings are deterministic per model.
In-memory vector index per collection:
- Dense search — TQ inner product (compressed dot product).
- Sparse search — BM25 via
rank_bm25. - Hybrid — RRF merges dense + sparse ranked lists.
- MMR — Maximal Marginal Relevance for result diversity.
- Serialization —
to_base64()/from_base64()for optional JSONB storage.
8DNA CRUD without new tables:
| Class | Table | entity_type |
Hierarchy |
|---|---|---|---|
CollectionManager |
data_projects_project |
rag_collection |
project → collection |
DocumentRepository |
data_projects_backup |
rag_document |
collection → chunk |
MemoryRepository |
data_projects_backup |
rag_memory |
project → turn |
Progressive indexing: save_chunk() computes content_hash; unchanged chunks
are skipped — re-ingesting a document only processes modified sections.
Two-tier conversation memory:
| Tier | Store | Capacity | Retrieval |
|---|---|---|---|
| Hot | NeuralCacheEngine rag:mem:{session_id} |
Last 20 turns | O(1) |
| Cold | In-process TQVectorStore per session | Up to 1 000 turns | TQ semantic search |
| Persist | 8DNA data_projects_backup |
Unlimited | On cold start |
After a cold start, search() calls _hydrate_cold_from_dna() which
decompresses stored TQ bytes to reconstruct float32 vectors — semantic search
works correctly across server restarts.
Central singleton orchestrating all layers.
engine = get_rag_engine()
await engine.initialize(embedding_provider="openai", api_key="sk-...")Public methods:
| Method | Description |
|---|---|
ingest(collection_id, text, metadata, project_id) |
Chunk → embed → compress → store. Returns {chunks_added, chunks_skipped}. |
search(collection_id, query, top_k, hybrid, mmr, ...) |
Semantic search with optional contextual compression. |
memory_add(session_id, role, content, project_id) |
Add a conversation turn to memory. |
memory_get(session_id, limit) |
Get recent turns. |
memory_search(session_id, query, top_k) |
Semantic search over past turns. |
get_context_for_prompt(task, project_id, token_budget) |
Used by AI Builder — returns ranked snippets within token budget. |
{
"entity_type": "rag_collection",
"name": "my-kb",
"description": "Product documentation",
"config": {
"embedding_provider": "openai",
"tq_bits": 4
},
"stats": {
"doc_count": 42,
"chunk_count": 317,
"last_indexed_at": "2026-03-27T10:00:00Z"
}
}project_id on the row → multi-tenant isolation.
{
"entity_type": "rag_document",
"parent_uuid": "<collection_uuid>",
"doc_id": "readme.md__chunk_0",
"source_doc_id": "readme.md",
"text": "AgentStack is...",
"chunk_index": 0,
"vector_compressed": "<base64 TurboQuant bytes>",
"content_hash": "a3f2b1c4d5e6f7a8",
"metadata": { "source": "readme.md", "type": "docs" }
}{
"entity_type": "rag_memory",
"session_id": "sess_abc123",
"role": "user",
"content": "How do I add a payment?",
"vector_compressed": "<base64 TurboQuant bytes>",
"content_hash": "b2c3d4e5f6a7b8c9"
}app lifespan (core_app.py)
│
├── get_rag_engine() # singleton created
├── get_namespaced_cache("rag_embeddings") # L2 cache namespace
├── await engine.initialize(provider, api_key, cache)
│ ├── EmbedderService(...)
│ └── MemoryStore(tq, cache, persist=True)
│
├── asyncio.create_task(_index_platform_reference_kb())
│ └── ingest curated platform markdown → collection `system:philosophy` (legacy id)
│
└── (on first GET /mcp/discovery)
└── asyncio.ensure_future(_index_mcp_tools_rag())
└── ingest each MCP tool description → collection "system:mcp_tools"
All routes are mounted under /api/rag/ (no version prefix).
Authentication: require_authentication (project-scoped session cookie/token).
Create a knowledge base collection.
Request:
{
"name": "product-docs",
"description": "Product documentation",
"embedding_provider": "openai"
}Response:
{
"success": true,
"collection": {
"uuid": "550e8400-...",
"project_id": 42,
"name": "product-docs",
"config": { "embedding_provider": "openai", "tq_bits": 4 },
"stats": { "doc_count": 0, "chunk_count": 0, "last_indexed_at": null }
}
}List all collections for the authenticated project.
Delete collection and all its document chunks (cascade). Ownership verified.
Add a document (auto-chunked, embedded, TQ-compressed, persisted).
Request:
{
"content": "Full text of the document...",
"source_doc_id": "readme.md",
"metadata": { "type": "docs", "version": "1.0" }
}Response:
{
"success": true,
"chunks_added": 3,
"chunks_skipped": 0,
"collection_id": "550e8400-...",
"source_doc_id": "readme.md"
}List chunks in a collection.
Remove all chunks for a source document.
Semantic search.
Request:
{
"query": "How do I configure payments?",
"top_k": 5,
"hybrid": true,
"mmr": false,
"filters": { "type": "docs" }
}Response:
{
"success": true,
"query": "How do I configure payments?",
"results": [
{
"doc_id": "readme.md__chunk_2",
"text": "To configure payments, add the STRIPE_KEY...",
"score": 0.92,
"metadata": { "type": "docs" },
"source_doc_id": "readme.md",
"chunk_index": 2
}
]
}Store a conversation turn.
Request:
{ "role": "user", "content": "How do I add a payment?" }Response:
{ "success": true, "session_id": "sess_abc", "turn_index": 4, "role": "user" }Get recent turns (hot store first, then DNA).
Semantic search over past turns.
Request:
{ "query": "payment configuration questions", "top_k": 3 }All tools are invoked via agentstack.execute with action: "rag.<name>".
| Action | Description | Key Parameters |
|---|---|---|
rag.collection_create |
Create a knowledge base | name, project_id, embedding_provider |
rag.collection_list |
List collections | project_id |
rag.collection_delete |
Delete collection + cascade | collection_id, project_id |
rag.document_add |
Ingest a document | collection_id, content, source_doc_id, metadata |
rag.document_list |
List chunks | collection_id, limit |
rag.document_delete |
Remove a document | collection_id, doc_id |
rag.search |
Semantic search | collection_id, query, top_k, hybrid, mmr |
rag.memory_add |
Add a conversation turn | session_id, role, content, project_id |
rag.memory_get |
Get recent turns | session_id, limit |
rag.memory_search |
Semantic memory search | session_id, query, top_k |
Example MCP call:
{
"action": "rag.search",
"collection_id": "550e8400-...",
"query": "stripe webhook setup",
"top_k": 3,
"hybrid": true
}| Feature | Free | Starter | Pro | Enterprise |
|---|---|---|---|---|
| Collections | 0 | 1 | 10 | Unlimited |
| Chunks per collection | — | 100 | 10 000 | Unlimited |
| Memory turns per session | 50 | 500 | Unlimited | Unlimited |
| Hybrid search (BM25 + dense) | No | Yes | Yes | Yes |
| MMR re-ranking | No | No | Yes | Yes |
| Contextual compression | No | No | No | Yes |
Limits are enforced in RAGEngine:
ingest()checkscollections == 0(free tier blocks ingestion) andchunks_per_collectioncap.memory_add()checksmemory_turnscap per session.search()downgradeshybrid/mmr/compress_resultsflags when not allowed by tier.
The tier is resolved via _get_tier(project_id) which calls
shared.subscription.get_project_tier if available, otherwise defaults to
"starter" (conservative fallback).
engine = get_rag_engine()
await engine.initialize(embedding_provider="openai", api_key=OPENAI_KEY)
# Ingest
result = await engine.ingest(
collection_id="my-project:docs",
text=open("README.md").read(),
metadata={"source": "README.md"},
project_id=42,
source_doc_id="README.md",
)
# {"chunks_added": 4, "chunks_skipped": 0, ...}
# Search
hits = await engine.search(
collection_id="my-project:docs",
query="how to configure authentication",
top_k=5,
hybrid=True,
project_id=42,
)
# [{"doc_id": "...", "text": "...", "score": 0.87, ...}, ...]# Add turns
await engine.memory_add("sess_xyz", "user", "What's the weather?", project_id=42)
await engine.memory_add("sess_xyz", "assistant", "It's sunny today.", project_id=42)
# Retrieve recent
turns = await engine.memory_get("sess_xyz", limit=10)
# Semantic search
relevant = await engine.memory_search("sess_xyz", "weather forecast", top_k=3)Re-calling ingest() with the same source_doc_id automatically:
- Removes old in-memory chunks for that
source_doc_id. - Re-embeds only changed chunks (different
content_hash). - Updates the 8DNA backup rows for changed chunks.
- Skips unchanged chunks entirely.
The UnderstandingStage injects RAG context automatically before every LLM
call (both handle() and handle_streaming()):
user message → RAGEngine.get_context_for_prompt(task=message, project_id=..., token_budget=1500)
→ top-5 snippets from "project:{id}:code_index"
→ prepended as "## Relevant Context (RAG)" to system_prompt
→ LLM call with enriched context
For the RAGContextSelector (Context Optimizer v2), snippets are injected as
patterns into SDKContext and scored — optimize_context() keeps the
highest-relevance patterns within the token budget.
Wraps the existing ContextSelector and augments it:
base selector ──────────────────┐
├─ parallel ──► merged SDKContext
RAGEngine.get_context_for_prompt ┘ (RAG patterns scored by relevance)
Usage:
from ai_builder.sdk_context.rag_context_selector import RAGContextSelector
from ai_builder.sdk_context import ContextSelector
selector = RAGContextSelector(ContextSelector(...))
context = await selector.select_context(
task_description="add Stripe payment integration",
project_id=42,
max_tokens=4000,
)Two system collections are auto-indexed at startup:
| Collection ID | Source | Purpose |
|---|---|---|
system:philosophy |
Curated platform markdown (internal sources) | Semantic search over bundled reference text (legacy collection id) |
system:mcp_tools |
All @mcp_tool descriptions |
Semantic tool discovery |
-
Cold start re-embedding — when
vector_compressedis missing from a stored chunk (legacy data),_lazy_load_collectionre-embeds the text. This costs API calls proportional to the number of affected chunks. -
_get_tierintegration stub — whenshared.subscriptionis not importable the engine defaults to"starter"limits. Production deployments should ensureshared.subscription.get_project_tieris wired to the real subscription table. -
Semantic eviction not active —
VectorCacheEntryhas the design for semantic L2 eviction (remove entries most similar to each other), but theNeuralCacheEngine.memory_poolstill uses plain LRU eviction. This is a future enhancement. -
GET /mcp/discoverydouble registration —routes.pydefines twoGET /discoveryhandlers on the same router. The first registered wins in Starlette; the thinmcp_discovery()handler may take precedence over the fulldiscovery()that fires background RAG indexing. Verify in production and consolidate if needed. -
In-memory vector store is not distributed —
TQVectorStoreis per-process in-memory. In a multi-worker deployment (e.g.gunicorn -w 4), each worker has its own store. Searches work correctly (lazy load from 8DNA on each worker), but memory usage is multiplied by the worker count. -
Max MCP result payload — MCP tools return full chunk texts. For large documents with many chunks,
rag.document_listmay return a payload exceeding MCP tool output limits. Uselimitparameter to paginate.
Generated 2026-03-27 — see CHANGELOG.md v0.4.3 for the full change list.