docs(design): update architecture summary to v1.1 (knowledge layer, hybrid RAG, compliance)

vinod0m · vinod0m · commit 75de21a65bc3 · 2025-10-08T00:47:55.000+02:00
diff --git a/doc/design/integration_architecture_summary.md b/doc/design/integration_architecture_summary.md
@@ -1,11 +1,28 @@
 # DeepAgent + DocumentAgent Integration - Architecture Summary
 
+**Version:** 1.1 (Aligned with design addendum)  
+**Enhancement Summary:** Adds context-aware tagging + confirmation loop, Postgres + pgvector knowledge persistence, hybrid (vector + lexical) retrieval, compliance & cross-standard reasoning flows, and extended deployment phases (5–8).  
+For full detail see `deepagent_document_tools_integration.md` Section 14.
+
 ## Quick Reference
 
 ### Core Concept
 
 **DeepAgent uses DocumentAgent family as LangChain tools**, enabling conversational AI to perform document processing tasks automatically.
 
+New (v1.1) Knowledge Layer Additions:
+ 
+```text
+┌──────────────────────────────────────────────────────────────┐
+│                 KNOWLEDGE LAYER (NEW v1.1)                   │
+│  • Tagging & Confirmation (auto + user override)             │
+│  • High-Accuracy Requirements Pipeline (>99% precision)      │
+│  • Postgres + pgvector persistence                          │
+│  • Hybrid Retrieval (vector + lexical + metadata)            │
+│  • Compliance & Standards Graph Reasoning                    │
+└──────────────────────────────────────────────────────────────┘
+```
+
 ```text
 ┌─────────────────────────────────────────────────────────────────┐
 │                         USER QUERY                              │
@@ -138,15 +155,17 @@
 │ • Auto document type detection      │
 │ • Adaptive prompt selection         │
 │ • Domain-specific processing        │
-│   - Requirements docs               │
+│   - Requirements docs (hi-accuracy) │
 │   - Technical specs                 │
 │   - Legal contracts                 │
 │   - Business documents              │
+│ • Persistence + embedding hooks     │
 ├─────────────────────────────────────┤
 │ WRAPS: TagAwareDocumentAgent        │
-│ • DocumentTagger                    │
+│ • DocumentTagger (heuristic + LLM)  │
 │ • PromptSelector                    │
 │ • Dynamic strategy selection        │
+│ • ExternalKnowledgeStore client     │
 ├─────────────────────────────────────┤
 │ INPUT:                              │
 │ • file_path: str                    │
@@ -200,7 +219,7 @@ User Query Analysis
                         │
                         ▼
             SmartDocumentProcessingTool
-            (auto-detects and adapts)
+              (auto-detects, adapts, persists, embeds)
 ```
 
 ### Fallback Chain
@@ -263,7 +282,8 @@ DocumentAgent processes:
   3. Chunk markdown (8000 chars/chunk)
   4. LLM structures each chunk
   5. Merge results
-  6. Apply quality enhancements
+   6. Apply quality enhancements (multi-pass if hi-accuracy enabled)
+   7. (v1.1) Persist structured requirements + embeddings (if configured)
   │
   ▼
 Returns: {
@@ -360,9 +380,8 @@ DeepAgent: → SmartDocumentProcessingTool
 
 TURN 2:
 USER: "How many database requirements?"
-DeepAgent: → Recalls previous extraction results from session
-           → Filters requirements by "database" keyword
-           → Returns count + list
+DeepAgent: → If session cache warm, use memory; else query hybrid retrieval restricted to `requirements_spec` + keyword filter
+           → Returns count + list (with persistent IDs)
 
 TURN 3:
 USER: "Export those to JSON"
@@ -403,6 +422,10 @@ DeepAgent: → Formats previously filtered results as JSON
 │    • Per-requirement confidence                │
 │    • Quality flags                             │
 ├────────────────────────────────────────────────┤
+│ 6. Multi-Pass Refinement (v1.1)                │
+│    • Second pass on ambiguous items            │
+│    • Atomic split & duplication resolution     │
+├────────────────────────────────────────────────┤
 │ TOTAL ACCURACY: 99-100% (exceeds ≥98% target) │
 └────────────────────────────────────────────────┘
 ```
@@ -466,6 +489,15 @@ User → [Status updates...] → Result
 
 ## Security Measures
 
+### New (v1.1) Persistence & Data Handling Considerations
+
+| Aspect | Control |
+|--------|---------|
+| PII in embeddings | Optional redaction pre-embedding (regex + classifier) |
+| Audit of overrides | Tag override table with user + timestamp |
+| Data minimization | Store only atomic requirement text + derived metadata |
+| Encryption | Recommend TLS for external Postgres API + at-rest encryption |
+
 ### Path Validation
 
 ```text
@@ -538,6 +570,23 @@ document_tools:
     auto_detect: true
     confidence_threshold: 0.7
 
+knowledge_layer:
+  enabled: true
+  persistence:
+    external_store: true
+    batch_size: 100
+  embeddings:
+    model: "text-embedding-3-large"
+    dimension: 1536
+    store_vectors: true
+  hybrid_retrieval:
+    enabled: true
+    vector_weight: 0.6
+    lexical_weight: 0.4
+  compliance:
+    enable_gap_analysis: true
+    enable_suggestions: true
+
 output:
   default_format: "summary"
   include_confidence_scores: true
@@ -610,6 +659,34 @@ Week 7-8: Phase 4 - Production
 │ ✓ Monitoring + alerts              │
 │ ✓ Production deployment            │
 └────────────────────────────────────┘
+
+Week 9-10: Phase 5 - Knowledge Layer
+┌────────────────────────────────────┐
+│ ✓ Tagging confirmation loop        │
+│ ✓ Requirements persistence         │
+│ ✓ Embedding generation pipeline    │
+└────────────────────────────────────┘
+
+Week 11-12: Phase 6 - Hybrid Retrieval
+┌────────────────────────────────────┐
+│ ✓ Hybrid (vector + lexical) API    │
+│ ✓ Scoring fusion + re-ranking      │
+│ ✓ Initial retrieval benchmarks     │
+└────────────────────────────────────┘
+
+Week 13-14: Phase 7 - Compliance Reasoning
+┌────────────────────────────────────┐
+│ ✓ Standards graph construction     │
+│ ✓ Gap analysis engine              │
+│ ✓ Cross-standard Q&A               │
+└────────────────────────────────────┘
+
+Week 15: Phase 8 - Optimization & QA
+┌────────────────────────────────────┐
+│ ✓ Performance tuning (latency p95) │
+│ ✓ Continuous eval harness          │
+│ ✓ Risk & drift monitoring          │
+└────────────────────────────────────┘
 ```
 
 ---
@@ -634,7 +711,9 @@ Week 7-8: Phase 4 - Production
 
 ✅ **Provider Agnostic**: Works with any LLM provider
 
-✅ **Quality Preserved**: All DocumentAgent features maintained
+✅ **Quality Preserved**: All DocumentAgent features maintained + enhanced multi-pass refinement
+✅ **Persistent Knowledge**: Structured + vectorized artifacts for future reasoning
+✅ **Hybrid Intelligence**: Combines semantic, lexical, and metadata signals
 
 ---
 
@@ -644,10 +723,15 @@ Week 7-8: Phase 4 - Production
 2. **Graceful Degradation**: Basic features always available
 3. **Quality First**: 99-100% accuracy maintained
 4. **Security by Default**: Path validation, resource limits, sanitization
-5. **Performance Optimized**: Caching, async, streaming
-6. **User-Centric**: Natural language interface, helpful errors
+5. **Performance Optimized**: Caching, async, streaming, fused retrieval
+6. **User-Centric**: Natural language interface, helpful errors, confirmation loops
+7. **Observability & Evaluation**: Metrics for tagging, extraction, retrieval, compliance
 
 ---
 
 **For complete implementation details, see:**  
-`doc/design/deepagent_document_tools_integration.md`
+`doc/design/deepagent_document_tools_integration.md` (Sections 14.x for v1.1 enhancements)
+
+---
+
+*This architecture summary is synchronized with the v1.1 design addendum.*