SoftwareDevLabs
diff --git a/‎.codacy/codacy.yaml‎
Lines changed: 15 additions & 0 deletions b/‎.codacy/codacy.yaml‎
Lines changed: 15 additions & 0 deletions
diff --git a/‎.gitignore‎
Lines changed: 4 additions & 0 deletions b/‎.gitignore‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 121 additions & 42 deletions b/‎README.md‎
Lines changed: 121 additions & 42 deletions
diff --git a/‎plan/README.md‎
Lines changed: 27 additions & 0 deletions b/‎plan/README.md‎
Lines changed: 27 additions & 0 deletions
diff --git a/‎plan/plan.yaml‎
Lines changed: 62 additions & 0 deletions b/‎plan/plan.yaml‎
Lines changed: 62 additions & 0 deletions
diff --git a/‎specs/README.md‎
Lines changed: 19 additions & 21 deletions b/‎specs/README.md‎
Lines changed: 19 additions & 21 deletions
@@ -0,0 +1,15 @@
+runtimes:
+    - dart@3.7.2
+    - go@1.22.3
+    - java@17.0.10
+    - node@22.2.0
+    - python@3.11.11
+tools:
+    - dartanalyzer@3.7.2
+    - eslint@8.57.0
+    - lizard@1.17.31
+    - pmd@7.11.0
+    - pylint@3.3.6
+    - revive@1.7.0
+    - semgrep@1.78.0
+    - trivy@0.66.0
@@ -293,3 +293,7 @@ profiles.json
 # Coverage reports
 .coverage
 coverage.xml
+
+
+#Ignore vscode AI rules
+.github/instructions/codacy.instructions.md
@@ -1,6 +1,4 @@
-
-
-# Welcome to the unstructuredDataHandler Repo
+# Welcome to the Unstructured Data RAG Platform Repo
 
 <details>
   <summary><strong>Table of Contents</strong></summary>
@@ -18,9 +16,21 @@
 
 </details>
 
-<br />
+---
+This repository contains the source code for the **Unstructured Data RAG Platform**,  
+a Python-based framework for handling unstructured documents in **safety-critical** and **cybersecurity** software development lifecycles.  
+
+It ingests documents (PDF, Word, PlantUML, Draw.io, Mermaid), converts them to structured JSON, embeds them with **PGVector**, stores raw artifacts in **MinIO**, and enables **agentic RAG** using **LangChain DeepAgent**.
+
+## Unstructured Data RAG Platform Overview
+
+Unstructured Data RAG Platform is a Python-based Software Development Life Cycle core project that provides AI/ML capabilities for software development workflows. The repository contains modules for LLM clients, intelligent agents, memory management, prompt engineering, document retrieval, skill execution, and various utilities. It combines a Python core with TypeScript for Azure DevOps pipeline configurations.
+
+- **Primary Language**: Python 3.10-3.12
+- **Secondary Languages**: TypeScript (for Azure pipelines), Shell scripts
+- **Project Type**: AI/ML library and tooling for SDLC workflows
+
 
-This repository contains the source code for the unstructuredDataHandler project, a Python-based framework for building AI-powered software development life cycle tools.
 
 Related repositories include:
 
@@ -31,63 +41,120 @@ Related repositories include:
 The plan for the unstructuredDataHandler [is described here](./doc/roadmap-20xx.md) and
 will be updated as the project proceeds.
 
-## Installing and running Windows Terminal
+The project specification, plan, and task breakdown are defined in YAML files:
 
-> [!NOTE]
-> This section is a placeholder and may not be relevant to this project.
+- [Specification](./specs/spec.yaml)  
+- [Plan](./plan/plan.yaml)  
+- [Tasks](./task/task.yaml)
 
-## unstructuredDataHandler Overview
+---
 
-unstructuredDataHandler is a Python-based Software Development Life Cycle core project that provides AI/ML capabilities for software development workflows. The repository contains modules for LLM clients, intelligent agents, memory management, prompt engineering, document retrieval, skill execution, and various utilities. It combines a Python core with TypeScript for Azure DevOps pipeline configurations.
+## Overview
 
-## Resources
+The Unstructured Data RAG Platform provides:
 
-> [!NOTE]
-> This section is a placeholder. Please add relevant links.
+- Ingestion and parsing for priority formats: PDF, Word, PlantUML, Draw.io, Mermaid.
+- Unified JSON schema for structured content.
+- Storage in **Postgres + PGVector** for semantic search.
+- Raw object storage in **MinIO**, bi-directionally linked with Postgres entries.
+- Context-preserving chunking for embeddings.
+- Agentic workflows with **LangChain DeepAgent**.
+- **LLM-as-judge** to validate consistency and traceability.
+- A **React frontend** for manual review, labeling, and editing.
+- Multi-LLM support (OpenAI API, Anthropic, local LLaMA2).
+- Structured logging and error capture for debugging and compliance.
 
-* [Link 1](add link)
-* [Link 2](add link)
+---
 
-## FAQ
+## Architecture
 
-> [!NOTE]
-> This section is a placeholder. Please add frequently asked questions.
+```
+                ┌───────────────┐
+                │   Frontend    │
+                │ (React/Next)  │
+                └───────▲───────┘
+                        │
+                        ▼
+                 ┌──────────────┐
+                 │   Backend    │
+                 │  (FastAPI)   │
+                 └──────▲───────┘
+                        │
+    ┌───────────────────┼───────────────────┐
+    ▼                   ▼                   ▼
+ [Parsers]         [Postgres+PGVector]    [MinIO]
+ (PDF/Word/        (JSON + embeddings)    (raw files,
+ PlantUML/Drawio)                          binaries, images)
+
+         ┌───────────────────────────────────┐
+         │       LangChain DeepAgent         │
+         │  Retrieval + Generation + Judge   │
+         └───────────────────────────────────┘
+```
 
-### Q1
-### Q2
+---
 
-## Documentation
+## 🚀 Modules
 
-All project documentation is located at [softwaremodule-docs](./doc/). If you would like
-to contribute to the documentation, please submit a pull request on the [unstructuredDataHandler
-Documentation][docs-repo] repository.
+### Agents
+
+The `agents` module provides the core components for creating AI agents. It includes a flexible `FlexibleAgent` (formerly `SDLCFlexibleAgent`) that can be configured to use different LLM providers (like OpenAI, Gemini, and Ollama) and a set of tools. The module is designed to be extensible, allowing for the creation of custom agents with specialized skills. Key components include a planner and an executor (currently placeholders for future development) and a `MockAgent` for testing and CI.
+
+The `agents` module integrates **LangChain DeepAgent**. It handles retrieval from PGVector, answer generation, and LLM-as-judge evaluations. Supports multiple LLM providers (OpenAI, Anthropic, LLaMA2 local via Ollama).
+
+### Parsers
+
+The `parsers` module is a powerful utility for parsing various diagram-as-code formats, including PlantUML, Mermaid, and DrawIO. It extracts structured information from diagram files, such as elements, relationships, and metadata, and stores it in a local SQLite database. This allows for complex querying, analysis, and export of diagram data. The module is built on a base parser abstraction, making it easy to extend with new diagram formats. It also includes a suite of utility functions for working with the diagram database, such as exporting to JSON/CSV, finding orphaned elements, and detecting circular dependencies.
+
+The `parsers` module processes multiple formats:
+- **PDF & Word** via [Unstructured.io](https://github.com/Unstructured-IO/unstructured)  
+- **PlantUML, Draw.io, Mermaid** via custom parsers  
+Extracted text, metadata, and relationships are normalized into JSON, stored in Postgres, and linked to raw files in MinIO.
 
 ---
 
-## 🔧 Key Components
+## Resources
 
-```
+* [LangChain DeepAgents](https://python.langchain.com/docs/deep-dive/deep_agents/)
+* [pgvector Extension](https://github.com/pgvector/pgvector)
+* [Unstructured.io Parsers](https://github.com/Unstructured-IO/unstructured)
+* [MinIO Object Storage](https://min.io)
 
-📁 config/ → YAML config for models, prompts, logging
-📁 data/ → Prompts, embeddings, and other dynamic content
-📁 examples/ → Minimal scripts to test key features
-📁 notebooks/ → Quick experiments and prototyping
-📁 tests/ → Unit, integration, and end-to-end tests
-📁 src/ → The core engine — all logic lives here (./src/README.md)
+---
+
+## FAQ
+
+**Q1:** Why not store raw docs directly in Postgres?  
+**A1:** To separate structured vs. unstructured storage. Raw files live in MinIO; Postgres stores structured JSON + embeddings with bi-directional links.  
+
+**Q2:** Can I use my own LLM?  
+**A2:** Yes. The platform supports OpenAI, Anthropic, local models (via Ollama), or self-hosted LLaMA2.  
 
-```
 ---
 
-## 🚀 Modules
+## Documentation
 
-### Agents
+All project documentation is located at [softwaremodule-docs](./doc/). Architecture, schema, and developer guides are maintained here. 
 
-The `agents` module provides the core components for creating AI agents. It includes a flexible `FlexibleAgent` (formerly `SDLCFlexibleAgent`) that can be configured to use different LLM providers (like OpenAI, Gemini, and Ollama) and a set of tools. The module is designed to be extensible, allowing for the creation of custom agents with specialized skills. Key components include a planner and an executor (currently placeholders for future development) and a `MockAgent` for testing and CI.
+If you would like to contribute to the documentation, please submit a pull request on the [unstructuredDataHandler Documentation][docs-repo] repository.
 
-### Parsers
+---
 
-The `parsers` module is a powerful utility for parsing various diagram-as-code formats, including PlantUML, Mermaid, and DrawIO. It extracts structured information from diagram files, such as elements, relationships, and metadata, and stores it in a local SQLite database. This allows for complex querying, analysis, and export of diagram data. The module is built on a base parser abstraction, making it easy to extend with new diagram formats. It also includes a suite of utility functions for working with the diagram database, such as exporting to JSON/CSV, finding orphaned elements, and detecting circular dependencies.
+## 🔧 Key Components
 
+```
+📁 specs/        → Project specification
+📁 plan/         → High-level plan  
+📁 task/         → Task breakdown
+📁 config/       → YAML config for models, prompts, logging
+📁 data/         → Prompts, embeddings, and other dynamic content
+📁 examples/     → Minimal scripts to test key features
+📁 notebooks/    → Quick experiments and prototyping
+📁 src/          → The core engine — all logic lives here (./src/README.md)
+📁 tests/        → Unit, integration, and end-to-end tests
+📁 docs/         → Architecture, schema, guides
+📁 deployments/  → Docker, Kubernetes, Helm, monitoring
+```
 ---
 
 ## ⚡ Best Practices
@@ -100,7 +167,12 @@ The `parsers` module is a powerful utility for parsing various diagram-as-code f
 - Handle errors with custom exceptions  
 - Use notebooks for rapid testing and iteration  
 - Monitor API usage and set rate limits  
-- Keep code and docs in sync  
+- Keep code and docs in sync
+- Normalize all parsed content into JSON schema.  
+- Chunk documents with context preservation.  
+- Monitor agents via LangSmith.  
+- Store only raw files in MinIO, not Git.  
+- Run CI/CD pipelines for linting, testing, and type checks.
 
 ---
 
@@ -125,12 +197,20 @@ The `parsers` module is a powerful utility for parsing various diagram-as-code f
 - Track with version control  
 - Keep datasets fresh  
 - Keep documentation updated
-- Monitor API usage and limits 
+- Monitor API usage and limits
+- Keep LLM API usage monitored.  
+- Keep specs/plans/tasks updated in version control.  
+- Test parsers with representative docs.  
+- Use notebooks for quick experiments.  
+- Run type checks (`mypy`) and linting (`ruff`) before PRs.
 
 ---
 
 ## 📁 Core Files
 
+- `specs/spec.yaml` – System specification  
+- `plan/plan.yaml` – Roadmap & phases  
+- `task/task.yaml` – Task breakdown with dependencies  
 - `requirements.txt` – Core package dependencies for the project.
 - `requirements-dev.txt` - Dependencies for development and testing.
 - `requirements-docs.txt` - Dependencies for generating documentation.
@@ -207,8 +287,7 @@ Note: The mypy exclusion for `src/llm/router.py` avoids a duplicate module confl
 
 We are excited to work with the community to build and enhance this project.
 
-***BEFORE you start work on a feature/fix***, please read & follow our [Contributor's Guide](./CONTRIBUTING.md) to
-help avoid any wasted or duplicate effort.
+***BEFORE you start work on a feature/fix***, please read & follow our [Contributor's Guide](./CONTRIBUTING.md) to help avoid any wasted or duplicate effort.
 
 ### Developer setup: pre-commit hooks (optional but recommended)
 
 
@@ -0,0 +1,27 @@
+# Project Plan
+
+This directory contains the high-level project plan in YAML format.
+
+## Files
+
+- `plan.yaml` - Project phases, deliverables, and acceptance criteria
+
+## Plan Structure
+
+The plan defines:
+
+- Development phases with clear boundaries
+- Key deliverables for each phase
+- Acceptance criteria for phase completion
+- Dependencies between phases
+
+## Usage
+
+The plan serves as:
+
+- High-level roadmap for development
+- Milestone tracking and planning
+- Resource allocation guidance
+- Progress measurement framework
+
+The plan should be updated as the project progresses and requirements evolve.
@@ -0,0 +1,62 @@
+plan:
+  phases:
+    - name: "Phase 1 - Environment & Infrastructure"
+      deliverables:
+        - Project repo with CI/CD pipeline.
+        - Dockerized base environment.
+        - Postgres with PGVector extension configured.
+        - MinIO bucket with SDK integration.
+      acceptance_criteria:
+        - Developer can run `docker-compose up` and access Postgres + MinIO.
+        - PGVector table created and test embedding stored/retrieved.
+        - MinIO file upload and retrieval tested with sample PDF.
+
+    - name: "Phase 2 - Document Ingestion & Normalization"
+      deliverables:
+        - PDF and Word ingestion via Unstructured.io.
+        - PlantUML, Draw.io, Mermaid parsing utilities.
+        - Unified JSON schema for structured content + metadata.
+      acceptance_criteria:
+        - Upload of sample PDF produces valid JSON in DB.
+        - PlantUML text parses into structured JSON.
+        - JSON schema validated against 5+ document samples.
+
+    - name: "Phase 3 - Storage & Embedding"
+      deliverables:
+        - JSON record storage in Postgres (JSONB columns).
+        - Raw object upload to MinIO with object ID in Postgres.
+        - Embedding pipeline with advanced chunking strategy.
+      acceptance_criteria:
+        - Each chunk has vector embedding stored in PGVector.
+        - JSON ↔ MinIO bi-directional reference resolvable by ID.
+        - Document-level embeddings preserved for long-context.
+
+    - name: "Phase 4 - Agentic RAG & LLM Evaluation"
+      deliverables:
+        - LangChain DeepAgent pipeline for retrieval + answer generation.
+        - LLM-as-judge consistency/traceability checks.
+        - Error & warning logging database.
+      acceptance_criteria:
+        - User query retrieves top-3 relevant chunks correctly.
+        - LLM-as-judge detects at least 80% of schema violations in tests.
+        - Errors automatically logged with timestamp + doc reference.
+
+    - name: "Phase 5 - Frontend Development"
+      deliverables:
+        - React-based UI container for labeling & review.
+        - Label storage API and editing workflows.
+        - Traceability view linking JSON records to raw files.
+      acceptance_criteria:
+        - User can log in, review parsed document, add/edit labels.
+        - UI displays linked raw document (from MinIO).
+        - Edits propagate back to Postgres.
+
+    - name: "Phase 6 - Deployment & Scaling"
+      deliverables:
+        - Kubernetes manifests and Helm chart.
+        - Multi-LLM provider integration.
+        - Monitoring + observability (LangSmith, Prometheus/Grafana).
+      acceptance_criteria:
+        - Cluster runs frontend + backend containers in separate pods.
+        - Queries work with OpenAI + local LLaMA2 backend.
+        - Metrics & logs visible in monitoring dashboard.
@@ -1,29 +1,27 @@
-# SpecKit Integration
+# Project Specifications
 
-This repo uses [speckit](https://github.com/github/spec-kit) for spec-based development.
+This directory contains the project specification files in YAML format.
 
-## Getting Started
+## Files
 
-- Install dev dependencies:
-  
-  ```bash
-  pip install -r requirements-dev.txt
-  
-  ```
+- `spec.yaml` - Main project specification including goals, stakeholders, and success metrics
 
-- Example spec file: `specs/example_spec.yaml`
-- Run a spec:
-  
-  ```bash
-  speckit run specs/example_spec.yaml
-  ```
+## Specification Format
 
-  
-## Workflow
+The specification follows a structured YAML format covering:
 
-- Write specs in YAML under `specs/`
-- Use `speckit run <specfile>` to execute and validate workflows
-- Integrate with CI for spec-driven checks
+- Project goals and non-goals
+- Stakeholder identification
+- Success metrics and acceptance criteria
+- Technical requirements and constraints
 
+## Usage
 
-See [speckit docs](https://github.com/github/spec-kit) for advanced usage.
+These specification files serve as the authoritative source for:
+
+- Project requirements and scope
+- Stakeholder expectations
+- Success criteria and metrics
+- Technical constraints and decisions
+
+The specifications are referenced throughout the project documentation and should be kept up-to-date as the project evolves.