You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Removed all spec-kit references from README.md and specs/README.md
- Fixed directory references to match actual structure (plan/ vs plans/, task/ vs tasks/)
- Updated README.md to reference correct YAML file names (spec.yaml vs specification.yaml)
- Added comprehensive README files for plan/ and task/ directories
- Added new specification, plan, and task YAML files with proper structure
- Added .codacy/codacy.yaml for code quality configuration
- Cleaned up specs/example_spec.yaml file
- Updated .gitignore to exclude codacy AI rules file
- All tests passing and repository structure is now consistent
# Welcome to the Unstructured Data RAG Platform Repo
4
2
5
3
<details>
6
4
<summary><strong>Table of Contents</strong></summary>
@@ -18,9 +16,21 @@
18
16
19
17
</details>
20
18
21
-
<br />
19
+
---
20
+
This repository contains the source code for the **Unstructured Data RAG Platform**,
21
+
a Python-based framework for handling unstructured documents in **safety-critical** and **cybersecurity** software development lifecycles.
22
+
23
+
It ingests documents (PDF, Word, PlantUML, Draw.io, Mermaid), converts them to structured JSON, embeds them with **PGVector**, stores raw artifacts in **MinIO**, and enables **agentic RAG** using **LangChain DeepAgent**.
24
+
25
+
## Unstructured Data RAG Platform Overview
26
+
27
+
Unstructured Data RAG Platform is a Python-based Software Development Life Cycle core project that provides AI/ML capabilities for software development workflows. The repository contains modules for LLM clients, intelligent agents, memory management, prompt engineering, document retrieval, skill execution, and various utilities. It combines a Python core with TypeScript for Azure DevOps pipeline configurations.
-**Project Type**: AI/ML library and tooling for SDLC workflows
32
+
22
33
23
-
This repository contains the source code for the unstructuredDataHandler project, a Python-based framework for building AI-powered software development life cycle tools.
24
34
25
35
Related repositories include:
26
36
@@ -31,63 +41,120 @@ Related repositories include:
31
41
The plan for the unstructuredDataHandler [is described here](./doc/roadmap-20xx.md) and
32
42
will be updated as the project proceeds.
33
43
34
-
## Installing and running Windows Terminal
44
+
The project specification, plan, and task breakdown are defined in YAML files:
35
45
36
-
> [!NOTE]
37
-
> This section is a placeholder and may not be relevant to this project.
46
+
-[Specification](./specs/spec.yaml)
47
+
-[Plan](./plan/plan.yaml)
48
+
-[Tasks](./task/task.yaml)
38
49
39
-
## unstructuredDataHandler Overview
50
+
---
40
51
41
-
unstructuredDataHandler is a Python-based Software Development Life Cycle core project that provides AI/ML capabilities for software development workflows. The repository contains modules for LLM clients, intelligent agents, memory management, prompt engineering, document retrieval, skill execution, and various utilities. It combines a Python core with TypeScript for Azure DevOps pipeline configurations.
52
+
## Overview
42
53
43
-
## Resources
54
+
The Unstructured Data RAG Platform provides:
44
55
45
-
> [!NOTE]
46
-
> This section is a placeholder. Please add relevant links.
56
+
- Ingestion and parsing for priority formats: PDF, Word, PlantUML, Draw.io, Mermaid.
57
+
- Unified JSON schema for structured content.
58
+
- Storage in **Postgres + PGVector** for semantic search.
59
+
- Raw object storage in **MinIO**, bi-directionally linked with Postgres entries.
60
+
- Context-preserving chunking for embeddings.
61
+
- Agentic workflows with **LangChain DeepAgent**.
62
+
-**LLM-as-judge** to validate consistency and traceability.
63
+
- A **React frontend** for manual review, labeling, and editing.
64
+
- Multi-LLM support (OpenAI API, Anthropic, local LLaMA2).
65
+
- Structured logging and error capture for debugging and compliance.
47
66
48
-
*[Link 1](add link)
49
-
*[Link 2](add link)
67
+
---
50
68
51
-
## FAQ
69
+
## Architecture
52
70
53
-
> [!NOTE]
54
-
> This section is a placeholder. Please add frequently asked questions.
71
+
```
72
+
┌───────────────┐
73
+
│ Frontend │
74
+
│ (React/Next) │
75
+
└───────▲───────┘
76
+
│
77
+
▼
78
+
┌──────────────┐
79
+
│ Backend │
80
+
│ (FastAPI) │
81
+
└──────▲───────┘
82
+
│
83
+
┌───────────────────┼───────────────────┐
84
+
▼ ▼ ▼
85
+
[Parsers] [Postgres+PGVector] [MinIO]
86
+
(PDF/Word/ (JSON + embeddings) (raw files,
87
+
PlantUML/Drawio) binaries, images)
88
+
89
+
┌───────────────────────────────────┐
90
+
│ LangChain DeepAgent │
91
+
│ Retrieval + Generation + Judge │
92
+
└───────────────────────────────────┘
93
+
```
55
94
56
-
### Q1
57
-
### Q2
95
+
---
58
96
59
-
## Documentation
97
+
## 🚀 Modules
60
98
61
-
All project documentation is located at [softwaremodule-docs](./doc/). If you would like
62
-
to contribute to the documentation, please submit a pull request on the [unstructuredDataHandler
63
-
Documentation][docs-repo] repository.
99
+
### Agents
100
+
101
+
The `agents` module provides the core components for creating AI agents. It includes a flexible `FlexibleAgent` (formerly `SDLCFlexibleAgent`) that can be configured to use different LLM providers (like OpenAI, Gemini, and Ollama) and a set of tools. The module is designed to be extensible, allowing for the creation of custom agents with specialized skills. Key components include a planner and an executor (currently placeholders for future development) and a `MockAgent` for testing and CI.
102
+
103
+
The `agents` module integrates **LangChain DeepAgent**. It handles retrieval from PGVector, answer generation, and LLM-as-judge evaluations. Supports multiple LLM providers (OpenAI, Anthropic, LLaMA2 local via Ollama).
104
+
105
+
### Parsers
106
+
107
+
The `parsers` module is a powerful utility for parsing various diagram-as-code formats, including PlantUML, Mermaid, and DrawIO. It extracts structured information from diagram files, such as elements, relationships, and metadata, and stores it in a local SQLite database. This allows for complex querying, analysis, and export of diagram data. The module is built on a base parser abstraction, making it easy to extend with new diagram formats. It also includes a suite of utility functions for working with the diagram database, such as exporting to JSON/CSV, finding orphaned elements, and detecting circular dependencies.
108
+
109
+
The `parsers` module processes multiple formats:
110
+
-**PDF & Word** via [Unstructured.io](https://github.com/Unstructured-IO/unstructured)
111
+
-**PlantUML, Draw.io, Mermaid** via custom parsers
112
+
Extracted text, metadata, and relationships are normalized into JSON, stored in Postgres, and linked to raw files in MinIO.
📁 config/ → YAML config for models, prompts, logging
72
-
📁 data/ → Prompts, embeddings, and other dynamic content
73
-
📁 examples/ → Minimal scripts to test key features
74
-
📁 notebooks/ → Quick experiments and prototyping
75
-
📁 tests/ → Unit, integration, and end-to-end tests
76
-
📁 src/ → The core engine — all logic lives here (./src/README.md)
123
+
---
124
+
125
+
## FAQ
126
+
127
+
**Q1:** Why not store raw docs directly in Postgres?
128
+
**A1:** To separate structured vs. unstructured storage. Raw files live in MinIO; Postgres stores structured JSON + embeddings with bi-directional links.
129
+
130
+
**Q2:** Can I use my own LLM?
131
+
**A2:** Yes. The platform supports OpenAI, Anthropic, local models (via Ollama), or self-hosted LLaMA2.
77
132
78
-
```
79
133
---
80
134
81
-
## 🚀 Modules
135
+
## Documentation
82
136
83
-
### Agents
137
+
All project documentation is located at [softwaremodule-docs](./doc/). Architecture, schema, and developer guides are maintained here.
84
138
85
-
The `agents` module provides the core components for creating AI agents. It includes a flexible `FlexibleAgent` (formerly `SDLCFlexibleAgent`) that can be configured to use different LLM providers (like OpenAI, Gemini, and Ollama) and a set of tools. The module is designed to be extensible, allowing for the creation of custom agents with specialized skills. Key components include a planner and an executor (currently placeholders for future development) and a `MockAgent` for testing and CI.
139
+
If you would like to contribute to the documentation, please submit a pull request on the [unstructuredDataHandler Documentation][docs-repo] repository.
86
140
87
-
### Parsers
141
+
---
88
142
89
-
The `parsers` module is a powerful utility for parsing various diagram-as-code formats, including PlantUML, Mermaid, and DrawIO. It extracts structured information from diagram files, such as elements, relationships, and metadata, and stores it in a local SQLite database. This allows for complex querying, analysis, and export of diagram data. The module is built on a base parser abstraction, making it easy to extend with new diagram formats. It also includes a suite of utility functions for working with the diagram database, such as exporting to JSON/CSV, finding orphaned elements, and detecting circular dependencies.
143
+
## 🔧 Key Components
90
144
145
+
```
146
+
📁 specs/ → Project specification
147
+
📁 plan/ → High-level plan
148
+
📁 task/ → Task breakdown
149
+
📁 config/ → YAML config for models, prompts, logging
150
+
📁 data/ → Prompts, embeddings, and other dynamic content
151
+
📁 examples/ → Minimal scripts to test key features
152
+
📁 notebooks/ → Quick experiments and prototyping
153
+
📁 src/ → The core engine — all logic lives here (./src/README.md)
154
+
📁 tests/ → Unit, integration, and end-to-end tests
@@ -100,7 +167,12 @@ The `parsers` module is a powerful utility for parsing various diagram-as-code f
100
167
- Handle errors with custom exceptions
101
168
- Use notebooks for rapid testing and iteration
102
169
- Monitor API usage and set rate limits
103
-
- Keep code and docs in sync
170
+
- Keep code and docs in sync
171
+
- Normalize all parsed content into JSON schema.
172
+
- Chunk documents with context preservation.
173
+
- Monitor agents via LangSmith.
174
+
- Store only raw files in MinIO, not Git.
175
+
- Run CI/CD pipelines for linting, testing, and type checks.
104
176
105
177
---
106
178
@@ -125,12 +197,20 @@ The `parsers` module is a powerful utility for parsing various diagram-as-code f
125
197
- Track with version control
126
198
- Keep datasets fresh
127
199
- Keep documentation updated
128
-
- Monitor API usage and limits
200
+
- Monitor API usage and limits
201
+
- Keep LLM API usage monitored.
202
+
- Keep specs/plans/tasks updated in version control.
203
+
- Test parsers with representative docs.
204
+
- Use notebooks for quick experiments.
205
+
- Run type checks (`mypy`) and linting (`ruff`) before PRs.
129
206
130
207
---
131
208
132
209
## 📁 Core Files
133
210
211
+
- `specs/spec.yaml` – System specification
212
+
- `plan/plan.yaml` – Roadmap & phases
213
+
- `task/task.yaml` – Task breakdown with dependencies
134
214
- `requirements.txt` – Core package dependencies for the project.
135
215
- `requirements-dev.txt` - Dependencies for development and testing.
136
216
- `requirements-docs.txt` - Dependencies for generating documentation.
@@ -207,8 +287,7 @@ Note: The mypy exclusion for `src/llm/router.py` avoids a duplicate module confl
207
287
208
288
We are excited to work with the community to build and enhance this project.
209
289
210
-
***BEFORE you start work on a feature/fix***, please read& follow our [Contributor's Guide](./CONTRIBUTING.md) to
211
-
help avoid any wasted or duplicate effort.
290
+
***BEFORE you start work on a feature/fix***, please read& follow our [Contributor's Guide](./CONTRIBUTING.md) to help avoid any wasted or duplicate effort.
212
291
213
292
### Developer setup: pre-commit hooks (optional but recommended)
0 commit comments