Skip to content

Commit 5b78a64

Browse files
committed
Fix consistency issues: remove spec-kit references, correct directory structure references, and add proper documentation
- Removed all spec-kit references from README.md and specs/README.md - Fixed directory references to match actual structure (plan/ vs plans/, task/ vs tasks/) - Updated README.md to reference correct YAML file names (spec.yaml vs specification.yaml) - Added comprehensive README files for plan/ and task/ directories - Added new specification, plan, and task YAML files with proper structure - Added .codacy/codacy.yaml for code quality configuration - Cleaned up specs/example_spec.yaml file - Updated .gitignore to exclude codacy AI rules file - All tests passing and repository structure is now consistent
1 parent de98236 commit 5b78a64

File tree

10 files changed

+437
-70
lines changed

10 files changed

+437
-70
lines changed

.codacy/codacy.yaml

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
runtimes:
2+
- dart@3.7.2
3+
- go@1.22.3
4+
- java@17.0.10
5+
- node@22.2.0
6+
- python@3.11.11
7+
tools:
8+
- dartanalyzer@3.7.2
9+
- eslint@8.57.0
10+
- lizard@1.17.31
11+
- pmd@7.11.0
12+
- pylint@3.3.6
13+
- revive@1.7.0
14+
- semgrep@1.78.0
15+
- trivy@0.66.0

.gitignore

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -293,3 +293,7 @@ profiles.json
293293
# Coverage reports
294294
.coverage
295295
coverage.xml
296+
297+
298+
#Ignore vscode AI rules
299+
.github/instructions/codacy.instructions.md

README.md

Lines changed: 121 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,4 @@
1-
2-
3-
# Welcome to the unstructuredDataHandler Repo
1+
# Welcome to the Unstructured Data RAG Platform Repo
42

53
<details>
64
<summary><strong>Table of Contents</strong></summary>
@@ -18,9 +16,21 @@
1816

1917
</details>
2018

21-
<br />
19+
---
20+
This repository contains the source code for the **Unstructured Data RAG Platform**,
21+
a Python-based framework for handling unstructured documents in **safety-critical** and **cybersecurity** software development lifecycles.
22+
23+
It ingests documents (PDF, Word, PlantUML, Draw.io, Mermaid), converts them to structured JSON, embeds them with **PGVector**, stores raw artifacts in **MinIO**, and enables **agentic RAG** using **LangChain DeepAgent**.
24+
25+
## Unstructured Data RAG Platform Overview
26+
27+
Unstructured Data RAG Platform is a Python-based Software Development Life Cycle core project that provides AI/ML capabilities for software development workflows. The repository contains modules for LLM clients, intelligent agents, memory management, prompt engineering, document retrieval, skill execution, and various utilities. It combines a Python core with TypeScript for Azure DevOps pipeline configurations.
28+
29+
- **Primary Language**: Python 3.10-3.12
30+
- **Secondary Languages**: TypeScript (for Azure pipelines), Shell scripts
31+
- **Project Type**: AI/ML library and tooling for SDLC workflows
32+
2233

23-
This repository contains the source code for the unstructuredDataHandler project, a Python-based framework for building AI-powered software development life cycle tools.
2434

2535
Related repositories include:
2636

@@ -31,63 +41,120 @@ Related repositories include:
3141
The plan for the unstructuredDataHandler [is described here](./doc/roadmap-20xx.md) and
3242
will be updated as the project proceeds.
3343

34-
## Installing and running Windows Terminal
44+
The project specification, plan, and task breakdown are defined in YAML files:
3545

36-
> [!NOTE]
37-
> This section is a placeholder and may not be relevant to this project.
46+
- [Specification](./specs/spec.yaml)
47+
- [Plan](./plan/plan.yaml)
48+
- [Tasks](./task/task.yaml)
3849

39-
## unstructuredDataHandler Overview
50+
---
4051

41-
unstructuredDataHandler is a Python-based Software Development Life Cycle core project that provides AI/ML capabilities for software development workflows. The repository contains modules for LLM clients, intelligent agents, memory management, prompt engineering, document retrieval, skill execution, and various utilities. It combines a Python core with TypeScript for Azure DevOps pipeline configurations.
52+
## Overview
4253

43-
## Resources
54+
The Unstructured Data RAG Platform provides:
4455

45-
> [!NOTE]
46-
> This section is a placeholder. Please add relevant links.
56+
- Ingestion and parsing for priority formats: PDF, Word, PlantUML, Draw.io, Mermaid.
57+
- Unified JSON schema for structured content.
58+
- Storage in **Postgres + PGVector** for semantic search.
59+
- Raw object storage in **MinIO**, bi-directionally linked with Postgres entries.
60+
- Context-preserving chunking for embeddings.
61+
- Agentic workflows with **LangChain DeepAgent**.
62+
- **LLM-as-judge** to validate consistency and traceability.
63+
- A **React frontend** for manual review, labeling, and editing.
64+
- Multi-LLM support (OpenAI API, Anthropic, local LLaMA2).
65+
- Structured logging and error capture for debugging and compliance.
4766

48-
* [Link 1](add link)
49-
* [Link 2](add link)
67+
---
5068

51-
## FAQ
69+
## Architecture
5270

53-
> [!NOTE]
54-
> This section is a placeholder. Please add frequently asked questions.
71+
```
72+
┌───────────────┐
73+
│ Frontend │
74+
│ (React/Next) │
75+
└───────▲───────┘
76+
77+
78+
┌──────────────┐
79+
│ Backend │
80+
│ (FastAPI) │
81+
└──────▲───────┘
82+
83+
┌───────────────────┼───────────────────┐
84+
▼ ▼ ▼
85+
[Parsers] [Postgres+PGVector] [MinIO]
86+
(PDF/Word/ (JSON + embeddings) (raw files,
87+
PlantUML/Drawio) binaries, images)
88+
89+
┌───────────────────────────────────┐
90+
│ LangChain DeepAgent │
91+
│ Retrieval + Generation + Judge │
92+
└───────────────────────────────────┘
93+
```
5594

56-
### Q1
57-
### Q2
95+
---
5896

59-
## Documentation
97+
## 🚀 Modules
6098

61-
All project documentation is located at [softwaremodule-docs](./doc/). If you would like
62-
to contribute to the documentation, please submit a pull request on the [unstructuredDataHandler
63-
Documentation][docs-repo] repository.
99+
### Agents
100+
101+
The `agents` module provides the core components for creating AI agents. It includes a flexible `FlexibleAgent` (formerly `SDLCFlexibleAgent`) that can be configured to use different LLM providers (like OpenAI, Gemini, and Ollama) and a set of tools. The module is designed to be extensible, allowing for the creation of custom agents with specialized skills. Key components include a planner and an executor (currently placeholders for future development) and a `MockAgent` for testing and CI.
102+
103+
The `agents` module integrates **LangChain DeepAgent**. It handles retrieval from PGVector, answer generation, and LLM-as-judge evaluations. Supports multiple LLM providers (OpenAI, Anthropic, LLaMA2 local via Ollama).
104+
105+
### Parsers
106+
107+
The `parsers` module is a powerful utility for parsing various diagram-as-code formats, including PlantUML, Mermaid, and DrawIO. It extracts structured information from diagram files, such as elements, relationships, and metadata, and stores it in a local SQLite database. This allows for complex querying, analysis, and export of diagram data. The module is built on a base parser abstraction, making it easy to extend with new diagram formats. It also includes a suite of utility functions for working with the diagram database, such as exporting to JSON/CSV, finding orphaned elements, and detecting circular dependencies.
108+
109+
The `parsers` module processes multiple formats:
110+
- **PDF & Word** via [Unstructured.io](https://github.com/Unstructured-IO/unstructured)
111+
- **PlantUML, Draw.io, Mermaid** via custom parsers
112+
Extracted text, metadata, and relationships are normalized into JSON, stored in Postgres, and linked to raw files in MinIO.
64113

65114
---
66115

67-
## 🔧 Key Components
116+
## Resources
68117

69-
```
118+
* [LangChain DeepAgents](https://python.langchain.com/docs/deep-dive/deep_agents/)
119+
* [pgvector Extension](https://github.com/pgvector/pgvector)
120+
* [Unstructured.io Parsers](https://github.com/Unstructured-IO/unstructured)
121+
* [MinIO Object Storage](https://min.io)
70122

71-
📁 config/ → YAML config for models, prompts, logging
72-
📁 data/ → Prompts, embeddings, and other dynamic content
73-
📁 examples/ → Minimal scripts to test key features
74-
📁 notebooks/ → Quick experiments and prototyping
75-
📁 tests/ → Unit, integration, and end-to-end tests
76-
📁 src/ → The core engine — all logic lives here (./src/README.md)
123+
---
124+
125+
## FAQ
126+
127+
**Q1:** Why not store raw docs directly in Postgres?
128+
**A1:** To separate structured vs. unstructured storage. Raw files live in MinIO; Postgres stores structured JSON + embeddings with bi-directional links.
129+
130+
**Q2:** Can I use my own LLM?
131+
**A2:** Yes. The platform supports OpenAI, Anthropic, local models (via Ollama), or self-hosted LLaMA2.
77132

78-
```
79133
---
80134

81-
## 🚀 Modules
135+
## Documentation
82136

83-
### Agents
137+
All project documentation is located at [softwaremodule-docs](./doc/). Architecture, schema, and developer guides are maintained here.
84138

85-
The `agents` module provides the core components for creating AI agents. It includes a flexible `FlexibleAgent` (formerly `SDLCFlexibleAgent`) that can be configured to use different LLM providers (like OpenAI, Gemini, and Ollama) and a set of tools. The module is designed to be extensible, allowing for the creation of custom agents with specialized skills. Key components include a planner and an executor (currently placeholders for future development) and a `MockAgent` for testing and CI.
139+
If you would like to contribute to the documentation, please submit a pull request on the [unstructuredDataHandler Documentation][docs-repo] repository.
86140

87-
### Parsers
141+
---
88142

89-
The `parsers` module is a powerful utility for parsing various diagram-as-code formats, including PlantUML, Mermaid, and DrawIO. It extracts structured information from diagram files, such as elements, relationships, and metadata, and stores it in a local SQLite database. This allows for complex querying, analysis, and export of diagram data. The module is built on a base parser abstraction, making it easy to extend with new diagram formats. It also includes a suite of utility functions for working with the diagram database, such as exporting to JSON/CSV, finding orphaned elements, and detecting circular dependencies.
143+
## 🔧 Key Components
90144

145+
```
146+
📁 specs/ → Project specification
147+
📁 plan/ → High-level plan
148+
📁 task/ → Task breakdown
149+
📁 config/ → YAML config for models, prompts, logging
150+
📁 data/ → Prompts, embeddings, and other dynamic content
151+
📁 examples/ → Minimal scripts to test key features
152+
📁 notebooks/ → Quick experiments and prototyping
153+
📁 src/ → The core engine — all logic lives here (./src/README.md)
154+
📁 tests/ → Unit, integration, and end-to-end tests
155+
📁 docs/ → Architecture, schema, guides
156+
📁 deployments/ → Docker, Kubernetes, Helm, monitoring
157+
```
91158
---
92159

93160
## ⚡ Best Practices
@@ -100,7 +167,12 @@ The `parsers` module is a powerful utility for parsing various diagram-as-code f
100167
- Handle errors with custom exceptions
101168
- Use notebooks for rapid testing and iteration
102169
- Monitor API usage and set rate limits
103-
- Keep code and docs in sync
170+
- Keep code and docs in sync
171+
- Normalize all parsed content into JSON schema.
172+
- Chunk documents with context preservation.
173+
- Monitor agents via LangSmith.
174+
- Store only raw files in MinIO, not Git.
175+
- Run CI/CD pipelines for linting, testing, and type checks.
104176

105177
---
106178

@@ -125,12 +197,20 @@ The `parsers` module is a powerful utility for parsing various diagram-as-code f
125197
- Track with version control
126198
- Keep datasets fresh
127199
- Keep documentation updated
128-
- Monitor API usage and limits
200+
- Monitor API usage and limits
201+
- Keep LLM API usage monitored.
202+
- Keep specs/plans/tasks updated in version control.
203+
- Test parsers with representative docs.
204+
- Use notebooks for quick experiments.
205+
- Run type checks (`mypy`) and linting (`ruff`) before PRs.
129206

130207
---
131208

132209
## 📁 Core Files
133210

211+
- `specs/spec.yaml` – System specification
212+
- `plan/plan.yaml` – Roadmap & phases
213+
- `task/task.yaml` – Task breakdown with dependencies
134214
- `requirements.txt` – Core package dependencies for the project.
135215
- `requirements-dev.txt` - Dependencies for development and testing.
136216
- `requirements-docs.txt` - Dependencies for generating documentation.
@@ -207,8 +287,7 @@ Note: The mypy exclusion for `src/llm/router.py` avoids a duplicate module confl
207287

208288
We are excited to work with the community to build and enhance this project.
209289

210-
***BEFORE you start work on a feature/fix***, please read & follow our [Contributor's Guide](./CONTRIBUTING.md) to
211-
help avoid any wasted or duplicate effort.
290+
***BEFORE you start work on a feature/fix***, please read & follow our [Contributor's Guide](./CONTRIBUTING.md) to help avoid any wasted or duplicate effort.
212291
213292
### Developer setup: pre-commit hooks (optional but recommended)
214293

plan/README.md

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# Project Plan
2+
3+
This directory contains the high-level project plan in YAML format.
4+
5+
## Files
6+
7+
- `plan.yaml` - Project phases, deliverables, and acceptance criteria
8+
9+
## Plan Structure
10+
11+
The plan defines:
12+
13+
- Development phases with clear boundaries
14+
- Key deliverables for each phase
15+
- Acceptance criteria for phase completion
16+
- Dependencies between phases
17+
18+
## Usage
19+
20+
The plan serves as:
21+
22+
- High-level roadmap for development
23+
- Milestone tracking and planning
24+
- Resource allocation guidance
25+
- Progress measurement framework
26+
27+
The plan should be updated as the project progresses and requirements evolve.

plan/plan.yaml

Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
plan:
2+
phases:
3+
- name: "Phase 1 - Environment & Infrastructure"
4+
deliverables:
5+
- Project repo with CI/CD pipeline.
6+
- Dockerized base environment.
7+
- Postgres with PGVector extension configured.
8+
- MinIO bucket with SDK integration.
9+
acceptance_criteria:
10+
- Developer can run `docker-compose up` and access Postgres + MinIO.
11+
- PGVector table created and test embedding stored/retrieved.
12+
- MinIO file upload and retrieval tested with sample PDF.
13+
14+
- name: "Phase 2 - Document Ingestion & Normalization"
15+
deliverables:
16+
- PDF and Word ingestion via Unstructured.io.
17+
- PlantUML, Draw.io, Mermaid parsing utilities.
18+
- Unified JSON schema for structured content + metadata.
19+
acceptance_criteria:
20+
- Upload of sample PDF produces valid JSON in DB.
21+
- PlantUML text parses into structured JSON.
22+
- JSON schema validated against 5+ document samples.
23+
24+
- name: "Phase 3 - Storage & Embedding"
25+
deliverables:
26+
- JSON record storage in Postgres (JSONB columns).
27+
- Raw object upload to MinIO with object ID in Postgres.
28+
- Embedding pipeline with advanced chunking strategy.
29+
acceptance_criteria:
30+
- Each chunk has vector embedding stored in PGVector.
31+
- JSON ↔ MinIO bi-directional reference resolvable by ID.
32+
- Document-level embeddings preserved for long-context.
33+
34+
- name: "Phase 4 - Agentic RAG & LLM Evaluation"
35+
deliverables:
36+
- LangChain DeepAgent pipeline for retrieval + answer generation.
37+
- LLM-as-judge consistency/traceability checks.
38+
- Error & warning logging database.
39+
acceptance_criteria:
40+
- User query retrieves top-3 relevant chunks correctly.
41+
- LLM-as-judge detects at least 80% of schema violations in tests.
42+
- Errors automatically logged with timestamp + doc reference.
43+
44+
- name: "Phase 5 - Frontend Development"
45+
deliverables:
46+
- React-based UI container for labeling & review.
47+
- Label storage API and editing workflows.
48+
- Traceability view linking JSON records to raw files.
49+
acceptance_criteria:
50+
- User can log in, review parsed document, add/edit labels.
51+
- UI displays linked raw document (from MinIO).
52+
- Edits propagate back to Postgres.
53+
54+
- name: "Phase 6 - Deployment & Scaling"
55+
deliverables:
56+
- Kubernetes manifests and Helm chart.
57+
- Multi-LLM provider integration.
58+
- Monitoring + observability (LangSmith, Prometheus/Grafana).
59+
acceptance_criteria:
60+
- Cluster runs frontend + backend containers in separate pods.
61+
- Queries work with OpenAI + local LLaMA2 backend.
62+
- Metrics & logs visible in monitoring dashboard.

specs/README.md

Lines changed: 19 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,29 +1,27 @@
1-
# SpecKit Integration
1+
# Project Specifications
22

3-
This repo uses [speckit](https://github.com/github/spec-kit) for spec-based development.
3+
This directory contains the project specification files in YAML format.
44

5-
## Getting Started
5+
## Files
66

7-
- Install dev dependencies:
8-
9-
```bash
10-
pip install -r requirements-dev.txt
11-
12-
```
7+
- `spec.yaml` - Main project specification including goals, stakeholders, and success metrics
138

14-
- Example spec file: `specs/example_spec.yaml`
15-
- Run a spec:
16-
17-
```bash
18-
speckit run specs/example_spec.yaml
19-
```
9+
## Specification Format
2010

21-
22-
## Workflow
11+
The specification follows a structured YAML format covering:
2312

24-
- Write specs in YAML under `specs/`
25-
- Use `speckit run <specfile>` to execute and validate workflows
26-
- Integrate with CI for spec-driven checks
13+
- Project goals and non-goals
14+
- Stakeholder identification
15+
- Success metrics and acceptance criteria
16+
- Technical requirements and constraints
2717

18+
## Usage
2819

29-
See [speckit docs](https://github.com/github/spec-kit) for advanced usage.
20+
These specification files serve as the authoritative source for:
21+
22+
- Project requirements and scope
23+
- Stakeholder expectations
24+
- Success criteria and metrics
25+
- Technical constraints and decisions
26+
27+
The specifications are referenced throughout the project documentation and should be kept up-to-date as the project evolves.

0 commit comments

Comments
 (0)