Distill time into meaning.
Chrona is Primitive #1 of the Semantic Data Fabricβa CPU-optimized temporal semantic compressor that reduces structured data by 99.98%+ while preserving queryable fidelity. Built for LLM context optimization, RAG pipelines, and AI-native observability.
We are building AI systems on infrastructure from the 1970s.
- LLMs pay per token, even for redundant information.
- RAG retrieves noise, not concepts.
- Logs are unreadable at scale.
- Traditional compression saves bits, but not meaning.
Chrona changes the abstraction layer.
Instead of compressing bytes, we compress semantic events. Instead of storing time-series data, we store temporal meaning.
Dataset: HDFS Production Logs (1.8 GB, ~11M lines)
Hardware: Consumer CPU (No GPU)
Status: β
Validated on 100k lines | π Full 11M run in progress
| Metric | Input | Output | Reduction |
|---|---|---|---|
| Log Lines | 100,000 | 14 | 7601x |
| Semantic Concepts | Unknown | 14 Unique Events | 99.98% |
| Processing Speed | - | 63 lines/sec | CPU Only |
| LLM Token Cost | ~$2.00 | ~$0.0003 | ~6600x Savings |
From 100k lines of logs β 14 semantic entries.
Errors stay separated from INFO logs. Temporal context is preserved. Meaning survives.
Chrona isn't heuristic summarization. It's information theory applied to semantic space.
We minimize information about raw input X while maximizing information about relevance Y:
β = I(X; T) - Ξ² Β· I(T; Y)
Where T is the compressed semantic representation. This ensures we discard redundancy while preserving task utility.
Instead of bit-error rate, we minimize semantic distance:
D_semantic(x, xΜ) = 1 - cos(E(x), E(xΜ))
This allows Chrona to merge "INFO: Block blk_123 received" and "INFO: Block blk_456 received" into a single semantic eventβsomething ZIP cannot do.
Chrona is Primitive #1 of a larger vision: a new abstraction layer where AI systems store, query, and reason over meaningβnot bytes.
| Primitive | Status | Purpose |
|---|---|---|
| #1: Temporal Semantic Compressor | β Validated (Chrona) | Distill streaming data into semantic entries |
| #2: Structural Code Compressor | π In Progress | Compress Git repos for AI code intelligence |
| #3: Semantic Query Engine | ποΈ Planned | Natural language queries over compressed data |
| #4: Distributed Semantic Gossip | ποΈ Research | Edge nodes share meaning, not bandwidth |
pip install -r requirements.txt
Run on Your Data
# Process first 100k lines (validation mode)
python chrona_v1.py
# Process entire file (production mode)
# Edit script: MAX_LINES = None
Configure
Edit these variables in chrona_v1.py:
FILE_PATH = Path(__file__).parent / "data" / "your_logs.jsonl"
MAX_LINES = 100_000 # Set to None for full file
THRESHOLD = 0.85 # Higher = stricter merging (0.75-0.95 typical)
π οΈ How It Works
Stream Input: Line-by-line ingestion (memory efficient).
Exact-Match Cache: MD5 hash check skips embedding for duplicates (10x speedup).
Semantic Embedding: Batched CPU encoding using sentence transformers.
Temporal Merging: Cosine similarity comparison merges events across time.
Output: Compressed JSON with summaries, counts, and temporal metadata.
[Raw Logs] β [Hash Cache] β [Embedder] β [Semantic Memory] β [Compressed JSON]
π― Use Cases
Use Case
Benefit
LLM Context Optimization
Reduce token costs by 100-1000x without losing signal
RAG Pipeline Preprocessing
Retrieve semantic summaries instead of noisy chunks
Observability at Scale
Human-readable summaries of massive log volumes
AI Agent Memory
Infinite context via compressed semantic history
πΊοΈ Roadmap
Quarter
Milestone
Success Metric
Q1 2026
β
Primitive #1 validated on 1.8 GB HDFS logs
3000x+ compression, <5% fidelity loss
Q2 2026
π Primitive #2 MVP: Git repo compression
Answer accuracy >95% vs. raw code
Q3 2026
ποΈ Primitive #3: Semantic Query Engine
Natural language β compressed data β answer
Q4 2026
ποΈ Primitive #4: Distributed Gossip Protocol
Multi-node semantic consensus
π€ Contributing: Build the Fabric
This isn't just a codebaseβit's a category definition. Contributions welcome at any layer.
Good First Contributions:
Benchmark Chrona on new datasets (Kubernetes logs, CloudTrail, chat histories)
Prototype Primitive #2 (code compression with AST parsing)
Write a position paper on semantic vs. byte-centric infrastructure
To Contribute:
Fork the repo
Create a feature branch (git checkout -b feat/code-compressor)
Commit your changes (git commit -m 'Add AST-aware parsing')
Push to your branch (git push origin feat/code-compressor)
Open a Pull Request
π License
MIT License β use freely in commercial and non-commercial projects.
π Acknowledgments
Information Bottleneck Principle: Tishby et al. (1999) β theoretical foundation
LogHub: HDFS log dataset for validation
Sentence Transformers: Fast, high-quality CPU embeddings