- Google Open Knowledge Format (OKF): Wiki pages follow the Google OKF specification for knowledge sharing.
- Entity Pages: People, orgs, places, and products as dedicated wiki pages, auto-extracted and kept in sync.
OpenKB (Open Knowledge Base) is an open-source system (in CLI) that compiles raw documents into a structured, interlinked wiki-style knowledge base using LLMs, powered by PageIndex's vectorless, reasoning-based retrieval for long documents.
The idea is based on a concept described by Andrej Karpathy: LLMs generate summaries, concept pages, and cross-references, all maintained automatically. Knowledge compounds over time instead of being re-derived on every query.
Traditional RAG rediscovers knowledge from scratch on every query. Nothing accumulates. OpenKB compiles knowledge once into a persistent wiki, then keeps it current. Cross-references already exist, contradictions are flagged, and synthesis reflects everything consumed.
OpenKB has two layers: a wiki foundation that compiles and maintains your knowledge, and generators (query / chat / Skill Factory) that turn it into useful output. See Usage for the full command list.
- Broad format support: PDF, Word, Markdown, PowerPoint, HTML, Excel, CSV, text, URLs, and more.
- Scales to long documents: Long and complex documents are handled via PageIndex tree indexing, enabling accurate, vectorless, context-aware retrieval.
- Native multi-modality: Retrieves and understands figures, tables, and images, not just text.
- Compiled wiki: The LLM compiles your documents into summaries, concept pages, entity pages, and cross-links, all kept in sync.
- Query & chat: One-off questions or multi-turn conversations over your wiki, with persisted sessions to resume.
- Skill Factory: Distills redistributable agent skills from your wiki.
- OKF-ready: Wiki pages follow the Google OKF specification for knowledge sharing.
- Obsidian-compatible: The wiki is plain
.mdfiles with cross-links. Opens in Obsidian for graph view.
pip install openkbOther install options:
-
Latest from GitHub:
pip install git+https://github.com/VectifyAI/OpenKB.git
-
Install from source (editable, for development):
git clone https://github.com/VectifyAI/OpenKB.git cd OpenKB pip install -e .
# 1. Create a directory for your knowledge base
mkdir my-kb && cd my-kb
# 2. Initialize the knowledge base
openkb init
# 3. Add documents
openkb add paper.pdf
openkb add ~/papers/ # Add a whole directory
openkb add https://arxiv.org/pdf/2509.11420 # Or fetch from a URL
# 4. Ask a question
openkb query "What are the main findings?"
# 5. Or chat interactively
openkb chat
# (Optional) Turn the wiki into other outputs
openkb skill new my-expert "Reason like an expert on <your-topic>" # a portable agent skill
openkb visualize # an interactive knowledge graph
openkb deck new my-deck "An intro deck on <your-topic>" # slides — a single-file HTML deckOpenKB supports multiple LLM providers (OpenAI, Claude, Gemini, etc.) via LiteLLM (pinned to a safe version).
Set your model during openkb init or in .openkb/config.yaml using the provider/model LiteLLM format (e.g. anthropic/claude-sonnet-4-6). OpenAI models can omit the prefix (e.g. gpt-5.4).
Create a .env file with your LLM API key:
LLM_API_KEY=your_llm_api_keyraw/ You drop files here
│
├─ Short docs ──→ markitdown ──→ LLM reads full text
│ │
├─ Long PDFs ──→ PageIndex ────→ LLM reads document trees
│ │
│ ▼
│ Wiki Compilation (using LLM)
│ │
▼ ▼
wiki/ │ ← the foundation
├── index.md Knowledge base overview
├── log.md Operations timeline
├── AGENTS.md Wiki schema (LLM instructions)
├── sources/ Full-text conversions
├── summaries/ Per-document summaries
├── concepts/ Cross-document synthesis
├── entities/ Specific named things (people, orgs, places, products)
├── explorations/ Saved query results
└── reports/ Lint reports
│
┌──────────────────────┼──────────────────────┐
▼ ▼ ▼
query / chat Skill Factory (future)
(LLM answers from (redistributable ppt / podcast /
the wiki) agent skills) report / …
| Short documents | Long documents (PDF ≥ 20 pages) | |
|---|---|---|
| Convert | markitdown → Markdown | PageIndex → tree index + summaries |
| Images | Extracted inline (pymupdf) | Extracted by PageIndex |
| LLM reads | Full text | Document trees |
| Result | summary + concepts | summary + concepts |
Short documents are read in full by the LLM. Long PDFs are processed by PageIndex into a hierarchical tree index. The LLM reads the tree instead of the full text, enabling accurate and scalable retrieval for long documents.
When you add a document, the LLM:
- Generates a summary page
- Reads existing concept and entity pages
- Creates or updates concepts with cross-document synthesis
- Creates or updates entity pages (people, orgs, places, products)
- Updates the index and log
A single source might touch 10--15 wiki pages. Knowledge accumulates: each document enriches the existing wiki rather than sitting in isolation.
OpenKB commands fall into two layers: the wiki foundation (compile + manage your knowledge) and generators (turn that wiki into useful output). Each links to a concrete walkthrough — a real artifact OpenKB generated from one sample paper (browse them all in examples/).
| Command | Description |
|---|---|
openkb init |
Initialize a new knowledge base (interactive) |
openkb add <file_or_dir_or_URL> |
Add files, directories, or URLs and compile to wiki (URL content type is auto-detected) |
openkb list |
List indexed documents and concepts |
openkb status |
Show knowledge base stats |
openkb watch |
Watch raw/ and auto-compile new files |
openkb lint |
Run structural and knowledge health checks |
More wiki commands:
| Command | Description |
|---|---|
openkb remove <doc> |
Remove a document and clean up its wiki pages, images, registry, and PageIndex state (--dry-run to preview, --keep-raw / --keep-empty to retain artifacts) |
openkb recompile [<doc>] [--all] |
Re-run the compile pipeline on already-indexed docs without re-indexing. Regenerates summaries and rewrites concept pages; manual edits are overwritten (--dry-run to preview, --refresh-schema to also update wiki/AGENTS.md) |
openkb feedback ["msg"] |
File feedback by opening a prefilled GitHub issue (--type bug/feature/question to tag it) |
→ Example: the everyday loop walked through end to end — examples/commands/.
A "generator" reads from the compiled wiki and produces something usable: an answer, a conversation, a skill folder. The wiki is the substrate; generators are the surfaces.
| Command | Output | Example |
|---|---|---|
openkb query "question" |
A grounded answer with citations (--save to persist to wiki/explorations/) |
query & save |
openkb chat |
Interactive multi-turn session over the wiki (--resume, --list, --delete to manage sessions) |
chat |
openkb visualize |
A self-contained interactive knowledge graph at output/visualize/graph.html — 3D, mind-map, and radial views |
visualize |
openkb skill new <skill-name> "<intent>" |
Distill a redistributable agent skill from your wiki (see Skill Factory below) | skills |
openkb deck new <name> "<intent>" |
Generate a single-file HTML slide deck (--skill picks a theme, --critique runs a quality pass) |
slides |
More skill commands:
| Command | Output |
|---|---|
openkb skill validate [name] |
Validate compiled skills (auto-runs after skill new) |
openkb skill eval <name> |
Check the skill triggers on the right prompts |
openkb skill history <name> / openkb skill rollback <name> |
Version history + rollback for skills |
The flagship generator: openkb skill new distills a portable agent skill from your wiki that Claude Code, Codex, and Gemini can install and load natively. Drop in a book's worth of papers; out comes a specialist other agents can call on. → A real generated skill, plus install / share / eval / rollback, is walked through in examples/skills/.
openkb init writes .openkb/config.yaml:
model: gpt-5.4 # LLM model (any LiteLLM-supported provider)
language: en # Wiki output language
pageindex_threshold: 20 # PDF pages threshold for PageIndexThe full settings reference — entity_types, OAuth providers (chatgpt/*, github_copilot/*), and LiteLLM tuning (timeouts for slow local runtimes like Ollama / LM Studio, drop_params, GitHub Copilot headers, install notes) — is in examples/configuration/.
Long-document retrieval is a known challenge for LLMs. PageIndex solves this with vectorless, reasoning-based retrieval, by building a hierarchical tree index that lets LLMs reason over the index for context-aware retrieval.
PageIndex runs locally by default using the open-source version, with no external dependencies required.
Cloud Support (Optional):
For large or complex PDFs, PageIndex Cloud can be used to access additional capabilities, including:
- OCR support for scanned PDFs (via hosted VLM models)
- Faster structure generation
- Scalable indexing for large documents
Set PAGEINDEX_API_KEY in your .env to enable cloud features:
PAGEINDEX_API_KEY=your_pageindex_api_key
→ Example: local vs. cloud indexing, and importing a cloud-indexed doc — examples/pageindex-cloud/.
The wiki/AGENTS.md file defines wiki structure and conventions. It's the LLM's instruction manual for maintaining the wiki. Customize it to change how your wiki is organized.
The LLM reads AGENTS.md from disk at runtime, so your edits take effect immediately.
The wiki is a directory of Markdown files with [[wikilinks]]. Obsidian renders it natively.
- Open
wiki/as an Obsidian vault - Browse summaries, concepts, and explorations
- Use graph view to see knowledge connections
- Use Obsidian Web Clipper to add web articles to
raw/
OpenKB ships a SKILL.md so any agent can read your compiled wiki. No extra runtime, no MCP setup, just install the skill once.
Claude Code:
/plugin marketplace add VectifyAI/OpenKB
/plugin install openkb@vectify
OpenAI Codex CLI:
(no marketplace command yet; manual symlink)
git clone https://github.com/VectifyAI/OpenKB.git ~/openkb-src
mkdir -p ~/.agents/skills
ln -s ~/openkb-src/skills/openkb ~/.agents/skills/openkbGemini CLI:
gemini skills install https://github.com/VectifyAI/OpenKB.git --path skills/openkb --consentThe skill is read-only. It won't run openkb add, remove, or lint --fix without you asking. See skills/openkb/SKILL.md for the full instruction set.
| Karpathy's workflow | OpenKB | |
|---|---|---|
| Short documents | LLM reads directly | markitdown → LLM reads |
| Long documents | Context limits, context rot | PageIndex tree index |
| Input sources | Web clipper → .md | PDF, Word, PPT, Excel, HTML, text, CSV, .md, URLs |
| Wiki compilation | LLM agent | LLM agent (same) |
| Entity extraction | Manual | Automatic (people, orgs, places, products) |
| Q&A | Query over wiki | Wiki + PageIndex retrieval |
| Output | Wiki only | Wiki + Skill Factory + agent CLI integration |
- PageIndex — Vectorless, reasoning-based document indexing and retrieval
- markitdown — Universal file-to-markdown conversion
- OpenAI Agents SDK — Agent framework (supports non-OpenAI models via LiteLLM)
- LiteLLM — Multi-provider LLM gateway
- Click — CLI framework
- watchdog — Filesystem monitoring
- Extend long document handling to non-PDF formats
- Scale to large document collections with nested folder support
- Hierarchical concept (topic) indexing for massive knowledge bases
- Database-backed storage engine
- Web UI for browsing and managing wikis
Contributions are welcome! Submit a pull request or open an issue for bugs and feature requests. For larger changes, consider opening an issue first to discuss the approach.
Apache 2.0. See LICENSE.
Other open-source projects from the PageIndex ecosystem:
- PageIndex: Vectorless, reasoning-based RAG framework for long documents
- ChatIndex: Tree indexing and retrieval for long conversational histories and memory
- ConDB: A KV-cache native context database for tree-based retrieval at scale
- PageIndex MCP: MCP server for PageIndex
If you find OpenKB useful, please give us a star 🌟 — and check out PageIndex too!