Skip to content

VectifyAI/OpenKB

Repository files navigation

OpenKB (by PageIndex)

VectifyAI%2FOpenKB | Trendshift

OpenKB: Open LLM Knowledge Base

Scale to long documents  •  Reasoning-based retrieval  •  Native multi-modality  •  No Vector DB

📢 Recent Updates

  • Google Open Knowledge Format (OKF): Wiki pages follow the Google OKF specification for knowledge sharing.
  • Entity Pages: People, orgs, places, and products as dedicated wiki pages, auto-extracted and kept in sync.

📑 What is OpenKB

OpenKB (Open Knowledge Base) is an open-source system (in CLI) that compiles raw documents into a structured, interlinked wiki-style knowledge base using LLMs, powered by PageIndex's vectorless, reasoning-based retrieval for long documents.

The idea is based on a concept described by Andrej Karpathy: LLMs generate summaries, concept pages, and cross-references, all maintained automatically. Knowledge compounds over time instead of being re-derived on every query.

Why not traditional RAG?

Traditional RAG rediscovers knowledge from scratch on every query. Nothing accumulates. OpenKB compiles knowledge once into a persistent wiki, then keeps it current. Cross-references already exist, contradictions are flagged, and synthesis reflects everything consumed.

OpenKB has two layers: a wiki foundation that compiles and maintains your knowledge, and generators (query / chat / Skill Factory) that turn it into useful output. See Usage for the full command list.

Features

  • Broad format support: PDF, Word, Markdown, PowerPoint, HTML, Excel, CSV, text, URLs, and more.
  • Scales to long documents: Long and complex documents are handled via PageIndex tree indexing, enabling accurate, vectorless, context-aware retrieval.
  • Native multi-modality: Retrieves and understands figures, tables, and images, not just text.
  • Compiled wiki: The LLM compiles your documents into summaries, concept pages, entity pages, and cross-links, all kept in sync.
  • Query & chat: One-off questions or multi-turn conversations over your wiki, with persisted sessions to resume.
  • Skill Factory: Distills redistributable agent skills from your wiki.
  • OKF-ready: Wiki pages follow the Google OKF specification for knowledge sharing.
  • Obsidian-compatible: The wiki is plain .md files with cross-links. Opens in Obsidian for graph view.

🚀 Getting Started

Install

pip install openkb
Other install options:
  • Latest from GitHub:

    pip install git+https://github.com/VectifyAI/OpenKB.git
  • Install from source (editable, for development):

    git clone https://github.com/VectifyAI/OpenKB.git
    cd OpenKB
    pip install -e .

Quick Start

# 1. Create a directory for your knowledge base
mkdir my-kb && cd my-kb

# 2. Initialize the knowledge base
openkb init

# 3. Add documents
openkb add paper.pdf
openkb add ~/papers/                            # Add a whole directory
openkb add https://arxiv.org/pdf/2509.11420     # Or fetch from a URL

# 4. Ask a question
openkb query "What are the main findings?"

# 5. Or chat interactively
openkb chat

# (Optional) Turn the wiki into other outputs
openkb skill new my-expert "Reason like an expert on <your-topic>"   # a portable agent skill
openkb visualize                                                     # an interactive knowledge graph
openkb deck new my-deck "An intro deck on <your-topic>"              # slides — a single-file HTML deck

Set up your LLM

OpenKB supports multiple LLM providers (OpenAI, Claude, Gemini, etc.) via LiteLLM (pinned to a safe version).

Set your model during openkb init or in .openkb/config.yaml using the provider/model LiteLLM format (e.g. anthropic/claude-sonnet-4-6). OpenAI models can omit the prefix (e.g. gpt-5.4).

Create a .env file with your LLM API key:

LLM_API_KEY=your_llm_api_key

🧩 How OpenKB Works

Architecture

raw/                              You drop files here
 │
 ├─ Short docs ──→ markitdown ──→ LLM reads full text
 │                                     │
 ├─ Long PDFs ──→ PageIndex ────→ LLM reads document trees
 │                                     │
 │                                     ▼
 │                         Wiki Compilation (using LLM)
 │                                     │
 ▼                                     ▼
wiki/                                  │            ← the foundation
 ├── index.md            Knowledge base overview
 ├── log.md              Operations timeline
 ├── AGENTS.md           Wiki schema (LLM instructions)
 ├── sources/            Full-text conversions
 ├── summaries/          Per-document summaries
 ├── concepts/           Cross-document synthesis
 ├── entities/           Specific named things (people, orgs, places, products)
 ├── explorations/       Saved query results
 └── reports/            Lint reports
                                       │
                ┌──────────────────────┼──────────────────────┐
                ▼                      ▼                      ▼
            query / chat         Skill Factory          (future)
          (LLM answers from    (redistributable       ppt / podcast /
            the wiki)           agent skills)           report / …

Short vs Long Document Handling

Short documents Long documents (PDF ≥ 20 pages)
Convert markitdown → Markdown PageIndex → tree index + summaries
Images Extracted inline (pymupdf) Extracted by PageIndex
LLM reads Full text Document trees
Result summary + concepts summary + concepts

Short documents are read in full by the LLM. Long PDFs are processed by PageIndex into a hierarchical tree index. The LLM reads the tree instead of the full text, enabling accurate and scalable retrieval for long documents.

Knowledge Compilation

When you add a document, the LLM:

  1. Generates a summary page
  2. Reads existing concept and entity pages
  3. Creates or updates concepts with cross-document synthesis
  4. Creates or updates entity pages (people, orgs, places, products)
  5. Updates the index and log

A single source might touch 10--15 wiki pages. Knowledge accumulates: each document enriches the existing wiki rather than sitting in isolation.

⚙️ Usage

OpenKB commands fall into two layers: the wiki foundation (compile + manage your knowledge) and generators (turn that wiki into useful output). Each links to a concrete walkthrough — a real artifact OpenKB generated from one sample paper (browse them all in examples/).

Layer 1: 🧱 Wiki Foundation — compile and maintain

Command Description
openkb init Initialize a new knowledge base (interactive)
openkb add <file_or_dir_or_URL> Add files, directories, or URLs and compile to wiki (URL content type is auto-detected)
openkb list List indexed documents and concepts
openkb status Show knowledge base stats
openkb watch Watch raw/ and auto-compile new files
openkb lint Run structural and knowledge health checks
More wiki commands:
Command Description
openkb remove <doc> Remove a document and clean up its wiki pages, images, registry, and PageIndex state (--dry-run to preview, --keep-raw / --keep-empty to retain artifacts)
openkb recompile [<doc>] [--all] Re-run the compile pipeline on already-indexed docs without re-indexing. Regenerates summaries and rewrites concept pages; manual edits are overwritten (--dry-run to preview, --refresh-schema to also update wiki/AGENTS.md)
openkb feedback ["msg"] File feedback by opening a prefilled GitHub issue (--type bug/feature/question to tag it)

Example: the everyday loop walked through end to end — examples/commands/.

Layer 2: 💡 Generators — turn the wiki into output

A "generator" reads from the compiled wiki and produces something usable: an answer, a conversation, a skill folder. The wiki is the substrate; generators are the surfaces.

Command Output Example
openkb query "question" A grounded answer with citations (--save to persist to wiki/explorations/) query & save
openkb chat Interactive multi-turn session over the wiki (--resume, --list, --delete to manage sessions) chat
openkb visualize A self-contained interactive knowledge graph at output/visualize/graph.html — 3D, mind-map, and radial views visualize
openkb skill new <skill-name> "<intent>" Distill a redistributable agent skill from your wiki (see Skill Factory below) skills
openkb deck new <name> "<intent>" Generate a single-file HTML slide deck (--skill picks a theme, --critique runs a quality pass) slides
More skill commands:
Command Output
openkb skill validate [name] Validate compiled skills (auto-runs after skill new)
openkb skill eval <name> Check the skill triggers on the right prompts
openkb skill history <name> / openkb skill rollback <name> Version history + rollback for skills

🛠 Skill Factory — drop in a book; out comes a digital expert.

The flagship generator: openkb skill new distills a portable agent skill from your wiki that Claude Code, Codex, and Gemini can install and load natively. Drop in a book's worth of papers; out comes a specialist other agents can call on. → A real generated skill, plus install / share / eval / rollback, is walked through in examples/skills/.

🔧 Configuration

Settings

openkb init writes .openkb/config.yaml:

model: gpt-5.4                   # LLM model (any LiteLLM-supported provider)
language: en                     # Wiki output language
pageindex_threshold: 20          # PDF pages threshold for PageIndex

The full settings reference — entity_types, OAuth providers (chatgpt/*, github_copilot/*), and LiteLLM tuning (timeouts for slow local runtimes like Ollama / LM Studio, drop_params, GitHub Copilot headers, install notes) — is in examples/configuration/.

PageIndex Setup

Long-document retrieval is a known challenge for LLMs. PageIndex solves this with vectorless, reasoning-based retrieval, by building a hierarchical tree index that lets LLMs reason over the index for context-aware retrieval.

PageIndex runs locally by default using the open-source version, with no external dependencies required.

Cloud Support (Optional):

For large or complex PDFs, PageIndex Cloud can be used to access additional capabilities, including:

  • OCR support for scanned PDFs (via hosted VLM models)
  • Faster structure generation
  • Scalable indexing for large documents

Set PAGEINDEX_API_KEY in your .env to enable cloud features:

PAGEINDEX_API_KEY=your_pageindex_api_key

Example: local vs. cloud indexing, and importing a cloud-indexed doc — examples/pageindex-cloud/.

AGENTS.md

The wiki/AGENTS.md file defines wiki structure and conventions. It's the LLM's instruction manual for maintaining the wiki. Customize it to change how your wiki is organized.

The LLM reads AGENTS.md from disk at runtime, so your edits take effect immediately.

🔌 Integrations

Using with Obsidian

The wiki is a directory of Markdown files with [[wikilinks]]. Obsidian renders it natively.

  1. Open wiki/ as an Obsidian vault
  2. Browse summaries, concepts, and explorations
  3. Use graph view to see knowledge connections
  4. Use Obsidian Web Clipper to add web articles to raw/

Using with Claude Code / Codex / Gemini CLI

OpenKB ships a SKILL.md so any agent can read your compiled wiki. No extra runtime, no MCP setup, just install the skill once.

Claude Code:
/plugin marketplace add VectifyAI/OpenKB
/plugin install openkb@vectify
OpenAI Codex CLI:

(no marketplace command yet; manual symlink)

git clone https://github.com/VectifyAI/OpenKB.git ~/openkb-src
mkdir -p ~/.agents/skills
ln -s ~/openkb-src/skills/openkb ~/.agents/skills/openkb
Gemini CLI:
gemini skills install https://github.com/VectifyAI/OpenKB.git --path skills/openkb --consent

The skill is read-only. It won't run openkb add, remove, or lint --fix without you asking. See skills/openkb/SKILL.md for the full instruction set.

🧭 Learn More

Compared to Karpathy's Approach

Karpathy's workflow OpenKB
Short documents LLM reads directly markitdown → LLM reads
Long documents Context limits, context rot PageIndex tree index
Input sources Web clipper → .md PDF, Word, PPT, Excel, HTML, text, CSV, .md, URLs
Wiki compilation LLM agent LLM agent (same)
Entity extraction Manual Automatic (people, orgs, places, products)
Q&A Query over wiki Wiki + PageIndex retrieval
Output Wiki only Wiki + Skill Factory + agent CLI integration

The Stack

  • PageIndex — Vectorless, reasoning-based document indexing and retrieval
  • markitdown — Universal file-to-markdown conversion
  • OpenAI Agents SDK — Agent framework (supports non-OpenAI models via LiteLLM)
  • LiteLLM — Multi-provider LLM gateway
  • Click — CLI framework
  • watchdog — Filesystem monitoring

Roadmap

  • Extend long document handling to non-PDF formats
  • Scale to large document collections with nested folder support
  • Hierarchical concept (topic) indexing for massive knowledge bases
  • Database-backed storage engine
  • Web UI for browsing and managing wikis

Contributing

Contributions are welcome! Submit a pull request or open an issue for bugs and feature requests. For larger changes, consider opening an issue first to discuss the approach.

License

Apache 2.0. See LICENSE.

🌐 Open-Source Ecosystem

Other open-source projects from the PageIndex ecosystem:

  • PageIndex: Vectorless, reasoning-based RAG framework for long documents
  • ChatIndex: Tree indexing and retrieval for long conversational histories and memory
  • ConDB: A KV-cache native context database for tree-based retrieval at scale
  • PageIndex MCP: MCP server for PageIndex

Support Us

If you find OpenKB useful, please give us a star 🌟 — and check out PageIndex too!

TwitterLinkedInContact Us

Packages

 
 
 

Contributors