+ Pillar 3 · Spec 2a (the Knowledge Graph) · 2026-06-16 · Status: Approved (design) — pending plan
+ Surface: the deep-dive pipeline + a new pure concept module + the local graph store. No user-facing UI in 2a.
+
+
+ Spec 2 decomposition:
+
2a — Concept substrate ← this (persist atoms, index concepts, link repos by shared concepts)
+
2b — Concept connections in the UI (corkboard / ego-graph + mastery annotation)
+
2c — Anchor-to-library explanation (explain a new repo against repos you know)
+
+
+
+
The gap
+
RepoLens already has a persistent graph (the nodes/edges stores) shown as the Corkboard and the per-repo Connections ego-graph — but repos are linked only by relationships (ALTERNATIVE_TO, SYNERGIZES_WITH, COMPARED_TO, COMBINES), never by shared concepts. The deep dive produces rich per-repo atoms ({id,name,kind,purpose,files}), uses them for the blueprint, then discards them. So nothing answers "which repos touch event-sourcing?" That substrate is the missing foundation for "plug pieces together, think in new ways."
+
+
Goals
+
+
Persist each repo's deep-dive atoms locally (already produced — no new call for the atoms).
+
Link repos by shared concepts via a pluggable matcher: embeddings where the provider supports them, lexical fallback otherwise.
+
Expose a clean link/index API (deriveConceptLinks, conceptIndex) for 2b and 2c.
+
+
+
Non-goals
+
+
No user-facing UI in 2a (that's 2b) — data + API, validated by tests.
+
No hosted backend (embeddings via BYO provider; vectors in IDB; cosine local JS).
+
No changes to existing relationship edges / Corkboard / Connections rendering.
+
+
+
+ Two departures to call out.
+
+
2a touches background.js and adds a new AI call (the embeddings request) — a deliberate departure from the Mastery Loop's "no background/AI changes," the direct consequence of choosing embeddings. Still no hosted backend: the call goes to the BYO provider, vectors cache in IDB, cosine is local.
+
2a ships nothing the user sees — the visible payoff (concept links on the Corkboard) is 2b. Output here = persisted data + the link API, validated by unit tests. Build order within 2a: lexical substrate first (the fallback), then embeddings layered on.
+
+
+
+
Components
+
+
+ ① Persist atoms — new concepts IDB store (v6→v7)
+
On deep-dive completion, persist atoms (best-effort — never fails the scan), keyed by raw repoId (mirrors decisions/mastery):
cosineSimilarity(a, b) → number in [-1, 1] (0 for zero / length-mismatched vectors).
+
Matchers behind one interface: lexical (concept→repos index over normalized keys + light token-overlap merge via store/search.js) and embedding (links repos with a cross-repo atom pair at cosine ≥ threshold, tunable ~0.82).
+
deriveConceptLinks(records, { matcher }) → [{ a, b, shared:[labels], score }]. Per-repo hybrid: embeddings only when both repos have vectors, else lexical for that pair. shared = shared normalized keys (lexical) or matched atom names (embedding).
+
conceptIndex(records) → { conceptLabel: repoId[] } over the lexical concepts (named concept → repos; what 2b/2c read). Lexical-only — embeddings link repo pairs without discrete labels, so this takes no matcher.
+
+
+
+
+ ③ Embeddings path — provider-gated
+
+
Capability:providers.js gains per-provider embeddings metadata (OpenAI → text-embedding-3-small, Google → text-embedding-004, most OpenAI-compatible → their /embeddings; none for Anthropic) + providerSupportsEmbeddings(id).
+
Call: new callEmbeddings(texts) in background.js (OpenAI-compatible /embeddings, BYO-key, same hard timeout as callAI) → number[][]. Errors degrade to null vectors (→ lexical), never breaking the deep dive.
+
When: on deep-dive completion, if the provider supports embeddings, embed each atom's name + purpose and cache vectors; else vectors: null.
deriveConceptLinks links repos by shared concepts with labels; hybrid uses embeddings only when both repos have vectors, else lexical.
+
concepts.js fully unit-tested; persistence via fake-indexeddb; callEmbeddings mocked incl. error→null.
+
No hosted backend; embeddings is the only new provider call; existing scan/lens contracts unchanged.
+
All existing tests pass + new; eslint . 0 errors; HTML gate passes.
+
+
+
+ Resolved decisions: slice = 2a substrate; matcher = hybrid (embeddings + lexical, per-repo); build lexical-first then embeddings; atoms → new concepts IDB store (v7), CRUD-only; embeddings via an OpenAI-compatible /embeddings call in background.js, provider-gated, vectors + cosine local; no UI in 2a.
+
+
+
+
+
+
+
diff --git a/docs/superpowers/specs/2026-06-16-concept-substrate-design.md b/docs/superpowers/specs/2026-06-16-concept-substrate-design.md
new file mode 100644
index 0000000..ff6fbbf
--- /dev/null
+++ b/docs/superpowers/specs/2026-06-16-concept-substrate-design.md
@@ -0,0 +1,111 @@
+# Concept Substrate (Pillar 3, Spec 2a)
+
+- **Date:** 2026-06-16
+- **Status:** Approved (design) — pending implementation plan
+- **Surface:** the deep-dive pipeline (persist + embed atoms) + a new pure concept module + the local graph store. **No user-facing UI** in 2a.
+- **Part of:** **Pillar 3 — The Knowledge Game**, Spec 2 (the Knowledge Graph). Spec 1 (the Mastery Loop) shipped (PR #37).
+ - **2a — Concept substrate** ← *this* (persist atoms, index concepts, link repos by shared concepts)
+ - **2b — Concept connections in the UI** (surface shared-concept links on the corkboard / ego-graph; annotate nodes with mastery)
+ - **2c — Anchor-to-library explanation** (output tab: explain a new repo against repos you know)
+
+## Problem / the gap
+
+RepoLens already has a persistent graph (the `nodes`/`edges` IDB stores) and surfaces it as the **Corkboard** (a library-wide canvas of repos + ideas joined by relationship edges: `ALTERNATIVE_TO`, `SYNERGIZES_WITH`, `COMPARED_TO`, `COMBINES`) and the per-repo **Connections** ego-graph. But repos are linked only by those *relationships* — **never by shared concepts**.
+
+The deep dive produces rich per-repo semantic **atoms** (`{id, name, kind, purpose, files}`) via `parseAtoms` (`deepdive.js`), uses them for the blueprint, then **discards them** — they are never persisted or compared across repos. So there is no data answering "which repos touch event-sourcing?" This substrate is the missing foundation for the "plug pieces together, think in new ways" vision, and for annotating the graph with the just-shipped mastery signal.
+
+## Goals
+
+- **Persist** each repo's deep-dive atoms locally (they're already produced — no new call for the atoms themselves).
+- **Link repos by shared concepts** via a pure, pluggable matcher: **embeddings** when the provider supports them, **lexical** fallback otherwise.
+- Expose a clean **link/index API** (`deriveConceptLinks`, `conceptIndex`) that 2b (visualize) and 2c (anchor) consume.
+
+## Non-goals
+
+- **No user-facing UI** in 2a — that is 2b. 2a is data + API, validated by tests.
+- No hosted backend (embeddings go to the user's BYO provider; vectors are cached in IDB; cosine is local JS).
+- No changes to the existing relationship edges / Corkboard / Connections rendering (2b territory).
+
+## Decisions (resolved in brainstorming)
+
+- First slice = **2a (substrate)** — the genuinely missing capability and the foundation.
+- Matcher = **hybrid**: embeddings where the configured provider exposes an embeddings endpoint; lexical (normalized-tag + token-overlap) fallback otherwise. Per-repo fallback (a repo with vectors uses embeddings; without, lexical).
+- **Build order within 2a:** the lexical substrate first (the fallback path, works everywhere, no AI/background changes), then the embeddings path layered on.
+
+## Two explicit departures to call out
+
+1. **2a touches `background.js` and adds a new AI call** (the embeddings request) — a deliberate departure from the Mastery Loop's "no background/AI changes." It is the direct consequence of choosing embeddings. There is still **no hosted backend**: the `/embeddings` call goes to the user's configured provider (BYO-key), vectors are cached in IDB, and all matching/cosine is local JS.
+2. **2a ships nothing the user sees.** The visible payoff (concept links on the Corkboard) is 2b. This spec is the substrate; its "output" is persisted data + the link API, validated by unit tests.
+
+## Components
+
+### ① Persist atoms — new `concepts` IDB store
+
+On deep-dive completion (the deep-dive runner in `background.js`), persist the atoms (already produced) to a new `concepts` store (additive: `store/idb.js` DB_VERSION 6→7, add `'concepts'` to `STORES`). Best-effort write — a ledger write must never fail a scan (mirrors `appendScanSnapshot`).
+
+Record, keyed by raw repoId (mirrors `decisions`/`mastery`):
+```
+concepts[repoId] = {
+ repoId,
+ atoms: [{ id, name, kind, purpose, files }], // from parseAtoms
+ vectors: number[][] | null, // per-atom embedding (aligned to atoms), null when none
+ embedModel: string | null, // model that produced vectors, e.g. 'text-embedding-3-small'
+ computedAt: string, // ISO
+}
+```
+`store.js` CRUD (no compute in the store): `getConcepts(repoId)`, `getAllConcepts()` (→ `{repoId: record}` map), `setConcepts(repoId, record)`.
+
+### ② Pure concept logic — `concepts.js` (no DOM / network / AI)
+
+Fully unit-testable. Exports:
+- `normalizeConcept(atom)` → a canonical lexical key (lowercase, strip punctuation, drop stopwords). Used by the lexical matcher.
+- `cosineSimilarity(vecA, vecB)` → number in [-1, 1] (0 for zero/length-mismatched vectors).
+- **Matchers** (pure functions over the records map), behind one interface:
+ - `lexicalMatcher` — builds a concept→repos index from normalized atom keys (with a light token-overlap merge of near-duplicate keys via `store/search.js`), then links repos sharing ≥1 concept.
+ - `embeddingMatcher(threshold)` — links repos that have a cross-repo atom pair with `cosineSimilarity >= threshold` (default tunable, e.g. 0.82).
+- `deriveConceptLinks(records, { matcher })` → `[{ a: repoId, b: repoId, shared: string[], score: number }]`. Per-repo hybrid selection: use the embedding matcher between two repos only when **both** have `vectors`; otherwise fall back to lexical for that pair. `shared` is the linking labels — **lexical**: the shared normalized concept keys; **embedding**: the names of the matched atom pair(s).
+- `conceptIndex(records)` → `{ conceptLabel: repoId[] }` over the **lexical** normalized concepts (named concept → repos; what 2b/2c read to show "N repos touch X"). Note: this index is lexical-only because embeddings link repo *pairs* without producing discrete concept labels — so `conceptIndex` takes no matcher, while `deriveConceptLinks` is the hybrid repo-linker.
+
+### ③ Embeddings path — provider-gated (the sharp matcher)
+
+- **Capability:** extend the provider registry (`providers.js`) with embeddings support per provider — `{ embeddingsEndpoint, embeddingsModel }` for those that have one (OpenAI → `text-embedding-3-small`; Google → `text-embedding-004`; most OpenAI-compatible → their `/embeddings`); none for Anthropic and any without an endpoint. A helper `providerSupportsEmbeddings(id)`.
+- **Call:** new `callEmbeddings(texts)` in `background.js` — an OpenAI-compatible `/embeddings` POST using the configured provider's key + model, wrapped in the same hard timeout as `callAI`. Returns `number[][]` aligned to `texts`. Errors degrade to `null` vectors (→ lexical fallback), never breaking the deep dive.
+- **When:** on deep-dive completion, if `providerSupportsEmbeddings(configured)`, embed each atom's `name + ' — ' + purpose` and store the vectors; else store `vectors: null`.
+
+## Architecture (files)
+
+- **Create** `concepts.js` — pure model (normalize, cosine, matchers, `deriveConceptLinks`, `conceptIndex`). One responsibility: concept indexing/linking.
+- **Create** `tests/concepts.test.js`.
+- **Modify** `store/idb.js` — v6→v7, add `'concepts'` store.
+- **Modify** `store.js` — `getConcepts` / `getAllConcepts` / `setConcepts` (CRUD) + a store test (`tests/store-concepts.test.js`).
+- **Modify** `providers.js` — embeddings capability metadata + `providerSupportsEmbeddings`.
+- **Modify** `background.js` — persist atoms on deep-dive completion; `callEmbeddings`; embed-on-deep-dive (provider-gated, best-effort).
+
+## Testing
+
+- **vitest (pure):** `concepts.js` — `cosineSimilarity` (orthogonal=0, identical=1, length-mismatch=0), `normalizeConcept`, `lexicalMatcher` (shared-key linking + near-dup merge), `embeddingMatcher` (links above threshold using hand-built vectors), `deriveConceptLinks` hybrid selection (both-have-vectors → embedding; else lexical), `conceptIndex` shape.
+- **vitest + fake-indexeddb:** concepts CRUD round-trip; v7 store creation.
+- **Provider/AI:** `callEmbeddings` tested with a mocked `global.fetch` (mirrors `tests/fetcher.test.js`), incl. the error→null path; `providerSupportsEmbeddings` table tests.
+- Existing suite green; `eslint .` 0 errors; HTML gate passes; `node --check` on touched files.
+
+## Constraints
+
+- Zero-build, zero-dep, vanilla ES modules. Embeddings via the BYO-key provider; vectors in IDB; cosine local — **no hosted backend**.
+- `background.js`/AI changes are scoped to the embeddings path only; the existing scan/deep-dive/lens contracts are untouched.
+- Mono Ink etc. apply once 2b adds UI (none here).
+
+## Acceptance criteria
+
+- [ ] After a deep dive, the repo's atoms are persisted to the `concepts` store (best-effort; failure never breaks the scan).
+- [ ] With an embeddings-capable provider configured, atom vectors are computed + cached; with Anthropic (or any without an endpoint), `vectors` is `null` and the lexical matcher is used.
+- [ ] `deriveConceptLinks` links repos by shared concepts and reports the shared concept labels; hybrid selection uses embeddings only when both repos have vectors, else lexical.
+- [ ] `concepts.js` is fully unit-tested (cosine, normalize, both matchers, hybrid selection, index); concepts persistence tested with fake-indexeddb; `callEmbeddings` tested against a mocked fetch incl. the error→null fallback.
+- [ ] No hosted backend added; the embeddings call is the only new provider call; existing scan/lens contracts unchanged.
+- [ ] All existing tests pass + new tests; `eslint .` 0 errors; HTML gate passes.
+
+## Resolved decisions
+
+- Slice = 2a substrate; matcher = hybrid (embeddings + lexical fallback, per-repo); build lexical-first then embeddings.
+- Atoms persisted to a new `concepts` IDB store (v7), keyed by raw repoId, CRUD-only store API.
+- Embeddings via an OpenAI-compatible `/embeddings` call in `background.js`, provider-gated; vectors local; cosine local.
+- No UI in 2a (deferred to 2b).
From c330dd1f05e17d2d64d31877f26af3e797c93d90 Mon Sep 17 00:00:00 2001
From: ares <285551516+New1Direction@users.noreply.github.com>
Date: Tue, 16 Jun 2026 22:06:59 -0700
Subject: [PATCH 02/20] docs(plan): Concept Substrate implementation plan
(Pillar 3, Spec 2a)
---
.../plans/2026-06-16-concept-substrate.html | 110 ++++
.../plans/2026-06-16-concept-substrate.md | 596 ++++++++++++++++++
2 files changed, 706 insertions(+)
create mode 100644 docs/superpowers/plans/2026-06-16-concept-substrate.html
create mode 100644 docs/superpowers/plans/2026-06-16-concept-substrate.md
diff --git a/docs/superpowers/plans/2026-06-16-concept-substrate.html b/docs/superpowers/plans/2026-06-16-concept-substrate.html
new file mode 100644
index 0000000..9c55da0
--- /dev/null
+++ b/docs/superpowers/plans/2026-06-16-concept-substrate.html
@@ -0,0 +1,110 @@
+
+
+
+
+
+ Concept Substrate — Implementation Plan
+
+
+
+
+
+
+
Concept Substrate
+
+ Implementation plan · Pillar 3 / Spec 2a · 2026-06-16 · branch feat/concept-substrate
+ Overview for review. Full step-by-step code (TDD steps, commands, commits) lives in the companion .md.
+
+
+
+
+ Goal. Persist deep-dive atoms per repo and link repos by shared concepts — lexical everywhere, embeddings where the provider supports them — exposing a link/index API for the later UI (2b) and anchor explanations (2c). No user-visible surface in 2a; no hosted backend.
+
+
+
Build order (two phases, one plan)
+
+
Phase 1 (Tasks 1–3) — lexical substrate.concepts.js (incl. cosine/embedding math as pure fns), the concepts store, atom persistence (vectors: null). No AI / provider-call changes. Ships a working lexical concept graph + the hybrid's fallback path.
+ v1 embeddings scope: OpenAI-protocol compat providers (OpenAI is in COMPAT_PROVIDERS; plus any tagged with an embeddings model). The 5 first-class providers (Anthropic/Google/OpenRouter/xAI/Nous) use a separate call path and fall back to lexical in v1 — Google embeddings is a deliberate follow-up (the spec mentioned it; narrowed to stay bounded).
+
Task 3 · Persist atoms on deep-dive completion modify
+
background.js (runDeepDive, ~L806)
+
Best-effort setConcepts(repoId, {atoms, vectors:null, …}) before status:'done'. Never fails the dive.
+
+
+
+
+ Phase 2 — Embeddings path
+
+
Task 4 · Provider embeddings capability modify
+
providers.js, tests/concepts-embeddings.test.js
+
embeddingsModel on OpenAI's entry + providerSupportsEmbeddings, compatEmbeddingsEndpoint (derive /embeddings from the chat URL), embeddingsModelFor, pickEmbeddingsProvider.
Pure embeddingsBody/parseEmbeddings in providers.js (testable); callEmbeddings wraps fetchWithTimeout (errors → null → lexical); runDeepDive embeds atom text when a provider supports it.
+
+
+
+
+ Phase 3 — Verification
+
+
Task 6 · Full pass
+
vitest run (prior + 3 new test files) · eslint . 0 errors · check:html · node --check · manual note (OpenAI key → vectors cached; Anthropic → null → lexical; no user surface yet).
+
+
+
+
Spec coverage
+
+
Persist atoms to a new store: Tasks 2, 3. Pure concept/cosine/matcher logic: Task 1.
Supports GitHub, GitLab, npm, and PyPI. Repos are scanned sequentially to respect rate limits. Each scan uses one AI call and saves automatically to your library.
+ Supports GitHub, GitLab, npm, and PyPI. Repos are scanned sequentially to respect rate limits. Each
+ scan uses one AI call and saves automatically to your library.
+