Skip to content

[BUG] bm project info reports missing embeddings after successful reindex #670

@bengreeno

Description

@bengreeno

Bug Description

After a successful bm reindex, bm project info still reports missing embeddings and recommends another embeddings reindex. In my case, this appears to be caused by stale rows in derived search tables rather than actual current entities missing embeddings.

Steps To Reproduce

  1. Install Basic Memory version 0.20.2 in Docker with semantic search enabled.
  2. Use a project with fastembed configured as the embedding provider.
  3. Run:
    bm reindex
  4. After it completes successfully, run:
    bm project info main
  5. Observe that bm project info still reports:
    • Indexed 1148/1172
    • Reindex recommended
    • 24 entities missing embeddings — run: bm reindex --embeddings

Expected Behavior

After a successful bm reindex, I would expect bm project info to stop recommending another embeddings reindex unless there are actually current entities missing embeddings.

Actual Behavior

bm reindex completes successfully:

Project: main
  Rebuilding full-text search index...
  ✓ Full-text search index rebuilt
  Building vector embeddings...
    Embedding entities... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100%
  ✓ Embeddings complete: 1140 entities embedded, 0 skipped, 0 errors

Reindex complete!

But immediately afterward, bm project info main still shows:
Image

I inspected the database and found:

  • 0 current entities missing chunks
  • stale entity_ids remain in search_index
  • stale entity_ids remain in search_vector_chunks

In my case:

  • stale search_index entity IDs: 32
  • stale search_vector_chunks entity IDs: 8

So the reported gap of 24 appears to be coming from stale derived-table rows rather than live notes that still need embedding.

Environment

  • OS: macOS (host) with Dockerized Basic Memory container
  • Python version: Python 3.13.12 in the container
  • Basic Memory version: 0.20.2
  • Installation method: Docker
  • Claude Desktop version (if applicable): N/A

Additional Context

Example stale derived rows I found:

  • entity_id = 1130
    • title: Ops Home
    • file path: ops/index.md
    • type: relation
  • entity_id = 1246
    • title: Dialog without buttons
    • file path: conversations/chatgpt-20240716-Dialog_without_buttons.md
    • type: entity

These rows still exist in derived search tables but no longer correspond to current rows in the canonical entity table.

This makes bm project info appear to be overstating missing embeddings after a successful reindex.

Possible Solution

bm project info may be calculating embedding coverage from derived tables without excluding stale rows whose entity_id no longer exists in the canonical entity table.

Possible fixes:

  • when computing embedding coverage, only count entity IDs that still exist in entity
  • or ensure bm reindex also cleans up stale rows in search_index / search_vector_chunks so the post-reindex stats remain consistent

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions