fix: hashing to avoid doc with duplicate name to collide by saccharin98 · Pull Request #30 · VectifyAI/OpenKB

saccharin98 · 2026-04-27T08:22:59Z

No description provided.

saccharin98 · 2026-04-27T08:56:07Z

Code review

No issues found. Checked for bugs (race conditions, data corruption, logic errors, edge cases) — no significant problems introduced by this PR.

Notes:

The two potential concerns (parallel indexing race, raw/wiki name mismatch when file is already in raw/) were evaluated and determined to be false positives: the hash suffix prevents collisions in the normal code path, and lint.py's registry lookup correctly handles the naming indirection.

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

rejojer · 2026-04-28T11:27:43Z

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

saccharin98 · 2026-04-29T05:05:56Z

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

KylinMountain · 2026-05-11T03:33:03Z

Code review

Found 1 issue:

Pre-existing registry entries (from versions before this PR) have no doc_name, path, raw_path, or source_path fields, so neither get_by_path nor remove_by_doc_name can match them. On re-ingest of a file whose content has changed, the upgrade path silently leaks orphans: the new hashed doc_name is generated, remove_by_doc_name matches nothing (old entry has no doc_name to compare), and the new entry is added alongside the stale one. The old wiki/sources/<stem>.md / wiki/summaries/<stem>.md files are also left in place next to the new <stem>-<hash>.md outputs. No migration code is included.

remove_by_doc_name only matches entries whose metadata["doc_name"] == doc_name:

OpenKB/openkb/state.py

Lines 55 to 67 in 870c947

    
           def remove_by_doc_name(self, doc_name: str) -> None: 
        
               """Remove older content-hash entries for the same document.""" 
        
               stale_hashes = [ 
        
                   file_hash 
        
                   for file_hash, metadata in self._data.items() 
        
                   if metadata.get("doc_name") == doc_name 
        
               ] 
        
               if not stale_hashes: 
        
                   return 
        
               for file_hash in stale_hashes: 
        
                   del self._data[file_hash] 
        
               self._persist()

get_by_path only matches entries with the new path fields:

OpenKB/openkb/state.py

Lines 34 to 44 in 870c947

    
           def get_by_path(self, path: str) -> dict | None: 
        
               """Return metadata registered for a source, raw, or wiki path.""" 
        
               for metadata in self._data.values(): 
        
                   if path in { 
        
                       metadata.get("path"), 
        
                       metadata.get("raw_path"), 
        
                       metadata.get("source_path"), 
        
                   }: 
        
                       return metadata 
        
               return None

The cleanup call in add_single_file therefore no-ops for any pre-PR entry:

OpenKB/openkb/cli.py

Lines 208 to 223 in 870c947

    
           # Register hash only after successful compilation 
        
           if result.file_hash: 
        
               doc_type = "long_pdf" if result.is_long_doc else file_path.suffix.lstrip(".") 
        
               metadata = { 
        
                   "name": file_path.name, 
        
                   "doc_name": doc_name, 
        
                   "type": doc_type, 
        
                   "path": _registry_path(file_path, kb_dir), 
        
               } 
        
               if result.raw_path is not None: 
        
                   metadata["raw_path"] = _registry_path(result.raw_path, kb_dir) 
        
               if result.source_path is not None: 
        
                   metadata["source_path"] = _registry_path(result.source_path, kb_dir) 
        
               registry.remove_by_doc_name(doc_name) 
        
               registry.add(result.file_hash, metadata)

A one-time migration (backfill path/doc_name for existing entries based on name) or a fallback match by name in remove_by_doc_name would close this gap.

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

saccharin98 force-pushed the dev branch from e142a96 to a177970 Compare April 28, 2026 12:06

saccharin98 closed this Apr 28, 2026

saccharin98 reopened this Apr 28, 2026

saccharin98 force-pushed the dev branch from a177970 to b8731a9 Compare April 28, 2026 12:17

fix: simple patch to resolve file name conflict

870c947

saccharin98 force-pushed the dev branch from b8731a9 to 870c947 Compare April 29, 2026 04:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: hashing to avoid doc with duplicate name to collide#30

fix: hashing to avoid doc with duplicate name to collide#30
saccharin98 wants to merge 1 commit into
VectifyAI:devfrom
saccharin98:dev

saccharin98 commented Apr 27, 2026

Uh oh!

saccharin98 commented Apr 27, 2026

Uh oh!

rejojer commented Apr 28, 2026

Uh oh!

saccharin98 commented Apr 29, 2026

Uh oh!

KylinMountain commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

saccharin98 commented Apr 27, 2026

Uh oh!

saccharin98 commented Apr 27, 2026

Code review

Uh oh!

rejojer commented Apr 28, 2026

Code review

Uh oh!

saccharin98 commented Apr 29, 2026

Code review

Uh oh!

KylinMountain commented May 11, 2026

Code review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants