Skip to content

fix: hashing to avoid doc with duplicate name to collide#30

Open
saccharin98 wants to merge 1 commit into
VectifyAI:devfrom
saccharin98:dev
Open

fix: hashing to avoid doc with duplicate name to collide#30
saccharin98 wants to merge 1 commit into
VectifyAI:devfrom
saccharin98:dev

Conversation

@saccharin98
Copy link
Copy Markdown
Collaborator

No description provided.

@saccharin98
Copy link
Copy Markdown
Collaborator Author

Code review

No issues found. Checked for bugs (race conditions, data corruption, logic errors, edge cases) — no significant problems introduced by this PR.

Notes:

  • The two potential concerns (parallel indexing race, raw/wiki name mismatch when file is already in raw/) were evaluated and determined to be false positives: the hash suffix prevents collisions in the normal code path, and lint.py's registry lookup correctly handles the naming indirection.

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

@rejojer
Copy link
Copy Markdown
Member

rejojer commented Apr 28, 2026

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

@saccharin98
Copy link
Copy Markdown
Collaborator Author

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

@KylinMountain
Copy link
Copy Markdown
Collaborator

Code review

Found 1 issue:

  1. Pre-existing registry entries (from versions before this PR) have no doc_name, path, raw_path, or source_path fields, so neither get_by_path nor remove_by_doc_name can match them. On re-ingest of a file whose content has changed, the upgrade path silently leaks orphans: the new hashed doc_name is generated, remove_by_doc_name matches nothing (old entry has no doc_name to compare), and the new entry is added alongside the stale one. The old wiki/sources/<stem>.md / wiki/summaries/<stem>.md files are also left in place next to the new <stem>-<hash>.md outputs. No migration code is included.

remove_by_doc_name only matches entries whose metadata["doc_name"] == doc_name:

OpenKB/openkb/state.py

Lines 55 to 67 in 870c947

def remove_by_doc_name(self, doc_name: str) -> None:
"""Remove older content-hash entries for the same document."""
stale_hashes = [
file_hash
for file_hash, metadata in self._data.items()
if metadata.get("doc_name") == doc_name
]
if not stale_hashes:
return
for file_hash in stale_hashes:
del self._data[file_hash]
self._persist()

get_by_path only matches entries with the new path fields:

OpenKB/openkb/state.py

Lines 34 to 44 in 870c947

def get_by_path(self, path: str) -> dict | None:
"""Return metadata registered for a source, raw, or wiki path."""
for metadata in self._data.values():
if path in {
metadata.get("path"),
metadata.get("raw_path"),
metadata.get("source_path"),
}:
return metadata
return None

The cleanup call in add_single_file therefore no-ops for any pre-PR entry:

OpenKB/openkb/cli.py

Lines 208 to 223 in 870c947

# Register hash only after successful compilation
if result.file_hash:
doc_type = "long_pdf" if result.is_long_doc else file_path.suffix.lstrip(".")
metadata = {
"name": file_path.name,
"doc_name": doc_name,
"type": doc_type,
"path": _registry_path(file_path, kb_dir),
}
if result.raw_path is not None:
metadata["raw_path"] = _registry_path(result.raw_path, kb_dir)
if result.source_path is not None:
metadata["source_path"] = _registry_path(result.source_path, kb_dir)
registry.remove_by_doc_name(doc_name)
registry.add(result.file_hash, metadata)

A one-time migration (backfill path/doc_name for existing entries based on name) or a fallback match by name in remove_by_doc_name would close this gap.

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants