Skip to content

fix: deduplicate entities by (title, type) to preserve same-name different-type entities#2339

Open
octo-patch wants to merge 1 commit intomicrosoft:mainfrom
octo-patch:fix/issue-1718-entity-dedup-by-title-and-type
Open

fix: deduplicate entities by (title, type) to preserve same-name different-type entities#2339
octo-patch wants to merge 1 commit intomicrosoft:mainfrom
octo-patch:fix/issue-1718-entity-dedup-by-title-and-type

Conversation

@octo-patch
Copy link
Copy Markdown

Fixes #1718

Problem

finalize_entities tracked only the entity title in its deduplication set, so two entities with the same title but different types (e.g. "Python" as LANGUAGE and "Python" as CONCEPT) were incorrectly collapsed into one. The second entity was silently dropped, causing ~10% data loss in graphs with polymorphic entity names.

Solution

Change the dedup key from title (a str) to (title, type) (a tuple[str, str]), matching the approach already used in finalize_relationships which deduplicates by (source, target).

Before:

seen_titles: set[str] = set()
...
if not title or title in seen_titles:
    continue
seen_titles.add(title)

After:

seen_entities: set[tuple[str, str]] = set()
...
entity_type = row.get("type", "")
if not title or (title, entity_type) in seen_entities:
    continue
seen_entities.add((title, entity_type))

Testing

  • Updated the existing test_deduplicates_by_title docstring to clarify it covers same (title, type) duplicates.
  • Added test_preserves_same_title_different_type which directly exercises the bug scenario: three entities where two share the title "Python" but have types LANGUAGE and CONCEPT; all three must survive finalization.
  • All 32 existing tests continue to pass.

…erent-type entities (fixes microsoft#1718)

Previously, finalize_entities only tracked seen titles, causing entities
with the same title but different type to be incorrectly deduplicated.
Change the dedup key to the (title, type) compound tuple so that e.g.
"Python" as a LANGUAGE and "Python" as a CONCEPT are both preserved.

Co-Authored-By: Octopus <liyuan851277048@icloud.com>
@octo-patch octo-patch requested a review from a team as a code owner April 28, 2026 02:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Fatal Bug]: Incorrect deduplication of entities with same title but different type

1 participant