Skip to content

feat : One Dataset, One Result – Smart Deduplication in Knowledge Space Search ✔#70

Closed
Areeba-Tahir-18 wants to merge 3 commits intoINCF:mainfrom
Areeba-Tahir-18:feature/dataset-deduplication
Closed

feat : One Dataset, One Result – Smart Deduplication in Knowledge Space Search ✔#70
Areeba-Tahir-18 wants to merge 3 commits intoINCF:mainfrom
Areeba-Tahir-18:feature/dataset-deduplication

Conversation

@Areeba-Tahir-18
Copy link

@Areeba-Tahir-18 Areeba-Tahir-18 commented Mar 6, 2026

Summary

This PR solves issue #68 and introduces a robust deduplication mechanism for the Knowledge Space search tool to ensure cleaner, more accurate search results.

Problem #68

Search results were showing duplicate datasets due to:

  • Aggregation from multiple datasources
  • Metadata variations (titles, descriptions, authors,capitalization).
  • Different URLs pointing to the same resource

Solution

This PR implements:

  1. Canonical dataset identity – uses datasource_id + dataset_id to uniquely identify datasets.
  2. URL normalization – removes query params and fragments to match identical datasets with different URLs.
  3. Title normalization – lowercasing, removing punctuation, extra spaces.
  4. Fuzzy title matching – detects highly similar titles (threshold 0.93) to remove duplicates.

Impact of Feature In Real World UseCase

  • Faster searches – Users find the dataset they need quickly without seeing duplicates.
  • Clear results – Each dataset shows only once, making results easy to understand .
  • Consistent data – Datasets from different sources are shown cleanly in one place.
  • Better user experience - Good user experince
  • Reduced redundancy in dataset listings - no one get confused after seein duplicates
  • Better Data Reliability - reliable Data

Example

Previously, “Anesthesia EEG Dataset” appeared 3 times from DANDI. Now, only a single clean entry is returned.

@Areeba-Tahir-18
Copy link
Author

Areeba-Tahir-18 commented Mar 7, 2026

Successfully merging this pull request may close these issues. #68

@QuantumByte-01
Copy link
Collaborator

Closing this PR. The diff accidentally includes the entire Google Cloud SDK directory (1.4M+ file additions) — it appears \ was installed inside the repo directory and everything got staged. The actual deduplication logic (~160 lines in ) is a valid idea, but it's completely buried under hundreds of thousands of unrelated files. Please re-open a clean PR with only the relevant code changes.

@Areeba-Tahir-18
Copy link
Author

Thanks for the review! I see how the diff got bloated with the Google Cloud SDK files. I’ll clean up the PR to include only the deduplication logic and reopen a fresh PR for easier review. I appreciate the guidance! @QuantumByte-01

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants