Skip to content

testing : Unit tests for dataset deduplication feature ✔️#91

Open
Areeba-Tahir-18 wants to merge 2 commits intoINCF:mainfrom
Areeba-Tahir-18:final
Open

testing : Unit tests for dataset deduplication feature ✔️#91
Areeba-Tahir-18 wants to merge 2 commits intoINCF:mainfrom
Areeba-Tahir-18:final

Conversation

@Areeba-Tahir-18
Copy link

Greetings INCF Team !!

Summary

This PR adds a dedicated unit testing suite for the dataset deduplication feature #87 in the Knowledge Space Agent project. The tests ensure that the deduplication logic works correctly across multiple edge cases and scenarios.

What’s included:

A new test file: backend/deduplication_testing.py

10 test scenarios, covering:

1. Basic deduplication by _id
2. URL variations deduplication
3. Title normalization (punctuation, spaces, case)
4. Fuzzy title matching
5. Handling multiple duplicates
6. Empty datasets
7. Datasets with different _id but same normalized title
8. Datasets with same _id but different capitalization
9. Ensuring unique datasets remain
10. Large datasets for performance testing

Uses pytest for clear, repeatable, and automated testing

Real Impact :

  • Ensures that deduplication works reliably under various scenarios
  • Improves confidence in data quality and search results
  • Demonstrates attention to quality, edge cases, and maintainable code

Proof Of Work

Screenshot (1003) Screenshot (1004)

@Areeba-Tahir-18
Copy link
Author

Hello @QuantumByte-01 and @visakhmr I’d appreciate it if you could review my unit testing suite for #87 and let me know if any changes are needed. Thank you !!

Would Love your Feedback 🙌 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant