Match the generic "reference" term as a token, not a bare substring#47
Merged
Conversation
ReferenceURIs.is_reference_uri matched the bare substring "reference", so any URI that merely contained those letters was classified as an authoritative reference field — e.g. `user_preference`, `dereference`, `preferenceOrder`. A mis-detected reference slot can then trigger spurious reference fetches and "could not fetch reference" errors on unrelated fields. Match the generic word "reference"/"references" only as a whole token, split on camelCase boundaries and separators, while keeping the specific Dublin Core / legacy URIs (dcterms:source, authoritative_reference, ...) as substring matches. This preserves real detections — including the camelCase `myReferenceField` case the existing tests pin — while rejecting the false positives. Also complete two under-specified mocks in test_title_validation that omitted `slot_uri` (their sibling title mock already set it to None); real slots always carry slot_uri as str|None, never an auto-MagicMock. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Contributor
|
Claude finished @cmungall's task —— View job PR Code Review ✅I've completed a comprehensive review of this pull request. This is a well-implemented fix for a subtle but important bug in reference URI detection. Todo List:
Review SummaryThis PR successfully fixes a false positive bug in reference URI detection while maintaining backward compatibility and following good software engineering practices. ✅ StrengthsExcellent Problem Analysis & Solution Design
High-Quality Implementation (
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
ReferenceURIs.is_reference_uriincluded the bare substring"reference"in its match patterns, so any URI that merely contained those letters was classified as an authoritative-reference field:A slot mis-detected as a reference then gets its value passed to
_extract_reference_idand fetched, which can surface spuriousCould not fetch referenceerrors on completely unrelated fields.Note this is intentionally not a plain word-boundary fix: an existing test (
test_field_detection.py) pinshttp://example.org/myReferenceField -> True, i.e. the camelCaseReferencetoken should match. A\breference\bregex (or simply dropping the substring) would break that.Fix
Match the generic word
reference/referencesas a whole token — splitting the URI on camelCase boundaries and separators (_ - : / #) — while keeping the specific, unambiguous Dublin Core / legacy URIs (dcterms:source,authoritative_reference, …) as substring matches.Result:
dcterms:referenceslinkml:authoritative_referencehttp://example.org/myReferenceFieldex:cross_referencetest:user_preferenceex:dereferenceisReferencedBy-style URIs (tokenreferenced, notreference) are also no longer matched, which is the intended behaviour.Scope: limited to reference detection, as the bare-substring style for excerpt/title is far less collision-prone.
Tests
TestReferenceURITokenBoundaries(test-first; the false-positive cases failed before the fix) covering rejected non-reference terms and preserved genuine references.test_title_validation.pythat omittedslot_uri(their sibling title mock already set it toNone). Real slots always carryslot_uriasstr | None; the old pure-substring path only tolerated the strayMagicMockby accident. No assertions changed.Full non-integration suite: 493 passed. Module doctests,
mypy, andruffclean.🤖 Generated with Claude Code