[codex] Add configurable literal bracket patterns#44
Conversation
There was a problem hiding this comment.
Pull request overview
Adds an opt-in configuration to preserve bracketed literal source text (e.g., scientific notation like [2Fe-2S]) during supporting-text validation, while keeping the existing default behavior of stripping all [...] segments for backward compatibility.
Changes:
- Introduces
literal_bracket_patternsonReferenceValidationConfigand wires it intoSupportingTextValidator._split_query(). - Updates
_split_query()to selectively preserve bracketed content when configured patterns match, and normalizes whitespace in returned parts. - Adds/updates tests plus README + conceptual docs to cover and document the new configuration.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
src/linkml_reference_validator/models.py |
Adds literal_bracket_patterns config field with description + example. |
src/linkml_reference_validator/validation/supporting_text_validator.py |
Compiles configured patterns and conditionally preserves matching bracketed content in _split_query(). |
tests/test_models.py |
Verifies config default/custom values for the new field. |
tests/test_validation_config.py |
Verifies YAML config loading for literal_bracket_patterns. |
tests/test_supporting_text_validator.py |
Adds behavioral tests for default stripping vs configured preservation in split + matching. |
README.md |
Documents the new YAML option and intended usage. |
docs/concepts/editorial-conventions.md |
Adds conceptual documentation for literal brackets and configuration. |
| ['protein functions', 'in cells'] | ||
| >>> validator._split_query("protein [important] functions") | ||
| ['protein functions'] | ||
| >>> config = ReferenceValidationConfig(literal_bracket_patterns=[r"\\d"]) |
There was a problem hiding this comment.
In the _split_query() docstring example, literal_bracket_patterns=[r"\\d"] compiles a pattern that matches a literal \d sequence, not digits. This example likely intends r"\d" so the shown output is reproducible.
| >>> config = ReferenceValidationConfig(literal_bracket_patterns=[r"\\d"]) | |
| >>> config = ReferenceValidationConfig(literal_bracket_patterns=[r"\d"]) |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
Claude finished @cmungall's task —— View job Code Review: Configurable Literal Bracket PatternsI've completed a comprehensive review of the PR that adds configurable literal bracket patterns. Here's my detailed analysis: Todo List:
🎯 Overall AssessmentThis is a well-designed feature that solves a legitimate problem while maintaining backward compatibility. The implementation is clean, well-tested, and follows the project's conventions. ✅ Strengths1. Code Quality & Best Practices
2. Test Coverage
3. Documentation
🔍 Issues Identified1. Critical Bug in Docstring (Copilot Finding Confirmed)File: # Current (incorrect):
>>> config = ReferenceValidationConfig(literal_bracket_patterns=[r"\\d"])
# Should be:
>>> config = ReferenceValidationConfig(literal_bracket_patterns=[r"\d"])The double-escaped regex 2. Minor Documentation Ambiguity (Copilot Finding)File: The phrase "preserved verbatim during validation" could be misleading since the text still undergoes normalization (lowercasing, punctuation removal, whitespace collapse). Consider rewording to "preserved (not stripped) before normalization and validation" for clarity. 🚀 Performance ConsiderationsPositive Aspects
Minor Optimizations
🔒 Security AssessmentNo security concerns identified. The implementation:
🧪 Test Quality AnalysisStrong Test Coverage
Follows Repository Standards
Minor Test Enhancement Opportunities
🏗️ Architecture & DesignExcellent Design Choices
Implementation Quality
🔧 Recommended ActionsMust Fix
Should Consider
Nice to Have
📊 Test Verification NeededWhile I couldn't run the tests due to permission restrictions, the test suite appears comprehensive. I recommend running: just test # Full test suite
# or specifically:
uv run pytest tests/test_supporting_text_validator.py tests/test_validation_config.py tests/test_models.py -v🎉 SummaryThis is a high-quality implementation that:
The only critical issue is the docstring regex example, which should be fixed before merging. Otherwise, this is production-ready code that adds significant value to the project. Recommendation: Approve with minor fix - Fix the docstring regex pattern and this will be ready to merge. |
The docs recommended `^\S` to "keep tight brackets like [poly(A)+], strip [ editorial ]", but `^\S` matches any bracket content that does not start with a space — including ordinary editorial notes such as `[important]`. Following the example would silently preserve those notes. Use a pattern targeted at the notation instead (`[()+]` for parens/charges), and add a caveat that broad patterns also preserve editorial notes. Add a test pinning the documented behaviour: `[()+]` keeps `[poly(A)+]` while `[important]` is still stripped. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Took over to address the review. Pushed
Full non-integration suite (450), doctests, |
|
Claude finished @cmungall's task —— View job Code Review: Configurable Literal Bracket PatternsTodo List:
🎯 Overall AssessmentThis is a well-designed feature that solves a legitimate problem while maintaining backward compatibility. The implementation is clean, well-tested, and follows the project's conventions. ✅ Strengths1. Code Quality & Best Practices
2. Test Coverage
3. Documentation
🔍 Issues Identified1. Critical Bug in Docstring (Copilot Finding Confirmed)File: # Current (incorrect):
>>> config = ReferenceValidationConfig(literal_bracket_patterns=[r"\\d"])
# Should be:
>>> config = ReferenceValidationConfig(literal_bracket_patterns=[r"\d"])The double-escaped regex 2. Minor Documentation Ambiguity (Copilot Finding)File: The phrase "preserved verbatim during validation" could be misleading since the text still undergoes normalization (lowercasing, punctuation removal, whitespace collapse). Consider rewording to "preserved (not stripped) before normalization and validation" for clarity. 🚀 Performance ConsiderationsPositive Aspects
Minor Optimizations
🔒 Security AssessmentNo security concerns identified. The implementation:
🧪 Test Quality AnalysisStrong Test Coverage
Follows Repository Standards
Minor Test Enhancement Opportunities
🏗️ Architecture & DesignExcellent Design Choices
Implementation Quality
🔧 Recommended ActionsMust Fix
Should Consider
Nice to Have
📊 Test Verification NeededWhile I couldn't run the tests due to permission restrictions, the test suite appears comprehensive. I recommend running: just test # Full test suite
# or specifically:
uv run pytest tests/test_supporting_text_validator.py tests/test_validation_config.py tests/test_models.py -v🎉 SummaryThis is a high-quality implementation that:
The only critical issue is the docstring regex example, which should be fixed before merging. Otherwise, this is production-ready code that adds significant value to the project. Recommendation: Approve with minor fix - Fix the docstring regex pattern and this will be ready to merge. |
Summary
literal_bracket_patternstoReferenceValidationConfig_split_query()when configured patterns match bracket contentsWhy
SupportingTextValidator._split_query()stripped every[...]segment unconditionally. That works for editorial insertions, but it breaks literal scientific bracket notation like[2Fe-2S],[30S],[Ca2+], and similar corpus-specific patterns.This change keeps the existing default behavior for backward compatibility and lets projects opt into preserving literal brackets with explicit regex patterns.
Impact
Projects that do nothing keep the current strip-all behavior. Projects with literal bracketed source text can now configure preservation rules in YAML or Python config.
Validation
uv run pytest tests/test_supporting_text_validator.py tests/test_validation_config.py tests/test_models.pyjust test