[codex] Add configurable literal bracket patterns by cmungall · Pull Request #44 · linkml/linkml-reference-validator

cmungall · 2026-04-03T21:59:10Z

Summary

add literal_bracket_patterns to ReferenceValidationConfig
preserve bracketed source text in _split_query() when configured patterns match bracket contents
add validator/config/model coverage and document the new configuration option

Why

SupportingTextValidator._split_query() stripped every [...] segment unconditionally. That works for editorial insertions, but it breaks literal scientific bracket notation like [2Fe-2S], [30S], [Ca2+], and similar corpus-specific patterns.

This change keeps the existing default behavior for backward compatibility and lets projects opt into preserving literal brackets with explicit regex patterns.

Impact

Projects that do nothing keep the current strip-all behavior. Projects with literal bracketed source text can now configure preservation rules in YAML or Python config.

Validation

uv run pytest tests/test_supporting_text_validator.py tests/test_validation_config.py tests/test_models.py
just test

Copilot

Pull request overview

Adds an opt-in configuration to preserve bracketed literal source text (e.g., scientific notation like [2Fe-2S]) during supporting-text validation, while keeping the existing default behavior of stripping all [...] segments for backward compatibility.

Changes:

Introduces literal_bracket_patterns on ReferenceValidationConfig and wires it into SupportingTextValidator._split_query().
Updates _split_query() to selectively preserve bracketed content when configured patterns match, and normalizes whitespace in returned parts.
Adds/updates tests plus README + conceptual docs to cover and document the new configuration.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`src/linkml_reference_validator/models.py`	Adds `literal_bracket_patterns` config field with description + example.
`src/linkml_reference_validator/validation/supporting_text_validator.py`	Compiles configured patterns and conditionally preserves matching bracketed content in `_split_query()`.
`tests/test_models.py`	Verifies config default/custom values for the new field.
`tests/test_validation_config.py`	Verifies YAML config loading for `literal_bracket_patterns`.
`tests/test_supporting_text_validator.py`	Adds behavioral tests for default stripping vs configured preservation in split + matching.
`README.md`	Documents the new YAML option and intended usage.
`docs/concepts/editorial-conventions.md`	Adds conceptual documentation for literal brackets and configuration.

Copilot · 2026-04-04T00:47:36Z

            ['protein functions', 'in cells']
            >>> validator._split_query("protein [important] functions")
            ['protein functions']
+            >>> config = ReferenceValidationConfig(literal_bracket_patterns=[r"\\d"])


In the _split_query() docstring example, literal_bracket_patterns=[r"\\d"] compiles a pattern that matches a literal \d sequence, not digits. This example likely intends r"\d" so the shown output is reproducible.

Suggested change

>>> config = ReferenceValidationConfig(literal_bracket_patterns=[r"\\d"])

>>> config = ReferenceValidationConfig(literal_bracket_patterns=[r"\d"])

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

claude · 2026-06-12T16:28:19Z

Claude finished @cmungall's task —— View job

Code Review: Configurable Literal Bracket Patterns

I've completed a comprehensive review of the PR that adds configurable literal bracket patterns. Here's my detailed analysis:

Todo List:

Read repository guidelines (CLAUDE.md)
Examine all changed files and implementation
Analyze code quality and best practices
Check for potential bugs or issues
Evaluate performance considerations
Review security concerns
Assess test coverage
Provide detailed feedback and recommendations

🎯 Overall Assessment

This is a well-designed feature that solves a legitimate problem while maintaining backward compatibility. The implementation is clean, well-tested, and follows the project's conventions.

✅ Strengths

1. Code Quality & Best Practices

Excellent adherence to project conventions: Follows TDD principles, includes comprehensive docstrings with examples, and avoids unnecessary try/except blocks per CLAUDE.md
Clean API design: The literal_bracket_patterns field is well-named and intuitive
Good separation of concerns: Logic is properly contained within _split_query() method
Consistent code style: Matches existing patterns in the codebase

2. Test Coverage

Comprehensive test suite: 45 new test lines covering both default and configured behavior
Good edge case coverage: Tests empty queries, multiple patterns, case sensitivity
Integration tests: Validates end-to-end behavior through find_text_in_reference
Backward compatibility verification: Ensures existing behavior is preserved

3. Documentation

Excellent documentation: README and conceptual docs clearly explain the feature
Good examples: Realistic patterns like [2Fe-2S], [Ca2+], [30S]
Clear migration path: Shows how to opt into the new behavior

🔍 Issues Identified

1. Critical Bug in Docstring (Copilot Finding Confirmed)

File: src/linkml_reference_validator/validation/supporting_text_validator.py:309

# Current (incorrect):
>>> config = ReferenceValidationConfig(literal_bracket_patterns=[r"\\d"])

# Should be:
>>> config = ReferenceValidationConfig(literal_bracket_patterns=[r"\d"])

The double-escaped regex r"\\d" matches literal backslash-d (\d), not digits. This makes the docstring example incorrect and misleading.

2. Minor Documentation Ambiguity (Copilot Finding)

File: README.md

The phrase "preserved verbatim during validation" could be misleading since the text still undergoes normalization (lowercasing, punctuation removal, whitespace collapse). Consider rewording to "preserved (not stripped) before normalization and validation" for clarity.

🚀 Performance Considerations

Positive Aspects

Efficient regex compilation: Patterns are compiled once at initialization, not per query
Short-circuit evaluation: Empty pattern list bypasses regex logic entirely
Minimal overhead: Only adds processing when patterns are configured

Minor Optimizations

Regex compilation is done even when literal_bracket_patterns is empty (though this is negligible)
Could add early return if no brackets are detected, but current approach is cleaner

🔒 Security Assessment

No security concerns identified. The implementation:

✅ Uses standard re module with no unsafe operations
✅ Doesn't execute user input or use eval()
✅ Handles regex compilation errors gracefully (Python would raise at initialization)
✅ No injection vulnerabilities in the bracket matching logic

🧪 Test Quality Analysis

Strong Test Coverage

Unit tests: Direct testing of _split_query() method
Integration tests: Full validation flow testing
Configuration tests: YAML loading and model validation
Edge cases: Empty patterns, multiple patterns, no matches

Follows Repository Standards

Real tests, not mocks: Aligns with CLAUDE.md guidance against "cheating" with mocks
Descriptive test names: Clear purpose for each test case
Good fixture usage: Proper setup with temporary directories

Minor Test Enhancement Opportunities

Could add test for invalid regex patterns (though Python handles this)
Could test performance with many patterns (though unlikely to be an issue)

🏗️ Architecture & Design

Excellent Design Choices

Backward compatibility: Default behavior unchanged
Configuration-driven: No code changes needed to use
Single responsibility: _split_query() handles both cases cleanly
Clear separation: Editorial vs. literal brackets concept is well-defined

Implementation Quality

Clean regex logic: The replace_bracket function is elegant
Proper error handling: Graceful handling of empty results
Good performance: Compiled regex patterns with efficient matching

🔧 Recommended Actions

Must Fix

Fix docstring regex pattern: Change r"\\d" to r"\d" in supporting_text_validator.py:309

Should Consider

Clarify README wording: Update "preserved verbatim" to be more precise about normalization
Add example of invalid pattern: Could add a note about regex validation in docs

Nice to Have

Performance note in docs: Mention that patterns are compiled at initialization
Migration examples: Show before/after for projects switching from custom logic

📊 Test Verification Needed

While I couldn't run the tests due to permission restrictions, the test suite appears comprehensive. I recommend running:

just test  # Full test suite
# or specifically:
uv run pytest tests/test_supporting_text_validator.py tests/test_validation_config.py tests/test_models.py -v

🎉 Summary

This is a high-quality implementation that:

✅ Solves a real problem (scientific notation in brackets)
✅ Maintains backward compatibility
✅ Includes excellent test coverage
✅ Follows all project conventions
✅ Has comprehensive documentation

The only critical issue is the docstring regex example, which should be fixed before merging. Otherwise, this is production-ready code that adds significant value to the project.

Recommendation: Approve with minor fix - Fix the docstring regex pattern and this will be ready to merge.

The docs recommended `^\S` to "keep tight brackets like [poly(A)+], strip [ editorial ]", but `^\S` matches any bracket content that does not start with a space — including ordinary editorial notes such as `[important]`. Following the example would silently preserve those notes. Use a pattern targeted at the notation instead (`[()+]` for parens/charges), and add a caveat that broad patterns also preserve editorial notes. Add a test pinning the documented behaviour: `[()+]` keeps `[poly(A)+]` while `[important]` is still stripped. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

cmungall · 2026-06-12T23:03:57Z

Took over to address the review. Pushed 005441d:

The ^\S example was misleading — it preserves any bracket content not starting with a space, so it would keep ordinary editorial notes like [important], not just [poly(A)+]. Replaced it in the README and editorial-conventions doc with a notation-specific pattern ([()+]) and added a caveat that broad patterns also preserve editorial notes.
Added a test pinning the documented behavior: [()+] keeps [poly(A)+] while [important] is still stripped.

Full non-integration suite (450), doctests, mypy, and ruff all green. The feature mechanism itself is sound — verified that preserving bracket content is what makes normalize_text match (stripping [2Fe-2S] leaves the cluster, which is not a substring of the reference; preserving it yields the 2fe 2s cluster, which is).

claude · 2026-06-12T23:03:58Z

Claude finished @cmungall's task —— View job

Code Review: Configurable Literal Bracket Patterns

Todo List:

Read repository guidelines (CLAUDE.md)
Examine all changed files and implementation
Analyze code quality and best practices
Check for potential bugs or issues
Evaluate performance considerations
Review security concerns
Assess test coverage
Provide detailed feedback and recommendations

🎯 Overall Assessment

This is a well-designed feature that solves a legitimate problem while maintaining backward compatibility. The implementation is clean, well-tested, and follows the project's conventions.

✅ Strengths

1. Code Quality & Best Practices

Excellent adherence to project conventions: Follows TDD principles, includes comprehensive docstrings with examples, and avoids unnecessary try/except blocks per CLAUDE.md
Clean API design: The literal_bracket_patterns field is well-named and intuitive
Good separation of concerns: Logic is properly contained within _split_query() method
Consistent code style: Matches existing patterns in the codebase

2. Test Coverage

Comprehensive test suite: 45 new test lines covering both default and configured behavior
Good edge case coverage: Tests empty queries, multiple patterns, case sensitivity
Integration tests: Validates end-to-end behavior through find_text_in_reference
Backward compatibility verification: Ensures existing behavior is preserved

3. Documentation

Excellent documentation: README and conceptual docs clearly explain the feature
Good examples: Realistic patterns like [2Fe-2S], [Ca2+], [30S]
Clear migration path: Shows how to opt into the new behavior

🔍 Issues Identified

1. Critical Bug in Docstring (Copilot Finding Confirmed)

File: src/linkml_reference_validator/validation/supporting_text_validator.py:309

# Current (incorrect):
>>> config = ReferenceValidationConfig(literal_bracket_patterns=[r"\\d"])

# Should be:
>>> config = ReferenceValidationConfig(literal_bracket_patterns=[r"\d"])

The double-escaped regex r"\\d" matches literal backslash-d (\d), not digits. This makes the docstring example incorrect and misleading.

2. Minor Documentation Ambiguity (Copilot Finding)

File: README.md

The phrase "preserved verbatim during validation" could be misleading since the text still undergoes normalization (lowercasing, punctuation removal, whitespace collapse). Consider rewording to "preserved (not stripped) before normalization and validation" for clarity.

🚀 Performance Considerations

Positive Aspects

Efficient regex compilation: Patterns are compiled once at initialization, not per query
Short-circuit evaluation: Empty pattern list bypasses regex logic entirely
Minimal overhead: Only adds processing when patterns are configured

Minor Optimizations

Regex compilation is done even when literal_bracket_patterns is empty (though this is negligible)
Could add early return if no brackets are detected, but current approach is cleaner

🔒 Security Assessment

No security concerns identified. The implementation:

✅ Uses standard re module with no unsafe operations
✅ Doesn't execute user input or use eval()
✅ Handles regex compilation errors gracefully (Python would raise at initialization)
✅ No injection vulnerabilities in the bracket matching logic

🧪 Test Quality Analysis

Strong Test Coverage

Unit tests: Direct testing of _split_query() method
Integration tests: Full validation flow testing
Configuration tests: YAML loading and model validation
Edge cases: Empty patterns, multiple patterns, no matches

Follows Repository Standards

Real tests, not mocks: Aligns with CLAUDE.md guidance against "cheating" with mocks
Descriptive test names: Clear purpose for each test case
Good fixture usage: Proper setup with temporary directories

Minor Test Enhancement Opportunities

Could add test for invalid regex patterns (though Python handles this)
Could test performance with many patterns (though unlikely to be an issue)

🏗️ Architecture & Design

Excellent Design Choices

Backward compatibility: Default behavior unchanged
Configuration-driven: No code changes needed to use
Single responsibility: _split_query() handles both cases cleanly
Clear separation: Editorial vs. literal brackets concept is well-defined

Implementation Quality

Clean regex logic: The replace_bracket function is elegant
Proper error handling: Graceful handling of empty results
Good performance: Compiled regex patterns with efficient matching

🔧 Recommended Actions

Must Fix

Fix docstring regex pattern: Change r"\\d" to r"\d" in supporting_text_validator.py:309

Should Consider

Clarify README wording: Update "preserved verbatim" to be more precise about normalization
Add example of invalid pattern: Could add a note about regex validation in docs

Nice to Have

Performance note in docs: Mention that patterns are compiled at initialization
Migration examples: Show before/after for projects switching from custom logic

📊 Test Verification Needed

While I couldn't run the tests due to permission restrictions, the test suite appears comprehensive. I recommend running:

just test  # Full test suite
# or specifically:
uv run pytest tests/test_supporting_text_validator.py tests/test_validation_config.py tests/test_models.py -v

🎉 Summary

This is a high-quality implementation that:

✅ Solves a real problem (scientific notation in brackets)
✅ Maintains backward compatibility
✅ Includes excellent test coverage
✅ Follows all project conventions
✅ Has comprehensive documentation

The only critical issue is the docstring regex example, which should be fixed before merging. Otherwise, this is production-ready code that adds significant value to the project.

Recommendation: Approve with minor fix - Fix the docstring regex pattern and this will be ready to merge.

Add literal bracket patterns for supporting text validation

e0f2163

cmungall marked this pull request as ready for review April 4, 2026 00:44

Copilot AI review requested due to automatic review settings April 4, 2026 00:44

Copilot started reviewing on behalf of cmungall April 4, 2026 00:45 View session

Copilot AI reviewed Apr 4, 2026

View reviewed changes

cmungall temporarily deployed to release April 4, 2026 05:21 — with GitHub Actions Inactive

Update README.md

ccf5c05

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

cmungall merged commit d2ebf4e into main Jun 12, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[codex] Add configurable literal bracket patterns#44

[codex] Add configurable literal bracket patterns#44
cmungall merged 3 commits into
mainfrom
feat/issue-43

cmungall commented Apr 3, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 4, 2026

Uh oh!

Uh oh!

claude Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

cmungall commented Jun 12, 2026

Uh oh!

claude Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	>>> config = ReferenceValidationConfig(literal_bracket_patterns=[r"\\d"])
	>>> config = ReferenceValidationConfig(literal_bracket_patterns=[r"\d"])

Conversation

cmungall commented Apr 3, 2026

Summary

Why

Impact

Validation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

claude Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: Configurable Literal Bracket Patterns

Todo List:

🎯 Overall Assessment

✅ Strengths

1. Code Quality & Best Practices

2. Test Coverage

3. Documentation

🔍 Issues Identified

1. Critical Bug in Docstring (Copilot Finding Confirmed)

2. Minor Documentation Ambiguity (Copilot Finding)

🚀 Performance Considerations

Positive Aspects

Minor Optimizations

🔒 Security Assessment

🧪 Test Quality Analysis

Strong Test Coverage

Follows Repository Standards

Minor Test Enhancement Opportunities

🏗️ Architecture & Design

Excellent Design Choices

Implementation Quality

🔧 Recommended Actions

Must Fix

Should Consider

Nice to Have

📊 Test Verification Needed

🎉 Summary

Uh oh!

cmungall commented Jun 12, 2026

Uh oh!

claude Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: Configurable Literal Bracket Patterns

Todo List:

🎯 Overall Assessment

✅ Strengths

1. Code Quality & Best Practices

2. Test Coverage

3. Documentation

🔍 Issues Identified

1. Critical Bug in Docstring (Copilot Finding Confirmed)

2. Minor Documentation Ambiguity (Copilot Finding)

🚀 Performance Considerations

Positive Aspects

Minor Optimizations

🔒 Security Assessment

🧪 Test Quality Analysis

Strong Test Coverage

Follows Repository Standards

Minor Test Enhancement Opportunities

🏗️ Architecture & Design

Excellent Design Choices

Implementation Quality

🔧 Recommended Actions

Must Fix

Should Consider

Nice to Have

📊 Test Verification Needed

🎉 Summary

Uh oh!

Uh oh!

Reviewers

claude Bot commented Jun 12, 2026 •

edited

Loading

claude Bot commented Jun 12, 2026 •

edited

Loading