Validate XML chars in name and sbmlName setters#1677
Merged
Conversation
This was referenced May 5, 2026
6d90ec5 to
788e115
Compare
Centralizes XML 1.0 character validation rules, plus project policy hard-rejecting U+FFFD (almost always charset corruption). Two modes: name (forbids whitespace) and attribute-content (allows TAB/LF/CR). Motivated by two stored BioModels (311226221, 311875206) whose cached VCML contained C0 control chars in reaction-name attributes and could no longer be parsed. The helper itself is a defensive primitive; this commit adds only the helper + tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wire XmlChars.requireValidAttributeContent into TokenMangler.checkNameProperty
(used by BioModel, MathModel, Simulation, SimulationContext, Structure name
vetos) and into ReactionStep.vetoableChange("name"). Allows whitespace but
rejects C0 control chars and U+FFFD, so that bad chars from external
sources can't reach the VCML attribute payload via setName.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add SpeciesContext.fixAndValidateSbmlName as the single chokepoint for all 9 setSbmlName methods (SpeciesContext, Structure, ReactionStep, BioModel, Model.GlobalParameter, AssignmentRule, RateRule, BioEvent, plus existing fixSbmlName fix-only path). The helper rejects C0 control chars and U+FFFD via XmlChars.requireValidAttributeContent, throwing PropertyVetoException so existing UI catch blocks render a friendly error instead of a stack trace. This closes the SBML import → setSbmlName → cached VCML attribute path that produced the un-parseable cached XML observed in BioModels 311226221 and 311875206. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Simulation Name (and matching Version Name) contained two U+FFFD replacement characters where 'µ' (micron) used to be — a charset corruption baked into the fixture: "Figure 2 is 4� micrometer radius". The new TokenMangler validation correctly rejects U+FFFD as invalid attribute content, breaking SEDMLExporterSBMLTest test case 91 on this branch. The pre-existing test only worked because TokenMangler.fixTokenStrict silently mangled U+FFFD into '_' before SBML export. Replacing the bad char with the actual 'µ' unicode letter doesn't help — fixTokenStrict preserves Unicode letters, so the resulting SBML SId becomes invalid. Substituting plain ASCII 'u' is round-trip-safe and self-explanatory: "Figure 2 is 4u micrometer radius". Verified via mvn -pl vcell-core -Dtest=SEDMLExporterSBMLTest test restricted to this single fixture. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
788e115 to
a0a9105
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
XmlCharshelper invcell-utilenforces XML 1.0 char rules plus a project policy banningU+FFFD(the replacement char, almost always evidence of upstream charset corruption); separate "name" and "attribute-content" modes for whitespace handling.TokenMangler.checkNamePropertyandReactionStep.vetoableChange("name")reject control chars andU+FFFDin entity names; existing whitespace-bearing biomodel/application/simulation names continue to work.SpeciesContext.fixAndValidateSbmlNameis the single chokepoint for all 9setSbmlNamemethods (SpeciesContext,Structure,ReactionStep,BioModel,Model.GlobalParameter,AssignmentRule,RateRule,BioEvent); throwsPropertyVetoExceptionon bad chars so existing UI catch sites still surface a friendly dialog.Why
Two stored BioModels (311226221, 311875206) had cached VCML containing C0 control chars (
0x13,0x1C) andU+FFFDinside reactionName/SbmlNameattributes; SAX rejected the cached XML on load. This PR closes the write-side leak so future imports/edits cannot serialize bad chars in the first place. PR #1676 (charset hygiene) addresses the upstream cause; this PR is the defense-in-depth follow-up.Test plan
mvn -pl vcell-util -Dtest=XmlCharsTest test— 17/17 pass, including real-world bad patterns observed in the corrupt biomodels (U+FFFD + 0x1C, lone0x13).mvn -pl vcell-core -Dgroups=Fast test— 418 run, 0 failures, 0 new errors (1 pre-existing unrelatedVCellDataTest.test_3Dfailure due to missingvtkmodulesPython module).🤖 Generated with Claude Code