Use UTF-8 explicitly when reading SBML/SED-ML XML#1676
Merged
Conversation
This was referenced May 5, 2026
Six call sites in the SBML and SED-ML import paths were reading XML input via Charset.defaultCharset() (or naked .getBytes() / no-charset InputStreamReader), which uses the JVM platform default encoding. On a non-UTF-8 default JVM (e.g. Cp1252 on legacy Windows), a UTF-8 SBML file with non-ASCII chars (Greek letters, en-dashes, etc.) gets mojibake'd before the XML parser sees it. XML 1.0 says: in absence of an <?xml encoding=...?> declaration the parser MUST treat input as UTF-8 (or detect via BOM). We were using the platform's preferred charset instead. This is a likely contributor to the corrupted reaction-name characters observed in BioModels 311226221 and 311875206 (where '14-3-3' appears as '14<U+FFFD><FS>3<U+FFFD><FS>3' or '14<DC3>3<DC3>3' in the cached VCML). Fixed sites: - SBMLImporter.readSbmlDocument(File): line-based read of SBML file - SedMLImporter: in-memory SBML String -> InputStream for re-import - jlibsedml.Libsedml.readDocument(InputStream, encoding): null-encoding fallback was platform default - jlibsedml.XpathGeneratorHelper: model String -> bytes -> XML parse - vcell.sybil.util.xml.DOMUtil.parse(String): String -> bytes -> XML parse - jlibsedml.modelsupport.KisaoTermParser: bundled .obo resource read The render-namespace string-munge workaround in SBMLImporter (lines 1936-1948) is unchanged in this commit. Whether it is still required against current jsbml is a separate question, gated on a corpus-driven test using the sys-bio/temp-biomodels SBML/OMEX archive. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Imports a minimal SBML L2V4 document containing U+2013 EN DASH and U+03BC GREEK SMALL LETTER MU in reaction and species name attributes, and asserts the resulting BioModel preserves the chars byte-for-byte in getSbmlName(). Tagged Fast. Exercises both the File path (readSbmlDocument(File), which used to read with Charset.defaultCharset()) and the InputStream path. On a UTF-8-default JVM both paths look equivalent, but the test documents expected behavior and catches regressions if either path is changed back to platform-default decoding. A complementary CI job that forks a Cp1252-default JVM is the follow-up that demonstrates the fix matters cross-platform. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
b782383 to
a1b6e34
Compare
CodeQL flagged the existing DOMUtil.parse(String) call site for CWE-611 (XXE) when the prior commit changed the line. Caller is the Pathway Commons HTTP error response parser, which has no legitimate need for DTDs or external entity expansion. Apply the OWASP-recommended JAXP hardening features at builder init time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Six call sites in the SBML and SED-ML import paths were reading XML input via
Charset.defaultCharset()(or naked.getBytes()/ no-charsetInputStreamReader), using the JVM platform default encoding. On a non-UTF-8 default JVM (e.g. Cp1252 on legacy Windows), a UTF-8 SBML file with non-ASCII chars (Greek letters, en-dashes, etc.) gets mojibake'd before the XML parser sees it.XML 1.0 says: in absence of an
<?xml encoding=…?>declaration the parser MUST treat input as UTF-8 (or detect via BOM). We were using the platform's preferred charset instead.This is a likely contributor to the corrupted reaction-name characters observed in BioModels 311226221 and 311875206 (where
14-3-3appears as14<U+FFFD><FS>3<U+FFFD><FS>3or14<DC3>3<DC3>3in the cached VCML).Files changed
SBMLImporter.readSbmlDocument(File)— line-based read of SBML fileSedMLImporter.java— in-memory SBML String → InputStream for re-importjlibsedml.Libsedml.readDocument(InputStream, encoding)— null-encoding fallback was platform defaultjlibsedml.XpathGeneratorHelper— model String → bytes → XML parsevcell.sybil.util.xml.DOMUtil.parse(String)— String → bytes → XML parsejlibsedml.modelsupport.KisaoTermParser— bundled .obo resource readPlus a new Fast unit test (
SBMLImportCharsetTest) that imports a minimal UTF-8 SBML L2V4 doc with U+2013 EN DASH and U+03BC GREEK MU in reaction/speciesnameattributes and asserts they round-trip intoBioModel.getSbmlName()byte-for-byte. Exercises both the File and InputStream import paths.Out of scope (follow-ups)
SBMLImporter(lines 1936–1948) is unchanged. Whether it's still required against current jsbml is a separate question, gated on a corpus-driven test using thesys-bio/temp-biomodelsSBML/OMEX archive.-Dfile.encoding=Cp1252to demonstrate the fix matters cross-platform.SEDMLExporter.java:196writes VCML withbUseUTF8=falsefor the OMEX-archived file).Test plan
mvn -pl vcell-core -am compile test-compile -DskipTestscleanmvn -pl vcell-core test -Dtest='SBMLImportCharsetTest'— 2/2 passmvn -pl vcell-core test -Dgroups='Fast' -Dtest='*SBML*Test*,*Sedml*Test*,*Sbml*Test*,*SEDML*Test*'— 13/13 pass, 0 failures, 1 skipped🤖 Generated with Claude Code