Skip to content

Use UTF-8 explicitly when reading SBML/SED-ML XML#1676

Merged
jcschaff merged 3 commits intomasterfrom
charset-hygiene-sbml-sedml-import
May 5, 2026
Merged

Use UTF-8 explicitly when reading SBML/SED-ML XML#1676
jcschaff merged 3 commits intomasterfrom
charset-hygiene-sbml-sedml-import

Conversation

@jcschaff
Copy link
Copy Markdown
Member

@jcschaff jcschaff commented May 5, 2026

Summary

Six call sites in the SBML and SED-ML import paths were reading XML input via Charset.defaultCharset() (or naked .getBytes() / no-charset InputStreamReader), using the JVM platform default encoding. On a non-UTF-8 default JVM (e.g. Cp1252 on legacy Windows), a UTF-8 SBML file with non-ASCII chars (Greek letters, en-dashes, etc.) gets mojibake'd before the XML parser sees it.

XML 1.0 says: in absence of an <?xml encoding=…?> declaration the parser MUST treat input as UTF-8 (or detect via BOM). We were using the platform's preferred charset instead.

This is a likely contributor to the corrupted reaction-name characters observed in BioModels 311226221 and 311875206 (where 14-3-3 appears as 14<U+FFFD><FS>3<U+FFFD><FS>3 or 14<DC3>3<DC3>3 in the cached VCML).

Files changed

  • SBMLImporter.readSbmlDocument(File) — line-based read of SBML file
  • SedMLImporter.java — in-memory SBML String → InputStream for re-import
  • jlibsedml.Libsedml.readDocument(InputStream, encoding) — null-encoding fallback was platform default
  • jlibsedml.XpathGeneratorHelper — model String → bytes → XML parse
  • vcell.sybil.util.xml.DOMUtil.parse(String) — String → bytes → XML parse
  • jlibsedml.modelsupport.KisaoTermParser — bundled .obo resource read

Plus a new Fast unit test (SBMLImportCharsetTest) that imports a minimal UTF-8 SBML L2V4 doc with U+2013 EN DASH and U+03BC GREEK MU in reaction/species name attributes and asserts they round-trip into BioModel.getSbmlName() byte-for-byte. Exercises both the File and InputStream import paths.

Out of scope (follow-ups)

  • The render-namespace string-munge workaround in SBMLImporter (lines 1936–1948) is unchanged. Whether it's still required against current jsbml is a separate question, gated on a corpus-driven test using the sys-bio/temp-biomodels SBML/OMEX archive.
  • A CI job that forks a JVM with -Dfile.encoding=Cp1252 to demonstrate the fix matters cross-platform.
  • Symmetric output-side hygiene (e.g. SEDMLExporter.java:196 writes VCML with bUseUTF8=false for the OMEX-archived file).
  • Input-validation guards in setters and serializer (planned PR 2).

Test plan

  • mvn -pl vcell-core -am compile test-compile -DskipTests clean
  • mvn -pl vcell-core test -Dtest='SBMLImportCharsetTest' — 2/2 pass
  • mvn -pl vcell-core test -Dgroups='Fast' -Dtest='*SBML*Test*,*Sedml*Test*,*Sbml*Test*,*SEDML*Test*' — 13/13 pass, 0 failures, 1 skipped

🤖 Generated with Claude Code

Comment thread vcell-core/src/main/java/org/vcell/sybil/util/xml/DOMUtil.java Fixed
jcschaff and others added 2 commits May 5, 2026 08:43
Six call sites in the SBML and SED-ML import paths were reading XML
input via Charset.defaultCharset() (or naked .getBytes() / no-charset
InputStreamReader), which uses the JVM platform default encoding.
On a non-UTF-8 default JVM (e.g. Cp1252 on legacy Windows), a UTF-8
SBML file with non-ASCII chars (Greek letters, en-dashes, etc.) gets
mojibake'd before the XML parser sees it.

XML 1.0 says: in absence of an <?xml encoding=...?> declaration the
parser MUST treat input as UTF-8 (or detect via BOM). We were using
the platform's preferred charset instead.

This is a likely contributor to the corrupted reaction-name characters
observed in BioModels 311226221 and 311875206 (where '14-3-3' appears
as '14<U+FFFD><FS>3<U+FFFD><FS>3' or '14<DC3>3<DC3>3' in the cached
VCML).

Fixed sites:
- SBMLImporter.readSbmlDocument(File): line-based read of SBML file
- SedMLImporter: in-memory SBML String -> InputStream for re-import
- jlibsedml.Libsedml.readDocument(InputStream, encoding): null-encoding
  fallback was platform default
- jlibsedml.XpathGeneratorHelper: model String -> bytes -> XML parse
- vcell.sybil.util.xml.DOMUtil.parse(String): String -> bytes -> XML parse
- jlibsedml.modelsupport.KisaoTermParser: bundled .obo resource read

The render-namespace string-munge workaround in SBMLImporter (lines
1936-1948) is unchanged in this commit. Whether it is still required
against current jsbml is a separate question, gated on a corpus-driven
test using the sys-bio/temp-biomodels SBML/OMEX archive.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Imports a minimal SBML L2V4 document containing U+2013 EN DASH and
U+03BC GREEK SMALL LETTER MU in reaction and species name attributes,
and asserts the resulting BioModel preserves the chars byte-for-byte
in getSbmlName().

Tagged Fast. Exercises both the File path (readSbmlDocument(File),
which used to read with Charset.defaultCharset()) and the InputStream
path. On a UTF-8-default JVM both paths look equivalent, but the test
documents expected behavior and catches regressions if either path
is changed back to platform-default decoding.

A complementary CI job that forks a Cp1252-default JVM is the
follow-up that demonstrates the fix matters cross-platform.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jcschaff jcschaff force-pushed the charset-hygiene-sbml-sedml-import branch from b782383 to a1b6e34 Compare May 5, 2026 12:45
CodeQL flagged the existing DOMUtil.parse(String) call site for CWE-611
(XXE) when the prior commit changed the line. Caller is the Pathway
Commons HTTP error response parser, which has no legitimate need for
DTDs or external entity expansion. Apply the OWASP-recommended JAXP
hardening features at builder init time.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jcschaff jcschaff merged commit 464cb68 into master May 5, 2026
13 checks passed
@jcschaff jcschaff deleted the charset-hygiene-sbml-sedml-import branch May 5, 2026 13:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants