Skip to content

Fix Oracle CLOB read: charset-aware character stream#1679

Merged
jcschaff merged 1 commit intomasterfrom
fix-clob-charset-read
May 5, 2026
Merged

Fix Oracle CLOB read: charset-aware character stream#1679
jcschaff merged 1 commit intomasterfrom
fix-clob-charset-read

Conversation

@jcschaff
Copy link
Copy Markdown
Member

@jcschaff jcschaff commented May 5, 2026

Summary

  • DbDriver.getLOB Oracle branch was reading every CLOB through clob.getAsciiStream() into a byte[], then converting that byte buffer to a String via the JVM's platform-default charset. Both steps are lossy:
    • getAsciiStream coerces each character into a single byte (truncating to the low 8 bits); a stored (en-dash, U+2013) becomes 0x13, an invalid XML 1.0 control character. A stored μ (U+00B5) becomes 0xB5, then mis-decoded again on the new String(byte[]) step.
    • new String(byte[]) uses Charset.defaultCharset() — Cp1252 on legacy Windows, UTF-8 on modern macOS/Linux — so even ASCII reads were platform-dependent for any non-ASCII byte that survived step 1.
  • Replaced with clob.getCharacterStream() reading into a char[] sized by clob.length() (chars, per JDBC spec). One-method change, Oracle branch only; Postgres branch already used rs.getString and was correct.

Why

The two biomodels (311226221 / shiVcell, 311875206 / mblinov) that have been failing to load in VCellClientMain from the database both contain en-dash characters in reaction names. The stored CLOBs are honest — a direct clob.getCharacterStream() scan of all 116k biomodel + mathmodel rows finds zero invalid XML chars. The corruption was being injected on every read by the buggy getLOB, and SAX then rejected the result on the client. PR #1676 (charset hygiene on import paths) and PR #1677 (input validation on setSbmlName/setName) were both addressing the consequences; this PR fixes the actual cause.

Test plan

  • Build succeeds: mvn -pl vcell-server,vcell-admin -am -DskipTests install
  • Local repro:
    • Before: serverDocumentManager.getBioModelXML(qh, mblinov, 311875206, false) → SAX "invalid XML char Unicode 0x13 in attribute Name"
    • After: same call → XMLToBioModel succeeds, updateAll(true) succeeds, returned XML preserves the original 12 en-dashes (0xE2 0x80 0x93) and contains zero invalid bytes
  • Same for 311226221 (shiVcell).
  • Run SEDML/SBML integration tests on this branch to surface anything that was being masked by lossy ASCII reads (other tables read via getLOB: kinetics, geometry curves, math descriptions, analysis tasks).
  • After merge: previously-broken BioModels load in the GUI without DB repair.

Notes

🤖 Generated with Claude Code

DbDriver.getLOB's Oracle branch was calling clob.getAsciiStream().read(byte[])
and then constructing a String via the JVM's platform-default charset. Both
steps lose information for any non-ASCII content:

  - getAsciiStream coerces each char into a single byte; for U+2013 (en-dash)
    that low byte is 0x13, an invalid XML 1.0 character. The same mechanism
    has produced 0x1C control chars and apparent U+FFFD sequences in
    BioModel CLOBs read by the GUI client (e.g. biomodels 311226221 and
    311875206 from users shiVcell / mblinov).
  - new String(byte[]) decodes via Charset.defaultCharset(), giving
    Cp1252-on-Windows vs. UTF-8-on-Linux/macOS round-tripping.

Replace both with clob.getCharacterStream() and a char[] buffer sized by
clob.length() (which is in characters per the JDBC spec). Net effect: every
caller of getLOB on Oracle now sees the actual stored Unicode, and CLOBs
with multi-byte UTF-8 sequences (en-dashes, μ, etc.) load correctly.

Verified locally against the two failing biomodels: previously SAX rejected
"Unicode 0x13 in attribute Name" / "0x1C in SbmlName" during ServerDocument
Manager.getBioModelXML; with the fix, both rehydrate, build full BioModel
objects, and pass updateAll(true). The stored CLOBs were never corrupt —
the read path was.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jcschaff jcschaff merged commit e4d0421 into master May 5, 2026
13 checks passed
@jcschaff jcschaff deleted the fix-clob-charset-read branch May 5, 2026 12:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant