Fix Oracle CLOB read: charset-aware character stream by jcschaff · Pull Request #1679 · virtualcell/vcell

jcschaff · 2026-05-05T12:20:57Z

Summary

DbDriver.getLOB Oracle branch was reading every CLOB through clob.getAsciiStream() into a byte[], then converting that byte buffer to a String via the JVM's platform-default charset. Both steps are lossy:
- getAsciiStream coerces each character into a single byte (truncating to the low 8 bits); a stored – (en-dash, U+2013) becomes 0x13, an invalid XML 1.0 control character. A stored μ (U+00B5) becomes 0xB5, then mis-decoded again on the new String(byte[]) step.
- new String(byte[]) uses Charset.defaultCharset() — Cp1252 on legacy Windows, UTF-8 on modern macOS/Linux — so even ASCII reads were platform-dependent for any non-ASCII byte that survived step 1.
Replaced with clob.getCharacterStream() reading into a char[] sized by clob.length() (chars, per JDBC spec). One-method change, Oracle branch only; Postgres branch already used rs.getString and was correct.

Why

The two biomodels (311226221 / shiVcell, 311875206 / mblinov) that have been failing to load in VCellClientMain from the database both contain en-dash characters in reaction names. The stored CLOBs are honest — a direct clob.getCharacterStream() scan of all 116k biomodel + mathmodel rows finds zero invalid XML chars. The corruption was being injected on every read by the buggy getLOB, and SAX then rejected the result on the client. PR #1676 (charset hygiene on import paths) and PR #1677 (input validation on setSbmlName/setName) were both addressing the consequences; this PR fixes the actual cause.

Test plan

Build succeeds: mvn -pl vcell-server,vcell-admin -am -DskipTests install
Local repro:
- Before: serverDocumentManager.getBioModelXML(qh, mblinov, 311875206, false) → SAX "invalid XML char Unicode 0x13 in attribute Name"
- After: same call → XMLToBioModel succeeds, updateAll(true) succeeds, returned XML preserves the original 12 en-dashes (0xE2 0x80 0x93) and contains zero invalid bytes
Same for 311226221 (shiVcell).
Run SEDML/SBML integration tests on this branch to surface anything that was being masked by lossy ASCII reads (other tables read via getLOB: kinetics, geometry curves, math descriptions, analysis tasks).
After merge: previously-broken BioModels load in the GUI without DB repair.

Notes

All other getLOB callers (MathDescTable, GeomDbDriver, SimulationContextDbDriver, generic clob_text) get the same fix automatically — any non-ASCII char they were silently mangling is now preserved.
Independent of PRs Use UTF-8 explicitly when reading SBML/SED-ML XML #1676 / Validate XML chars in name and sbmlName setters #1677 / Add scan-xml-control-chars admin CLI #1678. Those remain valuable as defense-in-depth (catch corruption at write-time) but aren't load-bearing for this particular family of failures anymore.

🤖 Generated with Claude Code

DbDriver.getLOB's Oracle branch was calling clob.getAsciiStream().read(byte[]) and then constructing a String via the JVM's platform-default charset. Both steps lose information for any non-ASCII content: - getAsciiStream coerces each char into a single byte; for U+2013 (en-dash) that low byte is 0x13, an invalid XML 1.0 character. The same mechanism has produced 0x1C control chars and apparent U+FFFD sequences in BioModel CLOBs read by the GUI client (e.g. biomodels 311226221 and 311875206 from users shiVcell / mblinov). - new String(byte[]) decodes via Charset.defaultCharset(), giving Cp1252-on-Windows vs. UTF-8-on-Linux/macOS round-tripping. Replace both with clob.getCharacterStream() and a char[] buffer sized by clob.length() (which is in characters per the JDBC spec). Net effect: every caller of getLOB on Oracle now sees the actual stored Unicode, and CLOBs with multi-byte UTF-8 sequences (en-dashes, μ, etc.) load correctly. Verified locally against the two failing biomodels: previously SAX rejected "Unicode 0x13 in attribute Name" / "0x1C in SbmlName" during ServerDocument Manager.getBioModelXML; with the fix, both rehydrate, build full BioModel objects, and pass updateAll(true). The stored CLOBs were never corrupt — the read path was. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jcschaff merged commit e4d0421 into master May 5, 2026
13 checks passed

jcschaff deleted the fix-clob-charset-read branch May 5, 2026 12:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Oracle CLOB read: charset-aware character stream#1679

Fix Oracle CLOB read: charset-aware character stream#1679
jcschaff merged 1 commit intomasterfrom
fix-clob-charset-read

jcschaff commented May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jcschaff commented May 5, 2026

Summary

Why

Test plan

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant