Skip to content

Add scan-xml-control-chars admin CLI#1678

Merged
jcschaff merged 1 commit intomasterfrom
xml-control-char-scanner
May 5, 2026
Merged

Add scan-xml-control-chars admin CLI#1678
jcschaff merged 1 commit intomasterfrom
xml-control-char-scanner

Conversation

@jcschaff
Copy link
Copy Markdown
Member

@jcschaff jcschaff commented May 5, 2026

Summary

  • New `vcell-su scan-xml-control-chars` command (read-only) walks every row in `vc_biomodelxml` and `vc_mathmodelxml`, joining `vc_biomodel`/`vc_mathmodel` and `vc_userinfo` for the model id and owner. Reports each row whose CLOB contains any codepoint rejected by `XmlChars.isValidXml10Char` (C0 controls except TAB/LF/CR, unpaired surrogates, non-character codepoints, and `U+FFFD` per project policy).
  • Output is TSV: `kind, model_id, userid, offset, cp_hex, snippet`. Defaults to first-bad-codepoint per row; `--all-occurrences` enumerates every hit.
  • Hardened against the failure modes of the previous scratch implementation: `PrintWriter` is autoFlush so progress survives interrupts; status goes to stderr (no log4j2 dependency); CLOBs above `--max-clob-mb` (default 64) are skipped with a noted reason; per-batch progress every 500 rows.

Why

PR #1676 fixed the import-side charset hygiene that was producing the corrupt CLOBs and PR #1677 added input-validation guards so future writes can't reach this state. This PR is the third leg: tell us how widespread the existing corruption is, so we can decide repair-in-place vs. mark-broken without speculation.

Test plan

  • `mvn -pl vcell-admin -am compile` succeeds.
  • `vcell-su scan-xml-control-chars --help` prints expected options.
  • Run against vcell-dev and confirm the output TSV identifies BioModel 311226221 and 311875206 (already known corrupt).
  • Spot-check at least one mathmodel hit, if any, to confirm both queries work.

Notes

🤖 Generated with Claude Code

@jcschaff jcschaff force-pushed the xml-control-char-scanner branch from 94069fd to dd7e452 Compare May 5, 2026 11:37
@jcschaff jcschaff force-pushed the xmlchars-validation branch from 6d90ec5 to 788e115 Compare May 5, 2026 12:38
@jcschaff jcschaff force-pushed the xml-control-char-scanner branch from dd7e452 to f3ad9e9 Compare May 5, 2026 12:38
@jcschaff jcschaff force-pushed the xmlchars-validation branch from 788e115 to a0a9105 Compare May 5, 2026 14:06
Read-only scanner that walks vc_biomodelxml.bmxml and vc_mathmodelxml.mmxml,
reports every row with codepoints rejected by XmlChars (C0 controls,
unpaired surrogates, non-character codepoints, U+FFFD per project policy).
Streams CLOBs through a Reader, caps in-memory size (--max-clob-mb), and
uses an autoFlush PrintWriter so output is durable even if the long-running
scan is interrupted. Output is TSV with kind/model_id/userid/offset/cp_hex/
snippet so corrupted models can be triaged for repair.

Motivated by the two failing biomodels (311226221, 311875206); knowing the
full scope of corruption in the prod DB is a prerequisite for choosing
between repair-in-place and mark-broken.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jcschaff jcschaff force-pushed the xml-control-char-scanner branch from f3ad9e9 to 98e0736 Compare May 5, 2026 14:08
Base automatically changed from xmlchars-validation to master May 5, 2026 15:06
@jcschaff jcschaff merged commit 28763ec into master May 5, 2026
9 checks passed
@jcschaff jcschaff deleted the xml-control-char-scanner branch May 5, 2026 15:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant