Bulk-scan single-line string bodies by tfoutrein · Pull Request #491 · python-poetry/tomlkit

tfoutrein · 2026-06-05T17:22:30Z

Stacked on #489 and #490 — the first two commits here are those PRs (index-based Source + bulk run scanning). Best reviewed/merged after them; GitHub will collapse this to just the string-scan commit once they land. Happy to rebase.

What

Parsing a single-line string appended its body one character at a time (value += current; inc()). For long string values (the bulk of most config / lockfile content) this dominates.

This scans the run of ordinary characters up to the next delimiter, backslash or control character in a single pass (Source.advance_until) and appends the whole slice at once; the stop character is then handled by the existing branch on the next iteration. Multiline strings keep the per-character loop (CRLF handling).

The stop-set is exactly the control characters the per-character loop rejects, so InvalidControlChar / escape / delimiter handling is unchanged, and a mid-string EOF raises UnexpectedEofError just as the per-char inc(exception=...) did.

Benchmarks

Parsing speedup across document shapes (median, interleaved A/B vs master, includes #489+#490):

document	speedup
large flat, single-line string values (~90 KB)	4.9×
pyproject.toml	2.1×
poetry.lock-like (~64 KB)	2.1×
typical mixed (~4 KB)	1.5×

No regression on any shape (multiline-heavy and nested docs are unchanged).

Tests

Full suite passes (972 tests, incl. the toml-test conformance submodule). On top of that, a 4135-input adversarial differential (random escapes valid+invalid, every control byte 0x00–0x1F+DEL in basic & literal, unicode/astral/combining, the other-quote char inside strings, truncated/malformed inputs for error parity) is byte-identical in output and exception type to the per-character loop. No public API or behaviour change.

`Source.__init__` built `iter([(i, TOMLChar(c)) for i, c in enumerate(self)])`, allocating one tuple and one TOMLChar per character of the whole input up front. Track an integer index into the underlying string instead: `inc()` bumps the index and reads `self[idx]`, and state save/restore snapshots the index rather than copying an iterator. Construction is O(1) and per-character work is deferred to the read. No behaviour change (full suite incl. the toml-test conformance submodule passes); ~1.07-1.14x faster parsing across document sizes.

The parser advanced one character at a time through runs of whitespace, bare-key and number characters, paying a `Source.inc()` call (attribute lookups + a `TOMLChar` build + bounds check) for every character. Add `Source.advance_while(charset)` / `advance_until(stopset)`, which scan the underlying string in a single pass and update the index and current character only once, and use them for the leading-whitespace, bare-key and number/date runs. Same value contract as the `while ... and self.inc()` loops they replace. No behaviour change (full suite incl. the toml-test conformance submodule passes; round-trip output byte-identical on a varied corpus). ~1.05-1.32x faster parsing depending on shape (e.g. ~1.26x on a poetry.lock-like file).

Parsing a single-line string appended its body one character at a time (`value += current; inc()`). For long string values this dominates. Scan the run of ordinary characters up to the next delimiter, backslash or control character in a single pass (`Source.advance_until`) and append the whole slice at once; the stop character is then handled by the existing branch on the next iteration. Multiline strings keep the per-character loop (CRLF handling). The stop-set is exactly the control characters the per-character loop rejects, so InvalidControlChar / escape / delimiter handling is unchanged. No behaviour change (972 tests incl. the toml-test conformance submodule; plus a 4135-input adversarial differential — output and error-type byte-identical to the per-char loop). Up to ~5x faster parsing on string-heavy single-line documents.

tfoutrein added 2 commits June 5, 2026 16:19

tfoutrein force-pushed the perf/string-scan branch from 35d3514 to 421172e Compare June 5, 2026 17:24

tfoutrein force-pushed the perf/string-scan branch from 421172e to e08a656 Compare June 5, 2026 17:39

This was referenced Jun 6, 2026

Speed up parsing by interning TOMLChar instances #488

Open

Remove the internal TOMLChar wrapper #492

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bulk-scan single-line string bodies#491

Bulk-scan single-line string bodies#491
tfoutrein wants to merge 3 commits into
python-poetry:masterfrom
AstekGroup:perf/string-scan

tfoutrein commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

tfoutrein commented Jun 5, 2026

What

Benchmarks

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant