[diffs] Byte arena parsing experiments#835
Draft
clemg wants to merge 32 commits into
Draft
Conversation
* Remove history coalesce
* Fix selction/crate not updated when do "redo" command
* Remove visualColumns.ts
* Move editor ts files
* Refactor textarea buffer
* Rename `EditSnippet` type to `TextareaSnapshot`
* Remove `Editor` component, introduce the `Editor` class for `File` component
* Update demo
* Update editor constants to set text and background color to transparent
* Rewrite rerender logic
* Format
* Remove dead code
* Fix caret postion on empty line
* Improve `renderSelectionRange` performance by using cached DOM elements
* Support range selection in textarea
* Improve rerender performance
* Use piece table data sturcture for the text document
* refactor
* Add public `setSelection` method for the `Editor` class
* Add `FileContentsWithLineOffsets` interface and update related components to support line offsets and line count. Refactor file handling to utilize computed line offsets for rendering and iteration.
* Add `updateRenderCacheAt` method to `FileRenderer` and `File` classes for improved rendering. Refactor theme handling in `Editor` to utilize a dedicated method for color map retrieval.
* Refactor file iteration logic by removing `iterateOverFile` utility and replacing it with direct loops in `VirtualizedFile` and `FileRenderer` components. Update line offset computation to exclude trailing newlines in multi-line files while maintaining correct line counts. Enhance tests to validate line counting behavior.
* Remove EOF field
* Remove text length fields from HistoryEntry and related test cases in EditHistory
* Rename class `EditHistory` to `EditStack`
* Refactor EditStack and PieceTable to use a unified text slice interface.
* Refactor PieceTable and TextDocument to improve line offset handling and remove unnecessary EOL trimming logic.
* Refactor `Editor` to utilize new dirty line resolution logic, enhancing performance and accuracy in line tracking.
* Fix multi-cursor textarea sync
* Refactor Editor rendering logic for improved performance and reduce direct DOM manipulation.
* Add grammer cache
* Enhance line position caching in Editor for improved performance and accuracy.
* Refactor indentation handling in Editor and remove unused utility function for improved clarity and performance.
* Fix testing types
* Improve performance of the `getCharacterX` method
* Improve caching mechanism for enhanced performance.
* Add maxEntries feature to EditStack for managing undo history size
* Refactor
* Enhance PieceTable and TextDocument to trim line endings in getLineText method, improving text handling consistency. Update related tests for accuracy.
* Refactor
* Add `BackgroundTokenzier` class
* Improve performance
* Fix hightlight bug
* Add `--diffs-bg-caret` css property
* Fix input
* Fix selection range rendering
* Fix prebuildStateStackCache funciton
* Update `TOKENIZE_MAX_LINE_LENGTH` to 10,000
* Add `DiffsEditor` interface
* Fix `lineAnnotations` argument on `triggerEdit` invoke
* Refactor editor edit method to accept onChange callback directly and update demo to log file changes
* Clean up
* typo
* Refactor BackgroundTokenizer to use message-based scheduling.
* Refactor editor focus handling by removing redundant event listeners and updating CSS selectors for caret visibility.
* Refactor
* Fix `toTextareaSelectionDirection` function
* Refactor
* Update `DiffsEditor` types
* Add line annotation handling
* Add documentation for `hasVisibleLineAnnotation` function.
* Get rid of enum
* Clean up
* Refactor
* Update editor CSS
* Support text wrap
* Clean up
* Fix line y/wrap cache
* Fix line cache
* Copies leading indentation onto the new line after Enter
* Focus textare after undo/redo
* Move multi-selection functions to editorSelection module
* Add support for handling leading indentation deletion in applyTextChangeToSelections
* Fix selection glitch bug
* Add extendSelection command
* Fix `focusTextare` function
* Fix `resolveTextareaChange` function
* Remove unnecessary target check in mouseup event listener in Editor class
* Fix textarea selction direction
* Fix selection bg color for safair
* Clean up
* Fix shift select
* Refactor
* Refactor
* Fix shift select delay
* Coalesce edit stack entries for simple typing or backspace operations.
* Add Support forward-delete coalescing for edit history
* docs: add docs for editStack module
* Refctor
* Fix 'documentStart' and 'documentEnd' commands
* Rewrite selection handle logic
* Fix shouldCoalesceEditStackEntry function
* Update demo
* Add `removeEditor` for File component
* Add react api
* Clean up
* Update demo app
* Refactor useFileInstance to remove redundant editor cleanup logic
* Fix `computeLineOffsets` function
* typo
* Update editor style
* Fix `getOrCreateLineOffSets` method
* Refactor line count and annotation handling in File component; remove hasVisibleLineAnnotation utility
* Fix lines deletion crocss virtul viewport
* Remove `normalizeSelectionsForDocument` function
* Fix `edit` function
* Add editor sub-module
* Use `contenteditable` model
* Fix line wrap
* Fix wrap line
* Fix selection on mobile
* Update editor style
* Fix resize handling
* Add editor overlay layer
* Cleanup
* Add `DiffsEditableComponent` types
* Fix `VirtualizedFile` component
* Update `DiffsEditableComponent` type
* Add editor demo
* Fix slection rendering
* Update editor demo app
* Fix VirualizedFile component
* Update editor demo app
* Fix some selection bugs
* Update demo app
* Refactor findNextNonOverlappingSubstring method into PieceTable and TextDocument
* Refactor
* feat: Implement line jump
* Fix selection rendering when scrolling
* Improve tokenzier performance
* feat: simple search pannel
* Update editor demo app
* Fix jump
* Update search UI
* Add lag radar
* Fix virtualizer
* Fix render range after typing
* Fix editor tokenzier cache
* Fix search input focus
* Update log rader position
* Improve piece table performance
* Refactor
* Add lag radar
* Fix line count for empty documents
* Fix offscreen lines flush
* Introduce gutter width tracking
* refactor
* Refactor
* fix import
* Add 'expandSelectionDocStart' and 'expandSelectionDocEnd' commands
* Fix buffer height
* Add matches text for the search pancel
* Disable preious/next icon when no matches
* Update style.css
* feat: Support `quiteEdit` action
* Update edtior demo app
* Refactor
* Update demo app
* Update demo app
* Fix girdRow when render quick edit UI
* Move testing files
* Clean up
* Add searchPanel.ts
* Fix expandCollapsedSelectionToWord to match when the cursor is immediately touching one of the word's boundaries
* clean up
* Potential fix for pull request finding
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* Fix typo
* Clean up searchPanel and quickEdit when swith file
* Rebase to beta-1.2
* Fix selection after clean up quick edit widget
* Fix virtual buffer
* Fix `updateWindowSelection` method of Editor class
* Fix render range when typing new line at the end of the file
* Fix buffer when adding large lines
* [editor] Support 'deleteHardLineForward' input
* Add `insertTranspose` input
* Move `change` handler to options
* Update css
* Merge beta-1.2 changes
* Fix emply line rendering
* Add search settings UI
* Merge branch 'main' into editor
* Support FileDiff component
* Update `DiffsEditableComponent` interface
* Fix `getSelectionAnchor` function
* Fix text measurement for emoji
* Increase delay for diff rendering in FileDiff component
* Update types
* Add unit testings for text measue functions
* Clean up dirty render cache
* Fix `lineAnnotations` re-rendering
* Disable gutter utility when editing
* Add global css
* Fix scrollToLine method
* Refactor selection handling in Editor class to initialize selections properly and streamline rendering logic
* Fix diffs components
* Allow to create selection from gutter interaction
* Fix focus
* Fix browser compatibility
* Support dual themes
* Fix selection bugs
* refactor
* Add `Metrics` class
* Clean up
* Fix wrap selection rendering on safari
* Add `QuickEditContext` types
* Fix caret scroll margin when search panel is on
* Refactor search panel widget
* Fix selection position
* Update react components
* Update search panel CSS
* Fix quick edit
* Add editor docs
* Fix react hooks for editor
* Update editor demo component
* Update Quick Edit docs
* Update `diffStyle` and `expandUnchanged` options when editing
* demo: remove editor route
* Update docs
* Update docs
* Update examples
* Reset selection when 'Esc' key pressed
* Fix selection focus
* Add 'enable edit' shortcut('e')
* Handle the arrow key events to scroll to the cursor position manually
* Merge of overlapping selections
* Handle cursor moving events
* Fix scroll margin top
* Add debug logging option to Editor class
* Fix selection bugs
* Fix selection renering for unified `FileDiff`
* Reset ignore selection change flag on mouse up event
* Clean up
* fix bun.lock
* Add editor theme style
* Refactor
* Fix react types
* Fix last line index calculation
* Update condition for marking DOM dirty in VirtualizedFile component
* Throw if someone is trying to edit with no editor instance
* Update `mergeFileDiffOptions` function
* `lineOffsets` -> `lines`
---------
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
…d for the `DiffsEditor` (pierrecomputer#766) * feat(editor): add pause and resume functionality for background tokenization * pref(editor): Introduce `postponeBackgroundTokenizeToNextFrame` method for the `DiffsEditor` * fix * Add debug option for the tokenzier * Update types * Refactor * typo * Reduce requestAnimationFrame calls
* chore: empty commit for beta branch * Homepage FileDiff editor demo * Style kbd elements, add beta badge to docs content, redo table for keyboards, few edits * docs(editor): add MultiFileDiff React example Document editing with MultiFileDiff alongside File and FileDiff in the React integration tabs. Co-authored-by: Cursor <cursoragent@cursor.com> * docs(editor): use parseDiffFromFile in FileDiff React example Align the editor FileDiff tab with the pre-parsed fileDiff prop API. Co-authored-by: Cursor <cursoragent@cursor.com> * Update editor react examples * docs(editor): document worker pool usage with useTokenTransformer Add a Worker Pool section with tabbed vanilla/React examples, and enable useTokenTransformer on the docs site worker pool so editing works off-thread. Co-authored-by: Cursor <cursoragent@cursor.com> * format * Remove toolbar, put reset into header metadata * add link * little copy editing * Update homepage example to include file and diff * redo reset --------- Co-authored-by: Amadeus Demarzi <amadeusdemarzi@gmail.com> Co-authored-by: Je Xia <i@jex.me> Co-authored-by: Cursor <cursoragent@cursor.com>
* Rounded selection boundaries * Search panel refactor/redesign * Introduce `postponeBackgroundTokenizeToNextFrame` method for the `DiffsEditor`
…`deleteWordBackward` (pierrecomputer#789)
Removing BETA badge from Virtualization
* [diffs/editor] refactor editor API * Refactor * refactor * fix * Refactor * refactor * refactor * Add blur method to Editor class for improved focus management * Update docs * Refactor
* Includes editor code refactor * Find/Replace functionality
fix(diffs): Handle IME composition input Fixes the following issue: 1. Open an editable diffs editor and place the caret in content. 2. Start a CJK IME or accented/dead-key composition. 3. Type a preview candidate, then commit it or press Esc to cancel. Previously, each preview beforeinput was prevented and warned as an unknown input type, so the browser could not show native preview text and canceled composition text could still be inserted on compositionend. Let insertCompositionText beforeinput events stay with the browser. Track composition updates and only commit non-canceled compositionend data into the editor model. Add regression coverage for preview, commit, warning noise, and canceled composition text. Also guard the selectionchange handler against browsers and embedded WebViews that lack Selection.getComposedRanges, which previously threw out of the listener on every selection change and left the editable surface unusable.
* [diffs/editor] Support editing unified diffs and refresh diff view on edit Allow the editor to edit unified diffs by syncing the render view against the FileDiff metadata (additions) instead of a separate addition file, and mark deletion/annotation lines non-editable in unified mode. Replace the post-edit `updateLineType` path with `refreshDiffView`/`fastRefreshDiffView` so the diff view re-renders correctly after edits. * Fix home live editor reset when toggling mode Include `mode` in the editable surface memo deps so switching modes recreates the surface instead of reusing a stale instance. --------- Co-authored-by: Mark Otto <markdotto@gmail.com>
* New homepage editor Co-authored-by: Cursor <cursoragent@cursor.com> * revamp * feat(diffs/editor): color marker hover popover by severity Tag the marker hover popover with its severity so the popover renders as a solid chip filled in the severity color with white text, mirroring the underline. Drops the inset surface ring on severity popovers so the fill reads as one solid color. * fix(diffs/editor): pin dual-theme surface tokenizer to themeType Only observe the document/system color scheme when the surface follows it (themeType: 'system'). A surface pinned to an explicit dark/light theme must re-tokenize after an edit with the same theme the SSR markup used; otherwise edited tokens fall back to the default foreground. * chore(demo): adapt marker message to severity-filled popover The marker hover popover is now filled with the severity color, so let the icon and description inherit the popover's white text instead of painting their own color (which would vanish against the fill). * refactor(docs): drive shortcut tables from ShortcutKeys Extract the platform-modifier logic into usePlatformModifier and add a ShortcutKeys component that renders a shortcut from plain string arrays, so a serializable shortcuts table can drive rendering without inlining JSX per row. Move the tree a11y shortcut table onto this model. * style(docs): default AUI editor demo to split diff Switch the homepage editor demo from unified to split diff style. * feat(docs): add Edit feature page Add a dedicated /edit feature page showcasing edit mode: live editing, selection actions, lint markers, find-in-file, undo history, and keyboard shortcuts, plus a reference section. Wire it into the header, mobile menu, and footer navigation, and link to it from the homepage agent demo. Rename the LiveEditor toggle from file/diff to a read-only Review vs editable Edit of the same File surface. * Redesign selection action icon a smidge * improve edit hero and content * vibe update docs * copy and layout updates * fixes mb * Fix AUI demo and more * fucking sick * better * better again --------- Co-authored-by: Cursor <cursoragent@cursor.com>
…#825) Render marker popup in overlay element
`additionLines`/`deletionLines` change from `string[]` to `DiffLines`: a plain
data object holding a file's lines as one UTF-8 byte arena plus an offset table,
decoded on demand via `lineAt` / `joinLines`. On a huge diff (linux v6..v7,
~22.8M lines across ~77k files) this avoids tens of millions of tiny `String`
objects, so the V8 heap drops ~33% on that compare and the parser is faster: it
no longer encode+decode-detaches every line, it encodes once on seal and decodes
only the visible (virtualized) lines.
It is plain data on purpose, so it survives structured clone (the highlight
worker), `structuredClone`, and IndexedDB without a revive step (no class, no
prototype to drop). `.length` stays a field, so the many `.length` consumers are
unchanged; only content reads migrate (`x[i]` -> `lineAt(x, i)`,
`x.join('')` -> `joinLines(x)`). Per-file offsets use the smallest int width that
fits the file. A file with a lone surrogate keeps exact strings as a fallback,
and merge-conflict diffs keep plain strings (no encode) so their parse stays at
parity. The parsed model is byte-identical to before (snapshot + content-hash).
The editor's realtime-update path edits addition lines in place
(DiffHunksRenderer.updateRenderCache / updateDiffHunks), so diffLines also
exposes mutableLines (decode an arena side into the plain form once, on the
first edit) and joinLineRange (read a hunk's line range as one contiguous byte
slice for the partial reparse). Whole-document changes reassign the side with
plainLines(splitFileContents(...)), keeping the editor on plain strings while
it is editing.
Adds diffLines.test.ts (arena round-trip, multibyte, emoji-keeps-arena, lone-surrogate fallback, BOM, offset-width, plainLines, joinLines, isWellFormed) and a withPlainLines snapshot converter so the existing parsed-model snapshots assert byte-identical line content. The audited tests' direct string[] reads move to the DiffLines helpers (lineAt / linesToArray / plainLines), and the updateDiffHunks fixtures edit lines through the plain form like the editor does.
processFileBytes parses a single file's diff straight from its UTF-8 bytes: only the file header and the @@ hunk header lines are decoded into JS strings, every content line's bytes are copied verbatim into the per-side arenas and decoded on demand by lineAt, so a parse allocates no per-line strings and no per-line garbage. The per-side byte arena is filled by a small SideBuilder (appendLine / sealSide) that sits in diffLines.ts next to finishLines, so every way of building a DiffLines lives in one module, while parsePatchFiles keeps only the byte scanners that walk the patch structure. It is the only hunk-content parser: processFile encodes its string once up front and hands the bytes over, and the full-file path (parseDiffFromFile, merge conflicts) rides the same loop: the patch bytes drive the hunk structure while the sides keep coming from the caller's contents strings through finishLines, like before. The previous per-line string loop and its helpers go away. Two byte-only behaviors to know about: invalid UTF-8 stays as-is in the arena and decodes to U+FFFD on read (the same text a stream decode produces), and a patch string holding a lone surrogate (which the up-front encode would corrupt) gets its exact line strings rebuilt from the original text into the plain-string form, so the surrogate-preservation behavior is unchanged. The parsed model is hash-identical to the previous parser on linux v6.0..v7.0 (76,872 files, 22.8M lines, every line compared).
streamGitPatchFiles moves into the package as a byte-level splitter: the patch stream buffers as bytes and each file is handed over as a Uint8Array slice at its diff --git boundary, ready for processFileBytes, so the bulk of the patch is never decoded into JS strings. Format-patch commit metadata splits on the From <hash> boundary like the string splitter did, a stream-leading BOM is stripped like a whole-stream TextDecoder would, and a boundary-less patch falls back to one decoded string for parsePatchFiles. The byte scanners and ASCII byte codes shared by the splitter and the file parser (matchesAscii, lineEndExclusive, a generic findNextLineStartingWith, isBlankLine, hasNonWhitespace, and the byte constants) now live in one byteScan module, defined once and kept out of the package's public exports. diffshub's viewer feeds the slices straight to processFileBytes; its own string splitter goes away, and it imports COMMIT_HASH_METADATA_PATTERN from the package instead of keeping a second copy.
Chunk-size sweeps over the file boundary splitting (including one-byte chunks), per-file model parity between the streamed bytes and the string entry, format-patch metadata attachment, stream-leading vs in-file BOM, the unified-diff fallback, and the invalid-UTF-8 read-back.
|
@clemg is attempting to deploy a commit to the Pierre Computer Company Team on Vercel. A member of the Team first needs to authorize it. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR brings the two proposed experiments from #760 together in one cleaner change on top of
beta-1.3.The goal is to store each file's diff lines as a per-file UTF-8 byte arena (meaning that it's just contiguous bytes in a UTF-8 byte array), and parsing the patch straight into that arena as it streams, instead of building tons of JS strings per line.
The suggested approach in this PR produces byte-identical parsed output to today's string approach
So whats in there:
string[]. Each side of a file (before/after) packs all its lines into one UTF-8Uint8Array(bytes) + a smallInt32Array(offsets) of per line offsets; so a line is justdecode(bytes[offsets[i]..offsets[i+1]])string[]to a custom typeDiffLinesdata object. So reads arelineAt(lines, i)andjoinLines(lines)instead oflines[i]andlines.join('')diff --git/@@headers to build the structure. This makes parsing noticeably faster and steadier (see benchmarks below)TextDecoder). A nice side effect on huge diffs (on my hardware, Air M2): fast-scrolling to a far-away file no longer flashes blank rows for a beat before the text paints. I'm farily sure that's because the whole tab is so much lighter now, not the decode itselfbeta-1.3from yesterday, so I think I took everything you've already merged about the parsing into accountNote: I left a bunch of
// Adapted from ...comments here and there, for functions and stuff I just adapted from before. I don't think its worth keeping but that might help for a reviewI also did some opinionated stuff about how I organized my functions and chose to represent data, please let me know if things should be changed!
Results:
The after-GC renderer footprint (of after a stream is settled for long enough), in MB, mean of 3 runs, measured on my VPS:
For the linux comparison, the peak memory usage while streaming also drops from 4259 (avg) to 2162 (avg), about half less. This is also true for the parsing time (in CPU time used), dropping from 107s to 42.5s (-2.5x).
Motivation & Context
As I said in #775, I'm getting OOM errors on diffshub with linux v6...v7's compare. This fixes it for me (and should for a bunch of different hardware), and also makes the parsing/memory usage more stable
The results of my testing are clear: for instance on linux's compare, its saving about 50% of memory usage, drastically reduces the peak memory usage, and makes the parsing ~2.5x faster. I (my clanked) made this page diffshub-bench.clemg.fr that allows you to start benchmarks on any PR of compare on 3 versions of diffshub: diffshub's main, beta branch and this proposed branch. It runs on my slow as fuck VPS but allows to compare remotely a lot of tests on the same hardware, without setup. It also has some already ran tests for you to check if you want
From my testings, I haven't found any issue on rendering the CodeView component, the docs, the editor or anything already existing. You should obviously try it for yourself because you know this much better than I do
Since you (Amadeus) mentioned that it was easier for you to test everything in one go, this PR shadows the other other one (that I should close?).
Anyway, happy to discuss the whys and the hows of this, and suggestions if you have any
Type of changes
first discussed with the dev team and they should be aware that this PR is
being opened
You must have first discussed with the dev team and they should be aware
that this PR is being opened
Checklist
contributing guidelines
bun run lint)bun run format)bun run diffs:test)How was AI used in generating this PR
I used AI to write all the tests (and reviewed them), parts of the commit messages and to rebase on top of beta-1.3 from my old experiment branch to this new one (a lot changed since)
Related issues
See: #760