Skip to content

Fix regression by reverting back to characters offset rather than byt…#447

Open
victorgveloso wants to merge 1 commit into
GumTreeDiff:mainfrom
victorgveloso:fix/charOffset
Open

Fix regression by reverting back to characters offset rather than byt…#447
victorgveloso wants to merge 1 commit into
GumTreeDiff:mainfrom
victorgveloso:fix/charOffset

Conversation

@victorgveloso

Copy link
Copy Markdown
Contributor

Fix byte offset regression in AST diff calculation introduced in #444

Description

This Pull Request addresses a major character/byte offset regression introduced in PR #444 (which originally solved the cross-platform line ending alignment issue #439). While the original PR successfully resolved the CRLF/LF line-drift on Windows by transitioning to a raw-read approach and UTF-8 byte calculations, the internal string-splitting logic broke down immediately after the beta7 release.

The Problem

In PR #444, file contents were switched to a raw read to prevent host-OS line ending normalization. However, the subsequent processing scattered inconsistent .getBytes(StandardCharsets.UTF_8) evaluations across multiple files (specifically around line 127 of the updated tracking engine).

When parsing files with complex token arrangements or structural variations, splitting string lines internally with .split("\n", -1) while evaluating uneven raw byte chunks caused an off-by-one or off-by-many byte alignment drift. This completely broke AST token matching in downstream tools like RefactoringMiner, leading to mangled, structurally misaligned visual diff outputs across multiple language test suites (Python, TypeScript, JavaScript, Kotlin).

The Fix

  • Unified Byte Offset Calculations: Harmonized the scattered .getBytes(StandardCharsets.UTF_8) calls across the core file-processing pipeline to ensure absolute consistency with Tree-sitter's internal byte-offset index mapping.

  • Preserved Boundary Tokens: Fixed the parsing loop boundaries so that line index boundaries and line separators are calculated uniformly, preventing arbitrary index drifts during AST generation.

  • Restored AST Alignments: Confirmed via local cross-testing that tokens map accurately to their respective source files without visual fragmentation.

Related Issues & Changes

Visual Verification

Tested on commit abhigyanpatwari/GitNexus@1c8ae5e

Before Fix (Broken Mappings after beta7)
Screenshot 2026-06-19 at 3 14 17 PM (2)
❌ Scattered mappings, misaligned blocks, missing token associations.

After Fix (Aligned Token Offsets)
image
✅ Clean structural match alignments, fully unified tracking indices.

@tsantalis

Copy link
Copy Markdown

@jrfaller
Jean-Rémy we found a bug in the previous PR, which is fixed with this PR.
I run personally tests to confirm this one is working well.

We are very sorry for this :(
beta7 maven release is buggy. A new release is needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants