Fix regression by reverting back to characters offset rather than byt… by victorgveloso · Pull Request #447 · GumTreeDiff/gumtree

victorgveloso · 2026-06-21T21:40:53Z

Fix byte offset regression in AST diff calculation introduced in #444

Description

This Pull Request addresses a major character/byte offset regression introduced in PR #444 (which originally solved the cross-platform line ending alignment issue #439). While the original PR successfully resolved the CRLF/LF line-drift on Windows by transitioning to a raw-read approach and UTF-8 byte calculations, the internal string-splitting logic broke down immediately after the beta7 release.

The Problem

In PR #444, file contents were switched to a raw read to prevent host-OS line ending normalization. However, the subsequent processing scattered inconsistent .getBytes(StandardCharsets.UTF_8) evaluations across multiple files (specifically around line 127 of the updated tracking engine).

When parsing files with complex token arrangements or structural variations, splitting string lines internally with .split("\n", -1) while evaluating uneven raw byte chunks caused an off-by-one or off-by-many byte alignment drift. This completely broke AST token matching in downstream tools like RefactoringMiner, leading to mangled, structurally misaligned visual diff outputs across multiple language test suites (Python, TypeScript, JavaScript, Kotlin).

The Fix

Unified Byte Offset Calculations: Harmonized the scattered .getBytes(StandardCharsets.UTF_8) calls across the core file-processing pipeline to ensure absolute consistency with Tree-sitter's internal byte-offset index mapping.
Preserved Boundary Tokens: Fixed the parsing loop boundaries so that line index boundaries and line separators are calculated uniformly, preventing arbitrary index drifts during AST generation.
Restored AST Alignments: Confirmed via local cross-testing that tokens map accurately to their respective source files without visual fragmentation.

Related Issues & Changes

Fixes Regression From: Fix platform-dependent byte offset drift and Unicode alignment in tree-sitter-ng generators #444
Addresses Root Problem In: gen.treesitter-ng:4.0.0-beta6 does not work on Windows #439
Testing: Verified locally via ./gradlew publishToMavenLocal -x test and cross-validated against RefactoringMiner AST diff suites to ensure full regression clearance across both Windows and macOS platforms.

Visual Verification

Tested on commit abhigyanpatwari/GitNexus@1c8ae5e

Before Fix (Broken Mappings after beta7)

❌ Scattered mappings, misaligned blocks, missing token associations.

After Fix (Aligned Token Offsets)

✅ Clean structural match alignments, fully unified tracking indices.

…es offset

tsantalis · 2026-06-21T22:54:17Z

@jrfaller
Jean-Rémy we found a bug in the previous PR, which is fixed with this PR.
I run personally tests to confirm this one is working well.

We are very sorry for this :(
beta7 maven release is buggy. A new release is needed.

Fix regression by reverting back to characters offset rather than byt…

a5009b5

…es offset

victorgveloso temporarily deployed to MavenCentral June 21, 2026 21:43 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix regression by reverting back to characters offset rather than byt…#447

Fix regression by reverting back to characters offset rather than byt…#447
victorgveloso wants to merge 1 commit into
GumTreeDiff:mainfrom
victorgveloso:fix/charOffset

victorgveloso commented Jun 21, 2026

Uh oh!

tsantalis commented Jun 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

victorgveloso commented Jun 21, 2026

Fix byte offset regression in AST diff calculation introduced in #444

Description

The Problem

The Fix

Related Issues & Changes

Visual Verification

Uh oh!

tsantalis commented Jun 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants