perf(codec-http2): specialize HpackDecoder.decodeULE128 int path by mashraf-222 · Pull Request #1 · codeflash-ai/netty

mashraf-222 · 2026-04-30T17:00:55Z

Summary

Replace HpackDecoder.decodeULE128(ByteBuf, int) with a specialized int decoder. Measured 9.25% reduction (2.3% error) in average decode time on HpackDecoderULE128Benchmark.decodeMaxInt vs the current decodeMaxIntUsingLong, with non-overlapping 99.9% confidence intervals. One method, one file, package-private API unchanged. The change copies the shape of the reference implementation that has lived in microbench/ since 2017.

What Changed

File	Change	Lines
`codec-http2/src/main/java/io/netty/handler/codec/http2/HpackDecoder.java`	Replaced `decodeULE128(ByteBuf, int)` body with a specialized int-loop decoder. The `long` overload is unchanged.	+22 / −12

Production line count delta: +10 net lines (22 added, 12 removed).
Public signature changes: none. Method is package-private (static int decodeULE128).
New tests: none needed. The existing HpackDecoderTest covers the four production call sites through full HEADERS-block decode paths, including boundary-value Integer.MAX_VALUE overflow cases for indexed header, literal name length, and literal value length paths.
Generated files: none.

Single optimization, single method. No unrelated cleanup bundled.

Why It Works

Eliminates int → long widening and long → int narrowing. The baseline delegates to the long overload, widening result from int to long and narrowing the return back to int. The specialized loop keeps the accumulator in int.
Eliminates long-shift arithmetic. The long overload operates on (b & 0x7FL) << shift with 64-bit shift and 64-bit add per iteration. The int specialization runs (b & 0x7F) << shift with 32-bit shift and 32-bit add. On x86-64 the instructions are the same width, but on JVMs and architectures where long ops are more expensive (and under register-pressure-sensitive inlining), the int form is cheaper.
Replaces post-facto v > Integer.MAX_VALUE with an in-loop shift == 28 branch. The baseline completes a full long decode and then rejects the result if it exceeds Integer.MAX_VALUE. The specialized version decides overflow one byte earlier, at the moment shift == 28, which is the only shift at which an int result can grow beyond 31 bits. Fewer instructions executed on the overflow path, and no post-decode compare-and-rollback.
Eliminates the explicit in.readerIndex(readerIndex) rollback on overflow. The baseline captures readerIndex before delegation and resets it after catching an overflow via the long decoder. The specialized version never advances past the overflow byte — the shift == 28 branch runs before the in.readerIndex(readerIndex + 1) write — so there is nothing to roll back.

The old code was slower because each int decode paid for a long-domain computation plus a second validation step. The JIT cannot eliminate the cost because the long decoder is shared with the long overload at line 488 and cannot be specialized away per call site.

Why It Is Correct

Preserved contract

Algorithm: identical RFC 7541 §5.1 ULE128 encoding, 7 bits per byte, MSB continuation, little-endian byte order.
Domain precondition: result <= 0x7f && result >= 0 preserved by the same assert.
Return value: same int for every legal ULE128 encoding representable in int.
Overflow exception identity: DECODE_ULE_128_TO_INT_DECOMPRESSION_EXCEPTION — unchanged reference, unchanged message, unchanged thrown type Http2Exception.
Underflow / short-buffer exception: DECODE_ULE_128_DECOMPRESSION_EXCEPTION — unchanged.
readerIndex on overflow: baseline captures entry readerIndex, advances through long decode, restores on throw. Specialized never advances past the overflow byte. Net observable state after DECODE_ULE_128_TO_INT_DECOMPRESSION_EXCEPTION: readerIndex equals its value at method entry in both versions.
readerIndex on short buffer: both versions walk the local readerIndex variable forward without calling in.readerIndex(...), then throw. Net observable state: identical.
readerIndex on success: both versions set in.readerIndex(readerIndex + 1) at the terminating byte. Identical.
Thread-safety: static method, no shared state. Unchanged.
Public API: package-private, signature unchanged.
JVM floor: Java 8-compatible constructs only. Unchanged.

Overflow-boundary argument

The baseline's v > Integer.MAX_VALUE check fires exactly when the decoded long has a bit set above position 31. For a ULE128, bits above position 31 can only come from the byte at shift == 28 (bits 28..34 inclusive).

The specialized version's branch at shift == 28:

if (shift == 28 && ((b & 0x80) != 0 || !resultStartedAtZero && b > 6 || resultStartedAtZero && b > 7))

rejects exactly the same inputs:

(b & 0x80) != 0 — there is a continuation byte after shift 28, so the result must exceed 35 bits → overflows int regardless of the current byte's payload.
resultStartedAtZero && b > 7 — top-byte payload is 8..127, pushing the accumulator past 0x7FFFFFFF. (Max when zero: 0x0 + 0x7F + (0x7F << 7) + (0x7F << 14) + (0x7F << 21) + (0x7 << 28) = Integer.MAX_VALUE; any higher top byte overflows.)
!resultStartedAtZero && b > 6 — non-zero prefix puts the reachable max at [0x01, 0x7F] + 0x7F + (0x7F << 7) + (0x7F << 14) + (0x7F << 21) + (0x6 << 28); top byte ≥ 7 overflows.

This matches the comment block in the existing code describing the maximum representable value.

Reference implementation

The specialized int decoder is not newly authored. It is byte-for-byte the same shape as HpackDecoderULE128Benchmark.decodeULE128(ByteBuf, int) at microbench/src/main/java/io/netty/handler/codec/http2/HpackDecoderULE128Benchmark.java:130–153, which has lived in the repo since 2017 as the "fast path" reference. This PR moves it from microbench to production.

Tests proving the contract

io.netty.handler.codec.http2.HpackDecoderTest — 58 tests covering full header block decode, including explicit Integer.MAX_VALUE boundary encodings and one-past-MAX encodings for indexed header, literal name length, and literal value length paths.
Full codec-http2 test suite (1402 tests) — covers neighboring codec behavior that exercises the HPACK decoder indirectly.
Command: mvn test -pl codec-http2 -Dcheckstyle.skip=true.
Result with diff applied: BUILD SUCCESS, 1402 tests, 0 failures, 0 errors, 7 skipped (the 7 skips exist on baseline).

Benchmark Methodology

Harness: JMH 1.36 (shaded microbench/target/microbenchmarks.jar, built with -Pbenchmark-jar).
Build command: mvn -pl microbench -am -DskipTests -Pbenchmark-jar package.
JVM: JDK 25.0.2 (OpenJDK 64-Bit Server VM, 25.0.2+10-Ubuntu-124.04). JAVA_HOME=/usr/lib/jvm/java-25-openjdk-amd64.
JVM options: none beyond defaults.
Host: Linux x86_64.
Mode: avgt, time unit ns/op.
Forks: 2 (-f 2).
Warmup: 5 iterations × 1s each (-wi 5 -w 1s).
Measurement: 10 iterations × 1s each per fork (-i 10 -r 1s) → 20 measurement samples per benchmark.
Threads: 1, synchronized iterations.
Blackhole mode: compiler (auto-detected by JMH; # Blackhole mode: compiler (auto-detected, use -Djmh.blackhole.autoDetect=false to disable) from the raw log).
Input distribution: the pre-existing HpackDecoderULE128Benchmark @Setup populates a ByteBuf with the ULE128 encoding of Integer.MAX_VALUE per invocation. This is not a compile-time constant from the benchmark method's perspective; the encoded bytes are read from heap state.
DCE prevention: JMH return-value consumption plus compiler blackholes are active.

Benchmark command

export JAVA_HOME=/usr/lib/jvm/java-25-openjdk-amd64
cd netty
java -jar microbench/target/microbenchmarks.jar \
  "HpackDecoderULE128Benchmark\.(decodeMaxInt|decodeMaxIntUsingLong)$" \
  -f 2 -wi 5 -i 10 -w 1s -r 1s -tu ns

Results

Case	Before (baseline)	After (optimized)	Change
`decodeULE128(ByteBuf, int)` on `Integer.MAX_VALUE` ULE128	7.536 ± 0.201 ns/op, 99.9% CI [7.335, 7.736]	6.839 ± 0.160 ns/op, 99.9% CI [6.679, 7.000]	−9.25%, non-overlapping CI (gap 0.335 ns/op)

Units: nanoseconds per operation.
Confidence intervals are 99.9% as emitted by JMH.
The optimized upper bound (7.000) is strictly less than the baseline lower bound (7.335); the CIs do not touch.
Error margins: ±2.67% of score (baseline), ±2.34% of score (optimized). Both tight.
Per-iteration data (from raw JMH log):
- decodeMaxInt: fork 1 (min, max) = (6.604, 7.230); fork 2 = (6.645, 7.198). Consistent across forks.
- decodeMaxIntUsingLong: fork 1 (min, max) = (7.353, 7.997); fork 2 = (7.210, 8.030). Consistent across forks.
Cross-benchmark consistency: this is the same setup already used in HpackDecoderULE128Benchmark since 2017 — the benchmark was designed specifically to compare these two shapes.

No extrapolation to query-level, connection-level, or service-level latency is claimed. This is a microbenchmark-level win on a pervasive primitive. See "Callers / Impact Scope" below.

Reproduction

# Setup (one-time)
git clone <fork-url> netty && cd netty
git checkout perf/hpack-decode-ule128-specialize-int  # or the base for baseline measurement
export JAVA_HOME=/usr/lib/jvm/java-25-openjdk-amd64   # or equivalent JDK with jni.h
sudo apt-get install -y autoconf automake libtool libtool-bin build-essential  # only needed for JMH shaded jar build
mvn -pl microbench -am -DskipTests -Pbenchmark-jar package

# Baseline (before this PR)
git checkout <base-sha>
mvn -pl microbench -am -DskipTests -Pbenchmark-jar package
java -jar microbench/target/microbenchmarks.jar \
  "HpackDecoderULE128Benchmark\.decodeMaxIntUsingLong$" \
  -f 2 -wi 5 -i 10 -w 1s -r 1s -tu ns

# Optimized (after this PR - the same benchmark class, different method)
git checkout <head-sha>
mvn -pl microbench -am -DskipTests -Pbenchmark-jar package
java -jar microbench/target/microbenchmarks.jar \
  "HpackDecoderULE128Benchmark\.decodeMaxInt$" \
  -f 2 -wi 5 -i 10 -w 1s -r 1s -tu ns

# Tests
mvn test -pl codec-http2 -Dcheckstyle.skip=true

Expected wall time: ~30s per benchmark, ~22s for the codec-http2 test suite, ~2–5 min for the first microbench build.

Methodological note on the before/after comparison. HpackDecoderULE128Benchmark is a pre-existing in-tree benchmark that already exposes both shapes: decodeMaxInt invokes the specialized int decoder body (the shape this PR lands in production), and decodeMaxIntUsingLong invokes HpackDecoder.decodeULE128(ByteBuf, int) unchanged (which on baseline uses the delegation path). Running both methods on 4.2 is sufficient to measure the delta without having to land the diff first — the benchmark has existed for exactly this purpose since 2017.

Callers / Impact Scope

This PR is framed as a primitive-level win with pervasive caller use, not an end-to-end infrastructure claim.

Named downstream callers

All in codec-http2/src/main/java/io/netty/handler/codec/http2/HpackDecoder.java, inside decodeHeaders(...):

#	Line	Call	HTTP/2 role
1	226	`getIndexedHeader(decodeULE128(in, index))`	Indexed header representation (RFC 7541 §6.1)
2	235	`name = readName(decodeULE128(in, index))`	Literal header with indexed name (RFC 7541 §6.2.1)
3	254	`nameLength = decodeULE128(in, index)`	Literal header name length (RFC 7541 §5.2)
4	291	`valueLength = decodeULE128(in, index)`	Literal header value length (RFC 7541 §5.2)

All four pass int index (not long), confirming the int overload is the hot path.

Verify with:

rg -n 'decodeULE128\(' codec-http2/src/main/java/io/netty/handler/codec/http2/HpackDecoder.java

Why this method qualifies as a primitive

Called on every HPACK HEADERS-block decode. A single HTTP/2 request with N headers typically triggers N to 4N calls depending on header representation (indexed vs literal, with/without indexing, with/without length prefixes).
Called from four syntactically distinct production sites within the same tight loop.
RFC-defined primitive encoding with fixed semantics; behavior is specified by the standard, not by internal policy.
Package-private but used throughout the io.netty.handler.codec.http2 package for all HPACK decode paths.

End-to-end impact

Not directly measured. Not claimed. The impact argument rests on call-site count and pervasive use, not on end-to-end numbers. Any deployment that terminates HTTP/2 with Netty pays the per-call cost of decodeULE128 multiple times per request, so the microbenchmark win carries through proportional to header density, but the exact translation to request latency or throughput depends on workload header shape and is not part of this PR's claim.

Risks and Limitations

Win shrinks for short ULE128 encodings. The benchmark exercises Integer.MAX_VALUE — a five-byte worst-case encoding. Most real HPACK indices are one or two bytes. The delegation and post-facto check still cost something on the short path, but the relative win is smaller. A follow-up benchmark on one-byte and two-byte ULE128s would quantify this, and is left out of scope for this PR.
JIT-dependent. The measured delta is on OpenJDK 25.0.2 C2. Other JDKs may inline the delegation better or worse; no cross-JVM validation is included.
Bytecode/inlining risk. Replacing a two-line delegation with a 20-line method slightly increases inlining cost. At 20 lines post-optimization the method is still well within default MaxInlineSize, and JMH measurements show the change is a net win, so this is not an observed regression — just a risk to note.
GC risk: none. Both versions allocate zero objects.
Follow-ups intentionally left out:
- Short-encoding benchmark coverage.
- Consolidating the overflow-bound comment with the long overload's comment for consistency.

Test Plan

mvn test -pl codec-http2 -Dtest=HpackDecoderTest -Dcheckstyle.skip=true — 58/58 pass (baseline + with diff).
mvn test -pl codec-http2 -Dcheckstyle.skip=true (full codec-http2 suite) — 1402 tests, 0 failures, 0 errors, 7 pre-existing skips (with diff applied).
JMH regression benchmark on the same input distribution: decodeMaxInt vs decodeMaxIntUsingLong — non-overlapping 99.9% CIs.
Style/format: mvn checkstyle:check — can be verified in CI.
Full Netty build sanity: mvn -DskipTests -Dcheckstyle.skip=true install -pl codec-http2 -am — can be verified in CI.

Replace the int-variant of decodeULE128 (which delegated to the long overload and re-validated the result against Integer.MAX_VALUE) with a specialized int decoder that detects overflow at shift == 28. Benchmark (HpackDecoderULE128Benchmark, JMH 1.36, JDK 25.0.2, -f 2 -wi 5 -i 10 -w 1s -r 1s -tu ns): decodeMaxIntUsingLong (baseline) 7.536 +/- 0.201 ns/op CI99.9% [7.335, 7.736] decodeMaxInt (optimized) 6.839 +/- 0.160 ns/op CI99.9% [6.679, 7.000] 9.25% improvement, 99.9% confidence intervals do not overlap. Callers: 4 sites in HpackDecoder.decodeHeaders (lines 226, 235, 254, 291). Public API: unchanged (package-private). Thread-safety: unchanged. JVM floor: unchanged (Java 8). Tests: codec-http2 full suite passes (1402 tests, 0 failures, 7 pre-existing skips). HpackDecoderTest 58/58. The optimized shape is the same as HpackDecoderULE128Benchmark.decodeULE128 that has lived in microbench/ since 2017 as the reference implementation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(codec-http2): specialize HpackDecoder.decodeULE128 int path#1

perf(codec-http2): specialize HpackDecoder.decodeULE128 int path#1
mashraf-222 wants to merge 1 commit into4.2from
perf/hpack-decode-ule128-specialize-int

mashraf-222 commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mashraf-222 commented Apr 30, 2026

Summary

What Changed

Why It Works

Why It Is Correct

Preserved contract

Overflow-boundary argument

Reference implementation

Tests proving the contract

Benchmark Methodology

Benchmark command

Results

Reproduction

Callers / Impact Scope

Named downstream callers

Why this method qualifies as a primitive

End-to-end impact

Risks and Limitations

Test Plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant