perf(codec-http2): specialize HpackDecoder.decodeULE128 int path#1
Open
mashraf-222 wants to merge 1 commit into4.2from
Open
perf(codec-http2): specialize HpackDecoder.decodeULE128 int path#1mashraf-222 wants to merge 1 commit into4.2from
mashraf-222 wants to merge 1 commit into4.2from
Conversation
Replace the int-variant of decodeULE128 (which delegated to the long overload and re-validated the result against Integer.MAX_VALUE) with a specialized int decoder that detects overflow at shift == 28. Benchmark (HpackDecoderULE128Benchmark, JMH 1.36, JDK 25.0.2, -f 2 -wi 5 -i 10 -w 1s -r 1s -tu ns): decodeMaxIntUsingLong (baseline) 7.536 +/- 0.201 ns/op CI99.9% [7.335, 7.736] decodeMaxInt (optimized) 6.839 +/- 0.160 ns/op CI99.9% [6.679, 7.000] 9.25% improvement, 99.9% confidence intervals do not overlap. Callers: 4 sites in HpackDecoder.decodeHeaders (lines 226, 235, 254, 291). Public API: unchanged (package-private). Thread-safety: unchanged. JVM floor: unchanged (Java 8). Tests: codec-http2 full suite passes (1402 tests, 0 failures, 7 pre-existing skips). HpackDecoderTest 58/58. The optimized shape is the same as HpackDecoderULE128Benchmark.decodeULE128 that has lived in microbench/ since 2017 as the reference implementation.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replace
HpackDecoder.decodeULE128(ByteBuf, int)with a specialized int decoder. Measured 9.25% reduction (2.3% error) in average decode time onHpackDecoderULE128Benchmark.decodeMaxIntvs the currentdecodeMaxIntUsingLong, with non-overlapping 99.9% confidence intervals. One method, one file, package-private API unchanged. The change copies the shape of the reference implementation that has lived inmicrobench/since 2017.What Changed
codec-http2/src/main/java/io/netty/handler/codec/http2/HpackDecoder.javadecodeULE128(ByteBuf, int)body with a specialized int-loop decoder. Thelongoverload is unchanged.static int decodeULE128).HpackDecoderTestcovers the four production call sites through full HEADERS-block decode paths, including boundary-valueInteger.MAX_VALUEoverflow cases for indexed header, literal name length, and literal value length paths.Single optimization, single method. No unrelated cleanup bundled.
Why It Works
int → longwidening andlong → intnarrowing. The baseline delegates to the long overload, wideningresultfrom int to long and narrowing the return back to int. The specialized loop keeps the accumulator in int.(b & 0x7FL) << shiftwith 64-bit shift and 64-bit add per iteration. The int specialization runs(b & 0x7F) << shiftwith 32-bit shift and 32-bit add. On x86-64 the instructions are the same width, but on JVMs and architectures where long ops are more expensive (and under register-pressure-sensitive inlining), the int form is cheaper.v > Integer.MAX_VALUEwith an in-loopshift == 28branch. The baseline completes a full long decode and then rejects the result if it exceedsInteger.MAX_VALUE. The specialized version decides overflow one byte earlier, at the momentshift == 28, which is the only shift at which an int result can grow beyond 31 bits. Fewer instructions executed on the overflow path, and no post-decode compare-and-rollback.in.readerIndex(readerIndex)rollback on overflow. The baseline capturesreaderIndexbefore delegation and resets it after catching an overflow via the long decoder. The specialized version never advances past the overflow byte — theshift == 28branch runs before thein.readerIndex(readerIndex + 1)write — so there is nothing to roll back.The old code was slower because each int decode paid for a long-domain computation plus a second validation step. The JIT cannot eliminate the cost because the long decoder is shared with the long overload at line 488 and cannot be specialized away per call site.
Why It Is Correct
Preserved contract
result <= 0x7f && result >= 0preserved by the sameassert.DECODE_ULE_128_TO_INT_DECOMPRESSION_EXCEPTION— unchanged reference, unchanged message, unchanged thrown typeHttp2Exception.DECODE_ULE_128_DECOMPRESSION_EXCEPTION— unchanged.readerIndexon overflow: baseline captures entryreaderIndex, advances through long decode, restores on throw. Specialized never advances past the overflow byte. Net observable state afterDECODE_ULE_128_TO_INT_DECOMPRESSION_EXCEPTION:readerIndexequals its value at method entry in both versions.readerIndexon short buffer: both versions walk the localreaderIndexvariable forward without callingin.readerIndex(...), then throw. Net observable state: identical.readerIndexon success: both versions setin.readerIndex(readerIndex + 1)at the terminating byte. Identical.Overflow-boundary argument
The baseline's
v > Integer.MAX_VALUEcheck fires exactly when the decoded long has a bit set above position 31. For a ULE128, bits above position 31 can only come from the byte atshift == 28(bits 28..34 inclusive).The specialized version's branch at
shift == 28:rejects exactly the same inputs:
(b & 0x80) != 0— there is a continuation byte after shift 28, so the result must exceed 35 bits → overflows int regardless of the current byte's payload.resultStartedAtZero && b > 7— top-byte payload is 8..127, pushing the accumulator past0x7FFFFFFF. (Max when zero:0x0 + 0x7F + (0x7F << 7) + (0x7F << 14) + (0x7F << 21) + (0x7 << 28) = Integer.MAX_VALUE; any higher top byte overflows.)!resultStartedAtZero && b > 6— non-zero prefix puts the reachable max at[0x01, 0x7F] + 0x7F + (0x7F << 7) + (0x7F << 14) + (0x7F << 21) + (0x6 << 28); top byte ≥ 7 overflows.This matches the comment block in the existing code describing the maximum representable value.
Reference implementation
The specialized int decoder is not newly authored. It is byte-for-byte the same shape as
HpackDecoderULE128Benchmark.decodeULE128(ByteBuf, int)atmicrobench/src/main/java/io/netty/handler/codec/http2/HpackDecoderULE128Benchmark.java:130–153, which has lived in the repo since 2017 as the "fast path" reference. This PR moves it from microbench to production.Tests proving the contract
io.netty.handler.codec.http2.HpackDecoderTest— 58 tests covering full header block decode, including explicitInteger.MAX_VALUEboundary encodings and one-past-MAX encodings for indexed header, literal name length, and literal value length paths.codec-http2test suite (1402 tests) — covers neighboring codec behavior that exercises the HPACK decoder indirectly.mvn test -pl codec-http2 -Dcheckstyle.skip=true.Benchmark Methodology
microbench/target/microbenchmarks.jar, built with-Pbenchmark-jar).mvn -pl microbench -am -DskipTests -Pbenchmark-jar package.25.0.2+10-Ubuntu-124.04).JAVA_HOME=/usr/lib/jvm/java-25-openjdk-amd64.avgt, time unitns/op.-f 2).-wi 5 -w 1s).-i 10 -r 1s) → 20 measurement samples per benchmark.# Blackhole mode: compiler (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)from the raw log).HpackDecoderULE128Benchmark@Setuppopulates aByteBufwith the ULE128 encoding ofInteger.MAX_VALUEper invocation. This is not a compile-time constant from the benchmark method's perspective; the encoded bytes are read from heap state.Benchmark command
Results
decodeULE128(ByteBuf, int)onInteger.MAX_VALUEULE128decodeMaxInt: fork 1 (min, max) = (6.604, 7.230); fork 2 = (6.645, 7.198). Consistent across forks.decodeMaxIntUsingLong: fork 1 (min, max) = (7.353, 7.997); fork 2 = (7.210, 8.030). Consistent across forks.HpackDecoderULE128Benchmarksince 2017 — the benchmark was designed specifically to compare these two shapes.No extrapolation to query-level, connection-level, or service-level latency is claimed. This is a microbenchmark-level win on a pervasive primitive. See "Callers / Impact Scope" below.
Reproduction
Expected wall time: ~30s per benchmark, ~22s for the codec-http2 test suite, ~2–5 min for the first microbench build.
Methodological note on the before/after comparison.
HpackDecoderULE128Benchmarkis a pre-existing in-tree benchmark that already exposes both shapes:decodeMaxIntinvokes the specialized int decoder body (the shape this PR lands in production), anddecodeMaxIntUsingLonginvokesHpackDecoder.decodeULE128(ByteBuf, int)unchanged (which on baseline uses the delegation path). Running both methods on4.2is sufficient to measure the delta without having to land the diff first — the benchmark has existed for exactly this purpose since 2017.Callers / Impact Scope
This PR is framed as a primitive-level win with pervasive caller use, not an end-to-end infrastructure claim.
Named downstream callers
All in
codec-http2/src/main/java/io/netty/handler/codec/http2/HpackDecoder.java, insidedecodeHeaders(...):getIndexedHeader(decodeULE128(in, index))name = readName(decodeULE128(in, index))nameLength = decodeULE128(in, index)valueLength = decodeULE128(in, index)All four pass
int index(notlong), confirming the int overload is the hot path.Verify with:
rg -n 'decodeULE128\(' codec-http2/src/main/java/io/netty/handler/codec/http2/HpackDecoder.javaWhy this method qualifies as a primitive
io.netty.handler.codec.http2package for all HPACK decode paths.End-to-end impact
Not directly measured. Not claimed. The impact argument rests on call-site count and pervasive use, not on end-to-end numbers. Any deployment that terminates HTTP/2 with Netty pays the per-call cost of
decodeULE128multiple times per request, so the microbenchmark win carries through proportional to header density, but the exact translation to request latency or throughput depends on workload header shape and is not part of this PR's claim.Risks and Limitations
Integer.MAX_VALUE— a five-byte worst-case encoding. Most real HPACK indices are one or two bytes. The delegation and post-facto check still cost something on the short path, but the relative win is smaller. A follow-up benchmark on one-byte and two-byte ULE128s would quantify this, and is left out of scope for this PR.MaxInlineSize, and JMH measurements show the change is a net win, so this is not an observed regression — just a risk to note.Test Plan
mvn test -pl codec-http2 -Dtest=HpackDecoderTest -Dcheckstyle.skip=true— 58/58 pass (baseline + with diff).mvn test -pl codec-http2 -Dcheckstyle.skip=true(full codec-http2 suite) — 1402 tests, 0 failures, 0 errors, 7 pre-existing skips (with diff applied).decodeMaxIntvsdecodeMaxIntUsingLong— non-overlapping 99.9% CIs.mvn checkstyle:check— can be verified in CI.mvn -DskipTests -Dcheckstyle.skip=true install -pl codec-http2 -am— can be verified in CI.