Speed up Windows dep resolution by building librdkafka at image-build time#23817
Speed up Windows dep resolution by building librdkafka at image-build time#23817Kyle-Neale wants to merge 11 commits into
Conversation
Add assert_kafka_version_matches() to build_wheels.py, called at the start of the TemporaryDirectory build block. Aborts loudly when the Windows Dockerfile ENV disagrees with the requirements.in pin; no-op when the ENV is absent (Linux/macOS). Covers all four cases with TDD tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
build_wheels.py imports pathspec and dotenv at module level. These ship via runner_dependencies.txt inside the container at image build, but the CI test job only installs host_dependencies.txt + test_dependencies.txt — so any test that imports build_wheels (e.g. test_build_wheels.py added in the previous commit) would fail with ImportError. Pin both packages here using the same versions as runner_dependencies.txt. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the perl-based parse of C:\mnt\requirements.in with a direct read of $Env:CONFLUENT_KAFKA_VERSION. The drift assertion in build_wheels.py already guarantees this env matches the confluent-kafka pin in agent_requirements.in, so we lose nothing — and we gain the ability to run this script at image-build time (next task) when C:\mnt is not yet mounted. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move the librdkafka + vcpkg deps (OpenSSL, curl, zlib, zstd) build out of DD_BUILD_COMMAND, where it ran on every wheel-build (~25 min per run), and into a single Dockerfile RUN that bakes the resulting .dll/.lib/.h files into image layers. The cached image then carries them across wheel-build runs, and the Resolve Dependencies and Build Wheels workflow only rebuilds when image inputs (Dockerfile, build_script.ps1, runner_dependencies.txt) actually change. The three native-dep output directories (C:\include, C:\lib, C:\bin) are created before the script runs, since it writes the build outputs into them. INCLUDE/LIB env and the PATH append for C:\bin remain at runtime so they only affect downstream wheel builds, not the librdkafka build itself. The drift assertion in build_wheels.py (added in a prior commit) guarantees that a bump to the confluent-kafka pin in agent_requirements.in without a matching bump to ENV CONFLUENT_KAFKA_VERSION in this Dockerfile will abort the wheel build with a clear error. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address review feedback on baking librdkafka into the image: 1. Clean up vcpkg, librdkafka sources, and the tarball intermediate at the end of build_script.ps1. Without this, the build artifacts (~1-2 GB of vcpkg downloads + buildtrees, plus the full librdkafka source tree) persist in the Docker layer permanently, inflating pull/push times and registry cost. Cleanup must run in the same RUN as the build — a later layer cannot shrink bytes already committed to an earlier one. ErrorAction Continue keeps the build successful even if Windows holds a file lock. 2. Add a bump checklist above ENV CONFLUENT_KAFKA_VERSION listing every file a confluent-kafka bump needs to touch: requirements.in, this ENV, the tarball SHA256 in build_script.ps1, and (rarely) the vcpkg desired_commit. The drift assertion already guards (1)+(2); this comment makes (3) and (4) discoverable without grepping. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- build_script.ps1: replace Remove-Item -ErrorAction Continue with try/catch
+ Write-Warning. The previous form buffered cleanup errors into the docker
build log error stream where they were trivial to miss; warnings render
prominently and can be grepped from CI logs. Same intent (don't fail the
build for a partial cleanup) but partial failures now leave a signal.
- build_wheels.py: collapse multi-line docstring to a one-liner per AGENTS.md
("internal helpers get concise one-liners; multi-line docstrings only for
important public-interface methods"). Add an explicit missing-file abort
before reading requirements.in, so a misconfigured mount surfaces the same
friendly message as the other failure modes rather than a FileNotFoundError
stack trace. Tighten the version regex from [\d.]+ to \d+(?:\.\d+){2} so it
rejects malformed pins like 2..13.2.
- Dockerfile: drop a comment that restated New-Item, and revert a comment
that was reworded only to satisfy a grep check in the previous commit
(same meaning, no functional change).
- test_build_wheels.py: add test_aborts_when_requirements_file_missing to
lock in the new missing-file abort path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
CI surfaced two issues on PR #23817: 1. mypy --config-file pyproject.toml . failed with import-not-found on `build_wheels` because scripts/ wasn't on mypy_path. Symmetric for pytest — it worked only via a manual sys.path.insert at the top of the test file. Fixed by adding pyproject.toml config: [tool.pytest.ini_options] pythonpath = ["scripts"] [tool.mypy] mypy_path = "scripts" Both pytest and mypy now resolve `build_wheels` cleanly, and the per-test sys.path hack is gone. 2. The four cases that exercised drift/match/no-pin/missing-file all repeated the same `_write_requirements` + `monkeypatch.setenv` + `monkeypatch.setattr` triplet. Collapsed the three env-set cases into a single parametrized test behind a `write_requirements` fixture; left the env-unset and missing-file cases as standalone tests since their setup genuinely differs. 5 cases still covered. Local pytest: 5 passed in 0.05s. Local mypy: clean.
Validation ReportAll 21 validations passed. Show details
|
AAraKKe
left a comment
There was a problem hiding this comment.
Hi @Kyle-Neale, this review was generated with Claude Code and has been looked over and approved before posting. I believe the comments below are valid — feel free to discuss any you disagree with.
Thanks! This is a step in the good direction but we need to be 100% we are not skipping any edge case that can prevent us from building and shipping the proper dependencies. This is a first review where I prompted the agents to make sure they inspect any possible issue that could either avoid building the proper image or break for unrelated reasons. Let's iterate before getting this merged.
Comment legend
praise: no action needed, just celebrate!
note: informational, no action required.
question: seeking clarification or understanding your approach.
nit: minor, non-blocking (style, typo). Feel free to ignore.
suggestion: optional improvement, recommended but not required.
request: a change I believe is necessary before merging.
| abort(f'CONFLUENT_KAFKA_VERSION is set but {requirements} is missing — is the build mount configured?') | ||
| pin = None | ||
| for line in requirements.read_text(encoding='utf-8').splitlines(): | ||
| match = re.match(r'^\s*confluent-kafka==(\d+(?:\.\d+){2})\s*$', line) |
There was a problem hiding this comment.
Request: The pin-match regex ^\s*confluent-kafka==(\d+(?:\.\d+){2})\s*$ is much stricter than the file format it parses, and a benign edit to agent_requirements.in will silently abort Windows builds with a misleading "no pin found" error. Any of these realistic future states fall through to the pin is None branch:
- Environment marker —
confluent-kafka==2.13.2; sys_platform != 'darwin'. Markers are already in use elsewhere inagent_requirements.in(e.g. line 1). - Inline comment —
confluent-kafka==2.13.2 # pinned. - Extras —
confluent-kafka[avro]==2.13.2. - Non-three-part PEP 440 versions —
2.13.2.post1,2.13.2rc1. Librdkafka publishes pre-release tarballs with this shape upstream.
When this happens, the abort message at line 370 says "no confluent-kafka== pin found in requirements.in" — which is wrong, the pin is there. A reviewer bumping kafka and adding a marker would chase a missing-pin error against a file that visibly contains the pin.
I'd loosen the regex to accept the cases that should still be matched (markers, extras, comments, surrounding whitespace), while keeping it strict about the operator (which is what you actually want to reject — a range pin would invalidate the cached binary):
PIN_RE = re.compile(r'^\s*confluent-kafka(?:\[[^\]]+\])?\s*==\s*([^\s;#]+)')
for line in requirements.read_text(encoding='utf-8').splitlines():
m = PIN_RE.match(line.split('#', 1)[0])
if m:
pin = m.group(1)
breakIf you prefer to keep the strict regex, at minimum split the failure paths so the abort message can say "pin found but format not supported" vs "no pin found at all", and add a comment on the regex line stating that markers/extras are intentionally unsupported so a future contributor isn't misled. The macOS/linux variants of the same extraction (.builders/images/macos/extra_build.sh:12, .builders/images/linux-x86_64/build_script.sh:16) use a looser shape — worth aligning.
There was a problem hiding this comment.
Done — loosened to KAFKA_PIN_RE = re.compile(r'^\s*confluent-kafka(?:\[[^\]]+\])?\s*==\s*([^\s;#]+)') with .split('#', 1)[0] stripping inline comments first. Lifted to module scope. Kept == strict so a range pin still falls through to the no-pin abort. New test rows cover env markers, inline comments, extras, surrounding whitespace, plus >= as a negative case.
| return {'compressed': compressed_size, 'uncompressed': uncompressed_size} | ||
|
|
||
|
|
||
| def assert_kafka_version_matches() -> None: |
There was a problem hiding this comment.
Suggestion: The three-file bump checklist (Dockerfile ENV, build_script.ps1 -Hash, agent_requirements.in pin) lives only in a Dockerfile comment at lines 171–179. assert_kafka_version_matches is the enforcement point in Python, but its one-line docstring doesn't point at the checklist — a future contributor reading build_wheels.py alone has no cross-reference for what else must change. Expanding the docstring would make the contract self-documenting on both sides:
def assert_kafka_version_matches() -> None:
"""Abort if CONFLUENT_KAFKA_VERSION env disagrees with the pin in requirements.in.
Env is set only in .builders/images/windows-x86_64/Dockerfile.
Bump checklist when changing the confluent-kafka version:
1. agent_requirements.in confluent-kafka==X.Y.Z
2. Dockerfile ENV CONFLUENT_KAFKA_VERSION this value
3. build_script.ps1 -Hash for the vX.Y.Z tarball
"""There was a problem hiding this comment.
Done — adopted your docstring almost verbatim, plus added item 4 ($desired_commit for vcpkg) so the Python docstring matches the 4-item Dockerfile checklist.
| pin = match.group(1) | ||
| break | ||
| if pin is None: | ||
| abort('CONFLUENT_KAFKA_VERSION is set but no confluent-kafka== pin found in requirements.in.') |
There was a problem hiding this comment.
Suggestion: When pin is None, the abort message omits both the expected value and the inspected path, so a CI log triage reader has to cross-reference the Dockerfile to know what was being looked for. The drift message at lines 372–375 already includes both — this one should match:
abort(
f'CONFLUENT_KAFKA_VERSION ({expected}) is set but no confluent-kafka== pin '
f'found in {requirements}. Did agent_requirements.in change shape?'
)There was a problem hiding this comment.
Done. New message:
CONFLUENT_KAFKA_VERSION ({expected}) is set but no confluent-kafka== pin found in {requirements}. Did agent_requirements.in change shape?
| # https://github.com/pypa/pip/issues/10114#issuecomment-1880125475 | ||
| env_vars['PIP_FIND_LINKS'] = path_to_uri(staged_wheel_dir) | ||
|
|
||
| assert_kafka_version_matches() |
There was a problem hiding this comment.
Suggestion: The call to assert_kafka_version_matches() is unconditional and nothing at the call site signals that this is a Windows-only guard. The behavior is correct — the function no-ops when CONFLUENT_KAFKA_VERSION is unset, and only the Windows Dockerfile sets it — but the asymmetry is invisible to a reader and fragile to future Dockerfile edits. A short comment makes the platform contract a self-documenting trip-wire:
# No-op on Linux/macOS: CONFLUENT_KAFKA_VERSION is set only in the Windows Dockerfile.
assert_kafka_version_matches()There was a problem hiding this comment.
Done — added your comment verbatim above the call. Also moved the call out of the TemporaryDirectory block so it aborts before any of the expensive setup runs.
| } | ||
| Write-Host "Will build librdkafka $kafka_version" | ||
|
|
||
| # Download and unpack the source |
There was a problem hiding this comment.
Question: The runtime lockstep (assert_kafka_version_matches) only checks CONFLUENT_KAFKA_VERSION vs the pin. It does not check the librdkafka tarball -Hash here at line 18 or the vcpkg $desired_commit at line 30. Both are correct inputs to the resulting native binary, but they're protected only by the image-inputs hash (which causes an image rebuild when these strings change) and by Get-RemoteFile's hash-mismatch failure at image-build time. Two operator scenarios where this gap could bite:
- A future bump of
CONFLUENT_KAFKA_VERSION+-Hashto a matching pair, but$desired_commitleft at an older vcpkg baseline. The image rebuilds, the assertion passes, and the resulting librdkafka is linked against whatever vcpkg resolved at the old commit's OpenSSL/curl/zlib/zstd — silently linked against transitive deps the operator didn't intend to ship. - The reverse:
-Hashbumped withoutCONFLUENT_KAFKA_VERSION—Get-RemoteFileerrors loudly at image-build time, which is fine but late.
The Dockerfile checklist (lines 171–179) does enumerate -Hash and $desired_commit as manual bump steps, so the human protocol exists. Is "loud failure at image-build time" (for -Hash) and "trust the checklist" (for $desired_commit) the accepted enforcement model, or is it worth extending the assertion to also compare the -Hash line and the $desired_commit line against an expected sidecar (e.g. a small .deps/windows-native-pins.toml) so all three pieces of state are programmatically linked? I'm asking rather than flagging as Request because vcpkg at a pinned commit is deterministic and the scenarios above are operator-mistake-on-future-bump, not a current bug.
There was a problem hiding this comment.
Sticking with the current enforcement model. Reasoning:
-Hash:Get-RemoteFileaborts at image-build time on mismatch, and image-build is the only place the hash is consumed — a wrong hash can never reach a wheel. Late but loud is the right guarantee.$desired_commit: the image-inputs hash includesbuild_script.ps1, so any change to the commit forces a fresh image build. At a pinned commit, vcpkg is deterministic.- A sidecar TOML would add a third source of truth without removing the human checklist. Net cost > benefit for a 4-line bump.
Did add two short lockstep cross-reference comments in build_script.ps1 — above -Hash and above $desired_commit — pointing at the Dockerfile checklist, so a reader of the build script alone sees the contract.
| return _write | ||
|
|
||
|
|
||
| @pytest.mark.parametrize( |
There was a problem hiding this comment.
Suggestion: The three parametrized rows cover drift / match / no-pin only. None cover the regex corner cases that are the most likely real-world way the lockstep guard fails open — and there's no assertion on the abort message, so the suite can't distinguish "no pin found" from "drift detected" if those branches were ever transposed.
Two changes:
-
Add the corner cases (assertions assume the current strict regex; if the regex is loosened per the Request above, flip the
should_abortvalues accordingly):('confluent-kafka==2.13.2; sys_platform != "darwin"', '2.13.2', True), # env marker ('confluent-kafka==2.13.2 # pin', '2.13.2', True), # inline comment ('confluent-kafka[avro]==2.13.2', '2.13.2', True), # extras ('confluent-kafka>=2.13.2', '2.13.2', True), # non-== operator (' confluent-kafka==2.13.2 ', '2.13.2', False), # surrounding ws
-
Use
pytest.raises(SystemExit, match=...)so the suite proves which abort branch fires:with pytest.raises(SystemExit, match='disagrees'): # drift case with pytest.raises(SystemExit, match='no .*== pin'): # no-pin case with pytest.raises(SystemExit, match='is missing'): # missing-file case (line 50)
Tiny related cleanup while you're in here: test_aborts_when_requirements_file_missing duplicates the fixture's monkeypatch.setattr(build_wheels, 'MOUNT_DIR', tmp_path) setup — fold it into the parametrization or extract a tiny set_mount fixture so the two tests share one definition of "what's the mount."
There was a problem hiding this comment.
Done. Expanded parametrize from 3 to 8 rows (env-marker, inline-comment, extras, surrounding-ws, non-eq-operator added).
One snag with the match= suggestion: since abort() prints to stderr and then sys.exit(1), the SystemExit value is the int 1 — match= matches against the exception value, not the printed text. So the suite still passed against the wrong message. Switched to a _assert_aborts_with(capsys, pattern) helper that asserts on capsys.readouterr().err instead, which gets the same guarantee. Folded the duplicate mount setup into a shared mount_dir fixture.
| build_wheels.assert_kafka_version_matches() | ||
|
|
||
|
|
||
| def test_noop_when_env_unset(write_requirements, monkeypatch): |
There was a problem hiding this comment.
Suggestion: test_noop_when_env_unset calls write_requirements('confluent-kafka==2.13.2') before exercising the early-return path. That masks whether the early return actually short-circuits before touching the filesystem — which is the contract that protects Linux/macOS, where CONFLUENT_KAFKA_VERSION is intentionally never set and MOUNT_DIR / 'requirements.in' may not exist. Tighten so the test proves the no-op is FS-independent:
def test_noop_when_env_unset(monkeypatch, tmp_path):
# MOUNT_DIR points at an empty dir — no requirements.in present
monkeypatch.setattr(build_wheels, 'MOUNT_DIR', tmp_path)
monkeypatch.delenv('CONFLUENT_KAFKA_VERSION', raising=False)
build_wheels.assert_kafka_version_matches() # must not raise, must not require requirements.inThere was a problem hiding this comment.
Done — your version exactly. The test now points MOUNT_DIR at an empty tmp_path (no requirements.in present) and deletes the env var. Proves the early return short-circuits before any FS access.
- Loosen KAFKA_PIN_RE to accept extras, env markers, inline comments, and surrounding whitespace; lift to module scope. - Expand bump checklist docstring to 4 items (include vcpkg $desired_commit) and add matching lockstep comments in build_script.ps1. - Improve no-pin abort message to include the expected version and the inspected path. - Hoist assert_kafka_version_matches() above the TemporaryDirectory block so drift aborts before any expensive setup runs. - Expand parametrized tests from 3 to 8 rows covering env markers, inline comments, extras, surrounding whitespace, and non-== operators. - Verify specific abort messages via capsys instead of bare SystemExit.
What does this PR do?
Moves the Windows librdkafka native build + its vcpkg deps (OpenSSL, curl, zlib, zstd) from wheel-build time to image-build time. The compiled binaries are baked into the builder image and reused across runs.
Dockerfile: pinsCONFLUENT_KAFKA_VERSIONas anENV, adds theRUNthat builds librdkafka, dropsENV DD_BUILD_COMMAND(build_wheels.pyno-ops cleanly when unset).build_script.ps1: reads the version from the env, cleans up build intermediates in the same layer so they don't bloat the image.build_wheels.py: aborts the wheel build ifCONFLUENT_KAFKA_VERSIONdisagrees with theconfluent-kafka==pin inagent_requirements.in. Covers the lockstep risk.pyproject.toml/test_dependencies.txt/tests/test_build_wheels.py: wirescripts/onto pytest+mypy path, pin deps the new tests import, 5 parametrized tests for the drift-check.Motivation
The Windows leg of
Resolve Dependencies and Build Wheelsruns ~90 min, with ~25 min spent rebuilding librdkafka + vcpkg native deps at wheel-build time on every resolution. macOS mitigates a similar cost viaactions/cacheon~/builder_root; Windows had no equivalent.Moving the native compile into image-build time means it runs once per image-inputs hash and the result rides in image layers. On the common path (no image rebuild) we expect ~25 min off the Windows leg.
A note on this PR's CI run
This PR changes Dockerfile inputs, so the image-inputs hash changes and CI has to rebuild the image once — which is where the librdkafka compile now lives. Net effect: this run is ~140 min (slower than the ~90 min baseline) because the cost was shifted, not yet eliminated. Subsequent runs that don't touch image inputs (e.g. a typical
agent_requirements.inbump) will hit the cached image and skip the compile entirely, landing closer to ~65 min.Review checklist (to be filled by reviewers)
qa/skip-qalabel if the PR doesn't need to be tested during QA.backport/<branch-name>label to the PR and it will automatically open a backport PR once this one is merged