Skip to content

Speed up Windows dep resolution by building librdkafka at image-build time#23817

Draft
Kyle-Neale wants to merge 11 commits into
masterfrom
kyle.neale/speed-up-windows-dep-resolution
Draft

Speed up Windows dep resolution by building librdkafka at image-build time#23817
Kyle-Neale wants to merge 11 commits into
masterfrom
kyle.neale/speed-up-windows-dep-resolution

Conversation

@Kyle-Neale
Copy link
Copy Markdown
Contributor

@Kyle-Neale Kyle-Neale commented May 22, 2026

What does this PR do?

Moves the Windows librdkafka native build + its vcpkg deps (OpenSSL, curl, zlib, zstd) from wheel-build time to image-build time. The compiled binaries are baked into the builder image and reused across runs.

  • Dockerfile: pins CONFLUENT_KAFKA_VERSION as an ENV, adds the RUN that builds librdkafka, drops ENV DD_BUILD_COMMAND (build_wheels.py no-ops cleanly when unset).
  • build_script.ps1: reads the version from the env, cleans up build intermediates in the same layer so they don't bloat the image.
  • build_wheels.py: aborts the wheel build if CONFLUENT_KAFKA_VERSION disagrees with the confluent-kafka== pin in agent_requirements.in. Covers the lockstep risk.
  • pyproject.toml / test_dependencies.txt / tests/test_build_wheels.py: wire scripts/ onto pytest+mypy path, pin deps the new tests import, 5 parametrized tests for the drift-check.

Motivation

The Windows leg of Resolve Dependencies and Build Wheels runs ~90 min, with ~25 min spent rebuilding librdkafka + vcpkg native deps at wheel-build time on every resolution. macOS mitigates a similar cost via actions/cache on ~/builder_root; Windows had no equivalent.

Moving the native compile into image-build time means it runs once per image-inputs hash and the result rides in image layers. On the common path (no image rebuild) we expect ~25 min off the Windows leg.

A note on this PR's CI run

This PR changes Dockerfile inputs, so the image-inputs hash changes and CI has to rebuild the image once — which is where the librdkafka compile now lives. Net effect: this run is ~140 min (slower than the ~90 min baseline) because the cost was shifted, not yet eliminated. Subsequent runs that don't touch image inputs (e.g. a typical agent_requirements.in bump) will hit the cached image and skip the compile entirely, landing closer to ~65 min.

Review checklist (to be filled by reviewers)

  • Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
  • Add the qa/skip-qa label if the PR doesn't need to be tested during QA.
  • If you need to backport this PR to another branch, you can add the backport/<branch-name> label to the PR and it will automatically open a backport PR once this one is merged

Kyle-Neale and others added 7 commits May 21, 2026 17:51
Add assert_kafka_version_matches() to build_wheels.py, called at the start
of the TemporaryDirectory build block. Aborts loudly when the Windows
Dockerfile ENV disagrees with the requirements.in pin; no-op when the ENV
is absent (Linux/macOS). Covers all four cases with TDD tests.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
build_wheels.py imports pathspec and dotenv at module level. These ship via
runner_dependencies.txt inside the container at image build, but the CI test
job only installs host_dependencies.txt + test_dependencies.txt — so any test
that imports build_wheels (e.g. test_build_wheels.py added in the previous
commit) would fail with ImportError. Pin both packages here using the same
versions as runner_dependencies.txt.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the perl-based parse of C:\mnt\requirements.in with a direct read of
$Env:CONFLUENT_KAFKA_VERSION. The drift assertion in build_wheels.py already
guarantees this env matches the confluent-kafka pin in agent_requirements.in,
so we lose nothing — and we gain the ability to run this script at image-build
time (next task) when C:\mnt is not yet mounted.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move the librdkafka + vcpkg deps (OpenSSL, curl, zlib, zstd) build out of
DD_BUILD_COMMAND, where it ran on every wheel-build (~25 min per run), and
into a single Dockerfile RUN that bakes the resulting .dll/.lib/.h files into
image layers. The cached image then carries them across wheel-build runs, and
the Resolve Dependencies and Build Wheels workflow only rebuilds when image
inputs (Dockerfile, build_script.ps1, runner_dependencies.txt) actually change.

The three native-dep output directories (C:\include, C:\lib, C:\bin) are
created before the script runs, since it writes the build outputs into them.
INCLUDE/LIB env and the PATH append for C:\bin remain at runtime so they only
affect downstream wheel builds, not the librdkafka build itself.

The drift assertion in build_wheels.py (added in a prior commit) guarantees
that a bump to the confluent-kafka pin in agent_requirements.in without a
matching bump to ENV CONFLUENT_KAFKA_VERSION in this Dockerfile will abort
the wheel build with a clear error.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address review feedback on baking librdkafka into the image:

1. Clean up vcpkg, librdkafka sources, and the tarball intermediate at the end
   of build_script.ps1. Without this, the build artifacts (~1-2 GB of vcpkg
   downloads + buildtrees, plus the full librdkafka source tree) persist in
   the Docker layer permanently, inflating pull/push times and registry cost.
   Cleanup must run in the same RUN as the build — a later layer cannot
   shrink bytes already committed to an earlier one. ErrorAction Continue
   keeps the build successful even if Windows holds a file lock.

2. Add a bump checklist above ENV CONFLUENT_KAFKA_VERSION listing every file
   a confluent-kafka bump needs to touch: requirements.in, this ENV, the
   tarball SHA256 in build_script.ps1, and (rarely) the vcpkg desired_commit.
   The drift assertion already guards (1)+(2); this comment makes (3) and (4)
   discoverable without grepping.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- build_script.ps1: replace Remove-Item -ErrorAction Continue with try/catch
  + Write-Warning. The previous form buffered cleanup errors into the docker
  build log error stream where they were trivial to miss; warnings render
  prominently and can be grepped from CI logs. Same intent (don't fail the
  build for a partial cleanup) but partial failures now leave a signal.

- build_wheels.py: collapse multi-line docstring to a one-liner per AGENTS.md
  ("internal helpers get concise one-liners; multi-line docstrings only for
  important public-interface methods"). Add an explicit missing-file abort
  before reading requirements.in, so a misconfigured mount surfaces the same
  friendly message as the other failure modes rather than a FileNotFoundError
  stack trace. Tighten the version regex from [\d.]+ to \d+(?:\.\d+){2} so it
  rejects malformed pins like 2..13.2.

- Dockerfile: drop a comment that restated New-Item, and revert a comment
  that was reworded only to satisfy a grep check in the previous commit
  (same meaning, no functional change).

- test_build_wheels.py: add test_aborts_when_requirements_file_missing to
  lock in the new missing-file abort path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Kyle-Neale Kyle-Neale added the qa/skip-qa Automatically skip this PR for the next QA label May 22, 2026
@datadog-datadog-prod-us1

This comment has been minimized.

Kyle-Neale and others added 2 commits May 22, 2026 13:23
CI surfaced two issues on PR #23817:

1. mypy --config-file pyproject.toml . failed with import-not-found on
   `build_wheels` because scripts/ wasn't on mypy_path. Symmetric for pytest —
   it worked only via a manual sys.path.insert at the top of the test file.

   Fixed by adding pyproject.toml config:
     [tool.pytest.ini_options]  pythonpath = ["scripts"]
     [tool.mypy]                mypy_path = "scripts"

   Both pytest and mypy now resolve `build_wheels` cleanly, and the per-test
   sys.path hack is gone.

2. The four cases that exercised drift/match/no-pin/missing-file all repeated
   the same `_write_requirements` + `monkeypatch.setenv` + `monkeypatch.setattr`
   triplet. Collapsed the three env-set cases into a single parametrized test
   behind a `write_requirements` fixture; left the env-unset and missing-file
   cases as standalone tests since their setup genuinely differs.

5 cases still covered. Local pytest: 5 passed in 0.05s. Local mypy: clean.
@dd-octo-sts
Copy link
Copy Markdown
Contributor

dd-octo-sts Bot commented May 23, 2026

Validation Report

All 21 validations passed.

Show details
Validation Description Status
agent-reqs Verify check versions match the Agent requirements file
ci Validate CI configuration and Codecov settings
codeowners Validate every integration has a CODEOWNERS entry
config Validate default configuration files against spec.yaml
dep Verify dependency pins are consistent and Agent-compatible
http Validate integrations use the HTTP wrapper correctly
imports Validate check imports do not use deprecated modules
integration-style Validate check code style conventions
jmx-metrics Validate JMX metrics definition files and config
labeler Validate PR labeler config matches integration directories
legacy-signature Validate no integration uses the legacy Agent check signature
license-headers Validate Python files have proper license headers
licenses Validate third-party license attribution list
metadata Validate metadata.csv metric definitions
models Validate configuration data models match spec.yaml
openmetrics Validate OpenMetrics integrations disable the metric limit
package Validate Python package metadata and naming
qa-label Validate the pull request declares whether it needs QA for the next Agent release
readmes Validate README files have required sections
saved-views Validate saved view JSON file structure and fields
version Validate version consistency between package and changelog

View full run

Copy link
Copy Markdown
Contributor

@AAraKKe AAraKKe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Kyle-Neale, this review was generated with Claude Code and has been looked over and approved before posting. I believe the comments below are valid — feel free to discuss any you disagree with.


Thanks! This is a step in the good direction but we need to be 100% we are not skipping any edge case that can prevent us from building and shipping the proper dependencies. This is a first review where I prompted the agents to make sure they inspect any possible issue that could either avoid building the proper image or break for unrelated reasons. Let's iterate before getting this merged.


Comment legend

praise: no action needed, just celebrate!
note: informational, no action required.
question: seeking clarification or understanding your approach.
nit: minor, non-blocking (style, typo). Feel free to ignore.
suggestion: optional improvement, recommended but not required.
request: a change I believe is necessary before merging.

Comment thread .builders/scripts/build_wheels.py Outdated
abort(f'CONFLUENT_KAFKA_VERSION is set but {requirements} is missing — is the build mount configured?')
pin = None
for line in requirements.read_text(encoding='utf-8').splitlines():
match = re.match(r'^\s*confluent-kafka==(\d+(?:\.\d+){2})\s*$', line)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Request: The pin-match regex ^\s*confluent-kafka==(\d+(?:\.\d+){2})\s*$ is much stricter than the file format it parses, and a benign edit to agent_requirements.in will silently abort Windows builds with a misleading "no pin found" error. Any of these realistic future states fall through to the pin is None branch:

  • Environment marker — confluent-kafka==2.13.2; sys_platform != 'darwin'. Markers are already in use elsewhere in agent_requirements.in (e.g. line 1).
  • Inline comment — confluent-kafka==2.13.2 # pinned.
  • Extras — confluent-kafka[avro]==2.13.2.
  • Non-three-part PEP 440 versions — 2.13.2.post1, 2.13.2rc1. Librdkafka publishes pre-release tarballs with this shape upstream.

When this happens, the abort message at line 370 says "no confluent-kafka== pin found in requirements.in" — which is wrong, the pin is there. A reviewer bumping kafka and adding a marker would chase a missing-pin error against a file that visibly contains the pin.

I'd loosen the regex to accept the cases that should still be matched (markers, extras, comments, surrounding whitespace), while keeping it strict about the operator (which is what you actually want to reject — a range pin would invalidate the cached binary):

PIN_RE = re.compile(r'^\s*confluent-kafka(?:\[[^\]]+\])?\s*==\s*([^\s;#]+)')
for line in requirements.read_text(encoding='utf-8').splitlines():
    m = PIN_RE.match(line.split('#', 1)[0])
    if m:
        pin = m.group(1)
        break

If you prefer to keep the strict regex, at minimum split the failure paths so the abort message can say "pin found but format not supported" vs "no pin found at all", and add a comment on the regex line stating that markers/extras are intentionally unsupported so a future contributor isn't misled. The macOS/linux variants of the same extraction (.builders/images/macos/extra_build.sh:12, .builders/images/linux-x86_64/build_script.sh:16) use a looser shape — worth aligning.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — loosened to KAFKA_PIN_RE = re.compile(r'^\s*confluent-kafka(?:\[[^\]]+\])?\s*==\s*([^\s;#]+)') with .split('#', 1)[0] stripping inline comments first. Lifted to module scope. Kept == strict so a range pin still falls through to the no-pin abort. New test rows cover env markers, inline comments, extras, surrounding whitespace, plus >= as a negative case.

return {'compressed': compressed_size, 'uncompressed': uncompressed_size}


def assert_kafka_version_matches() -> None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: The three-file bump checklist (Dockerfile ENV, build_script.ps1 -Hash, agent_requirements.in pin) lives only in a Dockerfile comment at lines 171–179. assert_kafka_version_matches is the enforcement point in Python, but its one-line docstring doesn't point at the checklist — a future contributor reading build_wheels.py alone has no cross-reference for what else must change. Expanding the docstring would make the contract self-documenting on both sides:

def assert_kafka_version_matches() -> None:
    """Abort if CONFLUENT_KAFKA_VERSION env disagrees with the pin in requirements.in.

    Env is set only in .builders/images/windows-x86_64/Dockerfile.
    Bump checklist when changing the confluent-kafka version:
      1. agent_requirements.in              confluent-kafka==X.Y.Z
      2. Dockerfile  ENV CONFLUENT_KAFKA_VERSION   this value
      3. build_script.ps1                   -Hash for the vX.Y.Z tarball
    """

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — adopted your docstring almost verbatim, plus added item 4 ($desired_commit for vcpkg) so the Python docstring matches the 4-item Dockerfile checklist.

Comment thread .builders/scripts/build_wheels.py Outdated
pin = match.group(1)
break
if pin is None:
abort('CONFLUENT_KAFKA_VERSION is set but no confluent-kafka== pin found in requirements.in.')
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: When pin is None, the abort message omits both the expected value and the inspected path, so a CI log triage reader has to cross-reference the Dockerfile to know what was being looked for. The drift message at lines 372–375 already includes both — this one should match:

abort(
    f'CONFLUENT_KAFKA_VERSION ({expected}) is set but no confluent-kafka== pin '
    f'found in {requirements}. Did agent_requirements.in change shape?'
)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. New message:

CONFLUENT_KAFKA_VERSION ({expected}) is set but no confluent-kafka== pin found in {requirements}. Did agent_requirements.in change shape?

Comment thread .builders/scripts/build_wheels.py Outdated
# https://github.com/pypa/pip/issues/10114#issuecomment-1880125475
env_vars['PIP_FIND_LINKS'] = path_to_uri(staged_wheel_dir)

assert_kafka_version_matches()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: The call to assert_kafka_version_matches() is unconditional and nothing at the call site signals that this is a Windows-only guard. The behavior is correct — the function no-ops when CONFLUENT_KAFKA_VERSION is unset, and only the Windows Dockerfile sets it — but the asymmetry is invisible to a reader and fragile to future Dockerfile edits. A short comment makes the platform contract a self-documenting trip-wire:

# No-op on Linux/macOS: CONFLUENT_KAFKA_VERSION is set only in the Windows Dockerfile.
assert_kafka_version_matches()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — added your comment verbatim above the call. Also moved the call out of the TemporaryDirectory block so it aborts before any of the expensive setup runs.

}
Write-Host "Will build librdkafka $kafka_version"

# Download and unpack the source
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: The runtime lockstep (assert_kafka_version_matches) only checks CONFLUENT_KAFKA_VERSION vs the pin. It does not check the librdkafka tarball -Hash here at line 18 or the vcpkg $desired_commit at line 30. Both are correct inputs to the resulting native binary, but they're protected only by the image-inputs hash (which causes an image rebuild when these strings change) and by Get-RemoteFile's hash-mismatch failure at image-build time. Two operator scenarios where this gap could bite:

  1. A future bump of CONFLUENT_KAFKA_VERSION + -Hash to a matching pair, but $desired_commit left at an older vcpkg baseline. The image rebuilds, the assertion passes, and the resulting librdkafka is linked against whatever vcpkg resolved at the old commit's OpenSSL/curl/zlib/zstd — silently linked against transitive deps the operator didn't intend to ship.
  2. The reverse: -Hash bumped without CONFLUENT_KAFKA_VERSIONGet-RemoteFile errors loudly at image-build time, which is fine but late.

The Dockerfile checklist (lines 171–179) does enumerate -Hash and $desired_commit as manual bump steps, so the human protocol exists. Is "loud failure at image-build time" (for -Hash) and "trust the checklist" (for $desired_commit) the accepted enforcement model, or is it worth extending the assertion to also compare the -Hash line and the $desired_commit line against an expected sidecar (e.g. a small .deps/windows-native-pins.toml) so all three pieces of state are programmatically linked? I'm asking rather than flagging as Request because vcpkg at a pinned commit is deterministic and the scenarios above are operator-mistake-on-future-bump, not a current bug.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sticking with the current enforcement model. Reasoning:

  • -Hash: Get-RemoteFile aborts at image-build time on mismatch, and image-build is the only place the hash is consumed — a wrong hash can never reach a wheel. Late but loud is the right guarantee.
  • $desired_commit: the image-inputs hash includes build_script.ps1, so any change to the commit forces a fresh image build. At a pinned commit, vcpkg is deterministic.
  • A sidecar TOML would add a third source of truth without removing the human checklist. Net cost > benefit for a 4-line bump.

Did add two short lockstep cross-reference comments in build_script.ps1 — above -Hash and above $desired_commit — pointing at the Dockerfile checklist, so a reader of the build script alone sees the contract.

return _write


@pytest.mark.parametrize(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: The three parametrized rows cover drift / match / no-pin only. None cover the regex corner cases that are the most likely real-world way the lockstep guard fails open — and there's no assertion on the abort message, so the suite can't distinguish "no pin found" from "drift detected" if those branches were ever transposed.

Two changes:

  1. Add the corner cases (assertions assume the current strict regex; if the regex is loosened per the Request above, flip the should_abort values accordingly):

    ('confluent-kafka==2.13.2; sys_platform != "darwin"', '2.13.2', True),  # env marker
    ('confluent-kafka==2.13.2  # pin',                    '2.13.2', True),  # inline comment
    ('confluent-kafka[avro]==2.13.2',                     '2.13.2', True),  # extras
    ('confluent-kafka>=2.13.2',                           '2.13.2', True),  # non-== operator
    ('  confluent-kafka==2.13.2  ',                       '2.13.2', False), # surrounding ws
  2. Use pytest.raises(SystemExit, match=...) so the suite proves which abort branch fires:

    with pytest.raises(SystemExit, match='disagrees'):     # drift case
    with pytest.raises(SystemExit, match='no .*== pin'):   # no-pin case
    with pytest.raises(SystemExit, match='is missing'):    # missing-file case (line 50)

Tiny related cleanup while you're in here: test_aborts_when_requirements_file_missing duplicates the fixture's monkeypatch.setattr(build_wheels, 'MOUNT_DIR', tmp_path) setup — fold it into the parametrization or extract a tiny set_mount fixture so the two tests share one definition of "what's the mount."

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Expanded parametrize from 3 to 8 rows (env-marker, inline-comment, extras, surrounding-ws, non-eq-operator added).

One snag with the match= suggestion: since abort() prints to stderr and then sys.exit(1), the SystemExit value is the int 1match= matches against the exception value, not the printed text. So the suite still passed against the wrong message. Switched to a _assert_aborts_with(capsys, pattern) helper that asserts on capsys.readouterr().err instead, which gets the same guarantee. Folded the duplicate mount setup into a shared mount_dir fixture.

Comment thread .builders/tests/test_build_wheels.py Outdated
build_wheels.assert_kafka_version_matches()


def test_noop_when_env_unset(write_requirements, monkeypatch):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: test_noop_when_env_unset calls write_requirements('confluent-kafka==2.13.2') before exercising the early-return path. That masks whether the early return actually short-circuits before touching the filesystem — which is the contract that protects Linux/macOS, where CONFLUENT_KAFKA_VERSION is intentionally never set and MOUNT_DIR / 'requirements.in' may not exist. Tighten so the test proves the no-op is FS-independent:

def test_noop_when_env_unset(monkeypatch, tmp_path):
    # MOUNT_DIR points at an empty dir — no requirements.in present
    monkeypatch.setattr(build_wheels, 'MOUNT_DIR', tmp_path)
    monkeypatch.delenv('CONFLUENT_KAFKA_VERSION', raising=False)
    build_wheels.assert_kafka_version_matches()  # must not raise, must not require requirements.in

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — your version exactly. The test now points MOUNT_DIR at an empty tmp_path (no requirements.in present) and deletes the env var. Proves the early return short-circuits before any FS access.

Kyle-Neale and others added 2 commits May 25, 2026 17:57
- Loosen KAFKA_PIN_RE to accept extras, env markers, inline comments,
  and surrounding whitespace; lift to module scope.
- Expand bump checklist docstring to 4 items (include vcpkg
  $desired_commit) and add matching lockstep comments in build_script.ps1.
- Improve no-pin abort message to include the expected version and the
  inspected path.
- Hoist assert_kafka_version_matches() above the TemporaryDirectory
  block so drift aborts before any expensive setup runs.
- Expand parametrized tests from 3 to 8 rows covering env markers,
  inline comments, extras, surrounding whitespace, and non-== operators.
- Verify specific abort messages via capsys instead of bare SystemExit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

qa/skip-qa Automatically skip this PR for the next QA

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants