Skip to content

feat: add replay support to runner group and deployment infrastructure#235

Open
JasonXuDeveloper wants to merge 6 commits intoAzure:unstable-replayfrom
JasonXuDeveloper:replay/pr6
Open

feat: add replay support to runner group and deployment infrastructure#235
JasonXuDeveloper wants to merge 6 commits intoAzure:unstable-replayfrom
JasonXuDeveloper:replay/pr6

Conversation

@JasonXuDeveloper
Copy link
Contributor

Summary

  • Handler: Replay-aware job building — skip configmap upload for replay mode, use indexed Jobs for runner assignment, custom replay entrypoint script
  • run_replay.sh: Entrypoint script for replay runner pods that downloads the replay profile and invokes kperf runner replay
  • Dockerfile: chmod +x for scripts directory

Test plan

  • go build ./... passes
  • go vet ./... passes
  • go test ./... passes (full test suite)

Part 6 of 6 in the replay feature PR stack. Depends on PR #234.

- Fix "traget" → "target" typo in LoadProfileSpec comment
- Fix "letencies" → "latencies" typo in runner CLI flag description
- Add empty specs validation in loadConfig to prevent index-out-of-range
  panic when config file has no specs
- Preserve nodeAffinity from runnergroup spec when CLI --affinity flag
  is not provided

Signed-off-by: JasonXuDeveloper - 傑 <jason@xgamedev.net>
Add foundation types for the timeseries replay system:
- ReplayRequest, ReplayProfile, ReplayProfileSpec types with validation
- IsReplayMode() method on RunnerGroupSpec for detecting replay configs
- ReplayProfileSpec field in RunnerGroupSpec for distributed mode
- Sample replay profile and runner group config test data

Signed-off-by: JasonXuDeveloper - 傑 <jason@xgamedev.net>
Extract shared report-building logic into metrics.BuildPercentileLatenciesReport()
to avoid duplication between runner and replay report builders. Refactor
buildRunnerMetricReport() to use the shared utility.

Signed-off-by: JasonXuDeveloper - 傑 <jason@xgamedev.net>
Core replay data processing (no execution engine yet):
- Loader: YAML/gzip profile loading from file or URL
- Partition: request distribution across runners using object-key
  consistent hashing to preserve per-object ordering
- Builder: HTTP request construction with URL building, verb mapping,
  and URL masking for metrics aggregation

Signed-off-by: JasonXuDeveloper - 傑 <jason@xgamedev.net>
Execution engine for the timeseries replay system:
- Runner: worker pool with time-bucketed scheduling, per-worker metrics,
  pool-first WATCH connection assignment with overflow support
- Scheduler: orchestrator for both local multi-runner and distributed
  single-runner modes with configuration validation and warnings
- CLI: 'kperf replay run' command for local replay execution and
  'kperf runner replay' subcommand for distributed runner pods

Signed-off-by: JasonXuDeveloper - 傑 <jason@xgamedev.net>
Distributed replay mode integration:
- Replay-aware job building: skip configmap upload for replay mode,
  use indexed Jobs for runner assignment, custom replay entrypoint script
- run_replay.sh: entrypoint script for replay runner pods that downloads
  the replay profile and invokes kperf runner replay
- Dockerfile: chmod +x for scripts directory

Signed-off-by: JasonXuDeveloper - 傑 <jason@xgamedev.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant