Extended Synthetic Workload Generator for stress testing#24
Extended Synthetic Workload Generator for stress testing#24rbx merged 9 commits intoFairRootGroup:masterfrom
Conversation
…arser passthrough Optional body: Sample duration/nodes/cores with flat|poisson|uniform (same mode as arrivals) using per-attribute params. Add *4 CLI args (arrivals,duration,nodes,cores) in inspect/sanity/train wiring. Add sanity coverage for flat attribute targets and poisson attribute lambdas.
📝 WalkthroughWalkthroughAdds uniform arrivals mode and per-attribute sampling (flat/poisson/uniform) with jitter and midpoints, introduces additive small/heavy burst injectors with validation and bounding, enforces hard caps, extends CLI with quad-value parsing, adds deterministic inspection tooling and workload-log analyzer for burst suggestions. Changes
Sequence Diagram(s)sequenceDiagram
participant CLI
participant WorkloadGen as WorkloadGenerator
participant Sampler as AttributeSampler
participant Burst as BurstInjector
participant RNG
participant Output as JobSpecList
CLI->>WorkloadGen: build WorkloadGenConfig (flat/poisson/uniform, bursts, caps)
CLI->>RNG: set seed (deterministic mode)
WorkloadGen->>Sampler: request base_n attribute arrays (durations,nodes,cores)
Sampler->>RNG: sample per-attribute arrays (mode-specific, clip to min/max)
Sampler-->>WorkloadGen: return base attribute arrays
WorkloadGen->>Burst: evaluate burst triggers per-hour (small/heavy prob)
Burst->>RNG: sample burst sizes and attribute overrides
Burst-->>WorkloadGen: append burst attributes to arrays, update total_n
WorkloadGen->>WorkloadGen: enforce hard_cap_jobs (trim arrays)
WorkloadGen->>Output: construct JobSpec instances for total_n
Output-->>CLI: return sampled jobs (deterministic digestable list)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (9)
analyze_workload_logs.py (2)
73-84:zip()missingstrict=Trueparameter.Although lengths are validated at line 74, passing
strict=Truetozipat line 84 provides a safety net against future refactors that might separate the length check from the zip call.Proposed fix
- return sum(a * b for a, b in zip(dx, dy)) / (sx * sy) + return sum(a * b for a, b in zip(dx, dy, strict=True)) / (sx * sy)
225-243: Bareexcept Exceptionswallows all parse errors silently.The
consumeinner function catches all exceptions during row parsing (line 232). While this is reasonable for a log analysis tool that needs to be tolerant of malformed data, consider logging the first few exceptions or accumulating distinct error types to aid debugging whenskippedis unexpectedly high.train.py (3)
178-243: WorkloadGenConfig construction logic is also duplicated across train.py, test_sanity_env.py, and test_inspect_workloadgen.py.The pattern of deriving midpoints, defaulting Poisson lambdas, flat targets, and jitters from the CLI args is repeated nearly identically in all three files. Consider extracting a helper like
build_workload_gen_config(args) -> WorkloadGenConfigto a shared location to keep this logic DRY and ensure consistency when parameters are added or defaults change.
109-121: Renamed--wg-poisson-lambdato "Legacy" but it's still the primary default.Line 109 labels
--wg-poisson-lambdaas "Legacy: arrivals-only poisson lambda" but it remains the active default when--wg-poisson-lambdas4is not provided (line 199). The "Legacy" label may confuse users into thinking this option is deprecated. Consider clarifying the help text.
32-76: Extract quad-parsing helpers to a shared module to eliminate duplication.
_parse_quad_floats,_parse_quad_ints, and_parse_quad_rangesare duplicated identically acrosstrain.py,test/test_sanity_env.py, andtest/test_inspect_workloadgen.py(note: the test_inspect_workloadgen.py version omits the underscore prefix). Move these to a shared location such assrc/cli_utils.pyand import from there in all three files.test/test_inspect_workloadgen.py (3)
64-72: Naming mismatch: "triplets" actually holds 4-tuples (quadruples).The function name
digest_jobs_tripletsand the variableall_jobs_tripletsstore 4-element tuples(hour, duration, nodes, cores_per_node), not triplets. The docstring correctly describes 4 fields but the name is misleading.Proposed rename
-def digest_jobs_triplets(triplets): +def digest_jobs_quads(quads): """ - Stable digest to verify determinism. - - We digest (hour_idx, duration, nodes, cores_per_node) so the hash is robust against - future refactors that might change how jobs are flattened/stored. + Stable digest over (hour_idx, duration, nodes, cores_per_node) quadruples + to verify determinism. """ - arr = np.array(triplets, dtype=np.int32) # (hour, duration, nodes, cores) + arr = np.array(quads, dtype=np.int32) return hashlib.sha256(arr.tobytes()).hexdigest()
224-287: Unusedfigvariable and no-opos.path.join("", fname).Line 226:
figis assigned but never used. Line 285:os.path.join("", fname)is equivalent to justfname.Proposed fix
- fig, axs = plt.subplots(4, 2, figsize=(14, 16), constrained_layout=True) + _fig, axs = plt.subplots(4, 2, figsize=(14, 16), constrained_layout=True)- save_path = os.path.join("", fname) - plt.savefig(save_path, dpi=250, bbox_inches="tight") + plt.savefig(fname, dpi=250, bbox_inches="tight")
10-11:matplotlibimported unconditionally but only used when--plotis passed.If matplotlib is not installed, the script will fail even when
--plotis not used. Consider a lazy import inside theif args.plot:block.Proposed fix
-import matplotlib.pyplot as pltThen inside the
if args.plot:block:if args.plot: import matplotlib.pyplot as plt # ... rest of plotting codesrc/workloadgen.py (1)
292-297: Minor:_sample_burst_countis defined insidesample()— consider making it a static helper.This nested function is re-created on every
sample()call. Since it only depends on its arguments plusrng(which could be passed), it could be a module-level or class-level helper. That said, the overhead is negligible for the typical call frequency.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@analyze_workload_logs.py`:
- Around line 60-70: The fallback in parse_duration_hours assumes time_part
splits into three parts and will crash; update parse_duration_hours to handle
both "HH:MM" and "HH:MM:SS" (and "D-HH:MM" / "D-HH:MM:SS") by validating the
result of time_part.split(":"), using len==2 or len==3, and treating a missing
seconds field as 0 (and validating numeric parts), or wrap the parsing in a
try/except that normalizes hh, mm, ss to integers with sensible defaults before
computing total_seconds (referencing variables day_part, time_part,
total_seconds and the parse_duration_hours function name).
🧹 Nitpick comments (6)
train.py (2)
32-76: Extract shared quad-parsing helpers to a common module.
_parse_quad_floats,_parse_quad_ints, and_parse_quad_rangesare duplicated verbatim intrain.py,test/test_sanity_env.py, andtest/test_inspect_workloadgen.py. Consider extracting them into a shared utility module (e.g.,src/cli_utils.py) and importing from there.
178-244: Duplicated config-construction logic across files.The entire block that derives midpoints, unpacks quad parameters with fallbacks, and builds
WorkloadGenConfigis nearly identical intrain.py(lines 178–244),test/test_sanity_env.py(lines 305–371), andtest/test_inspect_workloadgen.py(lines 112–172). A shared factory function (e.g.,build_workload_config(args) -> WorkloadGenConfig) would eliminate this three-way duplication and reduce the risk of the copies diverging over time.analyze_workload_logs.py (1)
225-234: Bareexcept Exceptionsilently drops parse errors — consider logging at DEBUG level.While
skippedis incremented, for a diagnostic tool it can be frustrating to debug why rows are being skipped without any indication of the actual error. Consider optionally logging the first few exceptions or adding a--verboseflag.test/run_all.py (1)
24-24:--plotflag in CI test may leave artifact PNGs in the working directory.The
test_inspect_workloadgenwith--plotsaves a timestamped PNG to the current directory. Consider either omitting--plotfrom the CI run or saving to a temporary/session directory to avoid polluting the repo root.test/test_inspect_workloadgen.py (1)
224-286: Minor: unusedfigvariable and no-opos.path.join("", ...).Line 226:
figis never used — prefix with_to signal intent. Line 285:os.path.join("", fname)is equivalent to justfname.Proposed fix
- fig, axs = plt.subplots(4, 2, figsize=(14, 16), constrained_layout=True) + _fig, axs = plt.subplots(4, 2, figsize=(14, 16), constrained_layout=True)- save_path = os.path.join("", fname) - plt.savefig(save_path, dpi=250, bbox_inches="tight") + plt.savefig(fname, dpi=250, bbox_inches="tight")src/workloadgen.py (1)
252-384: Burst RNG ordering ensures determinism — buthard_cap_jobssilently drops burst jobs.The sampling order (base count → base attrs → small burst coin flip → small burst attrs → heavy burst coin flip → heavy burst attrs) is fixed regardless of outcomes, which correctly preserves determinism.
One subtlety: when
hard_cap_jobstruncates, it always drops from the end of the array, meaning heavy-burst jobs are dropped first, then small-burst jobs. If this bias is intentional, a brief comment would help future maintainers. If not, a shuffle before truncation would distribute the cut more fairly.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@data/workload_statistics/analyze_workload_logs.py`:
- Around line 198-216: The early-return for empty input in the rows iteration
(where rows = iter_rows(path) and first = next(rows) is wrapped in a
StopIteration handler) returns a dict missing keys that main() expects; update
that returned dict to include the missing keys referenced by main() —
specifically add hours_observed, small_baseline_jobs_per_hour,
heavy_baseline_jobs_per_hour, small_event_prob, small_volume_prob,
heavy_event_prob, heavy_volume_prob, suggested_wg_burst_small_prob, and
suggested_wg_burst_heavy_prob — and set sensible default values (e.g., 0 or
float("nan") consistent with the other stats) so callers like main() won't get
KeyError when the file is empty.
🧹 Nitpick comments (5)
data/workload_statistics/analyze_workload_logs.py (5)
60-70: Unpackingtime_part.split(":")assumes exactlyHH:MM:SS.If a duration value has an unexpected format (e.g.,
"MM:SS"or"HH:MM:SS.ms"), the unpacking on line 68 will raise aValueErrorwith a generic "not enough values to unpack" message. This is caught by the bareexceptinconsume(), but makes debugging harder. Consider a guard or a more descriptive error.
73-84: Considerstrict=Trueonzip()for defensive consistency.The length check on line 74 makes this safe, but
zip(dx, dy, strict=True)on line 84 would guard against future refactors that might decouple the inputs. Flagged by Ruff B905.
103-104: Redundantint()casts onmath.floor/math.ceilresults.
math.floorandmath.ceilalready returnintin Python 3. Ruff RUF046.Proposed fix
- lo = int(math.floor(pos)) - hi = int(math.ceil(pos)) + lo = math.floor(pos) + hi = math.ceil(pos)
160-172: Whitespace-delimited parser silently truncates rows with more fields than headers.Line 171 maps only
range(len(columns))fields, so any trailing columns (e.g., a job name with spaces that splits into multiple tokens) are silently dropped. This is fine for the numeric/date columns this tool consumes, but worth a brief comment for future maintainers who might add text-column support.
225-234: Bareexcept Exceptionsilently swallows all parse errors.The
skippedcounter is reported in the final output (good), but when investigating why rows are being skipped — e.g., a wrong column mapping or unexpected date format — there's no way to see the actual error. Consider logging atDEBUGlevel or collecting a sample of the first few exceptions for the summary.Suggested minimal improvement
+import logging + +logger = logging.getLogger(__name__) + ... except Exception: skipped += 1 + if skipped <= 3: + logger.debug("Skipping row in %s: %s", path.name, row, exc_info=True) return
There was a problem hiding this comment.
🧹 Nitpick comments (7)
test/test_inspect_workloadgen.py (6)
11-11:matplotlib.pyplotimported unconditionally but only used when--plotis passed.In headless CI or environments without matplotlib, this top-level import will fail even when plotting isn't requested. Consider a lazy import inside the
if args.plot:block.
75-77: Remove unnecessaryint()wrappers (Ruff RUF046).
round(...)already returnsintin Python 3 when called with a single argument, so the outerint()calls are redundant.Proposed fix
- default_duration_mid = int(round((min_duration + max_duration) / 2.0)) - default_nodes_mid = int(round((min_nodes + max_nodes) / 2.0)) - default_cores_mid = int(round((min_cores + max_cores) / 2.0)) + default_duration_mid = round((min_duration + max_duration) / 2.0) + default_nodes_mid = round((min_nodes + max_nodes) / 2.0) + default_cores_mid = round((min_cores + max_cores) / 2.0)
66-128: Significant code duplication withtest/test_sanity_env.pyforWorkloadGenConfigconstruction.The entire config-building block (range defaults, quad-parameter unpacking,
WorkloadGenConfig(...)construction) is nearly identical between this file andmake_env_from_argsintest_sanity_env.py. Consider extracting a shared helper (e.g.,build_workloadgen_config(args)) into a common module to keep these in sync and reduce maintenance burden.
237-241: Plot saved to CWD with no feedback and a no-opos.path.join("", ...).
os.path.join("", fname)is justfname. Consider accepting an output directory via CLI arg (or defaulting to a well-known location), and printing the save path so the user knows where the file landed.Proposed fix
- save_path = os.path.join("", fname) - plt.savefig(save_path, dpi=250, bbox_inches="tight") + save_path = fname + plt.savefig(save_path, dpi=250, bbox_inches="tight") + print(f"Plot saved to: {save_path}")
181-181: Prefix unusedfigvariable with underscore (Ruff RUF059).Standard convention for unused unpacked variables.
Proposed fix
- fig, axs = plt.subplots(4, 2, figsize=(14, 16), constrained_layout=True) + _fig, axs = plt.subplots(4, 2, figsize=(14, 16), constrained_layout=True)
157-166: Duplicate list-comprehension ofdurations,nodes,cpn(also computed at lines 206-208).The same triplet unpacking is done twice — once for summarize output and once for plotting. Consider computing them once and reusing.
test/test_sanity_env.py (1)
274-276: Same unnecessaryint()wrappers as in the inspect utility (Ruff RUF046).
round()returnsintin Python 3; the outerint()is redundant.Proposed fix
- duration_mid = int(round((min_duration + max_duration) / 2.0)) - nodes_mid = int(round((min_nodes + max_nodes) / 2.0)) - cores_mid = int(round((min_cores + max_cores) / 2.0)) + duration_mid = round((min_duration + max_duration) / 2.0) + nodes_mid = round((min_nodes + max_nodes) / 2.0) + cores_mid = round((min_cores + max_cores) / 2.0)
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@test/test_sanity_env.py`:
- Around line 269-271: The test computes midpoints using int(round((a + b) /
2.0)) which uses bankers rounding and diverges from WorkloadGenerator.__init__'s
midpoint formula; change the test's duration_mid, nodes_mid, and cores_mid
calculations to use floor division (a + b) // 2 so they match
WorkloadGenerator.__init__ and avoid the CLI vs direct-construction divergence
when passing explicit values into WorkloadGenConfig.
🧹 Nitpick comments (3)
src/workloadgen.py (2)
292-371: Consider extracting a helper for burst attribute sampling.The small-burst block (lines 304–334) and heavy-burst block (lines 341–371) are structurally identical — each concatenates
rng.integers(…).astype(np.int32)for durations, nodes, and cores. A small helper would halve the code and make adding a third burst tier trivial.♻️ Sketch: extract `_append_burst`
+ def _append_burst(durations, nodes, cores, rng, n, d_min, d_max, n_min, n_max, c_min, c_max): + if n <= 0: + return durations, nodes, cores + durations = np.concatenate([durations, rng.integers(d_min, d_max + 1, size=n).astype(np.int32)]) + nodes = np.concatenate([nodes, rng.integers(n_min, n_max + 1, size=n).astype(np.int32)]) + cores = np.concatenate([cores, rng.integers(c_min, c_max + 1, size=n).astype(np.int32)]) + return durations, nodes, cores
373-379: Hard cap truncation silently discards burst jobs first.Since base jobs occupy the front of the arrays and burst jobs are appended,
[:hard_cap]always preserves base jobs and trims bursts. This is reasonable behavior for a safety valve, but worth documenting so users aren't surprised when burst injection appears to have no effect under a tight cap.test/test_sanity_env.py (1)
22-22: Test file imports private helpers fromtrain.py.
_parse_quad_floats,_parse_quad_ints, and_parse_quad_rangesare prefixed with_(private convention) yet imported by a test. Consider moving these parsers to a shared utility module (e.g.,src/cli_utils.py) so bothtrain.pyand test files can import them without reaching into each other's internals.
… spikes Optional body: Add two independent burst injectors on top of base arrivals: burst_small_* for many small jobs burst_heavy_* for long, high-resource jobs Add per-burst probability controls and default ranges. Wire burst probabilities through train/sanity/inspect CLI and add burst behavior tests. Small Add: Workload_logs.txt, collective stats of different partition logs. Can be used to determine parameters of generator. Generator: Add workload_logs script, plus added "burstiness" scale to logs. Fixup: Moved workload logs, into data/workload_statistics Fixed "fallback crash on formats without exactly three colon-separated parts"
No description provided.