Fleetbench

A small cross-platform benchmark collector for performance-testing fleets (pools of Taskcluster worker hosts that run Firefox perf tests), plus a Python runner that wraps it for use on those hosts.

The collector currently ships two workloads — CPU (prime sieve, single- and multi-threaded) and ADB I/O (timed adb push/pull against an attached Android device) — alongside an inspect mode for host metadata. Both workloads emit the same envelope shape so a single analysis pipeline consumes them.

Fleetbench produces raw per-iteration timings and host metadata as versioned JSON. It does not score hosts, compare across hardware classes, or maintain fleet-wide state — that work belongs to a downstream analysis layer fed from the collected envelope files.

Repo Layout

collector/ — Rust binary (fleetbench). Single-host-aware, emits one JSON object per invocation on stdout. No filesystem opinions.
runner/ — Python package (fleetbench-run). Wraps the collector, self-throttles, writes envelope files to disk.
docs/
- fleetbench_design_v2.md — design doc. Start here.
- analysis_notes.md — guidance for the downstream analysis layer (use median, drop iter 0, etc.).

Status

Component	Linux	Windows	macOS	Android
Collector	shipped	binary cross-compiles, env sampling fields are null pending implementation	shipped (env block intentionally null — no `/proc` on Darwin)	shipped (env block populated; same `/proc/stat` + `/proc/loadavg` path as Linux)
Runner	shipped	deferred pending CPython availability question	works (dev)	not applicable — Android deploy model is different

Subcommands

The collector is a single binary (fleetbench) with three peer subcommands:

Subcommand	What it does	Where it runs	Output section
`inspect`	Host + CPU metadata only, no workload	Any host	(just host/cpu, no `results`)
`cpu`	Prime-sieve workload (1t + MT), optional time-bounded torture mode with per-core frequency sampling	Any host (Linux/Windows/macOS/Android)	`results.prime_sieve_1t` / `results.prime_sieve_mt` (+ `frequency_series` in `--duration` mode)
`adb`	Times `adb push` / `adb pull` against an attached Android device; pre-generated random payloads, SHA256-verified per iteration	Linux/macOS host that has `adb` and a phone attached — not the phone itself	`adb_results.iterations`

Every invocation emits a single JSON envelope with the same top-level shape (schema_version, host, environment, plus suite-specific *_config, *_env, and *_results siblings). Downstream tools branch on which *_config block is present.

`inspect` (host metadata)

fleetbench inspect           # human-readable
fleetbench inspect --json    # envelope with host/cpu populated, no workload

Useful as a quick "what is this host?" check, and as a smoke test that the binary runs on the target at all before kicking off a workload.

CPU benchmark (`cpu`)

The default fleet workload: a prime-sieve up to prime_limit, run both single-threaded and across all cores. Calibrated for per-iteration timings above the noise floor on slow-x86 fleet hardware.

fleetbench cpu --json                      # --mode normal, all logical CPUs
fleetbench cpu --mode quick --json         # CI / dev cycles
fleetbench cpu --mode long --json          # fast hardware
fleetbench cpu --mode quick --duration 10m --json   # torture / throttle hunt

Choosing a mode

normal (pi(10⁸), 5 iterations) targets ~150 ms per iteration on slow-x86 fleet hosts (Xeon E3-class), which is where signal quality matters most. On much faster hardware — M-class Macs, modern workstations — per-iteration timing drops to ~90 ms, which is below the ~100 ms noise floor for tight outlier detection. Use --mode long (pi(10⁹), 3 iterations) on hardware that fast to keep iterations comfortably above the noise floor. Slow phones and old fleet hardware are well-served by normal.

`--duration` (torture/stress mode)

--duration <30s|10m|1h> switches the cpu subcommand into a time-bounded sustained-load run intended for thermal-throttle investigations — not the default fleet cadence. The MT sieve loops until the wall-clock duration elapses; the 1t workload is skipped so all cores stay hot continuously. A background sampler captures per-core CPU frequency at ~1Hz into the envelope as frequency_series, which is the direct signal for thermal throttling (boost-clock samples decaying toward base-clock over the run).

How --mode interacts with --duration. This trips people up: in duration mode, --mode picks only the per-iteration size (prime_limit). The preset's iteration count is ignored — total iterations are whatever completes before the deadline. Reading --mode long --duration 10m as "the longest mode" produces a handful of multi-second iterations, not a denser long run.

`--mode` (with `--duration`)	per-iteration time on a fast NUC	iterations in 10 min
`quick` (pi(10⁷))	~15 ms	~40,000
`normal` (pi(10⁸))	~150 ms	~4,000
`long` (pi(10⁹))	~1.5 s	~400

For torture runs, --mode quick --duration 10m is the natural pairing — it gives a dense per-iteration time series alongside the 1Hz frequency_series. --mode long still works (run_mt_until guarantees at least one iteration) but iteration-time drift becomes a coarse signal; frequency_series carries the throttle evidence either way.

For the full workflow — fetching the release binary, running a torture test, and reading the output to decide whether a host is throttling — see docs/detecting_thermal_throttling.md.

ADB I/O benchmark (`adb`)

fleetbench adb times adb push and adb pull against an attached Android device. It runs on the Linux Docker host where adb lives, not on the device itself — the goal is to characterize USB/adb behavior (the path raptor sees when staging APKs and test files), and to debug "why is provisioning slow today?" style problems across vendors (e.g. bitbar vs LambdaTest).

For the background, design rationale, and the original developer test this reproduces, see docs/ADB_TESTING.md.

fleetbench adb --json                                  # all defaults
fleetbench adb --serial <id> --json                    # multi-device host
fleetbench adb --sizes 25B,1M --iterations 25B=50,1M=20 --json
fleetbench adb --remote-path /sdcard/Download --json   # reproduce raptor's path

Operational model:

One invocation, one device. Contention is observed by running many invocations concurrently at the Taskcluster layer — that matches how real tests behave. There is no in-collector --parallel mode.
Target selection. With one device attached, no flag is needed. With multiple, pass --serial; otherwise the run fails with multiple_devices.
Remote path. Defaults to /data/local/tmp/ to avoid the FUSE layer on /sdcard for a cleaner USB/adb signal. Use --remote-path /sdcard/Download when the goal is to reproduce raptor's path exactly.
Payloads. For each size, N unique random files are generated up front (xorshift64 fill) so the kernel page cache can't quietly accelerate later iterations. Pre-generation happens before the timed section.
Verification. Push is checked via adb shell sha256sum; pull is checked by hashing the file locally. A failed hash sets sha256_ok = false on that iteration and exits non-zero (exit 2, correctness failure).
Sizes & iterations. Defaults emphasize the 25-byte point (where vendor variance shows up — that workload is dominated by command/setup overhead, not bytes on the wire), then progressively larger transfers:

size	default iterations	what it measures
25B	200	adb command/setup latency (no real bytes on wire)
1M	100	small-transfer steady state
10M	30	mid-transfer steady state
100M	10	bulk-transfer USB throughput ceiling

Override iterations per size via --iterations 25B=50,1M=20,....

A full default run does ~720 timed transfers and takes 10-30 minutes on a real device (longer on slow USB hubs). For a quick smoke test:

fleetbench adb --iterations 25B=5,1M=2,10M=2,100M=1 --json

Output. Per-iteration timings are emitted raw — no median/IQR/summary. The distribution is the signal; the mean often is not. (In a 100-retrigger bitbar-vs-LT comparison, LT's mean was lower but its distribution width was 4-5× wider; that's the kind of thing this subcommand surfaces.)
Env capture. adb --version is recorded in adb_env, and on Linux hosts the full lsusb -t topology is captured for hub-path correlation across concurrent invocations.

Verified end-to-end

cpu:

Linux: smoke-tested on real fleet hosts (Xeon E3-1585L v5).
macOS: dev box (Apple Silicon M4 Pro); pi(10⁹) 1t in ~840 ms, mt in ~118 ms across 14 cores.
Android: Pixel 10 Pro via adb push. See docs/analysis_notes.md for Android-specific behavior the analysis layer needs to know about (governor ramp, big.LITTLE + thermal throttling, non-zero idle load averages).

adb:

macOS + real phone: dev box (Apple Silicon M4 Pro) with a Pixel 10 Pro over USB; 21/21 iterations passed SHA256 verification across 25B / 1M / 10M / 100M. 25B transfers ran ~25-46 ms (pure adb command/setup overhead), 100M transfers hit ~34 MB/s push and ~39 MB/s pull (pull consistently faster — known adb asymmetry).
Linux + real phone: bitbar/LT-style Docker host validation is environmental, not a code path — the Linux-only env capture (/proc/stat, /proc/loadavg, lsusb -t) is the same code that ships in cpu and is exercised by that command's Linux fleet runs.

Caveats

cpu.frequency_mhz is null on macOS — Apple Silicon doesn't expose a single meaningful peak frequency and sysinfo's value is unreliable, so we deliberately drop it rather than emit a misleading number.
cpu.brand is null on Android (sysinfo doesn't parse the SoC name from /proc/cpuinfo on ARM); workaround if needed: parse it directly.
adb_env.lsusb_topology is only captured on Linux hosts (no lsusb on macOS/Windows).

Build

Collector (Rust)

cd collector
cargo build --release                  # native build for dev
./build                                # build all four (linux + windows + mac + android)
./build --platform linux               # just the linux musl binary
./build --platform windows             # just the windows .exe
./build --platform mac                 # just the mac host-arch binary
./build --platform android             # aarch64 Android (requires NDK)

./build produces:

target/x86_64-unknown-linux-musl/release/fleetbench (~1.1 MB, static, runs on any modern Linux including Ubuntu 18.04)
target/x86_64-pc-windows-gnu/release/fleetbench.exe (~1.0 MB)
target/<host-arch>-apple-darwin/release/fleetbench (~1.1 MB)
target/aarch64-linux-android/release/fleetbench

Identifying a binary

Every binary embeds version + git SHA as a tagged sentinel string. Three ways to read it, in order of effort:

# 1. From any machine (Mac, Linux), even for a Windows .exe:
strings -a fleetbench[.exe] | grep FLEETBENCH_BUILD
# FLEETBENCH_BUILD=0.1.0+3eb69d100e10
# (suffix "-dirty" appears if the build had uncommitted tracked changes)

# 2. Run the binary itself:
fleetbench --version
# fleetbench 0.1.0 (3eb69d100e10)

# 3. Look at any envelope it produced — collector_git_sha is in the JSON.

When sharing a build, paste the FLEETBENCH_BUILD=... line so the recipient can confirm they're running what you sent.

Linux and Windows builds cross-compile via cargo-zigbuild; the Mac build uses the native Apple toolchain; the Android build uses cargo-ndk.

Tooling: brew install zig, cargo install cargo-zigbuild cargo-ndk, and the rustup targets:

rustup target add x86_64-unknown-linux-musl x86_64-pc-windows-gnu \
                  aarch64-apple-darwin aarch64-linux-android

Android additionally needs the NDK. With Homebrew:

brew install --cask android-ndk
export ANDROID_NDK_HOME="$(brew --prefix)/share/android-ndk"

Add the export to your shell rc so it persists. Android Studio's SDK Manager also works; in that case ANDROID_NDK_HOME points at the SDK's ndk/<version>/ directory instead.

Runner (Python)

cd runner
uv sync                          # creates .venv, installs deps including pytest
uv run pytest -q                 # 98 tests
uv run fleetbench-run --help

Smoke Test

collector/smoke builds the binary, scps it to a target host, runs a sequence of validation checks, and prints a per-run timing table plus aggregate iter-0/iter-1+ distributions.

cd collector
./smoke <linux-host> --runs 5 --mode normal
./smoke <windows-host> --platform windows --runs 3 --mode normal

The smoke does:

cargo zigbuild for the target platform.
scp the binary to the host's home dir.
gwhc --json activity check (Linux only; skipped silently elsewhere).
inspect for host/CPU metadata.
N runs of cpu --json with full schema validation per envelope.
Negative test: --threads 0 --json must produce a failure envelope and exit 1.

If gwhc reports a non-IDLE state, smoke exits 0 with a summary rather than running benchmarks against a contaminated baseline.

Android (manual; adb-based)

./smoke does not yet wire Android. Use adb directly:

cd collector
./build --platform android
adb push target/aarch64-linux-android/release/fleetbench /data/local/tmp/fleetbench
adb shell chmod 755 /data/local/tmp/fleetbench
adb shell /data/local/tmp/fleetbench inspect
adb shell /data/local/tmp/fleetbench cpu --mode quick --json

/data/local/tmp/ is the standard "anyone can push and execute" path on Android. The collector emits the same v3 envelope as on Linux, with host.os_family = "android" and a populated environment block from the same /proc/stat + /proc/loadavg reads. adb shell exit codes are historically unreliable; trust the JSON's status field, not $?.

Operational Model (Runner)

Invoked by the worker-startup wrapper before the Taskcluster worker boots. Self-throttles based on the newest envelope timestamp in the results directory (--min-interval, default 24h). Pre-flights the host via gwhc on Linux and skips runs against non-IDLE hosts. Writes one envelope file per run, success or failure, via .partial + atomic rename. See the design doc for the full contract.

fleetbench-run \
  --results-dir /var/lib/fleetbench \
  --mode normal \
  --collector-binary /usr/local/bin/fleetbench \
  --min-interval 24h

Alternative: Taskcluster jobs (not yet built)

A possible companion model is to run the collector inside dedicated Taskcluster jobs targeted at specific worker pools, with a small controller tool that enqueues the jobs, records their IDs, polls for completion, and pulls the envelope artifacts back. Useful for targeted sweeps ("benchmark every gecko_t_linux_talos host now, before/after this kernel change") rather than continuous drift detection.

Tradeoffs noted but not yet committed work:

Queue contention. Benchmark jobs compete with real test traffic for worker time; on a busy queue, hourly or even daily fleet sweeps could end up waiting behind production work. The boot-throttle model sidesteps this by slipping into a window where the worker is not taking tasks.
Per-job overhead. TC task scheduling, image pull, and log shipping for what's a ~5 second benchmark is wasteful compared to direct invocation.
Visibility cost. Every benchmark becomes a TC entity that shows up in task dashboards.

A TC-driven invocation does not require a new runner — the existing fleetbench-run would just need a taskcluster value added to its --trigger enum and invocation from inside the task. Filing as a real beads task is deferred until someone needs the controlled-sweep capability.

Distribution

Binaries are intended to ship via GitHub releases, tagged per version. This is the primary distribution channel because:

Any Taskcluster task on any worker (including bitbar Android phones where Mozilla does not own the host OS layer) can fetch a release asset directly.
Releases are immutable per tag, so cross-version benchmark comparisons reference a stable build.
TC's fetches mechanism caches external URLs automatically.

Release asset naming follows a templatable convention so task definitions can be written once and parameterized by version:

fleetbench-<version>-linux-x86_64
fleetbench-<version>-windows-x86_64.exe
fleetbench-<version>-macos-aarch64
fleetbench-<version>-android-aarch64
SHA256SUMS

A SHA256SUMS file alongside the binaries enables fetch-time integrity verification (sha256sum -c) and lets TC fetches pin a hash per asset.

Releases are built and published automatically by .github/workflows/release.yml on any v* tag push. The latest release is at releases/latest. For local development builds outside the release pipeline, use ./build as documented above.

Example TC task payload

A Taskcluster task can fetch and run the collector directly from a release. Sketch for an Android worker (the motivating case — bitbar phones where Mozilla does not own the host OS layer):

payload:
  maxRunTime: 600
  mounts:
    - file: fleetbench
      content:
        url: https://github.com/<owner>/fleetbench/releases/download/v0.2.0/fleetbench-v0.2.0-android-aarch64
        sha256: "<pinned-hash-from-SHA256SUMS>"
  command:
    - - /bin/sh
      - -c
      - "chmod 755 fleetbench && ./fleetbench cpu --mode quick --json > result.json"
  artifacts:
    - name: public/result.json
      type: file
      path: result.json

The same pattern applies on Linux and Windows TC workers — just swap the release asset URL for the matching platform. A downstream controller tool (see "Alternative: Taskcluster jobs" above) would enqueue these tasks, collect the public/result.json artifacts, and drop them into the same flat results/ layout the runner uses.

Issue Tracking

Tasks live in .beads/ via beads_rust; see AGENTS.md for workflow conventions.

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
.beads		.beads
.claude		.claude
.github/workflows		.github/workflows
collector		collector
docs		docs
runner		runner
scripts		scripts
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fleetbench

Repo Layout

Status

Subcommands

`inspect` (host metadata)

CPU benchmark (`cpu`)

Choosing a mode

`--duration` (torture/stress mode)

ADB I/O benchmark (`adb`)

Verified end-to-end

Caveats

Build

Collector (Rust)

Identifying a binary

Runner (Python)

Smoke Test

Android (manual; adb-based)

Operational Model (Runner)

Alternative: Taskcluster jobs (not yet built)

Distribution

Example TC task payload

Issue Tracking

About

Uh oh!

Releases 3

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Fleetbench

Repo Layout

Status

Subcommands

inspect (host metadata)

CPU benchmark (cpu)

Choosing a mode

--duration (torture/stress mode)

ADB I/O benchmark (adb)

Verified end-to-end

Caveats

Build

Collector (Rust)

Identifying a binary

Runner (Python)

Smoke Test

Android (manual; adb-based)

Operational Model (Runner)

Alternative: Taskcluster jobs (not yet built)

Distribution

Example TC task payload

Issue Tracking

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`inspect` (host metadata)

CPU benchmark (`cpu`)

`--duration` (torture/stress mode)

ADB I/O benchmark (`adb`)

Packages