Minimal host roster + remote exec + one-shot bootstrap. Single hexa file, zero deps, state at ~/.pool/pool.json.
# 1. Install hexa-lang (gives you `hexa` + `hx` package manager)
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/dancinlab/hexa-lang/main/install.sh)"
# 2. Install pool
hx install poolhx wires the pool shim into ~/.hx/bin/ (must be on PATH).
pool add <name> <user@host> [--sudo] # register a host (ssh-probes `uname -s` for OS)
pool rm <name> # remove from roster
pool list [--json] [--fast] # roster + live cpu/ram/gpu/disk per host (--fast skips probe; --json embeds usage)
pool on <name> <cmd...> # ssh + exec on host
pool on all <cmd...> [--jobs N] [--timeout S] # parallel fanout to every shared host (skips [ded]; generic, zero domain knowledge)
pool status # live reachability probe (each host)
pool refresh # re-probe os for every host
pool init # bootstrap every shared host with all features (skips [ded])
pool clean <name>|all [--dry-run] [--deep] # run disk-cleanup now — safe tier; --deep adds model caches + docker
pool clean on|off <name> # exclude / include a host in disk-cleanup
pool desc <name> [text...] # set a host description (shown in `pool list`)pool init runs every bootstrap feature on every shared host in order (skipping any host with "shared": false). Adding a feature is one labels.push(...) / scripts.push(...) pair in cmd_init inside bin/pool.hexa.
| feature | linux | macos |
|---|---|---|
| tailscale (headless) | curl tailscale.com/install.sh | sudo sh |
brew install tailscale + brew services start tailscale (CLI, not the GUI cask — avoids menubar conflicts) |
/etc/cron.daily/disk-cleanup |
the two-tier pool clean script, logged to /var/log/pool-cleanup.log. Safe tier daily (journalctl vacuum · apt autoremove+clean · snap disabled-revs · /var/{log,crash,tmp} · /tmp · ~/.cache/{uv,pip,puppeteer,mathlib,thumbnails}); deep tier when disk ≥ 85% (~/.cache/{huggingface,torch,npm,…} · docker system prune). Uniform 3-day mtime retention. |
skipped (newsyslog + systemd-tmpfiles equivalents builtin) |
| no-sleep | skipped (Linux servers don't display-sleep) | pmset -a displaysleep=0 sleep=0 disksleep=0 + screensaver idleTime=0 — FileVault Macs sever sshd-reachable interfaces on display lock, breaking long-running headless builds |
| persistent-journal | mkdir /var/log/journal + restart journald (post-mortem logs survive reboot) |
skipped (macOS uses unified log, persistent by default) |
| panic-recovery | kernel.panic=10 + kernel.panic_on_oom=0 via /etc/sysctl.d/60-pool-recovery.conf — kernel auto-reboots 10s after panic instead of hanging |
skipped (Darwin kernel auto-restarts on panic) |
| earlyoom | apt install earlyoom + enable — userspace OOM-killer fires at <10% free RAM+swap, converting kernel hangs into clean per-process SIGKILL |
skipped (macOS has Jetsam built in) |
| swapfile | 32 GB /swapfile-pool + /etc/fstab entry + vm.swappiness=10 — only if existing swap < 30 GB |
skipped (macOS manages dynamic swap automatically) |
| watchdog | softdog + systemd RuntimeWatchdogSec=30s (auto-reset on kernel hang) |
skipped (no equivalent userspace API) |
tailscale up still needs interactive auth — pool init only installs.
The OOM-resilience block (persistent-journal · panic-recovery · earlyoom · swapfile · watchdog) was added after a host-OOM-on-concurrent-heavy-workload incident: a 17 GB ckpt eval + a 17 GB HF upload dispatched to the same 30 GB host in parallel saturated RAM, the kernel OOM-killer hung, and the host needed a physical reboot. Each feature is one independent line of defense — together they convert that failure mode into "earlyoom kills the lower-priority job; if it still escalates, swap absorbs the spike; if the kernel still hangs, the watchdog auto-resets within 30 s; persistent journal preserves the evidence; panic-recovery reboots in 10 s instead of forever." The structural dispatcher fix (a pool reservation/commitment ledger) is a separate roadmap item.
pool clean <name>|all runs disk-cleanup immediately — no waiting for the daily cron — and reports freed space (df before/after):
| tier | when | targets |
|---|---|---|
| safe | always | journalctl vacuum · apt-get autoremove/clean · snap disabled revisions · /var/{log,crash,tmp} · /tmp · ~/.cache/{uv,pip,puppeteer,mathlib,thumbnails} |
| deep | --deep, or disk ≥ 85% |
model caches ~/.cache/{huggingface,torch,torch_extensions,npm,yarn,go-build} · docker system prune |
--dry-run counts what each tier would remove and deletes nothing. Uniform 3-day mtime retention — atime is unreliable on noatime / relatime mounts.
pool clean off <name> excludes a host: pool clean skips it and pool init installs no cron hook there. For dedicated, non-shared machines (the clean: false roster key).
~/.pool/pool.json:
{
"hosts": [
{"name": "ubu-1", "ssh": "ubu-1", "sudo": true, "os": "linux"},
{"name": "ubu-2", "ssh": "ubu-2", "sudo": true, "os": "linux"},
{"name": "mini", "ssh": "mini", "sudo": true, "os": "macos"},
{"name": "pi5-akida", "ssh": "ubuntu@192.168.50.155", "sudo": true, "os": "linux",
"shared": false, "clean": false, "description": "anima — dedicated Akida neuromorphic host"}
]
}Override path with POOL_STATE=<path>. Optional per-host keys:
| key | default | effect |
|---|---|---|
description |
— | free text · shown in pool list |
shared |
true |
when false: [ded] marker in pool list; excluded from pool on all / pool init / pool clean all. Explicit pool on <name> / pool clean <name> still works. |
clean |
true |
when false: excluded from pool clean and the disk-cleanup feature in pool init (orthogonal to shared — a shared host can opt out of disk-cleanup only) |
pool list probes every host concurrently (8-wide async fan-out, ~8s probe timeout) and renders load · ram · gpu · disk inline. Shared hosts get a blank flag column; dedicated (shared: false) hosts get a [ded] marker:
load ram gpu disk sudo desc
mini mini macos 0.42 8/16G - 100/512G sudo
ubu-1 ubu-1 linux 0.01 3/30G 0% 546/915G sudo
ubu-2 ubu-2 linux 4.02 4/30G 0% 721/915G sudo shared CI runner
pi5-akida [ded] ubuntu@192… - - - - sudo anima — dedicated Akida host
--fast skips the probe for a quick roster-only view (no ssh). pool list --json embeds the probe result as a per-host usage object:
{"name": "ubu-1", "ssh": "ubu-1", "os": "linux",
"usage": {"load": "0.01", "ram": "3/30G", "gpu": "0%", "disk": "546/915G"}}Locale-independent (LC_ALL=C forced on remote). Linux free -g parsed by row/column position. macOS uses vm_stat + sysctl hw.memsize. GPU = nvidia-smi utilization when present, - otherwise.
See TODO.md (dispatcher-side reservation ledger · future verbs).
MIT.