Skip to content

dancinlab/pool

Repository files navigation

pool

Minimal host roster + remote exec + one-shot bootstrap. Single hexa file, zero deps, state at ~/.pool/pool.json.

Install

# 1. Install hexa-lang (gives you `hexa` + `hx` package manager)
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/dancinlab/hexa-lang/main/install.sh)"

# 2. Install pool
hx install pool

hx wires the pool shim into ~/.hx/bin/ (must be on PATH).

Verbs

pool add <name> <user@host> [--sudo]   # register a host (ssh-probes `uname -s` for OS)
pool rm  <name>                         # remove from roster
pool list [--json] [--fast]             # roster + live cpu/ram/gpu/disk per host (--fast skips probe; --json embeds usage)
pool on  <name> <cmd...>                # ssh + exec on host
pool on  all <cmd...> [--jobs N] [--timeout S]  # parallel fanout to every shared host (skips [ded]; generic, zero domain knowledge)
pool status                             # live reachability probe (each host)
pool refresh                            # re-probe os for every host
pool init                               # bootstrap every shared host with all features (skips [ded])
pool clean <name>|all [--dry-run] [--deep]  # run disk-cleanup now — safe tier; --deep adds model caches + docker
pool clean on|off <name>                # exclude / include a host in disk-cleanup
pool desc <name> [text...]              # set a host description (shown in `pool list`)

pool init features

pool init runs every bootstrap feature on every shared host in order (skipping any host with "shared": false). Adding a feature is one labels.push(...) / scripts.push(...) pair in cmd_init inside bin/pool.hexa.

feature linux macos
tailscale (headless) curl tailscale.com/install.sh | sudo sh brew install tailscale + brew services start tailscale (CLI, not the GUI cask — avoids menubar conflicts)
/etc/cron.daily/disk-cleanup the two-tier pool clean script, logged to /var/log/pool-cleanup.log. Safe tier daily (journalctl vacuum · apt autoremove+clean · snap disabled-revs · /var/{log,crash,tmp} · /tmp · ~/.cache/{uv,pip,puppeteer,mathlib,thumbnails}); deep tier when disk ≥ 85% (~/.cache/{huggingface,torch,npm,…} · docker system prune). Uniform 3-day mtime retention. skipped (newsyslog + systemd-tmpfiles equivalents builtin)
no-sleep skipped (Linux servers don't display-sleep) pmset -a displaysleep=0 sleep=0 disksleep=0 + screensaver idleTime=0 — FileVault Macs sever sshd-reachable interfaces on display lock, breaking long-running headless builds
persistent-journal mkdir /var/log/journal + restart journald (post-mortem logs survive reboot) skipped (macOS uses unified log, persistent by default)
panic-recovery kernel.panic=10 + kernel.panic_on_oom=0 via /etc/sysctl.d/60-pool-recovery.conf — kernel auto-reboots 10s after panic instead of hanging skipped (Darwin kernel auto-restarts on panic)
earlyoom apt install earlyoom + enable — userspace OOM-killer fires at <10% free RAM+swap, converting kernel hangs into clean per-process SIGKILL skipped (macOS has Jetsam built in)
swapfile 32 GB /swapfile-pool + /etc/fstab entry + vm.swappiness=10 — only if existing swap < 30 GB skipped (macOS manages dynamic swap automatically)
watchdog softdog + systemd RuntimeWatchdogSec=30s (auto-reset on kernel hang) skipped (no equivalent userspace API)

tailscale up still needs interactive auth — pool init only installs.

The OOM-resilience block (persistent-journal · panic-recovery · earlyoom · swapfile · watchdog) was added after a host-OOM-on-concurrent-heavy-workload incident: a 17 GB ckpt eval + a 17 GB HF upload dispatched to the same 30 GB host in parallel saturated RAM, the kernel OOM-killer hung, and the host needed a physical reboot. Each feature is one independent line of defense — together they convert that failure mode into "earlyoom kills the lower-priority job; if it still escalates, swap absorbs the spike; if the kernel still hangs, the watchdog auto-resets within 30 s; persistent journal preserves the evidence; panic-recovery reboots in 10 s instead of forever." The structural dispatcher fix (a pool reservation/commitment ledger) is a separate roadmap item.

pool clean

pool clean <name>|all runs disk-cleanup immediately — no waiting for the daily cron — and reports freed space (df before/after):

tier when targets
safe always journalctl vacuum · apt-get autoremove/clean · snap disabled revisions · /var/{log,crash,tmp} · /tmp · ~/.cache/{uv,pip,puppeteer,mathlib,thumbnails}
deep --deep, or disk ≥ 85% model caches ~/.cache/{huggingface,torch,torch_extensions,npm,yarn,go-build} · docker system prune

--dry-run counts what each tier would remove and deletes nothing. Uniform 3-day mtime retention — atime is unreliable on noatime / relatime mounts.

pool clean off <name> excludes a host: pool clean skips it and pool init installs no cron hook there. For dedicated, non-shared machines (the clean: false roster key).

State

~/.pool/pool.json:

{
  "hosts": [
    {"name": "ubu-1",     "ssh": "ubu-1",                   "sudo": true, "os": "linux"},
    {"name": "ubu-2",     "ssh": "ubu-2",                   "sudo": true, "os": "linux"},
    {"name": "mini",      "ssh": "mini",                    "sudo": true, "os": "macos"},
    {"name": "pi5-akida", "ssh": "ubuntu@192.168.50.155",   "sudo": true, "os": "linux",
     "shared": false, "clean": false, "description": "anima — dedicated Akida neuromorphic host"}
  ]
}

Override path with POOL_STATE=<path>. Optional per-host keys:

key default effect
description free text · shown in pool list
shared true when false: [ded] marker in pool list; excluded from pool on all / pool init / pool clean all. Explicit pool on <name> / pool clean <name> still works.
clean true when false: excluded from pool clean and the disk-cleanup feature in pool init (orthogonal to shared — a shared host can opt out of disk-cleanup only)

pool list

pool list probes every host concurrently (8-wide async fan-out, ~8s probe timeout) and renders load · ram · gpu · disk inline. Shared hosts get a blank flag column; dedicated (shared: false) hosts get a [ded] marker:

                                  load  ram    gpu  disk      sudo  desc
mini              mini    macos   0.42  8/16G  -    100/512G  sudo
ubu-1             ubu-1   linux   0.01  3/30G  0%   546/915G  sudo
ubu-2             ubu-2   linux   4.02  4/30G  0%   721/915G  sudo  shared CI runner
pi5-akida  [ded]  ubuntu@192…     -     -      -    -         sudo  anima — dedicated Akida host

--fast skips the probe for a quick roster-only view (no ssh). pool list --json embeds the probe result as a per-host usage object:

{"name": "ubu-1", "ssh": "ubu-1", "os": "linux",
 "usage": {"load": "0.01", "ram": "3/30G", "gpu": "0%", "disk": "546/915G"}}

Locale-independent (LC_ALL=C forced on remote). Linux free -g parsed by row/column position. macOS uses vm_stat + sysctl hw.memsize. GPU = nvidia-smi utilization when present, - otherwise.

Roadmap

See TODO.md (dispatcher-side reservation ledger · future verbs).

License

MIT.

About

minimal host roster + remote exec (6 verbs · hx install pool)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors