fix: replace tar injection with bind-mount COW workspace (CLI-28)#87
Open
markovejnovic wants to merge 18 commits into
Open
fix: replace tar injection with bind-mount COW workspace (CLI-28)#87markovejnovic wants to merge 18 commits into
markovejnovic wants to merge 18 commits into
Conversation
Child steps that boot from a cached parent snapshot now receive fresh source files via tar injection, preventing stale workspace data from leaking across builds.
cp -cR can fail with nonzero exit (cross-APFS-volume) but or_else only caught spawn failures. Now falls back to cp -R on either case.
Add optional workspace_dir column to the snapshots table with an idempotent schema migration. Change eviction return type from Vec<SnapshotId> to Vec<(SnapshotId, Option<String>)> so callers can clean up workspace directories alongside backend snapshots. New methods: put_with_workspace() and get_with_workspace(). Updated vm.rs eviction loop to remove workspace dirs on eviction.
…to cache Replace the tar-upload inject() mechanism with Docker bind mounts for workspace directories. The workspace is now mounted at container creation time via HostConfig.binds, eliminating the tar_directory() helper and UploadToContainerOptions usage entirely. Key changes: - Add WorkspaceMount type (host_path + guest_path) to types.rs - Add workspace_dir field to ExecutionResult for downstream consumption - Add workspace_cache_dir to VmConfig for COW cache persistence - Update VmBackend::create/restore to accept Optional<&WorkspaceMount> - Remove Vm::inject() from the trait and all implementations - Delete tar_directory() and remove tar crate dependency - Persist workspace via cow_copy after snapshot on cache miss - Store workspace path in registry via put_with_workspace on cache store - Return cached workspace path from registry on cache hit - Clean up evicted workspace directories alongside snapshots - Update VmRunner to construct WorkspaceMount instead of inject path
- Add parent_workspace_dir to StepContext, workspace_dir to StepResult - VmRunner: COW copy parent workspace for child steps, extract fresh for root - Scheduler: propagate workspace_dir through StepOutcome to children
- Format all Rust files - Collapse nested if-let to tuple pattern (clippy) - Replace map().unwrap_or() with map_or() (clippy) - Add #[doc] Errors section to cow_copy (clippy pedantic)
- Skip workspace persistence for CachingPolicy::None (fixes parallel uncached steps racing on shared "ephemeral" dir) - Runner keeps step tempdir alive via TempDir::keep() for uncached steps so children can still COW-copy from parent workspace - Single get_with_workspace() call on cache hit (was double lookup) - invalidate() now returns workspace_dir for cleanup (was leaking orphaned workspace dirs on disk) - Format test_cache_predicate.py with ruff
- Registry: unify put/put_with_workspace into single put(key, snap, ws); fold eviction into single lock scope to fix TOCTOU race (C3+C4) - Ephemeral snapshots: UUID-based labels to avoid parallel collision (C1); scheduler cleanup removes Docker images and temp dirs after build (C2) - Cache hits: peek_cache() in VmRunner skips expensive COW copy (I3); validate workspace_dir still exists on disk before trusting cache (I5) - Blocking I/O: wrap cow_copy, remove_dir_all, create_dir_all in spawn_blocking to avoid starving the tokio runtime (I2) - workspace.rs: clear partial state between failed APFS clone and fallback copy (I6); use -p flag to preserve permissions (I7); include src/dst paths in error messages - Python DSL: validate predicate callback is callable; wrap exceptions and reject None returns (I8)
…ots only Rearchitect workspace flow to kill the stale-source bug class (CLI-28): cached snapshots now carry ONLY system state (docker commit); workspace state is strictly run-scoped and rebases to the current run's fresh source at every cache-hit boundary. A cache-hit step contributes its snapshot for container lineage but no workspace dir; its children COW from the run's fresh source extract instead of a stale persisted copy. Core changes: - registry: workspace_dir always written NULL (legacy rows reaped); put() runs upsert+eviction in one transaction with DELETE..RETURNING; invalidate_if() CAS closes the stale-entry/fresh-insert race; contains_snapshot() backs all guarded image removals - vm: HmVm::execute takes a CancellationToken — cooperative cancel with awaited teardown (destroy/chown reclaim strictly happens-before the runner drops the workspace tempdir); evictions deferred to end-of-run (cleanup_deferred_evictions) so in-flight readers never lose their tag; put-failure demotes the snapshot to ephemeral instead of leaking it; gc_orphaned_snapshots sweeps aged rowless tags at startup - scheduler: per-step workspace refcounting frees a parent's kept dir the moment its last BuildsIn consumer finishes (temp footprint now tracks the live DAG frontier); per-step timeout fires a child token and AWAITS teardown instead of dropping the future; ephemeral snapshot cleanup is guarded by contains_snapshot so a concurrently re-registered tag survives - runner: once-per-run shared source extract (Arc<OnceCell<TempDir>>); cow_copy call sites moved into spawn_blocking; cache-hit path does zero workspace prep - workspace: single-attempt cp with captured stderr (cp -c self-falls back to copyfile(2); the manual retry masked real errors) - docker: per-tag GC removal (untag-safe for multi-tag images); exec quiesce on cancelled commands before chown reclaim - DSL: forever() docs state the hit-boundary semantics (workspace writes are not replayed across runs); predicate keygen cross-language tests Semantic change: on a cache hit, files a step wrote into /workspace are no longer visible to downstream steps across runs — downstream sees the current run's source tree. Write build outputs to system paths or use on_change() when workspace outputs must survive hits. Acceptance: deep_cache_chain_workspace passes from warm and cold cache, asserting [a]/[b] cache hits AND fresh marker content (the CLI-28 contract). local_fork_cache + local_parallelism pass. Test fixtures modernized from the obsolete def build()/.run() DSL form.
…workspace-broken # Conflicts: # crates/hm-exec/src/local/backend.rs # crates/hm-exec/src/local/runner/vm.rs # crates/hm-exec/src/local/scheduler.rs # crates/hm-vm/src/docker.rs # crates/hm-vm/src/registry.rs # crates/hm-vm/src/vm.rs
…workspace-broken # Conflicts: # crates/hm-exec/src/local/scheduler.rs
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes CLI-28 (#85): steps downstream of cached ancestors saw stale source files, because workspace state was baked into cached layers at first-run time and never refreshed.
Architecture: run-local workspace lineage. Cached snapshots (
docker commit) carry system state only — bind-mounted/workspaceis excluded from commits by Docker design, and the snapshot registry now stores no workspace paths. Workspace state is strictly run-scoped: each run extracts the source archive once, every root step COW-copies from that fresh extract, children COW from their parent's run-local output dir, and a cache-hit boundary rebases to fresh source — the hit contributes its snapshot for container lineage but no workspace, so children start from the current run's source tree.On a cache hit, files a step wrote into
/workspaceare not replayed across runs — downstream steps see the current run's source tree instead. Write build outputs to system paths (e.g./usr/local) or useon_change()when workspace outputs must survive hits. Documented onforever()in both DSLs.Reliability work (from 4 adversarial review rounds)
HmVm::executetakes aCancellationToken; teardown (container destroy + bind-mount ownership reclaim) is awaited before the future resolves — no future-drop leaks of root-owned workspace dirs. Per-step timeouts await teardown too (exit 124 only after reclaim).contains_snapshotso a re-registered tag is never destroyed.BuildsInconsumer finishes; temp footprint tracks the live DAG frontier, with an idempotent end-of-run sweep as backstop.registry.putfailure demotes the snapshot to ephemeral (run finishes correctly, image reclaimed later) instead of orphaning it; startup GC sweeps aged rowlessharmont-ephemeral/*tags (24h age floor protects concurrent runs).DELETE..RETURNING, deterministic tie-break),invalidate_ifCAS closes the stale-entry/fresh-insert race,busy_timeoutfor cross-process access.cow_copy: single-attemptcpwith captured stderr (Apple'scp -cself-falls-back tocopyfile(2)); all call sites inspawn_blocking.hm.predicate(fn)(Python) /predicate(value)(TS) with cross-language keygen tests.Known limitations (accepted, documented)
invalidate_ifCAS is vacuous on the Docker backend (snapshot id == tag); worst case degrades to re-execution, never wrong data.local_e2e.rsfixtures are missing on main (pre-existing); the executable cache-hit gates are the#[ignore]d Docker tests.Test plan
cargo test— all suites green (200 tests)cargo clippy --all-targets— zero warnings;cargo fmt --checkcleancargo test --test deep_cache_chain_workspace -- --ignored— passes from warm AND cold cache; asserts[a]/[b]cache hits and freshdeep-v2marker (the CLI-28 contract)local_fork_cache,local_parallelism(ignored, Docker) — passharmont-ephemeralimages leaked after runs; stale-invalidation path exercised viadocker rmiof cached images