feat(distributed): resumable file uploads via HTTP Content-Range#10109
Merged
Conversation
f1a38e4 to
db5e371
Compare
Large model GGUFs (multi-GB) transferred between master and worker over
flaky / bandwidth-throttled paths (e.g. libp2p relays with byte caps) used
to restart from byte 0 on every transport error. This change adds standard
HTTP Range/resume semantics to the worker's PUT /v1/files/<key> endpoint
and teaches the master-side HTTPFileStager to consult the worker for the
last accepted offset and resume from there.
Server side (file_transfer_server.go):
- PUT now honors Content-Range: bytes <start>-<end>/<total>. The handler
validates that <start> matches the current on-disk size; mismatches
return 416 with the actual size in X-File-Size.
- Mid-upload chunks return 308 Permanent Redirect ("Resume Incomplete")
with the new size, so the client can keep going.
- An optional X-Content-SHA256 request header binds an upload to a target
hash; cross-attempt drift returns 409. On the final chunk the server
re-computes SHA-256 and returns 400 if it doesn't match.
- HEAD now advertises Accept-Ranges: bytes and Content-Length, and exposes
X-Target-SHA256 for in-progress files (so clients can resume only when
the partial bytes belong to the file they want to upload).
- Legacy PUTs with no Content-Range keep the original truncate-create
semantics — zero behavior change on the happy path.
Client side (file_stager_http.go):
- Pre-PUT HEAD probe reads X-File-Size + X-Target-SHA256 to determine the
resume offset.
- doUpload seeks to that offset and sends Content-Range + X-Content-SHA256.
- Retry loop switches from fixed 3 attempts / 5s-10s-20s backoff to an
outer time budget
with exponential backoff (1s -> 30s cap), so a 5GB upload over a flaky
link can outlast many short disconnects.
- 308 and 416 responses are treated as transient: the next iteration
re-HEADs to learn the correct offset.
Tests:
- Two-chunk Content-Range round-trip produces the correct file + sidecar.
- 416 on a Content-Range/file-size mismatch.
- 409 on X-Content-SHA256 drift between chunks.
- 400 on final-hash mismatch.
- HEAD on a partial upload exposes X-Target-SHA256 (not a misleading
hash-of-partial-bytes via X-Content-SHA256).
- Pre-existing finished file with a different hash is transparently
overwritten when a new PUT starts at byte 0.
- End-to-end resume: EnsureRemote against a worker that already holds a
partial file transfers only the remainder.
- Mid-stream connection drop on attempt #1 is recovered by attempt #2
resuming from the partial offset.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
db5e371 to
9ff179d
Compare
mudler
approved these changes
May 31, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Large model GGUFs (multi-GB) transferred between master and worker over flaky / bandwidth-throttled paths (e.g. libp2p relays with byte caps) used to restart from byte 0 on every transport error. This change adds standard HTTP Range/resume semantics to the worker's `PUT /v1/files/` endpoint and teaches the master-side `HTTPFileStager` to consult the worker for the last accepted offset and resume from there.
Server side — `file_transfer_server.go`
Client side — `file_stager_http.go`
Tests
8 new specs in `file_transfer_server_test.go`:
Full nodes suite (267 specs, 147s) green.