Skip to content

feat(distributed): resumable file uploads via HTTP Content-Range#10109

Merged
mudler merged 1 commit into
masterfrom
feat/file-upload-range-resume
May 31, 2026
Merged

feat(distributed): resumable file uploads via HTTP Content-Range#10109
mudler merged 1 commit into
masterfrom
feat/file-upload-range-resume

Conversation

@localai-bot
Copy link
Copy Markdown
Collaborator

Summary

Large model GGUFs (multi-GB) transferred between master and worker over flaky / bandwidth-throttled paths (e.g. libp2p relays with byte caps) used to restart from byte 0 on every transport error. This change adds standard HTTP Range/resume semantics to the worker's `PUT /v1/files/` endpoint and teaches the master-side `HTTPFileStager` to consult the worker for the last accepted offset and resume from there.

Server side — `file_transfer_server.go`

  • PUT honors `Content-Range: bytes -/`. The handler validates that `` matches the current on-disk size; mismatches return 416 with the actual size in `X-File-Size`.
  • Mid-upload chunks return 308 Permanent Redirect ("Resume Incomplete") with the new size, so the client can keep going.
  • Optional `X-Content-SHA256` request header binds an upload to a target hash; cross-attempt drift returns 409. On the final chunk the server re-computes SHA-256 and returns 400 if it doesn't match.
  • HEAD advertises `Accept-Ranges: bytes` + `Content-Length` and exposes `X-Target-SHA256` for in-progress files (so clients can resume only when the partial bytes belong to the file they want to upload).
  • Legacy PUTs with no `Content-Range` keep the original truncate-create semantics — zero behavior change on the happy path.

Client side — `file_stager_http.go`

  • Pre-PUT HEAD probe reads `X-File-Size` + `X-Target-SHA256` to determine the resume offset.
  • `doUpload` seeks to that offset and sends `Content-Range` + `X-Content-SHA256`.
  • Retry loop switches from fixed 3 attempts / 5s-10s-20s backoff to an outer time budget (default 1h, override via `LOCALAI_FILE_TRANSFER_BUDGET`) with exponential backoff (1s → 30s cap), so a 5GB upload over a flaky link can outlast many short disconnects.
  • 308 and 416 responses are treated as transient: the next iteration re-HEADs to learn the correct offset.
  • The retry-budget context is derived from, not replacing, the caller's ctx — so an upstream routing-model timeout/cancellation still propagates and aborts the loop immediately.

Tests

8 new specs in `file_transfer_server_test.go`:

  • two-chunk `Content-Range` round-trip produces correct file + sidecar
  • 416 on offset mismatch
  • 409 on SHA drift between chunks
  • 400 on final-hash mismatch
  • HEAD on partial upload exposes `X-Target-SHA256` (not a misleading hash-of-partial-bytes)
  • pre-existing finished file with a different hash transparently overwritten on a new PUT starting at byte 0
  • end-to-end resume via `EnsureRemote` against a worker that already holds a partial file
  • mid-stream connection drop on attempt llama-cli: command not found #1 recovered by attempt windows compatibility? #2

Full nodes suite (267 specs, 147s) green.

@mudler mudler force-pushed the feat/file-upload-range-resume branch 2 times, most recently from f1a38e4 to db5e371 Compare May 31, 2026 10:21
Large model GGUFs (multi-GB) transferred between master and worker over
flaky / bandwidth-throttled paths (e.g. libp2p relays with byte caps) used
to restart from byte 0 on every transport error. This change adds standard
HTTP Range/resume semantics to the worker's PUT /v1/files/<key> endpoint
and teaches the master-side HTTPFileStager to consult the worker for the
last accepted offset and resume from there.

Server side (file_transfer_server.go):
- PUT now honors Content-Range: bytes <start>-<end>/<total>. The handler
  validates that <start> matches the current on-disk size; mismatches
  return 416 with the actual size in X-File-Size.
- Mid-upload chunks return 308 Permanent Redirect ("Resume Incomplete")
  with the new size, so the client can keep going.
- An optional X-Content-SHA256 request header binds an upload to a target
  hash; cross-attempt drift returns 409. On the final chunk the server
  re-computes SHA-256 and returns 400 if it doesn't match.
- HEAD now advertises Accept-Ranges: bytes and Content-Length, and exposes
  X-Target-SHA256 for in-progress files (so clients can resume only when
  the partial bytes belong to the file they want to upload).
- Legacy PUTs with no Content-Range keep the original truncate-create
  semantics — zero behavior change on the happy path.

Client side (file_stager_http.go):
- Pre-PUT HEAD probe reads X-File-Size + X-Target-SHA256 to determine the
  resume offset.
- doUpload seeks to that offset and sends Content-Range + X-Content-SHA256.
- Retry loop switches from fixed 3 attempts / 5s-10s-20s backoff to an
  outer time budget
  with exponential backoff (1s -> 30s cap), so a 5GB upload over a flaky
  link can outlast many short disconnects.
- 308 and 416 responses are treated as transient: the next iteration
  re-HEADs to learn the correct offset.

Tests:
- Two-chunk Content-Range round-trip produces the correct file + sidecar.
- 416 on a Content-Range/file-size mismatch.
- 409 on X-Content-SHA256 drift between chunks.
- 400 on final-hash mismatch.
- HEAD on a partial upload exposes X-Target-SHA256 (not a misleading
  hash-of-partial-bytes via X-Content-SHA256).
- Pre-existing finished file with a different hash is transparently
  overwritten when a new PUT starts at byte 0.
- End-to-end resume: EnsureRemote against a worker that already holds a
  partial file transfers only the remainder.
- Mid-stream connection drop on attempt #1 is recovered by attempt #2
  resuming from the partial offset.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
@mudler mudler force-pushed the feat/file-upload-range-resume branch from db5e371 to 9ff179d Compare May 31, 2026 10:29
@mudler mudler enabled auto-merge (squash) May 31, 2026 10:34
@mudler mudler merged commit c222161 into master May 31, 2026
58 checks passed
@mudler mudler deleted the feat/file-upload-range-resume branch May 31, 2026 11:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants