Skip to content

Add mirror command and API for selective package mirroring#40

Open
andrew wants to merge 5 commits intomainfrom
mirror-feature
Open

Add mirror command and API for selective package mirroring#40
andrew wants to merge 5 commits intomainfrom
mirror-feature

Conversation

@andrew
Copy link
Copy Markdown
Contributor

@andrew andrew commented Mar 19, 2026

Adds a proxy mirror CLI command and /api/mirror REST endpoints for pre-populating the cache from multiple input sources.

Input modes (narrowest to broadest):

  • Versioned PURLs: proxy mirror pkg:npm/lodash@4.17.21
  • Unversioned PURLs (all versions): proxy mirror pkg:npm/lodash
  • SBOM files: proxy mirror --sbom sbom.cdx.json (CycloneDX JSON/XML, SPDX JSON/tag-value)

Architecture:

  • New internal/mirror/ package with Source interface and implementations
  • Reuses existing handler.Proxy.GetOrFetchArtifact() for fetch-and-cache
  • Bounded worker pool via errgroup with --concurrency flag (default 4)
  • --dry-run flag to preview what would be mirrored
  • Async job management for API usage with POST/GET/DELETE endpoints
  • metadata_cache database table for offline metadata serving with upstream ETag and Last-Modified support

API endpoints:

  • POST /api/mirror - start a mirror job (JSON body with purls)
  • GET /api/mirror/{id} - check job status and live progress
  • DELETE /api/mirror/{id} - cancel a running job

Metadata caching:

  • FetchOrCacheMetadata() replaces ProxyMetadata(), returning bytes instead of writing directly to the response
  • ProxyCached() writes the response with ETag/Last-Modified headers and handles client conditional requests (304 Not Modified)
  • Configurable metadata_ttl (default 5m) serves cached metadata directly within the TTL window without contacting upstream
  • When upstream is unreachable, stale cache is served with Warning: 110 - "Response is Stale" per RFC 7234
  • ETag-based conditional revalidation with upstream avoids re-downloading unchanged metadata
  • Config: metadata_ttl (YAML) / PROXY_METADATA_TTL (env), set to "0" to always revalidate

New dependencies:

  • github.com/CycloneDX/cyclonedx-go - CycloneDX SBOM parsing
  • github.com/spdx/tools-golang - SPDX SBOM parsing

Breaking changes:

  • Proxy.ProxyMetadata() has been removed. Handlers now use Proxy.FetchOrCacheMetadata() which returns bytes instead of writing directly to the response, enabling the metadata caching layer. Proxy.ProxyCached() is the new response-writing equivalent.
  • Some upstream error responses changed from 500 to 502 (Bad Gateway), which better reflects the actual failure mode when upstream registries are unreachable or return errors.

Related to #20.

@andrew
Copy link
Copy Markdown
Contributor Author

andrew commented Apr 6, 2026

The metadata caching in this PR opens the door for cooldown support on ecosystems that currently need per-version HTTP calls to get timestamps:

Go modules: /@v/{version}.info returns a small JSON with a Time field. With proxyCached caching these responses, filtering /@v/list becomes feasible. First request would still fetch .info for each uncached version, but subsequent requests just read timestamps from the local cache. The N+1 HTTP problem becomes "N+1 once, then 0 extra calls."

Maven: Similar pattern. maven-metadata.xml lists versions but has no timestamps. Individual POM files have Last-Modified headers that could be cached. First pass fetches each POM once, then subsequent metadata filtering reads from cache.

Related PRs adding cooldown to ecosystems that already have timestamps in their metadata:

andrew added 5 commits April 6, 2026 19:32
Add a `proxy mirror` CLI command and `/api/mirror` API endpoints that
pre-populate the cache from various input sources: individual PURLs,
SBOM files (CycloneDX and SPDX), or full registry enumeration.

The mirror reuses the existing handler.Proxy.GetOrFetchArtifact()
pipeline so cached artifacts are identical to those fetched on demand.
A bounded worker pool controls download parallelism.

Metadata caching is opt-in via `cache_metadata: true` in config (or
PROXY_CACHE_METADATA=true). The mirror command always enables it. When
enabled, upstream metadata responses are stored for offline fallback
with ETag-based conditional revalidation.

New internal/mirror package with Source interface, PURLSource,
SBOMSource, RegistrySource, and async JobStore. New metadata_cache
database table for offline metadata serving.
- Wire job contexts to server shutdown context so jobs are canceled on
  server stop instead of running indefinitely
- Defer context cancel in runJob so completed jobs don't leak contexts
- Cap error accumulation in progressTracker to 1000 entries to prevent
  OOM on large mirror operations with many failures
- Add panic recovery in errgroup workers to prevent process crashes
- Use defer for db.Close() in runMirror CLI to ensure cleanup on all
  error paths
- Fix race where runJob could overwrite canceled state set by Cancel()
- Fix Debian ecosystem name inconsistency ("deb" -> "debian")
- Stream metadata responses when caching is disabled to avoid buffering
- Add metadata_cache table to initial schema strings for consistency
- Gate mirror API behind mirror_api config flag (disabled by default)
- Fix goconst lint in metadata_cache_test.go
…stubs

- ProxyCached now stores upstream Last-Modified in the cache and uses it
  (along with ETag) for conditional request handling, returning 304 when
  client validators match. Adds Content-Length to cached responses.

- Handlers calling FetchOrCacheMetadata (pypi, composer, pub, nuget) now
  check for ErrUpstreamNotFound and return 404 instead of 502, matching
  the existing npm and cargo behavior.

- Mirror jobs report live progress via a periodic callback while running,
  so API polls return real counts instead of zeroed progress.

- Registry mirroring removed from CLI flags, API acceptance, README, and
  docs since every enumerator was a stub returning "not yet implemented".

- Added tests for the conditional metadata path (ETag/If-None-Match,
  Last-Modified/If-Modified-Since, 304 responses, header omission).
Cached metadata is now served directly within a configurable TTL window
(default 5m) without contacting upstream, reducing latency and upstream
load. When upstream is unreachable and the cache is past its TTL, stale
content is served with a Warning: 110 header per RFC 7234.

New config: `metadata_ttl` (YAML) / `PROXY_METADATA_TTL` (env).
Set to "0" to always revalidate with upstream.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant