Add mirror command and API for selective package mirroring#40
Add mirror command and API for selective package mirroring#40
Conversation
17a0bbd to
7bb944b
Compare
|
The metadata caching in this PR opens the door for cooldown support on ecosystems that currently need per-version HTTP calls to get timestamps: Go modules: Maven: Similar pattern. Related PRs adding cooldown to ecosystems that already have timestamps in their metadata:
|
Add a `proxy mirror` CLI command and `/api/mirror` API endpoints that pre-populate the cache from various input sources: individual PURLs, SBOM files (CycloneDX and SPDX), or full registry enumeration. The mirror reuses the existing handler.Proxy.GetOrFetchArtifact() pipeline so cached artifacts are identical to those fetched on demand. A bounded worker pool controls download parallelism. Metadata caching is opt-in via `cache_metadata: true` in config (or PROXY_CACHE_METADATA=true). The mirror command always enables it. When enabled, upstream metadata responses are stored for offline fallback with ETag-based conditional revalidation. New internal/mirror package with Source interface, PURLSource, SBOMSource, RegistrySource, and async JobStore. New metadata_cache database table for offline metadata serving.
- Wire job contexts to server shutdown context so jobs are canceled on server stop instead of running indefinitely - Defer context cancel in runJob so completed jobs don't leak contexts - Cap error accumulation in progressTracker to 1000 entries to prevent OOM on large mirror operations with many failures - Add panic recovery in errgroup workers to prevent process crashes - Use defer for db.Close() in runMirror CLI to ensure cleanup on all error paths
- Fix race where runJob could overwrite canceled state set by Cancel()
- Fix Debian ecosystem name inconsistency ("deb" -> "debian")
- Stream metadata responses when caching is disabled to avoid buffering
- Add metadata_cache table to initial schema strings for consistency
- Gate mirror API behind mirror_api config flag (disabled by default)
- Fix goconst lint in metadata_cache_test.go
…stubs - ProxyCached now stores upstream Last-Modified in the cache and uses it (along with ETag) for conditional request handling, returning 304 when client validators match. Adds Content-Length to cached responses. - Handlers calling FetchOrCacheMetadata (pypi, composer, pub, nuget) now check for ErrUpstreamNotFound and return 404 instead of 502, matching the existing npm and cargo behavior. - Mirror jobs report live progress via a periodic callback while running, so API polls return real counts instead of zeroed progress. - Registry mirroring removed from CLI flags, API acceptance, README, and docs since every enumerator was a stub returning "not yet implemented". - Added tests for the conditional metadata path (ETag/If-None-Match, Last-Modified/If-Modified-Since, 304 responses, header omission).
Cached metadata is now served directly within a configurable TTL window (default 5m) without contacting upstream, reducing latency and upstream load. When upstream is unreachable and the cache is past its TTL, stale content is served with a Warning: 110 header per RFC 7234. New config: `metadata_ttl` (YAML) / `PROXY_METADATA_TTL` (env). Set to "0" to always revalidate with upstream.
Adds a
proxy mirrorCLI command and/api/mirrorREST endpoints for pre-populating the cache from multiple input sources.Input modes (narrowest to broadest):
proxy mirror pkg:npm/lodash@4.17.21proxy mirror pkg:npm/lodashproxy mirror --sbom sbom.cdx.json(CycloneDX JSON/XML, SPDX JSON/tag-value)Architecture:
internal/mirror/package withSourceinterface and implementationshandler.Proxy.GetOrFetchArtifact()for fetch-and-cacheerrgroupwith--concurrencyflag (default 4)--dry-runflag to preview what would be mirroredmetadata_cachedatabase table for offline metadata serving with upstream ETag and Last-Modified supportAPI endpoints:
POST /api/mirror- start a mirror job (JSON body withpurls)GET /api/mirror/{id}- check job status and live progressDELETE /api/mirror/{id}- cancel a running jobMetadata caching:
FetchOrCacheMetadata()replacesProxyMetadata(), returning bytes instead of writing directly to the responseProxyCached()writes the response with ETag/Last-Modified headers and handles client conditional requests (304 Not Modified)metadata_ttl(default 5m) serves cached metadata directly within the TTL window without contacting upstreamWarning: 110 - "Response is Stale"per RFC 7234metadata_ttl(YAML) /PROXY_METADATA_TTL(env), set to"0"to always revalidateNew dependencies:
github.com/CycloneDX/cyclonedx-go- CycloneDX SBOM parsinggithub.com/spdx/tools-golang- SPDX SBOM parsingBreaking changes:
Proxy.ProxyMetadata()has been removed. Handlers now useProxy.FetchOrCacheMetadata()which returns bytes instead of writing directly to the response, enabling the metadata caching layer.Proxy.ProxyCached()is the new response-writing equivalent.Related to #20.