Add mirror command and API for selective package mirroring by andrew · Pull Request #40 · git-pkgs/proxy

andrew · 2026-03-19T19:51:48Z

Adds a proxy mirror CLI command and /api/mirror REST endpoints for pre-populating the cache from multiple input sources.

Input modes (narrowest to broadest):

Versioned PURLs: proxy mirror pkg:npm/lodash@4.17.21
Unversioned PURLs (all versions): proxy mirror pkg:npm/lodash
SBOM files: proxy mirror --sbom sbom.cdx.json (CycloneDX JSON/XML, SPDX JSON/tag-value)

Architecture:

New internal/mirror/ package with Source interface and implementations
Reuses existing handler.Proxy.GetOrFetchArtifact() for fetch-and-cache
Bounded worker pool via errgroup with --concurrency flag (default 4)
--dry-run flag to preview what would be mirrored
Async job management for API usage with POST/GET/DELETE endpoints
metadata_cache database table for offline metadata serving with upstream ETag and Last-Modified support

API endpoints:

POST /api/mirror - start a mirror job (JSON body with purls)
GET /api/mirror/{id} - check job status and live progress
DELETE /api/mirror/{id} - cancel a running job

Metadata caching:

FetchOrCacheMetadata() replaces ProxyMetadata(), returning bytes instead of writing directly to the response
ProxyCached() writes the response with ETag/Last-Modified headers and handles client conditional requests (304 Not Modified)
Configurable metadata_ttl (default 5m) serves cached metadata directly within the TTL window without contacting upstream
When upstream is unreachable, stale cache is served with Warning: 110 - "Response is Stale" per RFC 7234
ETag-based conditional revalidation with upstream avoids re-downloading unchanged metadata
Config: metadata_ttl (YAML) / PROXY_METADATA_TTL (env), set to "0" to always revalidate

New dependencies:

github.com/CycloneDX/cyclonedx-go - CycloneDX SBOM parsing
github.com/spdx/tools-golang - SPDX SBOM parsing

Breaking changes:

Proxy.ProxyMetadata() has been removed. Handlers now use Proxy.FetchOrCacheMetadata() which returns bytes instead of writing directly to the response, enabling the metadata caching layer. Proxy.ProxyCached() is the new response-writing equivalent.
Some upstream error responses changed from 500 to 502 (Bad Gateway), which better reflects the actual failure mode when upstream registries are unreachable or return errors.

Related to #20.

andrew · 2026-04-06T12:08:18Z

The metadata caching in this PR opens the door for cooldown support on ecosystems that currently need per-version HTTP calls to get timestamps:

Go modules: /@v/{version}.info returns a small JSON with a Time field. With proxyCached caching these responses, filtering /@v/list becomes feasible. First request would still fetch .info for each uncached version, but subsequent requests just read timestamps from the local cache. The N+1 HTTP problem becomes "N+1 once, then 0 extra calls."

Maven: Similar pattern. maven-metadata.xml lists versions but has no timestamps. Individual POM files have Last-Modified headers that could be cached. First pass fetches each POM once, then subsequent metadata filtering reads from cache.

Related PRs adding cooldown to ecosystems that already have timestamps in their metadata:

Add cooldown support for NuGet #67 (NuGet)
Add cooldown support for Conda #68 (Conda)
Add cooldown support for RubyGems #69 (RubyGems)
Add cooldown support for Hex #70 (Hex)

Add a `proxy mirror` CLI command and `/api/mirror` API endpoints that pre-populate the cache from various input sources: individual PURLs, SBOM files (CycloneDX and SPDX), or full registry enumeration. The mirror reuses the existing handler.Proxy.GetOrFetchArtifact() pipeline so cached artifacts are identical to those fetched on demand. A bounded worker pool controls download parallelism. Metadata caching is opt-in via `cache_metadata: true` in config (or PROXY_CACHE_METADATA=true). The mirror command always enables it. When enabled, upstream metadata responses are stored for offline fallback with ETag-based conditional revalidation. New internal/mirror package with Source interface, PURLSource, SBOMSource, RegistrySource, and async JobStore. New metadata_cache database table for offline metadata serving.

- Wire job contexts to server shutdown context so jobs are canceled on server stop instead of running indefinitely - Defer context cancel in runJob so completed jobs don't leak contexts - Cap error accumulation in progressTracker to 1000 entries to prevent OOM on large mirror operations with many failures - Add panic recovery in errgroup workers to prevent process crashes - Use defer for db.Close() in runMirror CLI to ensure cleanup on all error paths

- Fix race where runJob could overwrite canceled state set by Cancel() - Fix Debian ecosystem name inconsistency ("deb" -> "debian") - Stream metadata responses when caching is disabled to avoid buffering - Add metadata_cache table to initial schema strings for consistency - Gate mirror API behind mirror_api config flag (disabled by default) - Fix goconst lint in metadata_cache_test.go

…stubs - ProxyCached now stores upstream Last-Modified in the cache and uses it (along with ETag) for conditional request handling, returning 304 when client validators match. Adds Content-Length to cached responses. - Handlers calling FetchOrCacheMetadata (pypi, composer, pub, nuget) now check for ErrUpstreamNotFound and return 404 instead of 502, matching the existing npm and cargo behavior. - Mirror jobs report live progress via a periodic callback while running, so API polls return real counts instead of zeroed progress. - Registry mirroring removed from CLI flags, API acceptance, README, and docs since every enumerator was a stub returning "not yet implemented". - Added tests for the conditional metadata path (ETag/If-None-Match, Last-Modified/If-Modified-Since, 304 responses, header omission).

Cached metadata is now served directly within a configurable TTL window (default 5m) without contacting upstream, reducing latency and upstream load. When upstream is unreachable and the cache is past its TTL, stale content is served with a Warning: 110 header per RFC 7234. New config: `metadata_ttl` (YAML) / `PROXY_METADATA_TTL` (env). Set to "0" to always revalidate with upstream.

andrew force-pushed the mirror-feature branch 4 times, most recently from 17a0bbd to 7bb944b Compare March 20, 2026 08:40

andrew force-pushed the mirror-feature branch from 1ab992c to 2a05515 Compare April 1, 2026 14:48

andrew mentioned this pull request Apr 3, 2026

Feature proposal: multiple upstreams, with optional masking #55

Open

andrew force-pushed the mirror-feature branch from 9607cb1 to 23a39c3 Compare April 6, 2026 12:46

andrew added 5 commits April 6, 2026 19:32

andrew force-pushed the mirror-feature branch from 18b4a39 to 52bc6e8 Compare April 6, 2026 18:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add mirror command and API for selective package mirroring#40

Add mirror command and API for selective package mirroring#40
andrew wants to merge 5 commits intomainfrom
mirror-feature

andrew commented Mar 19, 2026 •

edited

Loading

Uh oh!

andrew commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

andrew commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andrew commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

andrew commented Mar 19, 2026 •

edited

Loading