Skip to content

Media proxy#483

Merged
dahlia merged 18 commits into
fedify-dev:mainfrom
dahlia:feature/media-proxy
May 14, 2026
Merged

Media proxy#483
dahlia merged 18 commits into
fedify-dev:mainfrom
dahlia:feature/media-proxy

Conversation

@dahlia
Copy link
Copy Markdown
Member

@dahlia dahlia commented May 14, 2026

#481 asks for a Misskey-style media proxy so Hollo doesn't hand visitors the original URLs of remote avatars, headers, attachments, custom emojis, and link-preview images. Without one, restrictive CORS on remote object stores breaks client-side image work, and the visitor's browser ends up making a request to every server its timeline touches.

Configuration

MEDIA_PROXY decides how Hollo serves remote media. It has three states: off (the default and the historical behaviour), proxy (rewrite every remote media URL to a signed /proxy/<sig>/<b64url> path served by Hollo, streaming bytes through without persisting them), and cache (the same rewriting, plus storing the body at proxy/<sha256>.bin with a content-type sidecar at proxy/<sha256>.json in the configured storage backend). It also accepts the Boolean synonyms true/on/1 for proxy and false/off/0 for off, but disk caching must be requested explicitly with cache.

REMOTE_MEDIA_THUMBNAILS (default on) controls whether Hollo downloads incoming attachments to generate a local WebP thumbnail through Sharp. Operators can turn it off when they're already proxying remote media or when storage is tight.

Signing and the proxy route

The signing key is derived from SECRET_KEY via HMAC-SHA256(SECRET_KEY, "media-proxy/v1"), so no additional secret is needed. Each proxy URL embeds an HMAC-SHA256 over the base64url-encoded original URL, truncated to 16 bytes; verification uses crypto.timingSafeEqual and rejects non-canonical base64url.

The route in src/proxy.ts re-runs isSSRFSafeURL on every redirect target through up to three manual hops, caps the body at 32 MiB with streaming reads, requires image/, video/, or audio/ Content-Type with image/svg+xml explicitly blocked (SVG can carry inline scripts that would execute under the Hollo origin), and returns the response with Cache-Control: public, max-age=2592000, immutable plus X-Content-Type-Options: nosniff.

Where URL rewriting happens

Hollo rewrites these URLs in the entity serializers, not when posts are persisted. src/entities/ threads a baseUrl parameter through serializeMedium, serializeEmojis, serializeReaction, and serializePreviewCard, so the Mastodon API runs every remote URL through proxyUrl() before emitting it: account.avatar, media_attachments[].url and .preview_url, account and post emojis[], reaction url/static_url, preview-card image, notification emoji_reaction, and /api/v1/custom_emojis. The server-rendered web UI does the same in src/components/Profile.tsx, src/components/Post.tsx, src/components/AccountList.tsx, and the inline-emoji helper in src/custom-emoji.ts.

Outbound federation is deliberately untouched: src/federation/actor.ts and friends still publish the original remote URLs as icon, image, attachment, and emoji Tag references, because peers shouldn't have their own image rewritten to go through this instance.

Trust boundary

The proxy treats only the configured storage prefix as trusted local media, not every URL on the Hollo origin. The check in src/media-proxy.ts compares each URL's normalized .href against a canonical storage prefix derived from DRIVE_DISK: the FS driver always serves at /assets/ on the storage origin regardless of STORAGE_URL_BASE's path, while the S3 driver trusts STORAGE_URL_BASE itself. Comparing the parsed .href rather than the raw input string blocks dot-segment and percent-encoded path traversal like http://h/assets/../admin/avatar.png. URLs that share the Hollo origin but live outside the storage namespace, including attacker-crafted avatarUrl values that point at /admin/..., get routed through the proxy so they can't ride a logged-in operator's session.

Admin cache cleanup

The admin page at /thumbnail_cleanup picks up a media-proxy section: the statistics table gets a row counting the cache's .bin entries, and a "Clear N entries" button enqueues a cleanup job that deletes both the .bin body and its .json sidecar per entry. The existing cleanup worker handles the deletion; the new payload kind is distinguished by data.kind = "proxy_cache" inside the existing cleanup_thumbnails enum value, so no Drizzle migration is needed. Cleanup item keys are validated against a strict proxy/<sha256>.bin regex before disk.delete() sees them, and the storage listing pages through every disk.listAll page so multi-page S3 buckets get fully enumerated.

Out of scope

Post-content HTML, profile bios, and custom field values can still contain <img src>, <audio src>, <video src>, <source>, and poster attributes pointing at remote servers; rewriting those would need separate handling for srcset, relative URLs, and content fetched via the Open Graph scraper, on top of changes to the sanitiser pipeline in src/xss.ts.

ssrfcheck only validates the URL syntax, not DNS resolution, so a hostname that resolves to a private address can still be reached at fetch time. That's the same limitation every server-side fetch in the codebase has today (see src/federation/post.ts) and should be fixed in the shared fetch layer, not only in this route.

Tests

New tests live in src/media-proxy.test.ts (27 cases covering env-var parsing, signing, URL rewriting, and storage-prefix canonicalisation) and src/proxy.test.ts (10 cases for the route across off/proxy/cache modes, SSRF rejection on redirect, the SVG block, and the cache-hit reuse path). The full suite is at 432 passing.

Closes #481.

dahlia added 10 commits May 14, 2026 17:37
Introduce src/media-proxy.ts as the shared signing/verification core for
the upcoming /proxy route and entity-serializer rewrites.  Two new env
vars are parsed at module load:

  MEDIA_PROXY              off | proxy | cache  (default off)
  REMOTE_MEDIA_THUMBNAILS  on  | off            (default on)

The signing key is derived from SECRET_KEY via
HMAC-SHA256(SECRET_KEY, "media-proxy/v1") so no additional secret is
required.  Each proxied URL is turned into /proxy/<sig>/<b64url>, where
b64url is the URL itself, and sig is an HMAC over that base64url string
truncated to 16 bytes and constant-time-compared on verification.

proxyUrlForMode() short-circuits in three ways before signing: it
returns null for null/undefined input, returns the URL unchanged in off
mode or when the URL's origin matches baseUrl or STORAGE_URL_BASE, and
returns null for non-http(s) schemes so callers fall back to a default
rather than minting a signed wrapper for data:/javascript:/etc.

The base64url decoder additionally rejects the impossible mod-4-of-1
length and re-encodes the result to refuse trailing-bit aliasing, so
near-canonical variants of a signed path can't slip through.

For fedify-dev#481.

Assisted-by: Claude Code:claude-opus-4-7
Assisted-by: Codex:gpt-5.5
Mount a new Hono app at /proxy that consumes the URLs minted by the
media-proxy module added in the previous commit.  Behaviour follows the
three MEDIA_PROXY modes:

  off    — the app has no routes, so every /proxy/... path 404s.
  proxy  — verify the signature, run ssrfcheck on the URL and on every
           Location header through up to 3 manual redirects, fetch
           within a 30s abort window, validate Content-Type against an
           image/video/audio allowlist that explicitly excludes SVG,
           cap the body at 32 MiB during streaming, and return the
           bytes with Cache-Control: public, max-age=30d, immutable and
           X-Content-Type-Options: nosniff.
  cache  — same as proxy, but the body and a small JSON metadata
           sidecar are persisted to the flydrive disk under
           proxy/<sha256(url)>.  Subsequent requests for the same URL
           skip the upstream fetch entirely.

A few details worth calling out:

 - The abort timer spans both the connect/headers phase and the body
   read so a stalled upstream cannot tie a request up past the
   timeout.
 - Non-OK upstream responses and disallowed Content-Type both cancel
   the response body before returning to avoid sockets lingering
   until GC.
 - Uint8Array bodies are turned into exactly-sized ArrayBuffers before
   being handed to Hono so we don't leak bytes from a shared Buffer
   pool into the response.
 - DNS-rebinding via ssrfcheck (host string only, not resolved IP) is
   the same limitation as src/federation/post.ts:525-526 and is
   tracked as a follow-up that needs an SSRF-aware fetch connector.

For fedify-dev#481.

Assisted-by: Claude Code:claude-opus-4-7
Assisted-by: Codex:gpt-5.5
Wrap every remote URL that leaves Hollo through the Mastodon-compatible
API in proxyUrl(value, baseUrl).  The previous commits added the module
and the /proxy route; this is the wiring that makes API clients actually
see proxied URLs once MEDIA_PROXY is set to proxy or cache.

Touched:

  - account.avatar / .avatar_static / .header / .header_static
  - media_attachments[].url and .preview_url
  - emojis[].url / .static_url (account + post)
  - emoji_reactions[].url / .static_url
  - emoji_reaction (on notifications) plus its emoji_url / emojiURL
    siblings
  - card.image on preview cards

serializeMedium, serializeEmojis, serializeEmoji, serializeReactions,
serializeReaction, and serializePreviewCard all gained a baseUrl
parameter; status.ts threads c.req.url through every call site.

A couple of safety details worth calling out:

  - serializeEmoji returns null when proxyUrl rejects the href (non-
    http(s) scheme).  serializeEmojis filters those, and
    serializeReactions treats a rejected customEmoji as a unicode-only
    reaction so the count is preserved without leaking the raw URL.
    The notifications endpoint similarly skips emoji_url / emojiURL
    when the proxy refuses.
  - When a preview card's image URL is dropped, width and height are
    forced to 0 so clients don't see dimensions on a null image.

Outbound federation (src/federation/actor.ts and friends) is
untouched: it reads avatarUrl / coverUrl / media.url straight from the
DB, which still holds the original remote URLs.

For fedify-dev#481.

Assisted-by: Claude Code:claude-opus-4-7
Assisted-by: Codex:gpt-5.5
Thread a baseUrl prop through every JSX component that renders a
remote-origin avatar, header, media thumbnail, preview-card image, or
custom emoji, and wrap each of those URLs in proxyUrl() at render time.
Once the API path was proxying URLs, the admin dashboard and public
profile pages were the only remaining places where an operator's
browser would still talk directly to the source server.

Components updated:

  - Profile, AccountList / AccountItem, Post, PostContent, Medium,
    PreviewCardView, AuthorizationPage
  - renderCustomEmojis() in custom-emoji.ts now takes baseUrl too; if
    proxyUrl rejects an emoji URL (non-http(s) scheme), it falls back
    to the literal :shortcode: instead of emitting an <img> with the
    raw href

Page handlers updated to pass c.req.url through:

  - src/pages/home/index.tsx
  - src/pages/profile/index.tsx and src/pages/profile/profilePost.tsx
  - src/pages/tags/index.tsx
  - src/pages/accounts.tsx
  - src/pages/emojis.tsx (existing emoji table + import-discovery
    table)
  - src/pages/oauth/authorization.tsx via src/oauth.tsx

A couple of safety details:

  - The Medium component used to wrap the thumbnail in an <a href>
    that pointed at medium.url.  When proxyUrl rejects that URL the
    component now renders a plain <div> wrapper rather than fall back
    to the raw href, so a non-http(s) media URL can no longer slip
    into the rendered HTML.
  - Layout's imageUrl prop (OpenGraph / Twitter card) is intentionally
    left as the raw avatarUrl.  That metadata is consumed by external
    link-preview crawlers, and routing those fetches through our own
    /proxy endpoint would push their crawler traffic onto our origin.

For fedify-dev#481.

Assisted-by: Claude Code:claude-opus-4-7
Assisted-by: Codex:gpt-5.5
When REMOTE_MEDIA_THUMBNAILS=off the federation ingest path in
persistPost no longer downloads incoming attachments to generate webp
thumbnails.  It still creates the media row, but with thumbnailUrl set
to the remote URL itself and the attachment-declared dimensions —
mirroring the existing fallback shape used when the sharp pipeline
fails.

This addresses the long-standing complaint that the sharp-derived
thumbnails take up a lot of operator storage for posts that may never
actually be displayed.  Combined with MEDIA_PROXY=proxy or cache,
clients still see same-origin URLs at render time; with MEDIA_PROXY=off
they see the original remote URL exactly the same way the existing
catch-branch already did.

REMOTE_MEDIA_THUMBNAILS defaults to on so existing deployments keep
their current behaviour.

For fedify-dev#481.

Assisted-by: Claude Code:claude-opus-4-7
Assisted-by: Codex:gpt-5.5
Generalize src/cleanup/processors.ts into a single processCleanupItem()
entry point that the worker now calls.  It inspects item.data.kind: a
new "proxy_cache" payload routes to processProxyCacheDeletion() (which
removes the body and its JSON sidecar), and everything without that
discriminator falls through to the existing processThumbnailDeletion()
so old queued items keep working unchanged.

The .bin key the proxy-cache job carries is validated against the exact
shape writeCached() in src/proxy.ts mints — `proxy/<sha256>.bin`, 64
hex chars — so the per-item key cannot be coaxed into a path-traversal
delete.  listProxyCacheBinKeys() applies the same regex and pages
through every disk.listAll result so S3 buckets with more than one
listing page are fully enumerated.

The admin page (src/pages/thumbnail_cleanup.tsx, deliberately keeping
the existing route URL) gains:

  - a third statistics row, "Media proxy cache entries", populated
    from listProxyCacheBinKeys()
  - a "Clear media proxy cache" section with a button that POSTs to
    /thumbnail_cleanup/proxy_cache/clear, which enqueues one
    cleanupJobs row plus one cleanupJobItems row per cached .bin, in
    batches of 1000

The cleanup_thumbnails enum value is reused for these jobs (the kind
discriminator inside data picks the processor); the plan for this PR
explicitly chose that over a Drizzle migration for a new enum value.

A few details worth calling out:

  - The body delete is required; only its sidecar delete is wrapped in
    a try/log so a partial previous cleanup doesn't fail the item.
  - When the storage listing call itself fails, the statistics row is
    omitted (not shown as 0) and the section disables the clear button
    with "Cache size unavailable" so the UI can't claim "the proxy
    cache is empty" when it actually couldn't be inspected.
  - The proxy section uses its own `proxy-cache-result` query param so
    its post-action messages don't leak into the thumbnail section.

For fedify-dev#481.

Assisted-by: Claude Code:claude-opus-4-7
Assisted-by: Codex:gpt-5.5
Cover the two new environment variables introduced by this PR in every
place CLAUDE.md requires:

  - AGENTS.md (CLAUDE.md is a symlink to it) — appended both vars to
    the Optional variables table with their defaults
  - docs/src/content/docs/install/env.mdx and its ja / ko / zh-cn
    translations — added full sections describing the three MEDIA_PROXY
    modes and what protections kick in under proxy and cache (SSRF
    re-check on every redirect target, http(s)-only schemes, 32 MiB
    body cap, image/svg+xml explicitly blocked), and the storage
    layout (body at proxy/<sha256>.bin, content-type sidecar at
    proxy/<sha256>.json)
  - compose.yaml and compose-fs.yaml — added commented-out example
    lines so operators see where to plug the variables in for both the
    S3 and FS deployment templates
  - CHANGES.md — release notes for 0.9.0 under issue fedify-dev#481, plus the
    reference link

Defaults are off / on, so existing deployments are unaffected until an
operator opts in.

For fedify-dev#481.

Assisted-by: Claude Code:claude-opus-4-7
Assisted-by: Codex:gpt-5.5
Two leaks the per-commit reviews didn't catch:

  - proxyUrl() decided "this URL is already on us" by comparing
    origins.  That's wrong when STORAGE_URL_BASE is path-scoped under
    a shared origin (path-style S3 / MinIO, a shared CDN, or
    STORAGE_URL_BASE set to the bare Hollo origin), because every
    other URL on that origin slips past the proxy.  And it was also
    too permissive on baseUrl's origin: a federated actor whose
    avatarUrl points at the Hollo server's own URL could ride a
    logged-in operator's session.

    proxyUrl() now uses a STORAGE_URL_BASE prefix match (normalized to
    end in /), drops the baseUrl-origin passthrough entirely, and
    keeps a single anti-recursion guard so an already-proxied URL on
    our own origin isn't double-wrapped.  Default avatars / header
    images and uploaded local-account assets are unaffected: they're
    either fetched via the `?? defaultAvatarUrl` fallback (raw, never
    through proxyUrl) or sit under STORAGE_URL_BASE (prefix match
    catches them).

  - /api/v1/custom_emojis was still emitting `emoji.url` and
    `emoji.static_url` straight from the DB.  Every other API
    response path was already routed through proxyUrl(); this one was
    overlooked.  The endpoint now uses proxyUrl(emoji.url, c.req.url)
    and drops entries whose URL the proxy refuses, matching how
    serializeEmoji() behaves elsewhere.

Tests updated: the old "does not proxy a URL on the same origin as
baseUrl" assertion now covers STORAGE_URL_BASE prefix instead, plus
two new tests assert that same-origin-but-not-storage URLs do get
proxied and that already-proxied URLs are not double-wrapped.

For fedify-dev#481.

Assisted-by: Claude Code:claude-opus-4-7
Assisted-by: Codex:gpt-5.5
The previous fix prefix-matched the raw URL string, which lets a URL
like `http://h/assets/../admin/x.png` (or its `%2e%2e` equivalents)
slip past the check while resolving to `/admin/x.png` in the browser
and at fetch time.  Compare against `parsed.href` instead, so dot-
segment normalization happens before the check.

While here, derive the trusted prefix from the active storage driver
rather than from STORAGE_URL_BASE alone:

  - FS mode always serves uploads at `/assets/` on the storage origin,
    regardless of what STORAGE_URL_BASE's path looks like.  Trust only
    that namespace, so a bare-origin STORAGE_URL_BASE (the kind in
    .env.test) doesn't end up trusting the whole Hollo origin.
  - S3 / generic mode keeps using STORAGE_URL_BASE itself, normalized
    to end in `/`, because the driver concatenates that string with
    the object key.

Two new tests cover (a) `/admin/...` on the Hollo origin gets proxied
even though it shares an origin with STORAGE_URL_BASE, and (b)
`/assets/../admin/...` traversal is normalized before the prefix
check and therefore gets proxied too.

For fedify-dev#481.

Assisted-by: Claude Code:claude-opus-4-7
Assisted-by: Codex:gpt-5.5
…REMOTE_MEDIA_THUMBNAILS

MEDIA_PROXY previously only accepted the three named modes
(off/proxy/cache).  Allow `true`/`on`/`1` as aliases for
`proxy` and `false`/`off`/`0` as aliases for `off`, matching the
boolean conventions the rest of Hollo's env vars use.  Disk caching is
still opt-in only: it must be requested explicitly with `cache` rather
than via the truthy alias, to avoid silently turning on persistent
storage by flipping a generic Boolean.

REMOTE_MEDIA_THUMBNAILS was already Boolean internally, but its error
message advertised only `on`/`off`.  Updated the message to list every
accepted form (`true`/`false`, `on`/`off`, `1`/`0`) and reuses the
shared TRUTHY/FALSY sets with parseMediaProxyMode.

Both parsers are now exported so they can be unit-tested directly.
Tests cover defaults, named values, truthy/falsy synonyms, case-
insensitivity, and the throw path on unknown values.

Docs in all four languages (en/ja/ko/zh-cn), AGENTS.md, and
CHANGES.md were updated to describe the accepted aliases.

For fedify-dev#481.

Assisted-by: Claude Code:claude-opus-4-7
Assisted-by: Codex:gpt-5.5
@dahlia dahlia added this to the Hollo 0.9 milestone May 14, 2026
@dahlia dahlia self-assigned this May 14, 2026
@dahlia dahlia added the enhancement New feature or request label May 14, 2026
@dahlia
Copy link
Copy Markdown
Member Author

dahlia commented May 14, 2026

@codex review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a media proxy feature to Hollo, allowing remote media such as avatars, attachments, and emojis to be served through the instance's own origin to improve privacy and bypass CORS issues. It adds two new environment variables, MEDIA_PROXY (with off, proxy, and cache modes) and REMOTE_MEDIA_THUMBNAILS, and implements a new /proxy endpoint with SSRF protection and signature verification. Feedback from the reviewer focuses on security and performance, specifically suggesting improvements to the SSRF check to prevent DNS rebinding, and recommending streaming approaches for listing cache keys and handling media bodies to reduce memory pressure.

Comment thread src/proxy.ts
Comment thread src/cleanup/processors.ts Outdated
Comment thread src/proxy.ts
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7a593d71c3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/proxy.ts Outdated
Comment thread src/federation/post.ts
dahlia added 3 commits May 14, 2026 20:40
When REMOTE_MEDIA_THUMBNAILS=off the ingest path skips the body
download, which also gave up the Content-Type that the previous GET
response used to backfill mediaType.  Peers whose attachment objects
don't carry mediaType therefore had their attachments dropped
silently.

Issue a HEAD request to the URL in that fallback path and use the
upstream Content-Type if it comes back.  SSRF is already vetted just
above this branch, so the HEAD reuses that decision; on any error or
non-OK response the attachment still skips, matching the previous
behaviour for the genuinely unknowable case.

fedify-dev#483 (comment)

Assisted-by: Claude Code:claude-opus-4-7
Audio and video clients seek by issuing Range requests; the previous
proxy stripped that header on the way to upstream, so every byte
range request from a browser ended up downloading the full body.
That broke scrubbing and inflated bandwidth on proxied media.

The route now forwards the client's Range header through every
redirect hop, accepts 206 Partial Content as a success status with
Content-Range and Accept-Ranges passed through, and surfaces 416
Range Not Satisfiable with Content-Range (the upstream body is
discarded so we never serve an attacker-shaped error page under our
origin).

In cache mode, range requests skip both the cache read and the cache
write so that a single cache entry always represents the full
resource.  Three new tests cover the forwarding, the 416 passthrough,
and the cache bypass.

fedify-dev#483 (comment)

Assisted-by: Claude Code:claude-opus-4-7
listProxyCacheBinKeys() collected every cache key into an array
before returning, so the admin dashboard and the cleanup-enqueue
endpoint both had to hold the full key list in memory at once.  An
instance with a few million cache entries would have spent hundreds
of megabytes on this list, plus a duplicate copy as job-item objects
during enqueue.

Replace it with iterateProxyCacheBinKeys(), an async generator that
yields one key at a time while paging through the storage listing.
The dashboard's count uses a thin countProxyCacheBinKeys() wrapper.
The /proxy_cache/clear handler now creates the cleanupJobs row with
totalItems=0, streams keys into 1000-row INSERT batches, and updates
totalItems once enumeration completes.  If enumeration or any insert
throws, the placeholder job is deleted so the dashboard doesn't see
a stuck pending job.

fedify-dev#483 (comment)

Assisted-by: Claude Code:claude-opus-4-7
@dahlia
Copy link
Copy Markdown
Member Author

dahlia commented May 14, 2026

@codex review

@dahlia
Copy link
Copy Markdown
Member Author

dahlia commented May 14, 2026

/gemini review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 27fa3a9202

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/pages/thumbnail_cleanup.tsx Outdated
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a media proxy system and an optional toggle for remote media thumbnail generation. The media proxy allows Hollo to re-serve remote content—including avatars, headers, attachments, and emojis—through its own origin, supporting both streaming and on-disk caching modes with built-in SSRF protection and URL signing. Feedback on the implementation highlights potential performance bottlenecks in the admin dashboard when counting or enumerating large numbers of cache entries, which could lead to request timeouts. Additionally, it was suggested to use streaming for proxy responses to reduce memory pressure during high concurrency.

Comment thread src/pages/thumbnail_cleanup.tsx Outdated
Comment thread src/pages/thumbnail_cleanup.tsx Outdated
Comment thread src/proxy.ts
dahlia added 2 commits May 14, 2026 21:05
The previous /proxy_cache/clear handler created a cleanupJobs row
with totalItems=0 and then streamed cache keys into batched item
inserts.  That left a window where the cleanup worker could pick up
the pending job, see no items yet, and finalize it as completed.
Items inserted after that moment ended up parented to a completed
job that the worker no longer polled, so a large cache could be
silently left half-cleared.

The same handler was also doing the full storage walk inline, which
could blow past reverse-proxy timeouts on caches with hundreds of
thousands of entries.

Both problems go away when the worker owns the whole lifecycle.
/proxy_cache/clear now creates one cleanup job inside a transaction,
along with a single "enumerate_proxy_cache" item.  The worker picks
up that item, streams the cache listing into 1000-row batches of
proxy_cache items under the same jobId, and bumps the parent job's
totalItems by exactly the number of entries it queued.  Because the
enumeration item is still in flight while it inserts those items,
the worker's "no unfinished items? finalize." check can't race the
enqueue.

fedify-dev#483 (comment)
fedify-dev#483 (comment)

Assisted-by: Claude Code:claude-opus-4-7
Every load of /thumbnail_cleanup was iterating the storage backend
end to end to compute "Media proxy cache entries: N".  On an instance
with a few hundred thousand .bin entries, that's enough listing
calls to exceed a typical reverse-proxy timeout.

countProxyCacheBinKeys now stops once it has seen
PROXY_CACHE_COUNT_CAP (10,000) entries and returns
{ count, truncated: true } so the dashboard knows the value is a
lower bound.  Beyond that point the operator only needs to know
"there's a lot in there"; the worker-driven cleanup queues every
entry regardless of what the dashboard chose to show.

The stats row renders "10,000+" when truncated, and the clear button
likewise says "Clear 10,000+ entries" instead of pretending to know
the exact figure.

fedify-dev#483 (comment)

Assisted-by: Claude Code:claude-opus-4-7
@dahlia
Copy link
Copy Markdown
Member Author

dahlia commented May 14, 2026

@codex review

@dahlia
Copy link
Copy Markdown
Member Author

dahlia commented May 14, 2026

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a media proxy for Hollo, enabling remote media to be served via the instance origin with support for streaming and disk caching. It also introduces a toggle for local thumbnail generation to reduce storage overhead. Review feedback highlighted opportunities to improve the proxy's memory efficiency by streaming directly to responses when caching is disabled, the need for better fallback mechanisms when fetching attachment metadata via HEAD requests, and suggested documenting DNS-level SSRF limitations within the proxy implementation.

Comment thread src/proxy.ts
Comment thread src/proxy.ts
Comment thread src/federation/post.ts
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a0b31669f8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/federation/post.ts
A future maintainer reading fetchWithSSRFAwareRedirects could easily
read the per-hop isSSRFSafeURL call as a complete SSRF defence.  It
isn't: ssrfcheck only inspects the URL string, so a public hostname
that resolves to a private address at fetch time still gets through.

Add a comment block above the function explaining that, and note
that the proper fix lives in a shared SSRF-aware fetch connector
rather than this route.

fedify-dev#483 (comment)

Assisted-by: Claude Code:claude-opus-4-7
The previous HEAD probe for missing Content-Type was a single try:
if the upstream rejected HEAD (CDNs often reply 405, even when GET
would succeed) we dropped the attachment.  Fediverse peers that
omit mediaType plus a CDN that dislikes HEAD silently lost their
attachments.

Probe in three steps now, each cheaper than persisting a body:

  1.  HEAD, as before.
  2.  GET with Range: bytes=0-0.  200 (server ignored Range) and 206
      (server honoured it) both surface a usable Content-Type; the
      response body is cancelled either way.
  3.  mime.getType() against the URL's pathname.  Last-resort
      extension inference for the case where the server tells us
      nothing useful.

The attachment is only skipped if all three return nothing.

fedify-dev#483 (comment)
fedify-dev#483 (comment)

Assisted-by: Claude Code:claude-opus-4-7
@dahlia
Copy link
Copy Markdown
Member Author

dahlia commented May 14, 2026

@codex review

@dahlia
Copy link
Copy Markdown
Member Author

dahlia commented May 14, 2026

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a media proxy feature to Hollo, allowing remote media like avatars and attachments to be served through the local origin to bypass CORS issues and enhance privacy. The proxy supports streaming and on-disk caching modes, featuring SSRF protection and a security-focused content-type allowlist that excludes SVG files. It also adds a setting to disable local thumbnail generation for remote media to save disk space. The reviewer suggested an optimization in the proxy logic to validate the Content-Length header before downloading the response body, which would prevent unnecessary data transfer for files exceeding the size limit.

Comment thread src/proxy.ts
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Nice work!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

The streaming cap in readBoundedBody catches oversized bodies, but
only after we've started reading.  When the upstream is already
telling us the body is 50 MiB before we read a byte, there's no
reason to spend that bandwidth: we'll bail anyway once the cap
trips.

Check Content-Length right after the Content-Type allowlist and
short-circuit to 404 when it exceeds MAX_BYTES.  A missing or
malformed header still falls through to the streaming cap, which
keeps enforcing the limit byte by byte.

New test confirms a 64 MiB advertised body is dropped before the
body read happens.

fedify-dev#483 (comment)

Assisted-by: Claude Code:claude-opus-4-7
@dahlia
Copy link
Copy Markdown
Member Author

dahlia commented May 14, 2026

@codex review

@dahlia
Copy link
Copy Markdown
Member Author

dahlia commented May 14, 2026

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a media proxy for Hollo, allowing remote assets like avatars and attachments to be served via the local origin. It introduces the MEDIA_PROXY variable with off, proxy, and cache modes, alongside REMOTE_MEDIA_THUMBNAILS to toggle local thumbnail processing. The feature includes a signed /proxy endpoint with SSRF protection and optional disk caching. Updates span API serialization, JSX components, and the admin dashboard, which now features a proxy cache cleanup tool. I have no feedback to provide.

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Chef's kiss.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@dahlia dahlia merged commit 377c9eb into fedify-dev:main May 14, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Enhancement] Add Misskey-like Media Proxy for external media

1 participant