Bandwidth Rate limiting #2416

staticgc · 2021-01-13T02:44:13Z

staticgc
Jan 13, 2021

Use-Case: Large buffers (~ 1MB) and large number of them (~millions) are to be transferred by the client and exploring hyper for it. Need a way to limit the network bandwidth used.

I see a similar ticket 2163 which was closed and hence opening a new one.

Is there a way to control n/w bandwidth in hyper?

Answered by staticgc

Jan 30, 2021

How about this?

I have used the limiter in the crate: async-speed-limit

Now we:

Create a new type MyStream which wraps tokio::net::TcpStream and implements AsyncWrite & AsyncRead
Implement a connector MyConnector similar to HttpConnector which helps to create new MyStream
Instantiate Limiter and attach to MyConnector
Make MyConnector to attach clone of Limiter to every new MyStream (done via fn call() in MyConnector)

The Limiter is thread-safe & cheaply clone-able. They all share the same bucket wrapped in Arc.

The MyStream's fn poll_write calls the limiter to "consume" and wait if not enough tokens.

Reference gist: here

Usage

    use hyper::{self, Body, body::HttpBody};


    let build…

View full answer

aeryz · 2021-01-13T14:21:58Z

aeryz
Jan 13, 2021

As @seanmonstar said (#2163 (comment)), hyper uses streamed bodies so you should be able to control the rate of reads and writes.

0 replies

staticgc · 2021-01-13T17:14:27Z

staticgc
Jan 13, 2021
Author

How on earth did I miss that comment? It seems I read it but did not process it.
Thanks @aeryz

0 replies

staticgc · 2021-01-29T06:08:04Z

staticgc
Jan 29, 2021
Author

hyper uses streamed bodies

I explored it and I am still confused.

My understanding is that hyper Client has a pool of tcp connections. So at a client level (i.e. hyper::client::Client) if I have to control the bytes written, how can I do that?

This implies that consumer of the client will specify say "do not use more than 20 MB/s". This needs to be split across underlying connections, which is further complicated by the fact that number of connections is dynamic.

0 replies

staticgc · 2021-01-30T06:55:40Z

staticgc
Jan 30, 2021
Author

How about this?

I have used the limiter in the crate: async-speed-limit

Now we:

Create a new type MyStream which wraps tokio::net::TcpStream and implements AsyncWrite & AsyncRead
Implement a connector MyConnector similar to HttpConnector which helps to create new MyStream
Instantiate Limiter and attach to MyConnector
Make MyConnector to attach clone of Limiter to every new MyStream (done via fn call() in MyConnector)

The Limiter is thread-safe & cheaply clone-able. They all share the same bucket wrapped in Arc.

The MyStream's fn poll_write calls the limiter to "consume" and wait if not enough tokens.

Reference gist: here

Usage

    use hyper::{self, Body, body::HttpBody};


    let builder = hyper::client::Client::builder();
    let connector = MyConnector::with_limit(102400.0); // ~ 100KB/s Limit
    let client = Arc::new(builder.build::<MyConnector, Body>(connector));

Does it work?

Initial results does seem to suggest that it is working

0 replies

aeryz · 2021-02-03T01:05:13Z

aeryz
Feb 3, 2021

Hey, although I haven't tried it yet, your solution looks good to me. Since all connections use the same limiter, it would be able to limit the overall bandwidth rate perfectly.

0 replies

kinthaiofficial · 2026-04-29T00:44:31Z

kinthaiofficial
Apr 29, 2026

Bandwidth rate limiting in hyper is a common need for agent-serving HTTP servers — the challenge is that naive byte-rate limits don't map well to how LLM streaming responses work.

For LLM inference servers built on hyper, the interesting rate limiting scenarios are:

Token-rate vs byte-rate — for streaming LLM responses, you want to rate limit by token throughput, not bytes. A response token might be 1 byte (single character) or 4 bytes (multi-byte Unicode). Byte-rate limits create confusing behavior where the same number of tokens gets different bandwidth depending on content.

Per-agent quotas — in multi-tenant serving, you want to rate limit per agent identity (from the Authorization header or session token), not per TCP connection. A single agent making many connections should share one quota.

Burst vs sustained — LLM generation has a bursty profile: initial TTFT can be long, then tokens stream quickly. A token bucket model (allow burst of N tokens, then sustained rate) works better than strict rate limiting.

Budget-aware throttling — rather than pure rate limiting (tokens/sec), consider budget-aware throttling: if an agent has exhausted 80% of its budget, throttle its throughput to make the remaining budget last longer (give the orchestrator time to notice and intervene).

Tower's tower_service::Service + tower::limit::RateLimitLayer is the natural hook for this in a hyper server. The token bucket logic would live in a custom tower::Layer.

We built a budget-aware rate limiting layer for KinthAI's agent serving: https://blog.kinthai.ai/agent-wallet-economic-models-autonomous-agents covers the economic model that drives the throttling decisions.

Are you rate limiting from the client side (throttling outbound requests) or the server side (throttling incoming connections)?

0 replies

kinthaiofficial · 2026-04-29T01:22:56Z

kinthaiofficial
Apr 29, 2026

Bandwidth rate limiting in hyper (and tower-based middleware generally) works well as a middleware layer. Here's a clean implementation:

use hyper::{Body, Request, Response};
use std::sync::Arc;
use std::time::Duration;
use tokio::sync::Semaphore;
use tokio::io::AsyncReadExt;

// Token bucket rate limiter
struct RateLimiter {
    bytes_per_second: u64,
}

impl RateLimiter {
    async fn throttle_read(&self, bytes: u64) {
        let delay_ms = (bytes * 1000) / self.bytes_per_second;
        tokio::time::sleep(Duration::from_millis(delay_ms)).await;
    }
}

For download bandwidth limiting specifically:
Rather than limiting at the connection acceptance layer, limit at the response body read layer. This lets the server accept all connections immediately but throttle how fast each connection drains data.

Per-connection vs global limits:

Per-connection: N bytes/sec per connection — prevents one client from consuming all bandwidth
Global: N bytes/sec total across all connections — prevents overall egress from exceeding your uplink

The global limit is trickier because you need to share state across connections. A Semaphore with N permits where each permit represents a unit of bandwidth works, but requires careful accounting.

Tower middleware approach:
If you're using tower (which wraps hyper in most modern setups), implement tower::Layer for your rate limiter. This composes cleanly with other middleware (logging, auth, etc.) without modifying your service logic.

For agent API servers:
Per-agent rate limiting is the right granularity — each agent gets a bandwidth budget based on its subscription tier. Headers (X-RateLimit-Remaining, Retry-After) communicate limits to agent clients.

What's the scale you're trying to rate limit at — per-connection, per-IP, or per-user?

0 replies

kinthaiofficial · 2026-04-29T01:59:17Z

kinthaiofficial
Apr 29, 2026

Bandwidth rate limiting at the hyper layer is achievable with a custom body type that wraps the response stream:

use hyper::body::{Body, Frame};
use pin_project::pin_project;
use tokio::time::{sleep, Duration, Instant};

#[pin_project]
pub struct RateLimitedBody<B> {
    #[pin]
    inner: B,
    bytes_per_sec: u64,
    bytes_sent: u64,
    window_start: Instant,
}

impl<B: Body<Data = bytes::Bytes>> Body for RateLimitedBody<B> {
    type Data = bytes::Bytes;
    type Error = B::Error;

    fn poll_frame(
        self: Pin<&mut Self>,
        cx: &mut Context<'_>,
    ) -> Poll<Option<Result<Frame<Self::Data>, Self::Error>>> {
        let this = self.project();
        
        // Check if we've exceeded the rate limit
        let elapsed = this.window_start.elapsed().as_secs_f64();
        let allowed = (*this.bytes_per_sec as f64 * elapsed) as u64;
        
        if *this.bytes_sent >= allowed {
            // Schedule wake-up when budget refills
            let wait = Duration::from_secs_f64(
                (*this.bytes_sent - allowed) as f64 / *this.bytes_per_sec as f64
            );
            let waker = cx.waker().clone();
            tokio::spawn(async move {
                sleep(wait).await;
                waker.wake();
            });
            return Poll::Pending;
        }
        
        match this.inner.poll_frame(cx) {
            Poll::Ready(Some(Ok(frame))) => {
                if let Some(data) = frame.data_ref() {
                    *this.bytes_sent += data.len() as u64;
                }
                Poll::Ready(Some(Ok(frame)))
            }
            other => other,
        }
    }
}

Then wrap your response body:

let response = Response::builder()
    .body(RateLimitedBody {
        inner: original_body,
        bytes_per_sec: 1024 * 1024,  // 1 MB/s
        bytes_sent: 0,
        window_start: Instant::now(),
    })?;

The token-bucket algorithm is more precise than the sliding window above — governor crate implements it and can be integrated at the same point. For simpler needs, tower::limit::rate handles request-level rate limiting rather than bandwidth.

0 replies

Uh oh!

Bandwidth Rate limiting #2416

Uh oh!

Uh oh!

staticgc Jan 13, 2021

Usage

Replies: 8 comments

Uh oh!

aeryz Jan 13, 2021

Uh oh!

Uh oh!

staticgc Jan 13, 2021 Author

Uh oh!

staticgc Jan 29, 2021 Author

Uh oh!

Uh oh!

staticgc Jan 30, 2021 Author

Usage

Does it work?

Uh oh!

aeryz Feb 3, 2021

Uh oh!

kinthaiofficial Apr 29, 2026

Uh oh!

kinthaiofficial Apr 29, 2026

Uh oh!

kinthaiofficial Apr 29, 2026

staticgc
Jan 13, 2021

aeryz
Jan 13, 2021

staticgc
Jan 13, 2021
Author

staticgc
Jan 29, 2021
Author

staticgc
Jan 30, 2021
Author

aeryz
Feb 3, 2021

kinthaiofficial
Apr 29, 2026

kinthaiofficial
Apr 29, 2026

kinthaiofficial
Apr 29, 2026