infermesh

infermesh is for researchers and engineers who need to run large LLM jobs from notebooks, scripts, or local inference stacks without rebuilding the same concurrency and quota-control layer each time.

It sits on top of LiteLLM and focuses on the parts that usually show up once an experiment becomes real work:

concurrent batch generation with ordered results
notebook-safe sync APIs
per-item failure handling for long runs
crash-resilient batches with incremental writes and --resume support
automatic retries with exponential backoff for transient errors
client-side RPM and TPM throttling
typed results with token usage and timing metadata
multi-replica routing for local or clustered inference endpoints

If you only need a handful of one-off requests, use the provider SDK or plain LiteLLM. infermesh earns its keep when throughput control and batch ergonomics matter more than raw minimalism.

Install

Python 3.12+ is required.

python -m pip install infermesh

If you use uv:

uv add infermesh

Contributor setup, editable installs, and clone-based workflows live in CONTRIBUTING.md.

Quick Start

Set the provider key in your environment first:

export OPENAI_API_KEY=sk-...

The core workflow is "run a batch, keep the results you want, inspect the failures, and retry only what broke":

from infermesh import LMClient

prompts = [
    "Summarize section 1 in two bullet points.",
    "Summarize section 2 in two bullet points.",
    "Summarize section 3 in two bullet points.",
]

with LMClient(
    model="openai/gpt-4.1-mini",
    max_parallel_requests=32,
    rpm=500,
    tpm=100_000,
) as client:
    batch = client.generate_batch(prompts)

retry_prompts: list[str] = []

for i, result in enumerate(batch):
    if result is None:
        print(f"FAILED: {prompts[i]}\n  {batch.errors[i]}")
        retry_prompts.append(prompts[i])
    else:
        print(result.output_text)
        if result.token_usage is not None:
            print("tokens:", result.token_usage.total_tokens)

if retry_prompts:
    with LMClient(model="openai/gpt-4.1-mini") as retry_client:
        retry_batch = retry_client.generate_batch(retry_prompts)

One failing request does not abort the whole batch. Failed items are None in batch.results; the exception is in batch.errors[i]. This is deliberate: a single provider error should not wipe out a long experiment.

For large Python batches, set max_parallel_requests explicitly. generate_batch and transcribe_batch both use a bounded in-flight window when it is set; when it is unset, they start one coroutine per item up front, which can cause memory pressure for very large inputs. embed_batch is always micro-batched regardless of max_parallel_requests — pass micro_batch_size to tune chunk size instead.

This code works in Jupyter notebooks without any asyncio setup. The sync API runs a background event loop so you do not have to.

For a single one-off request:

with LMClient(model="openai/gpt-4.1-mini") as client:
    result = client.generate("What is the capital of France?")
    print(result.output_text)

The model string follows LiteLLM's provider/model-name format. See the LiteLLM model list for all supported providers:

Provider	Example
OpenAI	`"openai/gpt-4.1-mini"`
Anthropic	`"anthropic/claude-3-5-sonnet-20241022"`
Local vLLM	`"hosted_vllm/meta-llama/Meta-Llama-3-8B-Instruct"`

api_base is optional for hosted providers — LiteLLM already knows their endpoints. Set it explicitly for local servers or custom deployments. Keep provider secrets in environment variables (OPENAI_API_KEY, ANTHROPIC_API_KEY); local servers that require no auth work without an api_key.

Generate Text

result = client.generate("Say hello in one sentence.")

print(result.output_text)     # generated text
print(result.token_usage)     # prompt / completion / total token counts
print(result.finish_reason)   # "stop", "length", …
print(result.request_id)      # provider-assigned ID for debugging

Create Embeddings

# Single string → EmbeddingResult
result = client.embed("The quick brown fox")
print(result.embedding)          # list[float]

# Multiple strings → processed in resilient micro-batches by default
batch = client.embed_batch(
    ["sentence one", "sentence two", "sentence three"],
    micro_batch_size=32,
)
vectors = [r.embedding for r in batch if r is not None]

Transcribe Audio

result = client.transcribe("recording.wav")   # path, bytes, or file-like object
print(result.text)
print(result.language)     # detected language code, e.g. "en"
print(result.duration_s)   # audio length in seconds

batch = client.transcribe_batch(["recording-a.wav", "recording-b.wav"])
texts = [r.text if r is not None else None for r in batch]

Audio inputs larger than 25 MB are rejected by default. Pass max_transcription_bytes=None only in trusted environments where the server is expected to accept larger uploads. Disabling the guard means the client may read and send very large audio files in full.

CLI

# Set your key first (or use --env-file .env)
export OPENAI_API_KEY=sk-...

# Generate — single prompt
infermesh generate \
  --model openai/gpt-4.1-mini \
  --api-base https://api.openai.com/v1 \
  --prompt "Hello"

# Generate — from a JSONL file, results to another JSONL file
# Each input line: {"prompt": "..."} or {"messages": [...]} or {"responses_input": "..."}
# Output includes an _index field so interrupted runs can be resumed.
infermesh generate \
  --model openai/gpt-4.1-mini \
  --api-base https://api.openai.com/v1 \
  --input-jsonl prompts.jsonl \
  --output-jsonl results.jsonl

# Resume an interrupted run — skips already-completed rows and appends new ones
infermesh generate \
  --model openai/gpt-4.1-mini \
  --api-base https://api.openai.com/v1 \
  --input-jsonl prompts.jsonl \
  --output-jsonl results.jsonl \
  --resume

# Create embeddings
infermesh embed \
  --model text-embedding-3-small \
  --api-base https://api.openai.com/v1 \
  --text "hello world"

# Transcribe audio
infermesh transcribe --model whisper-1 \
  --api-base https://api.openai.com/v1 \
  recording.wav

Advanced Usage

Crash-resilient batches (on_result)

For long runs, pass on_result to write each result to disk as it arrives. A crash or interruption only loses the requests that were in-flight at that moment — everything already completed is safe on disk.

generate_batch, embed_batch, and transcribe_batch all support the same per-item callback contract.

import json
from infermesh import LMClient

with open("results.jsonl", "w") as out, \
     LMClient(model="openai/gpt-4.1-mini", max_parallel_requests=32) as client:

    def save(index: int, result, error) -> None:
        row = {"index": index}
        if error is not None:
            row["error"] = str(error)
        else:
            row["output_text"] = result.output_text
        out.write(json.dumps(row) + "\n")
        out.flush()

    client.generate_batch(prompts, on_result=save)

The CLI automates this with --resume — see the CLI section above and the User Guide for the full checkpoint/resume pattern.

Rate limiting

Pass any combination of rpm / tpm / rpd / tpd to activate the built-in rate limiter. The client queues requests automatically and respects all four limits simultaneously.

client = LMClient(
    model="openai/gpt-4.1-mini",
    rpm=500,      # requests per minute
    tpm=100_000,  # tokens per minute
)

Find your tier's limits in the provider dashboard: for OpenAI check Settings → Limits; for Anthropic check Console → Settings → Limits.

Use max_request_burst / max_token_burst to allow short bursts above the steady-state rate (token-bucket algorithm). Use default_output_tokens to pre-reserve output tokens for rate-limit accounting when you don't set max_tokens per request.

Provider rate-limit headers (x-ratelimit-*) are read automatically after each response to keep the client's internal counters in sync with the server's view. Use header_bucket_scope to control whether headers are routed to the per-minute or per-day buckets.

CLI flags: --rpm, --tpm, --rpd, --tpd, --max-request-burst, --max-token-burst.

Multi-replica routing (vLLM / SGLang)

When you run multiple inference servers for the same model, pass a deployments dict to spread load across them. model is the logical name the router exposes; each DeploymentConfig.model is the backend string sent to that server.

from infermesh import DeploymentConfig, LMClient

client = LMClient(
    model="llama-3-8b",
    deployments={
        "gpu-0": DeploymentConfig(
            model="hosted_vllm/meta-llama/Meta-Llama-3-8B-Instruct",
            api_base="http://host1:8000/v1",
        ),
        "gpu-1": DeploymentConfig(
            model="hosted_vllm/meta-llama/Meta-Llama-3-8B-Instruct",
            api_base="http://host2:8000/v1",
        ),
        "gpu-2": DeploymentConfig(
            model="hosted_vllm/meta-llama/Meta-Llama-3-8B-Instruct",
            api_base="http://host3:8000/v1",
        ),
    },
    routing_strategy="least-busy",   # or "simple-shuffle" (default), "latency-based-routing"
)

result = client.generate("Summarise this paper in one paragraph.")
print(result.metrics.deployment)  # e.g. "gpu-1"

DeploymentConfig is a plain dataclass, so it maps naturally to Hydra / OmegaConf structured config. Deployment keys ("gpu-0" etc.) are free-form labels.

CLI — repeated --api-base flags:

infermesh generate \
  --model llama-3-8b \
  --api-base http://host1:8000/v1 \
  --api-base http://host2:8000/v1 \
  --api-base http://host3:8000/v1 \
  --prompt "Hello"

CLI — TOML file for more control:

# deployments.toml
[deployments.gpu-0]
model = "hosted_vllm/meta-llama/Meta-Llama-3-8B-Instruct"
api_base = "http://host1:8000/v1"

[deployments.gpu-1]
model = "hosted_vllm/meta-llama/Meta-Llama-3-8B-Instruct"
api_base = "http://host2:8000/v1"

infermesh generate \
  --model llama-3-8b \
  --deployments-toml deployments.toml \
  --prompt "Hello"

Keep API keys out of TOML files — use environment variables or --env-file instead.

Async API

All methods have async counterparts prefixed with a. The sync methods work in notebooks and scripts by running a background event loop thread — you don't need to manage the event loop yourself.

import asyncio
from infermesh import LMClient

async def main():
    async with LMClient(model="openai/gpt-4.1-mini") as client:
        result = await client.agenerate("Hello")
        batch  = await client.agenerate_batch(["prompt A", "prompt B", "prompt C"])
        emb    = await client.aembed("The quick brown fox")
        embs   = await client.aembed_batch(["text a", "text b"])
        txs    = await client.atranscribe_batch(["a.wav", "b.wav"])

asyncio.run(main())

async with calls close() automatically. For sync code, use with or call client.close() when done.

Structured output

Pass a Pydantic model as response_format and the output is parsed automatically:

from pydantic import BaseModel

class Answer(BaseModel):
    value: int
    confidence: float

result = client.generate(
    "What is 2 + 2? Respond in JSON.",
    response_format=Answer,
)

print(result.output_text)   # raw JSON string
print(result.output_parsed) # Answer(value=4, confidence=0.99)

A plain dict (JSON schema) is also accepted in place of a Pydantic model; the output is returned as a plain Python object. Parse failures are logged as warnings; output_parsed is None if parsing fails.

Automatic retries

By default, LMClient retries transient provider errors up to 3 times with exponential backoff. This covers 429 rate-limit spikes, 503 unavailability, 500 server errors, network failures, and timeouts.

client = LMClient(
    model="openai/gpt-4.1-mini",
    max_retries=3,   # default; set to 0 to disable
)

Backoff formula: min(2 ** attempt, 60) seconds plus up to 0.5 s jitter. If the provider returns a Retry-After header its value is used instead (capped at 60 s). Non-transient errors (BadRequestError, AuthenticationError, etc.) are not retried.

result = client.generate("Hello")
print(result.metrics.retries)   # 0 on first-attempt success

CLI flag: --max-retries.

Timeout and per-request overrides

Set a default timeout for every request at construction time:

client = LMClient(
    model="openai/gpt-4.1-mini",
    timeout=30.0,   # seconds
)

Any LiteLLM keyword argument passed to a generate / embed / transcribe call overrides the default for that request:

result = client.generate("Hello", timeout=5.0, max_tokens=64)

Use default_request_kwargs to set persistent overrides for all requests:

client = LMClient(
    model="openai/gpt-4.1-mini",
    default_request_kwargs={"max_tokens": 256, "temperature": 0.7},
)

Benchmarking

infermesh bench measures client-side throughput across a concurrency sweep. It is intentionally a client benchmark — it tells you the best max_parallel_requests setting for your workload, not the server's maximum capacity.

infermesh bench generate \
  --model openai/gpt-4.1-mini \
  --api-base https://api.openai.com/v1 \
  --prompt "Write a haiku." \
  --warmup 5 \
  --requests 50 \
  --output-json bench.json

Output:

c=1    rps=3.14  p50=0.401s  p95=0.412s  p99=0.420s  svc_p95=0.410s  q_p95=0.001s  err=0/50  elapsed=15.9s
c=2    rps=5.81  p50=0.470s  p95=0.487s  p99=0.501s  svc_p95=0.480s  q_p95=0.002s  err=0/50  elapsed=8.6s
recommended_max_parallel_requests=8

c is the concurrency level. svc_p95 is the P95 of net provider response time (excluding queue wait). q_p95 is the P95 time a request spent in the client queue. High q_p95 relative to svc_p95 means the client is the bottleneck, not the server.

Use --input-jsonl to benchmark with a real prompt distribution. An embedding benchmark is available as infermesh bench embed.

For server-centric metrics (TTFT, TPOT, ITL, request goodput), use a dedicated server benchmark: vLLM · SGLang · AIPerf

Why Not Just Use LiteLLM?

Use LiteLLM directly if provider abstraction is the only missing piece.

infermesh is intentionally narrower:

LiteLLM is the provider abstraction and request layer.
infermesh adds notebook-safe sync APIs and concurrent batch helpers.
infermesh preserves partial failures instead of turning a long run into one giant exception.
infermesh adds client-side throttling and replica routing for experiment workloads.
infermesh returns typed result objects so request metadata is easier to inspect programmatically.

When Not To Use It

You only make a few single requests.
You already have a batching and throttling layer you trust.
You want raw provider payloads with as little abstraction as possible.

More Detail

User Guide for the complete researcher workflow, embeddings, transcription, multimodal inputs, rate limiting, routing, async usage, structured output, and benchmarking
API Reference for method signatures and parameter docs

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.github/workflows		.github/workflows
docs		docs
src/infermesh		src/infermesh
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

infermesh

Install

Quick Start

Generate Text

Create Embeddings

Transcribe Audio

CLI

Advanced Usage

Why Not Just Use LiteLLM?

When Not To Use It

More Detail

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

infermesh

Install

Quick Start

Generate Text

Create Embeddings

Transcribe Audio

CLI

Advanced Usage

Why Not Just Use LiteLLM?

When Not To Use It

More Detail

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages