Add vLLM backend server with continue_messages and highlights APIs by kcarnold · Pull Request #1 · AIToolsLab/writing-prototypes

kcarnold · 2026-02-13T21:40:00Z

Summary

This PR introduces a new vLLM-backed inference server that provides the same API as the original custom_llm.py but delegates all model inference to a vLLM server via its OpenAI-compatible REST API. Only the tokenizer is loaded locally, eliminating the need for GPU resources on the backend server itself.

Key Changes

New vllm_backend.py: A FastAPI server that:
- Loads only the tokenizer locally (lightweight, no GPU needed)
- Delegates all inference to a vLLM server via HTTP
- Implements two main endpoints: /api/continue_messages and /api/highlights
- Includes helper functions for vLLM's chat completion and completion APIs
- Supports configurable vLLM server URL, model name, and port via CLI args or environment variables
continue_messages endpoint:
- Takes a list of messages and returns k continuation branches
- Uses vLLM's chat completion API with logprobs to get top-k tokens
- Batches the k branches into a single completions call for efficiency
- Returns the branch token + next greedy token for each continuation
highlights endpoint:
- Analyzes tokens in a document to identify which ones differ from the model's top prediction
- Uses vLLM's prompt_logprobs feature to get per-token log probabilities without generating new text
- Returns character-level highlights with token loss, most likely token, and top-k alternatives
- Supports optional prompt and updated_doc parameters
Comprehensive test suite (test_vllm_backend.py):
- 701 lines of tests with mocked vLLM responses and tokenizer
- Tests for response format, API parameter validation, batching behavior
- Tests for edge cases (empty logprobs, empty messages, etc.)
- Verifies correct token ID handling and character offset calculations
- No external dependencies required (mocks HTTP client and tokenizer)

Notable Implementation Details

The server uses a workaround for the chat template API: adds a "." token then strips it to get the prefix token IDs for the highlights endpoint
Supports both text and token ID prompts in the completions API for flexibility
Includes CORS middleware for cross-origin requests
Gracefully handles vLLM server connection failures with warnings
Requires vLLM to be started with --no-enable-prefix-caching for prompt_logprobs support (vLLM limitation)

https://claude.ai/code/session_01XWL14LMjKYUadgbLhbRDGe

Implements a FastAPI server (vllm_backend.py) that provides the same API surface as custom_llm.py but delegates model inference to a vLLM server. Only the tokenizer is loaded locally; all GPU work is handled by vLLM's optimized serving engine. Endpoints implemented: - POST /api/continue_messages: two-call pattern (chat completions for top-k branch tokens, then batched completions for greedy next token per branch) - GET /api/highlights: uses vLLM's prompt_logprobs extension to get per-token log probabilities without generation Includes 12 tests with mocked HTTP responses and tokenizer. https://claude.ai/code/session_01XWL14LMjKYUadgbLhbRDGe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add vLLM backend server with continue_messages and highlights APIs#1

Add vLLM backend server with continue_messages and highlights APIs#1
kcarnold wants to merge 1 commit into
mainfrom
claude/vllm-backend-implementation-ZrZEK

kcarnold commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kcarnold commented Feb 13, 2026

Summary

Key Changes

Notable Implementation Details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants