Skip to content

Add vLLM backend server with continue_messages and highlights APIs#1

Open
kcarnold wants to merge 1 commit into
mainfrom
claude/vllm-backend-implementation-ZrZEK
Open

Add vLLM backend server with continue_messages and highlights APIs#1
kcarnold wants to merge 1 commit into
mainfrom
claude/vllm-backend-implementation-ZrZEK

Conversation

@kcarnold

Copy link
Copy Markdown
Contributor

Summary

This PR introduces a new vLLM-backed inference server that provides the same API as the original custom_llm.py but delegates all model inference to a vLLM server via its OpenAI-compatible REST API. Only the tokenizer is loaded locally, eliminating the need for GPU resources on the backend server itself.

Key Changes

  • New vllm_backend.py: A FastAPI server that:

    • Loads only the tokenizer locally (lightweight, no GPU needed)
    • Delegates all inference to a vLLM server via HTTP
    • Implements two main endpoints: /api/continue_messages and /api/highlights
    • Includes helper functions for vLLM's chat completion and completion APIs
    • Supports configurable vLLM server URL, model name, and port via CLI args or environment variables
  • continue_messages endpoint:

    • Takes a list of messages and returns k continuation branches
    • Uses vLLM's chat completion API with logprobs to get top-k tokens
    • Batches the k branches into a single completions call for efficiency
    • Returns the branch token + next greedy token for each continuation
  • highlights endpoint:

    • Analyzes tokens in a document to identify which ones differ from the model's top prediction
    • Uses vLLM's prompt_logprobs feature to get per-token log probabilities without generating new text
    • Returns character-level highlights with token loss, most likely token, and top-k alternatives
    • Supports optional prompt and updated_doc parameters
  • Comprehensive test suite (test_vllm_backend.py):

    • 701 lines of tests with mocked vLLM responses and tokenizer
    • Tests for response format, API parameter validation, batching behavior
    • Tests for edge cases (empty logprobs, empty messages, etc.)
    • Verifies correct token ID handling and character offset calculations
    • No external dependencies required (mocks HTTP client and tokenizer)

Notable Implementation Details

  • The server uses a workaround for the chat template API: adds a "." token then strips it to get the prefix token IDs for the highlights endpoint
  • Supports both text and token ID prompts in the completions API for flexibility
  • Includes CORS middleware for cross-origin requests
  • Gracefully handles vLLM server connection failures with warnings
  • Requires vLLM to be started with --no-enable-prefix-caching for prompt_logprobs support (vLLM limitation)

https://claude.ai/code/session_01XWL14LMjKYUadgbLhbRDGe

Implements a FastAPI server (vllm_backend.py) that provides the same
API surface as custom_llm.py but delegates model inference to a vLLM
server. Only the tokenizer is loaded locally; all GPU work is handled
by vLLM's optimized serving engine.

Endpoints implemented:
- POST /api/continue_messages: two-call pattern (chat completions for
  top-k branch tokens, then batched completions for greedy next token
  per branch)
- GET /api/highlights: uses vLLM's prompt_logprobs extension to get
  per-token log probabilities without generation

Includes 12 tests with mocked HTTP responses and tokenizer.

https://claude.ai/code/session_01XWL14LMjKYUadgbLhbRDGe
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants