llm_runner: add Engine and Session interfaces#20243
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20243
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ⏳ No Failures, 4 PendingAs of commit e94d710 with merge base d7ca5db ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
There was a problem hiding this comment.
Pull request overview
This PR introduces a new, model-agnostic C++ contract for the LLM serving path by defining LLMEngine (loaded model/resources + serving capacity) and LLMSession (per-conversation mutable state with reset/prefill/decode operations). This establishes an interface boundary so serving/worker code can depend on a stable API while backends keep ownership of KV/cache/state internally.
Changes:
- Adds new
LLMEngine/LLMSessioninterfaces plus supporting structs (SamplingConfig,DecodeResult,LLMServingCapacity). - Exports the new header via the Bazel runner library target.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| extension/llm/runner/llm_session.h | Adds new engine/session interfaces and associated data types for a stable serving contract. |
| extension/llm/runner/targets.bzl | Exports the new header in the runner library’s public headers. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
650b123 to
5201719
Compare
|
@claude Review this PR |
|
Claude finished @mergennachin's task in 1m 16s —— View job PR Review:
|
5201719 to
6b74ce3
Compare
The LLM serving path needs a stable contract between generic serving code and model-specific execution code. TextLLM, Qwen, Gemma, CUDA, and future backends all differ in how they own weights and mutable state, but the server should not know those details or grow a new Python binding for every model. This introduces the minimal runner-level split needed for that contract. LLMEngine represents the loaded physical model and its serving capacity; LLMSession represents one logical conversation state and exposes reset/prefill/decode-style operations. That shape lets a worker drive different model implementations through one interface while keeping KV/recurrent/cache ownership inside C++. This commit is only the interface and build export. It deliberately does not add a concrete adapter or change existing runner behavior, so model migrations and serving can be reviewed as downstream uses of the contract rather than hidden side effects.
6b74ce3 to
e94d710
Compare
| /// NotSupported for models whose state cannot be safely rewound (for example, | ||
| /// non-KV-cache, sliding-window, or recurrent-state models); callers should | ||
| /// fall back to reset() + full prefill. | ||
| virtual ::executorch::runtime::Error seek(int64_t pos) = 0; |
There was a problem hiding this comment.
why we have to have seek function in the top-level llm_session? For speculative decoding?
The LLM serving path needs a stable contract between generic serving code and model-specific execution code. TextLLM, Qwen, Gemma, CUDA, and future backends all differ in how they own weights and mutable state, but the server should not know those details or grow a new Python binding for every model.
This introduces the minimal runner-level split needed for that contract. LLMEngine represents the loaded physical model and its serving capacity; LLMSession represents one logical conversation state and exposes reset/prefill/decode-style operations. That shape lets a worker drive different model implementations through one interface while keeping KV/recurrent/cache ownership inside C++.
This commit is only the interface and build export. It deliberately does not add a concrete adapter or change existing runner behavior, so model migrations and serving can be reviewed as downstream uses of the contract rather than hidden side effects.
#20001