Skip to content

llm_runner: add Engine and Session interfaces#20243

Open
mergennachin wants to merge 1 commit into
mainfrom
llm-pr-c-llm-engine-session-interface
Open

llm_runner: add Engine and Session interfaces#20243
mergennachin wants to merge 1 commit into
mainfrom
llm-pr-c-llm-engine-session-interface

Conversation

@mergennachin

@mergennachin mergennachin commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

The LLM serving path needs a stable contract between generic serving code and model-specific execution code. TextLLM, Qwen, Gemma, CUDA, and future backends all differ in how they own weights and mutable state, but the server should not know those details or grow a new Python binding for every model.

This introduces the minimal runner-level split needed for that contract. LLMEngine represents the loaded physical model and its serving capacity; LLMSession represents one logical conversation state and exposes reset/prefill/decode-style operations. That shape lets a worker drive different model implementations through one interface while keeping KV/recurrent/cache ownership inside C++.

This commit is only the interface and build export. It deliberately does not add a concrete adapter or change existing runner behavior, so model migrations and serving can be reviewed as downstream uses of the contract rather than hidden side effects.

#20001

Copilot AI review requested due to automatic review settings June 12, 2026 20:13
@pytorch-bot

pytorch-bot Bot commented Jun 12, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20243

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

⏳ No Failures, 4 Pending

As of commit e94d710 with merge base d7ca5db (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 12, 2026
@github-actions

Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new, model-agnostic C++ contract for the LLM serving path by defining LLMEngine (loaded model/resources + serving capacity) and LLMSession (per-conversation mutable state with reset/prefill/decode operations). This establishes an interface boundary so serving/worker code can depend on a stable API while backends keep ownership of KV/cache/state internally.

Changes:

  • Adds new LLMEngine / LLMSession interfaces plus supporting structs (SamplingConfig, DecodeResult, LLMServingCapacity).
  • Exports the new header via the Bazel runner library target.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File Description
extension/llm/runner/llm_session.h Adds new engine/session interfaces and associated data types for a stable serving contract.
extension/llm/runner/targets.bzl Exports the new header in the runner library’s public headers.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread extension/llm/runner/llm_session.h Outdated
Comment thread extension/llm/runner/llm_session.h Outdated
Comment thread extension/llm/runner/llm_session.h Outdated
Comment thread extension/llm/runner/llm_session.h
Comment thread extension/llm/runner/llm_session.h
Comment thread extension/llm/runner/llm_session.h Outdated
@mergennachin

Copy link
Copy Markdown
Contributor Author

@claude Review this PR

@claude

claude Bot commented Jun 12, 2026

Copy link
Copy Markdown

Claude finished @mergennachin's task in 1m 16s —— View job


PR Review: llm_runner: add Engine and Session interfaces

  • Read changed files (llm_session.h, targets.bzl)
  • Analyze interface design & contract consistency
  • Check against Copilot review comments
  • Post review feedback

Reviewed the two changed files. This is a clean, well-scoped interface-only PR — no concrete adapters, no behavior change. The documentation is unusually thorough for a header, and the Engine/Session split (immutable model resources vs. per-conversation mutable state) is a sound contract for the serving path. Overall looks good to merge; comments below are minor.

On the Copilot comments — all already addressed ✅

The Copilot review was generated at an earlier snapshot. The current head of the branch already resolves every one of its six comments:

  • SamplingConfig, DecodeResult, LLMServingCapacity are all marked ET_EXPERIMENTAL (lines 31, 51, 62).
  • The cooperative-stop() terminal-step contract is now fully specified in three consistent places: DecodeResult (lines 47–50: "token_id must be 0, text_piece must be empty, position() must not advance"), decode_one() (lines 108–110), and stop() (lines 129–135).
  • The decode_one() doc no longer contradicts stop() — it explicitly states the short-circuit path.

No action needed there.

Minor observations

  1. DecodeResult has no default member initializers (lines 52–55). The other three structs (SamplingConfig, LLMServingCapacity) initialize every field, but DecodeResult's four members are left uninitialized. An adapter that forgets a field yields a garbage is_eos/is_terminal — a nasty bug since is_terminal drives the decode loop. Recommend defaulting for consistency and safety:

    struct ET_EXPERIMENTAL DecodeResult {
      uint64_t token_id = 0;
      std::string text_piece;
      bool is_eos = false;
      bool is_terminal = false;
    };
  2. Namespace style differs from the sibling header. This file uses C++17 nested namespace executorch::extension::llm { (line 25), while irunner.h in the same directory uses the older namespace executorch { namespace extension { namespace llm { form. Both compile fine; the new form is cleaner. Just flagging the inconsistency within one directory in case the codebase has a convention.

  3. Relationship to GenerationConfig/IRunner is worth a one-line note. SamplingConfig overlaps conceptually with GenerationConfig::temperature from irunner.h, but operates at a different altitude (per-decode-step vs. per-generation). The header comment (lines 9–12) explains the adapter relationship well; a sentence clarifying that this Session contract sits below the existing IRunner/GenerationConfig layer (rather than replacing it) would help future readers who land on both files.

  4. metadata() returning const std::unordered_map<std::string, int64_t>& ties the engine contract to a specific container type. That's consistent with how runner metadata is passed elsewhere in this codebase, so it's reasonable — just noting it constrains adapters to store metadata in exactly that shape.

None of these are blocking. The only one I'd actively recommend acting on is #1 (default-initialize DecodeResult) since uninitialized is_terminal is a real footgun. Fix this →

targets.bzl: the single-line addition correctly exports llm_session.h from the runner_lib target alongside the other public headers. ✓
· branch llm-pr-c-llm-engine-session-interface

Copilot AI review requested due to automatic review settings June 12, 2026 21:12
@mergennachin mergennachin force-pushed the llm-pr-c-llm-engine-session-interface branch from 5201719 to 6b74ce3 Compare June 12, 2026 21:12

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comment thread extension/llm/runner/llm_session.h
Comment thread extension/llm/runner/targets.bzl
The LLM serving path needs a stable contract between generic serving code and model-specific execution code. TextLLM, Qwen, Gemma, CUDA, and future backends all differ in how they own weights and mutable state, but the server should not know those details or grow a new Python binding for every model.

This introduces the minimal runner-level split needed for that contract. LLMEngine represents the loaded physical model and its serving capacity; LLMSession represents one logical conversation state and exposes reset/prefill/decode-style operations. That shape lets a worker drive different model implementations through one interface while keeping KV/recurrent/cache ownership inside C++.

This commit is only the interface and build export. It deliberately does not add a concrete adapter or change existing runner behavior, so model migrations and serving can be reviewed as downstream uses of the contract rather than hidden side effects.
/// NotSupported for models whose state cannot be safely rewound (for example,
/// non-KV-cache, sliding-window, or recurrent-state models); callers should
/// fall back to reset() + full prefill.
virtual ::executorch::runtime::Error seek(int64_t pos) = 0;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we have to have seek function in the top-level llm_session? For speculative decoding?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants