Cancel llama autocomplete generation when the wrapping Task is cancelled#346
Merged
Conversation
Today, when the suggestion work controller cancels a parent Task (new keystroke, focus change), the Task.detached call inside LlamaRuntimeManager does not inherit cancellation, so core.generate runs its full prediction budget while holding autocompleteLock. The next autocomplete then waits ~100-400ms on Metal behind a result nobody wants. Two changes: 1. core.generate now polls Task.isCancelled between sampleNext calls and breaks early. This matches what summarize already does. 2. generate and summarize in the manager wrap the Task.detached await in withTaskCancellationHandler so an outer cancel actually reaches the detached task. Engine-level cancelSequence is intentionally not called for the autocomplete path: its cancelled flag is one-way, and tripping it would require destroying and recreating the persistent sequence on every cancellation, losing KV cache reuse. The Task.isCancelled poll between samples gives us per-token (~10-15ms) granularity, which is fast enough.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Outer
Task.cancel()was not reachingcore.generate's sampling loop, so stale autocomplete generations ran to the full prediction budget while holdingautocompleteLock. This propagates cancellation throughTask.detachedand adds a per-iteration cancel poll, freeing the lock so the next autocomplete (the one the user actually wants) can start ~100-400 ms sooner on Metal.Validation
Local
xcodebuild testfailed with a Team ID signing mismatch (mapping process and mapped file (non-platform) have different Team IDs) — known local-signing issue per.claude/CLAUDE.md. Will rely on CI to run the test suite.Linked issues
None.
Risk / rollout notes
core.generatenow returns whatever partial text it has accumulated when the wrapping Task is cancelled, matching the existing behavior ofcore.summarize. Callers (LlamaSuggestionEngine.generateSuggestion) already follow the runtime call withtry Task.checkCancellation(), so the partial text is dropped before it can reach the UI.engine.cancelSequenceis deliberately not called for the persistent autocomplete sequence. The native cancellation flag is one-way: tripping it would force destroying and rebuilding the sequence on every cancellation and lose KV cache reuse. Per-iterationTask.isCancelledpolling betweensampleNextcalls gives us ~10-15 ms cancellation granularity, which is fast enough.batched-decode-refactorexists for follow-up work on true batched decode in CotabbyInference. That refactor needs a Phase 0 spike (measure actual Metal throughput forn_seq_max>1batched vs separate contexts on the GGUF models we ship) before any code lands.Greptile Summary
This PR fixes stale autocomplete generations that previously ran to their full prediction budget while holding
autocompleteLock, blocking the next (user-intended) autocomplete. Cancellation is now propagated end-to-end: the manager wraps bothgenerateandsummarizeinwithTaskCancellationHandlerto forward outer-task cancellation to the detached inference task, andLlamaRuntimeCore.generate()gains aTask.isCancelledpoll at the top of each sampling iteration (matching the pre-existing poll insummarize).LlamaRuntimeManager— replaces the bareTask.detached {...}.valuepattern withwithTaskCancellationHandler+ anonCancelblock that callstask.cancel(), then callsTask.checkCancellation()aftertask.valueresolves to surface the cancel asCancellationErrorfor the existingcatchpath.LlamaRuntimeCore.generate()— addsif Task.isCancelled { break }before eachsampleNextcall; on early exit the existingdeferblocks still trim the KV cache and releaseautocompleteLock, so state remains consistent for the next request.Confidence Score: 5/5
Safe to merge. The cancellation path correctly releases
autocompleteLockvia the existingdeferblocks, and thewithTaskCancellationHandlerpattern is the idiomatic Swift approach for forwarding cancellation to detached tasks.Both changed files have well-contained, complementary edits.
LlamaRuntimeCore.generate()gains a cooperative poll that mirrors the pre-existing one insummarize(), and the manager'swithTaskCancellationHandlerwrapper correctly connects outer-task cancellation to the detached task's flag. Thedefer-based lock and KV-trim cleanup runs on all exit paths including the new early-break, so no resource leaks or state corruption are introduced. Thetry Task.checkCancellation()call aftertask.valuemakes the existingcatch is CancellationErrorblock reachable again and ensures callers receive the expectedLlamaRuntimeError.cancelledvocabulary.No files require special attention.
Important Files Changed
Task.isCancelledpoll at the top of the generation sampling loop, enabling cooperative cancellation that releasesautocompleteLockearly instead of running the full prediction budget.generate()andsummarize()to usewithTaskCancellationHandlerso outer-task cancellation is forwarded to the detached inference task;Task.checkCancellation()aftertask.valuecorrectly surfaces the cancelled state asCancellationErrorfor the existing catch path.Sequence Diagram
sequenceDiagram participant OT as Outer Task (caller) participant MGR as LlamaRuntimeManager participant DT as Task.detached participant CORE as LlamaRuntimeCore OT->>MGR: generate(prompt, options) MGR->>DT: "Task.detached { core.generate(...) }" MGR->>MGR: "withTaskCancellationHandler { await task.value }" alt Normal path CORE-->>DT: sampleNext loop completes DT-->>MGR: task.value → full String MGR->>MGR: Task.checkCancellation() (no-op) MGR-->>OT: return full result else Cancellation path OT-xMGR: Task.cancel() (new keystroke / focus change) Note over MGR: onCancel fires → task.cancel() DT->>CORE: propagates cancel flag CORE->>CORE: "if Task.isCancelled { break }" Note over CORE: defer: trimKV, autocompleteLock.unlock() CORE-->>DT: return partial String DT-->>MGR: task.value → partial String MGR->>MGR: Task.checkCancellation() throws CancellationError MGR-->>OT: throw LlamaRuntimeError.cancelled endReviews (2): Last reviewed commit: "Surface cancellation as CancellationErro..." | Re-trigger Greptile