Skip to content

Cancel llama autocomplete generation when the wrapping Task is cancelled#346

Merged
FuJacob merged 2 commits into
mainfrom
runtime-cancellation-fix
May 28, 2026
Merged

Cancel llama autocomplete generation when the wrapping Task is cancelled#346
FuJacob merged 2 commits into
mainfrom
runtime-cancellation-fix

Conversation

@FuJacob
Copy link
Copy Markdown
Owner

@FuJacob FuJacob commented May 28, 2026

Summary

Outer Task.cancel() was not reaching core.generate's sampling loop, so stale autocomplete generations ran to the full prediction budget while holding autocompleteLock. This propagates cancellation through Task.detached and adds a per-iteration cancel poll, freeing the lock so the next autocomplete (the one the user actually wants) can start ~100-400 ms sooner on Metal.

Validation

xcodebuild -project Cotabby.xcodeproj -scheme Cotabby -destination 'platform=macOS' build
# ** BUILD SUCCEEDED **

xcodebuild -project Cotabby.xcodeproj -scheme Cotabby -destination 'platform=macOS' build-for-testing
# ** TEST BUILD SUCCEEDED **

swiftlint lint --quiet Cotabby/Services/Runtime/LlamaRuntimeCore.swift Cotabby/Services/Runtime/LlamaRuntimeManager.swift
# exit 0

Local xcodebuild test failed with a Team ID signing mismatch (mapping process and mapped file (non-platform) have different Team IDs) — known local-signing issue per .claude/CLAUDE.md. Will rely on CI to run the test suite.

Linked issues

None.

Risk / rollout notes

  • core.generate now returns whatever partial text it has accumulated when the wrapping Task is cancelled, matching the existing behavior of core.summarize. Callers (LlamaSuggestionEngine.generateSuggestion) already follow the runtime call with try Task.checkCancellation(), so the partial text is dropped before it can reach the UI.
  • engine.cancelSequence is deliberately not called for the persistent autocomplete sequence. The native cancellation flag is one-way: tripping it would force destroying and rebuilding the sequence on every cancellation and lose KV cache reuse. Per-iteration Task.isCancelled polling between sampleNext calls gives us ~10-15 ms cancellation granularity, which is fast enough.
  • A stacked branch batched-decode-refactor exists for follow-up work on true batched decode in CotabbyInference. That refactor needs a Phase 0 spike (measure actual Metal throughput for n_seq_max>1 batched vs separate contexts on the GGUF models we ship) before any code lands.

Greptile Summary

This PR fixes stale autocomplete generations that previously ran to their full prediction budget while holding autocompleteLock, blocking the next (user-intended) autocomplete. Cancellation is now propagated end-to-end: the manager wraps both generate and summarize in withTaskCancellationHandler to forward outer-task cancellation to the detached inference task, and LlamaRuntimeCore.generate() gains a Task.isCancelled poll at the top of each sampling iteration (matching the pre-existing poll in summarize).

  • LlamaRuntimeManager — replaces the bare Task.detached {...}.value pattern with withTaskCancellationHandler + an onCancel block that calls task.cancel(), then calls Task.checkCancellation() after task.value resolves to surface the cancel as CancellationError for the existing catch path.
  • LlamaRuntimeCore.generate() — adds if Task.isCancelled { break } before each sampleNext call; on early exit the existing defer blocks still trim the KV cache and release autocompleteLock, so state remains consistent for the next request.

Confidence Score: 5/5

Safe to merge. The cancellation path correctly releases autocompleteLock via the existing defer blocks, and the withTaskCancellationHandler pattern is the idiomatic Swift approach for forwarding cancellation to detached tasks.

Both changed files have well-contained, complementary edits. LlamaRuntimeCore.generate() gains a cooperative poll that mirrors the pre-existing one in summarize(), and the manager's withTaskCancellationHandler wrapper correctly connects outer-task cancellation to the detached task's flag. The defer-based lock and KV-trim cleanup runs on all exit paths including the new early-break, so no resource leaks or state corruption are introduced. The try Task.checkCancellation() call after task.value makes the existing catch is CancellationError block reachable again and ensures callers receive the expected LlamaRuntimeError.cancelled vocabulary.

No files require special attention.

Important Files Changed

Filename Overview
Cotabby/Services/Runtime/LlamaRuntimeCore.swift Adds Task.isCancelled poll at the top of the generation sampling loop, enabling cooperative cancellation that releases autocompleteLock early instead of running the full prediction budget.
Cotabby/Services/Runtime/LlamaRuntimeManager.swift Refactors both generate() and summarize() to use withTaskCancellationHandler so outer-task cancellation is forwarded to the detached inference task; Task.checkCancellation() after task.value correctly surfaces the cancelled state as CancellationError for the existing catch path.

Sequence Diagram

sequenceDiagram
    participant OT as Outer Task (caller)
    participant MGR as LlamaRuntimeManager
    participant DT as Task.detached
    participant CORE as LlamaRuntimeCore

    OT->>MGR: generate(prompt, options)
    MGR->>DT: "Task.detached { core.generate(...) }"
    MGR->>MGR: "withTaskCancellationHandler { await task.value }"

    alt Normal path
        CORE-->>DT: sampleNext loop completes
        DT-->>MGR: task.value → full String
        MGR->>MGR: Task.checkCancellation() (no-op)
        MGR-->>OT: return full result
    else Cancellation path
        OT-xMGR: Task.cancel() (new keystroke / focus change)
        Note over MGR: onCancel fires → task.cancel()
        DT->>CORE: propagates cancel flag
        CORE->>CORE: "if Task.isCancelled { break }"
        Note over CORE: defer: trimKV, autocompleteLock.unlock()
        CORE-->>DT: return partial String
        DT-->>MGR: task.value → partial String
        MGR->>MGR: Task.checkCancellation() throws CancellationError
        MGR-->>OT: throw LlamaRuntimeError.cancelled
    end
Loading

Reviews (2): Last reviewed commit: "Surface cancellation as CancellationErro..." | Re-trigger Greptile

Today, when the suggestion work controller cancels a parent Task (new
keystroke, focus change), the Task.detached call inside
LlamaRuntimeManager does not inherit cancellation, so core.generate runs
its full prediction budget while holding autocompleteLock. The next
autocomplete then waits ~100-400ms on Metal behind a result nobody
wants.

Two changes:

1. core.generate now polls Task.isCancelled between sampleNext calls and
   breaks early. This matches what summarize already does.

2. generate and summarize in the manager wrap the Task.detached await in
   withTaskCancellationHandler so an outer cancel actually reaches the
   detached task.

Engine-level cancelSequence is intentionally not called for the
autocomplete path: its cancelled flag is one-way, and tripping it would
require destroying and recreating the persistent sequence on every
cancellation, losing KV cache reuse. The Task.isCancelled poll between
samples gives us per-token (~10-15ms) granularity, which is fast enough.
Comment thread Cotabby/Services/Runtime/LlamaRuntimeManager.swift
@FuJacob FuJacob merged commit 67a0bc4 into main May 28, 2026
4 checks passed
@FuJacob FuJacob deleted the runtime-cancellation-fix branch May 28, 2026 09:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant