Cancel llama autocomplete generation when the wrapping Task is cancelled by FuJacob · Pull Request #346 · FuJacob/cotabby

FuJacob · 2026-05-28T08:59:51Z

Summary

Outer Task.cancel() was not reaching core.generate's sampling loop, so stale autocomplete generations ran to the full prediction budget while holding autocompleteLock. This propagates cancellation through Task.detached and adds a per-iteration cancel poll, freeing the lock so the next autocomplete (the one the user actually wants) can start ~100-400 ms sooner on Metal.

Validation

xcodebuild -project Cotabby.xcodeproj -scheme Cotabby -destination 'platform=macOS' build
# ** BUILD SUCCEEDED **

xcodebuild -project Cotabby.xcodeproj -scheme Cotabby -destination 'platform=macOS' build-for-testing
# ** TEST BUILD SUCCEEDED **

swiftlint lint --quiet Cotabby/Services/Runtime/LlamaRuntimeCore.swift Cotabby/Services/Runtime/LlamaRuntimeManager.swift
# exit 0

Local xcodebuild test failed with a Team ID signing mismatch (mapping process and mapped file (non-platform) have different Team IDs) — known local-signing issue per .claude/CLAUDE.md. Will rely on CI to run the test suite.

Linked issues

None.

Risk / rollout notes

core.generate now returns whatever partial text it has accumulated when the wrapping Task is cancelled, matching the existing behavior of core.summarize. Callers (LlamaSuggestionEngine.generateSuggestion) already follow the runtime call with try Task.checkCancellation(), so the partial text is dropped before it can reach the UI.
engine.cancelSequence is deliberately not called for the persistent autocomplete sequence. The native cancellation flag is one-way: tripping it would force destroying and rebuilding the sequence on every cancellation and lose KV cache reuse. Per-iteration Task.isCancelled polling between sampleNext calls gives us ~10-15 ms cancellation granularity, which is fast enough.
A stacked branch batched-decode-refactor exists for follow-up work on true batched decode in CotabbyInference. That refactor needs a Phase 0 spike (measure actual Metal throughput for n_seq_max>1 batched vs separate contexts on the GGUF models we ship) before any code lands.

Greptile Summary

This PR fixes stale autocomplete generations that previously ran to their full prediction budget while holding autocompleteLock, blocking the next (user-intended) autocomplete. Cancellation is now propagated end-to-end: the manager wraps both generate and summarize in withTaskCancellationHandler to forward outer-task cancellation to the detached inference task, and LlamaRuntimeCore.generate() gains a Task.isCancelled poll at the top of each sampling iteration (matching the pre-existing poll in summarize).

LlamaRuntimeManager — replaces the bare Task.detached {...}.value pattern with withTaskCancellationHandler + an onCancel block that calls task.cancel(), then calls Task.checkCancellation() after task.value resolves to surface the cancel as CancellationError for the existing catch path.
LlamaRuntimeCore.generate() — adds if Task.isCancelled { break } before each sampleNext call; on early exit the existing defer blocks still trim the KV cache and release autocompleteLock, so state remains consistent for the next request.

Confidence Score: 5/5

Safe to merge. The cancellation path correctly releases autocompleteLock via the existing defer blocks, and the withTaskCancellationHandler pattern is the idiomatic Swift approach for forwarding cancellation to detached tasks.

Both changed files have well-contained, complementary edits. LlamaRuntimeCore.generate() gains a cooperative poll that mirrors the pre-existing one in summarize(), and the manager's withTaskCancellationHandler wrapper correctly connects outer-task cancellation to the detached task's flag. The defer-based lock and KV-trim cleanup runs on all exit paths including the new early-break, so no resource leaks or state corruption are introduced. The try Task.checkCancellation() call after task.value makes the existing catch is CancellationError block reachable again and ensures callers receive the expected LlamaRuntimeError.cancelled vocabulary.

No files require special attention.

Important Files Changed

Filename	Overview
Cotabby/Services/Runtime/LlamaRuntimeCore.swift	Adds `Task.isCancelled` poll at the top of the generation sampling loop, enabling cooperative cancellation that releases `autocompleteLock` early instead of running the full prediction budget.
Cotabby/Services/Runtime/LlamaRuntimeManager.swift	Refactors both `generate()` and `summarize()` to use `withTaskCancellationHandler` so outer-task cancellation is forwarded to the detached inference task; `Task.checkCancellation()` after `task.value` correctly surfaces the cancelled state as `CancellationError` for the existing catch path.

Sequence Diagram

sequenceDiagram
    participant OT as Outer Task (caller)
    participant MGR as LlamaRuntimeManager
    participant DT as Task.detached
    participant CORE as LlamaRuntimeCore

    OT->>MGR: generate(prompt, options)
    MGR->>DT: "Task.detached { core.generate(...) }"
    MGR->>MGR: "withTaskCancellationHandler { await task.value }"

    alt Normal path
        CORE-->>DT: sampleNext loop completes
        DT-->>MGR: task.value → full String
        MGR->>MGR: Task.checkCancellation() (no-op)
        MGR-->>OT: return full result
    else Cancellation path
        OT-xMGR: Task.cancel() (new keystroke / focus change)
        Note over MGR: onCancel fires → task.cancel()
        DT->>CORE: propagates cancel flag
        CORE->>CORE: "if Task.isCancelled { break }"
        Note over CORE: defer: trimKV, autocompleteLock.unlock()
        CORE-->>DT: return partial String
        DT-->>MGR: task.value → partial String
        MGR->>MGR: Task.checkCancellation() throws CancellationError
        MGR-->>OT: throw LlamaRuntimeError.cancelled
    end

_{Reviews (2): Last reviewed commit: "Surface cancellation as CancellationErro..." | Re-trigger Greptile}

Today, when the suggestion work controller cancels a parent Task (new keystroke, focus change), the Task.detached call inside LlamaRuntimeManager does not inherit cancellation, so core.generate runs its full prediction budget while holding autocompleteLock. The next autocomplete then waits ~100-400ms on Metal behind a result nobody wants. Two changes: 1. core.generate now polls Task.isCancelled between sampleNext calls and breaks early. This matches what summarize already does. 2. generate and summarize in the manager wrap the Task.detached await in withTaskCancellationHandler so an outer cancel actually reaches the detached task. Engine-level cancelSequence is intentionally not called for the autocomplete path: its cancelled flag is one-way, and tripping it would require destroying and recreating the persistent sequence on every cancellation, losing KV cache reuse. The Task.isCancelled poll between samples gives us per-token (~10-15ms) granularity, which is fast enough.

…chable

greptile-apps Bot reviewed May 28, 2026

View reviewed changes

Comment thread Cotabby/Services/Runtime/LlamaRuntimeManager.swift

Surface cancellation as CancellationError so the catch path stays rea…

9afae9e

…chable

FuJacob merged commit 67a0bc4 into main May 28, 2026
4 checks passed

FuJacob deleted the runtime-cancellation-fix branch May 28, 2026 09:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cancel llama autocomplete generation when the wrapping Task is cancelled#346

Cancel llama autocomplete generation when the wrapping Task is cancelled#346
FuJacob merged 2 commits into
mainfrom
runtime-cancellation-fix

FuJacob commented May 28, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

FuJacob commented May 28, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Linked issues

Risk / rollout notes

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

FuJacob commented May 28, 2026 •

edited by greptile-apps Bot

Loading