feat: context compaction strategies for the react loop by yelkurdi · Pull Request #996 · generative-computing/mellea

yelkurdi · 2026-05-01T19:59:06Z

Component PR

Use this template when adding or modifying components in mellea/stdlib/components/.

Description

Link to Issue: Adds feature Compaction #1099.

Implementation Checklist

Protocol Compliance

parts() returns list of constituent parts (Components or CBlocks)
format_for_llm() returns TemplateRepresentation or string
_parse(computed: ModelOutputThunk) parses model output correctly into the specified Component return type

Content Blocks

CBlock used appropriately for text content
ImageBlock used for image content (if applicable)

Integration

Component exported in mellea/stdlib/components/__init__.py or, if you are adding a library of components, from your sub-module

Testing

Tests added to tests/components/
New code has 100% coverage
Ensure existing tests and github automation passes (a maintainer will kick off the github automation when the rest of the PR is populated)

Attribution

AI coding assistants used

Summary

Adds an optional CompactionStrategy to mellea.stdlib.frameworks.react with three concrete implementations (ClearAll, KeepLastN, LLMSummarize) under a new module mellea.stdlib.compaction. Strategies fire when the running context's token count crosses a configurable threshold, measured from the provider-reported usage on the last ModelOutputThunk.

Empirically on the BCP benchmark with Granite 4.1-8b, llm_summarize cuts inference cost by 23.7% and raises accuracy by 3.5 pp — compaction is a dual win, not a quality/cost trade-off.

Backwards compatible: compaction=None (default) preserves existing react() behavior exactly.

Motivation

Long agentic loops — especially retrieval-heavy ones — pile up tool responses. Each react iteration re-sends the full history to the model, so prompt-token cost grows quadratically and the loop can exhaust the model's context window before reaching a final answer. Compaction trims that history, lowering both the dollar cost of inference and the likelihood of hitting the context / timeout wall.

On the BrowseCompPlus (BCP) benchmark with Granite 4.1-8b (131K context, 830 questions, loop_budget=400, per_question_timeout_s=1800, averaged across 3 runs per strategy):

Compaction	Est. cost @ 80% cache hit	Accuracy	Mean wall-clock per Q
`none`	$1801.0M	15.6%	724 s
`clear_all`	$1499.2M (−16.8%)	18.5% (+2.9 pp)	980 s
`llm_summarize`	$1373.9M (−23.7%)	19.1% (+3.5 pp)	709 s

Without compaction, inference cost on the 830-Q set is ~$1801M under an 80%-cache-hit pricing model, because each react iteration re-sends an ever-larger prompt (history + tool outputs) — the prompt-token total balloons to 5.4B before the 1800 s wall / context limit stops further progress.
llm_summarize is the clear winner: it cuts inference cost by −23.7% ($1801M → $1374M) and raises accuracy from 15.6% to 19.1% (+3.5 pp, +23% relative), at similar mean wall-clock. The summary shrinks the prompt enough that each iteration is actually faster, so both cost and quality improve.
clear_all reduces cost by −16.8% and lifts accuracy +2.9 pp, but its +35% wall-clock makes it less attractive than llm_summarize when the judge budget allows the extra summarization call.

Hardware / infrastructure for the table above:

Model: ibm-granite/granite-4.1-8b (bf16, native 131K context)
Node: single 8× NVIDIA H100 80 GiB host (IBM LSF cluster, exclusive GPU allocation)
Inference: vLLM 0.19.1, 7 instances × TP=1 (1 GPU each), with --enable-prefix-caching on (Granite-family default). GPU 0 reserved for the BCP local search service (tevatron + BM25 over the BCP corpus).
Agent: Mellea react loop via OpenAIBackend → local vLLM, concurrency=56 (8 × num_vllm), loop_budget=400, per_question_timeout_s=1800. Threshold: --compaction_threshold 50000 tokens (~38% of 131K), --compaction_keep_n 5 where applicable.
Questions: all 830 from the BCP parquet, decrypted at load.
Judge: openai/gpt-oss-120b on the same node after the agent phase teardown. "Correct" = judge verdict matches the reference answer.
3 runs per strategy, independently seeded; numbers in the table are means.

Measurement was done with a separate BCP eval harness (forthcoming PR); the harness data is included here as motivating evidence — the compaction feature itself has no external runtime dependency.

Design

Protocol — `CompactionStrategy`

class CompactionStrategy(abc.ABC):
    def __init__(self, *, threshold: int = 0) -> None: ...
    def should_compact(self, context: ChatContext) -> bool: ...
    async def maybe_compact(
        self, context, *, backend=None, goal=None
    ) -> ChatContext: ...
    @abc.abstractmethod
    async def compact(
        self, context, *, backend=None, goal=None
    ) -> ChatContext: ...

Threshold is compared against total_tokens from the most recent ModelOutputThunk.usage — i.e. the prompt+completion of the last LLM call.

Concrete strategies

ClearAll(threshold) — discard everything after the first ReactInitiator, keeping only the system prefix. Cheapest, most aggressive; model must rebuild context each cycle.
KeepLastN(keep_n, threshold) — retain prefix + the last keep_n body components. Middle-ground; preserves recent tool outputs.
LLMSummarize(keep_n, threshold) — summarize the older body components via an additional LLM call, keep the last keep_n verbatim. Highest-fidelity, most expensive per-fire; empirically the best for Granite 4.1 on BCP (see table).

Integration site (react.py, +13 lines)

async def react(
    ..., compaction: CompactionStrategy | None = None,
) -> tuple[ComputedModelOutputThunk[str], ChatContext]:
    ...
    while turn_num < loop_budget or loop_budget == -1:
        step, next_context = await mfuncs.aact(...)
        ...
        if is_final:
            return step, context

        # Compact AFTER the final-answer check so terminal turns skip it.
        if compaction is not None:
            context = await compaction.maybe_compact(
                context, backend=backend, goal=goal
            )

Design decision — compact after the is_final check: a terminal turn has no next iteration to benefit from compaction, and for LLMSummarize this saves a full LLM call per question that would otherwise be wasted. Flagged inline so future refactors don't regress it.

Non-goals

No changes to mellea/core/ — the feature is purely additive under mellea/stdlib/, reusing the existing ChatContext + ModelOutputThunk surfaces.

Test plan

uv run pytest test/stdlib/test_compaction.py — 26 tests, 11 s locally, no external services required

The test file uses DummyThunk with synthetic usage dicts — no real backend needed. Coverage includes:

Each strategy's compact() output shape
Token-count comparison (below, at, above threshold)
threshold=0 disables compaction
Empty-context / no-thunk-with-usage returns False
Prefix preservation (first ReactInitiator never dropped)
LLMSummarize error handling when backend/goal omitted

Pitfalls (flagged here so reviewers know what to watch for)

Backends that don't populate mot.usage silently disable compaction. All mainline Mellea backends set it (OpenAI, HF, Ollama, LiteLLM), and AGENTS.md §5 codifies this as a requirement. If a new backend lands without usage population, its users will see compaction become a no-op with no loud error.
One-turn lag in the token-count measurement. The count reflects the prompt+completion of the LLM call that just completed — tool responses appended since are not yet counted. In practice negligible (a typical tool response is <5K tokens relative to a 50K+ threshold). Becomes visible only if a single tool response is very large (e.g. a raw document dump). Documented in the compaction.py docstring.
_last_usage_tokens returns None before the first model call. should_compact then returns False — no-op, not an error. Matters for strategies with very low thresholds where the first LLM call itself crosses the bar.
LLMSummarize needs backend + goal at compact() time. These are forwarded from the react call site. If a user constructs LLMSummarize and calls compact() directly (outside react), they must pass both. The docstring says so; reviewers may prefer a runtime check.
Strategies that drop all body components can leave a thunk-less context. On the next check, _last_usage_tokens returns None and compaction correctly doesn't re-fire immediately — behavior we want, but worth verifying didn't regress.
Token-count threshold is absolute, not a percentage of max context. A possible enhancement is to express the threshold as a percentage of the model's max context length rather than an absolute token count. This would require backends to reliably report that limit (e.g. 131K for Granite 4.1), after which --compaction_threshold 0.5 would read as "fire at 50% of context".
Compaction firing point is after the is_final check, not before. Easy to accidentally swap in a future refactor. Code comment calls this out; ideally a regression test guards it, but the current tests focus on the strategies themselves rather than react-loop placement. Worth adding if the reviewer flags it.

Files

 mellea/stdlib/compaction.py
 mellea/stdlib/frameworks/react.py
 test/stdlib/test_compaction.py
 1 file changed, 2 added

Adds CompactionStrategy abstraction and KeepLastN implementation to mellea/stdlib/compaction.py, wires an optional compaction parameter into the react() loop, and adds full test coverage in test/stdlib/test_compaction.py. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>

Switches `CompactionStrategy.threshold` from a component-count trigger to a token-count trigger, read from the most recent `ModelOutputThunk.usage` populated by the backend. This aligns compaction with the real constraint (context size) and sidesteps per-backend tokenizer dependencies by using provider-reported usage; the trade-off is a one-turn lag since usage is recorded at the end of each model call. Also reorders the react loop so compaction runs after the final-answer check, skipping wasted work (and a wasted LLM call for LLMSummarize) on terminal turns. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>

Move the compaction strategies alongside the react framework they serve: - mellea/stdlib/compaction.py -> mellea/stdlib/frameworks/react_compaction.py - test/stdlib/test_compaction.py -> test/stdlib/frameworks/test_react_compaction.py Imports and module docstrings updated accordingly. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>

yelkurdi · 2026-05-01T20:00:35Z

please add @ramon-astudillo as an observer

psschwei · 2026-05-04T15:59:39Z

(original PR body)

Summary

Adds an optional CompactionStrategy to mellea.stdlib.frameworks.react with three concrete implementations (ClearAll, KeepLastN, LLMSummarize) under a new module mellea.stdlib.compaction. Strategies fire when the running context's token count crosses a configurable threshold, measured from the provider-reported usage on the last ModelOutputThunk.

Empirically on the BCP benchmark with Granite 4.1-8b, llm_summarize cuts inference cost by 23.7% and raises accuracy by 3.5 pp — compaction is a dual win, not a quality/cost trade-off.

Backwards compatible: compaction=None (default) preserves existing react() behavior exactly.

Motivation

Long agentic loops — especially retrieval-heavy ones — pile up tool responses. Each react iteration re-sends the full history to the model, so prompt-token cost grows quadratically and the loop can exhaust the model's context window before reaching a final answer. Compaction trims that history, lowering both the dollar cost of inference and the likelihood of hitting the context / timeout wall.

On the BrowseCompPlus (BCP) benchmark with Granite 4.1-8b (131K context, 830 questions, loop_budget=400, per_question_timeout_s=1800, averaged across 3 runs per strategy):

Compaction	Est. cost @ 80% cache hit	Accuracy	Mean wall-clock per Q
`none`	$1801.0M	15.6%	724 s
`clear_all`	$1499.2M (−16.8%)	18.5% (+2.9 pp)	980 s
`llm_summarize`	$1373.9M (−23.7%)	19.1% (+3.5 pp)	709 s

Without compaction, inference cost on the 830-Q set is ~$1801M under an 80%-cache-hit pricing model, because each react iteration re-sends an ever-larger prompt (history + tool outputs) — the prompt-token total balloons to 5.4B before the 1800 s wall / context limit stops further progress.
llm_summarize is the clear winner: it cuts inference cost by −23.7% ($1801M → $1374M) and raises accuracy from 15.6% to 19.1% (+3.5 pp, +23% relative), at similar mean wall-clock. The summary shrinks the prompt enough that each iteration is actually faster, so both cost and quality improve.
clear_all reduces cost by −16.8% and lifts accuracy +2.9 pp, but its +35% wall-clock makes it less attractive than llm_summarize when the judge budget allows the extra summarization call.

Hardware / infrastructure for the table above:

Model: ibm-granite/granite-4.1-8b (bf16, native 131K context)
Node: single 8× NVIDIA H100 80 GiB host (IBM LSF cluster, exclusive GPU allocation)
Inference: vLLM 0.19.1, 7 instances × TP=1 (1 GPU each), with --enable-prefix-caching on (Granite-family default). GPU 0 reserved for the BCP local search service (tevatron + BM25 over the BCP corpus).
Agent: Mellea react loop via OpenAIBackend → local vLLM, concurrency=56 (8 × num_vllm), loop_budget=400, per_question_timeout_s=1800. Threshold: --compaction_threshold 50000 tokens (~38% of 131K), --compaction_keep_n 5 where applicable.
Questions: all 830 from the BCP parquet, decrypted at load.
Judge: openai/gpt-oss-120b on the same node after the agent phase teardown. "Correct" = judge verdict matches the reference answer.
3 runs per strategy, numbers in the table are means.

Measurement was done with a separate BCP eval harness (forthcoming PR); the harness data is included here as motivating evidence — the compaction feature itself has no external runtime dependency.

Design

Protocol — `CompactionStrategy`

class CompactionStrategy(abc.ABC):
    def __init__(self, *, threshold: int = 0) -> None: ...
    def should_compact(self, context: ChatContext) -> bool: ...
    async def maybe_compact(
        self, context, *, backend=None, goal=None
    ) -> ChatContext: ...
    @abc.abstractmethod
    async def compact(
        self, context, *, backend=None, goal=None
    ) -> ChatContext: ...

Threshold is compared against total_tokens from the most recent ModelOutputThunk.usage — i.e. the prompt+completion of the last LLM call.

Concrete strategies

ClearAll(threshold) — discard everything after the first ReactInitiator, keeping only the system prefix. Cheapest, most aggressive; model must rebuild context each cycle.
KeepLastN(keep_n, threshold) — retain prefix + the last keep_n body components. Middle-ground; preserves recent tool outputs.
LLMSummarize(keep_n, threshold) — summarize the older body components via an additional LLM call, keep the last keep_n verbatim. Highest-fidelity, most expensive per-fire; empirically the best for Granite 4.1 on BCP (see table).

Integration site (react.py, +13 lines)

async def react(
    ..., compaction: CompactionStrategy | None = None,
) -> tuple[ComputedModelOutputThunk[str], ChatContext]:
    ...
    while turn_num < loop_budget or loop_budget == -1:
        step, next_context = await mfuncs.aact(...)
        ...
        if is_final:
            return step, context

        # Compact AFTER the final-answer check so terminal turns skip it.
        if compaction is not None:
            context = await compaction.maybe_compact(
                context, backend=backend, goal=goal
            )

Design decision — compact after the is_final check: a terminal turn has no next iteration to benefit from compaction, and for LLMSummarize this saves a full LLM call per question that would otherwise be wasted. Flagged inline so future refactors don't regress it.

Non-goals

No changes to mellea/core/ — the feature is purely additive under mellea/stdlib/, reusing the existing ChatContext + ModelOutputThunk surfaces.

Test plan

uv run pytest test/stdlib/test_compaction.py — 26 tests, 11 s locally, no external services required

The test file uses DummyThunk with synthetic usage dicts — no real backend needed. Coverage includes:

Each strategy's compact() output shape
Token-count comparison (below, at, above threshold)
threshold=0 disables compaction
Empty-context / no-thunk-with-usage returns False
Prefix preservation (first ReactInitiator never dropped)
LLMSummarize error handling when backend/goal omitted

Pitfalls (flagged here so reviewers know what to watch for)

Backends that don't populate mot.usage silently disable compaction. All mainline Mellea backends set it (OpenAI, HF, Ollama, LiteLLM), and AGENTS.md §5 codifies this as a requirement. If a new backend lands without usage population, its users will see compaction become a no-op with no loud error.
One-turn lag in the token-count measurement. The count reflects the prompt+completion of the LLM call that just completed — tool responses appended since are not yet counted. In practice negligible (a typical tool response is <5K tokens relative to a 50K+ threshold). Becomes visible only if a single tool response is very large (e.g. a raw document dump). Documented in the compaction.py docstring.
_last_usage_tokens returns None before the first model call. should_compact then returns False — no-op, not an error. Matters for strategies with very low thresholds where the first LLM call itself crosses the bar.
LLMSummarize needs backend + goal at compact() time. These are forwarded from the react call site. If a user constructs LLMSummarize and calls compact() directly (outside react), they must pass both. The docstring says so; reviewers may prefer a runtime check.
Strategies that drop all body components can leave a thunk-less context. On the next check, _last_usage_tokens returns None and compaction correctly doesn't re-fire immediately — behavior we want, but worth verifying didn't regress.
Token-count threshold is absolute, not a percentage of max context. A possible enhancement is to express the threshold as a percentage of the model's max context length rather than an absolute token count. This would require backends to reliably report that limit (e.g. 131K for Granite 4.1), after which --compaction_threshold 0.5 would read as "fire at 50% of context".
Compaction firing point is after the is_final check, not before. Easy to accidentally swap in a future refactor. Code comment calls this out; ideally a regression test guards it, but the current tests focus on the strategies themselves rather than react-loop placement. Worth adding if the reviewer flags it.

Files

 mellea/stdlib/frameworks/react_compaction.py
 mellea/stdlib/frameworks/react.py
 test/stdlib/frameworks/test_react_compaction.py
 1 file modified, 2 added

github-actions · 2026-05-04T16:03:11Z

The PR description has been updated. Please fill out the template for your PR to be reviewed.

psschwei · 2026-05-04T16:05:12Z

@yelkurdi I updated the PR body so that the update-pr-body check would pass (and copied your original body into a comment). Now that the check is passing, feel free to re-edit to include your original comments in the appropriate section.

The docstring quality gate (tooling/docs-autogen/audit_coverage.py --quality --threshold 100) requires each documented symbol to have its own Args/Returns sections — inheritance from the abstract parent is not consulted. Six issues were reported against the compact() overrides on ClearAll, KeepLastN, and LLMSummarize. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>

ramon-astudillo · 2026-05-19T21:32:10Z

relevant related discussion #1099

yelkurdi · 2026-05-19T21:44:11Z

Will update according to #1099

github-actions · 2026-05-19T21:45:05Z

This comment is managed by a bot. Editing it is fine — checking off boxes, adding notes — but please leave the HTML comment marker on the first line alone, otherwise checklist updates will break.

Component PR Checklist

Use this checklist when adding or modifying components in mellea/stdlib/components/.

Protocol Compliance

parts() returns list of constituent parts (Components or CBlocks)
format_for_llm() returns TemplateRepresentation or string
_parse(computed: ModelOutputThunk) parses model output correctly into the specified Component return type

Content Blocks

CBlock used appropriately for text content
ImageBlock used for image content (if applicable)

Integration

Component exported in mellea/stdlib/components/__init__.py or, if you are adding a library of components, from your sub-module

…react

Replaces the original async ``react_compaction`` strategies (ClearAll, KeepLastN, LLMSummarize) with a generic, sync ``Compactor`` protocol that operates on any ``Context``. ``ReACT`` and ``ChatContext`` are rewired around the new protocol; sample callers, tests, and docs are updated. Squash of 29 Mellea-side commits from context_compaction_for_react_2; the BCP eval harness commits in that branch are intentionally excluded. mellea/stdlib/context/ becomes a package - Compactor protocol: sync ``compact(ctx, *, backend=None) -> Context`` - WindowCompactor(size, pin_predicate) keep last-N body components; ``size=0`` clears the body and retains only the pinned prefix - ThresholdCompactor(inner, threshold) token-gated wrapper that reads cumulative context size from the most recent ModelOutputThunk's ``generation.usage`` and forwards to ``inner.compact`` only above the gate - LLMSummarizeCompactor(keep_n, pin_predicate, prompt_template) summarizes old body components via the backend; the (async) backend call is hidden behind a sync ``compact()`` via ``_run_coro_blocking`` so the protocol stays sync - PinPredicate API: ``pin_nothing``, ``pin_system``, ``pin_system_and_initial_user``; chat compactors compose freely mellea/stdlib/frameworks/react.py - ``react()`` gains a ``compactor: Compactor | None = None`` per-turn hook; invoked once after each tool observation - The old ``react_compaction`` module is removed mellea/stdlib/components/react.py - ``pin_react_initiator``: a PinPredicate that pins everything up to and including the first ``ReactInitiator`` - ``react_summary_prompt(goal=None, max_tokens_hint=None)``: factory that returns a research-flavoured summary prompt template (with the {conversation} placeholder LLMSummarizeCompactor expects). Optional ``GOAL: <goal>`` line and optional ``- Be at most ~N tokens`` bullet when callers want goal anchoring or length-cap hints. mellea/stdlib/context/chat.py - ``ChatContext()`` defaults to no compactor (full history); pass ``compactor=`` or ``window_size=`` for opt-in compaction. Matches upstream main's window_size=None unbounded semantics. Test coverage - test/stdlib/test_compactor.py (~500 LOC): protocol semantics; Window / Threshold / LLMSummarize behaviours; pin-predicate edge cases; ``size=0`` collapse; threshold gate edge cases - test/stdlib/frameworks/test_react_framework.py (~210 LOC): react() per-turn hook integration + react_summary_prompt (default, goal interpolation, brace escaping, max_tokens_hint bullet ordering, LLMSummarizeCompactor template-validation) - test/stdlib/test_base_context.py: pin-non-compacting ChatContext in the session-copy operations test (matches new opt-in default) Net diff: 17 files, +381 / -896 lines (drops the old react_compaction.py and its dedicated test file). Backwards-compatible default behaviour preserved: bare ``ChatContext()`` retains full history; ``react()`` without ``compactor=`` behaves identically to today; ``LLMSummarizeCompactor`` defaults to a generic conversation-summary prompt unless callers opt in to the research-flavoured variant via ``react_summary_prompt``. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>

planetf1

Thanks for this — the Compactor architecture is really well thought through, and the test coverage in test_compactor.py is thorough. A few things to sort out before this lands, mostly around LLMSummarizeCompactor edge cases and a couple of quick doc fixes.

planetf1 · 2026-05-26T11:46:53Z

 def test_context_construction():
    context_construction(SimpleContext)
+    # ChatContext defaults to WindowCompactor(5); a single add stays well under
+    # the window so the linked-list shape is identical to the pre-compaction


[WARNING] This comment says "ChatContext defaults to WindowCompactor(5)" but the actual default is compactor=None — no compaction at all. test_default_has_no_compactor in test_compactor.py asserts this directly. Looks like a leftover from an earlier design iteration.

@planetf1 you are correct it is a leftover design iteration, will be corrected.

planetf1 · 2026-05-26T11:46:53Z

+  directly between turns, e.g. when compaction is exposed to the model as a
+  tool.
+
+See ``docs/rewrite/`` for full usage examples.


[WARNING] The module docstring points to docs/rewrite/ for usage examples, but that directory only has session_deepdive/ and streaming/ — nothing compaction-related. The actual examples are in docs/examples/context/. Worth fixing so the breadcrumb is useful.

good catch, will be fixed to point to docs/examples/context/

planetf1 · 2026-05-26T11:46:53Z

+        return pool.submit(asyncio.run, coro).result()
+
+
+class LLMSummarizeCompactor:


[WARNING] Worth calling out in the class docstring that this compactor can't be wired into ChatContext(compactor=...) directly. ChatContext.add() calls compact(new) without a backend, so once the body grows past keep_n every add() raises ValueError. The natural instinct after seeing ChatContext(compactor=WindowCompactor(...)) work is to try the same here, and the error message won't point you at the fix. A one-liner — "use via react(compactor=...) only" — would save a lot of confusion.

@planetf1 how about adding a hard block, such as limiting the types of compactors ChatContext can directly take e.g. 1. BasicCompactor (does not require backend). 2. ThresholdCompactor (algorithmic compactor wraps LLMSummaryCompactor) ?

Sure, that's better than a docstring.

Great, I made the following changes to create a hard block at construction time:

Added InlineCompactor marker; ChatContext(compactor=...) rejects non-InlineCompactor instances with a TypeError pointing at react(compactor=...), ThresholdCompactor, or manual compact().

WindowCompactor and ThresholdCompactor inherit the marker; LLMSummarizeCompactor does not (would call backend every add()). ThresholdCompactor(LLMSummarizeCompactor(...)) is the recommended wiring — it gates by token usage, so backend calls fire sparsely rather than per-add().

LLMSummarizeCompactor now requires default_backend at construction; compact() falls back to it, with call-time backend= overriding.

planetf1 · 2026-05-26T11:46:53Z

+    from mellea.stdlib.context.compactor import Compactor
+
+
+class ChatContext(Context):


[WARNING] ChatContext(window_size=N) used to keep the full history in as_list() and only apply the window in view_for_generation(). Now as_list() itself is truncated to N items. For the model, nothing changes — but callers inspecting as_list() directly (e.g. session.py:376 uses it for interaction_count) will silently get a capped value. The PR description says "backwards compatible" which is true for the model's view, but not for as_list(). Worth a migration note in the docstring at minimum.

Added better tracking of interaction counts, required small modifications in session.py

ChatContext docstring: added a Note flagging that as_list() now reflects post-compaction state (used to keep full history); points callers at out-of-band turn tracking.

MelleaSession: turned ctx into a property whose setter increments _interaction_count; reset() / __init__ bypass via _ctx so lifecycle events don't pollute the count. cleanup() publishes self._interaction_count instead of len(self.ctx.as_list()).

SessionCleanupPayload.interaction_count: docstring rewritten to "turns committed" semantics (was "items in context").

planetf1 · 2026-05-26T11:46:53Z

+        if len(body) <= self.keep_n:
+            return ctx
+
+        return _run_coro_blocking(self._async_compact(ctx, backend))


[WARNING] If the backend call raises here (rate limit, network error, timeout), the exception propagates through _run_coro_blocking and kills the entire react loop. For a long-running research task, that's quite painful. Since compaction is best-effort by nature, wrapping this in a try/except, logging a warning, and returning ctx unchanged would be much more robust.

Done, compaction is wrapped in try/except and behaves as best effort by returning ctx and logs warning when backend fails, enabling the react loop to continue.

planetf1 · 2026-05-26T11:46:53Z

+)
+
+
+def _run_coro_blocking(coro):  # type: ignore[no-untyped-def]


[WARNING] When called from inside react() (which is async), this blocks the event loop thread for the full duration of the summary LLM call. Nothing else on the loop — telemetry flushers, cancellation signals, other sessions — can make progress during that time. It's documented as "fine for a serial ReACT loop", but worth noting that backends using per-loop resources (e.g. httpx.AsyncClient) may behave unexpectedly. A stronger docstring warning here, and ideally an async variant on the protocol long-term, would be the right direction.

Done, beefed-up the warning message as you indicated.

planetf1 · 2026-05-26T11:46:53Z

+            elif isinstance(c, Message):
+                lines.append(f"{c.role}: {c.content}")
+            elif isinstance(c, ModelOutputThunk):
+                lines.append(f"assistant: {c.value}")


[WARNING] The rendering loop only handles c.content and c.value. Images, attached documents, and tool-call-only thunks (where value is None) are silently dropped — the latter render as "assistant: None". Users running multimodal or tool-heavy sessions will get incomplete summaries without any warning. Guarding with c.value or '' would at least fix the None case, and a docstring note about lossy summarisation for multimodal content would set expectations.

Augmented the block with the following:

Message: append "[N image(s) attached]" / "[M document(s) attached]" markers; bytes not reproduced.

ModelOutputThunk: render "assistant called tools: name({args}), ..." for tool-call-only thunks (value=None); empty thunks (no value, no tool calls) are skipped entirely. Eliminates "assistant: None".

Catch-all else: "<TypeName: content>" or "<TypeName>" instead of the default object repr (e.g. ReactInitiator when not pinned).

planetf1 · 2026-05-26T11:46:53Z

+                lines.append(str(getattr(c, "content", c)))
+
+        prompt = self.prompt_template.format(conversation="\n".join(lines))
+        result, _ = await mfuncs.aact(


[WARNING] The two equivalent mfuncs.aact() calls in react.py (lines 93 and 123) both pass silence_context_type_warning=True for internal framework calls. Without it here, every compaction fires a context-type-mismatch warning into the logs — potentially quite noisy with ThresholdCompactor triggering regularly in a long session.

added silence_context_type_warning=True parameter

planetf1 · 2026-05-26T11:46:53Z

+            a generic conversation-summary template.
+    """
+
+    def __init__(


[WARNING] There's no model_options parameter here, so the summary call uses the backend's default max_tokens. On many local backends that's 256–512, which will silently truncate a summary of a long conversation. react_summary_prompt(max_tokens_hint=N) adds a text nudge to the prompt but doesn't set the actual API parameter. Adding model_options: dict | None = None to the constructor and forwarding it to mfuncs.aact would let users enforce a real token budget.

model_options added, it is a good idea, I have been actually using it in my separate BCP eval harness.

…ples - Correct comment in test_base_context.py: ChatContext defaults to compactor=None, not WindowCompactor(5). - Point compactor.py module docstring to docs/examples/context/ instead of the nonexistent docs/rewrite/. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>

…n LLMSummarize Restrict ChatContext(compactor=...) to compactors that inherit InlineCompactor. Wiring a backend-requiring compactor (e.g. LLMSummarizeCompactor) directly would invoke the backend on every add(); the new isinstance guard rejects that with a TypeError pointing at react(compactor=...), ThresholdCompactor, or manual compact() as alternatives. ThresholdCompactor remains accepted regardless of its inner -- it gates by token usage, so backend calls are sparse rather than per-add. LLMSummarizeCompactor now takes a required default_backend at construction and falls back to it when compact() is invoked without an explicit backend. A backend kwarg passed to compact() still overrides the default for that call. This makes ThresholdCompactor(LLMSummarizeCompactor(...)) work end-to-end when attached to ChatContext: at trip time, the inner uses its stored default_backend. InlineCompactor carries the compact() signature (raising NotImplementedError) so it's a usable static type without cast() workarounds. Specialized to ChatContext rather than parameterized over Context -- ThresholdCompactor's prior generic-T signature was unexercised, so the simpler shape applies. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>

…antic shift Reviewer flagged that ChatContext(window_size=N) used to keep full history in as_list() and only window view_for_generation(); the per-turn compactor work made as_list() itself reflect the post-compaction state, which silently undercounted in MelleaSession.cleanup() (interaction_count = len(as_list())). - ChatContext docstring: add a Note describing the semantic shift and pointing callers at out-of-band turn tracking when full counts matter. - MelleaSession: turn ctx into a property; the setter increments _interaction_count, and reset() / __init__ bypass via _ctx so lifecycle events don't pollute the count. cleanup() now publishes self._interaction_count, stable under any compaction strategy. - SessionCleanupPayload: rewrite the interaction_count field doc to match the new semantics ("turns committed" rather than "items in context"). Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>

Reviewer flagged that an exception from the summarisation backend call (rate limit, network error, timeout) propagates through _run_coro_blocking and kills the entire react loop. For long-running research tasks that's quite painful, especially since compaction is best-effort by nature. LLMSummarizeCompactor.compact() now wraps _run_coro_blocking in a try/except Exception. On failure it logs a WARNING via MelleaLogger with the exception type and message, then returns ctx unchanged. The next compact() invocation retries; the conversation keeps growing in the meantime. BaseException (KeyboardInterrupt, SystemExit) still propagates so users can interrupt a stuck loop. Added a regression test with a backend that raises RuntimeError on every call: compact() returns the same ctx, original history is intact, and a warning naming the exception type is logged. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>

Reviewer flagged that _run_coro_blocking, when invoked from inside an async caller like react(), blocks the entire event loop for the full duration of the wrapped coroutine — not just the calling task. The previous "fine for a serial ReACT loop" wording undersold the implications. Beefed-up Warning: block now spells out: - The loop, not just the thread, is stalled — callbacks, telemetry, cancellation signals, other sessions sharing the loop, keepalives are all blocked. - Backends with per-loop resources (notably httpx.AsyncClient) may behave unexpectedly because the coroutine runs on a fresh loop in a worker thread; documents the typical failure signatures. - Long-term direction is an async variant on the Compactor protocol so callers can await natively. Docs-only; no behavior change. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>

Reviewer flagged silent drops when rendering the slice fed to the summariser: - Message: append "[N image(s) attached]" / "[M document(s) attached]" markers; bytes not reproduced. - ModelOutputThunk: render "assistant called tools: name({args}), ..." for tool-call-only thunks (value=None) and "assistant: <empty>" for empty thunks. Eliminates "assistant: None". - Catch-all else: "<TypeName: content>" or "<TypeName>" instead of the default object repr (e.g. ReactInitiator when not pinned). Docstring gains a Note: that summaries are text-only and lossy for multimodal / heavy-tool sessions. Tests cover each branch. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>

…ng "<empty>" A "<empty>" marker for thunks with neither value nor tool_calls tended to leak into the resulting summary verbatim. These turns carry no information worth summarising, so drop them from the rendered slice entirely. Test updated to assert no line is emitted for an empty thunk while neighbouring turns still come through. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>

…call Reviewer flagged that aact's context-type warning could be noisy under ThresholdCompactor-driven repeated compaction. Match react.py's pattern of setting silence_context_type_warning=True on internal framework calls so the warning stays quiet if the context argument is later changed to a non-SimpleContext, and to self-document the intent. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>

Reviewer flagged that the summary call uses the backend's default max_tokens (often 256-512 on local backends), silently truncating long summaries. react_summary_prompt(max_tokens_hint=N) is only a soft prompt- side nudge, not a real API parameter. Add model_options: dict | None to the constructor and forward it to mfuncs.aact so callers can set a hard token budget (or any other backend option). Default None preserves existing behaviour. Tests cover both forwarded and default paths. Assisted-by: Claude Code Signed-off-by: Yousef El-Kurdi <yelkurdi@gmail.com>

yelkurdi and others added 4 commits April 30, 2026 12:24

Fix mot.generation.usage

ca7bea1

yelkurdi requested a review from a team as a code owner May 1, 2026 19:59

yelkurdi requested review from markstur and nrfulton May 1, 2026 19:59

github-actions Bot added the enhancement New feature or request label May 1, 2026

yelkurdi added 2 commits May 20, 2026 11:28

Merge branch 'generative-computing:main' into context_compaction_for_…

f071fe6

…react

planetf1 requested changes May 26, 2026

View reviewed changes

yelkurdi added 9 commits May 26, 2026 12:54

		return pool.submit(asyncio.run, coro).result()


		class LLMSummarizeCompactor:

		from mellea.stdlib.context.compactor import Compactor


		class ChatContext(Context):

		)


		def _run_coro_blocking(coro): # type: ignore[no-untyped-def]

Conversation

yelkurdi commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Component PR

Description

Implementation Checklist

Protocol Compliance

Content Blocks

Integration

Testing

Attribution

Summary

Motivation

Design

Protocol — CompactionStrategy

Concrete strategies

Integration site (react.py, +13 lines)

Non-goals

Test plan

Pitfalls (flagged here so reviewers know what to watch for)

Files

Uh oh!

yelkurdi commented May 1, 2026

Uh oh!

psschwei commented May 4, 2026

Summary

Motivation

Design

Protocol — CompactionStrategy

Concrete strategies

Integration site (react.py, +13 lines)

Non-goals

Test plan

Pitfalls (flagged here so reviewers know what to watch for)

Files

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

psschwei commented May 4, 2026

Uh oh!

ramon-astudillo commented May 19, 2026

Uh oh!

yelkurdi commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Component PR Checklist

Protocol Compliance

Content Blocks

Integration

Uh oh!

planetf1 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

yelkurdi commented May 1, 2026 •

edited

Loading

Protocol — `CompactionStrategy`

Protocol — `CompactionStrategy`