Checks
Strands Version
1.35.0
Python Version
3.13.9
Operating System
macOS 26.4
Installation Method
other
Steps to Reproduce
import asyncio
import json
from collections.abc import AsyncGenerator
from typing import Any, override
from strands import Agent, tool
from strands.models import Model
from strands.types.content import Messages
from strands.types.streaming import StreamEvent
from strands.types.tools import ToolSpec
class TwoToolModel(Model):
"""Model that emits two parallel tool_use blocks on the first turn, then ends."""
def __init__(self) -> None:
self.turn = 0
@override
def update_config(self, **model_config: Any) -> None:
pass
@override
def get_config(self) -> Any:
return {}
@override
def structured_output(
self, output_model: Any, prompt: Messages, system_prompt: str | None = None, **kwargs: Any
) -> AsyncGenerator[Any, None]:
raise NotImplementedError
@override
async def stream(
self,
messages: Messages,
tool_specs: list[ToolSpec] | None = None,
system_prompt: str | None = None,
**kwargs: Any,
) -> AsyncGenerator[StreamEvent, None]:
self.turn += 1
yield StreamEvent(messageStart={"role": "assistant"})
if self.turn == 1:
for tid, name in [("id-slow", "slow_tool"), ("id-fast", "fast_tool")]:
yield StreamEvent(
contentBlockStart={"start": {"toolUse": {"name": name, "toolUseId": tid}}},
)
yield StreamEvent(contentBlockDelta={"delta": {"toolUse": {"input": json.dumps({})}}})
yield StreamEvent(contentBlockStop={})
yield StreamEvent(messageStop={"stopReason": "tool_use"})
else:
yield StreamEvent(contentBlockStart={"contentBlockIndex": 0, "start": {}})
yield StreamEvent(contentBlockDelta={"contentBlockIndex": 0, "delta": {"text": "done"}})
yield StreamEvent(contentBlockStop={"contentBlockIndex": 0})
yield StreamEvent(messageStop={"stopReason": "end_turn"})
@tool(name="slow_tool", description="sleeps briefly and returns")
async def slow_tool() -> str:
await asyncio.sleep(0.05)
return "slow done"
@tool(name="fast_tool", description="returns immediately")
async def fast_tool() -> str:
return "fast done"
async def main() -> None:
agent = Agent(model=TwoToolModel(), tools=[slow_tool, fast_tool])
_ = await agent.invoke_async("call both tools")
# Find the user message with the tool_result blocks
tool_result_message = next(
m for m in agent.messages
if m.get("role") == "user" and any("toolResult" in b for b in m.get("content", []))
)
ids = [b["toolResult"]["toolUseId"] for b in tool_result_message["content"] if "toolResult" in b]
print(f"tool_result order in next-turn prompt: {ids}")
# Expected: ['id-slow', 'id-fast'] — matches the assistant's toolUse emission order
# Actual: ['id-fast', 'id-slow'] — fast_tool finished first, so it was appended first
asyncio.run(main())
Expected Behavior
The tool_result blocks in the follow-up user message should appear in the same order as the toolUse blocks in the preceding assistant message. That order is deterministic (it comes from the model's output) and stable across runs, which is a prerequisite for byte-stable prompts.
Actual Behavior
tool_result blocks appear in tool-completion order. With the reproducer above this is deterministically inverted (fast_tool finishes before slow_tool), but in general the ordering is scheduler-dependent and varies run to run when the tools have similar completion times.
Additional Context
Byte-stable prompts are a load-bearing assumption for:
- Anthropic's server-side prompt caching — cache entries are keyed on the exact prompt prefix. A reordering of
tool_result blocks in a turn invalidates every cache entry that would otherwise have been reused for the rest of the conversation.
- Client-side request/response caching — any workflow that hashes prompts to deduplicate LLM calls (replay caches used by CI, offline test runs, determinism harnesses) will miss on every run, because the scheduler coin-flip picks a different ordering.
- Reproducible agent trajectories — when cached replays fall through to live LLM calls, the new responses differ, and the agent's decision path forks. We hit this in a test suite where a single concurrent tool_use at turn 10 caused two subsets of otherwise-identical tests to end up on entirely different agent trajectories (16 vs 18 turns, different tool sequences, different final verdicts).
In our case this manifested as "two back-to-back runs of the same test suite, with no code changes, produced different prompt hashes and a new live LLM request against what was supposed to be a fully-cached offline run."
The bug is in ConcurrentToolExecutor (src/strands/tools/executors/concurrent.py) combined with ToolExecutor._stream_with_trace (src/strands/tools/executors/_executor.py).
ConcurrentToolExecutor._execute launches one asyncio.Task per tool_use, passing the same shared tool_results: list[ToolResult] to every task:
for task_id, tool_use in enumerate(tool_uses):
tasks.append(
asyncio.create_task(
self._task(
agent,
tool_use,
tool_results, # ← shared list
...
)
)
)
Each task's _stream_with_trace appends to that shared list when its tool finishes:
yield ToolResultEvent(after_event.result)
tool_results.append(after_event.result) # ← append order = scheduler completion order
return
Then event_loop.py serializes the list in whatever order the scheduler left it:
# src/strands/event_loop/event_loop.py
tool_result_message: Message = {
"role": "user",
"content": [{"toolResult": result} for result in tool_results],
}
SequentialToolExecutor does not have this problem — it iterates tool_uses in request order and each tool appends to tool_results serially, producing request-order output.
Possible Solution
No response
Related Issues
Checks
Strands Version
1.35.0
Python Version
3.13.9
Operating System
macOS 26.4
Installation Method
other
Steps to Reproduce
Expected Behavior
The
tool_resultblocks in the follow-up user message should appear in the same order as thetoolUseblocks in the preceding assistant message. That order is deterministic (it comes from the model's output) and stable across runs, which is a prerequisite for byte-stable prompts.Actual Behavior
tool_resultblocks appear in tool-completion order. With the reproducer above this is deterministically inverted (fast_toolfinishes beforeslow_tool), but in general the ordering is scheduler-dependent and varies run to run when the tools have similar completion times.Additional Context
Byte-stable prompts are a load-bearing assumption for:
tool_resultblocks in a turn invalidates every cache entry that would otherwise have been reused for the rest of the conversation.In our case this manifested as "two back-to-back runs of the same test suite, with no code changes, produced different prompt hashes and a new live LLM request against what was supposed to be a fully-cached offline run."
The bug is in
ConcurrentToolExecutor(src/strands/tools/executors/concurrent.py) combined withToolExecutor._stream_with_trace(src/strands/tools/executors/_executor.py).ConcurrentToolExecutor._executelaunches oneasyncio.Taskper tool_use, passing the same sharedtool_results: list[ToolResult]to every task:Each task's
_stream_with_traceappends to that shared list when its tool finishes:Then
event_loop.pyserializes the list in whatever order the scheduler left it:SequentialToolExecutordoes not have this problem — it iteratestool_usesin request order and each tool appends totool_resultsserially, producing request-order output.Possible Solution
No response
Related Issues
tool_resultsfrom_execute's signature in favor of yielded events.tool_resultsparameter rather than through the event stream.