Skip to content

fix(task-store): retry on transient transport errors instead of dropping prompt#2090

Open
yashrajshuklaaa wants to merge 1 commit into
kagent-dev:mainfrom
yashrajshuklaaa:fix/task-store-transport-retry
Open

fix(task-store): retry on transient transport errors instead of dropping prompt#2090
yashrajshuklaaa wants to merge 1 commit into
kagent-dev:mainfrom
yashrajshuklaaa:fix/task-store-transport-retry

Conversation

@yashrajshuklaaa

Copy link
Copy Markdown

When the agent - - > controller HTTP hop raises httpx.TransportError ( idle keep-alive connection reset by Istio/HBONE mesh , controller pod reschedule , etc ) the error previously propagated uncaught out of KAgentTaskStore.get/save silently dropping the user prompt with no error surfaced and no recovery short of a pod restart

Fix :

introduce _request_with_retry( ) in KAgentTaskStore that catches TransportError calls aclose( ) to flush the stale connection pool and retries once on a fresh connection. Non-transport HTTP errors (4xx/5xx) are re-raised immediately without retrying. If the transport error persists after all retries it is re-raised so the caller sees a real error rather than a silent drop
fix lives entirely in kagent-core/_task_store.py and covers all
four framework adapters (langgraph, adk, openai, crewai) automatically
since they all share KAgentTaskStore

Fixes #2086

Copilot AI review requested due to automatic review settings June 25, 2026 18:17
@github-actions github-actions Bot added the bug Something isn't working label Jun 25, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds retry handling in the shared KAgentTaskStore HTTP layer to prevent BYO agents from silently dropping prompts when the agent→controller hop encounters transient httpx.TransportError conditions (e.g., stale keep-alive connections reset by the mesh).

Changes:

  • Introduce _request_with_retry() in KAgentTaskStore to retry once on httpx.TransportError.
  • Route save/get/delete through the new retry helper and document the new error behavior.
  • Add logging for transport retry attempts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread python/packages/kagent-core/src/kagent/core/a2a/_task_store.py Outdated
@yashrajshuklaaa yashrajshuklaaa force-pushed the fix/task-store-transport-retry branch from 0175791 to 084d78b Compare June 25, 2026 18:31
…ing prompt

Fixes kagent-dev#2086

Signed-off-by: Yashraj Shukla <shuklayashraj68@gmail.com>

fix: clean up stale docstring in _request_with_retry

Signed-off-by: Yashraj Shukla <shuklayashraj68@gmail.com>
@yashrajshuklaaa yashrajshuklaaa force-pushed the fix/task-store-transport-retry branch from 084d78b to 4a87931 Compare June 26, 2026 08:54
@yashrajshuklaaa

Copy link
Copy Markdown
Author

wanted to share how i actually approached this before anyone reviews
traced the failure end to end first. every incoming prompt hits KAgentTaskStore.get then KAgentTaskStore.save before the agent graph even starts. both were doing raw httpx calls with nothing catching transport failures. so when istio resets an idle keep-alive connection which is just normal mesh behavior httpx throws TransportError it propagates uncaught through the a2a handler, prompt is gone. no error , no task , no reply . only a pod restart fixes it because that forces a fresh tcp connection.
first instinct was to wrap each method individually in try/except but that's the same logic copy-pasted three times across save, get, delete. instead pulled it into _request_with_retry and routed everything through it. one place to read, one place to change.
kept _MAX_RETRIES = 1 on purpose. this isn't about flaky networks, it's specifically stale sockets from idle connections. one retry gets you a fresh connection. more than that and you're just hiding a controller that's actually down.
non-transport errors 4xx and 5xx are never retried , always re-raised immediately. didn't want to accidentally swallow real failures.
and since the whole fix is in kagent-core/_task_store.py all four adapters langgraph, adk, openai, crewai pick it up automatically. didn't need to touch any of them.

@yashrajshuklaaa

Copy link
Copy Markdown
Author

@peterj @EItanya PTAL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BYO agent silently drops an incoming prompt when the agent→controller /api/tasks call fails (transient transport error)

2 participants