Why Your AI Agent Crashes: Chat Recovery, Durable Submissions, and Routing Retries
The gap between a working prototype and a production agent is rarely about model quality. It is about state management.
When you build AI agents, you are engineering stateful systems that must survive network churn, browser refreshes, and infrastructure evictions. Traditional implementations often fail silently or lose context when these interruptions occur. That fragility is the primary bottleneck for production AI agents. If the conversation dies when the user’s connection drops, or if a sub-agent restarts from zero after a transient failure, the user experience is broken.
Cloudflare’s approach, particularly through the Agents SDK, shifts the focus from fragile client-side state to robust server-side durability. By using Durable Objects and specific SDK features introduced in recent updates, we can build agents that recover gracefully from interruptions. This isn’t about making the AI smarter; it’s about making the infrastructure around it reliable.
Chat Recovery: Keeping Conversations Alive
The most common point of failure in web-based AI applications is the client connection. A user might lose Wi-Fi, switch tabs, or refresh the page while the agent is generating a response. In older patterns, this often resulted in a half-finished response or a complete loss of context.
Agents SDK v0.12.4 introduced chat recovery, a feature that keeps server turns running even if the browser or client stream is interrupted. This is critical for maintaining continuity. The server does not abort the work just because the listener disconnected. Instead, it completes the turn and makes the result available for the next connection.
This behavior is controlled by the cancelOnClientAbort option in the @cloudflare/ai-chat package. By default, or when set to false, the server continues processing the turn. When set to true, the server cancels the turn upon client disconnection. For most production applications, you want this set to false to ensure that the agent’s work is not wasted.
The Role of Durable Objects
This reliability is underpinned by Durable Objects. Every agent instance has its own SQLite database for built-in memory, persisting conversation history and state automatically. This means that even if the underlying Worker is evicted or restarted, the agent’s context is preserved.
Recovery is more reliable during Durable Object restarts because the SDK defers user finish hooks until after the agent startup. This ensures that the agent is fully initialized and ready to serve requests before any completion callbacks are triggered. Without this deferral, you might see race conditions where the agent tries to write to a database that isn’t yet ready.
Durable Submissions for Server-Driven Turns
While chat recovery handles client-side interruptions, durable submissions handle server-side and long-running task reliability. This is where the submitMessages() API comes into play.
In many agent architectures, you need to trigger actions that take time—searching a database, calling an external API, or running a complex reasoning chain. These are “server-driven turns.” If these tasks fail or are interrupted, you need a way to inspect their status, retry them, or cancel them.
submitMessages() provides idempotent retries, status inspection, and cancellation for these tasks. Idempotency is key here. If the network drops between the submission and the confirmation, the system can safely retry the submission without creating duplicate records or triggering duplicate actions.
Think Sub-Agents and Partial Output
A specific use case for durable submissions is in Think sub-agents. These are specialized sub-agents that handle reasoning or planning steps. Previously, if a Think sub-agent was interrupted, it would often start over, wasting compute and time.
With the latest updates, interrupted sub-agent turns can now recover partial output instead of starting over. This is achieved through chat recovery fibers. These fibers allow the system to resume the execution from the last known good state, preserving any intermediate results. This is a significant efficiency gain for complex, multi-step reasoning tasks.
Managing cancellation and status inspection is also built-in. You can check the status of a submission at any time and cancel it if the user changes their mind or if the task is no longer relevant. This gives you fine-grained control over long-running processes.
Routing Retries and Transient Failures
In a distributed system, transient failures are inevitable. Durable Objects are powerful, but they are not immune to routing issues. When an agent is invoked, the system needs to route the request to the correct Durable Object instance. If this routing fails due to a transient network issue, the request should not fail permanently.
The getAgentByName() function now supports a routingRetry configuration. This allows you to handle transient Durable Object routing failures gracefully. You can configure the maximum number of attempts and the timeouts for these retries.
Why Routing Reliability Matters
This is particularly important for multi-agent systems. If Agent A needs to call Agent B, and the routing fails, Agent A should not crash. It should retry the routing and continue its work. This resilience is essential for building complex, interconnected agent workflows.
Without routing retries, you would need to implement your own retry logic, which is error-prone and complex. The built-in support in the SDK simplifies this significantly. It ensures that your agents can discover and communicate with each other reliably, even in the face of infrastructure instability.
Architecture: Execution vs. Orchestration
To understand why these features matter, it is helpful to look at the broader architecture. Cloudflare Agents SDK focuses on the execution layer—identity, state, and routing. It does not try to replace orchestration frameworks.
Orchestration frameworks (like those from OpenAI or other LLM providers) define the agent’s logic and decision-making. Cloudflare provides the infrastructure for that logic to run reliably. This separation of concerns is crucial. It allows you to choose the best LLM for the job while relying on Cloudflare for the heavy lifting of state management and durability.
Infrastructure Primitives
This architecture uses several primitives for reliability:
- Durable Objects: For persistent memory and state.
- SQLite: Built-in to every agent instance for conversation history.
- WebSockets: For real-time communication.
- Scheduling: For proactive and delayed actions.
By combining these primitives, you can build agents that are not only intelligent but also robust. The SDK handles the complexity of managing state across distributed systems, allowing you to focus on the agent’s behavior.
Building for Production: Practical Takeaways
When moving from prototype to production, you need to make specific decisions about how your agents handle failure. Here is a practical framework for doing so.
When to Use Durable Submissions vs. Standard Chat Streams
Use standard chat streams for simple, short-lived interactions where the user expects immediate feedback. If the connection drops, the user will likely retry, and the state is less critical.
Use durable submissions for long-running tasks, complex reasoning, or any action that has side effects. These tasks need to be idempotent and inspectable. If the user navigates away and comes back, they should be able to see the status of the task and resume if necessary.
Configuring Recovery Policies
You should configure recovery policies based on your user experience goals. For a real-time chat application, you might want to recover the conversation. For a batch processing agent, you might want to log the failure and alert an operator.
Use the cancelOnClientAbort option to control whether client disconnections cancel server-side turns. In most cases, you want to keep the turn running. Use the routingRetry configuration in getAgentByName() to handle transient routing failures. Set the max attempts and timeouts based on your network conditions and the criticality of the task.
Observability and Debugging
Observability is key to maintaining reliability. Use the status inspection features of durable submissions to track the progress of long-running tasks. Log the outcomes of routing retries to identify patterns of failure.
Debugging agent recovery in Cloudflare is simplified by the built-in SQLite database. You can inspect the conversation history and state directly to understand what happened during an interruption. This makes it easier to identify and fix issues in your agent logic.
Conclusion
Building resilient AI agents is not about finding a more robust model. It is about engineering a pipeline that can survive the realities of the web. By using chat recovery, durable submissions, and routing retries, you can build agents that keep working even when things go wrong.
The Cloudflare Agents SDK provides the tools to do this. It separates the execution layer from the orchestration layer, allowing you to build agents that are both intelligent and reliable. As you move to production, focus on these reliability features. They are the difference between an agent that works in a demo and one that works in the real world.
Sources and further reading
Find more practical writing from the RodyTech archive.
RodyTech publishes practical writing on AI systems, infrastructure, and software that teams can actually ship. Use the archive paths below to keep reading by topic or browse the full library.
- Browse the full archive by publication date and topic
- Hands-on notes from real builds, deployments, and ops work
- Category paths for AI, infrastructure, developer tools, and security
No comments yet