After this lesson you'll be able to configure per-step retry policies, pause a Workflow for minutes or months with sleep(), block on an external event with waitForEvent (with a safe timeout), and reason about how a Workflow instance dies for good.
A Workflow's durability comes from replaying completed steps instead of re-running them — that's the subject of the previous lesson. This lesson is about the other half: what happens when a step doesn't succeed, when the next action depends on time passing, or when the workflow needs to wait on something outside itself entirely, like a human clicking "approve." Getting this wrong doesn't crash loudly — it leaves an instance quietly stuck, sometimes forever.
Every step.do() call can take a retries config as its second argument. If the step's callback throws, Workflows retries it according to this policy before giving up:
let apiResult = await step.do(
"call the payment API",
{
retries: {
limit: 10, // max attempts
delay: "10 seconds", // base delay between attempts
backoff: "exponential" // "constant" | "linear" | "exponential"
},
timeout: "30 minutes", // max time for a single attempt
},
async () => {
const res = await fetch("https://payments.example.com/charge");
if (!res.ok) throw new Error(`charge failed: ${res.status}`);
return res.json();
},
);
If you omit the config, the default is limit: 5, delay: 10 seconds, backoff: exponential, timeout: 10 minutes. Each retry attempt only re-runs that one step's callback — everything before it in the workflow is already durably recorded and is not replayed.
Sometimes a failure is known to be unrecoverable (a 400 Bad Request, a validation error) and retrying is pointless. Throw NonRetryableError to skip straight past the retry policy:
import { NonRetryableError } from "cloudflare:workflows";
await step.do("validate order", async () => {
if (!order.customerId) {
throw new NonRetryableError("order is missing a customerId");
}
// ...
});
Workflows can pause for durations that would be absurd to hold a normal request-response connection open for — hours, days, or months — because the instance itself is suspended rather than a process sitting idle burning compute:
// relative delay
await step.sleep("cool-off period", "3 days");
// absolute point in time
await step.sleepUntil("wait until renewal", Date.parse("2026-08-01T00:00:00Z"));
Accepted units for the relative form range from seconds up to years. While a Workflow is sleeping (or blocked on a step, or waiting for an event — see below), it consumes no CPU time and costs nothing on that dimension; only active step execution is billed.
Retries and sleeps handle failure and time. waitForEvent handles the third case: the workflow needs input from outside itself — a webhook, a human clicking a button, another system finishing its own job — and doesn't know in advance when that will arrive.
const approval = await step.waitForEvent("wait for manager approval", {
type: "approval-decision",
timeout: "24 hours", // default is 24 hours if you omit this
});
type is a string (letters, digits, -, _, up to 100 chars) that a matching sendEvent call must use to reach this specific waiting step. From outside the workflow — a Worker handling a webhook, for example — you push the event in via the Workflow binding:
// in a separate Worker, e.g. a webhook handler
export default {
async fetch(request: Request, env: Env) {
const instance = await env.APPROVAL_WORKFLOW.get(instanceId);
await instance.sendEvent({
type: "approval-decision",
payload: await request.json(),
});
return new Response("ok");
},
} satisfies ExportedHandler<Env>;
Events can arrive before the workflow even reaches the waitForEvent step — Cloudflare buffers them and delivers on arrival at the matching step, so there's no race between "the webhook fires" and "the workflow gets there."
waitForEvent is 24 hours, and the maximum is 365 days — but there is no "wait forever" option, by design. If the timeout elapses with no matching event, the step throws and the instance fails unless you catch it. That failure is your signal to run a fallback path (escalate, auto-approve, cancel) instead of leaving the instance stuck.
A purchase-order workflow that pauses mid-execution for a manager's approval, but auto-escalates instead of hanging if nobody responds within a business day:
import { WorkflowEntrypoint, WorkflowStep, WorkflowEvent } from "cloudflare:workers";
import { NonRetryableError } from "cloudflare:workflows";
type Params = { orderId: string; amount: number; managerEmail: string };
export class PurchaseOrderWorkflow extends WorkflowEntrypoint<Env, Params> {
async run(event: WorkflowEvent<Params>, step: WorkflowStep) {
const { orderId, amount, managerEmail } = event.payload;
await step.do("validate order", async () => {
if (amount <= 0) throw new NonRetryableError("invalid order amount");
});
await step.do("notify manager", { retries: { limit: 3, delay: "30 seconds", backoff: "exponential" } },
async () => {
await sendApprovalEmail(managerEmail, orderId, amount);
},
);
let decision: { approved: boolean; by: string };
try {
const result = await step.waitForEvent<{ approved: boolean; by: string }>(
"wait for manager approval",
{ type: "approval-decision", timeout: "8 hours" },
);
decision = result.payload;
} catch (err) {
// Timed out — escalate instead of leaving the instance stalled forever.
await step.do("escalate to director", async () => {
await sendEscalationEmail(orderId, amount);
});
decision = { approved: false, by: "timeout-escalation" };
}
if (!decision.approved) {
await step.do("mark order rejected", async () => markRejected(orderId, decision.by));
return;
}
await step.do("charge and fulfill", { retries: { limit: 5, delay: "1 minute", backoff: "exponential" } },
async () => fulfillOrder(orderId),
);
}
}
The webhook that the approval email's link hits is a separate, ordinary Worker — it doesn't need to know anything about the workflow's internal state, only its instance ID and the event type string:
// src/approval-webhook.ts
export default {
async fetch(request: Request, env: Env) {
const { instanceId, approved, managerEmail } = await request.json();
const instance = await env.PURCHASE_ORDER_WORKFLOW.get(instanceId);
await instance.sendEvent({
type: "approval-decision",
payload: { approved, by: managerEmail },
});
return new Response("recorded");
},
} satisfies ExportedHandler<Env>;
Workflows billing has three dimensions: requests (instance creations/invocations), CPU time, and persisted state storage. As of this writing:
| Meter | Free plan | Paid plan (Workers Paid) |
|---|---|---|
| Requests | 100,000/day (shared with Workers requests) | 10 million/month included, then $0.30/million |
| CPU time | 10ms of CPU time per invocation | 30 million CPU-ms/month included, then $0.02/million CPU-ms |
| Stored state | 1 GB | 1 GB-month included, then $0.20/GB-month |
The detail that matters most for this lesson: time spent in step.sleep, blocked inside a retrying step.do, or parked in waitForEvent does not consume CPU time — you're billed for active execution, not wall-clock wait. An approval workflow that sits idle for eight hours costs the same in CPU time as one that gets approved in eight seconds.
step.sleep calls between actions, running as one durable instance instead of a cron job re-deriving state on every tick.waitForEvent in a try/catch (or otherwise handle the thrown error), that exception propagates to the top of run() unhandled and the entire instance fails in an errored state with no fallback action taken — no escalation email sent, no order marked rejected, nothing. Worse, teams sometimes set an unrealistically long timeout (30, 60, 90 days) "just to be safe" for a human-approval step, which does prevent the failure but means a genuinely abandoned instance sits in "waiting" state consuming stored-state billing for months before anyone notices. Always pick a timeout that matches the real-world deadline for the event, and always handle the timeout error with an explicit fallback path.
Cloudflare Docs — Workflows: Events and parameters covers waitForEvent and sendEvent in full; see also Sleeping and retrying for retry/backoff/sleep mechanics and the pricing page for current rates (checked 2026-07-03).
step.waitForEvent() call for a manager approval but don't pass a timeout, and don't wrap the call in try/catch. What actually happens if no matching event ever arrives?A normal thrown error inside step.do triggers the step's retry policy (by default 5 attempts with exponential backoff) before the workflow gives up. NonRetryableError skips retries entirely and propagates the failure straight to the top-level run() function immediately — useful when you already know retrying won't help, like a validation failure, so you don't waste time and compute retrying something that will never succeed.