Compute · Workflows

Designing Workflows that fail gracefully

After this lesson you'll be able to configure per-step retry policies, pause a Workflow for minutes or months with sleep(), block on an external event with waitForEvent (with a safe timeout), and reason about how a Workflow instance dies for good.

A Workflow's durability comes from replaying completed steps instead of re-running them — that's the subject of the previous lesson. This lesson is about the other half: what happens when a step doesn't succeed, when the next action depends on time passing, or when the workflow needs to wait on something outside itself entirely, like a human clicking "approve." Getting this wrong doesn't crash loudly — it leaves an instance quietly stuck, sometimes forever.

How it works

Retry policies on step.do

Every step.do() call can take a retries config as its second argument. If the step's callback throws, Workflows retries it according to this policy before giving up:

let apiResult = await step.do(
  "call the payment API",
  {
    retries: {
      limit: 10,             // max attempts
      delay: "10 seconds",   // base delay between attempts
      backoff: "exponential" // "constant" | "linear" | "exponential"
    },
    timeout: "30 minutes",   // max time for a single attempt
  },
  async () => {
    const res = await fetch("https://payments.example.com/charge");
    if (!res.ok) throw new Error(`charge failed: ${res.status}`);
    return res.json();
  },
);

If you omit the config, the default is limit: 5, delay: 10 seconds, backoff: exponential, timeout: 10 minutes. Each retry attempt only re-runs that one step's callback — everything before it in the workflow is already durably recorded and is not replayed.

Sometimes a failure is known to be unrecoverable (a 400 Bad Request, a validation error) and retrying is pointless. Throw NonRetryableError to skip straight past the retry policy:

import { NonRetryableError } from "cloudflare:workflows";

await step.do("validate order", async () => {
  if (!order.customerId) {
    throw new NonRetryableError("order is missing a customerId");
  }
  // ...
});

sleep() and sleepUntil()

Workflows can pause for durations that would be absurd to hold a normal request-response connection open for — hours, days, or months — because the instance itself is suspended rather than a process sitting idle burning compute:

// relative delay
await step.sleep("cool-off period", "3 days");

// absolute point in time
await step.sleepUntil("wait until renewal", Date.parse("2026-08-01T00:00:00Z"));

Accepted units for the relative form range from seconds up to years. While a Workflow is sleeping (or blocked on a step, or waiting for an event — see below), it consumes no CPU time and costs nothing on that dimension; only active step execution is billed.

waitForEvent: pausing for something external

Retries and sleeps handle failure and time. waitForEvent handles the third case: the workflow needs input from outside itself — a webhook, a human clicking a button, another system finishing its own job — and doesn't know in advance when that will arrive.

const approval = await step.waitForEvent("wait for manager approval", {
  type: "approval-decision",
  timeout: "24 hours", // default is 24 hours if you omit this
});

type is a string (letters, digits, -, _, up to 100 chars) that a matching sendEvent call must use to reach this specific waiting step. From outside the workflow — a Worker handling a webhook, for example — you push the event in via the Workflow binding:

// in a separate Worker, e.g. a webhook handler
export default {
  async fetch(request: Request, env: Env) {
    const instance = await env.APPROVAL_WORKFLOW.get(instanceId);

    await instance.sendEvent({
      type: "approval-decision",
      payload: await request.json(),
    });

    return new Response("ok");
  },
} satisfies ExportedHandler<Env>;

Events can arrive before the workflow even reaches the waitForEvent step — Cloudflare buffers them and delivers on arrival at the matching step, so there's no race between "the webhook fires" and "the workflow gets there."

Timeout is not optional in practice. The default timeout on waitForEvent is 24 hours, and the maximum is 365 days — but there is no "wait forever" option, by design. If the timeout elapses with no matching event, the step throws and the instance fails unless you catch it. That failure is your signal to run a fallback path (escalate, auto-approve, cancel) instead of leaving the instance stuck.

Worked example: an approval step with a timeout fallback

A purchase-order workflow that pauses mid-execution for a manager's approval, but auto-escalates instead of hanging if nobody responds within a business day:

import { WorkflowEntrypoint, WorkflowStep, WorkflowEvent } from "cloudflare:workers";
import { NonRetryableError } from "cloudflare:workflows";

type Params = { orderId: string; amount: number; managerEmail: string };

export class PurchaseOrderWorkflow extends WorkflowEntrypoint<Env, Params> {
  async run(event: WorkflowEvent<Params>, step: WorkflowStep) {
    const { orderId, amount, managerEmail } = event.payload;

    await step.do("validate order", async () => {
      if (amount <= 0) throw new NonRetryableError("invalid order amount");
    });

    await step.do("notify manager", { retries: { limit: 3, delay: "30 seconds", backoff: "exponential" } },
      async () => {
        await sendApprovalEmail(managerEmail, orderId, amount);
      },
    );

    let decision: { approved: boolean; by: string };
    try {
      const result = await step.waitForEvent<{ approved: boolean; by: string }>(
        "wait for manager approval",
        { type: "approval-decision", timeout: "8 hours" },
      );
      decision = result.payload;
    } catch (err) {
      // Timed out — escalate instead of leaving the instance stalled forever.
      await step.do("escalate to director", async () => {
        await sendEscalationEmail(orderId, amount);
      });
      decision = { approved: false, by: "timeout-escalation" };
    }

    if (!decision.approved) {
      await step.do("mark order rejected", async () => markRejected(orderId, decision.by));
      return;
    }

    await step.do("charge and fulfill", { retries: { limit: 5, delay: "1 minute", backoff: "exponential" } },
      async () => fulfillOrder(orderId),
    );
  }
}

The webhook that the approval email's link hits is a separate, ordinary Worker — it doesn't need to know anything about the workflow's internal state, only its instance ID and the event type string:

// src/approval-webhook.ts
export default {
  async fetch(request: Request, env: Env) {
    const { instanceId, approved, managerEmail } = await request.json();
    const instance = await env.PURCHASE_ORDER_WORKFLOW.get(instanceId);

    await instance.sendEvent({
      type: "approval-decision",
      payload: { approved, by: managerEmail },
    });

    return new Response("recorded");
  },
} satisfies ExportedHandler<Env>;

Pricing

Workflows billing has three dimensions: requests (instance creations/invocations), CPU time, and persisted state storage. As of this writing:

Meter	Free plan	Paid plan (Workers Paid)
Requests	100,000/day (shared with Workers requests)	10 million/month included, then $0.30/million
CPU time	10ms of CPU time per invocation	30 million CPU-ms/month included, then $0.02/million CPU-ms
Stored state	1 GB	1 GB-month included, then $0.20/GB-month

The detail that matters most for this lesson: time spent in step.sleep, blocked inside a retrying step.do, or parked in waitForEvent does not consume CPU time — you're billed for active execution, not wall-clock wait. An approval workflow that sits idle for eight hours costs the same in CPU time as one that gets approved in eight seconds.

Pricing changes over time — confirm current numbers on the live Workflows pricing page before estimating a bill (checked 2026-07-03).

Use cases

Approval workflows. Purchase orders, content moderation, refund requests — anything where a human needs to sign off mid-process before the workflow continues, as in the worked example above.
Scheduled multi-day sequences. Onboarding drips, trial-expiry reminders, dunning emails on failed payments — a sequence of step.sleep calls between actions, running as one durable instance instead of a cron job re-deriving state on every tick.
Saga-pattern distributed transactions. A multi-service operation (reserve inventory, charge card, book shipment) where each step can fail independently and needs a compensating action if a later step fails — retries handle transient failures per step, and terminal failure handling drives the rollback/compensation logic.
Waiting on async external systems. Kicking off a long-running third-party job (video transcoding, a batch export) and resuming only when that system's webhook confirms completion, rather than polling.

Pitfall: waitForEvent with no timeout plan leaves instances stalled indefinitely. The timeout parameter defaults to 24 hours rather than "forever," which means it's easy to assume the platform will always eventually give up gracefully — but "gives up" just means the step throws. If you don't wrap waitForEvent in a try/catch (or otherwise handle the thrown error), that exception propagates to the top of run() unhandled and the entire instance fails in an errored state with no fallback action taken — no escalation email sent, no order marked rejected, nothing. Worse, teams sometimes set an unrealistically long timeout (30, 60, 90 days) "just to be safe" for a human-approval step, which does prevent the failure but means a genuinely abandoned instance sits in "waiting" state consuming stored-state billing for months before anyone notices. Always pick a timeout that matches the real-world deadline for the event, and always handle the timeout error with an explicit fallback path.

Primary source

Cloudflare Docs — Workflows: Events and parameters covers waitForEvent and sendEvent in full; see also Sleeping and retrying for retry/backoff/sleep mechanics and the pricing page for current rates (checked 2026-07-03).

You add a step.waitForEvent() call for a manager approval but don't pass a timeout, and don't wrap the call in try/catch. What actually happens if no matching event ever arrives?

Without scrolling up: what's the difference between what NonRetryableError does versus a normal thrown Error inside a step.do callback?

Reveal

A normal thrown error inside step.do triggers the step's retry policy (by default 5 attempts with exponential backoff) before the workflow gives up. NonRetryableError skips retries entirely and propagates the failure straight to the top-level run() function immediately — useful when you already know retrying won't help, like a validation failure, so you don't waste time and compute retrying something that will never succeed.

Anything above unclear — how event buffering handles out-of-order delivery, or how to structure compensating steps for a saga pattern? Ask your AI teacher before moving on.

← Previous: Multi-step jobs that survive failures: Workflows Next: Ship a full-stack frontend with Pages →