Compute · Workflows

Multi-step jobs that survive failures: Workflows

After this lesson you'll be able to write a multi-step Workflow with per-step retry config, explain why step logic must be idempotent, and know when to reach for Workflows instead of a plain Worker or Queues.

A normal Worker lives for the length of one request: if it crashes halfway through a five-step process, everything it hadn't already persisted is gone, and you're rebuilding the state from scratch. Workflows is Cloudflare's answer to that problem — durable execution for long-running, multi-step processes. You write your process as a sequence of named steps; the platform checkpoints the return value of each completed step, and if the underlying instance crashes, gets rescheduled, or a step throws, Workflows resumes from the last completed step instead of restarting the whole run. A Workflow instance can legitimately run for minutes, hours, or weeks — it's built for that, not just tolerating it.

Under the hood. Each Workflow instance is backed by a Durable Object: that's where the single-threaded, strongly-consistent execution and the checkpointed step state actually live. You don't manage the Durable Object yourself — Workflows gives you a higher-level step API on top of it — but knowing this explains the behavior: one instance processes its steps one at a time, in order, and its state survives restarts the same way a Durable Object's storage does.

How it works

You define a Workflow as a class extending WorkflowEntrypoint, with a run(event, step) method. Inside run, you call step.do("step name", callback) for each unit of work. Three things matter about that call:

The step name is a cache key. Once a step with a given name completes, its return value is persisted. On retry or resume, Workflows doesn't re-run that step — it hands back the cached result and moves on. Names must be deterministic (fixed strings, not `step-${Date.now()}`), or the engine can't tell a resumed step from a new one.
Steps retry automatically. If the callback throws, Workflows retries it according to a per-step retry config (limit, delay, backoff) before giving up and failing the instance.
Idle time is free and doesn't block. step.sleep() / step.sleepUntil() pause a Workflow — even for days — without holding a running instance or billing CPU time while it waits.

The consequence of automatic retries is the single most important rule in this lesson: step callbacks must be idempotent and the control flow around them must be deterministic. Retries mean a step's code can run more than once for the same logical attempt (e.g. the API call succeeds but the response is lost before Workflows records it, triggering a retry that repeats the call). Code outside a step.do() — anything in run() directly — can also re-execute if the engine restarts, so side effects (writes, sends, charges) belong strictly inside steps, and a step that isn't naturally idempotent needs to check whether its effect already happened before performing it again (e.g. "has this order already been marked paid?" before charging).

Worked example

A three-step Workflow: call an external API, process the result, write it to R2. Each step has its own retry policy sized to how flaky and how expensive that particular step is.

import { WorkflowEntrypoint, WorkflowStep, WorkflowEvent } from "cloudflare:workers";

type Params = { reportId: string };

export class ReportWorkflow extends WorkflowEntrypoint<Env, Params> {
  async run(event: WorkflowEvent<Params>, step: WorkflowStep) {
    const { reportId } = event.payload;

    // Step 1: call an external API. Retries handle transient network/5xx errors.
    const raw = await step.do(
      "fetch source data",
      {
        retries: { limit: 5, delay: "10 seconds", backoff: "exponential" },
        timeout: "30 seconds",
      },
      async () => {
        const res = await fetch(`https://api.example.com/reports/${reportId}`);
        if (!res.ok) throw new Error(`upstream ${res.status}`);
        return res.json();
      }
    );

    // Step 2: pure transformation — no side effects, so it's safe to retry
    // even without a special idempotency check.
    const processed = await step.do("transform result", async () => {
      return { reportId, total: raw.items.reduce((sum: number, i: any) => sum + i.amount, 0) };
    });

    // Step 3: write to R2. Check-before-write makes the retry idempotent —
    // if a previous attempt already wrote the object, don't write it again.
    await step.do(
      "persist to R2",
      { retries: { limit: 3, delay: "5 seconds", backoff: "linear" } },
      async () => {
        const key = `reports/${reportId}.json`;
        const existing = await this.env.REPORTS_BUCKET.head(key);
        if (existing) return; // already written by a prior attempt
        await this.env.REPORTS_BUCKET.put(key, JSON.stringify(processed));
      }
    );
  }
}

Binding and trigger config in wrangler.toml:

[[workflows]]
name = "report-workflow"
binding = "REPORT_WORKFLOW"
class_name = "ReportWorkflow"

Kick off an instance from a Worker via the binding:

const instance = await env.REPORT_WORKFLOW.create({ params: { reportId: "abc123" } });
return Response.json({ id: instance.id, status: await instance.status() });

Pricing

Workflows bills on three dimensions: requests, CPU time, and storage. As of this writing:

Dimension	Free	Paid (Workers Paid plan)
Requests	100,000/day (shared with Workers)	10M/month included, then $0.30/million
CPU time	10ms CPU per invocation	30M CPU-ms/month included, then $0.02/million CPU-ms
Storage (state)	1GB/month	1GB/month included, then $0.20/GB-month

Two details worth internalizing: time spent waiting on a fetch response or paused in step.sleep() does not incur CPU time — you're billed for compute, not wall-clock duration. And a Workflow instance's state is retained for 3 days (Free) or 30 days (Paid) by default, which is what the storage line item measures. Pricing changes; confirm current numbers on the live page linked below before quoting them in a proposal.

Use cases

Order processing pipelines — validate payment, reserve inventory, notify fulfillment, send confirmation — each a separately retryable step instead of one fragile function.
AI agent multi-step tool chains — a chain of LLM calls and tool invocations that can run long, needs to survive a single tool call failing, and shouldn't re-run an expensive earlier step just because a later one timed out.
Data ETL — extract from a source, transform, load into D1/R2/an external system, where each stage has different failure modes and retry needs.
Onboarding sequences — send a welcome email, wait a day (step.sleep), check activation status, send a follow-up — durable, long-lived, and cheap to leave idle.

Pitfall: non-deterministic or non-idempotent logic inside a step body. It's tempting to put something like await charge(customer, amount) directly in a step.do() callback and rely on the retry config to "just handle" failures. But if the charge succeeds and the network drops before Workflows records the step as complete, the retry re-runs the callback — and re-charges the customer. The fix is to make the side effect itself idempotent (pass an idempotency key to the payment API, or check "has this already been charged?" before charging) rather than assuming a step only ever executes once. The same trap applies to logic placed outside step.do() in run() directly: that code can re-execute on engine restart with no caching at all, so anything with a side effect — writes, sends, random IDs used for dedup — needs to live inside a step, not beside one.

Primary source

Cloudflare Workflows — Rules of Workflows is the canonical page on determinism and idempotency requirements referenced in this lesson; pair it with the Workflows pricing page for current numbers, since pricing is subject to change.

A step calls a payment API and the callback throws after the charge actually succeeded (e.g. the response was lost). Workflows retries the step. What's the correct way to prevent a duplicate charge?

Without scrolling up: what does a step's name actually do, and what runs underneath a Workflow instance to give it durable, checkpointed state?

Reveal

A step's name acts as a cache key — once that named step completes, its result is persisted, and a retry or resume reuses the cached result instead of re-running the step. That's why step names must be deterministic (fixed, not built from Date.now() or similar).

Each Workflow instance is backed by a Durable Object, which is where the single-instance execution and persisted step state come from — Workflows is a step-oriented API layered on top of that.

Anything above unclear — the retry/idempotency distinction, the step-as-cache-key model, or when you'd pick Workflows over Queues? Ask your AI teacher before moving on.

← Previous: Let your users run their own code on your platform Next: Designing Workflows that fail gracefully →