AI · Workers AI

Run AI inference at the edge with Workers AI

After this lesson you'll be able to call an open-source model from a Worker with one binding call, stream the output back to a client, and pick a model size that matches your latency budget instead of defaulting to the biggest one available.

Workers AI is serverless GPU inference: you call a model — text generation, image generation, speech-to-text, embeddings, classification, and more — and Cloudflare runs it on GPUs distributed across its network. There's no GPU instance to provision, no model server to deploy or scale, and no idle capacity to pay for. You bind a Worker to AI, call env.AI.run() with a model name and input, and get a result back. The infrastructure — which GPU, in which data center, how the model is loaded and kept warm — is Cloudflare's problem, not yours.

Not your own model. Workers AI runs a curated catalog of 50+ open-source models (Llama, Mistral, Whisper, Stable Diffusion/FLUX variants, BGE embeddings, and others) — you don't upload custom weights or fine-tune here. If you need a model outside the catalog, or need to route between Workers AI and other providers with retries/fallback/caching, that's what AI Gateway sits in front of (covered in the next lesson).

How it works

Three things to know about the mechanics:

The binding, not a fetch call. You declare an ai binding in your Wrangler config, and Cloudflare injects an AI object into env. Calling env.AI.run(model, input) is a method call on that binding — the request never leaves Cloudflare's network to hit a public API endpoint, it's routed internally to available inference capacity.
Model names are catalog identifiers. Each model is addressed by a string like @cf/meta/llama-3.1-8b-instruct or @cf/baai/bge-m3 — the @cf/ prefix plus vendor and model name. The full catalog, with input/output schemas per model, is at the Workers AI models page; task categories include text generation, text-to-image, automatic speech recognition, text embeddings, image classification, object detection, translation, and summarization.
Streaming is a flag, not a different API. Text-generation models accept stream: true in the input. Instead of a JSON object, env.AI.run() then resolves to a ReadableStream of server-sent events, which you can hand straight to the Response constructor — the tokens flow to the client as the model produces them, instead of the client waiting for the full generation to finish.

Worked example

Binding config:

// wrangler.jsonc
{
  "ai": {
    "binding": "AI"
  }
}

A Worker that accepts a prompt and streams a chat completion back to the caller:

export interface Env {
  AI: Ai;
}

export default {
  async fetch(request, env): Promise<Response> {
    const { prompt } = await request.json<{ prompt: string }>();

    const stream = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
      messages: [
        { role: "system", content: "You are a concise technical assistant." },
        { role: "user", content: prompt },
      ],
      stream: true,
    });

    return new Response(stream, {
      headers: { "content-type": "text/event-stream" },
    });
  },
} satisfies ExportedHandler<Env>;

The client reads it like any other SSE stream — each chunk arrives as soon as the model emits it, so a user sees the reply appear token-by-token instead of staring at a spinner for the whole generation.

The same binding, called without stream, works identically for embeddings — the pattern you'd use to populate a Vectorize index:

const { data } = await env.AI.run("@cf/baai/bge-m3", {
  text: ["Cloudflare Workers run JavaScript at the edge."],
});
// data[0] is a float embedding vector, ready to upsert into Vectorize

Pricing

Workers AI bills in Neurons — a unit that normalizes GPU compute cost across very different model types (a text-generation token and an image-generation step consume GPU differently, so Neurons let Cloudflare price both on one scale). As of this writing:

Item	Amount
Free allocation	10,000 Neurons/day, on both Free and Paid Workers plans, resetting at 00:00 UTC
Beyond the free allocation (Workers Paid plan required)	$0.011 per 1,000 Neurons

Neuron cost per call scales with model size, as you'd expect: roughly 2,457 Neurons per million input tokens on a small model like Llama 3.2 1B, versus roughly 25,600 Neurons per million input tokens on Llama 3.1 8B, and roughly 45,000+ Neurons per million input tokens on a large model like DeepSeek R1 32B. Image, speech, and embedding models are priced per their own input unit (tiles, minutes, or tokens) rather than per generated text token. Treat these figures as illustrative — confirm exact current numbers on the pricing page linked below before quoting them in a proposal.

Use cases

Chatbots and support assistants — a text-generation model behind a Worker, often paired with streaming so replies feel responsive.
Summarization — condensing support tickets, documents, or transcripts, either with a dedicated summarization model or a general instruct model with a summarization prompt.
Content moderation — classification models scoring text or images for policy violations before they're stored or served.
Embeddings generation for Vectorize — running text through a BGE-family embedding model to produce vectors for semantic search, then upserting them into a Vectorize index.

Pitfall: reaching for the biggest model by default. It's tempting to pick the largest, most capable model in the catalog for every task, on the assumption that more parameters means a better result. For latency-sensitive paths — a chatbot reply, an autocomplete suggestion, a moderation check gating a request — that's usually the wrong trade. Larger models take longer to generate each token and cost more Neurons per call, and for narrow tasks (classification, short-form summarization, simple Q&A) a small instruct model often matches a large one on quality while responding several times faster and cheaper. Match model size to task difficulty: benchmark a small model on your actual inputs before assuming you need a large one, and reserve the largest models for tasks that genuinely need the extra reasoning depth.

Primary source

Cloudflare Workers AI docs for mechanics and the model catalog, and the Workers AI pricing page for current Neuron rates — both are the canonical, most-current source since model lineup and pricing change over time.

You're building a chatbot where response latency matters a lot, and a small instruct model scores just as well as a large one on your test prompts. What should you do?

Without scrolling up: what does passing stream: true to env.AI.run() actually change about what it returns, and what unit does Workers AI use to price inference?

Reveal

With stream: true, a text-generation model's env.AI.run() call resolves to a ReadableStream of server-sent events instead of a JSON object — you can pass that stream directly into new Response() so tokens reach the client as they're generated, rather than all at once at the end.

Pricing is denominated in Neurons, a normalized unit of GPU compute across model types, with 10,000 free per day and $0.011 per 1,000 beyond that on the Paid plan.

Anything above unclear — the binding model, streaming mechanics, or how to think about Neuron cost versus model size? Ask your AI teacher before moving on.

← Previous: Store and query vector embeddings with Vectorize Next: Building an AI agent on Workers: tying Workers AI, Vectorize, and AI Gateway together →