After this lesson you'll be able to call an open-source model from a Worker with one binding call, stream the output back to a client, and pick a model size that matches your latency budget instead of defaulting to the biggest one available.
Workers AI is serverless GPU inference: you call a model — text generation, image generation, speech-to-text, embeddings, classification, and more — and Cloudflare runs it on GPUs distributed across its network. There's no GPU instance to provision, no model server to deploy or scale, and no idle capacity to pay for. You bind a Worker to AI, call env.AI.run() with a model name and input, and get a result back. The infrastructure — which GPU, in which data center, how the model is loaded and kept warm — is Cloudflare's problem, not yours.
Three things to know about the mechanics:
ai binding in your Wrangler config, and Cloudflare injects an AI object into env. Calling env.AI.run(model, input) is a method call on that binding — the request never leaves Cloudflare's network to hit a public API endpoint, it's routed internally to available inference capacity.@cf/meta/llama-3.1-8b-instruct or @cf/baai/bge-m3 — the @cf/ prefix plus vendor and model name. The full catalog, with input/output schemas per model, is at the Workers AI models page; task categories include text generation, text-to-image, automatic speech recognition, text embeddings, image classification, object detection, translation, and summarization.stream: true in the input. Instead of a JSON object, env.AI.run() then resolves to a ReadableStream of server-sent events, which you can hand straight to the Response constructor — the tokens flow to the client as the model produces them, instead of the client waiting for the full generation to finish.Binding config:
// wrangler.jsonc
{
"ai": {
"binding": "AI"
}
}
A Worker that accepts a prompt and streams a chat completion back to the caller:
export interface Env {
AI: Ai;
}
export default {
async fetch(request, env): Promise<Response> {
const { prompt } = await request.json<{ prompt: string }>();
const stream = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
messages: [
{ role: "system", content: "You are a concise technical assistant." },
{ role: "user", content: prompt },
],
stream: true,
});
return new Response(stream, {
headers: { "content-type": "text/event-stream" },
});
},
} satisfies ExportedHandler<Env>;
The client reads it like any other SSE stream — each chunk arrives as soon as the model emits it, so a user sees the reply appear token-by-token instead of staring at a spinner for the whole generation.
The same binding, called without stream, works identically for embeddings — the pattern you'd use to populate a Vectorize index:
const { data } = await env.AI.run("@cf/baai/bge-m3", {
text: ["Cloudflare Workers run JavaScript at the edge."],
});
// data[0] is a float embedding vector, ready to upsert into Vectorize
Workers AI bills in Neurons — a unit that normalizes GPU compute cost across very different model types (a text-generation token and an image-generation step consume GPU differently, so Neurons let Cloudflare price both on one scale). As of this writing:
| Item | Amount |
|---|---|
| Free allocation | 10,000 Neurons/day, on both Free and Paid Workers plans, resetting at 00:00 UTC |
| Beyond the free allocation (Workers Paid plan required) | $0.011 per 1,000 Neurons |
Neuron cost per call scales with model size, as you'd expect: roughly 2,457 Neurons per million input tokens on a small model like Llama 3.2 1B, versus roughly 25,600 Neurons per million input tokens on Llama 3.1 8B, and roughly 45,000+ Neurons per million input tokens on a large model like DeepSeek R1 32B. Image, speech, and embedding models are priced per their own input unit (tiles, minutes, or tokens) rather than per generated text token. Treat these figures as illustrative — confirm exact current numbers on the pricing page linked below before quoting them in a proposal.
Cloudflare Workers AI docs for mechanics and the model catalog, and the Workers AI pricing page for current Neuron rates — both are the canonical, most-current source since model lineup and pricing change over time.
stream: true to env.AI.run() actually change about what it returns, and what unit does Workers AI use to price inference?With stream: true, a text-generation model's env.AI.run() call resolves to a ReadableStream of server-sent events instead of a JSON object — you can pass that stream directly into new Response() so tokens reach the client as they're generated, rather than all at once at the end.
Pricing is denominated in Neurons, a normalized unit of GPU compute across model types, with 10,000 free per day and $0.011 per 1,000 beyond that on the Paid plan.