After this lesson you'll be able to write a single Worker that embeds a query, retrieves context from Vectorize, calls an LLM through AI Gateway, and returns a grounded answer — and know when that Worker should become a Workflow instead.
The previous lessons covered Workers AI, Vectorize, AI Gateway, and Workflows as separate products. Nobody ships them separately. A real "agent" or RAG (retrieval-augmented generation) endpoint on Cloudflare is one Worker that calls three or four of these bindings in sequence inside a single request. This lesson is the capstone: it wires them together into one coherent pipeline, and it's deliberately code-heavy — the composition is the lesson.
env.AI.run().
Vectorize is the vector database that finds semantically similar chunks of your own content.
AI Gateway sits in front of the LLM call as a proxy — caching, rate limiting, retries, fallback, and analytics — without changing your application logic.
Workflows is what you reach for when the pipeline needs more steps than fit comfortably in one request/response, or when individual steps need independent, durable retries.
A RAG request flows through five stages, all inside one fetch handler:
1. Request arrives with a user query
2. Workers AI embeds the query → env.AI.run(embeddingModel, { text })
3. Vectorize finds similar chunks → env.VECTORIZE.query(vector, { topK })
4. AI Gateway-fronted LLM call → env.AI.run(llmModel, { messages }, { gateway })
using the retrieved chunks as context
5. Response returned to the caller
Steps 2 and 4 both go through env.AI — Workers AI is the compute layer for both the embedding model and the generative model. Step 3 is a different binding, env.VECTORIZE, pointing at an index you populated ahead of time (typically via a separate ingestion Worker or script that chunks your documents, embeds each chunk, and calls insert()). AI Gateway isn't a separate binding at all — it's an option you pass to env.AI.run() that routes the call through a named gateway for logging, caching, and fallback, or a proxy endpoint you point an external SDK at if you're calling OpenAI/Anthropic/etc. instead of a Workers AI model.
This assumes a Vectorize index already populated with document chunks, each with a text field in its metadata so you can hand the original chunk text back to the LLM as context (Vectorize stores vectors and metadata, not your source documents).
export interface Env {
AI: Ai;
VECTORIZE: VectorizeIndex;
}
const EMBEDDING_MODEL = "@cf/baai/bge-base-en-v1.5";
const LLM_MODEL = "@cf/meta/llama-3.1-8b-instruct-fast";
const MIN_RELEVANCE_SCORE = 0.72; // tune against your own index
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const { query } = await request.json<{ query: string }>();
if (!query?.trim()) {
return Response.json({ error: "query is required" }, { status: 400 });
}
// 1. Embed the query with Workers AI.
const embedding = await env.AI.run(EMBEDDING_MODEL, { text: [query] });
const [queryVector] = embedding.data;
// 2. Retrieve nearest chunks from Vectorize.
const matches = await env.VECTORIZE.query(queryVector, {
topK: 5,
returnMetadata: "all",
});
// 3. Pitfall guard: zero (or all low-confidence) matches. See callout below.
const relevant = matches.matches.filter((m) => m.score >= MIN_RELEVANCE_SCORE);
if (relevant.length === 0) {
return Response.json({
answer: "I don't have enough grounded information to answer that confidently.",
sources: [],
});
}
const context = relevant
.map((m, i) => `[${i + 1}] ${m.metadata?.text ?? ""}`)
.join("\n\n");
// 4. Call the LLM through AI Gateway — same env.AI binding, gateway option added.
const completion = await env.AI.run(
LLM_MODEL,
{
messages: [
{
role: "system",
content:
"Answer using only the numbered context below. If the context doesn't " +
"contain the answer, say so explicitly instead of guessing. Cite sources " +
"by their [number].",
},
{ role: "user", content: `Context:\n${context}\n\nQuestion: ${query}` },
],
},
{
gateway: {
id: "rag-prod", // named gateway — gives you logs, caching, analytics
skipCache: false, // identical (query, context) pairs can hit cache
},
}
);
// 5. Return the answer plus the sources actually used, for citation/debugging.
return Response.json({
answer: completion.response,
sources: relevant.map((m) => ({ id: m.id, score: m.score })),
});
},
} satisfies ExportedHandler<Env>;
Bindings in wrangler.toml:
[ai]
binding = "AI"
[[vectorize]]
binding = "VECTORIZE"
index_name = "docs-index"
Everything above runs inside a single request. Total latency is roughly: embed (tens of ms) + Vectorize query (single-digit ms to tens of ms) + LLM generation (hundreds of ms to a few seconds, dominated by token count). For a chat-style RAG endpoint that's usually fine — it's one HTTP round trip end to end.
matches.matches.length === 0 as a hard floor, and filter on m.score against a relevance threshold you've tuned for your embedding model and index — a match score doesn't mean "relevant," it means "closest of what exists." Return an explicit "I don't know" response (or fall back to a non-RAG answer, clearly labeled as such) rather than silently degrading to ungrounded generation.
The single-Worker pipeline above is the right shape as long as it's a straight line: embed → retrieve → generate → respond, one shot, and if any step fails you're comfortable just failing the request. Move to a Workflow once the agent needs any of the following:
step.do(), and a failure in round 3 shouldn't force re-running (and re-billing) rounds 1 and 2.The pattern in practice: a Workflow's run() method calls env.AI and env.VECTORIZE from inside its step.do() callbacks, exactly as shown above — Workflows doesn't replace this pipeline, it wraps it with checkpointing and retry semantics once a single fetch handler isn't durable enough.
A single RAG request touches three separate billing dimensions. As of this writing:
| Product | Free tier | Paid (Workers Paid plan) |
|---|---|---|
| Workers AI (Neurons) | 10,000 Neurons/day | $0.011 per 1,000 Neurons beyond the daily free allowance; per-model rates vary (a small model like Llama 3.2 1B costs far less per token than a 70B model) |
| Vectorize | 30M queried + 5M stored vector dimensions/month | 50M queried dimensions/month included, then $0.01/million; 10M stored dimensions included, then $0.05/100M |
| AI Gateway | Free — analytics, caching, rate limiting | Free at the core; persistent logs are capped (100K on Free, 10M/gateway on Paid) and Logpush is a Paid-only add-on at $0.05/million |
AI Gateway itself doesn't add cost to the underlying inference call — it passes through provider/Workers AI pricing unmarked-up (aside from the 5% fee on Cloudflare's unified billing/credits feature, if you use it). Its main cost lever is actually a savings one: a cache hit on an identical prompt avoids paying for inference at all. Confirm current numbers on the pricing pages linked below before quoting them — Neuron pricing per model in particular changes as Cloudflare adds models.
AI Gateway — Workers AI binding integration documents the exact gateway option shown in this lesson's code. Pair it with the Vectorize getting started guide for index creation/query syntax and the Workers AI pricing page for current Neuron rates, since per-model pricing changes as new models ship.
It's the same env.AI binding used for any Workers AI call (including the embedding step). What routes it through AI Gateway is passing a third argument to env.AI.run() — a { gateway: { id: "..." } } options object — not a separate binding or client.