AI · Workers AI

Building an AI agent on Workers: tying Workers AI, Vectorize, and AI Gateway together

After this lesson you'll be able to write a single Worker that embeds a query, retrieves context from Vectorize, calls an LLM through AI Gateway, and returns a grounded answer — and know when that Worker should become a Workflow instead.

The previous lessons covered Workers AI, Vectorize, AI Gateway, and Workflows as separate products. Nobody ships them separately. A real "agent" or RAG (retrieval-augmented generation) endpoint on Cloudflare is one Worker that calls three or four of these bindings in sequence inside a single request. This lesson is the capstone: it wires them together into one coherent pipeline, and it's deliberately code-heavy — the composition is the lesson.

The four roles, in one sentence each. Workers AI runs the models (embeddings and the LLM) on Cloudflare's GPUs, invoked via env.AI.run(). Vectorize is the vector database that finds semantically similar chunks of your own content. AI Gateway sits in front of the LLM call as a proxy — caching, rate limiting, retries, fallback, and analytics — without changing your application logic. Workflows is what you reach for when the pipeline needs more steps than fit comfortably in one request/response, or when individual steps need independent, durable retries.

How the pieces compose

A RAG request flows through five stages, all inside one fetch handler:

1. Request arrives with a user query
2. Workers AI embeds the query        → env.AI.run(embeddingModel, { text })
3. Vectorize finds similar chunks     → env.VECTORIZE.query(vector, { topK })
4. AI Gateway-fronted LLM call        → env.AI.run(llmModel, { messages }, { gateway })
   using the retrieved chunks as context
5. Response returned to the caller

Steps 2 and 4 both go through env.AI — Workers AI is the compute layer for both the embedding model and the generative model. Step 3 is a different binding, env.VECTORIZE, pointing at an index you populated ahead of time (typically via a separate ingestion Worker or script that chunks your documents, embeds each chunk, and calls insert()). AI Gateway isn't a separate binding at all — it's an option you pass to env.AI.run() that routes the call through a named gateway for logging, caching, and fallback, or a proxy endpoint you point an external SDK at if you're calling OpenAI/Anthropic/etc. instead of a Workers AI model.

Worked example: one Worker, full pipeline

This assumes a Vectorize index already populated with document chunks, each with a text field in its metadata so you can hand the original chunk text back to the LLM as context (Vectorize stores vectors and metadata, not your source documents).

export interface Env {
  AI: Ai;
  VECTORIZE: VectorizeIndex;
}

const EMBEDDING_MODEL = "@cf/baai/bge-base-en-v1.5";
const LLM_MODEL = "@cf/meta/llama-3.1-8b-instruct-fast";
const MIN_RELEVANCE_SCORE = 0.72; // tune against your own index

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const { query } = await request.json<{ query: string }>();
    if (!query?.trim()) {
      return Response.json({ error: "query is required" }, { status: 400 });
    }

    // 1. Embed the query with Workers AI.
    const embedding = await env.AI.run(EMBEDDING_MODEL, { text: [query] });
    const [queryVector] = embedding.data;

    // 2. Retrieve nearest chunks from Vectorize.
    const matches = await env.VECTORIZE.query(queryVector, {
      topK: 5,
      returnMetadata: "all",
    });

    // 3. Pitfall guard: zero (or all low-confidence) matches. See callout below.
    const relevant = matches.matches.filter((m) => m.score >= MIN_RELEVANCE_SCORE);
    if (relevant.length === 0) {
      return Response.json({
        answer: "I don't have enough grounded information to answer that confidently.",
        sources: [],
      });
    }

    const context = relevant
      .map((m, i) => `[${i + 1}] ${m.metadata?.text ?? ""}`)
      .join("\n\n");

    // 4. Call the LLM through AI Gateway — same env.AI binding, gateway option added.
    const completion = await env.AI.run(
      LLM_MODEL,
      {
        messages: [
          {
            role: "system",
            content:
              "Answer using only the numbered context below. If the context doesn't " +
              "contain the answer, say so explicitly instead of guessing. Cite sources " +
              "by their [number].",
          },
          { role: "user", content: `Context:\n${context}\n\nQuestion: ${query}` },
        ],
      },
      {
        gateway: {
          id: "rag-prod",       // named gateway — gives you logs, caching, analytics
          skipCache: false,     // identical (query, context) pairs can hit cache
        },
      }
    );

    // 5. Return the answer plus the sources actually used, for citation/debugging.
    return Response.json({
      answer: completion.response,
      sources: relevant.map((m) => ({ id: m.id, score: m.score })),
    });
  },
} satisfies ExportedHandler<Env>;

Bindings in wrangler.toml:

[ai]
binding = "AI"

[[vectorize]]
binding = "VECTORIZE"
index_name = "docs-index"

Everything above runs inside a single request. Total latency is roughly: embed (tens of ms) + Vectorize query (single-digit ms to tens of ms) + LLM generation (hundreds of ms to a few seconds, dominated by token count). For a chat-style RAG endpoint that's usually fine — it's one HTTP round trip end to end.

Pitfall: assuming Vectorize always returns useful context. Naive RAG implementations skip straight from "query the index" to "stuff whatever comes back into the prompt." Two failure modes follow from that: Vectorize returning zero matches (empty index, or a filter that excludes everything), and Vectorize returning matches that are technically the nearest neighbors but not actually relevant — cosine/euclidean distance always returns your top-K, even if the closest vector is still semantically unrelated to the query. Feed that into an LLM with a system prompt like "answer using the context" and it will either hallucinate an answer from irrelevant chunks or confidently answer from nothing. The fix in the code above is two-fold: check matches.matches.length === 0 as a hard floor, and filter on m.score against a relevance threshold you've tuned for your embedding model and index — a match score doesn't mean "relevant," it means "closest of what exists." Return an explicit "I don't know" response (or fall back to a non-RAG answer, clearly labeled as such) rather than silently degrading to ungrounded generation.

When to reach for Workflows instead

The single-Worker pipeline above is the right shape as long as it's a straight line: embed → retrieve → generate → respond, one shot, and if any step fails you're comfortable just failing the request. Move to a Workflow once the agent needs any of the following:

Multi-step tool calling. An agent that calls the LLM, gets back a tool-use request, executes the tool (a fetch to an external API, a D1 query), feeds the result back to the LLM, and repeats — potentially several rounds. Each round is a natural step.do(), and a failure in round 3 shouldn't force re-running (and re-billing) rounds 1 and 2.
Long-running work. Anything that risks exceeding a single request's practical duration — document ingestion across hundreds of pages, multi-document synthesis, a research-style agent that fans out to several sources — benefits from Workflows' ability to run for minutes or longer without holding an open HTTP connection.
Independent retry policies per step. The embedding call, the Vectorize query, and the LLM call have different failure characteristics and different costs to retry. A Workflow lets you give the flaky external tool call five retries with backoff while giving the cheap Vectorize query a tight, fast-fail policy.
Durability across restarts. If the process gets interrupted after three of five tool calls have already run (and cost money), you want those three cached, not re-executed. That's exactly the checkpointing Workflows provides — see the Workflows lesson for the step-as-cache-key mechanics.

The pattern in practice: a Workflow's run() method calls env.AI and env.VECTORIZE from inside its step.do() callbacks, exactly as shown above — Workflows doesn't replace this pipeline, it wraps it with checkpointing and retry semantics once a single fetch handler isn't durable enough.

Pricing: three meters, one request

A single RAG request touches three separate billing dimensions. As of this writing:

Product	Free tier	Paid (Workers Paid plan)
Workers AI (Neurons)	10,000 Neurons/day	$0.011 per 1,000 Neurons beyond the daily free allowance; per-model rates vary (a small model like Llama 3.2 1B costs far less per token than a 70B model)
Vectorize	30M queried + 5M stored vector dimensions/month	50M queried dimensions/month included, then $0.01/million; 10M stored dimensions included, then $0.05/100M
AI Gateway	Free — analytics, caching, rate limiting	Free at the core; persistent logs are capped (100K on Free, 10M/gateway on Paid) and Logpush is a Paid-only add-on at $0.05/million

AI Gateway itself doesn't add cost to the underlying inference call — it passes through provider/Workers AI pricing unmarked-up (aside from the 5% fee on Cloudflare's unified billing/credits feature, if you use it). Its main cost lever is actually a savings one: a cache hit on an identical prompt avoids paying for inference at all. Confirm current numbers on the pricing pages linked below before quoting them — Neuron pricing per model in particular changes as Cloudflare adds models.

Use cases

Internal documentation Q&A — the exact pipeline above, over a Vectorize index of your product docs or runbooks, gated by the relevance threshold pitfall.
Customer support triage agent — embed the incoming ticket, retrieve similar past tickets/KB articles, generate a suggested response or classification, with AI Gateway caching repeat questions.
Multi-tool research agent — LLM decides which of several tools (web search, internal API, database lookup) to call next; graduates to a Workflow once tool-calling rounds need durable retries.
Semantic code/content search with generated summaries — Vectorize finds the relevant files or sections, the LLM call (through AI Gateway for caching) turns raw matches into a synthesized answer instead of a raw result list.

Primary source

AI Gateway — Workers AI binding integration documents the exact gateway option shown in this lesson's code. Pair it with the Vectorize getting started guide for index creation/query syntax and the Workers AI pricing page for current Neuron rates, since per-model pricing changes as new models ship.

Your RAG Worker calls env.VECTORIZE.query() and gets back topK=5 matches, but the closest one has a similarity score far below anything meaningful for your index. What's the correct handling?

Without scrolling up: which Cloudflare binding does an LLM call through AI Gateway actually use, and what turns a plain call into a gateway-routed one?

Reveal

It's the same env.AI binding used for any Workers AI call (including the embedding step). What routes it through AI Gateway is passing a third argument to env.AI.run() — a { gateway: { id: "..." } } options object — not a separate binding or client.

Anything above unclear — the embed/retrieve/generate sequence, the relevance-threshold pitfall, or where the line sits between "one Worker" and "needs a Workflow" — ask your AI teacher before moving on.

← Previous: Run AI inference at the edge with Workers AI Next: Serverless SQL with D1 →