AI · AI Gateway

Put a control plane in front of your LLM calls

After this lesson you'll be able to repoint an existing Anthropic (or OpenAI, or Workers AI) SDK call through an AI Gateway endpoint and get logging, caching, rate limiting, and cost tracking for free, with no change to how you call the model.

AI Gateway is a reverse proxy that sits between your code and whatever LLM provider you're calling. You don't rewrite your integration — you change one string, the baseURL, so requests that used to go straight to api.anthropic.com now go through gateway.ai.cloudflare.com first, then on to Anthropic. The gateway forwards the request and streams back the response, but along the way it logs the request/response, can serve a cached response instead of calling the provider at all, can enforce a rate limit, and can retry or fall back to a different provider if the first one errors or times out. It's the same pattern as a database connection pool or an API gateway in front of microservices, applied to the specific pain points of calling LLMs: unpredictable cost, provider outages, and zero built-in observability.

How it works

Every AI Gateway you create gets an account- and gateway-scoped URL prefix:

https://gateway.ai.cloudflare.com/v1/{account_id}/{gateway_id}/{provider}

{provider} is a slug like anthropic, openai, workers-ai, google-ai-studio, or azure-openai. The gateway preserves each provider's own request/response schema — it's not translating Anthropic's API into OpenAI's — so your existing SDK code, types, and error handling keep working. The only change is the base URL and, optionally, an extra header if the gateway has authenticated mode turned on.

Request flow: your code calls the gateway URL with your normal provider API key → Cloudflare's edge receives it → checks cache (if caching is on and the request matches a cached entry, it returns immediately without touching the provider) → otherwise forwards to the real provider with your API key attached → logs the request metadata, token counts, and cost → streams the response back to you, cached for next time if caching applies.

Two auth layers, don't confuse them. Your provider API key (Anthropic, OpenAI, etc.) still goes in the request exactly where that provider expects it — AI Gateway doesn't replace it or see your provider account, it just relays it. A separate, optional Cloudflare API token (sent as cf-aig-authorization: Bearer {token}) controls who can call your gateway endpoint at all. You can run a gateway unauthenticated (anyone with the URL can use it, provided they supply their own valid provider key) or require the Cloudflare token — worth locking down before you put a gateway URL in client-side code.

What the gateway adds

Analytics & logging — every request's provider, model, token counts, latency, and estimated cost, viewable in the dashboard without instrumenting your own code.
Caching — identical requests can be served from Cloudflare's cache instead of hitting the provider, cutting both latency and spend.
Rate limiting — cap requests per gateway to protect your provider quota or budget from a runaway loop or a traffic spike.
Retries and fallback — via the separate Universal endpoint, define a priority list of providers/models; if one errors or times out, the gateway automatically tries the next.

Pricing

AI Gateway's core features — dashboard analytics, caching, and rate limiting — are free on every plan; you only pay the underlying provider (Anthropic, OpenAI, etc.) for the tokens you actually use. Cloudflare's own charges are limited to a few adjacent pieces:

Item	Free plan	Workers Paid plan
Persistent request logs	100,000 logs across all gateways	10,000,000 logs per gateway
Logpush export of gateway logs	Not available	$0.05 per million requests beyond the included 10M/month
Guardrails (content moderation)	Uses Workers AI under the hood	Billed by Workers AI token consumption
Unified Billing (pay providers via Cloudflare)	—	5% fee on credit purchases

Cloudflare notes it may add premium features later. Treat the numbers above as a snapshot — confirm current figures on the live pricing page linked below before budgeting.

Use cases

Cost observability across a team — instead of every developer's Anthropic/OpenAI dashboard showing a blended number, route all calls through one gateway (or one per environment) and see spend broken out by model, endpoint, or time window in one place.
Multi-provider fallback — configure a Universal endpoint that tries OpenAI first and falls back to Anthropic (or a different model) on error or timeout, so a single provider outage doesn't take your feature down.
Cutting LLM spend with caching — for prompts that repeat verbatim (a fixed system prompt plus a common set of user questions, a docs chatbot answering FAQs), caching returns the prior response instantly and for free instead of re-billing the provider every time.
Protecting quota from bugs — a rate limit on the gateway catches a retry loop or bad deploy before it burns through your provider's rate limit or your monthly budget.

Worked example

Create a gateway once in the dashboard (AI > AI Gateway > Create Gateway), note your account ID and the gateway name you chose, then repoint the Anthropic SDK's baseURL:

// Before: calling Anthropic directly
import Anthropic from "@anthropic-ai/sdk";

const anthropic = new Anthropic({
  apiKey: process.env.ANTHROPIC_API_KEY,
});

// After: same call, routed through AI Gateway
import Anthropic from "@anthropic-ai/sdk";

const anthropic = new Anthropic({
  apiKey: process.env.ANTHROPIC_API_KEY, // still your real Anthropic key
  baseURL: `https://gateway.ai.cloudflare.com/v1/${process.env.CF_ACCOUNT_ID}/my-gateway/anthropic`,
  defaultHeaders: {
    // Only needed if the gateway requires Cloudflare auth
    "cf-aig-authorization": `Bearer ${process.env.CF_AIG_TOKEN}`,
  },
});

const message = await anthropic.messages.create({
  model: "claude-sonnet-4-5",
  max_tokens: 1024,
  messages: [{ role: "user", content: "Summarize this changelog in two sentences." }],
});

console.log(message.content);

No change to messages.create(), its response shape, or your error handling. Open the AI Gateway dashboard and this call now shows up as a logged request with its model, token counts, latency, and cost. To bypass the cache for a request you know must be fresh, add a header per-request:

const message = await anthropic.messages.create(
  {
    model: "claude-sonnet-4-5",
    max_tokens: 1024,
    messages: [{ role: "user", content: "What time is it in Tokyo right now?" }],
  },
  { headers: { "cf-aig-skip-cache": "true" } }
);

Pitfall: the cache key includes more than the prompt text. AI Gateway builds its cache key by hashing the provider, endpoint, model, your provider auth header, and the full request body together. That means two users with different API keys never collide on cache — but it also means the cache is exact-match: change the system prompt, the temperature, or even whitespace in the message body, and it's a cache miss, not a stale hit. The actual staleness trap is the opposite of what you'd expect: if you enable a default cache_ttl gateway-wide (say, one day) for a prompt pattern that legitimately needs a fresh answer every call — a "what's the latest status" query, anything time-sensitive, anything that should reflect data that changed since the last call — you'll keep getting the first response back for the TTL window. Set cf-aig-cache-ttl per-request (or cf-aig-skip-cache entirely) for any prompt whose answer can legitimately change between identical-looking requests; don't rely on the gateway-wide default being right for every route through it.

Primary source

Cloudflare Docs — AI Gateway for the overview and supported providers; the caching configuration page for cache-key construction and the cf-aig-* headers; the pricing page for current costs and limits.

You turn on a gateway-wide default cache TTL of 24 hours. A route that asks "what's the current inventory count for SKU-4471?" starts returning the same number all day even though inventory changed. What's the actual cause?

Without scrolling up: what four things get hashed together to form an AI Gateway cache key, and why does that mean two different users' identical prompts don't share a cache entry?

Reveal

The cache key is a hash of the provider, the endpoint, the model, the provider auth header (your API key/bearer token), and the full request body. Because the auth header is part of the key, two users sending the byte-identical prompt but authenticating with different API keys produce different cache keys — so caching is scoped per credential, not globally shared across everyone hitting the gateway.

Anything above unclear — how the Universal endpoint's fallback array is structured, what Guardrails actually screens for, or how logs interact with Logpush — ask your AI teacher before moving on.

← Previous: Build live audio/video apps with Realtime Next: Store and query vector embeddings with Vectorize →