After this lesson you'll be able to explain why multi-party WebRTC needs an SFU, set up a two-party call using Cloudflare Realtime's sessions/tracks API, and recognize when Realtime is overkill for what you're actually building.
Cloudflare Realtime (the product formerly called Calls) is WebRTC media infrastructure: it gives you a globally-distributed SFU (Selective Forwarding Unit) so you can build live audio/video features — video chat, telehealth visits, live audio rooms — without running your own media servers. You still write the client-side WebRTC code (getting camera/mic access, building the peer connection); Realtime is the thing your browser's WebRTC connection talks to on the other end, and it takes care of routing media between participants at Cloudflare's network edge instead of you provisioning and scaling media servers yourself.
Plain WebRTC is peer-to-peer: two browsers negotiate a direct connection and exchange media. That works for a 1:1 call, but it breaks down past two participants. In a naive peer-to-peer mesh, every participant has to open a direct connection to every other participant and upload their own audio/video once per other participant — a 6-person call means each browser is uploading 5 copies of its own stream simultaneously. Upload bandwidth on consumer connections can't keep up, and the CPU cost of encoding multiple times per participant adds up fast.
An SFU fixes this by sitting in the middle: each participant uploads their media once, to the SFU, and the SFU forwards (selectively — hence the name) each stream to whichever other participants need it. Upload cost per participant stays constant regardless of call size; the SFU absorbs the fan-out. This is the standard architecture behind Zoom, Google Meet, and most production video products — Realtime is Cloudflare operating that SFU tier for you, distributed across its edge network so participants connect to a nearby location instead of one central server.
The basic flow for any Realtime SFU app is the same regardless of participant count:
RTCPeerConnection, and your backend calls the Realtime API to create a corresponding session on Cloudflare's side, exchanging SDP (session description protocol) offers/answers to establish that leg of the connection.getUserMedia) are pushed into that session. Cloudflare returns a track ID for each pushed track — your job is to store and distribute that ID (e.g. broadcast it to other participants via your own signaling channel, commonly a Durable Object over WebSockets).Signaling (who tells whom about which track IDs) is intentionally not Cloudflare's job — you route that over your own channel, which is why Realtime pairs naturally with a Durable Object per call: it holds the WebSocket connections to each participant and relays "here's a new track ID" messages between them.
This sketch shows the shape of a two-party call using the sessions/tracks model. In a real app the SDP exchange and track IDs travel over your own signaling channel (e.g. a WebSocket to a Durable Object); here it's flattened to show the sequence of Realtime API calls a backend makes on behalf of each participant.
const APP_ID = env.REALTIME_APP_ID;
const APP_TOKEN = env.REALTIME_APP_TOKEN; // App Secret, kept server-side
const BASE = `https://rtc.live.cloudflare.com/v1/apps/${APP_ID}`;
async function callRealtime(path, body) {
const res = await fetch(`${BASE}${path}`, {
method: "POST",
headers: {
Authorization: `Bearer ${APP_TOKEN}`,
"Content-Type": "application/json",
},
body: JSON.stringify(body),
});
if (!res.ok) throw new Error(`Realtime API ${path} failed: ${res.status}`);
return res.json();
}
// 1. Create a session for participant A and push their local track.
// `offerSdp` is the SDP offer generated by A's RTCPeerConnection in the browser.
const sessionA = await callRealtime("/sessions/new", {});
const pushA = await callRealtime(`/sessions/${sessionA.sessionId}/tracks/new`, {
sessionDescription: { type: "offer", sdp: offerSdpFromBrowserA },
tracks: [{ location: "local", trackName: "video", mid: "0" }],
});
// pushA.tracks[0].trackName + sessionA.sessionId is what you hand back to
// A's browser as the SDP answer, and what you distribute to other participants.
// 2. Create a session for participant B, then pull A's track into it.
const sessionB = await callRealtime("/sessions/new", {});
const pullB = await callRealtime(`/sessions/${sessionB.sessionId}/tracks/new`, {
tracks: [
{
location: "remote",
sessionId: sessionA.sessionId,
trackName: pushA.tracks[0].trackName,
},
],
});
// pullB.sessionDescription is the SDP answer B's browser applies via
// setRemoteDescription — after that, B's RTCPeerConnection starts
// receiving A's media, forwarded by the SFU.
For a two-way call, repeat step 2 in the other direction (push B's track, pull it into A's session). Notice neither participant ever connects to the other directly — both connections terminate at Cloudflare's SFU, which is what lets this same pattern scale to more participants by pulling one track into many sessions instead of rewiring peer connections.
As of this writing, Cloudflare Realtime's SFU and TURN services are billed on data egress, not per participant-minute:
| Tier | Rate |
|---|---|
| Free | First 1,000 GB/month of egress (SFU + TURN combined) |
| Paid | $0.05 per GB of egress beyond the free tier |
Only traffic Cloudflare forwards out to clients counts — media pushed into the SFU is free even if nobody ever pulls it. If you use TURN relay alongside the SFU for the same call, that traffic isn't double-billed. Cloudflare also offers RealtimeKit, a separate higher-level product (prebuilt meeting UI, recording, etc.) billed per-minute rather than per-GB — don't confuse the two when reading pricing pages. Confirm current numbers on the live pricing page linked below before quoting them, since pricing can change.
Cloudflare Realtime docs cover the SFU's sessions/tracks model; pair with the Realtime pricing page for current egress rates, since pricing is subject to change.
The two primitives are sessions (roughly, one participant's PeerConnection to the SFU) and tracks (individual audio/video/data MediaStreamTracks pushed into or pulled from a session, identified by track ID). Realtime deliberately has no "room" concept — presence, who's in a call, and who can see/hear whom is state you build yourself, typically by distributing track IDs over your own signaling channel (e.g. a Durable Object relaying messages over WebSockets).