Media · Realtime

Build live audio/video apps with Realtime

After this lesson you'll be able to explain why multi-party WebRTC needs an SFU, set up a two-party call using Cloudflare Realtime's sessions/tracks API, and recognize when Realtime is overkill for what you're actually building.

Cloudflare Realtime (the product formerly called Calls) is WebRTC media infrastructure: it gives you a globally-distributed SFU (Selective Forwarding Unit) so you can build live audio/video features — video chat, telehealth visits, live audio rooms — without running your own media servers. You still write the client-side WebRTC code (getting camera/mic access, building the peer connection); Realtime is the thing your browser's WebRTC connection talks to on the other end, and it takes care of routing media between participants at Cloudflare's network edge instead of you provisioning and scaling media servers yourself.

Why you need an SFU

Plain WebRTC is peer-to-peer: two browsers negotiate a direct connection and exchange media. That works for a 1:1 call, but it breaks down past two participants. In a naive peer-to-peer mesh, every participant has to open a direct connection to every other participant and upload their own audio/video once per other participant — a 6-person call means each browser is uploading 5 copies of its own stream simultaneously. Upload bandwidth on consumer connections can't keep up, and the CPU cost of encoding multiple times per participant adds up fast.

An SFU fixes this by sitting in the middle: each participant uploads their media once, to the SFU, and the SFU forwards (selectively — hence the name) each stream to whichever other participants need it. Upload cost per participant stays constant regardless of call size; the SFU absorbs the fan-out. This is the standard architecture behind Zoom, Google Meet, and most production video products — Realtime is Cloudflare operating that SFU tier for you, distributed across its edge network so participants connect to a nearby location instead of one central server.

Sessions and tracks, not "rooms." Realtime's SFU API deliberately doesn't give you a "room" abstraction. Instead it exposes two primitives: a session (roughly, one participant's WebRTC PeerConnection to the SFU) and a track (one audio/video/data MediaStreamTrack flowing through that session). You push tracks into a session and pull tracks from other sessions by track ID. Any concept of a "room" — who's in a call, who can see whom — is state you build yourself (e.g. in a Durable Object), using track IDs as the thing you distribute between participants. This is more work upfront than a "join room" SDK call, but it means Realtime doesn't constrain your app's model of presence, permissions, or call topology.

How it works

The basic flow for any Realtime SFU app is the same regardless of participant count:

Create a Realtime App in the dashboard (or via API) to get an App ID and App Secret — these authenticate your backend's calls to the Realtime API.
Create a session per participant. Each participant's browser builds a local RTCPeerConnection, and your backend calls the Realtime API to create a corresponding session on Cloudflare's side, exchanging SDP (session description protocol) offers/answers to establish that leg of the connection.
Push local tracks. Once a participant's session exists, their local audio/video tracks (from getUserMedia) are pushed into that session. Cloudflare returns a track ID for each pushed track — your job is to store and distribute that ID (e.g. broadcast it to other participants via your own signaling channel, commonly a Durable Object over WebSockets).
Pull remote tracks. To let participant B see participant A, you tell B's session to pull A's track by ID. The SFU then forwards that media into B's connection. Repeat pairwise (or fan out to many) and you have a multi-party call — the SFU does the forwarding work, not the browsers.

Signaling (who tells whom about which track IDs) is intentionally not Cloudflare's job — you route that over your own channel, which is why Realtime pairs naturally with a Durable Object per call: it holds the WebSocket connections to each participant and relays "here's a new track ID" messages between them.

Worked example: minimal two-party call

This sketch shows the shape of a two-party call using the sessions/tracks model. In a real app the SDP exchange and track IDs travel over your own signaling channel (e.g. a WebSocket to a Durable Object); here it's flattened to show the sequence of Realtime API calls a backend makes on behalf of each participant.

const APP_ID = env.REALTIME_APP_ID;
const APP_TOKEN = env.REALTIME_APP_TOKEN; // App Secret, kept server-side
const BASE = `https://rtc.live.cloudflare.com/v1/apps/${APP_ID}`;

async function callRealtime(path, body) {
  const res = await fetch(`${BASE}${path}`, {
    method: "POST",
    headers: {
      Authorization: `Bearer ${APP_TOKEN}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify(body),
  });
  if (!res.ok) throw new Error(`Realtime API ${path} failed: ${res.status}`);
  return res.json();
}

// 1. Create a session for participant A and push their local track.
//    `offerSdp` is the SDP offer generated by A's RTCPeerConnection in the browser.
const sessionA = await callRealtime("/sessions/new", {});
const pushA = await callRealtime(`/sessions/${sessionA.sessionId}/tracks/new`, {
  sessionDescription: { type: "offer", sdp: offerSdpFromBrowserA },
  tracks: [{ location: "local", trackName: "video", mid: "0" }],
});
// pushA.tracks[0].trackName + sessionA.sessionId is what you hand back to
// A's browser as the SDP answer, and what you distribute to other participants.

// 2. Create a session for participant B, then pull A's track into it.
const sessionB = await callRealtime("/sessions/new", {});
const pullB = await callRealtime(`/sessions/${sessionB.sessionId}/tracks/new`, {
  tracks: [
    {
      location: "remote",
      sessionId: sessionA.sessionId,
      trackName: pushA.tracks[0].trackName,
    },
  ],
});
// pullB.sessionDescription is the SDP answer B's browser applies via
// setRemoteDescription — after that, B's RTCPeerConnection starts
// receiving A's media, forwarded by the SFU.

For a two-way call, repeat step 2 in the other direction (push B's track, pull it into A's session). Notice neither participant ever connects to the other directly — both connections terminate at Cloudflare's SFU, which is what lets this same pattern scale to more participants by pulling one track into many sessions instead of rewiring peer connections.

Pricing

As of this writing, Cloudflare Realtime's SFU and TURN services are billed on data egress, not per participant-minute:

Tier	Rate
Free	First 1,000 GB/month of egress (SFU + TURN combined)
Paid	$0.05 per GB of egress beyond the free tier

Only traffic Cloudflare forwards out to clients counts — media pushed into the SFU is free even if nobody ever pulls it. If you use TURN relay alongside the SFU for the same call, that traffic isn't double-billed. Cloudflare also offers RealtimeKit, a separate higher-level product (prebuilt meeting UI, recording, etc.) billed per-minute rather than per-GB — don't confuse the two when reading pricing pages. Confirm current numbers on the live pricing page linked below before quoting them, since pricing can change.

Use cases

Video chat inside an app — a "call a support agent" or "video chat with a match" feature bolted onto a product where video isn't the whole product.
Telehealth — 1:1 or small-group video visits where you need low-latency audio/video without building or hosting your own SFU.
Live audio rooms — Clubhouse/Twitter-Spaces-style many-listener audio, using the SFU's track fan-out (push once, pull into many sessions) rather than a broadcast-specific product.
Custom multi-party conferencing — when you need control over call topology (who can hear/see whom) that an off-the-shelf "rooms" SDK doesn't give you.

Pitfall: reaching for Realtime for simple 1:1 data or file transfer. Realtime solves real-time audio/video media routing — SDP negotiation, jitter buffers, codec handling, network traversal. If what you actually need is "send messages or files between two connected clients" with no camera/microphone involved, standing up a WebRTC session (sessions, tracks, SDP offer/answer, ICE) is a lot of moving parts for a problem a WebSocket through a Durable Object solves in a fraction of the code, with none of WebRTC's negotiation complexity and no per-GB media egress billing. Reach for Realtime specifically when you need actual live audio/video; reach for Durable Object WebSockets (or Queues, for async transfer) when you just need two clients to exchange data.

Primary source

Cloudflare Realtime docs cover the SFU's sessions/tracks model; pair with the Realtime pricing page for current egress rates, since pricing is subject to change.

Why does a multi-party WebRTC call need an SFU instead of a full peer-to-peer mesh?

Without scrolling up: what are the two core primitives in Cloudflare Realtime's SFU API, and what concept does the API deliberately not provide that you have to build yourself?

Reveal

The two primitives are sessions (roughly, one participant's PeerConnection to the SFU) and tracks (individual audio/video/data MediaStreamTracks pushed into or pulled from a session, identified by track ID). Realtime deliberately has no "room" concept — presence, who's in a call, and who can see/hear whom is state you build yourself, typically by distributing track IDs over your own signaling channel (e.g. a Durable Object relaying messages over WebSockets).

Anything above unclear — the SFU vs. mesh tradeoff, the sessions/tracks model, or where signaling fits in — ask your AI teacher before moving on.

← Previous: Ingest and deliver video with Stream Next: Put a control plane in front of your LLM calls →