Responding to some technical points first, but then after that I do see a future that isn't WebRTC. I don't think it matches where WebTransport+WebCodecs etc is going though.
> …but as a user, I would much rather wait an extra 200ms for my slow/expensive prompt to be accurate
This is the opposite of the feedback I get. Users want instant responses. If you have delay in generating responses/interruptions it kills the magic. You also don't want to send faster than real-time. If the user interrupts the model you just wasted a bunch of bandwidth sending 3 minutes of audio (but only played 10 seconds)
You lose stuff like AEC. You also push complexity on clients. The simplicity of WebRTC (createOffer -> setRemoteDescription) is what lets people onboard easily. Lots of developers struggled with Realtime API + web sockets (lots of code and having to do stuff by hand)
----
I think if I had my choice I would pick Offer/Answer model and then doing QUIC instead of DTLS+SCTP. Maybe do RTP over QUIC? I personally don't feel strongly about the protocol itself. I don't know how to ship code to multiple clients (and customers clients) with a much large code footprint.
This poor soul. There are few protocols I hate implementing more than WebRTC. Getting a simple client going means you need to quickly acclimate to SDP, TURN/STUN, ice-candidates, offers, peer-to-peer protocols, and the complex handshake that is implemented from scratch each time. I can't imagine re-writing the whole trenchcoat of protocols and unintended "best-practices".
This is frustratingly one-sided writing. Yeah, WebRTC has limitations, but relying on a standard buys you a lot of correctness and reduces long-term engineering cost. The fact that WebRTC is complicated does not mean it is wrong; it means real-time media over the public internet is complicated.
Also, networking is inherently stateful. NAT traversal, jitter buffers, congestion control, packet loss, codec state, encryption, and session routing do not disappear because you put audio over TCP or WebSocket. Pretending otherwise is not architectural clarity. It is just moving the complexity somewhere less visible.
You might have noticed that the author started the blog post explaining themselves:
Like 6 years ago I wrote a WebRTC SFU at Twitch.
Originally we used Pion (Go) just like OpenAI,
but forked after benchmarking revealed that it was too slow.
I ended up rewriting every protocol, because of course I did!
Just a year ago, I was at Discord and I rewrote the WebRTC SFU in Rust.
Because of course I did! You’re probably noticing a trend.
Fun Fact: WebRTC consists of ~45 RFCs dating back to the early 2000s.
And some de-facto standards that are technically drafts (ex. TWCC, REMB).
Not a fun fact when you have to implement them all.
You should consider me a Certified WebRTC Expert.
Which is why I never, never want to use WebRTC again.
I think that they've done more than enough of 'trying the normal way' to be warranted in having an opinion the other way, don't you think?
It’s 2026 and teleconferencing is still such a shit show. There’s billions of dollars to be had and Zoom is at best mediocre, and it can be as bad as Microsoft Whatchamacallit. I’ve never not seen teleconferencing be a ham handed mess.
> WebRTC is designed to degrade and drop my prompt during poor network conditions
You want real time that's what you are going to deal with. If you don't want real time and instead imagine everything as STT -> Prompt -> TTS then maybe you shouldn't even be sending audio on the wire at all.
Isn’t the point that OpenAI’s use case does not require realtime?
When OpenAI responds, it has most of the audio in advance of when the user needs to hear it. It produces audio faster than real time, so a real time protocol is a bad fit.
Yep. Maybe there's some additional configuration I'm missing to mitigate the delay but clients don't seem to want to deal with the delay with STT -> Prompt -> TTS. They'll happily suffer occasional quality issues if the conversation feels "real".
I haven't really experienced disconnections while using ChatGPT. Gemini is the frustrating part. Simply backgrounding the app (and the web version too) and resuming it causes the response or the conversation with an assigned ID to disappear. Haha.
interesting read albeit over my head, but i spent half of yesterday comparing Gemini Live (websockets) vs gpt-realtime-2 and while gpt is super good, seemingly more robust. Gemini connects faster.
this misses a few key things but hits on many others
webrtc is a bad protocol, without a doubt. I do like websockets as an easy alternative, but you do need to reinvent decent portions of webrtc as a result
I like the idea of MoQ but it's not widely used. probably worth experimenting with, especially as video enters the chat
> and then a GPU pretends to talk to you via text-to-speech
OpenAI is speech-to-speech, there is no TTS in voice mode
> It takes a minimum of 8* round trips (RTT) to establish a WebRTC connection
signalling can be done long ahead of time, though I don't see this mentioned in the OpenAI blog. I also saw some new webrtc extensions that should reduce setup time further
ultimately though, it comes down to
> It’s not like LLMs are particularly responsive anyway
I expect to see a shift in how S2S models work to be lower latency like the new voice API models that OpenAI announced
to be fair, the new models were released the day after this MoQ blog was published
You just send packets to the other party's address and they send packets back to yours. Both parties know their address and you don't need a relay in the middle.
> …but as a user, I would much rather wait an extra 200ms for my slow/expensive prompt to be accurate
This is the opposite of the feedback I get. Users want instant responses. If you have delay in generating responses/interruptions it kills the magic. You also don't want to send faster than real-time. If the user interrupts the model you just wasted a bunch of bandwidth sending 3 minutes of audio (but only played 10 seconds)
> TTS is faster than real-time
https://research.nvidia.com/labs/adlr/personaplex/ Voice AI for the latest/aspirational is moving away from what the author describes. It is trickled in/out at 20ms
> We really hope the user’s source IP/port never changes, because we broke that functionality.
That is supported. When new IP for ufrag comes in its supported
> It takes a minimum of 8* round trips (RTT)
That's wrong. https://datatracker.ietf.org/doc/draft-hancke-webrtc-sped/
> I’d just stream audio over WebSockets
You lose stuff like AEC. You also push complexity on clients. The simplicity of WebRTC (createOffer -> setRemoteDescription) is what lets people onboard easily. Lots of developers struggled with Realtime API + web sockets (lots of code and having to do stuff by hand)
----
I think if I had my choice I would pick Offer/Answer model and then doing QUIC instead of DTLS+SCTP. Maybe do RTP over QUIC? I personally don't feel strongly about the protocol itself. I don't know how to ship code to multiple clients (and customers clients) with a much large code footprint.
I hope it’s getting better with education/more libraries. It’s also amazing how easy Codex etc… can burn through it now
Also, networking is inherently stateful. NAT traversal, jitter buffers, congestion control, packet loss, codec state, encryption, and session routing do not disappear because you put audio over TCP or WebSocket. Pretending otherwise is not architectural clarity. It is just moving the complexity somewhere less visible.
It’s 2026 and teleconferencing is still such a shit show. There’s billions of dollars to be had and Zoom is at best mediocre, and it can be as bad as Microsoft Whatchamacallit. I’ve never not seen teleconferencing be a ham handed mess.
You want real time that's what you are going to deal with. If you don't want real time and instead imagine everything as STT -> Prompt -> TTS then maybe you shouldn't even be sending audio on the wire at all.
Isn’t the point that OpenAI’s use case does not require realtime?
When OpenAI responds, it has most of the audio in advance of when the user needs to hear it. It produces audio faster than real time, so a real time protocol is a bad fit.
[0] https://developers.openai.com/api/docs/models/gpt-realtime-t...
You run into issues around AudioContext and resumption etc... it's a PITA to have to handle all those corner cases :(
Had a nice chuckle.
webrtc is a bad protocol, without a doubt. I do like websockets as an easy alternative, but you do need to reinvent decent portions of webrtc as a result
I like the idea of MoQ but it's not widely used. probably worth experimenting with, especially as video enters the chat
> and then a GPU pretends to talk to you via text-to-speech
OpenAI is speech-to-speech, there is no TTS in voice mode
> It takes a minimum of 8* round trips (RTT) to establish a WebRTC connection
signalling can be done long ahead of time, though I don't see this mentioned in the OpenAI blog. I also saw some new webrtc extensions that should reduce setup time further
ultimately though, it comes down to
> It’s not like LLMs are particularly responsive anyway
I expect to see a shift in how S2S models work to be lower latency like the new voice API models that OpenAI announced
to be fair, the new models were released the day after this MoQ blog was published
cldouflare doesn't support WebTransport well.