OpenAI's WebRTC problem

(moq.dev)

131 points | by atgctg 1 day ago

11 comments

Sean-Der 1 hour ago
Responding to some technical points first, but then after that I do see a future that isn't WebRTC. I don't think it matches where WebTransport+WebCodecs etc is going though.
> …but as a user, I would much rather wait an extra 200ms for my slow/expensive prompt to be accurate
This is the opposite of the feedback I get. Users want instant responses. If you have delay in generating responses/interruptions it kills the magic. You also don't want to send faster than real-time. If the user interrupts the model you just wasted a bunch of bandwidth sending 3 minutes of audio (but only played 10 seconds)
> TTS is faster than real-time
https://research.nvidia.com/labs/adlr/personaplex/ Voice AI for the latest/aspirational is moving away from what the author describes. It is trickled in/out at 20ms
> We really hope the user’s source IP/port never changes, because we broke that functionality.
That is supported. When new IP for ufrag comes in its supported
> It takes a minimum of 8* round trips (RTT)
That's wrong. https://datatracker.ietf.org/doc/draft-hancke-webrtc-sped/
> I’d just stream audio over WebSockets
You lose stuff like AEC. You also push complexity on clients. The simplicity of WebRTC (createOffer -> setRemoteDescription) is what lets people onboard easily. Lots of developers struggled with Realtime API + web sockets (lots of code and having to do stuff by hand)
----
I think if I had my choice I would pick Offer/Answer model and then doing QUIC instead of DTLS+SCTP. Maybe do RTP over QUIC? I personally don't feel strongly about the protocol itself. I don't know how to ship code to multiple clients (and customers clients) with a much large code footprint.
awkii 1 hour ago
This poor soul. There are few protocols I hate implementing more than WebRTC. Getting a simple client going means you need to quickly acclimate to SDP, TURN/STUN, ice-candidates, offers, peer-to-peer protocols, and the complex handshake that is implemented from scratch each time. I can't imagine re-writing the whole trenchcoat of protocols and unintended "best-practices".
[-]
- Sean-Der 1 hour ago
  What platforms were you targeting that you found it painful! Sorry it was frustrating.
  I hope it’s getting better with education/more libraries. It’s also amazing how easy Codex etc… can burn through it now
- jgalt212 1 hour ago
  Have you attempted to use the Microsoft Graph API to interact with email?
  [-]
  - tempaccount5050 12 minutes ago
    It's way better than the old powershell modules imo. What don't you like?
  - edoceo 1 hour ago
    Ugh. Who's decided to Graph all the things.
- moomoo11 1 hour ago
  i like livekit for this reason and their ceo is cool
r2vcap 1 hour ago
This is frustratingly one-sided writing. Yeah, WebRTC has limitations, but relying on a standard buys you a lot of correctness and reduces long-term engineering cost. The fact that WebRTC is complicated does not mean it is wrong; it means real-time media over the public internet is complicated.
Also, networking is inherently stateful. NAT traversal, jitter buffers, congestion control, packet loss, codec state, encryption, and session routing do not disappear because you put audio over TCP or WebSocket. Pretending otherwise is not architectural clarity. It is just moving the complexity somewhere less visible.
[-]
- tekacs 1 hour ago
  You might have noticed that the author started the blog post explaining themselves:
```
  Like 6 years ago I wrote a WebRTC SFU at Twitch.
  Originally we used Pion (Go) just like OpenAI,
  but forked after benchmarking revealed that it was too slow.
  I ended up rewriting every protocol, because of course I did!

  Just a year ago, I was at Discord and I rewrote the WebRTC SFU in Rust.
  Because of course I did! You’re probably noticing a trend.

  Fun Fact: WebRTC consists of ~45 RFCs dating back to the early 2000s.
  And some de-facto standards that are technically drafts (ex. TWCC, REMB).
  Not a fun fact when you have to implement them all.

  You should consider me a Certified WebRTC Expert.
  Which is why I never, never want to use WebRTC again.
```
  I think that they've done more than enough of 'trying the normal way' to be warranted in having an opinion the other way, don't you think?
- Waterluvian 1 hour ago
  “How hard can it be?” the strawman asked.
  It’s 2026 and teleconferencing is still such a shit show. There’s billions of dollars to be had and Zoom is at best mediocre, and it can be as bad as Microsoft Whatchamacallit. I’ve never not seen teleconferencing be a ham handed mess.
  [-]
  - fragmede 23 minutes ago
    Facetime does alright in the consumer segment.
- charcircuit 1 hour ago
  QUIC is also a standard.
fidotron 1 hour ago
> WebRTC is designed to degrade and drop my prompt during poor network conditions
You want real time that's what you are going to deal with. If you don't want real time and instead imagine everything as STT -> Prompt -> TTS then maybe you shouldn't even be sending audio on the wire at all.
[-]
- cowsandmilk 53 minutes ago
  > You want real time
  Isn’t the point that OpenAI’s use case does not require realtime?
  When OpenAI responds, it has most of the audio in advance of when the user needs to hear it. It produces audio faster than real time, so a real time protocol is a bad fit.
  [-]
  - Sean-Der 44 minutes ago
    That is not the case. See get-realtime-translate[0 that's doing it as a trickle instead (not turn based).
    [0] https://developers.openai.com/api/docs/models/gpt-realtime-t...
- telman17 1 hour ago
  Yep. Maybe there's some additional configuration I'm missing to mitigate the delay but clients don't seem to want to deal with the delay with STT -> Prompt -> TTS. They'll happily suffer occasional quality issues if the conversation feels "real".
lpln3452 1 hour ago
I haven't really experienced disconnections while using ChatGPT. Gemini is the frustrating part. Simply backgrounding the app (and the web version too) and resuming it causes the response or the conversation with an assigned ID to disappear. Haha.
[-]
- Sean-Der 1 hour ago
  I believe Gemini is Websockets? I have the same experience with heavy/custom applications that try to roll their own media stuff.
  You run into issues around AudioContext and resumption etc... it's a PITA to have to handle all those corner cases :(
sam1r 31 minutes ago
>> ... I say hi to <strike> Scarlett Johansson <strike>
Had a nice chuckle.
keizo 22 minutes ago
interesting read albeit over my head, but i spent half of yesterday comparing Gemini Live (websockets) vs gpt-realtime-2 and while gpt is super good, seemingly more robust. Gemini connects faster.
spongebobstoes 1 hour ago
this misses a few key things but hits on many others
webrtc is a bad protocol, without a doubt. I do like websockets as an easy alternative, but you do need to reinvent decent portions of webrtc as a result
I like the idea of MoQ but it's not widely used. probably worth experimenting with, especially as video enters the chat
> and then a GPU pretends to talk to you via text-to-speech
OpenAI is speech-to-speech, there is no TTS in voice mode
> It takes a minimum of 8* round trips (RTT) to establish a WebRTC connection
signalling can be done long ahead of time, though I don't see this mentioned in the OpenAI blog. I also saw some new webrtc extensions that should reduce setup time further
ultimately though, it comes down to
> It’s not like LLMs are particularly responsive anyway
I expect to see a shift in how S2S models work to be lower latency like the new voice API models that OpenAI announced
to be fair, the new models were released the day after this MoQ blog was published
giancarlostoro 1 hour ago
Probably because WebTransport is the lesser known alternative to WebRTC.
[-]
- est 1 hour ago
  WebTransport requires some speicific server setup.
  cldouflare doesn't support WebTransport well.
Giefo6ah 1 hour ago
Yet another victim of IPv4, and you still find countless detractors of IPv6 on every thread where it's mentioned.
[-]
- spongebobstoes 1 hour ago
  IPv4 support is necessary, but IPv6 isn't
- whattheheckheck 1 hour ago
  How would ipv6 handle it
  [-]
  - tardedmeme 57 minutes ago
    You just send packets to the other party's address and they send packets back to yours. Both parties know their address and you don't need a relay in the middle.
coalstartprob 37 minutes ago
[dead]