Is it possible synchronize audio when combining two WebRTC peers?

Question

I'm working on a WebRTC application where exactly 2 musicians collaborate on a live performance, and stream the combined audio to a third party. Since it's not possible to have both musicians hear the other with perfect synchronization, my approach is:

Musician A is the host, and performs however they see fit
Musician B is the guest, who hears the host's audio, then performs in-time with what they hear from the remote stream
Using the Web Audio API, A and B's audio streams are merged, and this merged audio is shared on a new stream to the listener C

A ----> B    (host streams to guest over WebRTC)
 \     /
  \   /
   ┙ ┕
    C        ("host" and "guest" streams merged using Web Audio API)

I believe getting perfect synchronization of the audio for C should be possible (as in, doesn't violate the laws of physics). For purposes of this application, "perfect synchronization" means that listener C should hear what B heard at time T concurrently with what B played at time T.

I've tried two approaches to this, neither successful:

B merges the audio. Since the performance already appears "in-sync" for B, I thought their merged stream might be in-sync as well. However, the output still contains a delay. I'm guessing from the time elapsed between B's local MediaStream receiving data, and that data completing processing for the merged stream.
A merges the audio. In this approach, host A receives peer B's audio and tries to account for the time difference between the two streams by passing A's local audio through a DelayNode before merging. I use the WebRTC Statistics API to try values like STUN roundtrip time, jitter buffer delay, and MediaStream latency estimate, but no combination seems to give the perfect delay offset.

Is there a known method to synchronize audio this way with WebRTC? Is it a matter of getting the right WebRTC Statistics, or is my approach totally off?

I don't know how to do this with WebRTC, but the underlying tech (RTP) makes it possible! webrtc.googlesource.com/src/+/refs/heads/master/docs/… puts a global timestamp on the packets, so you can synchronize off of that. I don't know how you can synchronize in the browser though. Maybe you could send them to a backend server that audio mixes them? — Sean DuBois
Very useful pointer - thank you! I had managed to stay one level of abstraction away from RTP so far, but this is a helpful new direction. There's also a node server facilitating the peer negotiation already, so I've got the option of server-side sync if that's the only option. — FunkyDelPueblo

Tom Tom · Accepted Answer · 2020-04-11T08:23:54

for solution B merges the audio, the delay comes from the delay browser => environment and environment => browser: Since B listens and plays in the environment, the two streams will be in sync in the environment, so off by the sum of the two above delays in B's browser. The magnitude of this effect depends on B's hardware, operating system, and browser; there is no way around measuring this. There are tools available for this measurement such as jack-delay (https://sources.debian.org/src/jack-delay/0.4.2-1/README/), but these do not work in the browser. Since you are in the WebRTC setting, I think something similar to frontend/crosscorrelation.js in https://github.com/ntgiwsvp/looper is your way to go.

For solution A merges the audio (and similarly for C merges the audio), I know of only one verified solution to this problem as of now, which unfortunately is a bit of a hack:

Add an extra channel 1 to the audio track.
A submits their performance to channel 0 and a periodic synchronization signal to channel 1
B delays channel 1 by her round-trip latency browser <=> environment as explained above. B's output stream consists of her recording in channel 0 and the delayed synchronization signal in channel 1.
Once anybody, say C, receives both A's and B's stream, they can use channels A1 and B1 to synchronize the streams via appropriate delay, then play channels A0 and B0.

There is a working implementation of most of what you need in the file frontend/client.js in the above-mentioned repository. (Your setup is a bit different, but the same concepts apply.)

Is it possible synchronize audio when combining two WebRTC peers?

1 Answers