I'm working on a WebRTC application where exactly 2 musicians collaborate on a live performance, and stream the combined audio to a third party. Since it's not possible to have both musicians hear the other with perfect synchronization, my approach is:
- Musician A is the host, and performs however they see fit
- Musician B is the guest, who hears the host's audio, then performs in-time with what they hear from the remote stream
- Using the Web Audio API, A and B's audio streams are merged, and this merged audio is shared on a new stream to the listener C
A ----> B (host streams to guest over WebRTC)
\ /
\ /
┙ ┕
C ("host" and "guest" streams merged using Web Audio API)
I believe getting perfect synchronization of the audio for C should be possible (as in, doesn't violate the laws of physics). For purposes of this application, "perfect synchronization" means that listener C should hear what B heard at time T
concurrently with what B played at time T
.
I've tried two approaches to this, neither successful:
B merges the audio. Since the performance already appears "in-sync" for B, I thought their merged stream might be in-sync as well. However, the output still contains a delay. I'm guessing from the time elapsed between B's local MediaStream receiving data, and that data completing processing for the merged stream.
A merges the audio. In this approach, host A receives peer B's audio and tries to account for the time difference between the two streams by passing A's local audio through a DelayNode before merging. I use the WebRTC Statistics API to try values like STUN roundtrip time, jitter buffer delay, and MediaStream latency estimate, but no combination seems to give the perfect delay offset.
Is there a known method to synchronize audio this way with WebRTC? Is it a matter of getting the right WebRTC Statistics, or is my approach totally off?