Putting the vocal track in the center means adding it to both channels at the same volume. You can do it like this:
sox -M stereo.wav vocal.wav result.wav remix -m 1,3 2,3
Here, -M
(or --combine=merge
) tells SoX to merge all channels of all input files. The stereo channels from stereo.wav
will become channels 1 and 2, the mono channel from vocal.wav
channel 3. Then, the remix
effect allows mixing them in different ways. It gives more control over the process than the standard combine methods.
Here, 1,3
describes the first output channel as the sum (mix) of channels 1 and 3, i.e. the original left music channel and the vocal track. Accordingly, 2,3
for the second output channel means the sum of the right music channel and the vocal track.
It is possible that clipping occurs, or that the vocal track is too loud or too soft in comparison to the background music. If it happens, this can be corrected by adding channel modifiers like p-5
(reduce volume by 5 dB):
remix -m 1p-5,3 2p-5,3
If the relative volume is OK but clipping occurs, one of the automatic scaling options might also be sufficient to remedy it (remix -a 1,3 2,3
or remix -p 1,3 2,3
).
This works for a known number of input files where you know which is which. For dealing with any number of mono/stereo input files automatically, some scripting will be required that tells mono and stereo files apart and constructs appropriate SoX calls.