0
votes

I am posed with the task of mixing raw data from audio files. I am currently struggling to get a clean sound from mixing the data, I keep getting distortion or white noise.

Lets say that I have a two byte array of data from two AudioInputStream's. The AIS is used to stream a byte array from a given audio file. Here I can playback single audio files using SourceDataLine's write method. I want to play two audio files simultaneously, therefore I am aware that I need to perform some sort of PCM addition.

Can anyone recommend whether this addition should be done with float values or byte values? Also, when it comes to adding 3,4 or more audio files, I am guessing my problem will be even harder! Do I need to divide by a certain amount to avoid this overflow? Lets say I am adding two 16-bit audio files (min -32,768, max 32,767).

I admit, I have had some advice on this before but can't seem to get it working! I have code of what I have tried but not with me!

Any advice would be great.

Thanks

1
Also, my main issue is what the size of my mixed array should be? Should it be the size of the largest audio file to mix?!Ivaan
Hello lvaan! You would indeed need to be careful when summing multiple signals as anything that sums over the min/max threshold will add noise/distortion. Do you need to solve this as a real-time problem or can it be pre-computed (non real-time)? For non real-time, have you also tried normalising the audio? You can sum with either byte or float value. I would recommend converting values to float at the start to a range -1 and 1 to keep things simple/understandable I believe for equal power summing you should multiply summed signal by (1/sqrt(2))^(n-1), for n signals.fdcpp
Hi there, thanks for the reponse! For now, I have been trying to implement non-real-time. I am aware that I need to clip my values to the min & max representation of an n-bit number. I indeed have been trying addition with normalized float values from -1 to 1. So if I want to add two byte arrays worth of data in to one byte array, I will be doing the the sum or all index positions and adding this function to the results as you have said ( (1/sqrt(2))^(n-1), for n signals ). Can you confirm why and where you have found this calculation?Ivaan
My problem is that I want to play an output track as appose to single audio files. However, I don't know how big the array of data in the output should be (obviously different audio files for addition are of different sizes).Ivaan
The power equations are a better match than linear values for modeling perceived loudness. If needed, the power transforms can be applied to the inputs prior to their addition. But I think the OP would be best advised to get the simpler linear addition of signals working first.Phil Freihofner

1 Answers

3
votes

First off, I question whether you are actually working with fully decoded PCM data values. If you are directly adding bytes, that would only make sense if the sound was recorded at 8-bit resolution, which is done less and less. These days, audio is recorded more commonly as 16-bit values, or more. I think there are some situations that don't require as much frequency content, but with current systems, the cpu savings aren't as critical so people opt to keep at least "CD Quality" (16-bit resolution, stereo, 41000 fps).

So step one, you have to make sure that you are properly converting the byte streams to valid PCM. For example, if 16-bit encoding, the two bytes have to be appended in the correct order (may be either big-endian or little-endian), and the resulting value used.

Once that is properly handled, it is usually sufficient to simply add the values and maybe impose a min and max filter to ensure the signal doesn't go beyond the defined range. I can think of two reasons why this works: (a) audio is usually recorded at a low enough volume that summing will not cause overflow, (b) the signals are random enough, with both positive and negative values, that moments where all the contributors line up in either the positive or negative direction are rare and short-lived.

Using a min and max will "clip" the signals, and can introduce some audible distortion, but it is a much less horrible sound than overflow! If your sources are routinely hitting the min and max, you can simply multiply a volume factor (within the range 0 to 1) to one or more of the contributing signals as a whole, to bring the audio values down.

For 16-bit data, it works to perform operations directly on the signed integers that result from appending the two bytes together (-32768 to 32767). But it is a more common practice to "normalize" the values, i.e., convert the 16-bit integers to floats ranging from -1 to 1, perform operations at that level, and then convert back to integers in the range -32768 to 32767 and break those integers into byte pairs.

There is a free book on digital signal processing that is well worth reading: Steven Smith's "The Scientists and Engineers Guide to Digital Signal Processing." It will give much more detail and background.