How do 16+ bit audio formats work?

Question

I'm trying to write some basic sound-editing programs in Java, but I've been having a huge amount of trouble with my 16-bit WAVE file format.

When I asked Java how many samples it thought my sound file had, it gave me a number twice as big as I expected. When I told Java to generate a sine wave of a 80000 byte samples, it played for 1 second instead of 2 (even though the sample rate was about 40000 per second).

After some more searching, I realized the the "frame size" of my file was 2, that a "sample" was actually 2 bytes instead of one, and that this was called a 16-bit audio file. As an experiment, I wrote my sound file to an array of bytes, set every other byte to 0, and played back the result. When I kept only the odd samples, the sound file played back with a tiny bit of static noise. When I kept only the even ones, that static noise played back on its own without the sound file. This makes me think that the even bytes contain the exact inverse of the static in the odd bytes, which contain the actual sound to be played. When played back together, the even bytes silence the static in the odd bytes, which increases the sounds fidelity.

This website has a pretty good explanation of the basics of 16-bit sound encodings. However, it's not quite good enough for me to go ahead and start editing the file byte by byte. How can I do byte-by-byte editing of a 16-bit (or larger) sound file while still preserving its higher fidelity? What's the formula for encoding sound with 16 bits per sample instead of just 8?

What type are you using to hold your samples? Sounds very much as if you're treating them as 8-bit, yet reading and writing 16-bit samples. — marko
I've been reading each audio file byte by byte and storing the file in an array of bytes (byte[]). So yes, I've been treating each sample as 8 bit instead of 16 bit. — NcAdams

Solomon Slow Solomon Slow · Accepted Answer · 2014-01-03T15:04:49

How can I do byte-by-byte editing of a 16-bit (or larger) sound file...?

That question does not make any sense. When you say "byte-by-byte editing", you really should be saying "sample-by-sample". In this case, every sample is 16 bits (or two bytes), and it does not make sense to split the samples apart. That would be like trying to edit only the top halves of each letter in a text editor.

A single channel of a digital audio stream is a sequence of numbers (a.k.a., samples). Each sample is a representation of the pressure exerted on a microphone diaphragm by the sound wave at some instant in time. In an eight bit sound file, there are only 256 possible values, whereas in a 16-bit sound file, there are 65536 possible values. A 16-bit file has much greater resolution.

This makes me think that the even bytes contain the exact inverse of the static in the odd bytes, which contain the actual sound to be played.

There's a kernel of truth to that. The definition of "noise" in signal processing is the difference between what you hear and what you wanted to hear. When you zeroed out all of odd-numbered bytes, you were stomping on the low-order halves of each sample. By changing the samples, you were introducing something you didn't want to hear (i.e., noise). When you zeroed out the even-numbered bytes, you killed all of the high-order bits and therefore most of the signal. What remained in the low-order bytes was the exact inverse of the noise that you had introduced in your first experiment. (your ears can't tell the difference between a given sound wave and the inverse of the same sound wave.)

There is no absolute mapping between sample values and pressure, but there are a couple of things you should know:

1) Are the samples signed or are they unsigned? Every sample has a value that must lie between some minimum and some maximum. If the (16-bit) samples are signed, then the minimum value is -32768 (0x8000), the maximum is 32767 (0x7FFF), and 0 is right in the middle. If the samples are unsigned, then the minimum is 0, and the maximum is 65535 (0xFFFF). Get it wrong, and you will know immediately because all you will hear is massive noise.

2) Are the samples linear? The sample values are always proportional to something. If they are directly proportional to the sound pressure level, that's called "linear encoding." But they may be proportional to the logarithm of the sound pressure or, to some other function of the sound pressure. Non-linear encodings are almost always 8-bit, and they usually are only encountered in specialized applications like telephony. If you are dealing with 16-bit or larger samples, then they are almost certainly linear.

How do 16+ bit audio formats work?

1 Answers