9
votes

I am writing an app to manipulate audio where i need to convert a file (wav, MP3, etc) to raw data (samples are presented as float) at the first place.

I use ffmpeg in cmd:

ffmpeg -i test.wav -f s16le -acodec pcm_s16le output.dat

How are samples represented in the output.dat file? I know one sample needs two bytes under S16, and dual channel means it is stored as L1 R1 L2 R2 ... But does this file come with a frame presentation or all the byte in the dat file are sample values? The converted files' size of test.wav via two methods is not identical. One is via libav using example code on ffmpeg website, another one is what mentioned above, directly using ffmpeg.exe in cmd, the former method give me a slightly smaller file size.I am confused when i find someone says pcm use a frame presentation (2048 samples a frame).

I actually do not need any code but hope someone can explain raw pcm format in detail.

Thanks a lot

2

2 Answers

7
votes

Starting with a stereo wav file with a bit depth of 16 bits at 44,100 kHz sample rate you have a standard CD quality audio file ... issue this on command line to display such stats on a file

ffprobe Cesária_Évora.wav

typical output

  Duration: 00:00:21.51, bitrate: 1411 kb/s
    Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 44100 Hz, 2 channels, s16, 1411 kb/s

to create a PCM file from the wav issue

ffmpeg -i Cesária_Évora.wav -f s16le -acodec pcm_s16le cesaria.dat

be aware a wav file is simply a 44 byte header followed by payload which is the raw audio curve in PCM format ... this PCM file is strictly L1 R1 L2 R2 nothing more nothing less ... any notion of frames is an abstraction of how we parse the data with no bits dedicated to implement a frame (like start/end markers) ... to write code to manipulate PCM data keep in mind your bit depth as well as whether your file has little endian or big endian byte structure ... whenever your file has a bit depth of 8 bits then you can safely ignore endianness since you will never need to shift bytes however since above file has bit depth of 16 bits this means each point of the audio curve is represented by a single 16 bit number per channel ( stereo is two channel, mono one channel )

when reading such a file this 16 bit number is stored across two bytes ... if little endian as you read the bytes the left most byte ( first encountered in your loop as you iterate across the file ) is the littlest byte followed by the next more significant byte meaning

L1 R1 L2 R2 

below we indicate the stereo representation of two 16 bit points on the audio curve

Llittle1 Lbig1 Rlittle1 Rbig1 Llittle2 Lbig2 Rlittle2 Rbig2

when we speak of individual bytes used to store those two points ... note above shows 8 bytes ... similarly if we had a bit depth of 24 bytes it would be the following for one raw audio sample on one channel

Llittle1 Lbigger1 Lbiggest1 Rlittle1 Rbigger1 Rbiggest1  

so conceptually when reading a little endian file of bit depth 16 bits here is how you parse the PCM for one channel for one point on the raw audio curve

Llittle1 Lbig1

now to generate a single value L1 you conceptually do this

L1 = ( Lbig1 << shift 8 bits to left ) + Llittle1

Not sure if this is the level of abstraction you where looking for however it is a stepping stone to nailing digital audio

Super helpful tool Audacity permits you to import a raw audio file in PCM format as we generated above cesaria.dat ... Audacity -> File -> Import -> Raw Data -> choose cesaria.dat ->

enter image description here

3
votes

-f s16le produces a raw samples dump with no header/trailer or any metadata. So, it is simply L1 R1 C1 L2 R2 C2... where L R C represent 3 channels.

When ffmpeg reads such a file, it will read and frame 1024 samples from each channel at a time, unless sampling rate/25 is less than 1024, in which case, it will read and packetize those many samples e.g. for a stream of 16000 Hz, sampling rate/25 = 640, which is less than 1024. So, ffmpeg will packetize 640x2 = 1280 samples for such a stereo stream.