why the output of mp3 decode sounds so delayed?(with ffmpeg mp3lame lib)

Question

i'm recording sound and encoding to mp3 with ffmpeg lib. then decode the mp3 data right away, play the decode data, but it's sounds so delayed. here are the codes: the function encode first parameter accepts the raw pcm data, len = 44100.

encode parameters:

cntx_->channels = 1;
cntx_->sample_rate = 44100;
cntx_->sample_fmt = 6;
cntx_->channel_layout =  AV_CH_LAYOUT_MONO;
cntx_->bit_rate = 8000;
err_ = avcodec_open2(cntx_, codec_, NULL);

vector<unsigned char>       encode(unsigned char* encode_data, unsigned int len)
{
    vector<unsigned char> ret;
    AVPacket avpkt;
    av_init_packet(&avpkt);

    unsigned int len_encoded = 0; 
    int data_left = len / 2;
    int miss_c = 0, i = 0;
    while (data_left > 0)
    {
        int sz = data_left > cntx_->frame_size ? cntx_->frame_size : data_left;
        mp3_frame_->nb_samples = sz;
        mp3_frame_->format = cntx_->sample_fmt;
        mp3_frame_->channel_layout = cntx_->channel_layout;

        int needed_size = av_samples_get_buffer_size(NULL, 1,
            mp3_frame_->nb_samples, cntx_->sample_fmt, 1);

        int r = avcodec_fill_audio_frame(mp3_frame_, 1, cntx_->sample_fmt, encode_data + len_encoded, needed_size, 0);

        int gotted = -1;

        r = avcodec_encode_audio2(cntx_, &avpkt, mp3_frame_, &gotted);
        if (gotted){
            i++;
            ret.insert(ret.end(), avpkt.data, avpkt.data + avpkt.size);
        }
        else if (gotted == 0){
            miss_c++;
        }
        len_encoded += needed_size;
        data_left -= sz;
        av_free_packet(&avpkt);
    }
    return ret;
}

std::vector<unsigned char>  decode(unsigned char* data, unsigned int len)
{
    std::vector<unsigned char> ret;

    AVPacket avpkt;
    av_init_packet(&avpkt);
    avpkt.data = data;
    avpkt.size = len;

    AVFrame* pframe = av_frame_alloc();
    while (avpkt.size > 0){
        int goted = -1;av_frame_unref(pframe);
        int used = avcodec_decode_audio4(cntx_, pframe, &goted, &avpkt);
        if (goted){
            ret.insert(ret.end(), pframe->data[0], pframe->data[0] + pframe->linesize[0]);
            avpkt.data += used;
            avpkt.size -= used;
            avpkt.dts = avpkt.pts = AV_NOPTS_VALUE; 
        }
        else if (goted == 0){
            avpkt.data += used;
            avpkt.size -= used;
            avpkt.dts = avpkt.pts = AV_NOPTS_VALUE; 
        }
        else if(goted < 0){
            break;
        }
    }
    av_frame_free(&pframe);
    return ret;
}

Suppose it's the 100th call to encode(data, len), this "frame" would appear in 150th or later in the decode call, the latency is not acceptable. It seems the mp3lame encoder would keep the sample data for later use, but not my desire. I don't know what is going wrong. Thank you for any information.

today i debug the code again and post some detail:

encode: each pcm sample frame len = 23040 ,which is 10 times of mp3 frame size, each time call encode only output 9 frames, this output cause decode output 20736 samples, 1 frame(2304 bytes) is lost, and the sound is noisy.

if the mp3 or mp2 encode is not suitable for real time voice transfer, which encoder should i choose?

Brad Brad · Accepted Answer · 2014-03-10T22:46:53

Suppose it's the 100th call to encode(data, len), this "frame" would appear in 150th or later in the decode call, the latency is not acceptable.

Understand how the codec works and adjust your expectations accordingly.

MP3 is a lossy codec. It works by converting your time domain PCM data to the frequency domain. This conversion alone requires time (because frequency components do not exist in any instant of time... they can only exist over a period of time). At a simple level, it then uses a handful of algorithms to determine what spectral information to keep, and what to throw away. Each MP3 frame is hundreds of samples long in duration. (576 is as low as you can typically go. Twice that is a typical number.)

Now that you have the minimum time for creating frames, MP3 also uses what is called a bit reservoir. If a complex passage requires more bandwidth, it borrows unused bandwidth from neighboring frames. To facilitate this, a buffer of many frames is required.

On top of all the codec work, FFmpeg itself has buffering (for detection of input and what not), and there are buffers in your pipes to and from FFmpeg. I would imagine the codec itself may also employ general buffering on the input and output.

Finally, you're decoding the stream and playing it back, which means that most of the same kinds of buffers used in encoding are now used for decoding. And, we're not even talking about the several hundred milliseconds of latency you have for getting the audio data through a sound card and out the analog to your speaker.

You have an unrealistic expectation, and while it is possible to tweak some things to reduce latency (such as disabling the bit reservoir), it will result in a poor quality stream and will not be of low latency anyway.

why the output of mp3 decode sounds so delayed?(with ffmpeg mp3lame lib)

1 Answers