1
votes

I have a strange problem in my C/C++ FFmpeg transcoder, which takes an input MP4 (varying input codecs) and produces and output MP4 (x264, baseline & AAC LC @44100 sample rate with libfdk_aac):

The resulting mp4 video has fine images (x264) and the audio (AAC LC) works fine as well, but is only played until exactly the half of the video.

The audio is not slowed down, not stretched and doesn't stutter. It just stops right in the middle of the video.

One hint may be that the input file has a sample rate of 22050 and 22050/44100 is 0.5, but I really don't get why this would make the sound just stop after half the time. I'd expect such an error leading to sound being at the wrong speed. Everything works just fine if I don't try to enforce 44100 and instead just use the incoming sample_rate.

Another guess would be that the pts calculation doesn't work. But the audio sounds just fine (until it stops) and I do exactly the same for the video part, where it works flawlessly. "Exactly", as in the same code, but "audio"-variables replaced with "video"-variables.

FFmpeg reports no errors during the whole process. I also flush the decoders/encoders/interleaved_writing after all the package reading from the input is done. It works well for the video so I doubt there is much wrong with my general approach.

Here are the functions of my code (stripped off the error handling & other class stuff):

AudioCodecContext Setup

outContext->_audioCodec = avcodec_find_encoder(outContext->_audioTargetCodecID);
outContext->_audioStream = 
        avformat_new_stream(outContext->_formatContext, outContext->_audioCodec);
outContext->_audioCodecContext = outContext->_audioStream->codec;
outContext->_audioCodecContext->channels = 2;
outContext->_audioCodecContext->channel_layout = av_get_default_channel_layout(2);
outContext->_audioCodecContext->sample_rate = 44100;
outContext->_audioCodecContext->sample_fmt = outContext->_audioCodec->sample_fmts[0];
outContext->_audioCodecContext->bit_rate = 128000;
outContext->_audioCodecContext->strict_std_compliance = FF_COMPLIANCE_EXPERIMENTAL;
outContext->_audioCodecContext->time_base = 
        (AVRational){1, outContext->_audioCodecContext->sample_rate};
outContext->_audioStream->time_base = (AVRational){1, outContext->_audioCodecContext->sample_rate};
int retVal = avcodec_open2(outContext->_audioCodecContext, outContext->_audioCodec, NULL);

Resampler Setup

outContext->_audioResamplerContext = 
        swr_alloc_set_opts( NULL, outContext->_audioCodecContext->channel_layout,
                            outContext->_audioCodecContext->sample_fmt,
                            outContext->_audioCodecContext->sample_rate,
                            _inputContext._audioCodecContext->channel_layout,
                            _inputContext._audioCodecContext->sample_fmt,
                            _inputContext._audioCodecContext->sample_rate,
                            0, NULL);
int retVal = swr_init(outContext->_audioResamplerContext);

Decoding

decodedBytes = avcodec_decode_audio4(   _inputContext._audioCodecContext, 
                                        _inputContext._audioTempFrame, 
                                        &p_gotAudioFrame, &_inputContext._currentPacket);

Converting (only if decoding produced a frame, of course)

int retVal = swr_convert(   outContext->_audioResamplerContext, 
                            outContext->_audioConvertedFrame->data, 
                            outContext->_audioConvertedFrame->nb_samples, 
                            (const uint8_t**)_inputContext._audioTempFrame->data, 
                            _inputContext._audioTempFrame->nb_samples);

Encoding (only if decoding produced a frame, of course)

outContext->_audioConvertedFrame->pts = 
        av_frame_get_best_effort_timestamp(_inputContext._audioTempFrame);

// Init the new packet
av_init_packet(&outContext->_audioPacket);
outContext->_audioPacket.data = NULL;
outContext->_audioPacket.size = 0;

// Encode
int retVal = avcodec_encode_audio2( outContext->_audioCodecContext, 
                                    &outContext->_audioPacket, 
                                    outContext->_audioConvertedFrame,
                                    &p_gotPacket);


// Set pts/dts time stamps for writing interleaved
av_packet_rescale_ts(   &outContext->_audioPacket, 
                        outContext->_audioCodecContext->time_base,
                        outContext->_audioStream->time_base);
outContext->_audioPacket.stream_index = outContext->_audioStream->index;

Writing (only if encoding produced a packet, of course)

int retVal = av_interleaved_write_frame(outContext->_formatContext, &outContext->_audioPacket);

I am quite out of ideas about what would cause such a behaviour.

1

1 Answers

1
votes

So, I finally managed to figure things out myself.

The problem was indeed in the difference of the sample_rate. You'd assume that a call to swr_convert() would give you all the samples you need for converting the audio frame when called like I did. Of course, that would be too easy.

Instead, you need to call swr_convert (potentially) multiple times per frame and buffer its output, if required. Then you need to grab a single frame from the buffer and that is what you will have to encode.

Here is my new convertAudioFrame function:

// Calculate number of output samples
int numOutputSamples = av_rescale_rnd(  
    swr_get_delay(outContext->_audioResamplerContext, _inputContext._audioCodecContext->sample_rate) 
    + _inputContext._audioTempFrame->nb_samples, 
    outContext->_audioCodecContext->sample_rate, 
    _inputContext._audioCodecContext->sample_rate, 
    AV_ROUND_UP);
if (numOutputSamples == 0) 
{
    return;
}

uint8_t* tempSamples;
av_samples_alloc(   &tempSamples, NULL, 
                    outContext->_audioCodecContext->channels, numOutputSamples,
                    outContext->_audioCodecContext->sample_fmt, 0);

int retVal = swr_convert(   outContext->_audioResamplerContext, 
                            &tempSamples, 
                            numOutputSamples, 
                            (const uint8_t**)_inputContext._audioTempFrame->data, 
                            _inputContext._audioTempFrame->nb_samples);

// Write to audio fifo
if (retVal > 0)
{
    retVal = av_audio_fifo_write(outContext->_audioFifo, (void**)&tempSamples, retVal);
}
av_freep(&tempSamples);

// Get a frame from audio fifo
int samplesAvailable = av_audio_fifo_size(outContext->_audioFifo);
if (samplesAvailable > 0)
{
    retVal = av_audio_fifo_read(outContext->_audioFifo, 
                                (void**)outContext->_audioConvertedFrame->data,
                                outContext->_audioCodecContext->frame_size);

    // We got a frame, so also set its pts
    if (retVal > 0)
    {
        p_gotConvertedFrame = 1;

        if (_inputContext._audioTempFrame->pts != AV_NOPTS_VALUE)
        {
            outContext->_audioConvertedFrame->pts = _inputContext._audioTempFrame->pts;
        }
        else if (_inputContext._audioTempFrame->pkt_pts != AV_NOPTS_VALUE)
        {
            outContext->_audioConvertedFrame->pts = _inputContext._audioTempFrame->pkt_pts;
        }
    }
}

This function I basically call until there are no more frame in the audio fifo buffer.

So, the audio was only half as long because I only encoded as many frames as I decoded. Where I actually needed to encode 2 times as many frames due to 2 times the sample_rate.