MediaCodec - How to concatenate two mp4 files' audio streams into a single unified format and mux them back

Question

So I've succeeded in concatenating the video streams of more than 1 video files using MediaCodec - with as many MediaExtractors and decoder MediaCodecs as the video files. Now my question is about concatenating said videos' audio streams.

Using the modified ExtractDecodeEditEncodeMux test, I tried the same method I used to concatenate the video streams for the audio streams, making sure that the final audio encoder has a single preset format:

private void audioExtractorLoop(MediaExtractor localAudioExtractor, MediaCodec destinationAudioDecoder, ByteBuffer[] dstAudioDecoderInputBuffers)
{
    //Audio Extractor code begin
    boolean localAudioExtractorIsOriginal = (localAudioExtractor == audioExtractor);
    boolean localDone = localAudioExtractorIsOriginal ? audioExtractorDone : audioExtractorAppendDone;
    Log.i("local_audio_extractor", localAudioExtractorIsOriginal+" "+localDone);

    while (mCopyAudio && !localDone && (encoderOutputAudioFormat == null || muxing)) {
        int decoderInputBufferIndex = destinationAudioDecoder.dequeueInputBuffer(TIMEOUT_USEC);
        if (decoderInputBufferIndex == MediaCodec.INFO_TRY_AGAIN_LATER) {
            if (VERBOSE)
                Log.d(TAG, "no audio decoder input buffer");
            break;
        }
        if (VERBOSE) {
            Log.d(TAG, "audio decoder: returned input buffer: "
                    + decoderInputBufferIndex);
        }
        ByteBuffer decoderInputBuffer = dstAudioDecoderInputBuffers[decoderInputBufferIndex];
        int size = localAudioExtractor.readSampleData(decoderInputBuffer, 0);
        long presentationTime = localAudioExtractor.getSampleTime();
        if(localAudioExtractorIsOriginal)currentFrameTimestamp = presentationTime;
        if (VERBOSE) {
            Log.d(TAG, "audio extractor: returned buffer of size "
                    + size);
            Log.d(TAG, "audio extractor: returned buffer for time "
                    + presentationTime);
        }
        if (size >= 0) {
            destinationAudioDecoder.queueInputBuffer(decoderInputBufferIndex, 0,
                    size, presentationTime,
                    localAudioExtractor.getSampleFlags());
        }
        localDone = !localAudioExtractor.advance();
        if (localDone) {
            if (VERBOSE)
                Log.d(TAG, "audio extractor: EOS");
            if(localAudioExtractorIsOriginal) {
                initAudioExtractorFinalTimestamp = currentFrameTimestamp;
                audioExtractorDone = true;
            }
            destinationAudioDecoder.queueInputBuffer(decoderInputBufferIndex, 0,
                    0, 0, MediaCodec.BUFFER_FLAG_END_OF_STREAM);
        }
        audioExtractedFrameCount++;
        break;
    }
    //Audio Extractor code end
}

private void localizedAudioDecoderLoop(MediaCodec localAudioDecoder)
{
    boolean localAudioDecoderIsOriginal = (localAudioDecoder == audioDecoder);
    boolean localDone = localAudioDecoderIsOriginal ? audioDecoderDone : audioDecoderAppendDone;

    Log.i("local_audio_decoder", localAudioDecoderIsOriginal+"");
    ByteBuffer[] localDecoderOutByteBufArray = localAudioDecoderIsOriginal ? audioDecoderOutputBuffers : audioDecoderAppendOutputBuffers;
    MediaCodec.BufferInfo localDecoderBufInfo = localAudioDecoderIsOriginal ? audioDecoderOutputBufferInfo : audioDecoderAppendOutputBufferInfo;
    while (mCopyAudio && !localDone && pendingAudioDecoderOutputBufferIndex == -1 && (encoderOutputAudioFormat == null || muxing)) {
        int decoderOutputBufferIndex = localAudioDecoder.dequeueOutputBuffer(localDecoderBufInfo, TIMEOUT_USEC);
        if(!localAudioDecoderIsOriginal)localDecoderBufInfo.presentationTimeUs += initAudioExtractorFinalTimestamp+33333;
        //Log.i("decoder_out_buf_info", audioDecoderOutputBufferInfo.size + " " + audioDecoderOutputBufferInfo.offset);
        if (decoderOutputBufferIndex == MediaCodec.INFO_TRY_AGAIN_LATER) {
            if (VERBOSE)
                Log.d(TAG, "no audio decoder output buffer");
            break;
        }
        if (decoderOutputBufferIndex == MediaCodec.INFO_OUTPUT_BUFFERS_CHANGED) {
            if (VERBOSE)
                Log.d(TAG, "audio decoder: output buffers changed");
            //audioDecoderOutputBuffers = audioDecoder.getOutputBuffers();
            localDecoderOutByteBufArray = audioDecoder.getOutputBuffers();
            break;
        }
        if (decoderOutputBufferIndex == MediaCodec.INFO_OUTPUT_FORMAT_CHANGED) {
            decoderOutputAudioFormat = localAudioDecoder.getOutputFormat();
            decoderOutputChannelNum = decoderOutputAudioFormat.getInteger(MediaFormat.KEY_CHANNEL_COUNT);
            decoderOutputAudioSampleRate = decoderOutputAudioFormat.getInteger(MediaFormat.KEY_SAMPLE_RATE);
            if (VERBOSE) {
                Log.d(TAG, "audio decoder: output format changed: "
                        + decoderOutputAudioFormat);
            }
            break;
        }
        if (VERBOSE) {
            Log.d(TAG, "audio decoder: returned output buffer: "
                    + decoderOutputBufferIndex);
        }
        if (VERBOSE) {
            Log.d(TAG, "audio decoder: returned buffer of size "
                    + localDecoderBufInfo.size);
        }
        ByteBuffer decoderOutputBuffer = localDecoderOutByteBufArray[decoderOutputBufferIndex];
        if ((localDecoderBufInfo.flags & MediaCodec.BUFFER_FLAG_CODEC_CONFIG) != 0) {
            if (VERBOSE)
                Log.d(TAG, "audio decoder: codec config buffer");
            localAudioDecoder.releaseOutputBuffer(decoderOutputBufferIndex,
                    false);
            break;
        }
        if (VERBOSE) {
            Log.d(TAG, "audio decoder: returned buffer for time "
                    + localDecoderBufInfo.presentationTimeUs);
        }
        if (VERBOSE) {
            Log.d(TAG, "audio decoder: output buffer is now pending: "
                    + pendingAudioDecoderOutputBufferIndex);
        }
        pendingAudioDecoderOutputBufferIndex = decoderOutputBufferIndex;
        audioDecodedFrameCount++;
        break;
    }

    while (mCopyAudio && pendingAudioDecoderOutputBufferIndex != -1) {
        if (VERBOSE) {
            Log.d(TAG,
                    "audio decoder: attempting to process pending buffer: "
                            + pendingAudioDecoderOutputBufferIndex);
        }
        int encoderInputBufferIndex = audioEncoder
                .dequeueInputBuffer(TIMEOUT_USEC);
        if (encoderInputBufferIndex == MediaCodec.INFO_TRY_AGAIN_LATER) {
            if (VERBOSE)
                Log.d(TAG, "no audio encoder input buffer");
            break;
        }
        if (VERBOSE) {
            Log.d(TAG, "audio encoder: returned input buffer: "
                    + encoderInputBufferIndex);
        }
        ByteBuffer encoderInputBuffer = audioEncoderInputBuffers[encoderInputBufferIndex];
        int size = localDecoderBufInfo.size;
        long presentationTime = localDecoderBufInfo.presentationTimeUs;
        if (VERBOSE) {
            Log.d(TAG, "audio decoder: processing pending buffer: "
                    + pendingAudioDecoderOutputBufferIndex);
        }
        if (VERBOSE) {
            Log.d(TAG, "audio decoder: pending buffer of size " + size);
            Log.d(TAG, "audio decoder: pending buffer for time "
                    + presentationTime);
        }
        if (size >= 0) {
            ByteBuffer decoderOutputBuffer = localDecoderOutByteBufArray[pendingAudioDecoderOutputBufferIndex]
                    .duplicate();

            byte[] testBufferContents = new byte[size];
            //int bufferSize = (extractorInputChannelNum == 1 && decoderOutputChannelNum == 2) ? size / 2 : size;
            float samplingFactor = (decoderOutputChannelNum/extractorInputChannelNum) * (decoderOutputAudioSampleRate / extractorAudioSampleRate);
            int bufferSize = size / (int)samplingFactor;
            Log.i("sampling_factor", samplingFactor+" "+bufferSize);

            if (decoderOutputBuffer.remaining() < size) {
                for (int i = decoderOutputBuffer.remaining(); i < size; i++) {
                    testBufferContents[i] = 0;  // pad with extra 0s to make a full frame.
                }
                decoderOutputBuffer.get(testBufferContents, 0, decoderOutputBuffer.remaining());
            } else {
                decoderOutputBuffer.get(testBufferContents, 0, size);
            }

            //WARNING: This works for 11025-22050-44100 or 8000-16000-24000-48000
            //What about in-between?
            //BTW, the size of the bytebuffer may be less than 4096 depending on the sampling factor
            //(Now that I think about it I should've realized this back when I decoded the video result from the encoding - 2048 bytes decoded)
            if (((int)samplingFactor) > 1) {
                Log.i("s2m_conversion", "Stereo to Mono and/or downsampling");
                byte[] finalByteBufferContent = new byte[size / 2];

                for (int i = 0; i < bufferSize; i+=2) {
                    if((i+1)*((int)samplingFactor) > testBufferContents.length)
                    {
                        finalByteBufferContent[i] = 0;
                        finalByteBufferContent[i+1] = 0;
                    }
                    else
                    {
                        finalByteBufferContent[i] = testBufferContents[i*((int)samplingFactor)];
                        finalByteBufferContent[i+1] = testBufferContents[i*((int)samplingFactor) + 1];
                    }
                }

                decoderOutputBuffer = ByteBuffer.wrap(finalByteBufferContent);
            }

            decoderOutputBuffer.position(localDecoderBufInfo.offset);
            decoderOutputBuffer.limit(localDecoderBufInfo.offset + bufferSize);
            //decoderOutputBuffer.limit(audioDecoderOutputBufferInfo.offset + size);
            encoderInputBuffer.position(0);

            Log.d(TAG, "hans, audioDecoderOutputBufferInfo:" + localDecoderBufInfo.offset);
            Log.d(TAG, "hans, decoderOutputBuffer:" + decoderOutputBuffer.remaining());
            Log.d(TAG, "hans, encoderinputbuffer:" + encoderInputBuffer.remaining());
            encoderInputBuffer.put(decoderOutputBuffer);

            audioEncoder.queueInputBuffer(encoderInputBufferIndex, 0, bufferSize, presentationTime, localDecoderBufInfo.flags);
            //audioEncoder.queueInputBuffer(encoderInputBufferIndex, 0, size, presentationTime, audioDecoderOutputBufferInfo.flags);
        }
        audioDecoder.releaseOutputBuffer(
                pendingAudioDecoderOutputBufferIndex, false);
        pendingAudioDecoderOutputBufferIndex = -1;
        if ((localDecoderBufInfo.flags & MediaCodec.BUFFER_FLAG_END_OF_STREAM) != 0) {
            if (VERBOSE)
                Log.d(TAG, "audio decoder: EOS");
            if(localDecoderBufInfo == audioDecoderOutputBufferInfo){audioDecoderDone = true;}
            else{audioDecoderAppendDone = true;}
        }
        break;
    }
}

Into these functions I'll pass the MediaExtractor and decoder MediaCodec objects for the first audio stream and loop through them until they reach EOS, then I'll swap the MediaExtractor and decoder MediaCodec with the ones for the second audio stream.

This code works fine for the first audio stream, but after the swap I get the following stacktrace:

10-11 15:14:59.941 3067-22024/? E/SEC_AAC_DEC: saacd_decode() failed ret_val: -3, Indata 0x 11 90 00 00, length : 683
10-11 15:14:59.941 3067-22024/? E/SEC_AAC_DEC: ASI 0x 11, 90 00 00
10-11 15:14:59.951 29907-22020/com.picmix.mobile E/ACodec: OMXCodec::onEvent, OMX_ErrorStreamCorrupt
10-11 15:14:59.951 29907-22020/com.picmix.mobile W/AHierarchicalStateMachine: Warning message AMessage(what = 'omxI') = {
                                                                            int32_t type = 0
                                                                            int32_t event = 1
                                                                            int32_t data1 = -2147479541
                                                                            int32_t data2 = 0
                                                                          } unhandled in root state.

I thought the decoders would just end up decoding all audio streams to audio/raw type with 44100 Hz sample rate and 2 channels, so the encoder can just take the data and encode to a final format.

What extra considerations will I have to take for audio, and how can I prevent the audio stream being corrupt when I swap Extractor-Decoder pairs?

EDIT:

I added these lines to check the contents of the extracted samples in MediaExtractor:

ByteBuffer decoderInputBuffer = dstAudioDecoderInputBuffers[decoderInputBufferIndex];
        int size = localAudioExtractor.readSampleData(decoderInputBuffer, 0);
        long presentationTime = localAudioExtractor.getSampleTime();
        //new lines begin
        byte[] debugBytes = new byte[decoderInputBuffer.remaining()];
        decoderInputBuffer.duplicate().get(debugBytes);
        Log.i(TAG, "DEBUG - extracted frame: "+ audioExtractedFrameCount +" | bytebuffer contents: "+new String(debugBytes));
        //new lines end

In the decoderInputBuffer.duplicate().get(debugBytes); line, I get the IllegalStateException: buffer is inaccessible error.

Does this mean I set up the extractor wrong?

EDIT 2:

When I looked into it further, it's only a problem with the appending audio extractor, not the first audio extractor.

Gensoukyou1337 Gensoukyou1337 · Accepted Answer · 2016-10-13T10:52:15

Turns out it was something completely stupid. Earlier in the code when I was setting up the decoder buffers, I did this:

audioDecoderInputBuffers = audioDecoder.getInputBuffers();
audioDecoderOutputBuffers = audioDecoder.getOutputBuffers();
audioDecoderAppendInputBuffers = audioDecoder.getInputBuffers();
audioDecoderAppendOutputBuffers = audioDecoder.getOutputBuffers();

They were referring to the same decoder instance.

MediaCodec - How to concatenate two mp4 files' audio streams into a single unified format and mux them back

1 Answers