Windows MFT (Media Foundation Transform) decoder not returning proper sample time or duration

Question

To decode a H264 stream with the Windows Media foundation Transform, the work flow is currently something like this:

IMFSample sample;
sample->SetTime(time_in_ns);
sample->SetDuration(duration_in_ns);
sample->AddBuffer(buffer);

// Feed IMFSample to decoder
mDecoder->ProcessInput(0, sample, 0);

// Get output from decoder.
/* create outputsample that will receive content */ { ... }
MFT_OUTPUT_DATA_BUFFER output = {0};
output.pSample = outputsample;
DWORD status = 0;
HRESULT hr = mDecoder->ProcessOutput(0, 1, &output, &status);
DWORD status = 0;
hr = mDecoder->ProcessOutput(0, 1, &output, &status);
if (output.pEvents) {
  // We must release this, as per the IMFTransform::ProcessOutput()
  // MSDN documentation.
  output.pEvents->Release();
  output.pEvents = nullptr;
}

if (hr == MF_E_TRANSFORM_STREAM_CHANGE) {
  // Type change, probably geometric aperture change.
  // Reconfigure decoder output type, so that GetOutputMediaType()
} else if (hr == MF_E_TRANSFORM_NEED_MORE_INPUT) {
  // Not enough input to produce output.
} else if (!output.pSample) {
  return S_OK;
} else }
  // Process output
}

}

When we have fed all data to the MFT decoder, we must drain it:

mDecoder->ProcessMessage(MFT_MESSAGE_COMMAND_DRAIN, 0);

Now, one thing with the WMF H264 decoder, is that it will typically not output anything before having been called with over 30 compressed h264 frames regardless of the size of the h264 sliding window. Latency is very high...

I'm encountering an issue that is very troublesome. With a video made only of keyframes, and which has only 15 frames, each being 2s long, the first frame having a presentation time of non-zero (this stream is from live content, so first frame is typically in epos time) So without draining the decoder, nothing will come out of the decoder as it hasn't received enough frames.

However, once the decoder is drained, the decoded frame will come out. HOWEVER, the MFT decoder has set all durations to 33.6ms only and the presentation time of the first sample coming out is always 0. The original duration and presentation time have been lost.

If you provide over 30 frames to the h264 decoder, then both duration and pts are valid...

I haven't yet found a way to get the WMF decoder to output samples with the proper value. It appears that if you have to drain the decoder before it has output any samples by itself, then it's totally broken...

Has anyone experienced such problems? How did you get around it?

Thank you in advance

Edit: a sample of the video is available on http://people.mozilla.org/~jyavenard/mediatest/fragmented/1301869.mp4 Playing this video with Firefox will causes it to play extremely quickly due to the problems described above.

low latency is only available from Window 8 and later. Also, incompatible with anything containing B-Frames. And additionally, got plenty of driver crashes when enabled... Low Latency would help here, though would still have the same problem if I had < 8 frames. — jyavenard
I am afraid all you mentioned is behavior by design. Output frames are available asynchronously (esp. when in DXVA mode), low latency in Windows 8+ to slightly improve things, changed timings when H.264 embeds time related data. Driver crashes need driver updates, of course. Software-only decoder might have more predictable behavior, I suppose. — Roman R.
So that MFT returns the wrong timestamp and the wrong duration is by design? I find that hard to believe :) — jyavenard
Not necessarily wrong if you think of timing information in the H.264 bitstream. It might be just preferred to your timestamps. — Roman R.

VuVirt VuVirt · Accepted Answer · 2016-09-26T12:55:06

I'm not sure that your work flow is correct. I think you should do something like this:

do
{
    ...
    hr = mDecoder->ProcessInput(0, sample, 0);
    if(FAILED(hr))
      break;
    ...
    hr = mDecoder->ProcessOutput(0, 1, &output, &status);
    if(FAILED(hr) && hr != MF_E_TRANSFORM_NEED_MORE_INPUT)
      break;
}
while(hr == MF_E_TRANSFORM_NEED_MORE_INPUT);

if(SUCCEEDED(hr))
{
    // You have a valid decoded frame here
}

The idea is to keep calling ProcessInput/ProcessOuptut while ProcessOutput returns MF_E_TRANSFORM_NEED_MORE_INPUT. MF_E_TRANSFORM_NEED_MORE_INPUT means that decoder needs more input. I think that with this loop you won't need to drain the decoder.

Windows MFT (Media Foundation Transform) decoder not returning proper sample time or duration

1 Answers