DirectShow Audio/Video PTS Clocking Calculation

Question

Greetings,

I have written a directshow source filter that takes the AVC/AAC video frames/AAC access units from the ATSC-153 broadcast, written on WinCE/ARM video processor. The output pins (2 of them, one for video, one for audio) are connected to the appropriate decoders and renderers. Currently, I am taking the PTS from the appropriate RTP headers, and passing them to the source filter and perform the calculation to the directshow clock. Video PTS is at the 90Khz rate, audio PTS rate varies, my current test stream has the audio ticking at 55.2Khz.

What follows is the convert_to_dshow_timestamp() and FillBuffer() routines. As I print out the converted time stamps as the video/audio are retrieved by the filter, the times are within 100-200ms difference. This would not be to bad, something to work with. However, the video trails the audio by 2-3 seconds.

/* Routine to convert a clock rate to directshow clock rate */ static unsigned long long convert_to_dshow_timestamp( unsigned long long ts, unsigned long rate ) { long double hz; long double multi; long double tmp;

if (rate == 0)
{
    return 0;
}

hz = (long double) 1.0 / rate;
multi = hz / 1e-7;

tmp = ((long double) ts * multi) + 0.5;
return (unsigned long long) tmp;

}

/* Source filter FillBuffer() routine */ HRESULT OutputPin::FillBuffer(IMediaSample *pSamp) { BYTE *pData; DWORD dataSize; pipeStream stream; BOOL retVal; DWORD returnBytes; HRESULT hr; DWORD discont; REFERENCE_TIME ts; REFERENCE_TIME df; unsigned long long difPts; unsigned long long difTimeRef;

pSamp->GetPointer(&pData);
dataSize = pSamp->GetSize();

ZeroMemory(pData, dataSize);

stream.lBuf = pData;
stream.dataSize = dataSize;

/* Pin type 1 is H.264 AVC video frames */
if (m_iPinType == 1)
{
    retVal = DeviceIoControl(
                                ghMHTune,
                                IOCTL_MHTUNE_RVIDEO_STREAM,
                                NULL,
                                0,
                                &stream,
                                sizeof(pipeStream),
                                &returnBytes,
                                NULL
                            );
    if (retVal == TRUE)
    {
        /* Get the data */
        /* Check for the first of the stream, if so, set the start time */
        pSamp->SetActualDataLength(returnBytes);
        hr = S_OK;
        if (returnBytes > 0)
        {
            /* The discontinuety is set in upper layers, when an RTP
             * sequence number has been lost.
             */
            discont = stream.discont;

            /* Check for another break in stream time */
            if (
                m_PrevTimeRef &&
                ((m_PrevTimeRef > (stream.timeRef + 90000 * 10)) ||
                ((m_PrevTimeRef + 90000 * 10) < stream.timeRef))
               )
            {
                dbg_log(TEXT("MY:DISC HERE\n"));
                 if (m_StartStream > 0)
                {
                    discont = 1;
                }
            }

            /* If the stream has not started yet, or there is a
             * discontinuety then reset the stream time.
             */
            if ((m_StartStream == 0) || (discont != 0))
            {
                sys_time = timeGetTime() - m_ClockStartTime;
                m_OtherSide->sys_time = sys_time;

                /* For Video, the clockRate is 90Khz */
                m_RefGap = (sys_time * (stream.clockRate / 1000)) +
                                                    (stream.clockRate / 2);

                /* timeRef is the PTS for the frame from the RTP header */
                m_TimeGap = stream.timeRef;
                m_StartStream = 1;
                difTimeRef = 1;
                m_PrevPTS = 0;
                m_PrevSysTime = timeGetTime();
                dbg_log(
                        TEXT("MY:StartStream %lld: %lld: %lld\n"),
                        sys_time,
                        m_RefGap,
                        m_TimeGap
                       );
            }
            else
            {
                m_StartStream++;
            }

            difTimeRef = stream.timeRef - m_PrevTimeRef;
            m_PrevTimeRef = stream.timeRef;

            /* Difference in 90 Khz clocking */
            ts = stream.timeRef - m_TimeGap + m_RefGap;
            ts = convert_to_dshow_timestamp(ts, stream.clockRate);

            if (discont != 0)
            {
                dbg_log(TEXT("MY:VDISC TRUE\n"));
                pSamp->SetDiscontinuity(TRUE);
            }
            else
            {
                pSamp->SetDiscontinuity(FALSE);
                pSamp->SetSyncPoint(TRUE);
            }

            difPts = ts - m_PrevPTS;

            df = ts + 1;
            m_PrevPTS = ts;
            dbg_log(
                    TEXT("MY:T %lld: %lld = %lld: %d: %lld\n"),
                    ts,
                    m_OtherSide->m_PrevPTS,
                    stream.timeRef,
                    (timeGetTime() - m_PrevSysTime),
                    difPts
                   );

            pSamp->SetTime(&ts, &df);
            m_PrevSysTime = timeGetTime();
        }
        else
        {
            Sleep(10);
        }
    }
    else
    {
        dbg_log(TEXT("MY:  Fill FAIL\n"));
        hr = E_FAIL;
    }
}
else if (m_iPinType == 2)
{
    /* Pin Type 2 is audio AAC Access units, with ADTS headers */
    retVal = DeviceIoControl(
                                ghMHTune,
                                IOCTL_MHTUNE_RAUDIO_STREAM,
                                NULL,
                                0,
                                &stream,
                                sizeof(pipeStream),
                                &returnBytes,
                                NULL
                            );

    if (retVal == TRUE)
    {
        /* Get the data */
        /* Check for the first of the stream, if so, set the start time */
        hr = S_OK;
        if (returnBytes > 0)
        {
            discont = stream.discont;
            if ((m_StartStream == 0) || (discont != 0))
            {
                sys_time = timeGetTime() - m_ClockStartTime;
                m_RefGap = (sys_time * (stream.clockRate / 1000)) +
                                                    (stream.clockRate / 2);

                /* Mark the first PTS from stream.  This PTS is from the
                 * RTP header, and is usually clocked differently than the
                 * video clock.
                 */
                m_TimeGap = stream.timeRef;
                m_StartStream = 1;
                difTimeRef = 1;
                m_PrevPTS = 0;
                m_PrevSysTime = timeGetTime();
                dbg_log(
                        TEXT("MY:AStartStream %lld: %lld: %lld\n"),
                        sys_time,
                        m_RefGap,
                        m_TimeGap
                       );
            }

            /* Let the video side stream in first before letting audio
             * start to flow.
             */
            if (m_OtherSide->m_StartStream < 32)
            {
                pSamp->SetActualDataLength(0);
                Sleep(10);
                return hr;
            }
            else
            {
                pSamp->SetActualDataLength(returnBytes);
            }

            difTimeRef = stream.timeRef - m_PrevTimeRef;
            m_PrevTimeRef = stream.timeRef;

            if (discont != 0)
            {
                dbg_log(TEXT("MY:ADISC TRUE\n"));
                pSamp->SetDiscontinuity(TRUE);
            }
            else
            {
                pSamp->SetDiscontinuity(FALSE);
                pSamp->SetSyncPoint(TRUE);
            }

            /* Difference in Audio PTS clock, TESTING AT 55.2 Khz */
            ts = stream.timeRef - m_TimeGap + m_RefGap;
            ts = convert_to_dshow_timestamp(ts, stream.clockRate);

            difPts = ts - m_PrevPTS;

            df = ts + 1;
            m_PrevPTS = ts;
            dbg_log(
                    TEXT("MY:AT %lld = %lld: %d: %lld\n"),
                    ts,
                    stream.timeRef,
                    (timeGetTime() - m_PrevSysTime),
                    difPts
                   );

            pSamp->SetTime(&ts, &df);
            m_PrevSysTime = timeGetTime();
        }
        else
        {
            pSamp->SetActualDataLength(0);
            Sleep(10);
        }
    }
}
return hr;

} /* End of code */

I have tried adjusting the video PTS, by simply adding (90000 * 10), to see if the video would go far ahead of the audio, however it does not. Video still trails the audio by 2 seconds or more. I really don't understand why this would not work. Each video frame should present 10 seconds ahead. Would this not be correct?

They main question is, basically, are the algorithms sound? They seem to work okay running the video/audio independently.

The source filter is not a push filter, I am not sure if this will make a difference. I am not having issues with the decoders getting out of sync with the input from the broadcast.

Many thanks.

Does the trailing happen over time, or does it immediately start with the 2-3 second delay? Also, please reformat the code to be more readable. — BeemerGuy
Sorry about the code format. I'll try and do better next time. — davroslyrad

davroslyrad davroslyrad · Accepted Answer · 2010-12-01T15:11:00

Actually I figured out the problem, of which there were two.

The first one was bad work around to the SPS H.264 frame. When the decoder started it would ditch every frame until it found the SPS frame. The stream was encoded at 15 frames per second. This would throw off the timing, as the decoder would consume up to a second worth of video in less than 10ms. Every frame that was presented after that was considered late, and it would try and fast forward the frames to catch up. Being a live source, it would run out of frames again. The workaround was placed in the code ahead of mine, to make sure there was a buffer of at least 32 frames, which is about 2 seconds.

The second problem really centers around the root of the problem. I was using the PTS's from the RTP header as the time reference. While this would work in the individual audio and/or video case, there is no guarantee that the video RTP PTS would match the corresponding audio RTP PTS, and typically would not. Hence the use of the RTCP NTP time according to the following formula, as per the spec:

PTS = RTCP_SR_NTP_timestamp + (RTP_timestamp - RTCP_SR_RTP_timestamp) / media_clock_rate

This allows me to match the actual video PTS to the corresponding audio PTS.

DirectShow Audio/Video PTS Clocking Calculation

1 Answers