3
votes

Background: perform benchmarking/comparisson over GPGPU platforms.

Problem: Device synchronization when dispatching a DirectX 11 Compute Shader.

Looking for the equivalent of cudaDeviceSynchronize() of clFinish(...) to make a fair comparisson of how my algorithm performs.

CUDA and OpenCL functions are more clear on the blocking/ non-blocking issues. DirectCompute however is more related to the graphics pipeline (of which I learning and very unfamiliar with) and therefore I have trouble finding out if a Dispatch call is blocking or if previously memory allocation/transfers are finished.

Code DX_1:

// Setup
...
for (...) {
    startTimer();
    context->Dispatch(number_of_groups, 1, 1);
    times[i] = stopTimer();
}
// Release
...

Code DX_2:

for (...) {
    // Setup
    ...
    startTimer();
    context->Dispatch(number_of_groups, 1, 1);
    times[i] = stopTimer();
    // Release
    ...
}

Results (average times of 2^2 to 2^11 elements):

DX_1  DX_2   CUDA
1.6   205.5  24.8
1.8   133.4  24.8
29.1  186.5  25.6
18.6  175.0  25.6
11.4  187.5  26.6
85.2  127.7  26.3
166.4 151.1  28.1
98.2  149.5  35.2
26.8  203.5  31.6 

Notice: these times are run on a desktop GPU with a screen connected, some erratic timings are expected. Times are not supposed to include host to device buffer transfers.

Notice 2: These are very short sequences (4 - 2048 elements) the interesting tests are performed on problem sizes of up to 2^26 elements.

2

2 Answers

1
votes

My new solution is to avoid synchronization with device. I have looked into some methods of retreiving timestamps instead, results look ok and I'm fairly sure the comparisons are fair enough. I compared my CUDA times (Event Record vs. QPC) and the difference is small, a seemingly constant overhead.

CUDA Event  Host QPC
4,6         30,0
4,8         30,0
5,0         31,0
5,2         32,0
5,6         34,0
6,1         34,0
6,9         31,0
8,3         47,0
9,2         34,0
12,0        39,0
16,7        46,0
20,5        55,0
32,1        69,0
48,5        111,0
86,0        134,0
182,4       237,0
419,0       473,0

In case my question brings someone in hopes of finding how to do gpgpu benchmarking I will leave some code behind demonstrating my current benchmarking strategy.

Code Examples, CUDA

cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
float milliseconds = 0;   
cudaEventRecord(start);
... 
// Launch my algorithm
...
cudaEventRecord(stop);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&milliseconds, start, stop);        

OpenCL

cl_event start_event, end_event;
cl_ulong start = 0, end = 0;
// Enqueue a dummy kernel for the start event.
clEnqueueNDRangeKernel(..., &start_event);
... 
// Launch my algorithm
...
// Enqueue a dummy kernel for the end event.
clEnqueueNDRangeKernel(..., &end_event);
clWaitForEvents(1, &end_event);
clGetEventProfilingInfo(start_event, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &start, NULL);
clGetEventProfilingInfo(end_event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &end, NULL);
timeInMS = (double)(end - start)*(double)(1e-06);    

DirectCompute

Here I followed the suggestion from Adam Miles and looked into that source. Will look something like this:

ID3D11Device*               device = nullptr;
...
// Setup
...
ID3D11QueryPtr disjoint_query;
ID3D11QueryPtr q_start;
ID3D11QueryPtr q_end;
...
if (disjoint_query == NULL)
{
    D3D11_QUERY_DESC desc;
    desc.Query = D3D11_QUERY_TIMESTAMP_DISJOINT;
    desc.MiscFlags = 0;
    device->CreateQuery(&desc, &disjoint_query);
    desc.Query = D3D11_QUERY_TIMESTAMP;
    device->CreateQuery(&desc, &q_start);
    device->CreateQuery(&desc, &q_end);
}
context->Begin(disjoint_query);
context->End(q_start);
... 
// Launch my algorithm
...
context->End(q_end);
context->End(disjoint_query);
UINT64 start, end;
D3D11_QUERY_DATA_TIMESTAMP_DISJOINT q_freq;
while (S_OK != context->GetData(q_start, &start, sizeof(UINT64), 0)){};
while (S_OK != context->GetData(q_end, &end, sizeof(UINT64), 0)){};
while (S_OK != context->GetData(disjoint_query, &q_freq, sizeof(D3D11_QUERY_DATA_TIMESTAMP_DISJOINT), 0)){};
timeInMS = (((double)(end - start)) / ((double)q_freq.Frequency)) * 1000.0;

C/C++/OpenMP

static LARGE_INTEGER StartingTime, EndingTime, ElapsedMicroseconds, Frequency;

static void __inline startTimer()
{
    QueryPerformanceFrequency(&Frequency);
    QueryPerformanceCounter(&StartingTime);
}

static double __inline stopTimer()
{
    QueryPerformanceCounter(&EndingTime);
    ElapsedMicroseconds.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;
    ElapsedMicroseconds.QuadPart *= 1000000;
    ElapsedMicroseconds.QuadPart /= Frequency.QuadPart;
    return (double)ElapsedMicroseconds.QuadPart;
}

My code examples are taken out of context and I tried to do some clean-up but errors might be present.

0
votes

If you're interested in how long a particular Draw or Dispatch is taking on the GPU then you should take a look at DirectX 11's Timestamp queries. You can query the GPU's clock frequency and current clock value before and after some GPU work and figure out how long that took in wall time.

This is probably a good primer / example on how to do it:

https://mynameismjp.wordpress.com/2011/10/13/profiling-in-dx11-with-queries/