I am trying to measure the execution time of GPU and compare it with CPU. I wrote a simple_add function to add all elements of a short int vector. The Kernel code is:
global const int * A, global const uint * B, global int* C)
{
///------------------------------------------------
/// Add 16 bits of each
int AA=A[get_global_id(0)];
int BB=B[get_global_id(0)];
int AH=0xFFFF0000 & AA;
int AL=0x0000FFFF & AA;
int BH=0xFFFF0000 & BB;
int BL=0x0000FFFF & BB;
int CL=(AL+BL)&0x0000FFFF;
int CH=(AH+BH)&0xFFFF0000;
C[get_global_id(0)]=CH|CL;
}
I wrote another CPU version for this function and after 100 time executions measured their execution time
clock_t before_GPU = clock();
for(int i=0;i<100;i++)
{
queue.enqueueNDRangeKernel(kernel_add,1,
cl::NDRange((size_t)(NumberOfAllElements/4)),cl::NDRange(64));
queue.finish();
}
clock_t after_GPU = clock();
clock_t before_CPU = clock();
for(int i=0;i<100;i++)
AddImagesCPU(A,B,C);
clock_t after_CPU = clock();
the result was as below after 10 times calling the whole measurement function:
CPU time: 1359
GPU time: 1372
----------------
CPU time: 1336
GPU time: 1269
----------------
CPU time: 1436
GPU time: 1255
----------------
CPU time: 1304
GPU time: 1266
----------------
CPU time: 1305
GPU time: 1252
----------------
CPU time: 1313
GPU time: 1255
----------------
CPU time: 1313
GPU time: 1253
----------------
CPU time: 1384
GPU time: 1254
----------------
CPU time: 1300
GPU time: 1254
----------------
CPU time: 1322
GPU time: 1254
----------------
The problem is that I really expected GPU to be much faster than CPU but it was not. I can't understand why my GPU speed is not much higher than CPU. Is there any problem in my codes ?? Here is my GPU properties:
-----------------------------------------------------
------------- Selected Platform Properties-------------:
NAME: AMD Accelerated Parallel Processing
EXTENSION: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices cl_khr_d3d10_sharing
VENDOR: Advanced Micro Devices, Inc.
VERSION: OpenCL 1.2 AMD-APP (937.2)
PROFILE: FULL_PROFILE
-----------------------------------------------------
------------- Selected Device Properties-------------:
NAME : ATI RV730
TYPE : 4
VENDOR : Advanced Micro Devices, Inc.
PROFILE : FULL_PROFILE
VERSION : OpenCL 1.0 AMD-APP (937.2)
EXTENSIONS : cl_khr_gl_sharing cl_amd_device_attribute_query cl_khr_d3d10_sharing
MAX_COMPUTE_UNITS : 8
MAX_WORK_GROUP_SIZE : 128
OPENCL_C_VERSION : OpenCL C 1.0
DRIVER_VERSION: CAL 1.4.1734
==========================================================
and just to compare this is my CPU specifications:
------------- CPU Properties-------------:
NAME : Intel(R) Core(TM) i3-2100 CPU @ 3.10GHz
TYPE : 2
VENDOR : GenuineIntel
PROFILE : FULL_PROFILE
VERSION : OpenCL 1.2 AMD-APP (937.2)
MAX_COMPUTE_UNITS : 4
MAX_WORK_GROUP_SIZE : 1024
OPENCL_C_VERSION : OpenCL C 1.2
DRIVER_VERSION: 2.0 (sse2,avx)
==========================================================
I also measured the wall clock time using QueryPerformanceCounter and here is the results:
CPU time: 1304449.6 micro-sec
GPU time: 1401740.82 micro-sec
----------------------
CPU time: 1620076.55 micro-sec
GPU time: 1310317.64 micro-sec
----------------------
CPU time: 1468520.44 micro-sec
GPU time: 1317153.63 micro-sec
----------------------
CPU time: 1304367.29 micro-sec
GPU time: 1251865.14 micro-sec
----------------------
CPU time: 1301589.17 micro-sec
GPU time: 1252889.4 micro-sec
----------------------
CPU time: 1294750.21 micro-sec
GPU time: 1257017.41 micro-sec
----------------------
CPU time: 1297506.93 micro-sec
GPU time: 1252768.9 micro-sec
----------------------
CPU time: 1293511.29 micro-sec
GPU time: 1252019.88 micro-sec
----------------------
CPU time: 1320753.54 micro-sec
GPU time: 1248895.73 micro-sec
----------------------
CPU time: 1296486.95 micro-sec
GPU time: 1255207.91 micro-sec
----------------------
Again I tried the opencl profiling for execution time.
queue.enqueueNDRangeKernel(kernel_add,1,
cl::NDRange((size_t)(NumberOfAllElements/4)),
cl::NDRange(64),NULL,&ev);
ev.wait();
queue.finish();
time_start=ev.getProfilingInfo<CL_PROFILING_COMMAND_START>();
time_end=ev.getProfilingInfo<CL_PROFILING_COMMAND_END>();
Results for one time execution were more or less the same:
CPU time: 13335.1815 micro-sec
GPU time: 11865.111 micro-sec
----------------------
CPU time: 13884.0235 micro-sec
GPU time: 11663.889 micro-sec
----------------------
CPU time: 19724.7296 micro-sec
GPU time: 14548.222 micro-sec
----------------------
CPU time: 19945.3199 micro-sec
GPU time: 15331.111 micro-sec
----------------------
CPU time: 17973.5055 micro-sec
GPU time: 11641.444 micro-sec
----------------------
CPU time: 12652.6683 micro-sec
GPU time: 11632 micro-sec
----------------------
CPU time: 18875.292 micro-sec
GPU time: 14783.111 micro-sec
----------------------
CPU time: 32782.033 micro-sec
GPU time: 11650.444 micro-sec
----------------------
CPU time: 20462.2257 micro-sec
GPU time: 11647.778 micro-sec
----------------------
CPU time: 14529.6618 micro-sec
GPU time: 11860.112 micro-sec
clock()
measures CPU time instead of wall clock time. It won't count in GPU running time. The time you measures is probably taken by OpenCL API calls. Tryclock_gettime()
in C orstd::chrono::steady_clock
in C++ instead. You didn't mention the unit of your "CPU time". If it's raw output ofclock()
function (which must be divided byCLOCKS_PER_SEC
to get the number of seconds), 1200 is really a short period. – cuihao