GPU Metrics
Comparison of execution time between array reduction algorithm in CUDA.
This implementation shows comparison between CPU array summation and GPU array summation between 4 approaches :
- Naive : Naive approach using
atomicAdd
function. - Binary Tree : Reduces by adding two elements into one for every iteration until reduced into one output.
- Non-divergent : Binary tree, but with thread addressing to reduce warp divergence.
- Sequential Addressing : Improvement of non-divergent solution which compacts the threads' memory access to reduce bank conflicts.
Setting Up
Working with CUDA requires NVIDIA C Compiler (nvcc) as a part of NVIDIA CUDA Toolkit. Latest version can be downloaded from NVIDIA website.
As this example involves atomicAdd
method which only supported from Compute 2.0 and above, make sure your system has GPU that supports Compute 2.0 or above.
Compilation
Navigate to the src
folder and run the following command on terminal/shell.
Linux
nvcc main.cu -o main
Windows
nvcc main.cu -o main.exe
Make sure you're using appropiate architecture supported by your device.
The test below compiled using the following command
nvcc main.cu -o main.exe -gencode arch=compute_50,code=sm_50
Metrics
This test conducted on laptop with system specifications:
- Intel Core i5-7200U running at 2.5GHz
- 12GB RAM, and
- NVIDIA GeForce 940MX
Below is the table comparison between the algorithms. Time in milliseconds
n | cpu | naive_gpu | bin_gpu | nondiv_gpu | seqaddr_gpu |
---|---|---|---|---|---|
1024 | 0.002656 | 0.062240 | 0.059392 | 0.055360 | 0.054688 |
2048 | 0.002368 | 0.065376 | 0.139008 | 0.090272 | 0.054432 |
4096 | 0.002432 | 0.105056 | 0.148544 | 0.064352 | 0.061248 |
8192 | 0.002336 | 0.077472 | 0.064896 | 0.059520 | 0.055968 |
16384 | 0.002464 | 0.098208 | 0.131392 | 0.070208 | 0.065792 |
32768 | 0.002432 | 0.081408 | 0.123808 | 0.101344 | 0.101312 |
65536 | 0.002592 | 0.107520 | 0.175456 | 0.178848 | 0.095712 |
131072 | 0.002464 | 0.100000 | 0.280480 | 0.195552 | 0.182496 |
262144 | 0.002496 | 0.132544 | 0.478144 | 0.309344 | 0.279840 |
524288 | 0.002432 | 0.187072 | 0.849248 | 0.526400 | 0.454880 |
1048576 | 0.002432 | 0.288864 | 1.619520 | 0.937248 | 0.835520 |
2097152 | 0.002528 | 0.448800 | 3.177376 | 1.841824 | 1.566624 |
Analysis
The result shows a massive difference between CPU and GPU execution time, where CPU runs in the order of microseconds, while GPU result differs for each array sizes but still maintaining the same pattern for most of the test cases. This shows that the overhead time between transferring data between host and device surpasses the execution time, costing performance
The binary implementation consistently giving twice the execution time compared to naive, nondivergent, and sequential addressing. This is irregular as naive implementation should be slower than the other cases. Atomic add should add more overhead as it requires the addition to finish before continuing the process, or the loop as described in the source code. But in this implementation, instead of adding one element at a time, each thread adds two elements, or as specified in numberOfInputs
in the main function.
The possible cause is warp divergence caused by threads only running on even-index (odd threads don't run), which hurts performance in bigger array elements as there will be more cache lines required. This problem is solved using non-divergent solution, but still causing bank conflicts. The most optimum of the three is sequential addressing approach as it solves both warp divergence and bank conflicts.
Footer
Analysis performed by
Arung Agamani Budi Putera - 13518005