Skip to content
Snippets Groups Projects
user avatar
arung-agamani authored
eb408115
Name Last commit Last update
img
src
.gitignore
README.md
comparison.md

GPU Metrics

Comparison of execution time between array reduction algorithm in CUDA.

This implementation shows comparison between CPU array summation and GPU array summation between 4 approaches :

  • Naive : Naive approach using atomicAdd function.
  • Binary Tree : Reduces by adding two elements into one for every iteration until reduced into one output.
  • Non-divergent : Binary tree, but with thread addressing to reduce warp divergence.
  • Sequential Addressing : Improvement of non-divergent solution which compacts the threads' memory access to reduce bank conflicts.

Setting Up

Working with CUDA requires NVIDIA C Compiler (nvcc) as a part of NVIDIA CUDA Toolkit. Latest version can be downloaded from NVIDIA website.

As this example involves atomicAdd method which only supported from Compute 2.0 and above, make sure your system has GPU that supports Compute 2.0 or above.

Compilation

Navigate to the src folder and run the following command on terminal/shell.

Linux

nvcc main.cu -o main

Windows

nvcc main.cu -o main.exe
Make sure you're using appropiate architecture supported by your device.
The test below compiled using the following command
nvcc main.cu -o main.exe -gencode arch=compute_50,code=sm_50

Metrics

This test conducted on laptop with system specifications:

  • Intel Core i5-7200U running at 2.5GHz
  • 12GB RAM, and
  • NVIDIA GeForce 940MX

Below is the table comparison between the algorithms. Time in milliseconds

n cpu naive_gpu bin_gpu nondiv_gpu seqaddr_gpu
1024 0.002656 0.062240 0.059392 0.055360 0.054688
2048 0.002368 0.065376 0.139008 0.090272 0.054432
4096 0.002432 0.105056 0.148544 0.064352 0.061248
8192 0.002336 0.077472 0.064896 0.059520 0.055968
16384 0.002464 0.098208 0.131392 0.070208 0.065792
32768 0.002432 0.081408 0.123808 0.101344 0.101312
65536 0.002592 0.107520 0.175456 0.178848 0.095712
131072 0.002464 0.100000 0.280480 0.195552 0.182496
262144 0.002496 0.132544 0.478144 0.309344 0.279840
524288 0.002432 0.187072 0.849248 0.526400 0.454880
1048576 0.002432 0.288864 1.619520 0.937248 0.835520
2097152 0.002528 0.448800 3.177376 1.841824 1.566624

gpu metrics

Analysis

The result shows a massive difference between CPU and GPU execution time, where CPU runs in the order of microseconds, while GPU result differs for each array sizes but still maintaining the same pattern for most of the test cases. This shows that the overhead time between transferring data between host and device surpasses the execution time, costing performance

The binary implementation consistently giving twice the execution time compared to naive, nondivergent, and sequential addressing. This is irregular as naive implementation should be slower than the other cases. Atomic add should add more overhead as it requires the addition to finish before continuing the process, or the loop as described in the source code. But in this implementation, instead of adding one element at a time, each thread adds two elements, or as specified in numberOfInputs in the main function.

The possible cause is warp divergence caused by threads only running on even-index (odd threads don't run), which hurts performance in bigger array elements as there will be more cache lines required. This problem is solved using non-divergent solution, but still causing bank conflicts. The most optimum of the three is sequential addressing approach as it solves both warp divergence and bank conflicts.

Footer

Analysis performed by
Arung Agamani Budi Putera - 13518005