Arung Agamani Budi Putera

if4010-metrics

Repository



GPU Metrics
Comparison of execution time between array reduction algorithm in CUDA.
This implementation shows comparison between CPU array summation and GPU array summation between 4 approaches :

Naive : Naive approach using atomicAdd function.
Binary Tree : Reduces by adding two elements into one for every iteration until reduced into one output.
Non-divergent : Binary tree, but with thread addressing to reduce warp divergence.
Sequential Addressing : Improvement of non-divergent solution which compacts the threads' memory access to reduce bank conflicts.


Setting Up
Working with CUDA requires NVIDIA C Compiler (nvcc) as a part of NVIDIA CUDA Toolkit. Latest version can be downloaded from NVIDIA website.
As this example involves atomicAdd method which only supported from Compute 2.0 and above, make sure your system has GPU that supports Compute 2.0 or above.

Compilation
Navigate to the src folder and run the following command on terminal/shell.

Linux
nvcc main.cu -o main

Windows
nvcc main.cu -o main.exe

Make sure you're using appropiate architecture supported by your device.

The test below compiled using the following command

nvcc main.cu -o main.exe -gencode arch=compute_50,code=sm_50

Metrics
This test conducted on laptop with system specifications:

Intel Core i5-7200U running at 2.5GHz
12GB RAM, and
NVIDIA GeForce 940MX

Below is the table comparison between the algorithms. Time in milliseconds


n
cpu
naive_gpu
bin_gpu
nondiv_gpu
seqaddr_gpu


1024
0.002656
0.062240
0.059392
0.055360
0.054688


2048
0.002368
0.065376
0.139008
0.090272
0.054432


4096
0.002432
0.105056
0.148544
0.064352
0.061248


8192
0.002336
0.077472
0.064896
0.059520
0.055968


16384
0.002464
0.098208
0.131392
0.070208
0.065792


32768
0.002432
0.081408
0.123808
0.101344
0.101312


65536
0.002592
0.107520
0.175456
0.178848
0.095712


131072
0.002464
0.100000
0.280480
0.195552
0.182496


262144
0.002496
0.132544
0.478144
0.309344
0.279840


524288
0.002432
0.187072
0.849248
0.526400
0.454880


1048576
0.002432
0.288864
1.619520
0.937248
0.835520


2097152
0.002528
0.448800
3.177376
1.841824
1.566624


Analysis
The result shows a massive difference between CPU and GPU execution time, where CPU runs in the order of microseconds, while GPU result differs for each array sizes but still maintaining the same pattern for most of the test cases. This shows that the overhead time between transferring data between host and device surpasses the execution time, costing performance
The binary implementation consistently giving twice the execution time compared to naive, nondivergent, and sequential addressing. This is irregular as naive implementation should be slower than the other cases. Atomic add should add more overhead as it requires the addition to finish before continuing the process, or the loop as described in the source code. But in this implementation, instead of adding one element at a time, each thread adds two elements, or as specified in numberOfInputs in the main function.
The possible cause is warp divergence caused by threads only running on even-index (odd threads don't run), which hurts performance in bigger array elements as there will be more cache lines required. This problem is solved using non-divergent solution, but still causing bank conflicts. The most optimum of the three is sequential addressing approach as it solves both warp divergence and bank conflicts.

Footer
Analysis performed by

Arung Agamani Budi Putera - 13518005