diff --git a/README.md b/README.md
index 996af0fd4bafc3c4297c04e8a86b0e4790ef5c1d..d7bf7cefd1ca57bb18e4f8d4c8664c8ecf7061c1 100644
--- a/README.md
+++ b/README.md
@@ -61,7 +61,7 @@ Below is the table comparison between the algorithms. Time in milliseconds
 
 The result shows a massive difference between CPU and GPU execution time, where CPU runs in the order of microseconds, while GPU result differs for each array sizes but still maintaining the same pattern for most of the test cases. This shows that the overhead time between transferring data between host and device surpasses the execution time, costing performance
 
-The binary implementation consistently giving twice the execution time compared to naive, nondivergent, and sequential addressing. This is irregular as naive implementation should be slower than the other cases. Atomic add should add more overhead as it requires the addition to finish before continuing the process, or the loop as described in the source code.
+The binary implementation consistently giving twice the execution time compared to naive, nondivergent, and sequential addressing. This is irregular as naive implementation should be slower than the other cases. Atomic add should add more overhead as it requires the addition to finish before continuing the process, or the loop as described in the source code. But in this implementation, instead of adding one element at a time, each thread adds two elements, or as specified in `numberOfInputs` in the main function.
 
 The possible cause is warp divergence caused by threads only running on even-index (odd threads don't run), which hurts performance in bigger array elements as there will be more cache lines required. This problem is solved using non-divergent solution, but still causing bank conflicts. The most optimum of the three is sequential addressing approach as it solves both warp divergence and bank conflicts.
 
diff --git a/comparison.md b/comparison.md
new file mode 100644
index 0000000000000000000000000000000000000000..befaa2ad7e500053d0d2c92968ec464d45922fa5
--- /dev/null
+++ b/comparison.md
@@ -0,0 +1,50 @@
+# Comparison to other's metrics result
+
+This analysis performed to compare the result of the current GPU implementation with other's, in which I choose Naufal's as benchmark.
+
+## My Performance Metrics
+
+| n       | cpu      | naive_gpu | bin_gpu  | nondiv_gpu | seqaddr_gpu |
+| ------- | -------- | --------- | -------- | ---------- | ----------- |
+| 1024    | 0.002656 | 0.062240  | 0.059392 | 0.055360   | 0.054688    |
+| 2048    | 0.002368 | 0.065376  | 0.139008 | 0.090272   | 0.054432    |
+| 4096    | 0.002432 | 0.105056  | 0.148544 | 0.064352   | 0.061248    |
+| 8192    | 0.002336 | 0.077472  | 0.064896 | 0.059520   | 0.055968    |
+| 16384   | 0.002464 | 0.098208  | 0.131392 | 0.070208   | 0.065792    |
+| 32768   | 0.002432 | 0.081408  | 0.123808 | 0.101344   | 0.101312    |
+| 65536   | 0.002592 | 0.107520  | 0.175456 | 0.178848   | 0.095712    |
+| 131072  | 0.002464 | 0.100000  | 0.280480 | 0.195552   | 0.182496    |
+| 262144  | 0.002496 | 0.132544  | 0.478144 | 0.309344   | 0.279840    |
+| 524288  | 0.002432 | 0.187072  | 0.849248 | 0.526400   | 0.454880    |
+| 1048576 | 0.002432 | 0.288864  | 1.619520 | 0.937248   | 0.835520    |
+
+## Naufal's Performance Metrics
+
+|    size |      cpu |    naive |    btree | nondivergent |  seqaddr |
+| ------: | -------: | -------: | -------: | -----------: | -------: |
+|    1024 | 0.002048 | 0.012288 | 0.013952 |     0.011936 | 0.012192 |
+|    2048 | 0.002048 | 0.011232 | 0.013664 |     0.012704 | 0.012736 |
+|    4096 | 0.003072 | 0.012288 | 0.015232 |     0.012256 | 0.012576 |
+|    8192 | 0.002048 | 0.017408 | 0.016448 |     0.015264 | 0.013888 |
+|   16384 | 0.002048 | 0.022528 | 0.019936 |     0.016352 | 0.015808 |
+|   32768 | 0.006048 | 0.237376 | 0.026112 |     0.021376 | 0.020448 |
+|   65536 | 0.017696 | 0.261248 | 0.039360 |     0.031040 | 0.029408 |
+|  131072 | 0.041824 | 0.314208 | 0.059968 |     0.051264 | 0.046048 |
+|  262144 | 0.084832 | 0.417248 | 0.185280 |     0.103136 | 0.099392 |
+|  524288 | 0.252544 | 0.646112 | 0.219264 |     0.191264 | 0.185664 |
+| 1048576 | 0.768096 | 1.017792 | 0.426848 |     0.367040 | 0.361504 |
+
+## Main Difference
+
+### Naive Approach
+
+Comparing between the results, Naufal's naive approach shows expected result as it takes relatively longer time compared to other GPU approaches. Our codes have similar structure, where it only differs on the amount of elements processed by each threads, where mine is two elements and Naufal's is four elements per threads.
+
+### Kernel Launch
+
+On kernel launch, my kernels launched with block size of 256 threads, and Naufal's with 512 threads. It shows that having smaller thread count per block delivers better performance. on 2^20 elements (1048576), my code run about 4 time faster than Naufal's.
+
+### Binary, Non-divergent, and Sequential Addressing
+
+While my code performs better on Naive implementation, the rest performs significantly slower than Naufal's, where it ranges between 3-4 times slower than Naufal's. A brief look into Naufal's code structure shows that each thread for the said implementations also processes more than one element, while in my code, each thread only deal with one element. This enhances the benefit from locality and cache lines for Naufal's code, delivering better performance.  
+Another difference is the usage of `atomicAdd` for all GPU implementation, which prevents thread-unsafe process albeit introducing thread blocking. Even if there could be performance degradation by using `atomicAdd`, performance metrics result shows that it benefits more from handling multiple elements for each threads.