UDACITY教程 Intro to Parallel Programming
- Basics on GPU, CUDA, Memory Model
- Parallel Algorithms(Reduce, Scan, Histogram, Sort)
- Optimize Parallel GPU Programs
- Others(Library, OpenACC, Dynamic parallelism)
1. Reduce
Input:
1) set of elements
2) reduction operator: binary(2 to 1), associative(- is not, + is)
Serial Implementation: work/step complexity O(n)
Parallel Impplementation: O(log(n)), 并行归并
Example
2. Scan
统计该元素之前所有位置的op结果
Parallel it: 1) logn steps and nlogn works 2) 2logn steps and 2n work
3. Histogram
-
效率较低: Need to use atomic operation(Hash计数问题)
-
优化: Reduced-based algorithm: Each thread has its own bin
4. Problem 3
Image Tone-mapping:
solution
5. Use Scan in Sparse Matrix
Use Scan to compute the Address of the density-array.
-
Sparse Matrix: - CSR representation
-
Solution-1 - thread/element by Segmented Scan:
- Use ROWPTR to generate segmented value array
- Thread per element: map (VALUE[n] * X[COLUMN[n]])
- Backwards inclusive segmented sum scan
- Use ROWPTR to gather sparse output into dense vector
-
Solution-2 - thread/row cuda code:(similar # of elements/row, 3X faster)
-
Solution-3 - Hybrid:
6. Sort
- ODD_EVEN Sort(O(n), 类似冒泡排序)
- Merge Sort
- Sorting Networks: 比较的顺序很tricky!
- Quick Sort
- Radix Sort: GPU容易实现,且效率高
7. Problem 4
Red Eye Removal:
solution
近期评论