GPUs contain hundreds to thousands of cores, but, unlike the independent cores of a CPU, small workgroups of GPU cores must execute the same instruction sets simultaneously though on different data. In this respect, GPU-based parallelization may be thought of as SIMD on a massive scale, leading Nvidia to coin the term SIMT (single instruction, multiple threads) (Lindholm et al. 2008). In this setup, communication between threads within the same workgroup happens extremely quickly via shared on-chip memory, and scheduling a massive number of threads actually hides latencies arising from off-chip memory transactions because of the dynamic and simultaneous loading and off-loading of the many tasks. In part, this is because of the GPU’s massively parallel architecture. In part, this is because contemporary general-purpose GPUs have small memory cache but high memory bandwidth, making them ideal tools for performing a massive number of short-lived, cooperative threads.
The likelihood evaluation first involves N independent transformation-reductions, one to obtain each λn. We generate T = N × B threads on the GPU and use work groups of B threads to compute each of the N λn. Each thread uses a while-loop across indices n′ to compute N/B λnn′s and keeps a running partial sum. After the threads obtain B partial sums, they work together in a final binary reduction to obtain λn. The binary reduction is fast, with complexity and represents an additional speedup beyond massive parallelization. After computing all N λns, a summation proceeds in the exact same manner. The GPU uses massive parallelization to avoid the cost of the rate-limiting floating point computation in exp(·). High memory bandwidth allows for fast transfer to and from each working group, and, in turn, each work group shares its own fast access memory that facilitates rapid communication between member threads. We use the Open Computing Language (OpenCL) to write our GPU code. In OpenCL, write functions called kernels, and the library assigns them to each work group separately for parallel execution. To evaluate the likelihood, we write one kernel for the work groups that compute the ℓns and one kernel for those that sum the N ℓns. These details culminate in Algorithm 3. Algorithm 5 is similar to Algorithm 3 and computes the vector of self-excitatory probabilities πn.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.
Tips for asking effective questions
+ Description
Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.