2.3.3.  Many-core GPUs

Andrew J. Holbrook; Charles E. Loeffler; Seth R. Flaxman; Marc A. Suchard

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

2.3.3. Many-core GPUs

AH Andrew J. Holbrook

CL Charles E. Loeffler

SF Seth R. Flaxman

MS Marc A. Suchard

This method is extracted from research article: Stat Comput, Jan 2021

Scalable Bayesian inference for self-excitatory stochastic processes applied to big American gunfire data

DOI: 10.1007/s11222-020-09980-4

Request a Protocol

Ask a question

Favorite

GPUs contain hundreds to thousands of cores, but, unlike the independent cores of a CPU, small workgroups of GPU cores must execute the same instruction sets simultaneously though on different data. In this respect, GPU-based parallelization may be thought of as SIMD on a massive scale, leading Nvidia to coin the term SIMT (single instruction, multiple threads) (Lindholm et al. 2008). In this setup, communication between threads within the same workgroup happens extremely quickly via shared on-chip memory, and scheduling a massive number of threads actually hides latencies arising from off-chip memory transactions because of the dynamic and simultaneous loading and off-loading of the many tasks. In part, this is because of the GPU’s massively parallel architecture. In part, this is because contemporary general-purpose GPUs have small memory cache but high memory bandwidth, making them ideal tools for performing a massive number of short-lived, cooperative threads.

The likelihood evaluation first involves N independent transformation-reductions, one to obtain each λ_n. We generate T = N × B threads on the GPU and use work groups of B threads to compute each of the N λ_n. Each thread uses a while-loop across indices n′ to compute N/B λ_nn′s and keeps a running partial sum. After the threads obtain B partial sums, they work together in a final binary reduction to obtain λ_n. The binary reduction is fast, with $O (log B)$ complexity and represents an additional speedup beyond massive parallelization. After computing all N λ_ns, a summation proceeds in the exact same manner. The GPU uses massive parallelization to avoid the cost of the rate-limiting floating point computation in exp(·). High memory bandwidth allows for fast transfer to and from each working group, and, in turn, each work group shares its own fast access memory that facilitates rapid communication between member threads. We use the Open Computing Language (OpenCL) to write our GPU code. In OpenCL, write functions called kernels, and the library assigns them to each work group separately for parallel execution. To evaluate the likelihood, we write one kernel for the work groups that compute the ℓ_ns and one kernel for those that sum the N ℓ_ns. These details culminate in Algorithm 3. Algorithm 5 is similar to Algorithm 3 and computes the vector of self-excitatory probabilities π_n.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol