The advantage of independent data processing by GPUs over the CPUs is based on a larger number of transistors specifically devoted to data processing rather than data caching and flow control. This permits an efficient implementation of the Single-Instruction, Multiple-Thread (SIMT) processing model.47 In our algorithm, the single procedure comprises calculation of the number of steric clashes for a set of ISC conformations. These calculations can be performed similarly and independently for each set of conformations. Thus, GPU data processing can yield significant acceleration of the conformational search.
We implemented the proposed method for GPUs with Compute Unified Device Architecture (CUDA) from NVIDIA using C++/CUDA C programming language. Calculations were carried out on two Linux workstations (HP laptop with GPU GeForce GTX 950M, GCC 4.9.2; and Dell desktop with GPU Quadro K4000, GCC 4.4.7; both with CUDA Compilation Tools 7.5). In the CUDA heterogeneous programming model, the host (CPU) running a C/C++ program passes data and flow control to a physically separate device (GPU), which operates as a coprocessor with its own memory.47 The time of the host/device data transfers may severely impede the algorithm speed-up. To mitigate this problem, we used pre-computed representations of rotameric states for each ISC (Fig. 3). Each Cartesian point (data structure Point3DCart) stores coordinates of an ISC atom in some rotameric state. The initial conformations of the ISC atoms (as in input PDB file) are stored in the data structures ExtractedResidue (one per interface residue) as arrays of Point3DCart. Positions of the ISC atoms beyond Cβ (e.g., Tyr in Fig. 3c) are changed to another rotameric state and the new set of Cartesian coordinates for that state is stored in a separate ExtractedResidue structure (Fig. 3c,d). After generating all rotameric states for each ISC (except Ala and Gly), the corresponding ExtractedResidue structures are grouped into a single data structure LibraryExtractedResidues (Fig. 3d), which is transferred to the device memory only once. Then, each thread on a GPU uses a unique combination of ExtractedResidue structures, determined by a combination of indices of the ISC rotameric states.
(a) Interface of a model of complex Ecotin Y69F, D70P bound to D102N Trypsin (1ezu) from the Dockground unbound benchmark set 3.0; (b) zoom in of the interface fragment with the Tyr217 residue (chain C); (c) mapping of three rotameric states of the TYR217 side chain onto the data structures; (d) data structures associated with the atoms ( Point3DCart), separate rotameric states of a side-chain ( ExtractedResidue) and storage of sets of these mappings ( LibraryExtractedResidues). Data structure LibraryExtractedResidues is a composition of all structures ExtractedResidue. The total number of elements in the LibraryExtractedResidues is , where Nrot (i) is the number of rotamers for the interface side chain i.
Such data optimization storage requires only a small amount of data transferred to the GPU memory. For example, for 10 ISC with average 20 atoms and 5 rotameric states per side chain (9,765,625 distinct configurations in total) the LibraryExtractedResidues data structure is 12 kB (in actual calculations the size varies from 8.4 kB to 240 kB). The brute-force approach in this case would result in 9,765,625 × 10 × 20 × 12 Bytes > 23 GB, the amount of memory that rapidly approaches current limits for the CUDA-enabled GPUs. Splitting this conformational space, with separate scoring in each part, would require multiple host/device data transfers.
Static conformations of the near-interface residues are transferred to the GPU memory as an array of Point3DCart structures and participate in the scoring of conformations. The typical size of this array is 18 kB.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.