Um die anderen Arten von Veröffentlichungen zu diesem Thema anzuzeigen, folgen Sie diesem Link: GPU1.

Zeitschriftenartikel zum Thema „GPU1“

Geben Sie eine Quelle nach APA, MLA, Chicago, Harvard und anderen Zitierweisen an

Wählen Sie eine Art der Quelle aus:

Machen Sie sich mit Top-50 Zeitschriftenartikel für die Forschung zum Thema "GPU1" bekannt.

Neben jedem Werk im Literaturverzeichnis ist die Option "Zur Bibliographie hinzufügen" verfügbar. Nutzen Sie sie, wird Ihre bibliographische Angabe des gewählten Werkes nach der nötigen Zitierweise (APA, MLA, Harvard, Chicago, Vancouver usw.) automatisch gestaltet.

Sie können auch den vollen Text der wissenschaftlichen Publikation im PDF-Format herunterladen und eine Online-Annotation der Arbeit lesen, wenn die relevanten Parameter in den Metadaten verfügbar sind.

Sehen Sie die Zeitschriftenartikel für verschiedene Spezialgebieten durch und erstellen Sie Ihre Bibliographie auf korrekte Weise.

1

Nakada, Yuji, und Yoshifumi Itoh. „Pseudomonas aeruginosa PAO1 genes for 3-guanidinopropionate and 4-guanidinobutyrate utilization may be derived from a common ancestor“. Microbiology 151, Nr. 12 (01.12.2005): 4055–62. http://dx.doi.org/10.1099/mic.0.28258-0.

Der volle Inhalt der Quelle
Annotation:
Pseudomonas aeruginosa PAO1 utilizes 3-guanidinopropionate (3-GP) and 4-guanidinobutyrate (4-GB), which differ in one methylene group only, via distinct enzymes: guanidinopropionase (EC 3.5.3.17; the gpuA product) and guanidinobutyrase (EC 3.5.3.7; the gbuA product). The authors cloned and characterized the contiguous gpuPAR genes (in that order) responsible for 3-GP utilization, and compared the deduced sequences of their putative protein products, and the potential regulatory mechanisms of gpuPA, with those of the corresponding gbu genes encoding the 4-GB catabolic system. GpuA and GpuR have similarity to GbuA (49 % identity) and GbuR (a transcription activator of gbuA; 37 % identity), respectively. GpuP resembles PA1418 (58 % identity), which is a putative membrane protein encoded by a potential gene downstream of gbuA. These features of the GpuR and GpuP sequences, and the impaired growth of gpuR and gpuP knockout mutants on 3-GP, support the notion that GpuR and GpuP direct the 3-GP-inducible expression of gpuA, and the uptake of 3-GP, respectively. Northern blots of mRNA from 3-GP-induced PAO1 cells revealed three transcripts of gpuA, gpuP, and gpuP and gpuA together, suggesting that gpuP and gpuA each have a 3-GP-responsible promoter, and that some transcription from the gpuP promoter is terminated after gpuP, or proceeds into gpuA. Knockout of gpuR abolished 3-GP-dependent synthesis of the transcripts, confirming that GpuR activates transcription from these promoters, with 3-GP as a specific co-inducer. The sequence conservation between the three functional pairs of the Gpu and Gbu proteins, and the absence of gpuAPR in closely related species, imply that the triad gpu genes have co-ordinately evolved from origins common to the gbu counterparts, to establish an independent catabolic system of 3-GP in P. aeruginosa.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
2

Guo, Sen, San Feng Chen und Yong Sheng Liang. „Global Shared Memory Design for Multi-GPU Graphics Cards on Personal Supercomputer“. Applied Mechanics and Materials 263-266 (Dezember 2012): 1236–41. http://dx.doi.org/10.4028/www.scientific.net/amm.263-266.1236.

Der volle Inhalt der Quelle
Annotation:
When programming CUDA or OpenCL on multi-GPU systems, the programmers usually expect the GPUs on the same system can communicate fast with each other. For instance, they hope a device memory copy from GPU1s memory to GPU2s memory can be done inside the graphics card, and needn’t to employ the PCIE, which is in relative low speed. In this paper, we propose an idea to add a multi-channel memory to the multi-GPU board, and this memory is only for transferring data between different GPUs. This multi-channel memory should have multiple interfaces, including one common interface shared by different GPUs, which is connected with a FPGA arbitration circuit and several other interfaces connected with dedicated GPUs frame buffer independently. To distinguish the shared memory of a stream multiprocessor, we call this memory Global Shared Memory. We analyze the performance improvement expectation with this global shared memory, with the case of accelerating computer tomography algebraic reconstruction on multi-GPU.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
3

Palmer, Daniel A., Jill K. Thompson, Lie Li, Ashton Prat und Ping Wang. „Gib2, A Novel Gβ-like/RACK1 Homolog, Functions as a Gβ Subunit in cAMP Signaling and Is Essential in Cryptococcus neoformans“. Journal of Biological Chemistry 281, Nr. 43 (01.09.2006): 32596–605. http://dx.doi.org/10.1074/jbc.m602768200.

Der volle Inhalt der Quelle
Annotation:
Canonical G proteins are heterotrimeric, consisting of α, β, and γ subunits. Despite multiple Gα subunits functioning in fungi, only a single Gβ subunit per species has been identified, suggesting that non-conventional G protein signaling exists in this diverse group of eukaryotic organisms. Using the Gα subunit Gpa1 that functions in cAMP signaling as bait in a two-hybrid screen, we have identified a novel Gβ-like/RACK1 protein homolog, Gib2, from the human pathogenic fungus Cryptococcus neoformans. Gib2 contains a seven WD-40 repeat motif and is predicted to form a seven-bladed β propeller structure characteristic of β transducins. Gib2 is also shown to interact, respectively, with two Gγ subunit homologs, Gpg1 and Gpg2, similar to the conventional Gβ subunit Gpb1. In contrast to Gpb1 whose overexpression promotes mating response, overproduction of Gib2 suppresses defects of gpa1 mutation in both melanization and capsule formation, the phenotypes regulated by cAMP signaling and associated with virulence. Furthermore, depletion of Gib2 by antisense suppression results in a severe growth defect, suggesting that Gib2 is essential. Finally, Gib2 is shown to also physically interact with a downstream target of Gpa1-cAMP signaling, Smg1, and the protein kinase C homolog Pkc1, indicating that Gib2 is also a multifunctional RACK1-like protein.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
4

Harashima, Toshiaki, und Joseph Heitman. „Gα Subunit Gpa2 Recruits Kelch Repeat Subunits That Inhibit Receptor-G Protein Coupling during cAMP-induced Dimorphic Transitions in Saccharomyces cerevisiae“. Molecular Biology of the Cell 16, Nr. 10 (Oktober 2005): 4557–71. http://dx.doi.org/10.1091/mbc.e05-05-0403.

Der volle Inhalt der Quelle
Annotation:
All eukaryotic cells sense extracellular stimuli and activate intracellular signaling cascades via G protein-coupled receptors (GPCR) and associated heterotrimeric G proteins. The Saccharomyces cerevisiae GPCR Gpr1 and associated Gα subunit Gpa2 sense extracellular carbon sources (including glucose) to govern filamentous growth. In contrast to conventional Gα subunits, Gpa2 forms an atypical G protein complex with the kelch repeat Gβ mimic proteins Gpb1 and Gpb2. Gpb1/2 negatively regulate cAMP signaling by inhibiting Gpa2 and an as yet unidentified target. Here we show that Gpa2 requires lipid modifications of its N-terminus for membrane localization but association with the Gpr1 receptor or Gpb1/2 subunits is dispensable for membrane targeting. Instead, Gpa2 promotes membrane localization of its associated Gβ mimic subunit Gpb2. We also show that the Gpa2 N-terminus binds both to Gpb2 and to the C-terminal tail of the Gpr1 receptor and that Gpb1/2 binding interferes with Gpr1 receptor coupling to Gpa2. Our studies invoke novel mechanisms involving GPCR-G protein modules that may be conserved in multicellular eukaryotes.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
5

Lai, Jianqi, Hua Li, Zhengyu Tian und Ye Zhang. „A Multi-GPU Parallel Algorithm in Hypersonic Flow Computations“. Mathematical Problems in Engineering 2019 (17.03.2019): 1–15. http://dx.doi.org/10.1155/2019/2053156.

Der volle Inhalt der Quelle
Annotation:
Computational fluid dynamics (CFD) plays an important role in the optimal design of aircraft and the analysis of complex flow mechanisms in the aerospace domain. The graphics processing unit (GPU) has a strong floating-point operation capability and a high memory bandwidth in data parallelism, which brings great opportunities for CFD. A cell-centred finite volume method is applied to solve three-dimensional compressible Navier–Stokes equations on structured meshes with an upwind AUSM+UP numerical scheme for space discretization, and four-stage Runge–Kutta method is used for time discretization. Compute unified device architecture (CUDA) is used as a parallel computing platform and programming model for GPUs, which reduces the complexity of programming. The main purpose of this paper is to design an extremely efficient multi-GPU parallel algorithm based on MPI+CUDA to study the hypersonic flow characteristics. Solutions of hypersonic flow over an aerospace plane model are provided at different Mach numbers. The agreement between numerical computations and experimental measurements is favourable. Acceleration performance of the parallel platform is studied with single GPU, two GPUs, and four GPUs. For single GPU implementation, the speedup reaches 63 for the coarser mesh and 78 for the finest mesh. GPUs are better suited for compute-intensive tasks than traditional CPUs. For multi-GPU parallelization, the speedup of four GPUs reaches 77 for the coarser mesh and 147 for the finest mesh; this is far greater than the acceleration achieved by single GPU and two GPUs. It is prospective to apply the multi-GPU parallel algorithm to hypersonic flow computations.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
6

Wang, Ping, John R. Perfect und Joseph Heitman. „The G-Protein β Subunit GPB1 Is Required for Mating and Haploid Fruiting in Cryptococcus neoformans“. Molecular and Cellular Biology 20, Nr. 1 (01.01.2000): 352–62. http://dx.doi.org/10.1128/mcb.20.1.352-362.2000.

Der volle Inhalt der Quelle
Annotation:
ABSTRACT Cryptococcus neoformans is an opportunistic fungal pathogen with a defined sexual cycle. The gene encoding a heterotrimeric G-protein β subunit, GPB1, was cloned and disrupted.gpb1 mutant strains are sterile, indicating a role for this gene in mating. GPB1 plays an active role in mediating responses to pheromones in early mating steps (conjugation tube formation and cell fusion) and signals via a mitogen-activated protein (MAP) kinase cascade in both MATα and MATa cells. The functions of GPB1 are distinct from those of the Gα protein GPA1, which functions in a nutrient-sensing cyclic AMP (cAMP) pathway required for mating, virulence factor induction, and virulence.gpb1 mutant strains are also defective in monokaryotic fruiting in response to nitrogen starvation. We show thatMATa cells stimulate monokaryotic fruiting ofMATα cells, possibly in response to mating pheromone, which may serve to disperse cells and spores to locate mating partners. In summary, the Gβ subunit GPB1 and the Gα subunit GPA1 function in distinct signaling pathways: one (GPB1) senses pheromones and regulates mating and haploid fruiting via a MAP kinase cascade, and the other (GPA1) senses nutrients and regulates mating, virulence factors, and pathogenicity via a cAMP cascade.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
7

Zhou, Chao, und Tao Zhang. „High Performance Graph Data Imputation on Multiple GPUs“. Future Internet 13, Nr. 2 (31.01.2021): 36. http://dx.doi.org/10.3390/fi13020036.

Der volle Inhalt der Quelle
Annotation:
In real applications, massive data with graph structures are often incomplete due to various restrictions. Therefore, graph data imputation algorithms have been widely used in the fields of social networks, sensor networks, and MRI to solve the graph data completion problem. To keep the data relevant, a data structure is represented by a graph-tensor, in which each matrix is the vertex value of a weighted graph. The convolutional imputation algorithm has been proposed to solve the low-rank graph-tensor completion problem that some data matrices are entirely unobserved. However, this data imputation algorithm has limited application scope because it is compute-intensive and low-performance on CPU. In this paper, we propose a scheme to perform the convolutional imputation algorithm with higher time performance on GPUs (Graphics Processing Units) by exploiting multi-core GPUs of CUDA architecture. We propose optimization strategies to achieve coalesced memory access for graph Fourier transform (GFT) computation and improve the utilization of GPU SM resources for singular value decomposition (SVD) computation. Furthermore, we design a scheme to extend the GPU-optimized implementation to multiple GPUs for large-scale computing. Experimental results show that the GPU implementation is both fast and accurate. On synthetic data of varying sizes, the GPU-optimized implementation running on a single Quadro RTX6000 GPU achieves up to 60.50× speedups over the GPU-baseline implementation. The multi-GPU implementation achieves up to 1.81× speedups on two GPUs versus the GPU-optimized implementation on a single GPU. On the ego-Facebook dataset, the GPU-optimized implementation achieves up to 77.88× speedups over the GPU-baseline implementation. Meanwhile, the GPU implementation and the CPU implementation achieve similar, low recovery errors.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
8

MITTAL, SPARSH. „A SURVEY OF TECHNIQUES FOR MANAGING AND LEVERAGING CACHES IN GPUs“. Journal of Circuits, Systems and Computers 23, Nr. 08 (18.06.2014): 1430002. http://dx.doi.org/10.1142/s0218126614300025.

Der volle Inhalt der Quelle
Annotation:
Initially introduced as special-purpose accelerators for graphics applications, graphics processing units (GPUs) have now emerged as general purpose computing platforms for a wide range of applications. To address the requirements of these applications, modern GPUs include sizable hardware-managed caches. However, several factors, such as unique architecture of GPU, rise of CPU–GPU heterogeneous computing, etc., demand effective management of caches to achieve high performance and energy efficiency. Recently, several techniques have been proposed for this purpose. In this paper, we survey several architectural and system-level techniques proposed for managing and leveraging GPU caches. We also discuss the importance and challenges of cache management in GPUs. The aim of this paper is to provide the readers insights into cache management techniques for GPUs and motivate them to propose even better techniques for leveraging the full potential of caches in the GPUs of tomorrow.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
9

Oden, Lena, und Holger Fröning. „InfiniBand Verbs on GPU: a case study of controlling an InfiniBand network device from the GPU“. International Journal of High Performance Computing Applications 31, Nr. 4 (25.06.2015): 274–84. http://dx.doi.org/10.1177/1094342015588142.

Der volle Inhalt der Quelle
Annotation:
Due to their massive parallelism and high performance per Watt, GPUs have gained high popularity in high-performance computing and are a strong candidate for future exascale systems. But communication and data transfer in GPU-accelerated systems remain a challenging problem. Since the GPU normally is not able to control a network device, a hybrid-programming model is preferred whereby the GPU is used for calculation and the CPU handles the communication. As a result, communication between distributed GPUs suffers from unnecessary overhead, introduced by switching control flow from GPUs to CPUs and vice versa. Furthermore, often a designated CPU thread is required to control GPU-related communication. In this work, we modify user space libraries and device drivers of GPUs and the InfiniBand network device in a way to enable the GPU to control an InfiniBand network device to independently source and sink communication requests without any involvement of the CPU. Our results show that complex networking protocols such as InfiniBand Verbs are better handled by CPUs, since overhead of work request generation cannot be parallelized and is not suitable for the highly parallel programming model of GPUs. The massive number of instructions and accesses to host memory that is required to source and sink a communication request on the GPU slows down the performance. Only through a massive reduction in the complexity of the InfiniBand protocol can some performance improvements be achieved.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
10

Gaurav und Steven F. Wojtkiewicz. „Use of GPU Computing for Uncertainty Quantification in Computational Mechanics: A Case Study“. Scientific Programming 19, Nr. 4 (2011): 199–212. http://dx.doi.org/10.1155/2011/730213.

Der volle Inhalt der Quelle
Annotation:
Graphics processing units (GPUs) are rapidly emerging as a more economical and highly competitive alternative to CPU-based parallel computing. As the degree of software control of GPUs has increased, many researchers have explored their use in non-gaming applications. Recent studies have shown that GPUs consistently outperform their best corresponding CPU-based parallel computing alternatives in single-instruction multiple-data (SIMD) strategies. This study explores the use of GPUs for uncertainty quantification in computational mechanics. Five types of analysis procedures that are frequently utilized for uncertainty quantification of mechanical and dynamical systems have been considered and their GPU implementations have been developed. The numerical examples presented in this study show that considerable gains in computational efficiency can be obtained for these procedures. It is expected that the GPU implementations presented in this study will serve as initial bases for further developments in the use of GPUs in the field of uncertainty quantification and will (i) aid the understanding of the performance constraints on the relevant GPU kernels and (ii) provide some guidance regarding the computational and the data structures to be utilized in these novel GPU implementations.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
11

Fortin, Pierre, und Maxime Touche. „Dual tree traversal on integrated GPUs for astrophysical N-body simulations“. International Journal of High Performance Computing Applications 33, Nr. 5 (15.04.2019): 960–72. http://dx.doi.org/10.1177/1094342019840806.

Der volle Inhalt der Quelle
Annotation:
In astrophysical N-body simulations, O( N) fast multipole methods (FMMs) with dual tree traversal (DTT) on multi-core CPUs are faster than O( N log N) CPU tree-codes but can still be outperformed by GPU ones. In this article, we aim at combining the best algorithm, namely FMM with DTT, with the most powerful hardware currently available, namely GPUs. In the astrophysical context requiring low accuracies and non-uniform particle distributions, we show that such combination can be achieved thanks to a hybrid CPU-GPU algorithm on integrated GPUs: while the DTT is performed on the CPU cores, the far- and near-field computations are all performed on the GPU cores. We show how to efficiently expose the interactions resulting from the DTT to the GPU cores, how to deploy both the far- and near-field computations on GPU, and how to overlap the parallel DTT on CPU with GPU computations. Based on the falcON code and using OpenCL on AMD Accelerated Processing Units and on Intel integrated GPUs, this first heterogeneous deployment of DTT for FMM outperforms standard multi-core CPUs and matches GPU and high-end CPU performance, being hence more cost- and power-efficient.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
12

Chen, Dong, Hua You Su, Wen Mei, Li Xuan Wang und Chun Yuan Zhang. „Scalable Parallel Motion Estimation on Muti-GPU System“. Applied Mechanics and Materials 347-350 (August 2013): 3708–14. http://dx.doi.org/10.4028/www.scientific.net/amm.347-350.3708.

Der volle Inhalt der Quelle
Annotation:
With NVIDIA’s parallel computing architecture CUDA, using GPU to speed up compute-intensive applications has become a research focus in recent years. In this paper, we proposed a scalable method for multi-GPU system to accelerate motion estimation algorithm, which is the most time consuming process in video encoding. Based on the analysis of data dependency and multi-GPU architecture, a parallel computing model and a communication model are designed. We tested our parallel algorithm and analyzed the performance with 10 standard video sequences in different resolutions using 4 NVIDIA GTX460 GPUs, and calculated the overall speedup. Our results show that a speedup of 36.1 times using 1 GPU and more than 120 times for 4 GPUs on 1920x1080 sequences. Further, our parallel algorithm demonstrated the potential of nearly linear speedup according to the number of GPUs in the system.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
13

Esseissah, Mohamed S., Ashraf Bhery, Sameh S. Daoud und Hatem M. Bahig. „Three Strategies for Improving Shortest Vector Enumeration Using GPUs“. Scientific Programming 2021 (05.01.2021): 1–13. http://dx.doi.org/10.1155/2021/8852497.

Der volle Inhalt der Quelle
Annotation:
Hard Lattice problems are assumed to be one of the most promising problems for generating cryptosystems that are secure in quantum computing. The shortest vector problem (SVP) is one of the most famous lattice problems. In this paper, we present three improvements on GPU-based parallel algorithms for solving SVP using the classical enumeration and pruned enumeration. There are two improvements for preprocessing: we use a combination of randomization and the Gaussian heuristic to expect a better basis that leads rapidly to a shortest vector and we expect the level on which the exchanging data between CPU and GPU is optimized. In the third improvement, we improve GPU-based implementation by generating some points in GPU rather than in CPU. We used NVIDIA GeForce GPUs of type GTX 1060 6G. We achieved a significant improvement upon Hermans’s improvement. The improvements speed up the pruned enumeration by a factor of almost 2.5 using a single GPU. Additionally, we provided an implementation for multi-GPUs by using two GPUs. The results showed that our algorithm of enumeration is scalable since the speedups achieved using two GPUs are almost faster than Hermans’s improvement by a factor of almost 5. The improvements also provided a high speedup for the classical enumeration. The speedup achieved using our improvements and two GPUs on a challenge of dimension 60 is almost faster by factor 2 than Correia’s parallel implementation using a dual-socket machine with 16 physical cores and simultaneous multithreading technology.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
14

Kurniawan, Kwek Benny, und YB Dwi Setianto. „CPU AND GPU PERFORMANCE ANALYSIS ON 2D MATRIX OPERATION“. Proxies : Jurnal Informatika 2, Nr. 1 (04.03.2021): 1. http://dx.doi.org/10.24167/proxies.v2i1.3194.

Der volle Inhalt der Quelle
Annotation:
GPU or Graphic Processing Unit can be used on many platforms in general GPUs are used for rendering graphics but now GPUs are general purpose parallel processors with support for easily accessible programming interfaces and industry standard languages such as C, Python and Fortran. In this study, the authors will compare CPU and GPU for completing some matrix calculation. To compare between CPU and GPU, the authors have done some testing to observe the use of Processing Unit, memory and computing time to complete matrix calculations by changing matrix sizes and dimensions. The results of tests that have been done shows asynchronous GPU is faster than sequential. Furthermore, thread for GPU needs to be adjusted to achieve efficiency in GPU load.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
15

DEHNE, FRANK, und HAMIDREZA ZABOLI. „DETERMINISTIC SAMPLE SORT FOR GPUS“. Parallel Processing Letters 22, Nr. 03 (08.07.2012): 1250008. http://dx.doi.org/10.1142/s0129626412500089.

Der volle Inhalt der Quelle
Annotation:
We demonstrate that parallel deterministic sample sort for many-core GPUs (GPU BUCKET SORT) is not only considerably faster than the best comparison-based sorting algorithm for GPUs (THRUST MERGE [Satish et.al., Proc. IPDPS 2009]) but also as fast as randomized sample sort for GPUs (GPU SAMPLE SORT [Leischner et.al., Proc. IPDPS 2010]). However, deterministic sample sort has the advantage that bucket sizes are guaranteed and therefore its running time does not have the input data dependent fluctuations that can occur for randomized sample sort.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
16

He, Guixia, und Jiaquan Gao. „A Novel CSR-Based Sparse Matrix-Vector Multiplication on GPUs“. Mathematical Problems in Engineering 2016 (2016): 1–12. http://dx.doi.org/10.1155/2016/8471283.

Der volle Inhalt der Quelle
Annotation:
Sparse matrix-vector multiplication (SpMV) is an important operation in scientific computations. Compressed sparse row (CSR) is the most frequently used format to store sparse matrices. However, CSR-based SpMVs on graphic processing units (GPUs), for example, CSR-scalar and CSR-vector, usually have poor performance due to irregular memory access patterns. This motivates us to propose a perfect CSR-based SpMV on the GPU that is called PCSR. PCSR involves two kernels and accesses CSR arrays in a fully coalesced manner by introducing a middle array, which greatly alleviates the deficiencies of CSR-scalar (rare coalescing) and CSR-vector (partial coalescing). Test results on a single C2050 GPU show that PCSR fully outperforms CSR-scalar, CSR-vector, and CSRMV and HYBMV in the vendor-tuned CUSPARSE library and is comparable with a most recently proposed CSR-based algorithm, CSR-Adaptive. Furthermore, we extend PCSR on a single GPU to multiple GPUs. Experimental results on four C2050 GPUs show that no matter whether the communication between GPUs is considered or not PCSR on multiple GPUs achieves good performance and has high parallel efficiency.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
17

Lin, Yu-Shiang, Chun-Yuan Lin, Hsiao-Chieh Chi und Yeh-Ching Chung. „Multiple Sequence Alignments with Regular Expression Constraints on a Cloud Service System“. International Journal of Grid and High Performance Computing 5, Nr. 3 (Juli 2013): 55–64. http://dx.doi.org/10.4018/jghpc.2013070105.

Der volle Inhalt der Quelle
Annotation:
Multiple sequence alignments with constraints are of priority concern in computational biology. Constrained sequence alignment incorporates the domain knowledge of biologists into sequence alignments such that the user-specified residues/segments are aligned together according to the alignment results. A series of constrained multiple sequence alignment tools have been developed in relevant literatures in the recent decade. GPU-REMuSiC is the most advanced method with the regular expression constraints, in which graphics processing units (GPUs) with CUDA are used. GPU-REMuSiC can achieve a speedup ratio of 29x for overall computation time based on the experimental results. However, the execution environment of GPU-REMuSiC must be constructed; it is a threshold for biologists to set up it. Therefore, this work presents an intuitive friendly user interface of GPU-REMuSiC for the potential cloud server with GPUs, called Cloud GPU-REMuSiC. Implementing the user interface via a network allows us to transmit the input data to a remote server without a complex cumbersome setting in a local host. Finally, the alignment results can be obtained from a remote cloud server with GPUs. Cloud GPU-REMuSiC is highly promising as an online application that is accessible without time or location constraints.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
18

Venstad, Jon Marius. „Industry-scale finite-difference elastic wave modeling on graphics processing units using the out-of-core technique“. GEOPHYSICS 81, Nr. 2 (01.03.2016): T35—T43. http://dx.doi.org/10.1190/geo2015-0267.1.

Der volle Inhalt der Quelle
Annotation:
The difference in computational power between the few- and multicore architectures represented by central processing units (CPUs) and graphics processing units (GPUs) is significant today, and this difference is likely to increase in the years ahead. GPUs are, therefore, ever more popular for applications in computational physics, such as wave modeling. Finite-difference methods are popular for wave modeling and are well suited for the GPU architecture, but developing an efficient and capable GPU implementation is hindered by the limited size of the GPU memory. I revealed how the out-of-core technique can be used to circumvent the memory limit on the GPU, increasing the available memory to that of the CPU (the main memory) instead, with no significant computational overhead. This approach has several advantages over a parallel scheme in terms of applicability, flexibility, and hardware requirements. Choices in the numerical scheme — the numerical differentiators in particular — also greatly affect computational efficiency. These factors are considered explicitly for GPU implementations of wave modeling because GPUs are special purpose with a visible architecture.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
19

Abdelaal, Yara M., M. Fayez, Samy Ghoniemy, Ehab Abozinadah und H. M. Faheem. „Performance Tuning Techniques for Face Detection Algorithms on GPGPU“. International Journal of Innovative Technology and Exploring Engineering 10, Nr. 2 (10.12.2020): 103–8. http://dx.doi.org/10.35940/ijitee.b8234.1210220.

Der volle Inhalt der Quelle
Annotation:
Face detection algorithms varies in speed and performance on GPUs. Different algorithms can report different speeds on different GPUs that are not governed by linear or nearlinear approximations. This is due to many factors such as register file size, occupancy rate of the GPU, speed of the memory, and speed of double precision processors. This paper studies the most common face detection algorithms LBP and Haar-like and study the bottlenecks associated with deploying both algorithms on different GPU architectures. The study focuses on the bottlenecks and the associated techniques to resolve them based on the different GPUs specifications.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
20

Aubert, Dominique. „Numerical Cosmology powered by GPUs“. Proceedings of the International Astronomical Union 6, S270 (Mai 2010): 397–400. http://dx.doi.org/10.1017/s1743921311000706.

Der volle Inhalt der Quelle
Annotation:
AbstractGraphics Processing Units (GPUs) offer a new way to accelerate numerical calculations by means of on-board massive parallelisation. We discuss two examples of GPU implementation relevant for cosmological simulations, an N-Body Particle-mesh solver and a radiative transfer code. The latter has also been ported on multi-GPU clusters. The range of acceleration (x30-x80) achieved here offer bright perspective for large scale simulations driven by GPUs.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
21

Tasoulas, Zois-Gerasimos, und Iraklis Anagnostopoulos. „Improving GPU Performance with a Power-Aware Streaming Multiprocessor Allocation Methodology“. Electronics 8, Nr. 12 (01.12.2019): 1451. http://dx.doi.org/10.3390/electronics8121451.

Der volle Inhalt der Quelle
Annotation:
Graphics processing units (GPUs) are extensively used as accelerators across multiple application domains, ranging from general purpose applications to neural networks, and cryptocurrency mining. The initial utilization paradigm for GPUs was one application accessing all the resources of the GPU. In recent years, time sharing is broadly used among applications of a GPU, nevertheless, spatial sharing is not fully explored. When concurrent applications share the computational resources of a GPU, performance can be improved by eliminating idle resources. Additionally, the incorporation of GPUs in embedded and mobile devices increases the demand for power efficient computation due to battery limitations. In this article, we present an allocation methodology for streaming multiprocessors (SMs). The presented methodology works for two concurrent applications on a GPU and determines an allocation scheme that will provide power efficient application execution, combined with improved GPU performance. Experimental results show that the developed methodology yields higher throughput while achieving improved power efficiency, compared to other SM power-aware and performance-aware policies. If the presented methodology is adopted, it will lead to higher performance of applications that are concurrently executing on a GPU. This will lead to a faster and more efficient acceleration of execution, even for devices with restrained energy sources.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
22

Lai, Jianqi, Hang Yu, Zhengyu Tian und Hua Li. „Hybrid MPI and CUDA Parallelization for CFD Applications on Multi-GPU HPC Clusters“. Scientific Programming 2020 (25.09.2020): 1–15. http://dx.doi.org/10.1155/2020/8862123.

Der volle Inhalt der Quelle
Annotation:
Graphics processing units (GPUs) have a strong floating-point capability and a high memory bandwidth in data parallelism and have been widely used in high-performance computing (HPC). Compute unified device architecture (CUDA) is used as a parallel computing platform and programming model for the GPU to reduce the complexity of programming. The programmable GPUs are becoming popular in computational fluid dynamics (CFD) applications. In this work, we propose a hybrid parallel algorithm of the message passing interface and CUDA for CFD applications on multi-GPU HPC clusters. The AUSM + UP upwind scheme and the three-step Runge–Kutta method are used for spatial discretization and time discretization, respectively. The turbulent solution is solved by the K−ω SST two-equation model. The CPU only manages the execution of the GPU and communication, and the GPU is responsible for data processing. Parallel execution and memory access optimizations are used to optimize the GPU-based CFD codes. We propose a nonblocking communication method to fully overlap GPU computing, CPU_CPU communication, and CPU_GPU data transfer by creating two CUDA streams. Furthermore, the one-dimensional domain decomposition method is used to balance the workload among GPUs. Finally, we evaluate the hybrid parallel algorithm with the compressible turbulent flow over a flat plate. The performance of a single GPU implementation and the scalability of multi-GPU clusters are discussed. Performance measurements show that multi-GPU parallelization can achieve a speedup of more than 36 times with respect to CPU-based parallel computing, and the parallel algorithm has good scalability.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
23

AHMED, IFTIKHAR, RICK SIOW MONG GOH, ENG HUAT KHOO, KIM HUAT LEE, SIAW KIAN ZHONG, ERPING LI und TERENCE HUNG. „IMPLEMENTATION OF THE LORENTZ–DRUDE MODEL INCORPORATED FDTD METHOD ON MULTIPLE GPUs FOR PLASMONICS APPLICATIONS“. International Journal of Computational Methods 11, Nr. 04 (August 2014): 1350063. http://dx.doi.org/10.1142/s0219876213500631.

Der volle Inhalt der Quelle
Annotation:
The Lorentz–Drude model incorporated Maxwell equations are simulated by using the three-dimensional finite difference time domain (FDTD) method and the method is parallelized on multiple graphics processing units (GPUs) for plasmonics applications. The compute unified device architecture (CUDA) is used for GPU parallelization. The Lorentz–Drude (LD) model is used to simulate the dispersive nature of materials in plasmonics domain and the auxiliary differential equation (ADE) approach is used to make it consistent with time domain Maxwell equations. Different aspects of multiple GPUs for the FDTD method are presented such as comparison of different numbers of GPUs, transfer time in between them, synchronous, and asynchronous passing. It is shown that by using multiple GPUs in parallel fashion, significant reduction in the simulation time can be achieved as compared to the single GPU.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
24

Masek, Jan, Radim Burget, Lukas Povoda und Malay Kishore Dutta. „Multi–GPU Implementation of Machine Learning Algorithm using CUDA and OpenCL“. International Journal of Advances in Telecommunications, Electrotechnics, Signals and Systems 5, Nr. 2 (10.06.2016): 101. http://dx.doi.org/10.11601/ijates.v5i2.142.

Der volle Inhalt der Quelle
Annotation:
Using modern Graphic Processing Units (GPUs) becomes very useful for computing complex and time consuming processes. GPUs provide high–performance computation capabilities with a good price. This paper deals with a multi–GPU OpenCL and CUDA implementations of k–Nearest Neighbor (k–NN) algorithm. This work compares performances of OpenCLand CUDA implementations where each of them is suitable for different number of used attributes. The proposed CUDA algorithm achieves acceleration up to 880x in comparison witha single thread CPU version. The common k-NN was modified to be faster when the lower number of k neighbors is set. The performance of algorithm was verified with two GPUs dual-core NVIDIA GeForce GTX 690 and CPU Intel Core i7 3770 with 4.1 GHz frequency. The results of speed up were measured for one GPU, two GPUs, three and four GPUs. We performed several tests with data sets containing up to 4 million elements with various number of attributes.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
25

Traynor, Daniel, und Terry Froy. „Provision and use of GPU resources for distributed workloads via the Grid“. EPJ Web of Conferences 245 (2020): 03002. http://dx.doi.org/10.1051/epjconf/202024503002.

Der volle Inhalt der Quelle
Annotation:
The Queen Mary University of London WLCG Tier-2 Grid site has been providing GPU resources on the Grid since 2016. GPUs are an important modern tool to assist in data analysis. They have historically been used to accelerate computationally expensive but parallelisable workloads using frameworks such as OpenCL and CUDA. However, more recently their power in accelerating machine learning, using libraries such as TensorFlow and Coffee, has come to the fore and the demand for GPU resources has increased. Significant effort is being spent in high energy physics to investigate and use machine learning to enhance the analysis of data. GPUs may also provide part of the solution to the compute challenge of the High Luminosity LHC. The motivation for providing GPU resources via the Grid is presented. The installation and configuration of the SLURM batch system together with Compute Elements (CREAM and ARC) for use with GPUs is shown. Real world use cases are presented and the success and issues discovered are discussed.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
26

Krasznahorkay, Attila, Charles Leggett, Alaettin Serhan Mete, Scott Snyder und Vakho Tsulaia. „GPU Usage in ATLAS Reconstruction and Analysis“. EPJ Web of Conferences 245 (2020): 05006. http://dx.doi.org/10.1051/epjconf/202024505006.

Der volle Inhalt der Quelle
Annotation:
With Graphical Processing Units (GPUs) and other kinds of accelerators becoming ever more accessible, High Performance Computing Centres all around the world using them ever more, ATLAS has to find the best way of making use of such accelerators in much of its computing. Tests with GPUs – mainly with CUDA – have been performed in the past in the experiment. At that time the conclusion was that it was not advantageous for the ATLAS offline and trigger software to invest time and money into GPUs. However as the usage of accelerators has become cheaper and simpler in recent years, their re-evaluation in ATLAS’s offline software is warranted. We show new results of using GPU accelerated calculations in ATLAS’s offline software environment using the ATLAS offline/analysis (xAOD) Event Data Model. We compare the performance and flexibility of a couple of the available GPU programming methods, and show how different memory management setups affect our ability to offload different types of calculations to a GPU efficiently.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
27

Jiang, Ronglin, Shugang Jiang, Yu Zhang, Ying Xu, Lei Xu und Dandan Zhang. „GPU-Accelerated Parallel FDTD on Distributed Heterogeneous Platform“. International Journal of Antennas and Propagation 2014 (2014): 1–8. http://dx.doi.org/10.1155/2014/321081.

Der volle Inhalt der Quelle
Annotation:
This paper introduces a (finite difference time domain) FDTD code written in Fortran and CUDA for realistic electromagnetic calculations with parallelization methods of Message Passing Interface (MPI) and Open Multiprocessing (OpenMP). Since both Central Processing Unit (CPU) and Graphics Processing Unit (GPU) resources are utilized, a faster execution speed can be reached compared to a traditional pure GPU code. In our experiments, 64 NVIDIA TESLA K20m GPUs and 64 INTEL XEON E5-2670 CPUs are used to carry out the pure CPU, pure GPU, and CPU + GPU tests. Relative to the pure CPU calculations for the same problems, the speedup ratio achieved by CPU + GPU calculations is around 14. Compared to the pure GPU calculations for the same problems, the CPU + GPU calculations have 7.6%–13.2% performance improvement. Because of the small memory size of GPUs, the FDTD problem size is usually very small. However, this code can enlarge the maximum problem size by 25% without reducing the performance of traditional pure GPU code. Finally, using this code, a microstrip antenna array with16×18elements is calculated and the radiation patterns are compared with the ones of MoM. Results show that there is a well agreement between them.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
28

INO, FUMIHIKO, YUKI KOTANI, YUMA MUNEKAWA und KENICHI HAGIHARA. „HARNESSING THE POWER OF IDLE GPUS FOR ACCELERATION OF BIOLOGICAL SEQUENCE ALIGNMENT“. Parallel Processing Letters 19, Nr. 04 (Dezember 2009): 513–33. http://dx.doi.org/10.1142/s0129626409000390.

Der volle Inhalt der Quelle
Annotation:
This paper presents a parallel system capable of accelerating biological sequence alignment on the graphics processing unit (GPU) grid. The GPU grid in this paper is a desktop grid system that utilizes idle GPUs and CPUs in the office and home. Our parallel implementation employs a master-worker paradigm to accelerate an OpenGL-based algorithm that runs on a single GPU. We integrate this implementation into a screensaver-based grid system that detects idle resources on which the alignment code can run. We also show some experimental results comparing our implementation with three different implementations running on a single GPU, a single CPU, or multiple CPUs. As a result, we find that a single non-dedicated GPU can provide us almost the same throughput as two dedicated CPUs in our laboratory environment, where GPU-equipped machines are ordinarily used to develop GPU applications. In a dedicated environment, the GPU-accelerated code achieves five times higher throughput than the CPU-based code. Furthermore, a linear speedup of 30.7X is observed on a 32-node cluster of dedicated GPUs. We also implement a compute unified device architecture (CUDA) based algorithm to demonstrate further acceleration.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
29

JIANG, CHAO, HENG HE, PENGCHENG LI und QINGMING LUO. „GRAPHICS PROCESSING UNIT CLUSTER ACCELERATED MONTE CARLO SIMULATION OF PHOTON TRANSPORT IN MULTI-LAYERED TISSUES“. Journal of Innovative Optical Health Sciences 05, Nr. 02 (April 2012): 1250004. http://dx.doi.org/10.1142/s1793545812500046.

Der volle Inhalt der Quelle
Annotation:
We present a graphics processing unit (GPU) cluster-based Monte Carlo simulation of photon transport in multi-layered tissues. The cluster is composed of multiple computing nodes in a local area network where each node is a personal computer equipped with one or several GPU(s) for parallel computing. In this study, the MPI (Message Passing Interface), the OpenMP (Open Multi-Processing) and the CUDA (Compute Unified Device Architecture) technologies are employed to develop the program. It is demonstrated that this designing runs roughly N times faster than that using single GPU when the GPUs within the cluster are of the same type, where N is the total number of the GPUs within the cluster.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
30

Verdesca, Marlo, Jaeson Munro, Michael Hoffman, Maria Bauer und Dinesh Manocha. „Using Graphics Processor Units to Accelerate OneSAF: A Case Study in Technology Transition“. Journal of Defense Modeling and Simulation: Applications, Methodology, Technology 3, Nr. 3 (Juli 2006): 177–87. http://dx.doi.org/10.1177/154851290600300305.

Der volle Inhalt der Quelle
Annotation:
Ongoing research aims to accelerate the runtime processing speed of the One Semi-Automated Forces (OneSAF) Computer Generated Forces (CGF) simulation by converting and migrating some of the core algorithms from the host central processing unit (CPU) to an onboard auxiliary graphics processor unit (GPU). In this research the GPU chip is regarded as a surrogate stream processor, and appropriate algorithms are designed to map to the GPU architecture. Processing speed gains are realized both through computational capabilities of the GPU as well as through offloading of the host CPU. Technology transfer of this research into the OneSAF baseline is a key requirement of this research. The OneSAF development program focuses on the same issues of scalability and runtime performance that will be directly affected by use of GPUs. As program architects are marshalling conventional approaches for resolving these challenges, the introduction of GPU-based solutions is being realized. This paper examines the challenges, planned approaches, and benchmarked results for using GPUs to accelerate OneSAF simulation.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
31

Kang, Jihun, und Heonchang Yu. „GPGPU Task Scheduling Technique for Reducing the Performance Deviation of Multiple GPGPU Tasks in RPC-Based GPU Virtualization Environments“. Symmetry 13, Nr. 3 (20.03.2021): 508. http://dx.doi.org/10.3390/sym13030508.

Der volle Inhalt der Quelle
Annotation:
In remote procedure call (RPC)-based graphic processing unit (GPU) virtualization environments, GPU tasks requested by multiple-user virtual machines (VMs) are delivered to the VM owning the GPU and are processed in a multi-process form. However, because the thread executing the computing on general GPUs cannot arbitrarily stop the task or trigger context switching, GPU monopoly may be prolonged owing to a long-running general-purpose computing on graphics processing unit (GPGPU) task. Furthermore, when scheduling tasks on the GPU, the time for which each user VM uses the GPU is not considered. Thus, in cloud environments that must provide fair use of computing resources, equal use of GPUs between each user VM cannot be guaranteed. We propose a GPGPU task scheduling scheme based on thread division processing that supports GPU use evenly by multiple VMs that process GPGPU tasks in an RPC-based GPU virtualization environment. Our method divides the threads of the GPGPU task into several groups and controls the execution time of each thread group to prevent a specific GPGPU task from a long time monopolizing the GPU. The efficiency of the proposed technique is verified through an experiment in an environment where multiple VMs simultaneously perform GPGPU tasks.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
32

Xu, S., X. Huang, Y. Zhang, H. Fu, L. Y. Oey, F. Xu und G. Yang. „gpuPOM: a GPU-based Princeton Ocean Model“. Geoscientific Model Development Discussions 7, Nr. 6 (17.11.2014): 7651–91. http://dx.doi.org/10.5194/gmdd-7-7651-2014.

Der volle Inhalt der Quelle
Annotation:
Abstract. Rapid advances in the performance of the graphics processing unit (GPU) have made the GPU a compelling solution for a series of scientific applications. However, most existing GPU acceleration works for climate models are doing partial code porting for certain hot spots, and can only achieve limited speedup for the entire model. In this work, we take the mpiPOM (a parallel version of the Princeton Ocean Model) as our starting point, design and implement a GPU-based Princeton Ocean Model. By carefully considering the architectural features of the state-of-the-art GPU devices, we rewrite the full mpiPOM model from the original Fortran version into a new Compute Unified Device Architecture C (CUDA-C) version. We take several accelerating methods to further improve the performance of gpuPOM, including optimizing memory access in a single GPU, overlapping communication and boundary operations among multiple GPUs, and overlapping input/output (I/O) between the hybrid Central Processing Unit (CPU) and the GPU. Our experimental results indicate that the performance of the gpuPOM on a workstation containing 4 GPUs is comparable to a powerful cluster with 408 CPU cores and it reduces the energy consumption by 6.8 times.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
33

Lin, Chun-Yuan, Wei Sheng Lee und Chuan Yi Tang. „Parallel Shellsort Algorithm for Many-Core GPUs with CUDA“. International Journal of Grid and High Performance Computing 4, Nr. 2 (April 2012): 1–16. http://dx.doi.org/10.4018/jghpc.2012040101.

Der volle Inhalt der Quelle
Annotation:
Sorting is a classic algorithmic problem and its importance has led to the design and implementation of various sorting algorithms on many-core graphics processing units (GPUs). CUDPP Radix sort is the most efficient sorting on GPUs and GPU Sample sort is the best comparison-based sorting. Although the implementations of these algorithms are efficient, they either need an extra space for the data rearrangement or the atomic operation for the acceleration. Sorting applications usually deal with a large amount of data, thus the memory utilization is an important consideration. Furthermore, these sorting algorithms on GPUs without the atomic operation support can result in the performance degradation or fail to work. In this paper, an efficient implementation of a parallel shellsort algorithm, CUDA shellsort, is proposed for many-core GPUs with CUDA. Experimental results show that, on average, the performance of CUDA shellsort is nearly twice faster than GPU quicksort and 37% faster than Thrust mergesort under uniform distribution. Moreover, its performance is the same as GPU sample sort up to 32 million data elements, but only needs a constant space usage. CUDA shellsort is also robust over various data distributions and could be suitable for other many-core architectures.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
34

Dressler, Sven, und Daniel N. Wilke. „PyBONDEM-GPU: A discrete element bonded particle Python research framework – Development and examples“. EPJ Web of Conferences 249 (2021): 14009. http://dx.doi.org/10.1051/epjconf/202124914009.

Der volle Inhalt der Quelle
Annotation:
Discrete element modelling (DEM) is widely used to simulate granular systems, nowadays routinely on graphical processing units. Graphics processing units (GPUs) are inherently designed for parallel computation, and recent advances in the architecture, compiler design and language development are allowing general-purpose computation to be computed on multiple GPUs. Application of DEM to bonded particle systems are much less common, with a number of open research questions remaining. This study outlines a Bonded-Particle Research DEM Framework, PyBONDEM-GPU, written in Python. This framework leverages the parallel nature of GPUs for computational speed-up and the rapid prototype flexibility of Python. Python is faster and easier to learn than classical compiled languages, making computational simulation development accessible to undergraduate and graduate engineers. PyBONDEMGPU leverages the Numba-CUDA module to compile Python syntax for execution on GPUs. The framework enables research of fibre pull-out from fibre-matrix embeddings. Bonds are simulated between all interacting particles. The performance of PyBONDEM-GPU is compared against Python CPU implementations of PyBONDEM using the Numpy and Numba-CPU Python modules. PyBONDEM-GPU was found to be 1000 times faster than the Numpy implementation and 4 times faster than the Numba-CPU implementation to resolve forces and to integrate the equations of motion.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
35

Rtal, Youness, und Abdelkader Hadjoudja. „Comparative study of the implementation of the Lagrange interpolation algorithm on GPU and CPU using CUDA to compute the density of a material at different temperatures“. SHS Web of Conferences 119 (2021): 07002. http://dx.doi.org/10.1051/shsconf/202111907002.

Der volle Inhalt der Quelle
Annotation:
Graphics Processing Units (GPUs) are microprocessors attached to graphics cards, which are dedicated to the operation of displaying and manipulating graphics data. Currently, such graphics cards (GPUs) occupy all modern graphics cards. In a few years, these microprocessors have become potent tools for massively parallel computing. Such processors are practical instruments that serve in developing several fields like image processing, video and audio encoding and decoding, the resolution of a physical system with one or more unknowns. Their advantages: faster processing and consumption of less energy than the power of the central processing unit (CPU). In this paper, we will define and implement the Lagrange polynomial interpolation method on GPU and CPU to calculate the sodium density at different temperatures Ti using the NVIDIA CUDA C parallel programming model. It can increase computational performance by harnessing the power of the GPU. The objective of this study is to compare the performance of the implementation of the Lagrange interpolation method on CPU and GPU processors and to deduce the efficiency of the use of GPUs for parallel computing.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
36

Dong, Jiankuo, Fangyu Zheng, Wuqiong Pan, Jingqiang Lin, Jiwu Jing und Yuan Zhao. „Utilizing the Double-Precision Floating-Point Computing Power of GPUs for RSA Acceleration“. Security and Communication Networks 2017 (2017): 1–15. http://dx.doi.org/10.1155/2017/3508786.

Der volle Inhalt der Quelle
Annotation:
Asymmetric cryptographic algorithm (e.g., RSA and Elliptic Curve Cryptography) implementations on Graphics Processing Units (GPUs) have been researched for over a decade. The basic idea of most previous contributions is exploiting the highly parallel GPU architecture and porting the integer-based algorithms from general-purpose CPUs to GPUs, to offer high performance. However, the great potential cryptographic computing power of GPUs, especially by the more powerful floating-point instructions, has not been comprehensively investigated in fact. In this paper, we fully exploit the floating-point computing power of GPUs, by various designs, including the floating-point-based Montgomery multiplication/exponentiation algorithm and Chinese Remainder Theorem (CRT) implementation in GPU. And for practical usage of the proposed algorithm, a new method is performed to convert the input/output between octet strings and floating-point numbers, fully utilizing GPUs and further promoting the overall performance by about 5%. The performance of RSA-2048/3072/4096 decryption on NVIDIA GeForce GTX TITAN reaches 42,211/12,151/5,790 operations per second, respectively, which achieves 13 times the performance of the previous fastest floating-point-based implementation (published in Eurocrypt 2009). The RSA-4096 decryption precedes the existing fastest integer-based result by 23%.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
37

Li, Zheng, Shuhong Wu, Jinchao Xu und Chensong Zhang. „Toward Cost-Effective Reservoir Simulation Solvers on GPUs“. Advances in Applied Mathematics and Mechanics 8, Nr. 6 (19.09.2016): 971–91. http://dx.doi.org/10.4208/aamm.2015.m1138.

Der volle Inhalt der Quelle
Annotation:
AbstractIn this paper, we focus on graphical processing unit (GPU) and discuss how its architecture affects the choice of algorithm and implementation of fully-implicit petroleum reservoir simulation. In order to obtain satisfactory performance on new many-core architectures such as GPUs, the simulator developers must know a great deal on the specific hardware and spend a lot of time on fine tuning the code. Porting a large petroleum reservoir simulator to emerging hardware architectures is expensive and risky. We analyze major components of an in-house reservoir simulator and investigate how to port them to GPUs in a cost-effective way. Preliminary numerical experiments show that our GPU-based simulator is robust and effective. More importantly, these numerical results clearly identify the main bottlenecks to obtain ideal speedup on GPUs and possibly other many-core architectures.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
38

Kondratyuk, Nikolay, Vsevolod Nikolskiy, Daniil Pavlov und Vladimir Stegailov. „GPU-accelerated molecular dynamics: State-of-art software performance and porting from Nvidia CUDA to AMD HIP“. International Journal of High Performance Computing Applications 35, Nr. 4 (19.04.2021): 312–24. http://dx.doi.org/10.1177/10943420211008288.

Der volle Inhalt der Quelle
Annotation:
Classical molecular dynamics (MD) calculations represent a significant part of the utilization time of high-performance computing systems. As usual, the efficiency of such calculations is based on an interplay of software and hardware that are nowadays moving to hybrid GPU-based technologies. Several well-developed open-source MD codes focused on GPUs differ both in their data management capabilities and in performance. In this work, we analyze the performance of LAMMPS, GROMACS and OpenMM MD packages with different GPU backends on Nvidia Volta and AMD Vega20 GPUs. We consider the efficiency of solving two identical MD models (generic for material science and biomolecular studies) using different software and hardware combinations. We describe our experience in porting the CUDA backend of LAMMPS to ROCm HIP that shows considerable benefits for AMD GPUs comparatively to the OpenCL backend.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
39

Semenenko, Julija, Aliaksei Kolesau, Vadimas Starikovičius, Artūras Mackūnas und Dmitrij Šešok. „COMPARISON OF GPU AND CPU EFFICIENCY WHILE SOLVING HEAT CONDUCTION PROBLEMS“. Mokslas - Lietuvos ateitis 12 (24.11.2020): 1–5. http://dx.doi.org/10.3846/mla.2020.13500.

Der volle Inhalt der Quelle
Annotation:
Overview of GPU usage while solving different engineering problems, comparison between CPU and GPU computations and overview of the heat conduction problem are provided in this paper. The Jacobi iterative algorithm was implemented by using Python, TensorFlow GPU library and NVIDIA CUDA technology. Numerical experiments were conducted with 6 CPUs and 4 GPUs. The fastest used GPU completed the calculations 19 times faster than the slowest CPU. On average, GPU was from 9 to 11 times faster than CPU. Significant relative speed-up in GPU calculations starts when the matrix contains at least 4002 floating-point numbers.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
40

Lin, Chun-Yuan, Jin Ye, Che-Lun Hung, Chung-Hung Wang, Min Su und Jianjun Tan. „Constructing a Bioinformatics Platform with Web and Mobile Services Based on NVIDIA Jetson TK1“. International Journal of Grid and High Performance Computing 7, Nr. 4 (Oktober 2015): 57–73. http://dx.doi.org/10.4018/ijghpc.2015100105.

Der volle Inhalt der Quelle
Annotation:
Current high-end graphics processing units (abbreviate to GPUs), such as NVIDIA Tesla, Fermi, Kepler series cards which contain up to thousand cores per-chip, are widely used in the high performance computing fields. These GPU cards (called desktop GPUs) should be installed in personal computers/servers with desktop CPUs; moreover, the cost and power consumption of constructing a high performance computing platform with these desktop CPUs and GPUs are high. NVIDIA releases Tegra K1, called Jetson TK1, which contains 4 ARM Cortex-A15 CPUs and 192 CUDA cores (Kepler GPU) and is an embedded board with low cost, low power consumption and high applicability advantages for embedded applications. NVIDIA Jetson TK1 becomes a new research direction. Hence, in this paper, a bioinformatics platform was constructed based on NVIDIA Jetson TK1. ClustalWtk and MCCtk tools for sequence alignment and compound comparison were designed on this platform, respectively. Moreover, the web and mobile services for these two tools with user friendly interfaces also were provided. The experimental results showed that the cost-performance ratio by NVIDIA Jetson TK1 is higher than that by Intel XEON E5-2650 CPU and NVIDIA Tesla K20m GPU card.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
41

Wu, Rongteng, und Xiaohong Xie. „A Heterogeneous Parallel LU Factorization Algorithm Based on a Basic Column Block Uniform Allocation Strategy“. Mathematical Problems in Engineering 2019 (25.02.2019): 1–12. http://dx.doi.org/10.1155/2019/3720450.

Der volle Inhalt der Quelle
Annotation:
Most supercomputers are shipped with both a CPU and a GPU. With the powerful parallel computing capability of GPUs, heterogeneous computing architecture produces new challenges for system software development and application design. Because of the significantly different architectures and programming models of CPUs and GPUs, conventional optimization techniques for CPUs may not work well in a heterogeneous multi-CPU and multi-GPU system. We present a heterogeneous parallel LU factorization algorithm for heterogeneous architectures. According to the different performances of the processors in the system, any given matrix is partitioned into different sizes of basic column blocks. Then, a static task allocation strategy is used to distribute the basic column blocks to corresponding processors uniformly. The idle time is minimized by optimized sizes and the number of basic column blocks. Right-looking ahead technology is also used in systems configured with one CPU core to one GPU to decrease the wait time. Experiments are conducted to test the performance of synchronization and load balancing, communication cost, and scalability of the heterogeneous parallel LU factorization in different systems and compare it with the related matrix algebra algorithm on a heterogeneous system configured with multiple GPUs and CPUs.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
42

Rui, Ran, Hao Li und Yi-Cheng Tu. „Efficient join algorithms for large database tables in a multi-GPU environment“. Proceedings of the VLDB Endowment 14, Nr. 4 (Dezember 2020): 708–20. http://dx.doi.org/10.14778/3436905.3436927.

Der volle Inhalt der Quelle
Annotation:
Relational join processing is one of the core functionalities in database management systems. It has been demonstrated that GPUs as a general-purpose parallel computing platform is very promising in processing relational joins. However, join algorithms often need to handle very large input data, which is an issue that was not sufficiently addressed in existing work. Besides, as more and more desktop and workstation platforms support multi-GPU environment, the combined computing capability of multiple GPUs can easily achieve that of a computing cluster. It is worth exploring how join processing would benefit from the adaptation of multiple GPUs. We identify the low rate and complex patterns of data transfer among the CPU and GPUs as the main challenges in designing efficient algorithms for large table joins. To overcome such challenges, we propose three distinctive designs of multi-GPU join algorithms, namely, the nested loop, global sort-merge and hybrid joins for large table joins with different join conditions. Extensive experiments running on multiple databases and two different hardware configurations demonstrate high scalability of our algorithms over data size and significant performance boost brought by the use of multiple GPUs. Furthermore, our algorithms achieve much better performance as compared to existing join algorithms, with a speedup up to 25X and 2.8X over best known code developed for multi-core CPUs and GPUs respectively.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
43

Wilton, Richard, und Alexander S. Szalay. „Arioc: High-concurrency short-read alignment on multiple GPUs“. PLOS Computational Biology 16, Nr. 11 (09.11.2020): e1008383. http://dx.doi.org/10.1371/journal.pcbi.1008383.

Der volle Inhalt der Quelle
Annotation:
In large DNA sequence repositories, archival data storage is often coupled with computers that provide 40 or more CPU threads and multiple GPU (general-purpose graphics processing unit) devices. This presents an opportunity for DNA sequence alignment software to exploit high-concurrency hardware to generate short-read alignments at high speed. Arioc, a GPU-accelerated short-read aligner, can compute WGS (whole-genome sequencing) alignments ten times faster than comparable CPU-only alignment software. When two or more GPUs are available, Arioc's speed increases proportionately because the software executes concurrently on each available GPU device. We have adapted Arioc to recent multi-GPU hardware architectures that support high-bandwidth peer-to-peer memory accesses among multiple GPUs. By modifying Arioc's implementation to exploit this GPU memory architecture we obtained a further 1.8x-2.9x increase in overall alignment speeds. With this additional acceleration, Arioc computes two million short-read alignments per second in a four-GPU system; it can align the reads from a human WGS sequencer run–over 500 million 150nt paired-end reads–in less than 15 minutes. As WGS data accumulates exponentially and high-concurrency computational resources become widespread, Arioc addresses a growing need for timely computation in the short-read data analysis toolchain.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
44

Gao, Jiaquan, Yuanshen Zhou und Kesong Wu. „A Novel Multi-GPU Parallel Optimization Model for The Sparse Matrix-Vector Multiplication“. Parallel Processing Letters 26, Nr. 04 (Dezember 2016): 1640001. http://dx.doi.org/10.1142/s0129626416400016.

Der volle Inhalt der Quelle
Annotation:
Accelerating the sparse matrix-vector multiplication (SpMV) on the graphics processing units (GPUs) has attracted considerable attention recently. We observe that on a specific multiple-GPU platform, the SpMV performance can usually be greatly improved when a matrix is partitioned into several blocks according to a predetermined rule and each block is assigned to a GPU with an appropriate storage format. This motivates us to propose a novel multi-GPU parallel SpMV optimization model. Our model involves two stages. In the first stage, a simple rule is defined to divide any given matrix among multiple GPUs, and then a performance model, which is independent of the problems and dependent on the resources of devices, is proposed to accurately predict the execution time of SpMV kernels. Using these models, we construct in the second stage an optimally multi-GPU parallel SpMV algorithm that is automatically and rapidly generated for the platform for any problem. Given that our model for SpMV is general, independent of the problems, and dependent on the resources of devices, this model is constructed only once for each type of GPU. The experiments validate the high efficiency of our proposed model.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
45

Sakai, Kohei, Takuma Iwazaki, Eiki Yamashita, Atsushi Nakagawa, Fumiya Sakuraba, Atsushi Enomoto, Minoru Inagaki und Shigeki Takeda. „Observation of unexpected molecular binding activity for Mu phage tail fibre chaperones“. Journal of Biochemistry 166, Nr. 6 (30.08.2019): 529–35. http://dx.doi.org/10.1093/jb/mvz068.

Der volle Inhalt der Quelle
Annotation:
Abstract In the history of viral research, one of the important biological features of bacteriophage Mu is the ability to expand its host range. For extending the host range, the Mu phage encodes two alternate tail fibre genes. Classical amber mutation experiments and genome sequence analysis of Mu phage suggested that gene products (gp) of geneS (gpS = gp49) and gene S’ (gpS’ = gp52) are tail fibres and that gene products of geneU (gpU = gp50) and geneU’ (gpU’ = gp51) work for tail fibre assembly or tail fibre chaperones. Depending on the gene orientation, a pair of genes 49-50 or 52-51 is expressed for producing different tail fibres that enable Mu phage to recognize different host cell surface. Since several fibrous proteins including some phage tail fibres employ their specific chaperone to facilitate folding and prevent aggregation, we expected that gp50 or gp51 would be a specific chaperone for gp49 and gp52, respectively. However, heterologous overexpression results for gp49 or gp52 (tail fibre subunit) together with gp51 and gp50, respectively, were also effective in producing soluble Mu tail fibres. Moreover, we successfully purified non-native gp49-gp51 and gp52-gp50 complexes. These facts showed that gp50 and gp51 were fungible and functional for both gp49 and gp52 each other.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
46

Cabodi, G., A. Garbo, C. Loiacono, S. Quer und G. Francini. „Efficient Complex High-Precision Computations on GPUs without Precision Loss“. Journal of Circuits, Systems and Computers 26, Nr. 12 (August 2017): 1750187. http://dx.doi.org/10.1142/s0218126617501870.

Der volle Inhalt der Quelle
Annotation:
General-purpose computing on graphics processing units is the utilization of a graphics processing unit (GPU) to perform computation in applications traditionally handled by the central processing unit. Many attempts have been made to implement well-known algorithms on embedded and mobile GPUs. Unfortunately, these applications are computationally complex and often require high precision arithmetic, whereas embedded and mobile GPUs are designed specifically for graphics, and thus are very restrictive in terms of input/output, precision, programming style and primitives available. This paper studies how to implement efficient and accurate high-precision algorithms on embedded GPUs adopting the OpenGL ES language. We discuss the problems arising during the design phase, and we detail our implementation choices, focusing on the SIFT and ALP key-point detectors. We transform standard, i.e., single (or double) precision floating-point computations, to reduced-precision GPU arithmetic without precision loss. We develop a desktop framework to simulate Gaussian Scale Space transforms on all possible target embedded GPU platforms, and with all possible range and precision arithmetic. We illustrate how to re-engineer standard Gaussian Scale Space computations to mobile multi-core parallel GPUs using the OpenGL ES language. We present experiments on a large set of standard images, proving how efficiency and accuracy can be maintained on different target platforms. To sum up, we present a complete framework to minimize future programming effort, i.e., to easily check, on different embedded platforms, the accuracy and performance of complex algorithms requiring high-precision computations.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
47

Blyth, Simon. „Opticks : GPU Optical Photon Simulation for Particle Physics using NVIDIA® OptiXTM“. EPJ Web of Conferences 214 (2019): 02027. http://dx.doi.org/10.1051/epjconf/201921402027.

Der volle Inhalt der Quelle
Annotation:
Opticks is an open source project that integrates the NVIDIA OptiX GPU ray tracing engine with Geant4 toolkit based simulations. Massive parallelism brings drastic performance improvements with optical photon simulation speedup expected to exceed 1000 times Geant4 with workstation GPUs. Optical physics processes of scattering, absorption, scintillator reemission and boundary processes are implemented as CUDA OptiX programs based on the Geant4 implementations. Wavelength-dependent material and surface properties as well as inverse cumulative distribution functions for reemission are interleaved into GPU textures providing fast interpolated property lookup or wavelength generation. OptiX handles the creation and application of a choice of acceleration structures such as boundary volume hierarchies and the transparent use of multiple GPUs. A major recent advance is the implementation of GPU ray tracing of complex constructive solid geometry shapes, enabling automated translation of Geant4 geometries to the GPU without approximation. Using common initial photons and random number sequences allows the Opticks and Geant4 simulations to be run point-by-point aligned. Aligned running has reached near perfect equivalence with test geometries.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
48

Schive, Hsi-Yu, Ui-Han Zhang und Tzihong Chiueh. „Directionally unsplit hydrodynamic schemes with hybrid MPI/OpenMP/GPU parallelization in AMR“. International Journal of High Performance Computing Applications 26, Nr. 4 (17.11.2011): 367–77. http://dx.doi.org/10.1177/1094342011428146.

Der volle Inhalt der Quelle
Annotation:
We present the implementation and performance of a class of directionally unsplit Riemann-solver-based hydrodynamic schemes on graphics processing units (GPUs). These schemes, including the MUSCL-Hancock method, a variant of the MUSCL-Hancock method, and the corner-transport-upwind method, are embedded into the adaptive-mesh-refinement (AMR) code GAMER. Furthermore, a hybrid MPI/OpenMP model is investigated, which enables the full exploitation of the computing power in a heterogeneous CPU/GPU cluster and significantly improves the overall performance. Performance benchmarks are conducted on the Dirac GPU cluster at NERSC/LBNL using up to 32 Tesla C2050 GPUs. A single GPU achieves speed-ups of 101 (25) and 84 (22) for uniform-mesh and AMR simulations, respectively, as compared with the performance using one (four) CPU core(s), and the excellent performance persists in multi-GPU tests. In addition, we make a direct comparison between GAMER and the widely adopted CPU code Athena in adiabatic hydrodynamic tests and demonstrate that, with the same accuracy, GAMER is able to achieve two orders of magnitude performance speed-up.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
49

Abdi, Daniel S., Lucas C. Wilcox, Timothy C. Warburton und Francis X. Giraldo. „A GPU-accelerated continuous and discontinuous Galerkin non-hydrostatic atmospheric model“. International Journal of High Performance Computing Applications 33, Nr. 1 (01.02.2017): 81–109. http://dx.doi.org/10.1177/1094342017694427.

Der volle Inhalt der Quelle
Annotation:
We present a Graphics Processing Unit (GPU)-accelerated nodal discontinuous Galerkin method for the solution of the three-dimensional Euler equations that govern the motion and thermodynamic state of the atmosphere. Acceleration of the dynamical core of atmospheric models plays an important practical role in not only getting daily forecasts faster, but also in obtaining more accurate (high resolution) results within a given simulation time limit. We use algorithms suitable for the single instruction multiple thread architecture of GPUs to accelerate our model by two orders of magnitude relative to one core of a CPU. Tests on one node of the Titan supercomputer show a speedup of up to 15 times using the K20X GPU as compared to that on the 16-core AMD Opteron CPU. The scalability of the multi-GPU implementation is tested using 16,384 GPUs, which resulted in a weak scaling efficiency of about 90%. Finally, the accuracy and performance of our GPU implementation is verified using several benchmark problems representative of different scales of atmospheric dynamics.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
50

Xu, S., X. Huang, L. Y. Oey, F. Xu, H. Fu, Y. Zhang und G. Yang. „POM.gpu-v1.0: a GPU-based Princeton Ocean Model“. Geoscientific Model Development 8, Nr. 9 (09.09.2015): 2815–27. http://dx.doi.org/10.5194/gmd-8-2815-2015.

Der volle Inhalt der Quelle
Annotation:
Abstract. Graphics processing units (GPUs) are an attractive solution in many scientific applications due to their high performance. However, most existing GPU conversions of climate models use GPUs for only a few computationally intensive regions. In the present study, we redesign the mpiPOM (a parallel version of the Princeton Ocean Model) with GPUs. Specifically, we first convert the model from its original Fortran form to a new Compute Unified Device Architecture C (CUDA-C) code, then we optimize the code on each of the GPUs, the communications between the GPUs, and the I / O between the GPUs and the central processing units (CPUs). We show that the performance of the new model on a workstation containing four GPUs is comparable to that on a powerful cluster with 408 standard CPU cores, and it reduces the energy consumption by a factor of 6.8.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
Wir bieten Rabatte auf alle Premium-Pläne für Autoren, deren Werke in thematische Literatursammlungen aufgenommen wurden. Kontaktieren Sie uns, um einen einzigartigen Promo-Code zu erhalten!

Zur Bibliographie