Se connecter

Bibliographies thématiques / GPU-CPU / Articles de revues

Articles de revues sur le sujet « GPU-CPU »

Pour voir les autres types de publications sur ce sujet consultez le lien suivant : GPU-CPU.

Auteur : Grafiati

Publié le 25 mai 2024

Créez une référence correcte selon les styles APA, MLA, Chicago, Harvard et plusieurs autres

Choisissez une source :

Consultez les 50 meilleurs articles de revues pour votre recherche sur le sujet « GPU-CPU ».

À côté de chaque source dans la liste de références il y a un bouton « Ajouter à la bibliographie ». Cliquez sur ce bouton, et nous générerons automatiquement la référence bibliographique pour la source choisie selon votre style de citation préféré : APA, MLA, Harvard, Vancouver, Chicago, etc.

Vous pouvez aussi télécharger le texte intégral de la publication scolaire au format pdf et consulter son résumé en ligne lorsque ces informations sont inclues dans les métadonnées.

Parcourez les articles de revues sur diverses disciplines et organisez correctement votre bibliographie.

1

Zhu, Ziyu, Xiaochun Tang et Quan Zhao. « A unified schedule policy of distributed machine learning framework for CPU-GPU cluster ». Xibei Gongye Daxue Xuebao/Journal of Northwestern Polytechnical University 39, n^o 3 (juin 2021) : 529–38. http://dx.doi.org/10.1051/jnwpu/20213930529.

Texte intégral

Résumé :

With the widespread using of GPU hardware facilities, more and more distributed machine learning applications have begun to use CPU-GPU hybrid cluster resources to improve the efficiency of algorithms. However, the existing distributed machine learning scheduling framework either only considers task scheduling on CPU resources or only considers task scheduling on GPU resources. Even considering the difference between CPU and GPU resources, it is difficult to improve the resource usage of the entire system. In other words, the key challenge in using CPU-GPU clusters for distributed machine learning jobs is how to efficiently schedule tasks in the job. In the full paper, we propose a CPU-GPU hybrid cluster schedule framework in detail. First, according to the different characteristics of the computing power of the CPU and the computing power of the GPU, the data is divided into data fragments of different sizes to adapt to CPU and GPU computing resources. Second, the paper introduces the task scheduling method under the CPU-GPU hybrid. Finally, the proposed method is verified at the end of the paper. After our verification for K-Means, using the CPU-GPU hybrid computing framework can increase the performance of K-Means by about 1.5 times. As the number of GPUs increases, the performance of K-Means can be significantly improved.

Styles APA, Harvard, Vancouver, ISO, etc.

2

Cui, Pengjie, Haotian Liu, Bo Tang et Ye Yuan. « CGgraph : An Ultra-Fast Graph Processing System on Modern Commodity CPU-GPU Co-processor ». Proceedings of the VLDB Endowment 17, n^o 6 (février 2024) : 1405–17. http://dx.doi.org/10.14778/3648160.3648179.

Texte intégral

Résumé :

In recent years, many CPU-GPU heterogeneous graph processing systems have been developed in both academic and industrial to facilitate large-scale graph processing in various applications, e.g., social networks and biological networks. However, the performance of existing systems can be significantly improved by addressing two prevailing challenges: GPU memory over-subscription and efficient CPU-GPU cooperative processing. In this work, we propose CGgraph, an ultra-fast CPU-GPU graph processing system to address these challenges. In particular, CGgraph overcomes GPU-memory over-subscription by extracting a subgraph which only needs to be loaded into GPU memory once, but its vertices and edges can be used in multiple iterations during the graph processing procedure. To support efficient CPU-GPU co-processing, we design a CPU-GPU cooperative processing scheme, which balances the workloads between CPU and GPU by on-demand task allocation. To evaluate the efficiency of CG-graph, we conduct extensive experiments, comparing it with 7 state-of-the-art systems using 4 well-known graph algorithms on 6 real-world graphs. Our prototype system CGgraph outperforms all existing systems, delivering up to an order of magnitude improvement. Moreover, CGgraph on a modern commodity machine with a CPU-GPU co-processor yields superior (or at the very least, comparable) performance compared to existing systems on a high-end CPU-GPU server.

Styles APA, Harvard, Vancouver, ISO, etc.

3

Lee, Taekhee, et Young J. Kim. « Massively parallel motion planning algorithms under uncertainty using POMDP ». International Journal of Robotics Research 35, n^o 8 (21 août 2015) : 928–42. http://dx.doi.org/10.1177/0278364915594856.

Texte intégral

Résumé :

We present new parallel algorithms that solve continuous-state partially observable Markov decision process (POMDP) problems using the GPU (gPOMDP) and a hybrid of the GPU and CPU (hPOMDP). We choose the Monte Carlo value iteration (MCVI) method as our base algorithm and parallelize this algorithm using the multi-level parallel formulation of MCVI. For each parallel level, we propose efficient algorithms to utilize the massive data parallelism available on modern GPUs. Our GPU-based method uses the two workload distribution techniques, compute/data interleaving and workload balancing, in order to obtain the maximum parallel performance at the highest level. Here we also present a CPU–GPU hybrid method that takes advantage of both CPU and GPU parallelism in order to solve highly complex POMDP planning problems. The CPU is responsible for data preparation, while the GPU performs Monte Cacrlo simulations; these operations are performed concurrently using the compute/data overlap technique between the CPU and GPU. To the best of the authors’ knowledge, our algorithms are the first parallel algorithms that efficiently execute POMDP in a massively parallel fashion utilizing the GPU or a hybrid of the GPU and CPU. Our algorithms outperform the existing CPU-based algorithm by a factor of 75–99 based on the chosen benchmark.

Styles APA, Harvard, Vancouver, ISO, etc.

4

Yogatama, Bobbi W., Weiwei Gong et Xiangyao Yu. « Orchestrating data placement and query execution in heterogeneous CPU-GPU DBMS ». Proceedings of the VLDB Endowment 15, n^o 11 (juillet 2022) : 2491–503. http://dx.doi.org/10.14778/3551793.3551809.

Texte intégral

Résumé :

There has been a growing interest in using GPU to accelerate data analytics due to its massive parallelism and high memory bandwidth. The main constraint of using GPU for data analytics is the limited capacity of GPU memory. Heterogeneous CPU-GPU query execution is a compelling approach to mitigate the limited GPU memory capacity and PCIe bandwidth. However, the design space of heterogeneous CPU-GPU query execution has not been fully explored. We aim to improve state-of-the-art CPU-GPU data analytics engine by optimizing data placement and heterogeneous query execution. First, we introduce a semantic-aware fine-grained caching policy which takes into account various aspects of the workload such as query semantics, data correlation, and query frequency when determining data placement between CPU and GPU. Second, we introduce a heterogeneous query executor which can fully exploit data in both CPU and GPU and coordinate query execution at a fine granularity. We integrate both solutions in Mordred, our novel hybrid CPU-GPU data analytics engine. Evaluation on the Star Schema Benchmark shows that the semantic-aware caching policy can outperform the best traditional caching policy by up to 3x. Compared to existing GPU DBMSs, Mordred can outperform by an order of magnitude.

Styles APA, Harvard, Vancouver, ISO, etc.

5

Power, Jason, Joel Hestness, Marc S. Orr, Mark D. Hill et David A. Wood. « gem5-gpu : A Heterogeneous CPU-GPU Simulator ». IEEE Computer Architecture Letters 14, n^o 1 (1 janvier 2015) : 34–36. http://dx.doi.org/10.1109/lca.2014.2299539.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

6

Raju, K., et Niranjan N Chiplunkar. « PERFORMANCE ENHANCEMENT OF CUDA APPLICATIONS BY OVERLAPPING DATA TRANSFER AND KERNEL EXECUTION ». Applied Computer Science 17, n^o 3 (30 septembre 2021) : 5–18. http://dx.doi.org/10.35784/acs-2021-17.

Texte intégral

Résumé :

The CPU-GPU combination is a widely used heterogeneous computing system in which the CPU and GPU have different address spaces. Since the GPU cannot directly access the CPU memory, prior to invoking the GPU function the input data must be available on the GPU memory. On completion of GPU function, the results of computation are transferred to CPU memory. The CPU-GPU data transfer happens through PCI-Express bus. The PCI-E bandwidth is much lesser than that of GPU memory. The speed at which the data is transferred is limited by the PCI-E bandwidth. Hence, the PCI-E acts as a performance bottleneck. In this paper two approaches are discussed to minimize the overhead of data transfer, namely, performing the data transfer while the GPU function is being executed and reducing the amount of data to be transferred to GPU. The effectiveness of these approaches on the execution time of a set of CUDA applications is realized using CUDA streams. The results of our experiments show that the execution time of applications can be minimized with the proposed approaches.

Styles APA, Harvard, Vancouver, ISO, etc.

7

Liu, Gaogao, Wenbo Yang, Peng Li, Guodong Qin, Jingjing Cai, Youming Wang, Shuai Wang, Ning Yue et Dongjie Huang. « MIMO Radar Parallel Simulation System Based on CPU/GPU Architecture ». Sensors 22, n^o 1 (5 janvier 2022) : 396. http://dx.doi.org/10.3390/s22010396.

Texte intégral

Résumé :

The data volume and computation task of MIMO radar is huge; a very high-speed computation is necessary for its real-time processing. In this paper, we mainly study the time division MIMO radar signal processing flow, propose an improved MIMO radar signal processing algorithm, raising the MIMO radar algorithm processing speed combined with the previous algorithms, and, on this basis, a parallel simulation system for the MIMO radar based on the CPU/GPU architecture is proposed. The outer layer of the framework is coarse-grained with OpenMP for acceleration on the CPU, and the inner layer of fine-grained data processing is accelerated on the GPU. Its performance is significantly faster than the serial computing equipment, and satisfactory acceleration effects have been achieved in the CPU/GPU architecture simulation. The experimental results show that the MIMO radar parallel simulation system with CPU/GPU architecture greatly improves the computing power of the CPU-based method. Compared with the serial sequential CPU method, GPU simulation achieves a speedup of 130 times. In addition, the MIMO radar signal processing parallel simulation system based on the CPU/GPU architecture has a performance improvement of 13%, compared to the GPU-only method.

Styles APA, Harvard, Vancouver, ISO, etc.

8

Zou, Yong Ning, Jue Wang et Jian Wei Li. « Cutting Display of Industrial CT Volume Data Based on GPU ». Advanced Materials Research 271-273 (juillet 2011) : 1096–102. http://dx.doi.org/10.4028/www.scientific.net/amr.271-273.1096.

Texte intégral

Résumé :

The rapid development of Graphic Processor Units (GPU) in recent years in terms of performance and programmability has attracted the attention of those seeking to leverage alternative architectures for better performance than that which commodity CPU can provide. This paper presents a new algorithm for cutting display of computed tomography volume data on the GPU. We first introduce the programming model of the GPU and outline the implementation of techniques for oblique plane cutting display of volume data on both the CPU and GPU. We compare the approaches and present performance results for both the CPU and GPU. The results show that cutting display image generated by GPU algorithm is clear, frame rate on GPU is 2-9 times than that on CPU.

Styles APA, Harvard, Vancouver, ISO, etc.

9

Jiang, Ronglin, Shugang Jiang, Yu Zhang, Ying Xu, Lei Xu et Dandan Zhang. « GPU-Accelerated Parallel FDTD on Distributed Heterogeneous Platform ». International Journal of Antennas and Propagation 2014 (2014) : 1–8. http://dx.doi.org/10.1155/2014/321081.

Texte intégral

Résumé :

This paper introduces a (finite difference time domain) FDTD code written in Fortran and CUDA for realistic electromagnetic calculations with parallelization methods of Message Passing Interface (MPI) and Open Multiprocessing (OpenMP). Since both Central Processing Unit (CPU) and Graphics Processing Unit (GPU) resources are utilized, a faster execution speed can be reached compared to a traditional pure GPU code. In our experiments, 64 NVIDIA TESLA K20m GPUs and 64 INTEL XEON E5-2670 CPUs are used to carry out the pure CPU, pure GPU, and CPU + GPU tests. Relative to the pure CPU calculations for the same problems, the speedup ratio achieved by CPU + GPU calculations is around 14. Compared to the pure GPU calculations for the same problems, the CPU + GPU calculations have 7.6%–13.2% performance improvement. Because of the small memory size of GPUs, the FDTD problem size is usually very small. However, this code can enlarge the maximum problem size by 25% without reducing the performance of traditional pure GPU code. Finally, using this code, a microstrip antenna array with16×18elements is calculated and the radiation patterns are compared with the ones of MoM. Results show that there is a well agreement between them.

Styles APA, Harvard, Vancouver, ISO, etc.

10

Semenenko, Julija, Aliaksei Kolesau, Vadimas Starikovičius, Artūras Mackūnas et Dmitrij Šešok. « COMPARISON OF GPU AND CPU EFFICIENCY WHILE SOLVING HEAT CONDUCTION PROBLEMS ». Mokslas - Lietuvos ateitis 12 (24 novembre 2020) : 1–5. http://dx.doi.org/10.3846/mla.2020.13500.

Texte intégral

Résumé :

Overview of GPU usage while solving different engineering problems, comparison between CPU and GPU computations and overview of the heat conduction problem are provided in this paper. The Jacobi iterative algorithm was implemented by using Python, TensorFlow GPU library and NVIDIA CUDA technology. Numerical experiments were conducted with 6 CPUs and 4 GPUs. The fastest used GPU completed the calculations 19 times faster than the slowest CPU. On average, GPU was from 9 to 11 times faster than CPU. Significant relative speed-up in GPU calculations starts when the matrix contains at least 4002 floating-point numbers.

Styles APA, Harvard, Vancouver, ISO, etc.

11

Hu, Peng, Zixiong Zhao, Aofei Ji, Wei Li, Zhiguo He, Qifeng Liu, Youwei Li et Zhixian Cao. « A GPU-Accelerated and LTS-Based Finite Volume Shallow Water Model ». Water 14, n^o 6 (15 mars 2022) : 922. http://dx.doi.org/10.3390/w14060922.

Texte intégral

Résumé :

This paper presents a GPU (Graphics Processing Unit)-accelerated and LTS (Local-time-Step)-based finite volume Shallow Water Model (SWM). The model performance is compared against the other five model versions (Single CPU versions with/without LTS, Multi-CPU versions with/without LTS, and a GPU version) by simulating three flow scenarios: an idealized dam-break flow; an experimental dam-break flow; a field-scale scenario of tidal flows. Satisfactory agreements between simulation results and the available measured data/reference solutions (water level, flow velocity) indicate that all the six SWM versions can well simulate these challenging shallow water flows. Inter-comparisons of the computational efficiency of the six SWM versions indicate the following. First, GPU acceleration is much more efficient than multi-core CPU parallel computing. Specifically, the speed increase in the GPU can be as high as a hundred, whereas those for multi-core CPU are only 2–3. Second, implementing the LTS can bring considerable reduction: the additional maximum speed-ups can be as high as 10 for the single-core CPU/multi-core CPU versions, and as high as five for the GPU versions. Third, the GPU + LTS version is computationally the most efficient in most cases; the multi-core CPU + LTS version may run as fast as a GPU version for scenarios over some intermediate number of cells.

Styles APA, Harvard, Vancouver, ISO, etc.

12

Gyurjyan, Vardan, et Sebastian Mancilla. « Heterogeneous data-processing optimization with CLARA’s adaptive workflow orchestrator ». EPJ Web of Conferences 245 (2020) : 05020. http://dx.doi.org/10.1051/epjconf/202024505020.

Texte intégral

Résumé :

The hardware landscape used in HEP and NP is changing from homogeneous multi-core systems towards heterogeneous systems with many different computing units, each with their own characteristics. To achieve maximum performance with data processing, the main challenge is to place the right computing on the right hardware. In this paper, we discuss CLAS12 charge particle tracking workflow orchestration that allows us to utilize both CPU and GPU to improve the performance. The tracking application algorithm was decomposed into micro-services that are deployed on CPU and GPU processing units, where the best features of both are intelligently combined to achieve maximum performance. In this heterogeneous environment, CLARA aims to match the requirements of each micro-service to the strength of a CPU or a GPU architecture. A predefined execution of a micro-service on a CPU or a GPU may not be the most optimal solution due to the streaming data-quantum size and the data-quantum transfer latency between CPU and GPU. So, the CLARA workflow orchestrator is designed to dynamically assign micro-service execution to a CPU or a GPU, based on the online benchmark results analyzed for a period of real-time data-processing.

Styles APA, Harvard, Vancouver, ISO, etc.

13

Agibalov, Oleg, et Nikolay Ventsov. « On the issue of fuzzy timing estimations of the algorithms running at GPU and CPU architectures ». E3S Web of Conferences 135 (2019) : 01082. http://dx.doi.org/10.1051/e3sconf/201913501082.

Texte intégral

Résumé :

We consider the task of comparing fuzzy estimates of the execution parameters of genetic algorithms implemented at GPU (graphics processing unit’ GPU) and CPU (central processing unit) architectures. Fuzzy estimates are calculated based on the averaged dependencies of the genetic algorithms running time at GPU and CPU architectures from the number of individuals in the populations processed by the algorithm. The analysis of the averaged dependences of the genetic algorithms running time at GPU and CPU-architectures showed that it is possible to process 10’000 chromosomes at GPU-architecture or 5’000 chromosomes at CPUarchitecture by genetic algorithm in approximately 2’500 ms. The following is correct for the cases under consideration: “Genetic algorithms (GA) are performed in approximately 2, 500 ms (on average), ” and a sections of fuzzy sets, with a = 0.5, correspond to the intervals [2, 000.2399] for GA performed at the GPU-architecture, and [1, 400.1799] for GA performed at the CPU-architecture. Thereby, it can be said that in this case, the actual execution time of the algorithm at the GPU architecture deviates in a lesser extent from the average value than at the CPU.

Styles APA, Harvard, Vancouver, ISO, etc.

14

Fortin, Pierre, et Maxime Touche. « Dual tree traversal on integrated GPUs for astrophysical N-body simulations ». International Journal of High Performance Computing Applications 33, n^o 5 (15 avril 2019) : 960–72. http://dx.doi.org/10.1177/1094342019840806.

Texte intégral

Résumé :

In astrophysical N-body simulations, O( N) fast multipole methods (FMMs) with dual tree traversal (DTT) on multi-core CPUs are faster than O( N log N) CPU tree-codes but can still be outperformed by GPU ones. In this article, we aim at combining the best algorithm, namely FMM with DTT, with the most powerful hardware currently available, namely GPUs. In the astrophysical context requiring low accuracies and non-uniform particle distributions, we show that such combination can be achieved thanks to a hybrid CPU-GPU algorithm on integrated GPUs: while the DTT is performed on the CPU cores, the far- and near-field computations are all performed on the GPU cores. We show how to efficiently expose the interactions resulting from the DTT to the GPU cores, how to deploy both the far- and near-field computations on GPU, and how to overlap the parallel DTT on CPU with GPU computations. Based on the falcON code and using OpenCL on AMD Accelerated Processing Units and on Intel integrated GPUs, this first heterogeneous deployment of DTT for FMM outperforms standard multi-core CPUs and matches GPU and high-end CPU performance, being hence more cost- and power-efficient.

Styles APA, Harvard, Vancouver, ISO, etc.

15

Cao, Wei, Zheng Hua Wang et Chuan Fu Xu. « An Out-of-Core Method for CFD Simulation in Heterogeneous Environment ». Advanced Materials Research 753-755 (août 2013) : 2912–15. http://dx.doi.org/10.4028/www.scientific.net/amr.753-755.2912.

Texte intégral

Résumé :

In recent years, the highly parallel graphics processing unit (GPU) is rapidly gaining maturity as a powerful engine for high performance computer. However, in most computational fluid dynamics (CFD) simulations, the computational capacity of CPU was ignored. In this paper, we propose a hybrid parallel programming model to utilize the computational capacity of both CPU and GPU. Considering the memory amount of CPU and GPU, we also propose an out-of-core method to increase the simulation scale on single node. The experiment results show that the programming model can utilize the computational capacity of both CPU and GPU efficiently and the out-of-core method can increase the simulation scale on single node.

Styles APA, Harvard, Vancouver, ISO, etc.

16

Tang, Wenjie, Wentong Cai, Yiping Yao, Xiao Song et Feng Zhu. « An alternative approach for collaborative simulation execution on a CPU+GPU hybrid system ». SIMULATION 96, n^o 3 (14 novembre 2019) : 347–61. http://dx.doi.org/10.1177/0037549719885178.

Texte intégral

Résumé :

In the past few years, the graphics processing unit (GPU) has been widely used to accelerate time-consuming models in simulations. Since both model computation and simulation management are main factors that affect the performance of large-scale simulations, only accelerating model computation will limit the potential speedup. Moreover, models that can be well accelerated by a GPU could be insufficient, especially for simulations with many lightweight models. Traditionally, the parallel discrete event simulation (PDES) method is used to solve this class of simulation, but most PDES simulators only utilize the central processing unit (CPU) even though the GPU is commonly available now. Hence, we propose an alternative approach for collaborative simulation execution on a CPU+GPU hybrid system. The GPU supports both simulation management and model computation as CPUs. A concurrency-oriented scheduling algorithm was proposed to enable cooperation between the CPU and the GPU, so that multiple computation and communication resources can be efficiently utilized. In addition, GPU functions have also been carefully designed to adapt the algorithm. The combination of those efforts allows the proposed approach to achieve significant speedup compared to the traditional PDES on a CPU.

Styles APA, Harvard, Vancouver, ISO, etc.

17

Hadi, N. A., S. A. Halim, N. S. M. Lazim et N. Alias. « Performance of CPU GPU Parallel Architecture on Segmentation and Geometrical Features Extraction of Malaysian Herb Leaves ». Malaysian Journal of Mathematical Sciences 16, n^o 2 (29 avril 2022) : 363–77. http://dx.doi.org/10.47836/mjms.16.2.12.

Texte intégral

Résumé :

Image recognition includes the segmentation of image boundary geometrical features extraction and classification is used in the particular image database development. The ultimate challenge in this task is it is computationally expensive. This paper highlighted a CPU GPU architecture for image segmentation and features extraction processes of 125 images of Malaysian Herb Leaves. Two GPUs and three kernels are utilized in the CPU GPU platform using MATLAB software. Each of herb image has pixel dimensions 16161080. The segmentation process uses the Sobel operator which is then used to extract the boundary points. Finally seven geometrical features are extracted for each image. Both processes are first executed on the CPU alone before bringing it onto a CPU GPU platform to accelerate the computational performance. The results show that the developed CPU GPU platform has accelerated the computation process by a factor of 4.13. However the efficiency shows a decline which suggests that the processors utilization must be improved in the future to balance the load distribution.

Styles APA, Harvard, Vancouver, ISO, etc.

18

CHEN, LIN, DESHI YE et GUOCHUAN ZHANG. « ONLINE SCHEDULING OF MIXED CPU-GPU JOBS ». International Journal of Foundations of Computer Science 25, n^o 06 (septembre 2014) : 745–61. http://dx.doi.org/10.1142/s0129054114500312.

Texte intégral

Résumé :

We consider the online scheduling problem in a CPU-GPU cluster. In this problem there are two sets of processors, the CPU processors and the GPU processors. Each job has two distinct processing times, one for the CPU processor and the other for the GPU processor. Once a job is released, a decision should be made immediately about which processor it should be assigned to. The goal is to minimize the makespan, i.e., the largest completion time among all the processors. Such a problem could be seen as an intermediate model between the scheduling problem on identical machines and unrelated machines. We provide a 3.85-competitive online algorithm for this problem and show that no online algorithm exists with competitive ratio strictly less than 2. We also consider two special cases of this problem, the balanced case where the number of CPU processors equals to that of GPU processors, and the one-sided case where there is only one CPU or GPU processor. For the balanced case, we first provide a simple 3-competitive algorithm, and then a better algorithm with competitive ratio of 2.732 is derived. For the one-sided case, a 3-competitive algorithm is given.

Styles APA, Harvard, Vancouver, ISO, etc.

19

Tao, Yu-Bo, Hai Lin et Hu Jun Bao. « FROM CPU TO GPU : GPU-BASED ELECTROMAGNETIC COMPUTING (GPUECO) ». Progress In Electromagnetics Research 81 (2008) : 1–19. http://dx.doi.org/10.2528/pier07121302.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

20

Liu, Zhi Yuan, et Xue Zhang Zhao. « Research and Implementation of Image Rotation Based on CUDA ». Advanced Materials Research 216 (mars 2011) : 708–12. http://dx.doi.org/10.4028/www.scientific.net/amr.216.708.

Texte intégral

Résumé :

GPU technology release CPU from burdensome graphic computing task. The nVIDIA company, the main GPU producer, adds CUDA technology in new GPU models which enhances GPU function greatly and has much advantage in computing complex matrix. General algorithms of image rotation and the structure of CUDA are introduced in this paper. An example of rotating an image by using HALCON based on CPU instruction extensions and CUDA technology is to prove the advantage of CUDA by comparing two results.

Styles APA, Harvard, Vancouver, ISO, etc.

21

Ma, Haifeng. « Development of a CPU-GPU heterogeneous platform based on a nonlinear parallel algorithm ». Nonlinear Engineering 11, n^o 1 (1 janvier 2022) : 215–22. http://dx.doi.org/10.1515/nleng-2022-0027.

Texte intégral

Résumé :

Abstract In order to seek a refined model analysis software platform that can balance both the computational accuracy and computational efficiency, a CPU-GPU heterogeneous platform based on a nonlinear parallel algorithm is developed. The modular design method is adopted to complete the architecture construction of structural nonlinear analysis software, clarify the basic analysis steps of nonlinear finite element problems, so as to determine the structure of the software system, conduct module division, and clarify the function, interface, and call relationship of each module. The results show that when the number of model layers is 10, the GPU is 210.5/s and the CPU is 1073.2/s, and the computational time of the GPU is significantly better, with an acceleration ratio of 5.1. For all the models, the GPU calculation time is much less than that of the CPU, and when the number of model degrees of freedom increases, the acceleration effect of the GPU becomes more obvious. Therefore, the CPU-GPU heterogeneous platform can more accurately describe the nonlinear behavior in the complex stress states of the shear walls, and is computationally efficient.

Styles APA, Harvard, Vancouver, ISO, etc.

22

Yoo, Seohwan, Sunjun Hwang, Hayeon Park, Jin Choi et Chang-Gun Lee. « Hardware Interrupt-Aware CPU/GPU Scheduling on Heterogeneous Multicore and GPU System ». KIISE Transactions on Computing Practices 29, n^o 1 (31 janvier 2023) : 10–14. http://dx.doi.org/10.5626/ktcp.2022.29.1.10.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

23

Woźniak, Jarosław. « Wykorzystanie CPU i GPU do obliczeń w Matlabie ». Journal of Computer Sciences Institute 10 (30 mars 2019) : 32–35. http://dx.doi.org/10.35784/jcsi.191.

Texte intégral

Résumé :

W artykule zostały przedstawione wybrane rozwiązania wykorzystujące procesory CPU oraz procesory graficzne GPU do obliczeń w środowisku Matlab. Porównywano różne metody wykonywania obliczeń na CPU, jak i na GPU. Zostały wskazane różnice, wady, zalety oraz skutki stosowania wybranych sposobów obliczeń.

Styles APA, Harvard, Vancouver, ISO, etc.

24

Wang, Qihan, Zhen Peng, Bin Ren, Jie Chen et Robert G. Edwards. « MemHC : An Optimized GPU Memory Management Framework for Accelerating Many-body Correlation ». ACM Transactions on Architecture and Code Optimization 19, n^o 2 (30 juin 2022) : 1–26. http://dx.doi.org/10.1145/3506705.

Texte intégral

Résumé :

The many-body correlation function is a fundamental computation kernel in modern physics computing applications, e.g., Hadron Contractions in Lattice quantum chromodynamics (QCD). This kernel is both computation and memory intensive, involving a series of tensor contractions, and thus usually runs on accelerators like GPUs. Existing optimizations on many-body correlation mainly focus on individual tensor contractions (e.g., cuBLAS libraries and others). In contrast, this work discovers a new optimization dimension for many-body correlation by exploring the optimization opportunities among tensor contractions. More specifically, it targets general GPU architectures (both NVIDIA and AMD) and optimizes many-body correlation’s memory management by exploiting a set of memory allocation and communication redundancy elimination opportunities: first, GPU memory allocation redundancy : the intermediate output frequently occurs as input in the subsequent calculations; second, CPU-GPU communication redundancy : although all tensors are allocated on both CPU and GPU, many of them are used (and reused) on the GPU side only, and thus, many CPU/GPU communications (like that in existing Unified Memory designs) are unnecessary; third, GPU oversubscription: limited GPU memory size causes oversubscription issues, and existing memory management usually results in near-reuse data eviction, thus incurring extra CPU/GPU memory communications. Targeting these memory optimization opportunities, this article proposes MemHC, an optimized systematic GPU memory management framework that aims to accelerate the calculation of many-body correlation functions utilizing a series of new memory reduction designs. These designs involve optimizations for GPU memory allocation, CPU/GPU memory movement, and GPU memory oversubscription, respectively. More specifically, first, MemHC employs duplication-aware management and lazy release of GPU memories to corresponding host managing for better data reusability. Second, it implements data reorganization and on-demand synchronization to eliminate redundant (or unnecessary) data transfer. Third, MemHC exploits an optimized Least Recently Used (LRU) eviction policy called Pre-Protected LRU to reduce evictions and leverage memory hits. Additionally, MemHC is portable for various platforms including NVIDIA GPUs and AMD GPUs. The evaluation demonstrates that MemHC outperforms unified memory management by \( 2.18\times \) to \( 10.73\times \) . The proposed Pre-Protected LRU policy outperforms the original LRU policy by up to \( 1.36\times \) improvement. 1

Styles APA, Harvard, Vancouver, ISO, etc.

25

Borcovas, Evaldas, et Gintautas Daunys. « CPU AND GPU (CUDA) TEMPLATE MATCHING COMPARISON / CPU IR GPU (CUDA) PALYGINIMAS VYKDANT ŠABLONŲ ATITIKTIES ALGORITMĄ ». Mokslas – Lietuvos ateitis 6, n^o 2 (24 avril 2014) : 129–33. http://dx.doi.org/10.3846/mla.2014.16.

Texte intégral

Résumé :

Image processing, computer vision or other complicated opticalinformation processing algorithms require large resources. It isoften desired to execute algorithms in real time. It is hard tofulfill such requirements with single CPU processor. NVidiaproposed CUDA technology enables programmer to use theGPU resources in the computer. Current research was madewith Intel Pentium Dual-Core T4500 2.3 GHz processor with4 GB RAM DDR3 (CPU I), NVidia GeForce GT320M CUDAcompliable graphics card (GPU I) and Intel Core I5-2500K3.3 GHz processor with 4 GB RAM DDR3 (CPU II), NVidiaGeForce GTX 560 CUDA compatible graphic card (GPU II).Additional libraries as OpenCV 2.1 and OpenCV 2.4.0 CUDAcompliable were used for the testing. Main test were made withstandard function MatchTemplate from the OpenCV libraries.The algorithm uses a main image and a template. An influenceof these factors was tested. Main image and template have beenresized and the algorithm computing time and performancein Gtpix/s have been measured. According to the informationobtained from the research GPU computing using the hardwarementioned earlier is till 24 times faster when it is processing abig amount of information. When the images are small the performanceof CPU and GPU are not significantly different. Thechoice of the template size makes influence on calculating withCPU. Difference in the computing time between the GPUs canbe explained by the number of cores which they have. Vaizdų apdorojimas, kompiuterinė rega ir kiti sudėtingi algoritmai, apdorojantys optinę informaciją, naudoja dideliusskaičiavimo išteklius. Dažnai šiuos algoritmus reikia realizuoti realiuoju laiku. Šį uždavinį išspręsti naudojant tik vienoCPU (angl. Central processing unit) pajėgumus yra sudėtinga. nVidia pasiūlyta CUDA (angl. Compute unified device architecture)technologija leidžia panaudoti GPU (angl. Graphic processing unit) išteklius. Tyrimui atlikti buvo pasirinkti du skirtingiCPU: Intel Pentium Dual-Core T4500 ir Intel Core I5 2500K, bei GPU: nVidia GeForce GT320M ir NVidia GeForce 560.Tyrime buvo panaudotos vaizdų apdorojimo bibliotekos: OpenCV 2.1 ir OpenCV 2.4. Tyrimui buvo pasirinktas šablonų atitiktiesalgoritmas. Algoritmui realizuoti reikalingas analizuojamas vaizdas ir ieškomo objekto vaizdo šablonas. Tyrimo metu buvokeičiamas vaizdo ir šablono dydis bei stebima, kaip tai veikia algoritmo vykdymo trukmę ir vykdomų operacijų skaičių persekundę. Iš gautų rezultatų galima teigti, kad apdorojant didelį duomenų kiekį GPU realizuoja algoritmą iki 24 kartų greičiaunei tik CPU. Dirbant su nedideliu duomenų kiekiu, skirtumas tarp CPU ir GPU yra minimalus. Lyginant skaičiavimus dviejuoseGPU, pastebėta, kad skaičiavimų sparta yra tiesiogiai proporcinga GPU turimų branduolių kiekiui. Mūsų tyrimo atvejuspartesniame GPU jų buvo 16 kartų daugiau, tad ir skaičiavimai vyko 16 kartų sparčiau.

Styles APA, Harvard, Vancouver, ISO, etc.

26

Paul, Indrani, Vignesh Ravi, Srilatha Manne, Manish Arora et Sudhakar Yalamanchili. « Coordinated Energy Management in Heterogeneous Processors ». Scientific Programming 22, n^o 2 (2014) : 93–108. http://dx.doi.org/10.1155/2014/210762.

Texte intégral

Résumé :

This paper examines energy management in a heterogeneous processor consisting of an integrated CPU–GPU for high-performance computing (HPC) applications. Energy management for HPC applications is challenged by their uncompromising performance requirements and complicated by the need for coordinating energy management across distinct core types – a new and less understood problem. We examine the intra-node CPU–GPU frequency sensitivity of HPC applications on tightly coupled CPU–GPU architectures as the first step in understanding power and performance optimization for a heterogeneous multi-node HPC system. The insights from this analysis form the basis of a coordinated energy management scheme, called DynaCo, for integrated CPU–GPU architectures. We implement DynaCo on a modern heterogeneous processor and compare its performance to a state-of-the-art power- and performance-management algorithm. DynaCo improves measured average energy-delay squared (ED2) product by up to 30% with less than 2% average performance loss across several exascale and other HPC workloads.

Styles APA, Harvard, Vancouver, ISO, etc.

27

Campeanu, Gabriel, et Mehrdad Saadatmand. « A Two-Layer Component-Based Allocation for Embedded Systems with GPUs ». Designs 3, n^o 1 (19 janvier 2019) : 6. http://dx.doi.org/10.3390/designs3010006.

Texte intégral

Résumé :

Component-based development is a software engineering paradigm that can facilitate the construction of embedded systems and tackle its complexities. The modern embedded systems have more and more demanding requirements. One way to cope with such a versatile and growing set of requirements is to employ heterogeneous processing power, i.e., CPU–GPU architectures. The new CPU–GPU embedded boards deliver an increased performance but also introduce additional complexity and challenges. In this work, we address the component-to-hardware allocation for CPU–GPU embedded systems. The allocation for such systems is much complex due to the increased amount of GPU-related information. For example, while in traditional embedded systems the allocation mechanism may consider only the CPU memory usage of components to find an appropriate allocation scheme, in heterogeneous systems, the GPU memory usage needs also to be taken into account in the allocation process. This paper aims at decreasing the component-to-hardware allocation complexity by introducing a two-layer component-based architecture for heterogeneous embedded systems. The detailed CPU–GPU information of the system is abstracted at a high-layer by compacting connected components into single units that behave as regular components. The allocator, based on the compacted information received from the high-level layer, computes, with a decreased complexity, feasible allocation schemes. In the last part of the paper, the two-layer allocation method is evaluated using an existing embedded system demonstrator; namely, an underwater robot.

Styles APA, Harvard, Vancouver, ISO, etc.

28

Handa, Pooja, Meenu Kalra et Rajesh Sachdeva. « A Survey on Green Computing using GPU in Image Processing ». INTERNATIONAL JOURNAL OF COMPUTERS & ; TECHNOLOGY 14, n^o 10 (28 juin 2015) : 6135–41. http://dx.doi.org/10.24297/ijct.v14i10.1834.

Texte intégral

Résumé :

Green computing is the process of reducing the power consumed by a computer and thereby reducing carbon emissions. The total power consumed by the computer excluding the monitor at its fully computative load is equal to the sum of the power consumed by the GPU in its idle state and the CPU at its full state. Recently, there have been tremendous interests in the acceleration of general computing applications using a Graphics Processing Unit (GPU). Now the GPU provides the computing powers not only for fast processing of graphics applications, but also for general computationally complex data intensive applications. On the other hand, power and energy consumptions are also becoming important design criteria. Consequently, software designs have to consider the power/energy consumptions together with performance when they are developing software.The GPU therefore does the 100% of the CPU work in its idle state .Hence the power consumed by the GPU will be low. Also when the GPU is doing all the work the CPU will remain at a load less than its idle load. Hence the power consumed will be equal to the power consumed by the CPU at a load less than its idle load plus the power consumed by a GPU.Â Â

Styles APA, Harvard, Vancouver, ISO, etc.

29

Ding, Li, Zhaomiao Dong, Huagang He et Qibin Zheng. « A Hybrid GPU and CPU Parallel Computing Method to Accelerate Millimeter-Wave Imaging ». Electronics 12, n^o 4 (7 février 2023) : 840. http://dx.doi.org/10.3390/electronics12040840.

Texte intégral

Résumé :

The range migration algorithm (RMA) based on Fourier transformation is widely applied in millimeter-wave (MMW) close-range imaging because of its few operations and small approximation. However, its interpolation stage is not effective due to the involved intensive logic controls, which limits the speed performance in a graphics processing unit (GPU) platform. Therefore, in this paper, we present an acceleration optimization method based on the hybrid GPU and central processing unit (CPU) parallel computation for implementing the RMA. The proposed method exploits the strong logic-control capability of the CPU to assist the GPU in processing the logic controls of the interpolation stage. The common positions of wavenumber-domain components to be interpolated are calculated by the CPU and stored in the constant memory for broadcast at any time. This avoids the repetitive computation consumed in a GPU-only scheme. Then the GPU is responsible for the remaining matrix-related steps and outputs the needed wavenumber-domain values. The imaging experiments verify the acceleration efficiency of the proposed method and demonstrate that the speedup ratio of our proposed method is more than 15 times of that by the CPU-only method, and more than 2 times of that by the GPU-only method.

Styles APA, Harvard, Vancouver, ISO, etc.

30

GARBA, MICHAEL T., et HORACIO GONZÁLEZ–VÉLEZ. « ASYMPTOTIC PEAK UTILISATION IN HETEROGENEOUS PARALLEL CPU/GPU PIPELINES : A DECENTRALISED QUEUE MONITORING STRATEGY ». Parallel Processing Letters 22, n^o 02 (16 mai 2012) : 1240008. http://dx.doi.org/10.1142/s0129626412400087.

Texte intégral

Résumé :

Widespread heterogeneous parallelism is unavoidable given the emergence of General-Purpose computing on graphics processing units (GPGPU). The characteristics of a Graphics Processing Unit (GPU)—including significant memory transfer latency and complex performance characteristics—demand new approaches to ensuring that all available computational resources are efficiently utilised. This paper considers the simple case of a divisible workload based on widely-used numerical linear algebra routines and the challenges that prevent efficient use of all resources available to a naive SPMD application using the GPU as an accelerator. We suggest a possible queue monitoring strategy that facilitates resource usage with a view to balancing the CPU/GPU utilisation for applications that fit the pipeline parallel architectural pattern on heterogeneous multicore/multi-node CPU and GPU systems. We propose a stochastic allocation technique that may serve as a foundation for heuristic approaches to balancing CPU/GPU workloads.

Styles APA, Harvard, Vancouver, ISO, etc.

31

Chen, Yong, Hai Jin, Han Jiang, Dechao Xu, Ran Zheng et Haocheng Liu. « Implementation and Optimization of GPU-Based Static State Security Analysis in Power Systems ». Mobile Information Systems 2017 (2017) : 1–10. http://dx.doi.org/10.1155/2017/1897476.

Texte intégral

Résumé :

Static state security analysis (SSSA) is one of the most important computations to check whether a power system is in normal and secure operating state. It is a challenge to satisfy real-time requirements with CPU-based concurrent methods due to the intensive computations. A sensitivity analysis-based method with Graphics processing unit (GPU) is proposed for power systems, which can reduce calculation time by 40% compared to the execution on a 4-core CPU. The proposed method involves load flow analysis and sensitivity analysis. In load flow analysis, a multifrontal method for sparse LU factorization is explored on GPU through dynamic frontal task scheduling between CPU and GPU. The varying matrix operations during sensitivity analysis on GPU are highly optimized in this study. The results of performance evaluations show that the proposed GPU-based SSSA with optimized matrix operations can achieve a significant reduction in computation time.

Styles APA, Harvard, Vancouver, ISO, etc.

32

Ngo, Long Thanh, Dzung Dinh Nguyen, Long The Pham et Cuong Manh Luong. « Speedup of Interval Type 2 Fuzzy Logic Systems Based on GPU for Robot Navigation ». Advances in Fuzzy Systems 2012 (2012) : 1–11. http://dx.doi.org/10.1155/2012/698062.

Texte intégral

Résumé :

As the number of rules and sample rate for type 2 fuzzy logic systems (T2FLSs) increases, the speed of calculations becomes a problem. The T2FLS has a large membership value of inherent algorithmic parallelism that modern CPU architectures do not exploit. In the T2FLS, many rules and algorithms can be speedup on a graphics processing unit (GPU) as long as the majority of computation a various stages and components are not dependent on each other. This paper demonstrates how to install interval type 2 fuzzy logic systems (IT2-FLSs) on the GPU and experiments for obstacle avoidance behavior of robot navigation. GPU-based calculations are high-performance solution and free up the CPU. The experimental results show that the performance of the GPU is many times faster than CPU.

Styles APA, Harvard, Vancouver, ISO, etc.

33

Echeverribar, Isabel, Mario Morales-Hernández, Pilar Brufau et Pilar García-Navarro. « Analysis of the performance of a hybrid CPU/GPU 1D2D coupled model for real flood cases ». Journal of Hydroinformatics 22, n^o 5 (2 juillet 2020) : 1198–216. http://dx.doi.org/10.2166/hydro.2020.032.

Texte intégral

Résumé :

Abstract Coupled 1D2D models emerged as an efficient solution for a two-dimensional (2D) representation of the floodplain combined with a fast one-dimensional (1D) schematization of the main channel. At the same time, high-performance computing (HPC) has appeared as an efficient tool for model acceleration. In this work, a previously validated 1D2D Central Processing Unit (CPU) model is combined with an HPC technique for fast and accurate flood simulation. Due to the speed of 1D schemes, a hybrid CPU/GPU model that runs the 1D main channel on CPU and accelerates the 2D floodplain with a Graphics Processing Unit (GPU) is presented. Since the data transfer between sub-domains and devices (CPU/GPU) may be the main potential drawback of this architecture, the test cases are selected to carry out a careful time analysis. The results reveal the speed-up dependency on the 2D mesh, the event to be solved and the 1D discretization of the main channel. Additionally, special attention must be paid to the time step size computation shared between sub-models. In spite of the use of a hybrid CPU/GPU implementation, high speed-ups are accomplished in some cases.

Styles APA, Harvard, Vancouver, ISO, etc.

34

Min, Seung Won, Kun Wu, Sitao Huang, Mert Hidayetoğlu, Jinjun Xiong, Eiman Ebrahimi, Deming Chen et Wen-mei Hwu. « Large graph convolutional network training with GPU-oriented data communication architecture ». Proceedings of the VLDB Endowment 14, n^o 11 (juillet 2021) : 2087–100. http://dx.doi.org/10.14778/3476249.3476264.

Texte intégral

Résumé :

Graph Convolutional Networks (GCNs) are increasingly adopted in large-scale graph-based recommender systems. Training GCN requires the minibatch generator traversing graphs and sampling the sparsely located neighboring nodes to obtain their features. Since real-world graphs often exceed the capacity of GPU memory, current GCN training systems keep the feature table in host memory and rely on the CPU to collect sparse features before sending them to the GPUs. This approach, however, puts tremendous pressure on host memory bandwidth and the CPU. This is because the CPU needs to (1) read sparse features from memory, (2) write features into memory as a dense format, and (3) transfer the features from memory to the GPUs. In this work, we propose a novel GPU-oriented data communication approach for GCN training, where GPU threads directly access sparse features in host memory through zero-copy accesses without much CPU help. By removing the CPU gathering stage, our method significantly reduces the consumption of the host resources and data access latency. We further present two important techniques to achieve high host memory access efficiency by the GPU: (1) automatic data access address alignment to maximize PCIe packet efficiency, and (2) asynchronous zero-copy access and kernel execution to fully overlap data transfer with training. We incorporate our method into PyTorch and evaluate its effectiveness using several graphs with sizes up to 111 million nodes and 1.6 billion edges. In a multi-GPU training setup, our method is 65--92% faster than the conventional data transfer method, and can even match the performance of all-in-GPU-memory training for some graphs that fit in GPU memory.

Styles APA, Harvard, Vancouver, ISO, etc.

35

Lee, Chien Yu, H. S. Lin et H. T. Yau. « Using Graphic Hardware to Accelerate Pocketing Tool-Path Generation ». Applied Mechanics and Materials 311 (février 2013) : 135–40. http://dx.doi.org/10.4028/www.scientific.net/amm.311.135.

Texte intégral

Résumé :

In this paper, we propose a new approach to accelerate the pocketing tool-path generation by using graphic hardware (graphic processing units, GPU). The intersections among tool-path elements can be eliminated with higher efficiency from GPU-based Voronoi diagrams. According to our experimental results, the GPU-based computation speed was seven to eight times faster than that of CPU-based computation. In addition, the difference of tool-path geometry between the CPU-based and GPU-based methods was insignificant. Therefore, the GPU-method can be efficiently used to accelerate the computation while the precision is assured for the tool-path generation in pocketing machining.

Styles APA, Harvard, Vancouver, ISO, etc.

36

Abramowicz, Kamil, et Przemysław Borczuk. « Comparative analysis of the performance of Unity and Unreal Engine game engines in 3D games ». Journal of Computer Sciences Institute 30 (20 mars 2024) : 53–60. http://dx.doi.org/10.35784/jcsi.5473.

Texte intégral

Résumé :

The article compared the performance of the Unity and Unreal Engine game engines based on tests conducted on two nearly identical games. The research focused on frames per second, CPU usage, RAM, and GPU memory. The results showed that Unity achieved a better average frame rate. Unreal Engine required more RAM and GPU resources. Analyzing CPU load values revealed that on the first system, Unity demanded less CPU usage. However, on the second system, Unreal Engine used over 10 percentage points less CPU. The conclusions from the research partially confirm the hypothesis that Unity requires fewer computer resources, although in some cases, Unreal Engine may demand fewer CPU resources.

Styles APA, Harvard, Vancouver, ISO, etc.

37

Wasiljew, A., et K. Murawski. « A new CUDA-based GPU implementation of the two-dimensional Athena code ». Bulletin of the Polish Academy of Sciences : Technical Sciences 61, n^o 1 (1 mars 2013) : 239–50. http://dx.doi.org/10.2478/bpasts-2013-0023.

Texte intégral

Résumé :

Abstract We present a new version of the Athena code, which solves magnetohydrodynamic equations in two-dimensional space. This new implementation, which we have named Athena-GPU, uses CUDA architecture to allow the code execution on Graphical Processor Unit (GPU). The Athena-GPU code is an unofficial, modified version of the Athena code which was originally designed for Central Processor Unit (CPU) architecture. We perform numerical tests based on the original Athena-CPU code and its GPU counterpart to make a performance analysis, which includes execution time, precision differences and accuracy. We narrowed our tests and analysis only to double precision floating point operations and two-dimensional test cases. Our comparison shows that results are similar for both two versions of the code, which confirms correctness of our CUDA-based implementation. Our tests reveal that the Athena-GPU code can be 2 to 15-times faster than the Athena-CPU code, depending on test cases, the size of a problem and hardware configuration.

Styles APA, Harvard, Vancouver, ISO, etc.

38

Preto, Bruno, Fernando Birra, Adriano Lopes et Pedro Medeiros. « Object Identification in Binary Tomographic Images Using GPGPUs ». International Journal of Creative Interfaces and Computer Graphics 4, n^o 2 (juillet 2013) : 40–56. http://dx.doi.org/10.4018/ijcicg.2013070103.

Texte intégral

Résumé :

The authors present a hybrid OpenCL CPU/GPU algorithm for identification of connected structures inside black and white 3D scientific data. This algorithm exploits parallelism both at CPU and GPGPU levels, but the work is predominantly done in GPUs. The underlying context of this work is the structural characterization of composite materials via tomography. The algorithm allows us to later infer location and morphology of objects inside composite materials. Moreover, execution times are very low thus allowing us to process large data sets, but within acceptable running times. Intermediate solutions are computed independently over a partition of the spatial domain, following the data parallelism paradigm, and then integrated both at GPU and CPU levels, using parallel multi-cores. The authors consistently explore parallelism both at the CPU level, by allowing the CPU stage to run in multiple concurrent threads, and at the GPU level with massive parallelism and concurrent data transfers and kernel executions.

Styles APA, Harvard, Vancouver, ISO, etc.

39

WANG, Dong-dong, et Lei ZHUANG. « CPU-GPU parallel computed fire simulation ». Journal of Computer Applications 29, n^o 6 (5 août 2009) : 1702–6. http://dx.doi.org/10.3724/sp.j.1087.2009.01702.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

40

Wang, Zhenning, Long Zheng, Quan Chen et Minyi Guo. « CPU+GPU scheduling with asymptotic profiling ». Parallel Computing 40, n^o 2 (février 2014) : 107–15. http://dx.doi.org/10.1016/j.parco.2013.11.003.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

41

Ikuyajolu, Olawale James, Luke Van Roekel, Steven R. Brus, Erin E. Thomas, Yi Deng et Sarat Sreepathi. « Porting the WAVEWATCH III (v6.07) wave action source terms to GPU ». Geoscientific Model Development 16, n^o 4 (3 mars 2023) : 1445–58. http://dx.doi.org/10.5194/gmd-16-1445-2023.

Texte intégral

Résumé :

Abstract. Surface gravity waves play a critical role in several processes, including mixing, coastal inundation, and surface fluxes. Despite the growing literature on the importance of ocean surface waves, wind–wave processes have traditionally been excluded from Earth system models (ESMs) due to the high computational costs of running spectral wave models. The development of the Next Generation Ocean Model for the DOE’s (Department of Energy) E3SM (Energy Exascale Earth System Model) Project partly focuses on the inclusion of a wave model, WAVEWATCH III (WW3), into E3SM. WW3, which was originally developed for operational wave forecasting, needs to be computationally less expensive before it can be integrated into ESMs. To accomplish this, we take advantage of heterogeneous architectures at DOE leadership computing facilities and the increasing computing power of general-purpose graphics processing units (GPUs). This paper identifies the wave action source terms, W3SRCEMD, as the most computationally intensive module in WW3 and then accelerates them via GPU. Our experiments on two computing platforms, Kodiak (P100 GPU and Intel(R) Xeon(R) central processing unit, CPU, E5-2695 v4) and Summit (V100 GPU and IBM POWER9 CPU) show respective average speedups of 2× and 4× when mapping one Message Passing Interface (MPI) per GPU. An average speedup of 1.4× was achieved using all 42 CPU cores and 6 GPUs on a Summit node (with 7 MPI ranks per GPU). However, the GPU speedup over the 42 CPU cores remains relatively unchanged (∼ 1.3×) even when using 4 MPI ranks per GPU (24 ranks in total) and 3 MPI ranks per GPU (18 ranks in total). This corresponds to a 35 %–40 % decrease in both simulation time and usage of resources. Due to too many local scalars and arrays in the W3SRCEMD subroutine and the huge WW3 memory requirement, GPU performance is currently limited by the data transfer bandwidth between the CPU and the GPU. Ideally, OpenACC routine directives could be used to further improve performance. However, W3SRCEMD would require significant code refactoring to make this possible. We also discuss how the trade-off between the occupancy, register, and latency affects the GPU performance of WW3.

Styles APA, Harvard, Vancouver, ISO, etc.

42

Kurniawan, Kwek Benny, et YB Dwi Setianto. « CPU AND GPU PERFORMANCE ANALYSIS ON 2D MATRIX OPERATION ». Proxies : Jurnal Informatika 2, n^o 1 (4 mars 2021) : 1. http://dx.doi.org/10.24167/proxies.v2i1.3194.

Texte intégral

Résumé :

GPU or Graphic Processing Unit can be used on many platforms in general GPUs are used for rendering graphics but now GPUs are general purpose parallel processors with support for easily accessible programming interfaces and industry standard languages such as C, Python and Fortran. In this study, the authors will compare CPU and GPU for completing some matrix calculation. To compare between CPU and GPU, the authors have done some testing to observe the use of Processing Unit, memory and computing time to complete matrix calculations by changing matrix sizes and dimensions. The results of tests that have been done shows asynchronous GPU is faster than sequential. Furthermore, thread for GPU needs to be adjusted to achieve efficiency in GPU load.

Styles APA, Harvard, Vancouver, ISO, etc.

43

Климонов, И. А., В. Д. Корнеев et В. М. Свешников. « Parallelization technologies for solving three-dimensional boundary value problems on quasi-structured grids using the CPU+GPU hybrid computing environment ». Numerical Methods and Programming (Vychislitel'nye Metody i Programmirovanie), n^o 1 (29 mars 2016) : 65–71. http://dx.doi.org/10.26089/nummet.v17r107.

Texte intégral

Résumé :

При распараллеливании решения трехмерных краевых задач на квазиструктурированных сетках методом декомпозиции расчетной области на подобласти, сопрягаемые без наложения, наиболее трудоемкой вычислительной процедурой является решение краевых подзадач в подобластях. Использование параллелепипедальных квазиструктурированных сеток дает возможность применить для этих целей быстросходящиеся методы переменных направлений. Распараллеливание итерационного процесса по подобластям проводится на CPU в системе MPI, а для решения подзадач в настоящей статье предлагается использовать графические ускорители GPU. Приводятся экспериментальные исследования применения графических ускорителей при решении подзадач методом Писмана-Речфорда. Даются экспериментальные оценки ускорения распараллеливания в гибридной вычислительной среде CPU+GPU по сравнению с расчетами только на CPU. When parallelizing the solution processes of solving three-dimensional boundary value problems on quasi-structured grids by the method of decomposition of the computational domain into subdomains without imposition, the most time consuming computational procedure is a solution of subproblems in subdomains. The application of parallelepiped quasi-structured grids makes it possible to use the rapidly convergent method of alternating directions. The parallelization of iterative processes on subdomains is performed on CPU using MPI. In order to solve the subproblems, we propose to use the graphics accelerators (GPU). Experimental results of using the graphics accelerators to solve the subproblems by Peaceman-Rachford method are discussed. The computational acceleration achieved on the CPU+GPU hybrid computing environment is experimentally estimated compared to using the CPU only.

Styles APA, Harvard, Vancouver, ISO, etc.

44

Hasif Azman, Ahmad, Syed Abdul Mutalib Al Junid, Abdul Hadi Abdul Razak, Mohd Faizul Md Idros, Abdul Karimi Halim et Fairul Nazmie Osman. « Performance Evaluation of SW Algorithm on NVIDIA GeForce GTX TITAN X Graphic Processing Unit (GPU) ». Indonesian Journal of Electrical Engineering and Computer Science 12, n^o 2 (1 novembre 2018) : 670. http://dx.doi.org/10.11591/ijeecs.v12.i2.pp670-676.

Texte intégral

Résumé :

Nowadays, the requirement for high performance and sensitive alignment tools have increased after the advantage of the Deoxyribonucleic Acid (DNA) and molecular biology has been figured out through Bioinformatics study. Therefore, this paper reports the performance evaluation of parallel Smith-Waterman Algorithm implementation on the new NVIDIA GeForce GTX Titan X Graphic Processing Unit (GPU) compared to the Central Processing Unit (CPU) running on Intel® CoreTM i5-4440S CPU 2.80GHz. Both of the design were developed using C-programming language and targeted to the respective platform. The code for GPU was developed and compiled using NVIDIA Compute Unified Device Architecture (CUDA). It clearly recorded that, the performance of GPU based computational is better compared to the CPU based. These results indicate that the GPU based DNA sequence alignment has a better speed in accelerating the computational process of DNA sequence alignment.

Styles APA, Harvard, Vancouver, ISO, etc.

45

Gustavo Araujo Alvaro Coelho, Atila Saraiva Quintela Soares, João Henrique Speglich et Marcelo Oliveira da Silva. « Enhancing DEVITO GPU Allocator Using Unified Memory by NVIDIA ». JOURNAL OF BIOENGINEERING, TECHNOLOGIES AND HEALTH 6, Suppl1 (9 février 2023) : 14–16. http://dx.doi.org/10.34178/jbth.v6isuppl1.267.

Texte intégral

Résumé :

DEVITO is a framework whose objective is to implement optimized stencil computing. Its execution can be carried out both in the CPU and in GPU. For this reason, the data must be manipulated correctly so that, in case of executions in the GPU, they are present in the memory of the GPU at the time of the execution. Natively, DEVITO transfers data every time the operator is executed from OpenACC pragmas. This approach results in performance degradation when the operator is executed repeatedly. To prevent redundant copies and alleviate this bottleneck, an allocator based on unified memory was implemented, which makes manual data transfer between CPU and GPU unnecessary, significantly reducing data transfer time in GPU applications.

Styles APA, Harvard, Vancouver, ISO, etc.

46

Fang, Juan, Mengxuan Wang et Zelin Wei. « A memory scheduling strategy for eliminating memory access interference in heterogeneous system ». Journal of Supercomputing 76, n^o 4 (10 janvier 2020) : 3129–54. http://dx.doi.org/10.1007/s11227-019-03135-7.

Texte intégral

Résumé :

AbstractMultiple CPUs and GPUs are integrated on the same chip to share memory, and access requests between cores are interfering with each other. Memory requests from the GPU seriously interfere with the CPU memory access performance. Requests between multiple CPUs are intertwined when accessing memory, and its performance is greatly affected. The difference in access latency between GPU cores increases the average latency of memory accesses. In order to solve the problems encountered in the shared memory of heterogeneous multi-core systems, we propose a step-by-step memory scheduling strategy, which improve the system performance. The step-by-step memory scheduling strategy first creates a new memory request queue based on the request source and isolates the CPU requests from the GPU requests when the memory controller receives the memory request, thereby preventing the GPU request from interfering with the CPU request. Then, for the CPU request queue, a dynamic bank partitioning strategy is implemented, which dynamically maps it to different bank sets according to different memory characteristics of the application, and eliminates memory request interference of multiple CPU applications without affecting bank-level parallelism. Finally, for the GPU request queue, the criticality is introduced to measure the difference of the memory access latency between the cores. Based on the first ready-first come first served strategy, we implemented criticality-aware memory scheduling to balance the locality and criticality of application access.

Styles APA, Harvard, Vancouver, ISO, etc.

47

Gan, Xin Biao, Li Shen, Zhi Ying Wang, Xin Lai et Qi Zhu. « Parallelizing Network Coding Using CUDA ». Advanced Materials Research 186 (janvier 2011) : 484–88. http://dx.doi.org/10.4028/www.scientific.net/amr.186.484.

Texte intégral

Résumé :

Network coding has emerged as a promising technique to improve network throughput and bandwidth. However, due to high computational complexity, its practicability has remained to be a challenge. At the same time, applications accelerated by GPU are confined to GPU acting as a coprocessor to consume dataset transferred from CPU. Therefore, an aggressive parallel network coding is customized for GPU using CUDA (Compute Unified Device Architecture), in which dataset are partitioned for exploiting both thread-level parallelism and data-level parallelism, and collaboration between GPU and CPU is introduced to decoding with texture cache so that GPU can act as not only data consumer but also data producer. Moreover, random linear network coding is parallelizing on CUDA-enabled GPU to validate proposed techniques. Experimental results demonstrate that it is effective to parallelize network coding on GPU-accelerated system using proposed techniques.

Styles APA, Harvard, Vancouver, ISO, etc.

48

Nascimento, Ernandes, Elisan Magalhães, Arthur Azevedo, Luiz E. S. Paes et Ariel Oliveira. « An Implementation of LASER Beam Welding Simulation on Graphics Processing Unit Using CUDA ». Computation 12, n^o 4 (17 avril 2024) : 83. http://dx.doi.org/10.3390/computation12040083.

Texte intégral

Résumé :

The maximum number of parallel threads in traditional CFD solutions is limited by the Central Processing Unit (CPU) capacity, which is lower than the capabilities of a modern Graphics Processing Unit (GPU). In this context, the GPU allows for simultaneous processing of several parallel threads with double-precision floating-point formatting. The present study was focused on evaluating the advantages and drawbacks of implementing LASER Beam Welding (LBW) simulations using the CUDA platform. The performance of the developed code was compared to that of three top-rated commercial codes executed on the CPU. The unsteady three-dimensional heat conduction Partial Differential Equation (PDE) was discretized in space and time using the Finite Volume Method (FVM). The Volumetric Thermal Capacitor (VTC) approach was employed to model the melting-solidification. The GPU solutions were computed using a CUDA-C language in-house code, running on a Gigabyte Nvidia GeForce RTX™ 3090 video card and an MSI 4090 video card (both made in Hsinchu, Taiwan), each with 24 GB of memory. The commercial solutions were executed on an Intel® Core™ i9-12900KF CPU (made in Hillsboro, Oregon, United States of America) with a 3.6 GHz base clock and 16 cores. The results demonstrated that GPU and CPU processing achieve similar precision, but the GPU solution exhibited significantly faster speeds and greater power efficiency, resulting in speed-ups ranging from 75.6 to 1351.2 times compared to the CPU solutions. The in-house code also demonstrated optimized memory usage, with an average of 3.86 times less RAM utilization. Therefore, adopting parallelized algorithms run on GPU can lead to reduced CFD computational costs compared to traditional codes while maintaining high accuracy.

Styles APA, Harvard, Vancouver, ISO, etc.

49

Chen, Xiang, et Decheng Wan. « Numerical Simulation of Three-Dimensional Violent Free Surface Flows by GPU-Based MPS Method ». International Journal of Computational Methods 16, n^o 04 (13 mai 2019) : 1843012. http://dx.doi.org/10.1142/s0219876218430120.

Texte intégral

Résumé :

The Moving Particle Semi-implicit (MPS) method has been widely used in the field of computational fluid dynamics in recent years. However, the inefficient drawback of MPS method limits its three-dimensional (3D) large-scale applications. In order to overcome this disadvantage, a novel acceleration technique, graphics processing unit (GPU) parallel computing, is applied in MPS. Based on modified MPS method and GPU technique, an in-house solver MPSGPU-SJTU has been developed by using Compute Unified Device Architecture (CUDA) language. In this paper, 3D dam break and sloshing, two typical violent flows with large deformation and nonlinear fragmentation of free surface are simulated by MPSGPU-SJTU solver. In dam break case, the results of fluid flied, water front, wave height and impact pressure by GPU simulation are compared to those by CPU calculation, experimental research, Smooth Particle Hydrodynamics (SPH) and Boundary Element Method (BEM) simulations. And the comparison of fluid field and impact pressure among GPU, CPU and experiment is made in sloshing flow. The accuracy of GPU solver is verified by these comparisons. Moreover, the computation time of every part in each calculation step is compared between GPU and CPU solvers. The results show that computational efficiency is improved dramatically by employing GPU acceleration technique.

Styles APA, Harvard, Vancouver, ISO, etc.

50

Huang, M., J. Mielikainen, B. Huang, H. Chen, H. L. A. Huang et M. D. Goldberg. « Development of efficient GPU parallelization of WRF Yonsei University planetary boundary layer scheme ». Geoscientific Model Development 8, n^o 9 (30 septembre 2015) : 2977–90. http://dx.doi.org/10.5194/gmd-8-2977-2015.

Texte intégral

Résumé :

Abstract. The planetary boundary layer (PBL) is the lowest part of the atmosphere and where its character is directly affected by its contact with the underlying planetary surface. The PBL is responsible for vertical sub-grid-scale fluxes due to eddy transport in the whole atmospheric column. It determines the flux profiles within the well-mixed boundary layer and the more stable layer above. It thus provides an evolutionary model of atmospheric temperature, moisture (including clouds), and horizontal momentum in the entire atmospheric column. For such purposes, several PBL models have been proposed and employed in the weather research and forecasting (WRF) model of which the Yonsei University (YSU) scheme is one. To expedite weather research and prediction, we have put tremendous effort into developing an accelerated implementation of the entire WRF model using graphics processing unit (GPU) massive parallel computing architecture whilst maintaining its accuracy as compared to its central processing unit (CPU)-based implementation. This paper presents our efficient GPU-based design on a WRF YSU PBL scheme. Using one NVIDIA Tesla K40 GPU, the GPU-based YSU PBL scheme achieves a speedup of 193× with respect to its CPU counterpart running on one CPU core, whereas the speedup for one CPU socket (4 cores) with respect to 1 CPU core is only 3.5×. We can even boost the speedup to 360× with respect to 1 CPU core as two K40 GPUs are applied.

Styles APA, Harvard, Vancouver, ISO, etc.

Nous offrons des réductions sur tous les plans premium pour les auteurs dont les œuvres sont incluses dans des sélections littéraires thématiques. Contactez-nous pour obtenir un code promo unique!