Academic literature on the topic 'GPU-CPU'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'GPU-CPU.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "GPU-CPU"

1

Zhu, Ziyu, Xiaochun Tang, and Quan Zhao. "A unified schedule policy of distributed machine learning framework for CPU-GPU cluster." Xibei Gongye Daxue Xuebao/Journal of Northwestern Polytechnical University 39, no. 3 (June 2021): 529–38. http://dx.doi.org/10.1051/jnwpu/20213930529.

Full text
Abstract:
With the widespread using of GPU hardware facilities, more and more distributed machine learning applications have begun to use CPU-GPU hybrid cluster resources to improve the efficiency of algorithms. However, the existing distributed machine learning scheduling framework either only considers task scheduling on CPU resources or only considers task scheduling on GPU resources. Even considering the difference between CPU and GPU resources, it is difficult to improve the resource usage of the entire system. In other words, the key challenge in using CPU-GPU clusters for distributed machine learning jobs is how to efficiently schedule tasks in the job. In the full paper, we propose a CPU-GPU hybrid cluster schedule framework in detail. First, according to the different characteristics of the computing power of the CPU and the computing power of the GPU, the data is divided into data fragments of different sizes to adapt to CPU and GPU computing resources. Second, the paper introduces the task scheduling method under the CPU-GPU hybrid. Finally, the proposed method is verified at the end of the paper. After our verification for K-Means, using the CPU-GPU hybrid computing framework can increase the performance of K-Means by about 1.5 times. As the number of GPUs increases, the performance of K-Means can be significantly improved.
APA, Harvard, Vancouver, ISO, and other styles
2

Cui, Pengjie, Haotian Liu, Bo Tang, and Ye Yuan. "CGgraph: An Ultra-Fast Graph Processing System on Modern Commodity CPU-GPU Co-processor." Proceedings of the VLDB Endowment 17, no. 6 (February 2024): 1405–17. http://dx.doi.org/10.14778/3648160.3648179.

Full text
Abstract:
In recent years, many CPU-GPU heterogeneous graph processing systems have been developed in both academic and industrial to facilitate large-scale graph processing in various applications, e.g., social networks and biological networks. However, the performance of existing systems can be significantly improved by addressing two prevailing challenges: GPU memory over-subscription and efficient CPU-GPU cooperative processing. In this work, we propose CGgraph, an ultra-fast CPU-GPU graph processing system to address these challenges. In particular, CGgraph overcomes GPU-memory over-subscription by extracting a subgraph which only needs to be loaded into GPU memory once, but its vertices and edges can be used in multiple iterations during the graph processing procedure. To support efficient CPU-GPU co-processing, we design a CPU-GPU cooperative processing scheme, which balances the workloads between CPU and GPU by on-demand task allocation. To evaluate the efficiency of CG-graph, we conduct extensive experiments, comparing it with 7 state-of-the-art systems using 4 well-known graph algorithms on 6 real-world graphs. Our prototype system CGgraph outperforms all existing systems, delivering up to an order of magnitude improvement. Moreover, CGgraph on a modern commodity machine with a CPU-GPU co-processor yields superior (or at the very least, comparable) performance compared to existing systems on a high-end CPU-GPU server.
APA, Harvard, Vancouver, ISO, and other styles
3

Lee, Taekhee, and Young J. Kim. "Massively parallel motion planning algorithms under uncertainty using POMDP." International Journal of Robotics Research 35, no. 8 (August 21, 2015): 928–42. http://dx.doi.org/10.1177/0278364915594856.

Full text
Abstract:
We present new parallel algorithms that solve continuous-state partially observable Markov decision process (POMDP) problems using the GPU (gPOMDP) and a hybrid of the GPU and CPU (hPOMDP). We choose the Monte Carlo value iteration (MCVI) method as our base algorithm and parallelize this algorithm using the multi-level parallel formulation of MCVI. For each parallel level, we propose efficient algorithms to utilize the massive data parallelism available on modern GPUs. Our GPU-based method uses the two workload distribution techniques, compute/data interleaving and workload balancing, in order to obtain the maximum parallel performance at the highest level. Here we also present a CPU–GPU hybrid method that takes advantage of both CPU and GPU parallelism in order to solve highly complex POMDP planning problems. The CPU is responsible for data preparation, while the GPU performs Monte Cacrlo simulations; these operations are performed concurrently using the compute/data overlap technique between the CPU and GPU. To the best of the authors’ knowledge, our algorithms are the first parallel algorithms that efficiently execute POMDP in a massively parallel fashion utilizing the GPU or a hybrid of the GPU and CPU. Our algorithms outperform the existing CPU-based algorithm by a factor of 75–99 based on the chosen benchmark.
APA, Harvard, Vancouver, ISO, and other styles
4

Yogatama, Bobbi W., Weiwei Gong, and Xiangyao Yu. "Orchestrating data placement and query execution in heterogeneous CPU-GPU DBMS." Proceedings of the VLDB Endowment 15, no. 11 (July 2022): 2491–503. http://dx.doi.org/10.14778/3551793.3551809.

Full text
Abstract:
There has been a growing interest in using GPU to accelerate data analytics due to its massive parallelism and high memory bandwidth. The main constraint of using GPU for data analytics is the limited capacity of GPU memory. Heterogeneous CPU-GPU query execution is a compelling approach to mitigate the limited GPU memory capacity and PCIe bandwidth. However, the design space of heterogeneous CPU-GPU query execution has not been fully explored. We aim to improve state-of-the-art CPU-GPU data analytics engine by optimizing data placement and heterogeneous query execution. First, we introduce a semantic-aware fine-grained caching policy which takes into account various aspects of the workload such as query semantics, data correlation, and query frequency when determining data placement between CPU and GPU. Second, we introduce a heterogeneous query executor which can fully exploit data in both CPU and GPU and coordinate query execution at a fine granularity. We integrate both solutions in Mordred, our novel hybrid CPU-GPU data analytics engine. Evaluation on the Star Schema Benchmark shows that the semantic-aware caching policy can outperform the best traditional caching policy by up to 3x. Compared to existing GPU DBMSs, Mordred can outperform by an order of magnitude.
APA, Harvard, Vancouver, ISO, and other styles
5

Power, Jason, Joel Hestness, Marc S. Orr, Mark D. Hill, and David A. Wood. "gem5-gpu: A Heterogeneous CPU-GPU Simulator." IEEE Computer Architecture Letters 14, no. 1 (January 1, 2015): 34–36. http://dx.doi.org/10.1109/lca.2014.2299539.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Raju, K., and Niranjan N Chiplunkar. "PERFORMANCE ENHANCEMENT OF CUDA APPLICATIONS BY OVERLAPPING DATA TRANSFER AND KERNEL EXECUTION." Applied Computer Science 17, no. 3 (September 30, 2021): 5–18. http://dx.doi.org/10.35784/acs-2021-17.

Full text
Abstract:
The CPU-GPU combination is a widely used heterogeneous computing system in which the CPU and GPU have different address spaces. Since the GPU cannot directly access the CPU memory, prior to invoking the GPU function the input data must be available on the GPU memory. On completion of GPU function, the results of computation are transferred to CPU memory. The CPU-GPU data transfer happens through PCI-Express bus. The PCI-E bandwidth is much lesser than that of GPU memory. The speed at which the data is transferred is limited by the PCI-E bandwidth. Hence, the PCI-E acts as a performance bottleneck. In this paper two approaches are discussed to minimize the overhead of data transfer, namely, performing the data transfer while the GPU function is being executed and reducing the amount of data to be transferred to GPU. The effectiveness of these approaches on the execution time of a set of CUDA applications is realized using CUDA streams. The results of our experiments show that the execution time of applications can be minimized with the proposed approaches.
APA, Harvard, Vancouver, ISO, and other styles
7

Liu, Gaogao, Wenbo Yang, Peng Li, Guodong Qin, Jingjing Cai, Youming Wang, Shuai Wang, Ning Yue, and Dongjie Huang. "MIMO Radar Parallel Simulation System Based on CPU/GPU Architecture." Sensors 22, no. 1 (January 5, 2022): 396. http://dx.doi.org/10.3390/s22010396.

Full text
Abstract:
The data volume and computation task of MIMO radar is huge; a very high-speed computation is necessary for its real-time processing. In this paper, we mainly study the time division MIMO radar signal processing flow, propose an improved MIMO radar signal processing algorithm, raising the MIMO radar algorithm processing speed combined with the previous algorithms, and, on this basis, a parallel simulation system for the MIMO radar based on the CPU/GPU architecture is proposed. The outer layer of the framework is coarse-grained with OpenMP for acceleration on the CPU, and the inner layer of fine-grained data processing is accelerated on the GPU. Its performance is significantly faster than the serial computing equipment, and satisfactory acceleration effects have been achieved in the CPU/GPU architecture simulation. The experimental results show that the MIMO radar parallel simulation system with CPU/GPU architecture greatly improves the computing power of the CPU-based method. Compared with the serial sequential CPU method, GPU simulation achieves a speedup of 130 times. In addition, the MIMO radar signal processing parallel simulation system based on the CPU/GPU architecture has a performance improvement of 13%, compared to the GPU-only method.
APA, Harvard, Vancouver, ISO, and other styles
8

Zou, Yong Ning, Jue Wang, and Jian Wei Li. "Cutting Display of Industrial CT Volume Data Based on GPU." Advanced Materials Research 271-273 (July 2011): 1096–102. http://dx.doi.org/10.4028/www.scientific.net/amr.271-273.1096.

Full text
Abstract:
The rapid development of Graphic Processor Units (GPU) in recent years in terms of performance and programmability has attracted the attention of those seeking to leverage alternative architectures for better performance than that which commodity CPU can provide. This paper presents a new algorithm for cutting display of computed tomography volume data on the GPU. We first introduce the programming model of the GPU and outline the implementation of techniques for oblique plane cutting display of volume data on both the CPU and GPU. We compare the approaches and present performance results for both the CPU and GPU. The results show that cutting display image generated by GPU algorithm is clear, frame rate on GPU is 2-9 times than that on CPU.
APA, Harvard, Vancouver, ISO, and other styles
9

Jiang, Ronglin, Shugang Jiang, Yu Zhang, Ying Xu, Lei Xu, and Dandan Zhang. "GPU-Accelerated Parallel FDTD on Distributed Heterogeneous Platform." International Journal of Antennas and Propagation 2014 (2014): 1–8. http://dx.doi.org/10.1155/2014/321081.

Full text
Abstract:
This paper introduces a (finite difference time domain) FDTD code written in Fortran and CUDA for realistic electromagnetic calculations with parallelization methods of Message Passing Interface (MPI) and Open Multiprocessing (OpenMP). Since both Central Processing Unit (CPU) and Graphics Processing Unit (GPU) resources are utilized, a faster execution speed can be reached compared to a traditional pure GPU code. In our experiments, 64 NVIDIA TESLA K20m GPUs and 64 INTEL XEON E5-2670 CPUs are used to carry out the pure CPU, pure GPU, and CPU + GPU tests. Relative to the pure CPU calculations for the same problems, the speedup ratio achieved by CPU + GPU calculations is around 14. Compared to the pure GPU calculations for the same problems, the CPU + GPU calculations have 7.6%–13.2% performance improvement. Because of the small memory size of GPUs, the FDTD problem size is usually very small. However, this code can enlarge the maximum problem size by 25% without reducing the performance of traditional pure GPU code. Finally, using this code, a microstrip antenna array with16×18elements is calculated and the radiation patterns are compared with the ones of MoM. Results show that there is a well agreement between them.
APA, Harvard, Vancouver, ISO, and other styles
10

Semenenko, Julija, Aliaksei Kolesau, Vadimas Starikovičius, Artūras Mackūnas, and Dmitrij Šešok. "COMPARISON OF GPU AND CPU EFFICIENCY WHILE SOLVING HEAT CONDUCTION PROBLEMS." Mokslas - Lietuvos ateitis 12 (November 24, 2020): 1–5. http://dx.doi.org/10.3846/mla.2020.13500.

Full text
Abstract:
Overview of GPU usage while solving different engineering problems, comparison between CPU and GPU computations and overview of the heat conduction problem are provided in this paper. The Jacobi iterative algorithm was implemented by using Python, TensorFlow GPU library and NVIDIA CUDA technology. Numerical experiments were conducted with 6 CPUs and 4 GPUs. The fastest used GPU completed the calculations 19 times faster than the slowest CPU. On average, GPU was from 9 to 11 times faster than CPU. Significant relative speed-up in GPU calculations starts when the matrix contains at least 4002 floating-point numbers.
APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic "GPU-CPU"

1

Fang, Zhuowen. "Java GPU vs CPU Hashing Performance." Thesis, Mittuniversitetet, Avdelningen för informationssystem och -teknologi, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:miun:diva-33994.

Full text
Abstract:
In the latest years, the public’s interest in blockchain technology has been growing since it was brought up in 2008, primarily because of its ability to create an immutable ledger, for storing information that never will or can be changed. As an expanding chain structure, the act of nodes adding blocks to the chain is called mining which is regulated by consensus mechanism. In the most widely used consensus mechanism Proof of work, this process is based on computationally heavy guessing of hashes of blocks. Today, there are several prominent ways developed of performing this guessing, thanks to the development of hardware technology, either using the regular all-rounded computer processing unit (CPU), or using the more specialized graphics processing unit (GPU), or using dedicated hardware. This thesis studied the working principles of blockchain, implemented the crucial hash function used in Proof of Work consensus mechanism and other blockchain structures with the popular programming language Java on various platforms. CPU implementation is done with Java’s built-in functions and for GPU I used OpenCL ’ s Java binding JOCL. This project gives a quantified measurement for hash rate on different devices, determines that all the GPUs tested advantage over CPUs in performance and memory consumption. Java’s built-in function is easier to use but both of the implementations are doing well in platform independent that the same code can easily be executed on different platforms. Furthermore, based on the measurements, I did in-depth exploration of the principles and proposed future work, analyzed their application values combined with future possibilities of blockchain based on implementation difficulties and performance.
APA, Harvard, Vancouver, ISO, and other styles
2

Dollinger, Jean-François. "A framework for efficient execution on GPU and CPU+GPU systems." Thesis, Strasbourg, 2015. http://www.theses.fr/2015STRAD019/document.

Full text
Abstract:
Les verrous technologiques rencontrés par les fabricants de semi-conducteurs au début des années deux-mille ont abrogé la flambée des performances des unités de calculs séquentielles. La tendance actuelle est à la multiplication du nombre de cœurs de processeur par socket et à l'utilisation progressive des cartes GPU pour des calculs hautement parallèles. La complexité des architectures récentes rend difficile l'estimation statique des performances d'un programme. Nous décrivons une méthode fiable et précise de prédiction du temps d'exécution de nids de boucles parallèles sur GPU basée sur trois étapes : la génération de code, le profilage offline et la prédiction online. En outre, nous présentons deux techniques pour exploiter l'ensemble des ressources disponibles d'un système pour la performance. La première consiste en l'utilisation conjointe des CPUs et GPUs pour l'exécution d'un code. Afin de préserver les performances il est nécessaire de considérer la répartition de charge, notamment en prédisant les temps d'exécution. Le runtime utilise les résultats du profilage et un ordonnanceur calcule des temps d'exécution et ajuste la charge distribuée aux processeurs. La seconde technique présentée met le CPU et le GPU en compétition : des instances du code cible sont exécutées simultanément sur CPU et GPU. Le vainqueur de la compétition notifie sa complétion à l'autre instance, impliquant son arrêt
Technological limitations faced by the semi-conductor manufacturers in the early 2000's restricted the increase in performance of the sequential computation units. Nowadays, the trend is to increase the number of processor cores per socket and to progressively use the GPU cards for highly parallel computations. Complexity of the recent architectures makes it difficult to statically predict the performance of a program. We describe a reliable and accurate parallel loop nests execution time prediction method on GPUs based on three stages: static code generation, offline profiling, and online prediction. In addition, we present two techniques to fully exploit the computing resources at disposal on a system. The first technique consists in jointly using CPU and GPU for executing a code. In order to achieve higher performance, it is mandatory to consider load balance, in particular by predicting execution time. The runtime uses the profiling results and the scheduler computes the execution times and adjusts the load distributed to the processors. The second technique, puts CPU and GPU in a competition: instances of the considered code are simultaneously executed on CPU and GPU. The winner of the competition notifies its completion to the other instance, implying the termination of the latter
APA, Harvard, Vancouver, ISO, and other styles
3

Gjermundsen, Aleksander. "CPU and GPU Co-processing for Sound." Thesis, Norges teknisk-naturvitenskapelige universitet, Institutt for datateknikk og informasjonsvitenskap, 2010. http://urn.kb.se/resolve?urn=urn:nbn:no:ntnu:diva-11794.

Full text
Abstract:
When using voice communications, one of the problematic phenomena that can occur, is participants hearing an echo of their own voice. Acoustic echo cancellation (AEC) is used to remove this echo, but can be computationally demanding.The recent OpenCL standard allows high-level programs to be run on both multi-core CPUs, as well as Graphics Processing Units (GPUs) and custom accelerators. This opens up new possibilities for offloading computations, which is especially important for real-time applications. Although many algorithms for image- and video-processing have been studied on the GPU, audio processing algorithms have not similarly been well researched. This can be due to these algorithms not being viewed as computationally heavy and thus as suitable for GPU-offloading as, for instance, dense linear algebra.This thesis studies the AEC filter from the open-source library Speex for speech compression and audio preprocessing. We translate the original code into an optimized OpenCL program that can run on both CPUs and GPUs. Since the overhead of the OpenCL vendor implementations dominate running times, our results show that the existing reference implementation is faster for single channel input/output, due to its simplicity and low computational intensity. However, by increasing the number of channels processed by the filter and the length of the echo tail, a speed-up of up to 5 on CPU+GPU over CPU only, was achieved. Although these cases may not be the most common, the techniques developed in this thesis are expected to be of increasing importance as GPUs and CPUs become more integrated, especially on embedded devices. This makes latencies less of an issue and hence the value of our results stronger. An outline for future work in this area is thus also included.
APA, Harvard, Vancouver, ISO, and other styles
4

CARLOS, EDUARDO TELLES. "HYBRID FRUSTUM CULLING USING CPU AND GPU." PONTIFÍCIA UNIVERSIDADE CATÓLICA DO RIO DE JANEIRO, 2009. http://www.maxwell.vrac.puc-rio.br/Busca_etds.php?strSecao=resultado&nrSeq=31453@1.

Full text
Abstract:
PONTIFÍCIA UNIVERSIDADE CATÓLICA DO RIO DE JANEIRO
Um dos problemas mais antigos da computação gráfica tem sido a determinação de visibilidade. Vários algoritmos têm sido desenvolvidos para viabilizar modelos cada vez maiores e detalhados. Dentre estes algoritmos, destaca-se o frustum culling, cujo papel é remover objetos que não sejam visíveis ao observador. Esse algoritmo, muito comum em várias aplicações, vem sofrendo melhorias ao longo dos anos, a fim de acelerar ainda mais a sua execução. Apesar de ser tratado como um problema bem resolvido na computação gráfica, alguns pontos ainda podem ser aperfeiçoados, e novas formas de descarte desenvolvidas. No que se refere aos modelos massivos, necessita-se de algoritmos de alta performance, pois a quantidade de cálculos aumenta significativamente. Este trabalho objetiva avaliar o algoritmo de frustum culling e suas otimizações, com o propósito de obter o melhor algoritmo possível implementado em CPU, além de analisar a influência de cada uma de suas partes em modelos massivos. Com base nessa análise, novas técnicas de frustum culling serão desenvolvidas, utilizando o poder computacional da GPU (Graphics Processing Unit), e comparadas com o resultado obtido apenas pela CPU. Como resultado, será proposta uma forma de frustum culling híbrido, que tentará aproveitar o melhor da CPU e da GPU.
The definition of visibility is a classical problem in Computer Graphics. Several algorithms have been developed to enable the visualization of huge and complex models. Among these algorithms, the frustum culling, which plays an important role in this area, is used to remove invisible objects by the observer. Besides being very usual in applications, this algorithm has been improved in order to accelerate its execution. Although being treated as a well-solved problem in Computer Graphics, some points can be enhanced yet, and new forms of culling may be disclosed as well. In massive models, for example, algorithms of high performance are required, since the calculus arises considerably. This work analyses the frustum culling algorithm and its optimizations, aiming to obtain the state-of-the-art algorithm implemented in CPU, as well as explains the influence of each of its steps in massive models. Based on this analysis, new GPU (Graphics Processing Unit) based frustum culling techniques will be developed and compared with the ones using only CPU. As a result, a hybrid frustum culling will be proposed, in order to achieve the best of CPU and GPU processing.
APA, Harvard, Vancouver, ISO, and other styles
5

Farooqui, Naila. "Runtime specialization for heterogeneous CPU-GPU platforms." Diss., Georgia Institute of Technology, 2015. http://hdl.handle.net/1853/54915.

Full text
Abstract:
Heterogeneous parallel architectures like those comprised of CPUs and GPUs are a tantalizing compute fabric for performance-hungry developers. While these platforms enable order-of-magnitude performance increases for many data-parallel application domains, there remain several open challenges: (i) the distinct execution models inherent in the heterogeneous devices present on such platforms drives the need to dynamically match workload characteristics to the underlying resources, (ii) the complex architecture and programming models of such systems require substantial application knowledge and effort-intensive program tuning to achieve high performance, and (iii) as such platforms become prevalent, there is a need to extend their utility from running known regular data-parallel applications to the broader set of input-dependent, irregular applications common in enterprise settings. The key contribution of our research is to enable runtime specialization on such hybrid CPU-GPU platforms by matching application characteristics to the underlying heterogeneous resources for both regular and irregular workloads. Our approach enables profile-driven resource management and optimizations for such platforms, providing high application performance and system throughput. Towards this end, this research: (a) enables dynamic instrumentation for GPU-based parallel architectures, specifically targeting the complex Single-Instruction Multiple-Data (SIMD) execution model, to gain real-time introspection into application behavior; (b) leverages such dynamic performance data to support novel online resource management methods that improve application performance and system throughput, particularly for irregular, input-dependent applications; (c) automates some of the programmer effort required to exercise specialized architectural features of such platforms via instrumentation-driven dynamic code optimizations; and (d) proposes a specialized, affinity-aware work-stealing scheduling runtime for integrated CPU-GPU processors that efficiently distributes work across all CPU and GPU cores for improved load balance, taking into account both application characteristics and architectural differences of the underlying devices.
APA, Harvard, Vancouver, ISO, and other styles
6

Smith, Michael Shawn. "Performance Analysis of Hybrid CPU/GPU Environments." PDXScholar, 2010. https://pdxscholar.library.pdx.edu/open_access_etds/300.

Full text
Abstract:
We present two metrics to assist the performance analyst to gain a unified view of application performance in a hybrid environment: GPU Computation Percentage and GPU Load Balance. We analyze the metrics using a matrix multiplication benchmark suite and a real scientific application. We also extend an experiment management system to support GPU performance data and to calculate and store our GPU Computation Percentage and GPU Load Balance metrics.
APA, Harvard, Vancouver, ISO, and other styles
7

Wong, Henry Ting-Hei. "Architectures and limits of GPU-CPU heterogeneous systems." Thesis, University of British Columbia, 2008. http://hdl.handle.net/2429/2529.

Full text
Abstract:
As we continue to be able to put an increasing number of transistors on a single chip, the answer to the perpetual question of what the best processor we could build with the transistors is remains uncertain. Past work has shown that heterogeneous multiprocessor systems provide benefits in performance and efficiency. This thesis explores heterogeneous systems composed of a traditional sequential processor (CPU) and highly parallel graphics processors (GPU). This thesis presents a tightly-coupled heterogeneous chip multiprocessor architecture for general-purpose non-graphics computation and a limit study exploring the potential benefits of GPU-like cores for accelerating a set of general-purpose workloads. Pangaea is a heterogeneous CMP design for non-rendering workloads that integrates IA32 CPU cores with GMA X4500 GPU cores. Pangaea introduces a resource partitioning of the GPU, where 3D graphics-specific hardware is removed to reduce area or add more processing cores, and a 3-instruction extension to the IA32 ISA that supports fast communication between CPU and GPU by building user-level interrupts on top of existing cache coherency mechanisms. By removing graphics-specific hardware on a 65 nm process, the area saved is equivalent to 9 GPU cores, while the power saved is equivalent to 5 cores. Our FPGA prototype shows thread spawn latency improvements from thousands of clock cycles to 26. A set of non-graphics workloads demonstrate speedups of up to 8.8x. This thesis also presents a limit study, where we measure the limit of algorithm parallelism in the context of a heterogeneous system that can be usefully extracted from a set of general-purpose applications. We measure sensitivity to the sequential performance (register read-after-write latency) of the low-cost parallel cores, and latency and bandwidth of the communication channel between the two cores. Using these measurements, we propose system characteristics that maximize area and power efficiencies. As in previous limit studies, we find a high amount of parallelism. We show, however, that the potential speedup on GPU-like systems is low (2.2x - 12.7x) due to poor sequential performance. Communication latency and bandwidth have comparatively small performance effects (<25%). Optimal area efficiency requires a lower-cost parallel processor while optimal power efficiency requires a higher-performance parallel processor than today's GPUs.
APA, Harvard, Vancouver, ISO, and other styles
8

Gummadi, Deepthi. "Improving GPU performance by regrouping CPU-memory data." Thesis, Wichita State University, 2014. http://hdl.handle.net/10057/10959.

Full text
Abstract:
In order to fast effective analysis of large complex systems, high-performance computing is essential. NVIDIA Compute Unified Device Architecture (CUDA)-assisted central processing unit (CPU) / graphics processing unit (GPU) computing platform has proven its potential to be used in high-performance computing. In CPU/GPU computing, original data and instructions are copied from CPU main memory to GPU global memory. Inside GPU, it would be beneficial to keep the data into shared memory (shared only by the threads of that block) than in the global memory (shared by all threads). However, shared memory is much smaller than global memory (for Fermi Tesla C2075, total shared memory per block is 48 KB and total global memory is 6 GB). In this paper, we introduce a CPU-memory to GPU-global-memory mapping technique to improve GPU and overall system performance by increasing the effectiveness of GPU-shared memory. We use NVIDIA 448-core Fermi and 2496-core Kepler GPU cards in this study. Experimental results, from solving Laplace's equation for 512x512 matrixes using a Fermi GPU card, show that proposed CPU-to-GPU memory mapping technique help decrease the overall execution time by more than 75%.
Thesis (M.S.)--Wichita State University, College of Engineering, Dept. of Electrical Engineering and Computer Science
APA, Harvard, Vancouver, ISO, and other styles
9

Chen, Wei. "Dynamic Workload Division in GPU-CPU Heterogeneous Systems." The Ohio State University, 2013. http://rave.ohiolink.edu/etdc/view?acc_num=osu1364250106.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Ben, Romdhanne Bilel. "Simulation des réseaux à grande échelle sur les architectures de calculs hétérogènes." Thesis, Paris, ENST, 2013. http://www.theses.fr/2013ENST0088/document.

Full text
Abstract:
La simulation est une étape primordiale dans l'évolution des systèmes en réseaux. L’évolutivité et l’efficacité des outils de simulation est une clef principale de l’objectivité des résultats obtenue, étant donné la complexité croissante des nouveaux des réseaux sans-fils. La simulation a évènement discret est parfaitement adéquate au passage à l'échelle, cependant les architectures logiciel existantes ne profitent pas des avancées récente du matériel informatique comme les processeurs parallèle et les coprocesseurs graphique. Dans ce contexte, l'objectif de cette thèse est de proposer des mécanismes d'optimisation qui permettent de surpasser les limitations des approches actuelles en combinant l’utilisation des ressources de calcules hétérogène. Pour répondre à la problématique de l’efficacité, nous proposons de changer la représentation d'événement, d'une représentation bijective (évènement-descripteur) à une représentation injective (groupe d'évènements-descripteur). Cette approche permet de réduire la complexité de l'ordonnancement d'une part et de maximiser la capacité d'exécuter massivement des évènements en parallèle d'autre part. Dans ce sens, nous proposons une approche d'ordonnancement d'évènements hybride qui se base sur un enrichissement du descripteur pour maximiser le degré de parallélisme en combinons la capacité de calcule du CPU et du GPU dans une même simulation. Les résultats comparatives montre un gain en terme de temps de simulation de l’ordre de 100x en comparaison avec une exécution équivalente sur CPU uniquement. Pour répondre à la problématique d’évolutivité du système, nous proposons une nouvelle architecture distribuée basée sur trois acteurs
The simulation is a primary step on the evaluation process of modern networked systems. The scalability and efficiency of such a tool in view of increasing complexity of the emerging networks is a key to derive valuable results. The discrete event simulation is recognized as the most scalable model that copes with both parallel and distributed architecture. Nevertheless, the recent hardware provides new heterogeneous computing resources that can be exploited in parallel.The main scope of this thesis is to provide a new mechanisms and optimizations that enable efficient and scalable parallel simulation using heterogeneous computing node architecture including multicore CPU and GPU. To address the efficiency, we propose to describe the events that only differs in their data as a single entry to reduce the event management cost. At the run time, the proposed hybrid scheduler will dispatch and inject the events on the most appropriate computing target based on the event descriptor and the current load obtained through a feedback mechanisms such that the hardware usage rate is maximized. Results have shown a significant gain of 100 times compared to traditional CPU based approaches. In order to increase the scalability of the system, we propose a new simulation model, denoted as general purpose coordinator-master-worker, to address jointly the challenge of distributed and parallel simulation at different levels. The performance of a distributed simulation that relies on the GP-CMW architecture tends toward the maximal theoretical efficiency in a homogeneous deployment. The scalability of such a simulation model is validated on the largest European GPU-based supercomputer
APA, Harvard, Vancouver, ISO, and other styles

Books on the topic "GPU-CPU"

1

Piccoli, María Fabiana. Computación de alto desempeño en GPU. Editorial de la Universidad Nacional de La Plata (EDULP), 2011. http://dx.doi.org/10.35537/10915/18404.

Full text
Abstract:
Este libro es el resultado del trabajo de investigación sobre las características de la GPU y su adopción como arquitectura masivamente paralela para aplicaciones de propósito general. Su propósito es transformarse en una herramienta útil para guiar los primeros pasos de aquellos que se inician en la computación de alto desempeños en GPU. Pretende resumir el estado del arte considerando la bibliografía propuesta. El objetivo no es solamente describir la arquitectura many-core de la GPU y la herramienta de programación CUDA, sino también conducir al lector hacia el desarrollo de programas con buen desempeño. El libro se estructura de la siguiente manera: Capítulo 1: se detallan los conceptos básicos y generales de la computación de alto rendimiento, presentes en el resto del texto. Capítulo 2: describe las características de la arquitectura de la GPU y su evolución histórica. En ambos casos realizando una comparación con la CPU. Finalmente detalla la evolución de la GPU como co-procesador para el desarrollo de aplicaciones de propósito general. Capítulo 3: este capítulo contiene los lineamientos básicos del modelo de programación asociado a CUDA. CUDA provee una interfaz para la comunicación CPU-GPU y la administración de los threads. También se describe las características del modelo de ejecución SIMT asociado. Capítulo 4: analiza las propiedades generales y básicas de la jerarquía de memoria de la GPU, describiendo las propiedades de cada una, la forma de uso y sus ventajas y desventajas. Capítulo 5: comprende un análisis de los diferentes aspectos a tener en cuenta para resolver aplicaciones con buena performance. La programación de GPU con CUDA no es una mera transcripción de un código secuencial a un código paralelo, es necesario tener en cuenta diferentes aspectos para usar de manera eficiente la arquitectura y llevar a cabo una buena programación. Finalmente se incluyen tres apéndices. En el primero se describen los calificadores, tipos y funciones básicos de CUDA, el segundo detalla algunas herramientas simples de la biblioteca cutil.h para el control de la programación en CUDA. El último apéndice describe las capacidades de cómputo de CUDA para las distintas GPU existentes, listando los modelos reales que las poseen.
APA, Harvard, Vancouver, ISO, and other styles

Book chapters on the topic "GPU-CPU"

1

Ou, Zhixin, Juan Chen, Yuyang Sun, Tao Xu, Guodong Jiang, Zhengyuan Tan, and Xinxin Qi. "AOA: Adaptive Overclocking Algorithm on CPU-GPU Heterogeneous Platforms." In Algorithms and Architectures for Parallel Processing, 253–72. Cham: Springer Nature Switzerland, 2023. http://dx.doi.org/10.1007/978-3-031-22677-9_14.

Full text
Abstract:
AbstractAlthough GPUs have been used to accelerate various convolutional neural network algorithms with good performance, the demand for performance improvement is still continuously increasing. CPU/GPU overclocking technology brings opportunities for further performance improvement in CPU-GPU heterogeneous platforms. However, CPU/GPU overclocking inevitably increases the power of the CPU/GPU, which is not conducive to energy conservation, energy efficiency optimization, or even system stability. How to effectively constrain the total energy to remain roughly unchanged during the CPU/GPU overclocking is a key issue in designing adaptive overclocking algorithms. There are two key factors during solving this key issue. Firstly, the dynamic power upper bound must be set to reflect the real-time behavior characteristics of the program so that algorithm can better meet the total energy unchanging constraints; secondly, instead of independently overclocking at both CPU and GPU sides, coordinately overclocking on CPU-GPU must be considered to adapt to real-time load balance for higher performance improvement and better energy constraints. This paper proposes an Adaptive Overclocking Algorithm (AOA) on CPU-GPU heterogeneous platforms to achieve the goal of performance improvement while the total energy remains roughly unchanged. AOA uses the function $$F_k$$ F k to describe the variable power upper bound and introduces the load imbalance factor W to realize the CPU-GPU coordinated overclocking. Through the verification of several types convolutional neural network algorithms on two CPU-GPU heterogeneous platforms (Intel$$^\circledR $$ ® Xeon E5-2660 & NVIDIA$$^\circledR $$ ® Tesla K80; Intel$$^\circledR $$ ® Core™i9-10920X & NIVIDIA$$^\circledR $$ ® GeForce RTX 2080Ti), AOA achieves an average of 10.7% performance improvement and 4.4% energy savings. To verify the effectiveness of the AOA, we compare AOA with other methods including automatic boost, the highest overclocking and static optimal overclocking.
APA, Harvard, Vancouver, ISO, and other styles
2

Stuart, Jeff A., Michael Cox, and John D. Owens. "GPU-to-CPU Callbacks." In Euro-Par 2010 Parallel Processing Workshops, 365–72. Berlin, Heidelberg: Springer Berlin Heidelberg, 2011. http://dx.doi.org/10.1007/978-3-642-21878-1_45.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Wille, Mario, Tobias Weinzierl, Gonzalo Brito Gadeschi, and Michael Bader. "Efficient GPU Offloading with OpenMP for a Hyperbolic Finite Volume Solver on Dynamically Adaptive Meshes." In Lecture Notes in Computer Science, 65–85. Cham: Springer Nature Switzerland, 2023. http://dx.doi.org/10.1007/978-3-031-32041-5_4.

Full text
Abstract:
AbstractWe identify and show how to overcome an OpenMP bottleneck in the administration of GPU memory. It arises for a wave equation solver on dynamically adaptive block-structured Cartesian meshes, which keeps all CPU threads busy and allows all of them to offload sets of patches to the GPU. Our studies show that multithreaded, concurrent, non-deterministic access to the GPU leads to performance breakdowns, since the GPU memory bookkeeping as offered through OpenMP’s clause, i.e., the allocation and freeing, becomes another runtime challenge besides expensive data transfer and actual computation. We, therefore, propose to retain the memory management responsibility on the host: A caching mechanism acquires memory on the accelerator for all CPU threads, keeps hold of this memory and hands it out to the offloading threads upon demand. We show that this user-managed, CPU-based memory administration helps us to overcome the GPU memory bookkeeping bottleneck and speeds up the time-to-solution of Finite Volume kernels by more than an order of magnitude.
APA, Harvard, Vancouver, ISO, and other styles
4

Reinders, James, Ben Ashbaugh, James Brodman, Michael Kinsner, John Pennycook, and Xinmin Tian. "Programming for GPUs." In Data Parallel C++, 353–85. Berkeley, CA: Apress, 2020. http://dx.doi.org/10.1007/978-1-4842-5574-2_15.

Full text
Abstract:
Abstract Over the last few decades, Graphics Processing Units (GPUs) have evolved from specialized hardware devices capable of drawing images on a screen to general-purpose devices capable of executing complex parallel kernels. Nowadays, nearly every computer includes a GPU alongside a traditional CPU, and many programs may be accelerated by offloading part of a parallel algorithm from the CPU to the GPU.
APA, Harvard, Vancouver, ISO, and other styles
5

Shi, Lin, Hao Chen, and Ting Li. "Hybrid CPU/GPU Checkpoint for GPU-Based Heterogeneous Systems." In Communications in Computer and Information Science, 470–81. Berlin, Heidelberg: Springer Berlin Heidelberg, 2014. http://dx.doi.org/10.1007/978-3-642-53962-6_42.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Li, Jie, George Michelogiannakis, Brandon Cook, Dulanya Cooray, and Yong Chen. "Analyzing Resource Utilization in an HPC System: A Case Study of NERSC’s Perlmutter." In Lecture Notes in Computer Science, 297–316. Cham: Springer Nature Switzerland, 2023. http://dx.doi.org/10.1007/978-3-031-32041-5_16.

Full text
Abstract:
AbstractResource demands of HPC applications vary significantly. However, it is common for HPC systems to primarily assign resources on a per-node basis to prevent interference from co-located workloads. This gap between the coarse-grained resource allocation and the varying resource demands can lead to HPC resources being not fully utilized. In this study, we analyze the resource usage and application behavior of NERSC’s Perlmutter, a state-of-the-art open-science HPC system with both CPU-only and GPU-accelerated nodes. Our one-month usage analysis reveals that CPUs are commonly not fully utilized, especially for GPU-enabled jobs. Also, around 64% of both CPU and GPU-enabled jobs used 50% or less of the available host memory capacity. Additionally, about 50% of GPU-enabled jobs used up to 25% of the GPU memory, and the memory capacity was not fully utilized in some ways for all jobs. While our study comes early in Perlmutter’s lifetime thus policies and application workload may change, it provides valuable insights on performance characterization, application behavior, and motivates systems with more fine-grain resource allocation.
APA, Harvard, Vancouver, ISO, and other styles
7

Li, Jianqing, Hongli Li, Jing Li, Jianmin Chen, Kai Liu, Zheng Chen, and Li Liu. "Distributed Heterogeneous Parallel Computing Framework Based on Component Flow." In Proceeding of 2021 International Conference on Wireless Communications, Networking and Applications, 437–45. Singapore: Springer Nature Singapore, 2022. http://dx.doi.org/10.1007/978-981-19-2456-9_45.

Full text
Abstract:
AbstractSingle processor has limited computing performance, slow running speed and low efficiency, which is far from being able to complete complex computing tasks, while distributed computing can solve such huge computational problems well. Therefore, this paper carried out a series of research on the heterogeneous computing cluster based on CPU+GPU, including component flow model, multi-core multi processor efficient task scheduling strategy and real-time heterogeneous computing framework, and realized a distributed heterogeneous parallel computing framework based on component flow. The results show that the CPU+GPU heterogeneous parallel computing framework based on component flow can make full use of the computing resources, realize task parallel and load balance automatically through multiple instances of components, and has the characteristics of good portability and reusability.
APA, Harvard, Vancouver, ISO, and other styles
8

Krol, Dawid, Jason Harris, and Dawid Zydek. "Hybrid GPU/CPU Approach to Multiphysics Simulation." In Progress in Systems Engineering, 893–99. Cham: Springer International Publishing, 2015. http://dx.doi.org/10.1007/978-3-319-08422-0_130.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Sao, Piyush, Richard Vuduc, and Xiaoye Sherry Li. "A Distributed CPU-GPU Sparse Direct Solver." In Lecture Notes in Computer Science, 487–98. Cham: Springer International Publishing, 2014. http://dx.doi.org/10.1007/978-3-319-09873-9_41.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Chen, Lin, Deshi Ye, and Guochuan Zhang. "Online Scheduling on a CPU-GPU Cluster." In Lecture Notes in Computer Science, 1–9. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013. http://dx.doi.org/10.1007/978-3-642-38236-9_1.

Full text
APA, Harvard, Vancouver, ISO, and other styles

Conference papers on the topic "GPU-CPU"

1

Elis, Bengisu, Olga Pearce, David Boehme, Jason Burmark, and Martin Schulz. "Non-Blocking GPU-CPU Notifications to Enable More GPU-CPU Parallelism." In HPCAsia 2024: International Conference on High Performance Computing in Asia-Pacific Region. New York, NY, USA: ACM, 2024. http://dx.doi.org/10.1145/3635035.3635036.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Yang, Yi, Ping Xiang, Mike Mantor, and Huiyang Zhou. "CPU-assisted GPGPU on fused CPU-GPU architectures." In 2012 IEEE 18th International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2012. http://dx.doi.org/10.1109/hpca.2012.6168948.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Rai, Siddharth, and Mainak Chaudhuri. "Improving CPU Performance Through Dynamic GPU Access Throttling in CPU-GPU Heterogeneous Processors." In 2017 IEEE International Parallel and Distributed Processing Symposium: Workshops (IPDPSW). IEEE, 2017. http://dx.doi.org/10.1109/ipdpsw.2017.37.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Chadwick, Jools, Francois Taiani, and Jonathan Beecham. "From CPU to GP-GPU." In the 10th International Workshop. New York, New York, USA: ACM Press, 2012. http://dx.doi.org/10.1145/2405136.2405142.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Wang, Xin, and Wei Zhang. "A Sample-Based Dynamic CPU and GPU LLC Bypassing Method for Heterogeneous CPU-GPU Architectures." In 2017 IEEE Trustcom/BigDataSE/ICESS. IEEE, 2017. http://dx.doi.org/10.1109/trustcom/bigdatase/icess.2017.309.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

K., Raju, Niranjan N. Chiplunkar, and Kavoor Rajanikanth. "A CPU-GPU Cooperative Sorting Approach." In 2019 Innovations in Power and Advanced Computing Technologies (i-PACT). IEEE, 2019. http://dx.doi.org/10.1109/i-pact44901.2019.8960106.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Xu, Yan, Gary Tan, Xiaosong Li, and Xiao Song. "Mesoscopic traffic simulation on CPU/GPU." In the 2nd ACM SIGSIM/PADS conference. New York, New York, USA: ACM Press, 2014. http://dx.doi.org/10.1145/2601381.2601396.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Kerr, Andrew, Gregory Diamos, and Sudhakar Yalamanchili. "Modeling GPU-CPU workloads and systems." In the 3rd Workshop. New York, New York, USA: ACM Press, 2010. http://dx.doi.org/10.1145/1735688.1735696.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Kang, SeungGu, Hong Jun Choi, Cheol Hong Kim, Sung Woo Chung, DongSeop Kwon, and Joong Chae Na. "Exploration of CPU/GPU co-execution." In the 2011 ACM Symposium. New York, New York, USA: ACM Press, 2011. http://dx.doi.org/10.1145/2103380.2103388.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Aciu, Razvan-Mihai, and Horia Ciocarlie. "Algorithm for Cooperative CPU-GPU Computing." In 2013 15th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC). IEEE, 2013. http://dx.doi.org/10.1109/synasc.2013.53.

Full text
APA, Harvard, Vancouver, ISO, and other styles

Reports on the topic "GPU-CPU"

1

Samfass, Philipp. Porting AMG2013 to Heterogeneous CPU+GPU Nodes. Office of Scientific and Technical Information (OSTI), January 2017. http://dx.doi.org/10.2172/1343001.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Smith, Michael. Performance Analysis of Hybrid CPU/GPU Environments. Portland State University Library, January 2000. http://dx.doi.org/10.15760/etd.300.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Rudin, Sven. VASP calculations on Chicoma: CPU vs. GPU. Office of Scientific and Technical Information (OSTI), March 2023. http://dx.doi.org/10.2172/1962769.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Owens, John. A Programming Framework for Scientific Applications on CPU-GPU Systems. Office of Scientific and Technical Information (OSTI), March 2013. http://dx.doi.org/10.2172/1069280.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Pietarila Graham, Anna, Daniel Holladay, Jonah Miller, and Jeffrey Peterson. Spiner-EOSPAC Comparison: performance and accuracy on Power9 CPU and GPU. Office of Scientific and Technical Information (OSTI), March 2022. http://dx.doi.org/10.2172/1859858.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Kurzak, Jakub, Pitior Luszczek, Mathieu Faverge, and Jack Dongarra. LU Factorization with Partial Pivoting for a Multi-CPU, Multi-GPU Shared Memory System. Office of Scientific and Technical Information (OSTI), March 2012. http://dx.doi.org/10.2172/1173291.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Snider, Dale M. DOE SBIR Phase-1 Report on Hybrid CPU-GPU Parallel Development of the Eulerian-Lagrangian Barracuda Multiphase Program. Office of Scientific and Technical Information (OSTI), February 2011. http://dx.doi.org/10.2172/1009440.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Anathan, Sheryas, Alan Williams, James Overfelt, Johnathan Vo, Philip Sakievich, Timothy Smith, Jonathan Hu, et al. Demonstration and performance testing of extreme-resolution simulations with static meshes on Summit (CPU & GPU) for a parked-turbine con%0Cfiguration and an actuator-line (mid-fidelity model) wind farm con%0Cfiguration. Office of Scientific and Technical Information (OSTI), October 2020. http://dx.doi.org/10.2172/1706223.

Full text
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography