Academic literature on the topic 'CPU-GPU Partitioning'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'CPU-GPU Partitioning.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "CPU-GPU Partitioning"

1

Benatia, Akrem, Weixing Ji, Yizhuo Wang, and Feng Shi. "Sparse matrix partitioning for optimizing SpMV on CPU-GPU heterogeneous platforms." International Journal of High Performance Computing Applications 34, no. 1 (November 14, 2019): 66–80. http://dx.doi.org/10.1177/1094342019886628.

Full text
Abstract:
Sparse matrix–vector multiplication (SpMV) kernel dominates the computing cost in numerous applications. Most of the existing studies dedicated to improving this kernel have been targeting just one type of processing units, mainly multicore CPUs or graphics processing units (GPUs), and have not explored the potential of the recent, rapidly emerging, CPU-GPU heterogeneous platforms. To take full advantage of these heterogeneous systems, the input sparse matrix has to be partitioned on different available processing units. The partitioning problem is more challenging with the existence of many sparse formats whose performances depend both on the sparsity of the input matrix and the used hardware. Thus, the best performance does not only depend on how to partition the input sparse matrix but also on which sparse format to use for each partition. To address this challenge, we propose in this article a new CPU-GPU heterogeneous method for computing the SpMV kernel that combines between different sparse formats to achieve better performance and better utilization of CPU-GPU heterogeneous platforms. The proposed solution horizontally partitions the input matrix into multiple block-rows and predicts their best sparse formats using machine learning-based performance models. A mapping algorithm is then used to assign the block-rows to the CPU and GPU(s) available in the system. Our experimental results using real-world large unstructured sparse matrices on two different machines show a noticeable performance improvement.
APA, Harvard, Vancouver, ISO, and other styles
2

Narayana, Divyaprabha Kabbal, and Sudarshan Tekal Subramanyam Babu. "Optimal task partitioning to minimize failure in heterogeneous computational platform." International Journal of Electrical and Computer Engineering (IJECE) 15, no. 1 (February 1, 2025): 1079. http://dx.doi.org/10.11591/ijece.v15i1.pp1079-1088.

Full text
Abstract:
The increased energy consumption by heterogeneous cloud platforms surges the carbon emissions and reduces system reliability, thus, making workload scheduling an extremely challenging process. The dynamic voltage- frequency scaling (DVFS) technique provides an efficient mechanism in improving the energy efficiency of cloud platform; however, employing DVFS reduces reliability and increases the failure rate of resource scheduling. Most of the current workload scheduling methods have failed to optimize the energy and reliability together under a central processing unit - graphical processing unit (CPU-GPU) heterogeneous computing platform; As a result, reducing energy consumption and task failure are prime issues this work aims to address. This work introduces task failure minimization (TFM) through optimal task partitioning (OTP) for workload scheduling in the CPU-GPU cloud computational platform. The TFM-OTP introduces a task partitioning model for the CPU-GPU pair; then, it provides a DVFS- based energy consumption model. Finally, the energy-load optimization problem is defined, and the optimal resource allocation design is presented. The experiment is conducted on two standard workloads namely SIPHT and CyberShake workload. The result shows that the proposed TFA-OTP model reduces energy consumption by 30.35%, reduces makespan by 70.78% and reduces task failure energy overhead by 83.7% in comparison with energy minimized scheduling (EMS) approach.
APA, Harvard, Vancouver, ISO, and other styles
3

Huijing Yang and Tingwen Yu. "Two novel cache management mechanisms on CPU-GPU heterogeneous processors." Research Briefs on Information and Communication Technology Evolution 7 (June 15, 2021): 1–8. http://dx.doi.org/10.56801/rebicte.v7i.113.

Full text
Abstract:
Heterogeneous multicore processors that take full advantage of CPUs and GPUs within the samechip raise an emerging challenge for sharing a series of on-chip resources, particularly Last-LevelCache (LLC) resources. Since the GPU core has good parallelism and memory latency tolerance,the majority of the LLC space is utilized by GPU applications. Under the current cache managementpolicies, the LLC sharing of CPU applications can be remarkably decreased due to the existence ofGPU workloads, thus seriously affecting the overall performance. To alleviate the unfair contentionwithin CPUs and GPUs for the cache capability, we propose two novel cache supervision mechanisms:static cache partitioning scheme based on adaptive replacement policy (SARP) and dynamiccache partitioning scheme based on GPU missing awareness (DGMA). SARP scheme first uses cachepartitioning to split the cache ways between CPUs and GPUs and then uses adaptive cache replacementpolicy depending on the type of the requested message. DGMA scheme monitors GPU’s cacheperformance metrics at run time and set appropriate threshold to dynamically change the cache ratioof the mutual LLC between various kernels. Experimental results show that SARP mechanismcan further increase CPU performance, up to 32.6% and an average increase of 8.4%. And DGMAscheme improves CPU performance under the premise of ensuring that GPU performance is not affected,and achieves a maximum increase of 18.1% and an average increase of 7.7%.
APA, Harvard, Vancouver, ISO, and other styles
4

Fang, Juan, Mengxuan Wang, and Zelin Wei. "A memory scheduling strategy for eliminating memory access interference in heterogeneous system." Journal of Supercomputing 76, no. 4 (January 10, 2020): 3129–54. http://dx.doi.org/10.1007/s11227-019-03135-7.

Full text
Abstract:
AbstractMultiple CPUs and GPUs are integrated on the same chip to share memory, and access requests between cores are interfering with each other. Memory requests from the GPU seriously interfere with the CPU memory access performance. Requests between multiple CPUs are intertwined when accessing memory, and its performance is greatly affected. The difference in access latency between GPU cores increases the average latency of memory accesses. In order to solve the problems encountered in the shared memory of heterogeneous multi-core systems, we propose a step-by-step memory scheduling strategy, which improve the system performance. The step-by-step memory scheduling strategy first creates a new memory request queue based on the request source and isolates the CPU requests from the GPU requests when the memory controller receives the memory request, thereby preventing the GPU request from interfering with the CPU request. Then, for the CPU request queue, a dynamic bank partitioning strategy is implemented, which dynamically maps it to different bank sets according to different memory characteristics of the application, and eliminates memory request interference of multiple CPU applications without affecting bank-level parallelism. Finally, for the GPU request queue, the criticality is introduced to measure the difference of the memory access latency between the cores. Based on the first ready-first come first served strategy, we implemented criticality-aware memory scheduling to balance the locality and criticality of application access.
APA, Harvard, Vancouver, ISO, and other styles
5

MERRILL, DUANE, and ANDREW GRIMSHAW. "HIGH PERFORMANCE AND SCALABLE RADIX SORTING: A CASE STUDY OF IMPLEMENTING DYNAMIC PARALLELISM FOR GPU COMPUTING." Parallel Processing Letters 21, no. 02 (June 2011): 245–72. http://dx.doi.org/10.1142/s0129626411000187.

Full text
Abstract:
The need to rank and order data is pervasive, and many algorithms are fundamentally dependent upon sorting and partitioning operations. Prior to this work, GPU stream processors have been perceived as challenging targets for problems with dynamic and global data-dependences such as sorting. This paper presents: (1) a family of very efficient parallel algorithms for radix sorting; and (2) our allocation-oriented algorithmic design strategies that match the strengths of GPU processor architecture to this genre of dynamic parallelism. We demonstrate multiple factors of speedup (up to 3.8x) compared to state-of-the-art GPU sorting. We also reverse the performance differentials observed between GPU and multi/many-core CPU architectures by recent comparisons in the literature, including those with 32-core CPU-based accelerators. Our average sorting rates exceed 1B 32-bit keys/sec on a single GPU microprocessor. Our sorting passes are constructed from a very efficient parallel prefix scan "runtime" that incorporates three design features: (1) kernel fusion for locally generating and consuming prefix scan data; (2) multi-scan for performing multiple related, concurrent prefix scans (one for each partitioning bin); and (3) flexible algorithm serialization for avoiding unnecessary synchronization and communication within algorithmic phases, allowing us to construct a single implementation that scales well across all generations and configurations of programmable NVIDIA GPUs.
APA, Harvard, Vancouver, ISO, and other styles
6

Vilches, Antonio, Rafael Asenjo, Angeles Navarro, Francisco Corbera, Rub́en Gran, and María Garzarán. "Adaptive Partitioning for Irregular Applications on Heterogeneous CPU-GPU Chips." Procedia Computer Science 51 (2015): 140–49. http://dx.doi.org/10.1016/j.procs.2015.05.213.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Sung, Hanul, Hyeonsang Eom, and HeonYoung Yeom. "The Need of Cache Partitioning on Shared Cache of Integrated Graphics Processor between CPU and GPU." KIISE Transactions on Computing Practices 20, no. 9 (September 15, 2014): 507–12. http://dx.doi.org/10.5626/ktcp.2014.20.9.507.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Wang, Shunjiang, Baoming Pu, Ming Li, Weichun Ge, Qianwei Liu, and Yujie Pei. "State Estimation Based on Ensemble DA–DSVM in Power System." International Journal of Software Engineering and Knowledge Engineering 29, no. 05 (May 2019): 653–69. http://dx.doi.org/10.1142/s0218194019400023.

Full text
Abstract:
This paper investigates the state estimation problem of power systems. A novel, fast and accurate state estimation algorithm is presented to solve this problem based on the one-dimensional denoising autoencoder and deep support vector machine (1D DA–DSVM). Besides, for further reducing the computation burden, a partitioning method is presented to divide the power system into several sub-networks and the proposed algorithm can be applied to each sub-network. A hybrid computing architecture of Central Processing Unit (CPU) and Graphics Processing Unit (GPU) is employed in the overall state estimation, in which the GPU is used to estimate each sub-network and the CPU is used to integrate all the calculation results and output the state estimate. Simulation results show that the proposed method can effectively improve the accuracy and computational efficiency of the state estimation of power systems.
APA, Harvard, Vancouver, ISO, and other styles
9

Barreiros, Willian, Alba C. M. A. Melo, Jun Kong, Renato Ferreira, Tahsin M. Kurc, Joel H. Saltz, and George Teodoro. "Efficient microscopy image analysis on CPU-GPU systems with cost-aware irregular data partitioning." Journal of Parallel and Distributed Computing 164 (June 2022): 40–54. http://dx.doi.org/10.1016/j.jpdc.2022.02.004.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Singh, Amit Kumar, Alok Prakash, Karunakar Reddy Basireddy, Geoff V. Merrett, and Bashir M. Al-Hashimi. "Energy-Efficient Run-Time Mapping and Thread Partitioning of Concurrent OpenCL Applications on CPU-GPU MPSoCs." ACM Transactions on Embedded Computing Systems 16, no. 5s (October 10, 2017): 1–22. http://dx.doi.org/10.1145/3126548.

Full text
APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic "CPU-GPU Partitioning"

1

Öhberg, Tomas. "Auto-tuning Hybrid CPU-GPU Execution of Algorithmic Skeletons in SkePU." Thesis, Linköpings universitet, Programvara och system, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-149605.

Full text
Abstract:
The trend in computer architectures has for several years been heterogeneous systems consisting of a regular CPU and at least one additional, specialized processing unit, such as a GPU.The different characteristics of the processing units and the requirement of multiple tools and programming languages makes programming of such systems a challenging task. Although there exist tools for programming each processing unit, utilizing the full potential of a heterogeneous computer still requires specialized implementations involving multiple frameworks and hand-tuning of parameters.To fully exploit the performance of heterogeneous systems for a single computation, hybrid execution is needed, i.e. execution where the workload is distributed between multiple, heterogeneous processing units, working simultaneously on the computation. This thesis presents the implementation of a new hybrid execution backend in the algorithmic skeleton framework SkePU. The skeleton framework already gives programmers a user-friendly interface to algorithmic templates, executable on different hardware using OpenMP, CUDA and OpenCL. With this extension it is now also possible to divide the computational work of the skeletons between multiple processing units, such as between a CPU and a GPU. The results show an improvement in execution time with the hybrid execution implementation for all skeletons in SkePU. It is also shown that the new implementation results in a lower and more predictable execution time compared to a dynamic scheduling approach based on an earlier implementation of hybrid execution in SkePU.
APA, Harvard, Vancouver, ISO, and other styles
2

Thomas, Béatrice. "Adéquation Algorithme Architecture pour la gestion des réseaux électriques." Electronic Thesis or Diss., université Paris-Saclay, 2024. http://www.theses.fr/2024UPASG104.

Full text
Abstract:
L'augmentation de la production renouvelable décentralisée nécessaire à la transition énergétique complexifiera la gestion du réseau électrique.Une riche littérature propose de décentraliser la gestion pour éviter la surcharge de l'opérateur central pendant la gestion réelle. Cependant la décentralisation exacerbe les problèmes de passage à l'échelle lors des simulations préliminaires permettant de valider les performances, la robustesse de la gestion ou le dimensionnement du futur réseau. Une démarche Adéquation Algorithme Architecture a été suivie dans cette thèse pour un marché pair à pair pour résoudre le problème de passage à l'échelle lors de la simulation sur une architecture matérielle de calcul unique.L'influence des agents sur le réseau en grande dimension ne pouvant plus être négligée, l'étude a porté sur un marché endogène pair à pair.Nous avons étudié la complexité calculatoire de différents algorithmes. Des méthodes d'optimisation de temps de traitements sur des architectures type GPU ont été développées. L'évaluation des performances, en termes de temps de traitement et de convergence, a été réalisée.Ainsi, un modèle de calcul parallèle sur une architecture GPU a apporté une accélération substantielle lorsque la précision n'est pas critique. Une implémentation optimisée sur une architecture GPU a permis de réduire de plus de 98% les temps de simulation d'un marché sans contraintes réseau. Comparé à un modèle de calcul sur une architecture conventionnelle type PC, la démarche d'adéquation algorithme-architecture a permis de définir un modèle de calcul sur GPU 1000 fois plus rapide lors de la simulation d'un DC-marché endogène et 10 fois plus rapide sur un marché AC-endogène sur un réseau radial. Les résultats de cette thèse ont permis de consolider l'étude menée sur les aspects algorithmiques comme sur les aspects d'architectures matérielles pour l'accélération des simulations de réseaux électriques sur des architectures parallèles
The growth of distributed energy resources raises the challenge of scaling up network management algorithms. This difficulty may be overcome in operating conditions with the help of a rich literature that frequently calls upon the distribution of computations. However, this issue persists during preliminary simulations validating the performances, the operation's safety, and the infrastructure's sizing. A hardware-software co-design approach is conducted here for a Peer-to-Peer market to address this scaling issue while computing simulations on a single machine. With the increasing number of distributed agents, the impact on the grid cannot be neglected anymore. Thus, this work will focus on an endogenous market. The mapping between several algorithms and different partitioning models on Central and Graphic Processing Units (CPU-GPU) has been conducted. The complexity and performance of these algorithms have been analyzed on CPU and GPU. The implementations have shown that the GPU is more numerically unstable than the CPU. Nevertheless, when precision is not critical, GPU gives substantial speedup. Thus, markets without grid constraints are 98% faster on GPU. Even with the grid constraints, the GPU is 1000 times faster with the DC hypothesis and ten times faster on the AC radial grid. This dimension-dependent acceleration increases with the grid size and the agent's count
APA, Harvard, Vancouver, ISO, and other styles
3

Li, Cheng-Hsuan, and 李承軒. "Weighted LLC Latency-Based Run-Time Cache Partitioning for Heterogeneous CPU-GPU Architecture." Thesis, 2014. http://ndltd.ncl.edu.tw/handle/33311478280299879988.

Full text
Abstract:
碩士
國立臺灣大學
資訊工程學研究所
102
Integrating the CPU and GPU on the same chip has become the development trend for microprocessor design. In integrated CPU-GPU architecture, utilizing the shared last-level cache (LLC) is a critical design issue due to the pressure on shared resources and the different characteristics of CPU and GPU applications. Because of the latency-hiding capability provided by the GPU and the huge discrepancy in concurrent executing threads between the CPU and GPU, LLC partitioning can no longer be achieved by simply minimizing the overall cache misses as in homogeneous CPUs. State-of-the-art cache partitioning mechanism distinguishes those cache-insensitive GPU applications from those cache-sensitive ones and optimize only the cache misses for CPU applications when the GPU is cache-insensitive. However, optimizing only the cache hit rate for CPU applications generates more cache misses from the GPU and leads to longer queuing delay in the underlying DRAM system. In terms of memory access latency, the loss due to longer queuing delay may out-weight the benefit from higher cache hit ratio. Therefore, we find that even though the performance of the GPU application may not be sensitive to cache resources, CPU applications'' cache hit rate is not the only factor which should be considered in partitioning the LLC. Cache miss penalty, i.e., off-chip latency, is also an important factor in designing LLC partitioning mechanism for integrated CPU-GPU architecture. In this paper, we proposed a Weighted LLC Latency-Based Run-Time Cache Partitioning for integrated CPU-GPU architecture. In order to correlate cache partition to overall performance more accurately, we develops a mechanism to predict the off-chip latency based on the number of total cache misses, and a GPU cache-sensitivity monitor, which quantitatively profiles GPU''s performance sensitivity to memory access latency. The experimental results show that the proposed mechanism improves the overall throughput by 9.7% over TLP-aware cache partitioning (TAP), 6.2% over Utility-based Cache Partitioning (UCP), and 10.9% over LRU on 30 heterogeneous workloads.
APA, Harvard, Vancouver, ISO, and other styles
4

Mishra, Ashirbad. "Efficient betweenness Centrality Computations on Hybrid CPU-GPU Systems." Thesis, 2016. http://etd.iisc.ac.in/handle/2005/2718.

Full text
Abstract:
Analysis of networks is quite interesting, because they can be interpreted for several purposes. Various features require different metrics to measure and interpret them. Measuring the relative importance of each vertex in a network is one of the most fundamental building blocks in network analysis. Between’s Centrality (BC) is one such metric that plays a key role in many real world applications. BC is an important graph analytics application for large-scale graphs. However it is one of the most computationally intensive kernels to execute, and measuring centrality in billion-scale graphs is quite challenging. While there are several existing e orts towards parallelizing BC algorithms on multi-core CPUs and many-core GPUs, in this work, we propose a novel ne-grained CPU-GPU hybrid algorithm that partitions a graph into two partitions, one each for CPU and GPU. Our method performs BC computations for the graph on both the CPU and GPU resources simultaneously, resulting in a very small number of CPU-GPU synchronizations, hence taking less time for communications. The BC algorithm consists of two phases, the forward phase and the backward phase. In the forward phase, we initially and the paths that are needed by either partitions, after which each partition is executed on each processor in an asynchronous manner. We initially compute border matrices for each partition which stores the relative distances between each pair of border vertex in a partition. The matrices are used in the forward phase calculations of all the sources. In this way, our hybrid BC algorithm leverages the multi-source property inherent in the BC problem. We present proof of correctness and the bounds for the number of iterations for each source. We also perform a novel hybrid and asynchronous backward phase, in which each partition communicates with the other only when there is a path that crosses the partition, hence it performs minimal CPU-GPU synchronizations. We use a variety of implementations for our work, like node-based and edge based parallelism, which includes data-driven and topology based techniques. In the implementation we show that our method also works using variable partitioning technique. The technique partitions the graph into unequal parts accounting for the processing power of each processor. Our implementations achieve almost equal percentage of utilization on both the processors due to the technique. For large scale graphs, the size of the border matrix also becomes large, hence to accommodate the matrix we present various techniques. The techniques use the properties inherent in the shortest path problem for reduction. We mention the drawbacks of performing shortest path computations on a large scale and also provide various solutions to it. Evaluations using a large number of graphs with different characteristics show that our hybrid approach without variable partitioning and border matrix reduction gives 67% improvement in performance, and 64-98.5% less CPU-GPU communications than the state of art hybrid algorithm based on the popular Bulk Synchronous Paradigm (BSP) approach implemented in TOTEM. This shows our algorithm's strength which reduces the need for larger synchronizations. Implementing variable partitioning, border matrix reduction and backward phase optimizations on our hybrid algorithm provides up to 10x speedup. We compare our optimized implementation, with CPU and GPU standalone codes based on our forward phase and backward phase kernels, and show around 2-8x speedup over the CPU-only code and can accommodate large graphs that cannot be accommodated in the GPU-only code. We also show that our method`s performance is competitive to the state of art multi-core CPU and performs 40-52% better than GPU implementations, on large graphs. We show the drawbacks of CPU and GPU only implementations and try to motivate the reader about the challenges that graph algorithms face in large scale computing, suggesting that a hybrid or distributed way of approaching the problem is a better way of overcoming the hurdles.
APA, Harvard, Vancouver, ISO, and other styles
5

Mishra, Ashirbad. "Efficient betweenness Centrality Computations on Hybrid CPU-GPU Systems." Thesis, 2016. http://hdl.handle.net/2005/2718.

Full text
Abstract:
Analysis of networks is quite interesting, because they can be interpreted for several purposes. Various features require different metrics to measure and interpret them. Measuring the relative importance of each vertex in a network is one of the most fundamental building blocks in network analysis. Between’s Centrality (BC) is one such metric that plays a key role in many real world applications. BC is an important graph analytics application for large-scale graphs. However it is one of the most computationally intensive kernels to execute, and measuring centrality in billion-scale graphs is quite challenging. While there are several existing e orts towards parallelizing BC algorithms on multi-core CPUs and many-core GPUs, in this work, we propose a novel ne-grained CPU-GPU hybrid algorithm that partitions a graph into two partitions, one each for CPU and GPU. Our method performs BC computations for the graph on both the CPU and GPU resources simultaneously, resulting in a very small number of CPU-GPU synchronizations, hence taking less time for communications. The BC algorithm consists of two phases, the forward phase and the backward phase. In the forward phase, we initially and the paths that are needed by either partitions, after which each partition is executed on each processor in an asynchronous manner. We initially compute border matrices for each partition which stores the relative distances between each pair of border vertex in a partition. The matrices are used in the forward phase calculations of all the sources. In this way, our hybrid BC algorithm leverages the multi-source property inherent in the BC problem. We present proof of correctness and the bounds for the number of iterations for each source. We also perform a novel hybrid and asynchronous backward phase, in which each partition communicates with the other only when there is a path that crosses the partition, hence it performs minimal CPU-GPU synchronizations. We use a variety of implementations for our work, like node-based and edge based parallelism, which includes data-driven and topology based techniques. In the implementation we show that our method also works using variable partitioning technique. The technique partitions the graph into unequal parts accounting for the processing power of each processor. Our implementations achieve almost equal percentage of utilization on both the processors due to the technique. For large scale graphs, the size of the border matrix also becomes large, hence to accommodate the matrix we present various techniques. The techniques use the properties inherent in the shortest path problem for reduction. We mention the drawbacks of performing shortest path computations on a large scale and also provide various solutions to it. Evaluations using a large number of graphs with different characteristics show that our hybrid approach without variable partitioning and border matrix reduction gives 67% improvement in performance, and 64-98.5% less CPU-GPU communications than the state of art hybrid algorithm based on the popular Bulk Synchronous Paradigm (BSP) approach implemented in TOTEM. This shows our algorithm's strength which reduces the need for larger synchronizations. Implementing variable partitioning, border matrix reduction and backward phase optimizations on our hybrid algorithm provides up to 10x speedup. We compare our optimized implementation, with CPU and GPU standalone codes based on our forward phase and backward phase kernels, and show around 2-8x speedup over the CPU-only code and can accommodate large graphs that cannot be accommodated in the GPU-only code. We also show that our method`s performance is competitive to the state of art multi-core CPU and performs 40-52% better than GPU implementations, on large graphs. We show the drawbacks of CPU and GPU only implementations and try to motivate the reader about the challenges that graph algorithms face in large scale computing, suggesting that a hybrid or distributed way of approaching the problem is a better way of overcoming the hurdles.
APA, Harvard, Vancouver, ISO, and other styles

Book chapters on the topic "CPU-GPU Partitioning"

1

Clarke, David, Aleksandar Ilic, Alexey Lastovetsky, and Leonel Sousa. "Hierarchical Partitioning Algorithm for Scientific Computing on Highly Heterogeneous CPU + GPU Clusters." In Euro-Par 2012 Parallel Processing, 489–501. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012. http://dx.doi.org/10.1007/978-3-642-32820-6_49.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Saba, Issa, Eishi Arima, Dai Liu, and Martin Schulz. "Orchestrated Co-scheduling, Resource Partitioning, and Power Capping on CPU-GPU Heterogeneous Systems via Machine Learning." In Architecture of Computing Systems, 51–67. Cham: Springer International Publishing, 2022. http://dx.doi.org/10.1007/978-3-031-21867-5_4.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Fei, Xiongwei, Kenli Li, Wangdong Yang, and Keqin Li. "CPU-GPU Computing." In Innovative Research and Applications in Next-Generation High Performance Computing, 159–93. IGI Global, 2016. http://dx.doi.org/10.4018/978-1-5225-0287-6.ch007.

Full text
Abstract:
Heterogeneous and hybrid computing has been heavily studied in the field of parallel and distributed computing in recent years. It can work on a single computer, or in a group of computers connected by a high-speed network. The former is the topic of this chapter. Its key points are how to cooperatively use devices that are different in performance and architecture to satisfy various computing requirements, and how to make the whole program achieve the best performance possible when executed. CPUs and GPUs have fundamentally different design philosophies, but combining their characteristics could avail better performance in many applications. However, it is still a challenge to optimize them. This chapter focuses on the main optimization strategies including “partitioning and load-balancing”, “data access”, “communication”, and “synchronization and asynchronization”. Furthermore, two applications will be introduced as examples of using these strategies.
APA, Harvard, Vancouver, ISO, and other styles
4

"Topology-Aware Load-Balance Schemes for Heterogeneous Graph Processing." In Advances in Computer and Electrical Engineering, 113–43. IGI Global, 2018. http://dx.doi.org/10.4018/978-1-5225-3799-1.ch005.

Full text
Abstract:
Inspired by the insights presented in Chapters 2, 3, and 4, in this chapter the authors present the KCMAX (K-Core MAX) and the KCML (K-Core Multi-Level) frameworks: novel k-core-based graph partitioning approaches that produce unbalanced partitions of complex networks that are suitable for heterogeneous parallel processing. Then they use KCMAX and KCML to explore the configuration space for accelerating BFSs on large complex networks in the context of TOTEM, a BSP heterogeneous GPU + CPU HPC platform. They study the feasibility of the heterogeneous computing approach by systematically studying different graph partitioning strategies, including the KCMAX and KCML algorithms, while processing synthetic and real-world complex networks.
APA, Harvard, Vancouver, ISO, and other styles

Conference papers on the topic "CPU-GPU Partitioning"

1

Goodarzi, Bahareh, Martin Burtscher, and Dhrubajyoti Goswami. "Parallel Graph Partitioning on a CPU-GPU Architecture." In 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 2016. http://dx.doi.org/10.1109/ipdpsw.2016.16.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Cho, Younghyun, Florian Negele, Seohong Park, Bernhard Egger, and Thomas R. Gross. "On-the-fly workload partitioning for integrated CPU/GPU architectures." In PACT '18: International conference on Parallel Architectures and Compilation Techniques. New York, NY, USA: ACM, 2018. http://dx.doi.org/10.1145/3243176.3243210.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Kim, Dae Hee, Rakesh Nagi, and Deming Chen. "Thanos: High-Performance CPU-GPU Based Balanced Graph Partitioning Using Cross-Decomposition." In 2020 25th Asia and South Pacific Design Automation Conference (ASP-DAC). IEEE, 2020. http://dx.doi.org/10.1109/asp-dac47756.2020.9045588.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Wang, Xin, and Wei Zhang. "Cache locking vs. partitioning for real-time computing on integrated CPU-GPU processors." In 2016 IEEE 35th International Performance Computing and Communications Conference (IPCCC). IEEE, 2016. http://dx.doi.org/10.1109/pccc.2016.7820644.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Fang, Juan, Shijian Liu, and Xibei Zhang. "Research on Cache Partitioning and Adaptive Replacement Policy for CPU-GPU Heterogeneous Processors." In 2017 16th International Symposium on Distributed Computing and Applications to Business, Engineering and Science (DCABES). IEEE, 2017. http://dx.doi.org/10.1109/dcabes.2017.12.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Wachter, Eduardo Weber, Geoff V. Merrett, Bashir M. Al-Hashimi, and Amit Kumar Singh. "Reliable mapping and partitioning of performance-constrained openCL applications on CPU-GPU MPSoCs." In ESWEEK'17: THIRTEENTH EMBEDDED SYSTEM WEEK. New York, NY, USA: ACM, 2017. http://dx.doi.org/10.1145/3139315.3157088.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Xiao, Chunhua, Wei Ran, Fangzhu Lin, and Lin Zhang. "Dynamic Fine-Grained Workload Partitioning for Irregular Applications on Discrete CPU-GPU Systems." In 2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom). IEEE, 2021. http://dx.doi.org/10.1109/ispa-bdcloud-socialcom-sustaincom52081.2021.00148.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Magalhães, W. F., H. M. Gomes, L. B. Marinho, G. S. Aguiar, and P. Silveira. "Investigating Mobile Edge-Cloud Trade-Offs of Object Detection with YOLO." In VII Symposium on Knowledge Discovery, Mining and Learning. Sociedade Brasileira de Computação - SBC, 2019. http://dx.doi.org/10.5753/kdmile.2019.8788.

Full text
Abstract:
With the advent of smart IoT applications empowered with AI, together with the democratization of mobile devices, moving the computation from cloud to edge is a natural trend in both academia and industry. A major challenge in this direction is enabling the deployment of Deep Neural Networks (DNNs), which usually demand lots of computational resources (i.e. memory, disk, CPU/GPU, and power), in resource limited edge devices. Among the possible strategies to tackle this challenge are: (i) running the entire DNN on the edge device (sometimes not feasible), (ii) distributing the computation between edge and cloud or (iii) running the entire DNN on the cloud. All these strategies involve trade-offs in terms of latency, communication, and financial costs. In this article we investigate such trade-offs in a real-world scenario involving object detection from video surveillance feeds. We conduct several experiments on two different versions of YOLO (You Only Look Once), a state-of-the-art DNN designed for fast and accurate object detection and location. Our experimental setup for DNN model partitioning includes a Raspberry PI 3 B+ and a cloud server equipped with a GPU. Experiments using different network bandwidths are performed. Our results provide useful insights about the aforementioned trade-offs.
APA, Harvard, Vancouver, ISO, and other styles
9

Negrut, Dan, Toby Heyn, Andrew Seidl, Dan Melanz, David Gorsich, and David Lamb. "ENABLING COMPUTATIONAL DYNAMICS IN DISTRIBUTED COMPUTING ENVIRONMENTS USING A HETEROGENEOUS COMPUTING TEMPLATE." In 2024 NDIA Michigan Chapter Ground Vehicle Systems Engineering and Technology Symposium. 2101 Wilson Blvd, Suite 700, Arlington, VA 22201, United States: National Defense Industrial Association, 2024. http://dx.doi.org/10.4271/2024-01-3314.

Full text
Abstract:
<title>ABSTRACT</title> <p>This paper describes a software infrastructure made up of tools and libraries designed to assist developers in implementing computational dynamics applications running on heterogeneous and distributed computing environments. Together, these tools and libraries compose a so called Heterogeneous Computing Template (HCT). The underlying theme of the solution approach embraced by HCT is that of partitioning the domain of interest into a number of sub-domains that are each managed by a separate core/accelerator (CPU/GPU) pair. The five components at the core of HCT, which ultimately enable the distributed/heterogeneous computing approach to large-scale dynamical system simulation, are as follows: (a) a method for the geometric domain decomposition; (b) methods for proximity computation or collision detection; (c) support for moving data within the heterogeneous hardware ecosystem to mirror the migration of simulation elements from subdomain to subdomain; (d) parallel numerical methods for solving the specific dynamics problem of interest; and (e) tools for performing visualization and post-processing in a distributed manner.</p>
APA, Harvard, Vancouver, ISO, and other styles
10

Heyn, Toby, Andrew Seidl, Hammad Mazhar, David Lamb, Alessandro Tasora, and Dan Negrut. "Enabling Computational Dynamics in Distributed Computing Environments Using a Heterogeneous Computing Template." In ASME 2011 International Design Engineering Technical Conferences and Computers and Information in Engineering Conference. ASMEDC, 2011. http://dx.doi.org/10.1115/detc2011-48347.

Full text
Abstract:
This paper describes a software infrastructure made up of tools and libraries designed to assist developers in implementing computational dynamics applications running on heterogeneous and distributed computing environments. Together, these tools and libraries compose a so called Heterogeneous Computing Template (HCT). The heterogeneous and distributed computing hardware infrastructure is assumed herein to be made up of a combination of CPUs and GPUs. The computational dynamics applications targeted to execute on such a hardware topology include many-body dynamics, smoothed-particle hydrodynamics (SPH) fluid simulation, and fluid-solid interaction analysis. The underlying theme of the solution approach embraced by HCT is that of partitioning the domain of interest into a number of sub-domains that are each managed by a separate core/accelerator (CPU/GPU) pair. Five components at the core of HCT enable the envisioned distributed computing approach to large-scale dynamical system simulation: (a) a method for the geometric domain decomposition and mapping onto heterogeneous hardware; (b) methods for proximity computation or collision detection; (c) support for moving data among the corresponding hardware as elements move from subdomain to subdomain; (d) numerical methods for solving the specific dynamics problem of interest; and (e) tools for performing visualization and post-processing in a distributed manner. In this contribution the components (a) and (c) of the HCT are demonstrated via the example of the Discrete Element Method (DEM) for rigid body dynamics with friction and contact. The collision detection task required in frictional-contact dynamics; i.e., task (b) above, is discussed separately and in the context of GPU computing. This task is shown to benefit of a two order of magnitude gain in efficiency when compared to traditional sequential implementations. Note: Reference herein to any specific commercial products, process, or service by trade name, trademark, manufacturer, or otherwise, does not imply its endorsement, recommendation, or favoring by the US Army. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Army, and shall not be used for advertising or product endorsement purposes.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography