Einloggen

Thematische Bibliographien / CPU-GPU Partitioning / Zeitschriftenartikel

Um die anderen Arten von Veröffentlichungen zu diesem Thema anzuzeigen, folgen Sie diesem Link: CPU-GPU Partitioning.

Zeitschriftenartikel zum Thema „CPU-GPU Partitioning“

Autor: Grafiati

Veröffentlicht am 22. Februar 2025

Geben Sie eine Quelle nach APA, MLA, Chicago, Harvard und anderen Zitierweisen an

Wählen Sie eine Art der Quelle aus:

Machen Sie sich mit Top-34 Zeitschriftenartikel für die Forschung zum Thema "CPU-GPU Partitioning" bekannt.

Neben jedem Werk im Literaturverzeichnis ist die Option "Zur Bibliographie hinzufügen" verfügbar. Nutzen Sie sie, wird Ihre bibliographische Angabe des gewählten Werkes nach der nötigen Zitierweise (APA, MLA, Harvard, Chicago, Vancouver usw.) automatisch gestaltet.

Sie können auch den vollen Text der wissenschaftlichen Publikation im PDF-Format herunterladen und eine Online-Annotation der Arbeit lesen, wenn die relevanten Parameter in den Metadaten verfügbar sind.

Sehen Sie die Zeitschriftenartikel für verschiedene Spezialgebieten durch und erstellen Sie Ihre Bibliographie auf korrekte Weise.

1

Benatia, Akrem, Weixing Ji, Yizhuo Wang und Feng Shi. „Sparse matrix partitioning for optimizing SpMV on CPU-GPU heterogeneous platforms“. International Journal of High Performance Computing Applications 34, Nr. 1 (14.11.2019): 66–80. http://dx.doi.org/10.1177/1094342019886628.

Der volle Inhalt der Quelle

Annotation:

Sparse matrix–vector multiplication (SpMV) kernel dominates the computing cost in numerous applications. Most of the existing studies dedicated to improving this kernel have been targeting just one type of processing units, mainly multicore CPUs or graphics processing units (GPUs), and have not explored the potential of the recent, rapidly emerging, CPU-GPU heterogeneous platforms. To take full advantage of these heterogeneous systems, the input sparse matrix has to be partitioned on different available processing units. The partitioning problem is more challenging with the existence of many sparse formats whose performances depend both on the sparsity of the input matrix and the used hardware. Thus, the best performance does not only depend on how to partition the input sparse matrix but also on which sparse format to use for each partition. To address this challenge, we propose in this article a new CPU-GPU heterogeneous method for computing the SpMV kernel that combines between different sparse formats to achieve better performance and better utilization of CPU-GPU heterogeneous platforms. The proposed solution horizontally partitions the input matrix into multiple block-rows and predicts their best sparse formats using machine learning-based performance models. A mapping algorithm is then used to assign the block-rows to the CPU and GPU(s) available in the system. Our experimental results using real-world large unstructured sparse matrices on two different machines show a noticeable performance improvement.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

2

Narayana, Divyaprabha Kabbal, und Sudarshan Tekal Subramanyam Babu. „Optimal task partitioning to minimize failure in heterogeneous computational platform“. International Journal of Electrical and Computer Engineering (IJECE) 15, Nr. 1 (01.02.2025): 1079. http://dx.doi.org/10.11591/ijece.v15i1.pp1079-1088.

Der volle Inhalt der Quelle

Annotation:

The increased energy consumption by heterogeneous cloud platforms surges the carbon emissions and reduces system reliability, thus, making workload scheduling an extremely challenging process. The dynamic voltage- frequency scaling (DVFS) technique provides an efficient mechanism in improving the energy efficiency of cloud platform; however, employing DVFS reduces reliability and increases the failure rate of resource scheduling. Most of the current workload scheduling methods have failed to optimize the energy and reliability together under a central processing unit - graphical processing unit (CPU-GPU) heterogeneous computing platform; As a result, reducing energy consumption and task failure are prime issues this work aims to address. This work introduces task failure minimization (TFM) through optimal task partitioning (OTP) for workload scheduling in the CPU-GPU cloud computational platform. The TFM-OTP introduces a task partitioning model for the CPU-GPU pair; then, it provides a DVFS- based energy consumption model. Finally, the energy-load optimization problem is defined, and the optimal resource allocation design is presented. The experiment is conducted on two standard workloads namely SIPHT and CyberShake workload. The result shows that the proposed TFA-OTP model reduces energy consumption by 30.35%, reduces makespan by 70.78% and reduces task failure energy overhead by 83.7% in comparison with energy minimized scheduling (EMS) approach.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

3

Huijing Yang und Tingwen Yu. „Two novel cache management mechanisms on CPU-GPU heterogeneous processors“. Research Briefs on Information and Communication Technology Evolution 7 (15.06.2021): 1–8. http://dx.doi.org/10.56801/rebicte.v7i.113.

Der volle Inhalt der Quelle

Annotation:

Heterogeneous multicore processors that take full advantage of CPUs and GPUs within the samechip raise an emerging challenge for sharing a series of on-chip resources, particularly Last-LevelCache (LLC) resources. Since the GPU core has good parallelism and memory latency tolerance,the majority of the LLC space is utilized by GPU applications. Under the current cache managementpolicies, the LLC sharing of CPU applications can be remarkably decreased due to the existence ofGPU workloads, thus seriously affecting the overall performance. To alleviate the unfair contentionwithin CPUs and GPUs for the cache capability, we propose two novel cache supervision mechanisms:static cache partitioning scheme based on adaptive replacement policy (SARP) and dynamiccache partitioning scheme based on GPU missing awareness (DGMA). SARP scheme first uses cachepartitioning to split the cache ways between CPUs and GPUs and then uses adaptive cache replacementpolicy depending on the type of the requested message. DGMA scheme monitors GPU’s cacheperformance metrics at run time and set appropriate threshold to dynamically change the cache ratioof the mutual LLC between various kernels. Experimental results show that SARP mechanismcan further increase CPU performance, up to 32.6% and an average increase of 8.4%. And DGMAscheme improves CPU performance under the premise of ensuring that GPU performance is not affected,and achieves a maximum increase of 18.1% and an average increase of 7.7%.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

4

Fang, Juan, Mengxuan Wang und Zelin Wei. „A memory scheduling strategy for eliminating memory access interference in heterogeneous system“. Journal of Supercomputing 76, Nr. 4 (10.01.2020): 3129–54. http://dx.doi.org/10.1007/s11227-019-03135-7.

Der volle Inhalt der Quelle

Annotation:

AbstractMultiple CPUs and GPUs are integrated on the same chip to share memory, and access requests between cores are interfering with each other. Memory requests from the GPU seriously interfere with the CPU memory access performance. Requests between multiple CPUs are intertwined when accessing memory, and its performance is greatly affected. The difference in access latency between GPU cores increases the average latency of memory accesses. In order to solve the problems encountered in the shared memory of heterogeneous multi-core systems, we propose a step-by-step memory scheduling strategy, which improve the system performance. The step-by-step memory scheduling strategy first creates a new memory request queue based on the request source and isolates the CPU requests from the GPU requests when the memory controller receives the memory request, thereby preventing the GPU request from interfering with the CPU request. Then, for the CPU request queue, a dynamic bank partitioning strategy is implemented, which dynamically maps it to different bank sets according to different memory characteristics of the application, and eliminates memory request interference of multiple CPU applications without affecting bank-level parallelism. Finally, for the GPU request queue, the criticality is introduced to measure the difference of the memory access latency between the cores. Based on the first ready-first come first served strategy, we implemented criticality-aware memory scheduling to balance the locality and criticality of application access.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

5

MERRILL, DUANE, und ANDREW GRIMSHAW. „HIGH PERFORMANCE AND SCALABLE RADIX SORTING: A CASE STUDY OF IMPLEMENTING DYNAMIC PARALLELISM FOR GPU COMPUTING“. Parallel Processing Letters 21, Nr. 02 (Juni 2011): 245–72. http://dx.doi.org/10.1142/s0129626411000187.

Der volle Inhalt der Quelle

Annotation:

The need to rank and order data is pervasive, and many algorithms are fundamentally dependent upon sorting and partitioning operations. Prior to this work, GPU stream processors have been perceived as challenging targets for problems with dynamic and global data-dependences such as sorting. This paper presents: (1) a family of very efficient parallel algorithms for radix sorting; and (2) our allocation-oriented algorithmic design strategies that match the strengths of GPU processor architecture to this genre of dynamic parallelism. We demonstrate multiple factors of speedup (up to 3.8x) compared to state-of-the-art GPU sorting. We also reverse the performance differentials observed between GPU and multi/many-core CPU architectures by recent comparisons in the literature, including those with 32-core CPU-based accelerators. Our average sorting rates exceed 1B 32-bit keys/sec on a single GPU microprocessor. Our sorting passes are constructed from a very efficient parallel prefix scan "runtime" that incorporates three design features: (1) kernel fusion for locally generating and consuming prefix scan data; (2) multi-scan for performing multiple related, concurrent prefix scans (one for each partitioning bin); and (3) flexible algorithm serialization for avoiding unnecessary synchronization and communication within algorithmic phases, allowing us to construct a single implementation that scales well across all generations and configurations of programmable NVIDIA GPUs.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

6

Vilches, Antonio, Rafael Asenjo, Angeles Navarro, Francisco Corbera, Rub́en Gran und María Garzarán. „Adaptive Partitioning for Irregular Applications on Heterogeneous CPU-GPU Chips“. Procedia Computer Science 51 (2015): 140–49. http://dx.doi.org/10.1016/j.procs.2015.05.213.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

7

Sung, Hanul, Hyeonsang Eom und HeonYoung Yeom. „The Need of Cache Partitioning on Shared Cache of Integrated Graphics Processor between CPU and GPU“. KIISE Transactions on Computing Practices 20, Nr. 9 (15.09.2014): 507–12. http://dx.doi.org/10.5626/ktcp.2014.20.9.507.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

8

Wang, Shunjiang, Baoming Pu, Ming Li, Weichun Ge, Qianwei Liu und Yujie Pei. „State Estimation Based on Ensemble DA–DSVM in Power System“. International Journal of Software Engineering and Knowledge Engineering 29, Nr. 05 (Mai 2019): 653–69. http://dx.doi.org/10.1142/s0218194019400023.

Der volle Inhalt der Quelle

Annotation:

This paper investigates the state estimation problem of power systems. A novel, fast and accurate state estimation algorithm is presented to solve this problem based on the one-dimensional denoising autoencoder and deep support vector machine (1D DA–DSVM). Besides, for further reducing the computation burden, a partitioning method is presented to divide the power system into several sub-networks and the proposed algorithm can be applied to each sub-network. A hybrid computing architecture of Central Processing Unit (CPU) and Graphics Processing Unit (GPU) is employed in the overall state estimation, in which the GPU is used to estimate each sub-network and the CPU is used to integrate all the calculation results and output the state estimate. Simulation results show that the proposed method can effectively improve the accuracy and computational efficiency of the state estimation of power systems.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

9

Barreiros, Willian, Alba C. M. A. Melo, Jun Kong, Renato Ferreira, Tahsin M. Kurc, Joel H. Saltz und George Teodoro. „Efficient microscopy image analysis on CPU-GPU systems with cost-aware irregular data partitioning“. Journal of Parallel and Distributed Computing 164 (Juni 2022): 40–54. http://dx.doi.org/10.1016/j.jpdc.2022.02.004.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

10

Singh, Amit Kumar, Alok Prakash, Karunakar Reddy Basireddy, Geoff V. Merrett und Bashir M. Al-Hashimi. „Energy-Efficient Run-Time Mapping and Thread Partitioning of Concurrent OpenCL Applications on CPU-GPU MPSoCs“. ACM Transactions on Embedded Computing Systems 16, Nr. 5s (10.10.2017): 1–22. http://dx.doi.org/10.1145/3126548.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

11

Hou, Neng, Fazhi He, Yi Zhou, Yilin Chen und Xiaohu Yan. „A Parallel Genetic Algorithm With Dispersion Correction for HW/SW Partitioning on Multi-Core CPU and Many-Core GPU“. IEEE Access 6 (2018): 883–98. http://dx.doi.org/10.1109/access.2017.2776295.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

12

Mahmud, Shohaib, Haiying Shen und Anand Iyer. „PACER: Accelerating Distributed GNN Training Using Communication-Efficient Partition Refinement and Caching“. Proceedings of the ACM on Networking 2, CoNEXT4 (25.11.2024): 1–18. http://dx.doi.org/10.1145/3697805.

Der volle Inhalt der Quelle

Annotation:

Despite recent breakthroughs in distributed Graph Neural Network (GNN) training, large-scale graphs still generate significant network communication overhead, decreasing time and resource efficiency. Although recently proposed partitioning or caching methods try to reduce communication inefficiencies and overheads, they are not sufficiently effective due to their sampling pattern-agnostic nature. This paper proposes a Pipelined Partition Aware Caching and Communication Efficient Refinement System (Pacer), a communication-efficient distributed GNN training system. First, Pacer intelligently estimates each partition's access frequency to each vertex by jointly considering the sampling method and graph topology. Then, it uses the estimated access frequency to refine partitions and caching vertices in its two-level cache (CPU and GPU) to minimize data transfer latency. Furthermore, Pacer incorporates a pipeline-based minibatching method to mask the effect of the network communication. Experimental results on real-world graphs show that Pacer outperforms state-of-the-art distributed GNN training system in training time by 40% on average.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

13

Chen, Hao, Anqi Wei und Ye Zhang. „Three-level parallel-set partitioning in hierarchical trees coding based on the collaborative CPU and GPU for remote sensing images compression“. Journal of Applied Remote Sensing 11, Nr. 04 (18.12.2017): 1. http://dx.doi.org/10.1117/1.jrs.11.045015.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

14

Wu, Qunyong, Yuhang Wang, Haoyu Sun, Han Lin und Zhiyuan Zhao. „A System Coupled GIS and CFD for Atmospheric Pollution Dispersion Simulation in Urban Blocks“. Atmosphere 14, Nr. 5 (05.05.2023): 832. http://dx.doi.org/10.3390/atmos14050832.

Der volle Inhalt der Quelle

Annotation:

Atmospheric pollution is a critical issue in public health systems. The simulation of atmospheric pollution dispersion in urban blocks, using CFD, faces several challenges, including the complexity and inefficiency of existing CFD software, time-consuming construction of CFD urban block geometry, and limited visualization and analysis capabilities of simulation outputs. To address these challenges, we have developed a prototype system that couples 3DGIS and CFD for simulating, visualizing, and analyzing atmospheric pollution dispersion. Specifically, a parallel algorithm for coordinate transformation was designed, and the relevant commands were encapsulated to automate the construction of geometry and meshing required for CFD simulations of urban blocks. Additionally, the Fluent-based command flow was parameterized and encapsulated, enabling the automatic generation of model calculation command flow files to simulate atmospheric pollution dispersion. Moreover, multi-angle spatial partitioning and spatiotemporal multidimensional visualization analysis were introduced to achieve an intuitive expression and analysis of CFD simulation results. The result shows that the constructed geometry is correct, and the mesh quality meets requirements with all values above 0.45. CPU and GPU parallel algorithms are 13.3× and 25× faster than serial. Furthermore, our case study demonstrates the developed system’s effectiveness in simulating, visualizing, and analyzing atmospheric pollution dispersion in urban blocks.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

15

Giannoula, Christina, Ivan Fernandez, Juan Gómez Luna, Nectarios Koziris, Georgios Goumas und Onur Mutlu. „SparseP“. Proceedings of the ACM on Measurement and Analysis of Computing Systems 6, Nr. 1 (24.02.2022): 1–49. http://dx.doi.org/10.1145/3508041.

Der volle Inhalt der Quelle

Annotation:

Several manufacturers have already started to commercialize near-bank Processing-In-Memory (PIM) architectures, after decades of research efforts. Near-bank PIM architectures place simple cores close to DRAM banks. Recent research demonstrates that they can yield significant performance and energy improvements in parallel applications by alleviating data access costs. Real PIM systems can provide high levels of parallelism, large aggregate memory bandwidth and low memory access latency, thereby being a good fit to accelerate the Sparse Matrix Vector Multiplication (SpMV) kernel. SpMV has been characterized as one of the most significant and thoroughly studied scientific computation kernels. It is primarily a memory-bound kernel with intensive memory accesses due its algorithmic nature, the compressed matrix format used, and the sparsity patterns of the input matrices given. This paper provides the first comprehensive analysis of SpMV on a real-world PIM architecture, and presents SparseP, the first SpMV library for real PIM architectures. We make three key contributions. First, we implement a wide variety of software strategies on SpMV for a multithreaded PIM core, including (1) various compressed matrix formats, (2) load balancing schemes across parallel threads and (3) synchronization approaches, and characterize the computational limits of a single multithreaded PIM core. Second, we design various load balancing schemes across multiple PIM cores, and two types of data partitioning techniques to execute SpMV on thousands of PIM cores: (1) 1D-partitioned kernels to perform the complete SpMV computation only using PIM cores, and (2) 2D-partitioned kernels to strive a balance between computation and data transfer costs to PIM-enabled memory. Third, we compare SpMV execution on a real-world PIM system with 2528 PIM cores to an Intel Xeon CPU and an NVIDIA Tesla V100 GPU to study the performance and energy efficiency of various devices, i.e., both memory-centric PIM systems and conventional processor-centric CPU/GPU systems, for the SpMV kernel. SparseP software package provides 25 SpMV kernels for real PIM systems supporting the four most widely used compressed matrix formats, i.e., CSR, COO, BCSR and BCOO, and a wide range of data types. SparseP is publicly and freely available at https://github.com/CMU-SAFARI/SparseP. Our extensive evaluation using 26 matrices with various sparsity patterns provides new insights and recommendations for software designers and hardware architects to efficiently accelerate the SpMV kernel on real PIM systems.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

16

Kumar, P. S. Jagadeesh, Tracy Lin Huan und Yang Yung. „Computational Paradigm and Quantitative Optimization to Parallel Processing Performance of Still Image Compression“. Circulation in Computer Science 2, Nr. 4 (20.05.2017): 11–17. http://dx.doi.org/10.22632/ccs-2017-252-02.

Der volle Inhalt der Quelle

Annotation:

Fashionable and staggering evolution in inferring the parallel processing routine coupled with the necessity to amass and distribute huge magnitude of digital records especially still images has fetched an amount of confronts for researchers and other stakeholders. These disputes exorbitantly outlay and maneuvers the digital information among others, subsists the spotlight of the research civilization in topical days and encompasses the lead to the exploration of image compression methods that can accomplish exceptional outcomes. One of those practices is the parallel processing of a diversity of compression techniques, which facilitates split, an image into ingredients of reverse occurrences and has the benefit of great compression. This manuscript scrutinizes the computational intricacy and the quantitative optimization of diverse still image compression tactics and additional accede to the recital of parallel processing. The computational efficacy is analyzed and estimated with respect to the Central Processing Unit (CPU) as well as Graphical Processing Unit (GPU). The PSNR (Peak Signal to Noise Ratio) is exercised to guesstimate image re-enactment and eminence in harmonization. The moments are obtained and conferred with support on different still image compression algorithms such as Block Truncation Coding (BTC), Discrete Cosine Transform (DCT), Discrete Wavelet Transform (DWT), Dual Tree Complex Wavelet Transform (DTCWT), Set Partitioning in Hierarchical Trees (SPIHT), Embedded Zero-tree Wavelet (EZW). The evaluation is conceded in provisos of coding efficacy, memory constraints, image quantity and quality.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

17

TANAKA, SATOSHI, KYOKO HASEGAWA, SUSUMU NAKATA, HIDEO NAKAJIMA, TAKUYA HATTA, FREDERIKA RAMBU NGANA, TAKUMA KAWAMURA, NAOHISA SAKAMOTO und KOJI KOYAMADA. „GRID-INDEPENDENT METROPOLIS SAMPLING FOR VOLUME VISUALIZATION“. International Journal of Modeling, Simulation, and Scientific Computing 01, Nr. 02 (Juni 2010): 199–218. http://dx.doi.org/10.1142/s1793962310000158.

Der volle Inhalt der Quelle

Annotation:

We propose a method of sampling regular and irregular-grid volume data for visualization. The method is based on the Metropolis algorithm that is a type of Monte Carlo technique. Our method enables "importance sampling" of local regions of interest in the visualization by generating sample points intensively in regions where a user-specified transfer function takes the peak values. The generated sample-point distribution is independent of the grid structure of the given volume data. Therefore, our method is applicable to irregular grids as well as regular grids. We demonstrate the effectiveness of our method by applying it to regular cubic grids and irregular tetrahedral grids with adaptive cell sizes. We visualize volume data by projecting the generated sample points onto the 2D image plane. We tested our sampling with three rendering models: an X-ray model, a simple illuminant particle model, and an illuminant particle model with light-attenuation effects. The grid-independency and the efficiency in the parallel processing mean that our method is suitable for visualizing large-scale volume data. The former means that the required number of sample points is proportional to the number of 2D pixels, not the number of 3D voxels. The latter means that our method can be easily accelerated on the multiple-CPU and/or GPU platforms. We also show that our method can work with adaptive space partitioning of volume data, which also enables us to treat large-scale/complex volume data easily.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

18

Bloch, Aurelien, Simone Casale-Brunet und Marco Mattavelli. „Performance Estimation of High-Level Dataflow Program on Heterogeneous Platforms by Dynamic Network Execution“. Journal of Low Power Electronics and Applications 12, Nr. 3 (23.06.2022): 36. http://dx.doi.org/10.3390/jlpea12030036.

Der volle Inhalt der Quelle

Annotation:

The performance of programs executed on heterogeneous parallel platforms largely depends on the design choices regarding how to partition the processing on the various different processing units. In other words, it depends on the assumptions and parameters that define the partitioning, mapping, scheduling, and allocation of data exchanges among the various processing elements of the platform executing the program. The advantage of programs written in languages using the dataflow model of computation (MoC) is that executing the program with different configurations and parameter settings does not require rewriting the application software for each configuration setting, but only requires generating a new synthesis of the execution code corresponding to different parameters. The synthesis stage of dataflow programs is usually supported by automatic code generation tools. Another competitive advantage of dataflow software methodologies is that they are well-suited to support designs on heterogeneous parallel systems as they are inherently free of memory access contention issues and naturally expose the available intrinsic parallelism. So as to fully exploit these advantages and to be able to efficiently search the configuration space to find the design points that better satisfy the desired design constraints, it is necessary to develop tools and associated methodologies capable of evaluating the performance of different configurations and to drive the search for good design configurations, according to the desired performance criteria. The number of possible design assumptions and associated parameter settings is usually so large (i.e., the dimensions and size of the design space) that intuition as well as trial and error are clearly unfeasible, inefficient approaches. This paper describes a method for the clock-accurate profiling of software applications developed using the dataflow programming paradigm such as the formal RVL-CAL language. The profiling can be applied when the application program has been compiled and executed on GPU/CPU heterogeneous hardware platforms utilizing two main methodologies, denoted as static and dynamic. This paper also describes how a method for the qualitative evaluation of the performance of such programs as a function of the supplied configuration parameters can be successfully applied to heterogeneous platforms. The technique was illustrated using two different application software examples and several design points.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

19

Gallet, Benoit, und Michael Gowanlock. „Heterogeneous CPU-GPU Epsilon Grid Joins: Static and Dynamic Work Partitioning Strategies“. Data Science and Engineering, 21.10.2020. http://dx.doi.org/10.1007/s41019-020-00145-x.

Der volle Inhalt der Quelle

Annotation:

Abstract Given two datasets (or tables) A and B and a search distance $$\epsilon$$ ϵ , the distance similarity join, denoted as $$A \ltimes _\epsilon B$$ A ⋉ ϵ B , finds the pairs of points ($$p_a$$ p a , $$p_b$$ p b ), where $$p_a \in A$$ p a ∈ A and $$p_b \in B$$ p b ∈ B , and such that the distance between $$p_a$$ p a and $$p_b$$ p b is $$\le \epsilon$$ ≤ ϵ . If $$A = B$$ A = B , then the similarity join is equivalent to a similarity self-join, denoted as $$A \bowtie _\epsilon A$$ A ⋈ ϵ A . We propose in this paper Heterogeneous Epsilon Grid Joins (HEGJoin), a heterogeneous CPU-GPU distance similarity join algorithm. Efficiently partitioning the work between the CPU and the GPU is a challenge. Indeed, the work partitioning strategy needs to consider the different characteristics and computational throughput of the processors (CPU and GPU), as well as the data-dependent nature of the similarity join that accounts in the overall execution time (e.g., the number of queries, their distribution, the dimensionality, etc.). In addition to HEGJoin, we design in this paper a dynamic and two static work partitioning strategies. We also propose a performance model for each static partitioning strategy to perform the distribution of the work between the processors. We evaluate the performance of all three partitioning methods by considering the execution time and the load imbalance between the CPU and GPU as performance metrics. HEGJoin achieves a speedup of up to $$5.46\times$$ 5.46 × ($$3.97\times$$ 3.97 × ) over the GPU-only (CPU-only) algorithms on our first test platform and up to $$1.97\times$$ 1.97 × ($$12.07\times$$ 12.07 × ) on our second test platform over the GPU-only (CPU-only) algorithms.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

20

Wu, Zhenlin, Haosong Zhao, Hongyuan Liu, Wujie Wen und Jiajia Li. „gHyPart: GPU-friendly End-to-End Hypergraph Partitioner“. ACM Transactions on Architecture and Code Optimization, 10.01.2025. https://doi.org/10.1145/3711925.

Der volle Inhalt der Quelle

Annotation:

Hypergraph partitioning finds practical applications in various fields, such as high-performance computing and circuit partitioning in VLSI physical design, where high-performance solutions often demand substantial parallelism beyond what existing CPU-based solutions can offer. While GPUs are promising in this regard, their potential in hypergraph partitioning remains unexplored. In this work, we first develop an end-to-end deterministic hypergraph partitioner on GPUs, ported from state-of-the-art multi-threaded CPU work, and identify three major performance challenges by characterizing its performance. We propose the first end-to-end solution, gHyPart , to unleash the potentials of hypergraph partitioning on GPUs. To overcome the challenges of GPU thread underutilization due to imbalanced workload, long critical path, and high work complexity due to excessive operations, we redesign GPU algorithms with diverse parallelization strategies thus expanding optimization space; to address the challenge of no one-size-fits-all implementation for various input hypergraphs, we propose a decision tree-based strategy to choose a suitable parallelization strategy for each kernel. Evaluation on 500 hypergraphs shows up to 125.7 × (17.5 × on average), 640.0 × (24.2 × on average), and 171.6 × (1.4 × on average) speedups over two CPU partitioners and our GPU baseline gHyPart-B , respectively.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

21

„Improving Processing Speed of Real-Time Stereo Matching using Heterogenous CPU/GPU Model“. International Journal of Innovative Technology and Exploring Engineering 9, Nr. 5 (10.03.2020): 1983–87. http://dx.doi.org/10.35940/ijitee.e2982.039520.

Der volle Inhalt der Quelle

Annotation:

This paper presents an improvement of the processing speed of the stereo matching problem. The time required for stereo matching represents a problem for many real time applications such as robot navigation , self-driving vehicles and object tracking. In this work, a real-time stereo matching system is proposed that utilizes the parallelism of Graphics Processing Unit (GPU). An area based stereo matching system is used to generate the disparity map. Four different sequential and parallel computational models are used to analyze the time consumed by the stereo matching. The models are: 1) Sequential CPU, 2) Parallel multi-core CPU, 3) Parallel GPU and 4) Parallel heterogenous CPU/GPU. The dense disparity image is calculated, and the time is highly reduced using the heterogenous CPU/GPU model, while maintaining the same accuracy of other models. A static partitioning of CPU and GPU workload is properly designed based on time analysis. Different cost functions are used to measure the correspondence and to generate the disparity map. A sliding window is used to calculate the cost functions efficiently. A speed of more than 100 frames per second(f/s) is achieved using parallel heterogenous CPU/GPU for 640 x 480 image resolution and a disparity range equals 50.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

22

Bloch, Aurelien, Simone Casale-Brunet und Marco Mattavelli. „Design Space Exploration for Partitioning Dataflow Program on CPU-GPU Heterogeneous System“. Journal of Signal Processing Systems, 31.07.2023. http://dx.doi.org/10.1007/s11265-023-01884-6.

Der volle Inhalt der Quelle

Annotation:

AbstractDataflow programming is a methodology that enables the development of high-level, parametric programs that are independent of the underlying platform. This approach is particularly useful for heterogeneous platforms, as it eliminates the need to rewrite application software for each configuration. Instead, it only requires new low-level implementation code, which is typically automatically generated through code generation tools. The performance of programs running on heterogeneous parallel platforms is highly dependent on the partitioning and mapping of computation to different processing units. This is determined by parameters that govern the partitioning, mapping, scheduling, and allocation of data exchanges among the processing elements of the platform. Determining the appropriate parameters for a specific application and set of architectures is a complex task and is an active area of research. This paper presents a novel methodology for partitioning and mapping dataflow programs onto heterogeneous systems composed of both CPUs and GPUs. The objective is to identify the program configuration that provides the most efficient way to process a typical dataflow program by exploring its design space. This is an NP-complete problem that we have addressed by utilizing a design space exploration approach that leverages a Tabu search meta-heuristic optimization algorithm driven by analysis of the execution trace graph of the program. The heuristic algorithm effectively identifies a solution that maps actors to processing units while improving overall performance. The parameters of the heuristic algorithm, such as the time limit and the proportion of neighboring solutions explored during each iteration, can be fine-tuned for optimal results. Additionally, the proposed approach allows for the exploration of solutions that do not utilize all hardware resources if it results in better performance. The effectiveness of the proposed approach is demonstrated through experimental results on dataflow programs.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

23

Kemmler, Samuel, Christoph Rettinger, Ulrich Rüde, Pablo Cuéllar und Harald Köstler. „Efficiency and scalability of fully-resolved fluid-particle simulations on heterogeneous CPU-GPU architectures“. International Journal of High Performance Computing Applications, 10.01.2025. https://doi.org/10.1177/10943420241313385.

Der volle Inhalt der Quelle

Annotation:

Current supercomputers often have a heterogeneous architecture using both conventional Central Processing Units (CPUs) and Graphics Processing Units (GPUs). At the same time, numerical simulation tasks frequently involve multiphysics scenarios whose components run on different hardware due to multiple reasons, e.g., architectural requirements, pragmatism, etc. This leads naturally to a software design where different simulation modules are mapped to different subsystems of the heterogeneous architecture. We present a detailed performance analysis for such a hybrid four-way coupled simulation of a fully resolved particle-laden flow. The Eulerian representation of the flow utilizes GPUs, while the Lagrangian model for the particles runs on conventional CPUs. Two characteristic model situations involving dense and dilute particle systems are used as benchmark scenarios. First, a roofline model is employed to predict the node level performance and to show that the lattice-Boltzmann-based Eulerian fluid simulation reaches very good performance on a single GPU. Furthermore, the GPU-GPU communication for a large-scale Eulerian flow simulation results in only moderate slowdowns. This is due to the efficiency of the CUDA-aware MPI communication, combined with the use of communication hiding techniques. On 1024 A100 GPUs, an overall parallel efficiency of up to 71% is achieved. While the flow simulation has good performance characteristics, the integration of the stiff Lagrangian particle system requires frequent CPU-CPU communications that can become a bottleneck, especially when simulating the dense particle system. Additionally, special attention is paid to the CPU-GPU communication overhead since this is essential for coupling the particles to the flow simulation. However, thanks to our problem-aware co-partitioning, the CPU-GPU communication overhead is found to be negligible. As a lesson learned from this development, four criteria are postulated that a hybrid implementation must meet for the efficient use of heterogeneous supercomputers.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

24

THOMAS, beatrice, Roman LE GOFF LATIMIER, Hamid BEN AHMED, Gurvan JODIN, Abdelhafid EL OUARDI und Samir BOUAZIZ. „Optimized Cpu-Gpu Partitioning for an Admm Algorithm Applied to a Peer to Peer Energy Market“. SSRN Electronic Journal, 2022. http://dx.doi.org/10.2139/ssrn.4186889.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

25

Mu, Yifei, Ce Yu, Chao Sun, Kun Li, Yajie Zhang, Jizeng Wei, Jian Xiao und Jie Wang. „3DT-CM: A Low-complexity Cross-matching Algorithm for Large Astronomical Catalogues using 3d-tree Approach“. Research in Astronomy and Astrophysics, 08.08.2023. http://dx.doi.org/10.1088/1674-4527/acee50.

Der volle Inhalt der Quelle

Annotation:

Abstract Location-based cross-matching is a preprocessing step in astronomy that aims to identify records belonging to the same celestial body based on the angular distance formula. The traditional approach involves comparing each record in one catalogue with every record in the other catalogue, resulting in a one-to-one comparison with high computational complexity. To reduce the computational time, index partitioning methods are used to divide the sky into regions and perform local cross-matching. In addition, cross-matching algorithms have been adopted on high-performance architectures to improve their efficiency. But the index partitioning methods and computation architectures only increase the degree of parallelism, and can't decrease the complexity of pairwise-based cross-matching algorithm itself. A better algorithm is needed to further improve the performance of cross-matching algorithm.In this paper, we propose a 3d-tree-based cross-matching algorithm that converts the angular distance formula into an equivalent 3d Euclidean distance and uses 3d-tree method to reduce the overall computational complexity and to avoid boundary issues. Furthermore, we demonstrate the superiority of the 3d-tree approach over the 2d-tree method and implement it using a multi-threading technique during both the construction and querying phases. We have experimentally evaluated the proposed 3d-tree cross-matching algorithm using publicly available catalogue data. The results show that our algorithm applied on two 32-core CPU achieves equivalent performance than previous experiments conducted on a six-node CPU-GPU cluster.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

26

Magalhães, W. F., M. C. De Farias, H. M. Gomes, L. B. Marinho, G. S. Aguiar und P. Silveira. „Evaluating Edge-Cloud Computing Trade-Offs for Mobile Object Detection and Classification with Deep Learning“. Journal of Information and Data Management 11, Nr. 1 (30.06.2020). http://dx.doi.org/10.5753/jidm.2020.2026.

Der volle Inhalt der Quelle

Annotation:

Internet-of-Things (IoT) applications based on Artificial Intelligence, such as mobile object detection and recognition from images and videos, may greatly benefit from inferences made by state-of-the-art Deep Neural Network(DNN) models. However, adopting such models in IoT applications poses an important challenge since DNNs usually require lots of computational resources (i.e. memory, disk, CPU/GPU, and power), which may prevent them to run on resource-limited edge devices. On the other hand, moving the heavy computation to the Cloud may significantly increase running costs and latency of IoT applications. Among the possible strategies to tackle this challenge are: (i) DNN model partitioning between edge and cloud; and (ii) running simpler models in the edge and more complex ones in the cloud, with information exchange between models, when needed. Variations of strategy (i) also include: running the entire DNN on the edge device (sometimes not feasible) and running the entire DNN on the cloud. All these strategies involve trade-offs in terms of latency, communication, and financial costs. In this article we investigate such trade-offs in real-world scenarios. We conduct several experiments using deep learning models for image-based object detection and classification. Our setup includes a Raspberry PI 3 B+ and a cloud server equipped with a GPU. Different network bandwidths are also evaluated. Our results provide useful insights about the aforementioned trade-offs. The partitioning experiment showed that, overall, running the inferences entirely on the edge or entirely on the cloud server are the best options. The collaborative approach yielded a significant increase in accuracy without penalizing running costs too much.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

27

Sahebi, Amin, Marco Barbone, Marco Procaccini, Wayne Luk, Georgi Gaydadjiev und Roberto Giorgi. „Distributed large-scale graph processing on FPGAs“. Journal of Big Data 10, Nr. 1 (04.06.2023). http://dx.doi.org/10.1186/s40537-023-00756-x.

Der volle Inhalt der Quelle

Annotation:

AbstractProcessing large-scale graphs is challenging due to the nature of the computation that causes irregular memory access patterns. Managing such irregular accesses may cause significant performance degradation on both CPUs and GPUs. Thus, recent research trends propose graph processing acceleration with Field-Programmable Gate Arrays (FPGA). FPGAs are programmable hardware devices that can be fully customised to perform specific tasks in a highly parallel and efficient manner. However, FPGAs have a limited amount of on-chip memory that cannot fit the entire graph. Due to the limited device memory size, data needs to be repeatedly transferred to and from the FPGA on-chip memory, which makes data transfer time dominate over the computation time. A possible way to overcome the FPGA accelerators’ resource limitation is to engage a multi-FPGA distributed architecture and use an efficient partitioning scheme. Such a scheme aims to increase data locality and minimise communication between different partitions. This work proposes an FPGA processing engine that overlaps, hides and customises all data transfers so that the FPGA accelerator is fully utilised. This engine is integrated into a framework for using FPGA clusters and is able to use an offline partitioning method to facilitate the distribution of large-scale graphs. The proposed framework uses Hadoop at a higher level to map a graph to the underlying hardware platform. The higher layer of computation is responsible for gathering the blocks of data that have been pre-processed and stored on the host’s file system and distribute to a lower layer of computation made of FPGAs. We show how graph partitioning combined with an FPGA architecture will lead to high performance, even when the graph has Millions of vertices and Billions of edges. In the case of the PageRank algorithm, widely used for ranking the importance of nodes in a graph, compared to state-of-the-art CPU and GPU solutions, our implementation is the fastest, achieving a speedup of 13 compared to 8 and 3 respectively. Moreover, in the case of the large-scale graphs, the GPU solution fails due to memory limitations while the CPU solution achieves a speedup of 12 compared to the 26x achieved by our FPGA solution. Other state-of-the-art FPGA solutions are 28 times slower than our proposed solution. When the size of a graph limits the performance of a single FPGA device, our performance model shows that using multi-FPGAs in a distributed system can further improve the performance by about 12x. This highlights our implementation efficiency for large datasets not fitting in the on-chip memory of a hardware device.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

28

Schmidt, Bertil, Felix Kallenborn, Alejandro Chacon und Christian Hundt. „CUDASW++4.0: ultra-fast GPU-based Smith–Waterman protein sequence database search“. BMC Bioinformatics 25, Nr. 1 (02.11.2024). http://dx.doi.org/10.1186/s12859-024-05965-6.

Der volle Inhalt der Quelle

Annotation:

Abstract Background The maximal sensitivity for local pairwise alignment makes the Smith-Waterman algorithm a popular choice for protein sequence database search. However, its quadratic time complexity makes it compute-intensive. Unfortunately, current state-of-the-art software tools are not able to leverage the massively parallel processing capabilities of modern GPUs with close-to-peak performance. This motivates the need for more efficient implementations. Results CUDASW++4.0 is a fast software tool for scanning protein sequence databases with the Smith-Waterman algorithm on CUDA-enabled GPUs. Our approach achieves high efficiency for dynamic programming-based alignment computation by minimizing memory accesses and instructions. We provide both efficient matrix tiling, and sequence database partitioning schemes, and exploit next generation floating point arithmetic and novel DPX instructions. This leads to close-to-peak performance on modern GPU generations (Ampere, Ada, Hopper) with throughput rates of up to 1.94 TCUPS, 5.01 TCUPS, 5.71 TCUPS on an A100, L40S, and H100, respectively. Evaluation on the Swiss-Prot, UniRef50, and TrEMBL databases shows that CUDASW++4.0 gains over an order-of-magnitude performance improvements over previous GPU-based approaches (CUDASW++3.0, ADEPT, SW#DB). In addition, our algorithm demonstrates significant speedups over top-performing CPU-based tools (BLASTP, SWIPE, SWIMM2.0), can exploit multi-GPU nodes with linear scaling, and features an impressive energy efficiency of up to 15.7 GCUPS/Watt. Conclusion CUDASW++4.0 changes the standing of GPUs in protein sequence database search with Smith-Waterman alignment by providing close-to-peak performance on modern GPUs. It is freely available at https://github.com/asbschmidt/CUDASW4.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

29

Yanamala, Rama Muni Reddy, und Muralidhar Pullakandam. „Empowering edge devices: FPGA‐based 16‐bit fixed‐point accelerator with SVD for CNN on 32‐bit memory‐limited systems“. International Journal of Circuit Theory and Applications, 13.02.2024. http://dx.doi.org/10.1002/cta.3957.

Der volle Inhalt der Quelle

Annotation:

AbstractConvolutional neural networks (CNNs) are now often used in deep learning and computer vision applications. Its convolutional layer accounts for most calculations and should be computed fast in a local edge device. Field‐programmable gate arrays (FPGAs) have been adequately explored as promising hardware accelerators for CNNs due to their high performance, energy efficiency, and reconfigurability. This paper developed an efficient FPGA‐based 16‐bit fixed‐point hardware accelerator unit for deep learning applications on the 32‐bit low‐memory edge device (PYNQ‐Z2 board). Additionally, singular value decomposition is applied to the fully connected layer for dimensionality reduction of weight parameters. The accelerator unit was designed for all five layers and employed eight processing elements in convolution layers 1 and 2 for parallel computations. In addition, array partitioning, loop unrolling, and pipelining are the techniques used to increase the speed of calculations. The AXI‐Lite interface was also used to communicate between IP and other blocks. Moreover, the design is tested with grayscale image classification on MNIST handwritten digit dataset and color image classification on the Tumor dataset. The experimental results show that the proposed accelerator unit implementation performs faster than the software‐based implementation. Its inference speed is 89.03% more than INTEL 3‐core CPU, 86.12% higher than Haswell 2‐core CPU, and 82.45% more than NVIDIA Tesla K80 GPU. Furthermore, the throughput of the proposed design is 4.33GOP/s, which is better than the conventional CNN accelerator architectures.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

30

Liu, Chaoqiang, Xiaofei Liao, Long Zheng, Yu Huang, Haifeng Liu, Yi Zhang, Haiheng He, Haoyan Huang, Jingyi Zhou und Hai Jin. „L-FNNG: Accelerating Large-Scale KNN Graph Construction on CPU-FPGA Heterogeneous Platform“. ACM Transactions on Reconfigurable Technology and Systems, 14.03.2024. http://dx.doi.org/10.1145/3652609.

Der volle Inhalt der Quelle

Annotation:

Due to the high complexity of constructing exact k -nearest neighbor graphs, approximate construction has become a popular research topic. The NN-Descent algorithm is one of the representative in-memory algorithms. To effectively handle large datasets, existing state-of-the-art solutions combine the divide-and-conquer approach and the NN-Descent algorithm, where large datasets are divided into multiple partitions, and a subgraph is constructed for each partition before all the subgraphs are merged, reducing the memory pressure significantly. However, such solutions fail to address inefficiencies in large-scale k -nearest neighbor graph construction. In this paper, we propose L-FNNG, a novel solution for accelerating large-scale k -nearest neighbor graph construction on CPU-FPGA heterogeneous platform. The CPU is responsible for dividing data and determining the order of partition processing, while the FPGA executes all construction tasks to utilize the acceleration capability fully. To accelerate the execution of construction tasks, we design an efficient FPGA accelerator, which includes the Block-based Scheduling (BS) and Useless Computation Aborting (UCA) techniques to address the problems of memory access and computation in the NN-Descent algorithm. We also propose an efficient scheduling strategy that includes a KD-tree-based data partitioning method and a hierarchical processing method to address scheduling inefficiency. We evaluate L-FNNG on a Xilinx Alveo U280 board hosted by a 64-core Xeon server. On multiple large-scale datasets, L-FNNG achieves, on average, 2.3 × construction speedup over the state-of-the-art GPU-based solution.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

31

Aghapour, Ehsan, Dolly Sapra, Andy Pimentel und Anuj Pathania. „ARM-CO-UP: ARM CO operative U tilization of P rocessors“. ACM Transactions on Design Automation of Electronic Systems, 08.04.2024. http://dx.doi.org/10.1145/3656472.

Der volle Inhalt der Quelle

Annotation:

HMPSoCs combine different processors on a single chip. They enable powerful embedded devices, which increasingly perform ML inference tasks at the edge. State-of-the-art HMPSoCs can perform on-chip embedded inference using different processors, such as CPUs, GPUs, and NPUs. HMPSoCs can potentially overcome the limitation of low single-processor CNN inference performance and efficiency by cooperative use of multiple processors. However, standard inference frameworks for edge devices typically utilize only a single processor. We present the ARM-CO-UP framework built on the ARM-CL library. The ARM-CO-UP framework supports two modes of operation – Pipeline and Switch. It optimizes inference throughput using pipelined execution of network partitions for consecutive input frames in the Pipeline mode. It improves inference latency through layer-switched inference for a single input frame in the Switch mode. Furthermore, it supports layer-wise CPU/GPU DVFS in both modes for improving power efficiency and energy consumption. ARM-CO-UP is a comprehensive framework for multi-processor CNN inference that automates CNN partitioning and mapping, pipeline synchronization, processor type switching, layer-wise DVFS , and closed-source NPU integration.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

32

Vera-Parra, Nelson Enrique, Danilo Alfonso López-Sarmiento und Cristian Alejandro Rojas-Quintero. „HETEROGENEOUS COMPUTING TO ACCELERATE THE SEARCH OF SUPER K-MERS BASED ON MINIMIZERS“. International Journal of Computing, 30.12.2020, 525–32. http://dx.doi.org/10.47839/ijc.19.4.1985.

Der volle Inhalt der Quelle

Annotation:

The k-mers processing techniques based on partitioning of the data set on the disk using minimizer-type seeds have led to a significant reduction in memory requirements; however, it has added processes (search and distribution of super k-mers) that can be intensive given the large volume of data. This paper presents a massive parallel processing model in order to enable the efficient use of heterogeneous computation to accelerate the search of super k-mers based on seeds (minimizers or signatures). The model includes three main contributions: a new data structure called CISK for representing the super k-mers, their minimizers and two massive parallelization patterns in an indexed and compact way: one for obtaining the canonical m-mers of a set of reads and another for searching for super k-mers based on minimizers. The model was implemented through two OpenCL kernels. The evaluation of the kernels shows favorable results in terms of execution times and memory requirements to use the model for constructing heterogeneous solutions with simultaneous execution (workload distribution), which perform co-processing using the current search methods of super k -mers on the CPU and the methods presented herein on GPU. The model implementation code is available in the repository: https://github.com/BioinfUD/K-mersCL.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

33

Karp, Martin, Estela Suarez, Jan H. Meinke, Måns I. Andersson, Philipp Schlatter, Stefano Markidis und Niclas Jansson. „Experience and analysis of scalable high-fidelity computational fluid dynamics on modular supercomputing architectures“. International Journal of High Performance Computing Applications, 28.11.2024. http://dx.doi.org/10.1177/10943420241303163.

Der volle Inhalt der Quelle

Annotation:

The never-ending computational demand from simulations of turbulence makes computational fluid dynamics (CFD) a prime application use case for current and future exascale systems. High-order finite element methods, such as the spectral element method, have been gaining traction as they offer high performance on both multicore CPUs and modern GPU-based accelerators. In this work, we assess how high-fidelity CFD using the spectral element method can exploit the modular supercomputing architecture at scale through domain partitioning, where the computational domain is split between a Booster module powered by GPUs and a Cluster module with conventional CPU nodes. We investigate several different flow cases and computer systems based on the Modular Supercomputing Architecture (MSA). We observe that for our simulations, the communication overhead and load balancing issues incurred by incorporating different computing architectures are seldom worthwhile, especially when I/O is also considered, but when the simulation at hand requires more than the combined global memory on the GPUs, utilizing additional CPUs to increase the available memory can be fruitful. We support our results with a simple performance model to assess when running across modules might be beneficial. As MSA is becoming more widespread and efforts to increase system utilization are growing more important our results give insight into when and how a monolithic application can utilize and spread out to more than one module and obtain a faster time to solution.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

34

Zhang, Yajie, Ce Yu, Chao Sun, Jian Xiao, Kun Li, Yifei Mu und Chenzhou Cui. „HLC2: a Highly Efficient Cross-matching Framework for Large Astronomical Catalogues on Heterogeneous Computing Environments“. Monthly Notices of the Royal Astronomical Society, 10.01.2023. http://dx.doi.org/10.1093/mnras/stad067.

Der volle Inhalt der Quelle

Annotation:

Abstract Cross-matching operation, which is to find corresponding data for the same celestial object or region from multiple catalogues, is indispensable to astronomical data analysis and research. Due to the large amount of astronomical catalogues generated by the ongoing and next generation large-scale sky surveys, the time complexity of the cross-matching is increasing dramatically. Heterogeneous computing environments provide a theoretical possibility to accelerate the cross-matching, but the performance advantages of heterogeneous computing resources have not been fully utilized. To meet the challenge of cross-matching for substantial increasing amount of astronomical observation data, this paper proposes HLC2 (Heterogeneous-computing-enabled Large Catalogue Cross-matcher), a high-performance cross-matching framework based on spherical position deviation on CPU-GPU heterogeneous computing platforms. It supports scalable and flexible cross-matching and can be directly applied to the fusion of large astronomical catalogues from survey missions and astronomical data centers. A performance estimation model is proposed to locate the performance bottlenecks and guide the optimizations. A two-level partitioning strategy is designed to generate an optimized data placement according to the positions of celestial objects to increase throughput. To make HLC2 a more adaptive solution, the architecture-aware task splitting, thread parallelization and concurrent scheduling strategies are designed and integrated. Moreover, a novel quad-direction strategy is proposed for the boundary problem to effectively balance performance and completeness. We have experimentally evaluated HLC2 using public released catalogue data. Experiments demonstrate that HLC2 scales well on different sizes of catalogues and the cross-matching speed is significantly improved compared to the state-of-the-art cross-matchers.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Wir bieten Rabatte auf alle Premium-Pläne für Autoren, deren Werke in thematische Literatursammlungen aufgenommen wurden. Kontaktieren Sie uns, um einen einzigartigen Promo-Code zu erhalten!