Log in

Relevant bibliographies by topics / High-performance, graph processing, GPU / Journal articles

To see the other types of publications on this topic, follow the link: High-performance, graph processing, GPU.

Journal articles on the topic 'High-performance, graph processing, GPU'

Author: Grafiati

Published: 11 March 2023

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'High-performance, graph processing, GPU.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Zhou, Chao, and Tao Zhang. "High Performance Graph Data Imputation on Multiple GPUs." Future Internet 13, no. 2 (January 31, 2021): 36. http://dx.doi.org/10.3390/fi13020036.

Full text

Abstract:

In real applications, massive data with graph structures are often incomplete due to various restrictions. Therefore, graph data imputation algorithms have been widely used in the fields of social networks, sensor networks, and MRI to solve the graph data completion problem. To keep the data relevant, a data structure is represented by a graph-tensor, in which each matrix is the vertex value of a weighted graph. The convolutional imputation algorithm has been proposed to solve the low-rank graph-tensor completion problem that some data matrices are entirely unobserved. However, this data imputation algorithm has limited application scope because it is compute-intensive and low-performance on CPU. In this paper, we propose a scheme to perform the convolutional imputation algorithm with higher time performance on GPUs (Graphics Processing Units) by exploiting multi-core GPUs of CUDA architecture. We propose optimization strategies to achieve coalesced memory access for graph Fourier transform (GFT) computation and improve the utilization of GPU SM resources for singular value decomposition (SVD) computation. Furthermore, we design a scheme to extend the GPU-optimized implementation to multiple GPUs for large-scale computing. Experimental results show that the GPU implementation is both fast and accurate. On synthetic data of varying sizes, the GPU-optimized implementation running on a single Quadro RTX6000 GPU achieves up to 60.50× speedups over the GPU-baseline implementation. The multi-GPU implementation achieves up to 1.81× speedups on two GPUs versus the GPU-optimized implementation on a single GPU. On the ego-Facebook dataset, the GPU-optimized implementation achieves up to 77.88× speedups over the GPU-baseline implementation. Meanwhile, the GPU implementation and the CPU implementation achieve similar, low recovery errors.

APA, Harvard, Vancouver, ISO, and other styles

2

Wang, Yangzihao, Andrew Davidson, Yuechao Pan, Yuduo Wu, Andy Riffel, and John D. Owens. "Gunrock: a high-performance graph processing library on the GPU." ACM SIGPLAN Notices 50, no. 8 (December 18, 2015): 265–66. http://dx.doi.org/10.1145/2858788.2688538.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

Choudhury, Dwaipayan, Aravind Sukumaran Rajam, Ananth Kalyanaraman, and Partha Pratim Pande. "High-Performance and Energy-Efficient 3D Manycore GPU Architecture for Accelerating Graph Analytics." ACM Journal on Emerging Technologies in Computing Systems 18, no. 1 (January 31, 2022): 1–19. http://dx.doi.org/10.1145/3482880.

Full text

Abstract:

Recent advances in GPU-based manycore accelerators provide the opportunity to efficiently process large-scale graphs on chip. However, real world graphs have a diverse range of topology and connectivity patterns (e.g., degree distributions) that make the design of input-agnostic hardware architectures a challenge. Network-on-Chip (NoC)- based architectures provide a way to overcome this challenge as the architectural topology can be used to approximately model the expected traffic patterns that emerge from graph application workloads. In this paper, we first study the mix of long- and short-range traffic patterns generated on-chip using graph workloads, and subsequently use the findings to adapt the design of an optimal NoC-based architecture. In particular, by leveraging emerging three-dimensional (3D) integration technology, we propose design of a small-world NoC (SWNoC)- enabled manycore GPU architecture, where the placement of the links connecting the streaming multiprocessors (SM) and the memory controllers (MC) follow a power-law distribution. The proposed 3D manycore GPU architecture outperforms the traditional planar (2D) counterparts in both performance and energy consumption. Moreover, by adopting a joint performance-thermal optimization strategy, we address the thermal concerns in a 3D design without noticeably compromising the achievable performance. The 3D integration technology is also leveraged to incorporate Near Data Processing (NDP) to complement the performance benefits introduced by the SWNoC architecture. As graph applications are inherently memory intensive, off-chip data movement gives rise to latency and energy overheads in the presence of external DRAM. In conventional GPU architectures, as the main memory layer is not integrated with the logic, off-chip data movement negatively impacts overall performance and energy consumption. We demonstrate that NDP significantly reduces the overheads associated with such frequent and irregular memory accesses in graph-based applications. The proposed SWNoC-enabled NDP framework that integrates 3D memory (like Micron's HMC) with a massive number of GPU cores achieves 29.5% performance improvement and 30.03% less energy consumption on average compared to a conventional planar Mesh-based design with external DRAM.

APA, Harvard, Vancouver, ISO, and other styles

4

Pan, Xiao Hui. "Efficient Graph Component Labeling on Hybrid CPU and GPU Platforms." Applied Mechanics and Materials 596 (July 2014): 276–79. http://dx.doi.org/10.4028/www.scientific.net/amm.596.276.

Full text

Abstract:

Graph component labeling, which is a subset of the general graph coloring problem, is a computationally expensive operation in many important applications and simulations. A number of data-parallel algorithmic variations to the component labeling problem are possible and we explore their use with general purpose graphical processing units (GPGPUs) and with the CUDA GPU programming language. We discuss implementation issues and performance results on CPUs and GPUs using CUDA. We evaluated our system with real-world graphs. We show how to consider different architectural features of the GPU and the host CPUs and achieve high performance.

APA, Harvard, Vancouver, ISO, and other styles

5

Lü, Yashuai, Hui Guo, Libo Huang, Qi Yu, Li Shen, Nong Xiao, and Zhiying Wang. "GraphPEG." ACM Transactions on Architecture and Code Optimization 18, no. 3 (June 2021): 1–24. http://dx.doi.org/10.1145/3450440.

Full text

Abstract:

Due to massive thread-level parallelism, GPUs have become an attractive platform for accelerating large-scale data parallel computations, such as graph processing. However, achieving high performance for graph processing with GPUs is non-trivial. Processing graphs on GPUs introduces several problems, such as load imbalance, low utilization of hardware unit, and memory divergence. Although previous work has proposed several software strategies to optimize graph processing on GPUs, there are several issues beyond the capability of software techniques to address. In this article, we present GraphPEG, a graph processing engine for efficient graph processing on GPUs. Inspired by the observation that many graph algorithms have a common pattern on graph traversal, GraphPEG improves the performance of graph processing by coupling automatic edge gathering with fine-grain work distribution. GraphPEG can also adapt to various input graph datasets and simplify the software design of graph processing with hardware-assisted graph traversal. Simulation results show that, in comparison with two representative highly efficient GPU graph processing software framework Gunrock and SEP-Graph, GraphPEG improves graph processing throughput by 2.8× and 2.5× on average, and up to 7.3× and 7.0× for six graph algorithm benchmarks on six graph datasets, with marginal hardware cost.

APA, Harvard, Vancouver, ISO, and other styles

6

Zhang, Yu, Da Peng, Xiaofei Liao, Hai Jin, Haikun Liu, Lin Gu, and Bingsheng He. "LargeGraph." ACM Transactions on Architecture and Code Optimization 18, no. 4 (December 31, 2021): 1–24. http://dx.doi.org/10.1145/3477603.

Full text

Abstract:

Many out-of-GPU-memory systems are recently designed to support iterative processing of large-scale graphs. However, these systems still suffer from long time to converge because of inefficient propagation of active vertices’ new states along graph paths. To efficiently support out-of-GPU-memory graph processing, this work designs a system LargeGraph . Different from existing out-of-GPU-memory systems, LargeGraph proposes a dependency-aware data-driven execution approach , which can significantly accelerate active vertices’ state propagations along graph paths with low data access cost and also high parallelism. Specifically, according to the dependencies between the vertices, it only loads and processes the graph data associated with dependency chains originated from active vertices for smaller access cost. Because most active vertices frequently use a small evolving set of paths for their new states’ propagation because of power-law property, this small set of paths are dynamically identified and maintained and efficiently handled on the GPU to accelerate most propagations for faster convergence, whereas the remaining graph data are handled over the CPU. For out-of-GPU-memory graph processing, LargeGraph outperforms four cutting-edge systems: Totem (5.19–11.62×), Graphie (3.02–9.41×), Garaph (2.75–8.36×), and Subway (2.45–4.15×).

APA, Harvard, Vancouver, ISO, and other styles

7

SOMAN, JYOTHISH, KISHORE KOTHAPALLI, and P. J. NARAYANAN. "SOME GPU ALGORITHMS FOR GRAPH CONNECTED COMPONENTS AND SPANNING TREE." Parallel Processing Letters 20, no. 04 (December 2010): 325–39. http://dx.doi.org/10.1142/s0129626410000272.

Full text

Abstract:

Graphics Processing Units (GPU) are application specific accelerators which provide high performance to cost ratio and are widely available and used, hence places them as a ubiquitous accelerator. A computing paradigm based on the same is the general purpose computing on the GPU (GPGPU) model. The GPU due to its graphics lineage is better suited for the data-parallel, data-regular algorithms. The hardware architecture of the GPU is not suitable for the data parallel but data irregular algorithms such as graph connected components and list ranking. In this paper, we present results that show how to use GPUs efficiently for graph algorithms which are known to have irregular data access patterns. We consider two fundamental graph problems: finding the connected components and finding a spanning tree. These two problems find applications in several graph theoretical problems. In this paper we arrive at efficient GPU implementations for the above two problems. The algorithms focus on minimising irregularity at both algorithmic and implementation level. Our implementation achieves a speedup of 11-16 times over a corresponding best sequential implementation.

APA, Harvard, Vancouver, ISO, and other styles

8

Seliverstov, E. Yu. "Structural Mapping of Global Optimization Algorithms to Graphics Processing Unit Architecture." Herald of the Bauman Moscow State Technical University. Series Instrument Engineering, no. 2 (139) (June 2022): 42–59. http://dx.doi.org/10.18698/0236-3933-2022-2-42-59.

Full text

Abstract:

Graphics processing units (GPU) deliver a high execution efficiency for modern metaheuristic algorithms with a high computation complexity. It is crucial to have an optimal task mapping of the optimization algorithm to the parallel system architecture which strongly affects the efficiency of the optimization process. The paper proposes a novel task mapping algorithm of the parallel metaheuristic algorithm to the GPU architecture, describes problem statement for the mapping of algorithm graph model to the GPU model, and gives a formal definition of graph mapping and mapping restrictions. The algorithm graph model is a hierarchical graph model consisting of island parallel model and metaheuristic optimization algorithm model. A set of feasible mappings using mapping restrictions makes it possible to formalize GPU architecture and parallel model features. The structural mapping algorithm is based on cooperative solving of the optimization problem and the discrete optimization problem of the structural model mapping. The study outlines the parallel efficiency criteria which can be evaluated both experimentally and analytically to predict a model efficiency. The experimental section introduces the parallel optimization algorithm based on the proposed structural mapping algorithm. Experimental results for parallel efficiency comparison between parallel and sequential algorithms are presented and discussed

APA, Harvard, Vancouver, ISO, and other styles

9

Toledo, Leonel, Pedro Valero-Lara, Jeffrey S. Vetter, and Antonio J. Peña. "Towards Enhancing Coding Productivity for GPU Programming Using Static Graphs." Electronics 11, no. 9 (April 20, 2022): 1307. http://dx.doi.org/10.3390/electronics11091307.

Full text

Abstract:

The main contribution of this work is to increase the coding productivity of GPU programming by using the concept of Static Graphs. GPU capabilities have been increasing significantly in terms of performance and memory capacity. However, there are still some problems in terms of scalability and limitations to the amount of work that a GPU can perform at a time. To minimize the overhead associated with the launch of GPU kernels, as well as to maximize the use of GPU capacity, we have combined the new CUDA Graph API with the CUDA programming model (including CUDA math libraries) and the OpenACC programming model. We use as test cases two different, well-known and widely used problems in HPC and AI: the Conjugate Gradient method and the Particle Swarm Optimization. In the first test case (Conjugate Gradient) we focus on the integration of Static Graphs with CUDA. In this case, we are able to significantly outperform the NVIDIA reference code, reaching an acceleration of up to 11× thanks to a better implementation, which can benefit from the new CUDA Graph capabilities. In the second test case (Particle Swarm Optimization), we complement the OpenACC functionality with the use of CUDA Graph, achieving again accelerations of up to one order of magnitude, with average speedups ranging from 2× to 4×, and performance very close to a reference and optimized CUDA code. Our main target is to achieve a higher coding productivity model for GPU programming by using Static Graphs, which provides, in a very transparent way, a better exploitation of the GPU capacity. The combination of using Static Graphs with two of the current most important GPU programming models (CUDA and OpenACC) is able to reduce considerably the execution time w.r.t. the use of CUDA and OpenACC only, achieving accelerations of up to more than one order of magnitude. Finally, we propose an interface to incorporate the concept of Static Graphs into the OpenACC Specifications.

APA, Harvard, Vancouver, ISO, and other styles

10

Quer, Stefano, and Andrea Calabrese. "Graph Reachability on Parallel Many-Core Architectures." Computation 8, no. 4 (December 2, 2020): 103. http://dx.doi.org/10.3390/computation8040103.

Full text

Abstract:

Many modern applications are modeled using graphs of some kind. Given a graph, reachability, that is, discovering whether there is a path between two given nodes, is a fundamental problem as well as one of the most important steps of many other algorithms. The rapid accumulation of very large graphs (up to tens of millions of vertices and edges) from a diversity of disciplines demand efficient and scalable solutions to the reachability problem. General-purpose computing has been successfully used on Graphics Processing Units (GPUs) to parallelize algorithms that present a high degree of regularity. In this paper, we extend the applicability of GPU processing to graph-based manipulation, by re-designing a simple but efficient state-of-the-art graph-labeling method, namely the GRAIL (Graph Reachability Indexing via RAndomized Interval) algorithm, to many-core CUDA-based GPUs. This algorithm firstly generates a label for each vertex of the graph, then it exploits these labels to answer reachability queries. Unfortunately, the original algorithm executes a sequence of depth-first visits which are intrinsically recursive and cannot be efficiently implemented on parallel systems. For that reason, we design an alternative approach in which a sequence of breadth-first visits substitute the original depth-first traversal to generate the labeling, and in which a high number of concurrent visits is exploited during query evaluation. The paper describes our strategy to re-design these steps, the difficulties we encountered to implement them, and the solutions adopted to overcome the main inefficiencies. To prove the validity of our approach, we compare (in terms of time and memory requirements) our GPU-based approach with the original sequential CPU-based tool. Finally, we report some hints on how to conduct further research in the area.

APA, Harvard, Vancouver, ISO, and other styles

11

Trefftz, Christian, Hugh McGuire, Zachary Kurmas, and Jerry Scripps. "Exhaustive Community Enumeration in Parallel." Parallel Processing Letters 26, no. 02 (June 2016): 1650006. http://dx.doi.org/10.1142/s0129626416500067.

Full text

Abstract:

An algorithm to evaluate/count all the possible communities of a graph is presented. An associated unrank function is described. An implementation of an existing algorithm to evaluate all the possible partitions of a graph, based on an unrank function, is presented as well. Performance results of the parallelizations of these algorithms obtained on a shared memory machine, a cluster of workstations and a Graphical Processing Unit (GPU) are included.

APA, Harvard, Vancouver, ISO, and other styles

12

Malek, Maximilian, and Christoph W. Sensen. "Instant Feedback Rapid Prototyping for GPU-Accelerated Computation, Manipulation, and Visualization of Multidimensional Data." International Journal of Biomedical Imaging 2018 (June 3, 2018): 1–9. http://dx.doi.org/10.1155/2018/2046269.

Full text

Abstract:

Objective. We have created an open-source application and framework for rapid GPU-accelerated prototyping, targeting image analysis, including volumetric images such as CT or MRI data. Methods. A visual graph editor enables the design of processing pipelines without programming. Run-time compiled compute shaders enable prototyping of complex operations in a matter of minutes. Results. GPU-acceleration increases processing the speed by at least an order of magnitude when compared to traditional multithreaded CPU-based implementations, while offering the flexibility of scripted implementations. Conclusion. Our framework enables real-time, intuition-guided accelerated algorithm and method development, supported by built-in scriptable visualization. Significance. This is, to our knowledge, the first tool for medical data analysis that provides both high performance and rapid prototyping. As such, it has the potential to act as a force multiplier for further research, enabling handling of high-resolution datasets while providing quasi-instant feedback and visualization of results.

APA, Harvard, Vancouver, ISO, and other styles

13

Beamer, Scott, Krste Asanović, and David Patterson. "Direction-Optimizing Breadth-First Search." Scientific Programming 21, no. 3-4 (2013): 137–48. http://dx.doi.org/10.1155/2013/702694.

Full text

Abstract:

Breadth-First Search is an important kernel used by many graph-processing applications. In many of these emerging applications of BFS, such as analyzing social networks, the input graphs are low-diameter and scale-free. We propose a hybrid approach that is advantageous for low-diameter graphs, which combines a conventional top-down algorithm along with a novel bottom-up algorithm. The bottom-up algorithm can dramatically reduce the number of edges examined, which in turn accelerates the search as a whole. On a multi-socket server, our hybrid approach demonstrates speedups of 3.3–7.8 on a range of standard synthetic graphs and speedups of 2.4–4.6 on graphs from real social networks when compared to a strong baseline. We also typically double the performance of prior leading shared memory (multicore and GPU) implementations.

APA, Harvard, Vancouver, ISO, and other styles

14

Nozdrzykowski, Łukasz, and Magdalena Nozdrzykowska. "Models for estimating the time of program loop execution in parallel on a CPU and with the use of OpenCL computation on a GPU." AUTOBUSY – Technika, Eksploatacja, Systemy Transportowe 19, no. 12 (December 31, 2018): 802–7. http://dx.doi.org/10.24136/atest.2018.501.

Full text

Abstract:

The authors present models for estimating the time of execution of program loops compliant with the FAN model with no data dependencies or with data dependencies only within the body programming loop, which can be executed either by CPUs or by stream multiprocessors referred to as GPU cores. The models presented will make it possible to determine whether it would be more efficient to execute computation in the existing environment using the CPU (Central Pro-cessing Unit) or a state-of-the-art graphics card with a high-performance GPU (Graphics Processing Unit) and super-fast memory, of-ten implemented in modern graphics cards. Validity checks confirming the developed time estimation model for GPU are presented. The purpose of these models is to provide methods for accelerating the performance of applications performing various tasks, including transport tasks, such as accelerated solution searching, searching paths in graphs, or accelerating image processing algorithms in vision systems of autonomous and semiautonomous vehicles, where these models allow to build an automatic task distribution system between the CPU and the GPU with the variability of computing resources.

APA, Harvard, Vancouver, ISO, and other styles

15

Yoon, Daegun, and Sangyoon Oh. "SURF: Direction-Optimizing Breadth-First Search Using Workload State on GPUs." Sensors 22, no. 13 (June 29, 2022): 4899. http://dx.doi.org/10.3390/s22134899.

Full text

Abstract:

Graph data structures have been used in a wide range of applications including scientific and social network applications. Engineers and scientists analyze graph data to discover knowledge and insights by using various graph algorithms. A breadth-first search (BFS) is one of the fundamental building blocks of complex graph algorithms and its implementation is included in graph libraries for large-scale graph processing. In this paper, we propose a novel direction selection method, SURF (Selecting directions Upon Recent workload of Frontiers) to enhance the performance of BFS on GPU. A direction optimization that selects the proper traversal direction of a BFS execution between the push and pull phases is crucial to the performance as well as for efficient handling of the varying workloads of the frontiers. However, existing works select the direction using condition statements based on predefined thresholds without considering the changing workload state. To solve this drawback, we define several metrics that describe the state of the workload and analyze their impact on the BFS performance. To show that SURF selects the appropriate direction, we implement the direction selection method with a deep neural network model that adopts those metrics as the input features. Experimental results indicate that SURF achieves a higher direction prediction accuracy and reduced execution time in comparison with existing state-of-the-art methods that support a direction-optimizing BFS. SURF yields up to a 5.62× and 3.15× speedup over the state-of-the-art graph processing frameworks Gunrock and Enterprise, respectively.

APA, Harvard, Vancouver, ISO, and other styles

16

Lai-Dang, Quoc-Vinh, Sarvar Hussain Nengroo, and Hojun Jin. "Learning Dense Features for Point Cloud Registration Using a Graph Attention Network." Applied Sciences 12, no. 14 (July 12, 2022): 7023. http://dx.doi.org/10.3390/app12147023.

Full text

Abstract:

Point cloud registration is a fundamental task in many applications such as localization, mapping, tracking, and reconstruction. Successful registration relies on extracting robust and discriminative geometric features. Though existing learning-based methods require high computing capacity for processing a large number of raw points at the same time, computational capacity limitation is not an issue thanks to powerful parallel computing process using GPU. In this paper, we introduce a framework that efficiently and economically extracts dense features using a graph attention network for point cloud matching and registration (DFGAT). The detector of the DFGAT is responsible for finding highly reliable key points in large raw data sets. The descriptor of the DFGAT takes these keypoints combined with their neighbors to extract invariant density features in preparation for the matching. The graph attention network (GAT) uses the attention mechanism that enriches the relationships between point clouds. Finally, we consider this as an optimal transport problem and use the Sinkhorn algorithm to find positive and negative matches. We perform thorough tests on the KITTI dataset and evaluate the effectiveness of this approach. The results show that this method with the efficiently compact keypoint selection and description can achieve the best performance matching metrics and reach the highest success ratio of 99.88% registration in comparison with other state-of-the-art approaches.

APA, Harvard, Vancouver, ISO, and other styles

17

Sancho, Jaime, Pallab Sutradhar, Gonzalo Rosa, Miguel Chavarrías, Angel Perez-Nuñez, Rubén Salvador, Alfonso Lagares, Eduardo Juárez, and César Sanz. "GoRG: Towards a GPU-Accelerated Multiview Hyperspectral Depth Estimation Tool for Medical Applications." Sensors 21, no. 12 (June 14, 2021): 4091. http://dx.doi.org/10.3390/s21124091.

Full text

Abstract:

HyperSpectral (HS) images have been successfully used for brain tumor boundary detection during resection operations. Nowadays, these classification maps coexist with other technologies such as MRI or IOUS that improve a neurosurgeon’s action, with their incorporation being a neurosurgeon’s task. The project in which this work is framed generates an unified and more accurate 3D immersive model using HS, MRI, and IOUS information. To do so, the HS images need to include 3D information and it needs to be generated in real-time operating room conditions, around a few seconds. This work presents Graph cuts Reference depth estimation in GPU (GoRG), a GPU-accelerated multiview depth estimation tool for HS images also able to process YUV images in less than 5.5 s on average. Compared to a high-quality SoA algorithm, MPEG DERS, GoRG YUV obtain quality losses of −0.93 dB, −0.6 dB, and −1.96% for WS-PSNR, IV-PSNR, and VMAF, respectively, using a video synthesis processing chain. For HS test images, GoRG obtains an average RMSE of 7.5 cm, with most of its errors in the background, needing around 850 ms to process one frame and view. These results demonstrate the feasibility of using GoRG during a tumor resection operation.

APA, Harvard, Vancouver, ISO, and other styles

18

Zhu, Qilin, Hongmin Deng, and Kaixuan Wang. "Skeleton Action Recognition Based on Temporal Gated Unit and Adaptive Graph Convolution." Electronics 11, no. 18 (September 19, 2022): 2973. http://dx.doi.org/10.3390/electronics11182973.

Full text

Abstract:

In recent years, great progress has been made in the recognition of skeletal behaviors based on graph convolutional networks (GCNs). In most existing methods, however, the fixed adjacency matrix and fixed graph structure are used for skeleton data feature extraction in the spatial dimension, which usually leads to weak spatial modeling ability, unsatisfactory generalization performance, and an excessive number of model parameters. Most of these methods follow the ST-GCN approach in the temporal dimension, which inevitably leads to a number of non-key frames, increasing the cost of feature extraction and causing the model to be slower in terms of feature extraction and the required computational burden. In this paper, a gated temporally and spatially adaptive graph convolutional network is proposed. On the one hand, a learnable parameter matrix which can adaptively learn the key information of the skeleton data in spatial dimension is added to the graph convolution layer, improving the feature extraction and generalizability of the model and reducing the number of parameters. On the other hand, a gated unit is added to the temporal feature extraction module to alleviate interference from non-critical frames and reduce computational complexity. A channel attention mechanism based on an SE module and a frame attention mechanism are used to enhance the model’s feature extraction ability. To prevent model degradation and ensure more stable training, residual links are added to each feature extraction module. The proposed approach was ultimately able to achieve 0.63% higher accuracy on the X-Sub benchmark with 4.46 M fewer parameters than GAT, one of the best SOTA methods. Inference speed of our model reaches as fast as 86.23 sequences/(second × GPU). Extensive experimental results further validate the effectiveness of our proposed approach on three large-scale datasets, namely, NTU RGB+D 60, NTU RGB+D 120, and Kinetics Skeleton.

APA, Harvard, Vancouver, ISO, and other styles

19

Lu, Shengliang, Bingsheng He, Yuchen Li, and Hao Fu. "Accelerating exact constrained shortest paths on GPUs." Proceedings of the VLDB Endowment 14, no. 4 (December 2020): 547–59. http://dx.doi.org/10.14778/3436905.3436914.

Full text

Abstract:

The recently emerging applications such as software-defined networks and autonomous vehicles require efficient and exact solutions for constrained shortest paths (CSP), which finds the shortest path in a graph while satisfying some user-defined constraints. Compared with the common shortest path problems without constraints, CSP queries have a significantly larger number of subproblems. The most widely used labeling algorithm becomes prohibitively slow and impractical. Other existing approaches tend to find approximate solutions and build costly indices on graphs for fast query processing, which are not suitable for emerging applications with the requirement of exact solutions. A natural question is whether and how we can efficiently find the exact solution for CSP. In this paper, we propose Vine , a framework that parallelizes the labeling algorithm to efficiently find the exact CSP solution using GPUs. The major challenge addressed in Vine is how to deal with a large number of subproblems that are mostly unpromising but require a significant amount of memory and computational resources. Our solution is twofold. First, we develop a two-level pruning approach to eliminate the subproblems by making good use of the GPU's hierarchical memory. Second, we propose an adaptive parallelism control model based on the observations that the degree of parallelism (DOP) is the key to performance optimization with the given amount of computational resources. Extensive experiments show that Vine achieves 18× speedup on average over the widely adopted CPU-based solution running on 40 CPU threads. Vine also has over 5× speedup compared with a GPU approach that statically controls the DOP. Compared to the state-of-the-art approximate solution with preprocessed indices, Vine provides exact results with competitive or even better performance.

APA, Harvard, Vancouver, ISO, and other styles

20

WEI, ZHENG, and JOSEPH JAJA. "OPTIMIZATION OF LINKED LIST PREFIX COMPUTATIONS ON MULTITHREADED GPUS USING CUDA." Parallel Processing Letters 22, no. 04 (December 2012): 1250012. http://dx.doi.org/10.1142/s0129626412500120.

Full text

Abstract:

We present a number of optimization techniques to compute prefix sums on linked lists and implement them on the multithreaded GPUs Tesla C1060, Tesla C2050, and GTX480 using CUDA. Prefix computations on linked structures involve in general highly irregular fine grain memory accesses that are typical of many computations on linked lists, trees, and graphs. While the current generation of GPUs provides substantial computational power and extremely high bandwidth memory accesses, they may appear at first to be primarily geared toward streamed, highly data parallel computations. In this paper, we introduce an optimized multithreaded GPU algorithm for prefix computations through a randomization process that reduces the problem to a large number of fine-grain computations. We map these fine-grain computations onto multithreaded GPUs in such a way that the processing cost per element is shown to be close to the best possible. Our experimental results show scalability for list sizes ranging from 1M nodes to 256M nodes, and significantly improve on the recently published parallel implementations of list ranking, including implementations on the Cell Processor, the MTA-8, and the NVIDIA GT200 and Fermi series. They also compare favorably to the performance of the best known CUDA algorithm for the scan operation on the Tesla C1060 and GTX480.

APA, Harvard, Vancouver, ISO, and other styles

21

Yang, Haoduo, Huayou Su, Qiang Lan, Mei Wen, and Chunyuan Zhang. "HPGraph: High-Performance Graph Analytics with Productivity on the GPU." Scientific Programming 2018 (December 11, 2018): 1–11. http://dx.doi.org/10.1155/2018/9340697.

Full text

Abstract:

The growing use of graph in many fields has sparked a broad interest in developing high-level graph analytics programs. Existing GPU implementations have limited performance with compromising on productivity. HPGraph, our high-performance bulk-synchronous graph analytics framework based on the GPU, provides an abstraction focused on mapping vertex programs to generalized sparse matrix operations on GPU as the backend. HPGraph strikes a balance between performance and productivity by coupling high-performance GPU computing primitives and optimization strategies with a high-level programming model for users to implement various graph algorithms with relatively little effort. We evaluate the performance of HPGraph for four graph primitives (BFS, SSSP, PageRank, and TC). Our experiments show that HPGraph matches or even exceeds the performance of high-performance GPU graph libraries such as MapGraph, nvGraph, and Gunrock. HPGraph also runs significantly faster than advanced CPU graph libraries.

APA, Harvard, Vancouver, ISO, and other styles

22

Merrill, Duane, Michael Garland, and Andrew Grimshaw. "High-Performance and Scalable GPU Graph Traversal." ACM Transactions on Parallel Computing 1, no. 2 (February 18, 2015): 1–30. http://dx.doi.org/10.1145/2717511.

Full text

APA, Harvard, Vancouver, ISO, and other styles

23

Zhang, Tao, Wang Kan, and Xiao-Yang Liu. "High performance GPU primitives for graph-tensor learning operations." Journal of Parallel and Distributed Computing 148 (February 2021): 125–37. http://dx.doi.org/10.1016/j.jpdc.2020.10.011.

Full text

APA, Harvard, Vancouver, ISO, and other styles

24

Morishima, Shin, and Hiroki Matsutani. "High-Performance with an In-GPU Graph Database Cache." IT Professional 19, no. 6 (November 2017): 58–64. http://dx.doi.org/10.1109/mitp.2017.4241461.

Full text

APA, Harvard, Vancouver, ISO, and other styles

25

Abdellah, Marwan, Ayman Eldeib, and Amr Sharawi. "High Performance GPU-Based Fourier Volume Rendering." International Journal of Biomedical Imaging 2015 (2015): 1–13. http://dx.doi.org/10.1155/2015/590727.

Full text

Abstract:

Fourier volume rendering (FVR) is a significant visualization technique that has been used widely in digital radiography. As a result of itsO(N2log⁡N)time complexity, it provides a faster alternative to spatial domain volume rendering algorithms that areO(N3)computationally complex. Relying on theFourier projection-slice theorem, this technique operates on the spectral representation of a 3D volume instead of processing its spatial representation to generate attenuation-only projections that look likeX-ray radiographs. Due to the rapid evolution of its underlying architecture, the graphics processing unit (GPU) became an attractive competent platform that can deliver giant computational raw power compared to the central processing unit (CPU) on a per-dollar-basis. The introduction of the compute unified device architecture (CUDA) technology enables embarrassingly-parallel algorithms to run efficiently on CUDA-capable GPU architectures. In this work, a high performance GPU-accelerated implementation of the FVR pipeline on CUDA-enabled GPUs is presented. This proposed implementation can achieve a speed-up of 117x compared to a single-threaded hybrid implementation that uses the CPU and GPU together by taking advantage of executing the rendering pipeline entirely on recent GPU architectures.

APA, Harvard, Vancouver, ISO, and other styles

26

Yang, Carl, Aydın Buluç, and John D. Owens. "GraphBLAST: A High-Performance Linear Algebra-based Graph Framework on the GPU." ACM Transactions on Mathematical Software 48, no. 1 (March 31, 2022): 1–51. http://dx.doi.org/10.1145/3466795.

Full text

Abstract:

High-performance implementations of graph algorithms are challenging to implement on new parallel hardware such as GPUs because of three challenges: (1) the difficulty of coming up with graph building blocks, (2) load imbalance on parallel hardware, and (3) graph problems having low arithmetic intensity. To address some of these challenges, GraphBLAS is an innovative, on-going effort by the graph analytics community to propose building blocks based on sparse linear algebra, which allow graph algorithms to be expressed in a performant, succinct, composable, and portable manner. In this paper, we examine the performance challenges of a linear-algebra-based approach to building graph frameworks and describe new design principles for overcoming these bottlenecks. Among the new design principles is exploiting input sparsity , which allows users to write graph algorithms without specifying push and pull direction. Exploiting output sparsity allows users to tell the backend which values of the output in a single vectorized computation they do not want computed. Load-balancing is an important feature for balancing work amongst parallel workers. We describe the important load-balancing features for handling graphs with different characteristics. The design principles described in this paper have been implemented in “GraphBLAST”, the first high-performance linear algebra-based graph framework on NVIDIA GPUs that is open-source. The results show that on a single GPU, GraphBLAST has on average at least an order of magnitude speedup over previous GraphBLAS implementations SuiteSparse and GBTL, comparable performance to the fastest GPU hardwired primitives and shared-memory graph frameworks Ligra and Gunrock, and better performance than any other GPU graph framework, while offering a simpler and more concise programming model.

APA, Harvard, Vancouver, ISO, and other styles

27

Khan, Fazeel Mahmood, Peter Berczik, and Andreas Just. "Gravitational wave driven mergers and coalescence time of supermassive black holes." Astronomy & Astrophysics 615 (July 2018): A71. http://dx.doi.org/10.1051/0004-6361/201730489.

Full text

Abstract:

Aims. The evolution of supermassive black holes (SMBHs) initially embedded in the centres of merging galaxies realised with a stellar mass function (SMF) is studied from the onset of galaxy mergers until coalescence. Coalescence times of SMBH binaries are of great importance for black hole evolution and gravitational wave detection studies. Methods. We performed direct N-body simulations using the highly efficient and massively parallel phi-GRAPE+GPU code capable of running on high-performance computer clusters supported by graphic processing units (GPUs). Post-Newtonian terms up to order 3.5 are used to drive the SMBH binary evolution in the relativistic regime. We performed a large set of simulations with three different slopes of the central stellar cusp and different random seeds. The impact of a SMF on the hardening rate and the coalescence time is investigated. Results. We find that SMBH binaries coalesce well within one billion years when our models are scaled to galaxies with a steep cusp at low redshift. Here higher central densities provide a larger supply of stars to efficiently extract energy from the SMBH binary orbit and shrink it to the phase where gravitational wave (GW) emission becomes dominant, leading to the coalescence of the SMBHs. Mergers of models with shallow cusps that are representative of giant elliptical galaxies having central cores result in less efficient extraction of the binary’s orbital energy, due to the lower stellar densities in the centre. However, high values of eccentricity witnessed for SMBH binaries in such galaxy mergers ensure that the GW emission dominated phase sets in earlier at larger values of the semi-major axis. This helps to compensate for the less efficient energy extraction during the phase dominated by stellar encounters resulting in mergers of SMBHs in about 1 Gyr after the formation of the binary. Additionally, we witness mass segregation in the merger remnant resulting in enhanced SMBH binary hardening rates. We show that at least the final phase of the merger in cuspy low-mass galaxies would be observable with the GW detector eLISA.

APA, Harvard, Vancouver, ISO, and other styles

28

Erofeev, K. Yu, E. M. Khramchenkov, and E. V. Biryal’tsev. "High-performance Processing of Covariance Matrices Using GPU Computations." Lobachevskii Journal of Mathematics 40, no. 5 (May 2019): 547–54. http://dx.doi.org/10.1134/s1995080219050068.

Full text

APA, Harvard, Vancouver, ISO, and other styles

29

Rossi, Stefano, and Enrico Boni. "Embedded GPU Implementation for High-Performance Ultrasound Imaging." Electronics 10, no. 8 (April 8, 2021): 884. http://dx.doi.org/10.3390/electronics10080884.

Full text

Abstract:

Methods of increasing complexity are currently being proposed for ultrasound (US) echographic signal processing. Graphics Processing Unit (GPU) resources allowing massive exploitation of parallel computing are ideal candidates for these tasks. Many high-performance US instruments, including open scanners like ULA-OP 256, have an architecture based only on Field-Programmable Gate Arrays (FPGAs) and/or Digital Signal Processors (DSPs). This paper proposes the implementation of the embedded NVIDIA Jetson Xavier AGX module on board ULA-OP 256. The system architecture was revised to allow the introduction of a new Peripheral Component Interconnect Express (PCIe) communication channel, while maintaining backward compatibility with all other embedded computing resources already on board. Moreover, the Input/Output (I/O) peripherals of the module make the ultrasound system independent, freeing the user from the need to use an external controlling PC.

APA, Harvard, Vancouver, ISO, and other styles

30

EMMART, NIALL, and CHARLES WEEMS. "HIGH PRECISION INTEGER ADDITION, SUBTRACTION AND MULTIPLICATION WITH A GRAPHICS PROCESSING UNIT." Parallel Processing Letters 20, no. 04 (December 2010): 293–306. http://dx.doi.org/10.1142/s0129626410000259.

Full text

Abstract:

In this paper we evaluate the potential for using an NVIDIA graphics processing unit (GPU) to accelerate high precision integer multiplication, addition, and subtraction. The reported peak vector performance for a typical GPU appears to offer good potential for accelerating such a computation. Because of limitations in the on-chip memory, the high cost of kernel launches, and the nature of the architecture's support for parallelism, we used a hybrid algorithmic approach to obtain good performance on multiplication. On the GPU itself we adapt the Strassen FFT algorithm to multiply 32KB chunks, while on the CPU we adapt the Karatsuba divide-and-conquer approach to optimize application of the GPU's partial multiplies, which are viewed as "digits" by our implementation of Karatsuba. Even with this approach, the result is at best a factor of three increase in performance, compared with using the GMP package on a 64-bit CPU at a comparable technology node. Our implementations of addition and subtraction achieve up to a factor of eight improvement. We identify the issues that limit performance and discuss the likely impact of planned advances in GPU architecture.

APA, Harvard, Vancouver, ISO, and other styles

31

Gao, Pin, Mingxing Zhang, Kang Chen, Yongwei Wu, and Weimin Zheng. "High Performance Graph Processing with Locality Oriented Design." IEEE Transactions on Computers 66, no. 7 (July 1, 2017): 1261–67. http://dx.doi.org/10.1109/tc.2017.2652465.

Full text

APA, Harvard, Vancouver, ISO, and other styles

32

Zhang, Weikang, Desheng Wen, Zongxi Song, Xin Wei, Gang Liu, and Zhixin Li. "High Resolution and Fast Processing of Spectral Reconstruction in Fourier Transform Imaging Spectroscopy." Sensors 18, no. 12 (November 27, 2018): 4159. http://dx.doi.org/10.3390/s18124159.

Full text

Abstract:

High-resolution spectrum estimation has continually attracted great attention in spectrum reconstruction based on Fourier transform imaging spectroscopy (FTIS). In this paper, a parallel solution for interference data processing using high-resolution spectrum estimation is proposed to reconstruct the spectrum in a fast high-resolution way. In batch processing, we use high-performance parallel-computing on the graphics processing unit (GPU) for higher efficiency and lower operation time. In addition, a parallel processing mechanism is designed for our parallel algorithm to obtain higher performance. At the same time, other solving algorithms for the modern spectrum estimation model are introduced for discussion and comparison. We compare traditional high-resolution solving algorithms running on the central processing unit (CPU) and the parallel algorithm on the GPU for processing the interferogram. The experimental results illustrate that runtime is reduced by about 70% using our parallel solution, and the GPU has a great advantage in processing large data and accelerating applications.

APA, Harvard, Vancouver, ISO, and other styles

33

LI, PING, HANQIU SUN, JIANBING SHEN, and CHEN HUANG. "HDR IMAGE RERENDERING USING GPU-BASED PROCESSING." International Journal of Image and Graphics 12, no. 01 (January 2012): 1250007. http://dx.doi.org/10.1142/s0219467812500076.

Full text

Abstract:

One essential process in image rerendering is to replace existing texture in the region of interest by other user-preferred textures, while preserving the shading and similar texture distortion. In this paper, we propose the graphics processing units (GPU)-accelerated high dynamic range (HDR) image rerendering using revisited NLM processing in parallel on GPU-CUDA platform, to reproduce the realistic rendering of HDR images with retexturing and transparent/translucent effects. Our image-based approach using GPU-based pipeline in gradient domain provides efficient processing with easy-control image retexturing and special shading effects. The experimental results showed the efficiency and high-quality performance of our approach.

APA, Harvard, Vancouver, ISO, and other styles

34

Jin, Kyung Chan, and Hyung Tae Kim. "GPU-Based Mojette Transform for High-Speed Reconstruction." Applied Mechanics and Materials 307 (February 2013): 23–26. http://dx.doi.org/10.4028/www.scientific.net/amm.307.23.

Full text

Abstract:

Mojette Transform (MOT) is used mainly in imaging implementation of mechatronicbased imaging system to reconstruct a discrete signal from a finite set of projection planes. The MOT uses a specific algorithm, called Corner Based Inversion (CBI), to reconstruct an image from its projections offering high-speed computing properties. Moreover, the MOT ensures a very low complexity in comparison to the reconstruction based on Fast Fourier Transform (FFT). In this paper, Graphic Processing Unit (GPU) based MOT is presented and also CPU and GPU processing are issued from 1283 image pixels. In the result, performance differences between the CPU and GPU architectures are discussed, and an approach of fast improvement in architectural efficiency is recommend.

APA, Harvard, Vancouver, ISO, and other styles

35

Liu, Gaogao, Wenbo Yang, Peng Li, Guodong Qin, Jingjing Cai, Youming Wang, Shuai Wang, Ning Yue, and Dongjie Huang. "MIMO Radar Parallel Simulation System Based on CPU/GPU Architecture." Sensors 22, no. 1 (January 5, 2022): 396. http://dx.doi.org/10.3390/s22010396.

Full text

Abstract:

The data volume and computation task of MIMO radar is huge; a very high-speed computation is necessary for its real-time processing. In this paper, we mainly study the time division MIMO radar signal processing flow, propose an improved MIMO radar signal processing algorithm, raising the MIMO radar algorithm processing speed combined with the previous algorithms, and, on this basis, a parallel simulation system for the MIMO radar based on the CPU/GPU architecture is proposed. The outer layer of the framework is coarse-grained with OpenMP for acceleration on the CPU, and the inner layer of fine-grained data processing is accelerated on the GPU. Its performance is significantly faster than the serial computing equipment, and satisfactory acceleration effects have been achieved in the CPU/GPU architecture simulation. The experimental results show that the MIMO radar parallel simulation system with CPU/GPU architecture greatly improves the computing power of the CPU-based method. Compared with the serial sequential CPU method, GPU simulation achieves a speedup of 130 times. In addition, the MIMO radar signal processing parallel simulation system based on the CPU/GPU architecture has a performance improvement of 13%, compared to the GPU-only method.

APA, Harvard, Vancouver, ISO, and other styles

36

Fajrianti, Evianita Dewi, Afis Asryullah Pratama, Jamal Abdul Nasyir, Alfandino Rasyid, Idris Winarno, and Sritrusta Sukaridhoto. "High-Performance Computing on Agriculture: Analysis of Corn Leaf Disease." JOIV : International Journal on Informatics Visualization 6, no. 2 (June 28, 2022): 411. http://dx.doi.org/10.30630/joiv.6.2.793.

Full text

Abstract:

In some cases, image processing relies on a lot of training data to produce good and accurate models. It can be done to get an accurate model by augmenting the data, adjusting the darkness level of the image, and providing interference to the image. However, the more data that is trained, of course, requires high computational costs. One way that can be done is to add acceleration and parallel communication. This study discusses several scenarios of applying CUDA and MPI to train the 14.04 GB corn leaf disease dataset. The use of CUDA and MPI in the image pre-processing process. The results of the pre-processing image accuracy are 83.37%, while the precision value is 86.18%. In pre-processing using MPI, the load distribution process occurs on each slave, from loading the image to cutting the image to get the features carried out in parallel. The resulting features are combined with the master for linear regression. In the use of CPU and Hybrid without the addition of MPI there is a difference of 2 minutes. Meanwhile, in the usage between CPU MPI and GPU MPI there is a difference of 1 minute. This demonstrates that implementing accelerated and parallel communications can streamline the processing of data sets and save computational costs. In this case, the use of MPI and GPU positively influences the proposed system.

APA, Harvard, Vancouver, ISO, and other styles

37

Perek, Piotr, Aleksander Mielczarek, and Dariusz Makowski. "High-Performance Image Acquisition and Processing for Stereoscopic Diagnostic Systems with the Application of Graphical Processing Units." Sensors 22, no. 2 (January 8, 2022): 471. http://dx.doi.org/10.3390/s22020471.

Full text

Abstract:

In recent years, cinematography and other digital content creators have been eagerly turning to Three-Dimensional (3D) imaging technology. The creators of movies, games, and augmented reality applications are aware of this technology’s advantages, possibilities, and new means of expression. The development of electronic and IT technologies enables the achievement of a better and better quality of the recorded 3D image and many possibilities for its correction and modification in post-production. However, preparing a correct 3D image that does not cause perception problems for the viewer is still a complex and demanding task. Therefore, planning and then ensuring the correct parameters and quality of the recorded 3D video is essential. Despite better post-production techniques, fixing errors in a captured image can be difficult, time consuming, and sometimes impossible. The detection of errors typical for stereo vision related to the depth of the image (e.g., depth budget violation, stereoscopic window violation) during the recording allows for their correction already on the film set, e.g., by different scene layouts and/or different camera configurations. The paper presents a prototype of an independent, non-invasive diagnostic system that supports the film crew in the process of calibrating stereoscopic cameras, as well as analysing the 3D depth while working on a film set. The system acquires full HD video streams from professional cameras using Serial Digital Interface (SDI), synchronises them, and estimates and analyses the disparity map. Objective depth analysis using computer tools while recording scenes allows stereographers to immediately spot errors in the 3D image, primarily related to the violation of the viewing comfort zone. The paper also describes an efficient method of analysing a 3D video using Graphics Processing Unit (GPU). The main steps of the proposed solution are uncalibrated rectification and disparity map estimation. The algorithms selected and implemented for the needs of this system do not require knowledge of intrinsic and extrinsic camera parameters. Thus, they can be used in non-cooperative environments, such as a film set, where the camera configuration often changes. Both of them are implemented with the use of a GPU to improve the data processing efficiency. The paper presents the evaluation results of the algorithms’ accuracy, as well as the comparison of the performance of two implementations—with and without the GPU acceleration. The application of the described GPU-based method makes the system efficient and easy to use. The system can process a video stream with full HD resolution at a speed of several frames per second.

APA, Harvard, Vancouver, ISO, and other styles

38

Stepanenko, S. O., and P. Y. Yakimov. "Using high-performance deep learning platform to accelerate object detection." Information Technology and Nanotechnology, no. 2416 (2019): 354–60. http://dx.doi.org/10.18287/1613-0073-2019-2416-354-360.

Full text

Abstract:

Object classification with use of neural networks is extremely current today. YOLO is one of the most often used frameworks for object classification. It produces high accuracy but the processing speed is not high enough especially in conditions of limited performance of a computer. This article researches use of a framework called NVIDIA TensorRT to optimize YOLO with the aim of increasing the image processing speed. Saving efficiency and quality of the neural network work TensorRT allows us to increase the processing speed using an optimization of the architecture and an optimization of calculations on a GPU.

APA, Harvard, Vancouver, ISO, and other styles

39

Wang, Long, Masaki Iwasawa, Keigo Nitadori, and Junichiro Makino. "petar: a high-performance N-body code for modelling massive collisional stellar systems." Monthly Notices of the Royal Astronomical Society 497, no. 1 (July 24, 2020): 536–55. http://dx.doi.org/10.1093/mnras/staa1915.

Full text

Abstract:

ABSTRACT The numerical simulations of massive collisional stellar systems, such as globular clusters (GCs), are very time consuming. Until now, only a few realistic million-body simulations of GCs with a small fraction of binaries ($5{{\ \rm per\ cent}}$) have been performed by using the nbody6++gpu code. Such models took half a year computational time on a Graphic Processing Unit (GPU)-based supercomputer. In this work, we develop a new N-body code, petar, by combining the methods of Barnes–Hut tree, Hermite integrator and slow-down algorithmic regularization. The code can accurately handle an arbitrary fraction of multiple systems (e.g. binaries and triples) while keeping a high performance by using the hybrid parallelization methods with mpi, openmp, simd instructions and GPU. A few benchmarks indicate that petar and nbody6++gpu have a very good agreement on the long-term evolution of the global structure, binary orbits and escapers. On a highly configured GPU desktop computer, the performance of a million-body simulation with all stars in binaries by using petar is 11 times faster than that of nbody6++gpu. Moreover, on the Cray XC50 supercomputer, petar well scales when number of cores increase. The 10 million-body problem, which covers the region of ultracompact dwarfs and nuclear star clusters, becomes possible to be solved.

APA, Harvard, Vancouver, ISO, and other styles

40

Peng, Chao. "High-performance computer graphics technologies in engineering applications." World Journal of Engineering 16, no. 2 (April 8, 2019): 304–8. http://dx.doi.org/10.1108/wje-05-2018-0158.

Full text

Abstract:

Purpose The purpose of this paper is to investigate possibilities to adopt state-of-the-art computer graphics technologies for big data visualization in engineering applications. Toward this purpose, a conceptual heterogeneous system is proposed for graphical rendering, which is established with multiple central processing unit cores and multiple graphics processing unit GPUs. Design/methodology/approach The design of the system supports both general-purpose computation and graphics-related computation. Three processing components are discussed to fulfill the execution requirements in load balancing, data streaming and display. This design fully uses computational and memory resources and enhances the performance with the support of GPU-based parallelization. Findings The advantages and disadvantages of particular technical methods for each processing component are discussed. The possible ways to integrate them are analyzed. Originality/value This work has contributions of using computer graphics technologies in engineering applications.

APA, Harvard, Vancouver, ISO, and other styles

41

Li, Teng, Vikram Narayana, and Tarek El-Ghazawi. "Exploring Graphics Processing Unit (GPU) Resource Sharing Efficiency for High Performance Computing." Computers 2, no. 4 (November 19, 2013): 176–214. http://dx.doi.org/10.3390/computers2040176.

Full text

APA, Harvard, Vancouver, ISO, and other styles

42

STOJANOVIC, N., and D. STOJANOVIC. "High Performance Processing and Analysis of Geospatial Data Using CUDA on GPU." Advances in Electrical and Computer Engineering 14, no. 4 (2014): 109–14. http://dx.doi.org/10.4316/aece.2014.04017.

Full text

APA, Harvard, Vancouver, ISO, and other styles

43

Said, Issam, Pierre Fortin, Jean–Luc Lamotte, and Henri Calandra. "Leveraging the accelerated processing units for seismic imaging: A performance and power efficiency comparison against CPUs and GPUs." International Journal of High Performance Computing Applications 32, no. 6 (April 5, 2017): 819–37. http://dx.doi.org/10.1177/1094342017696562.

Full text

Abstract:

Oil and gas companies rely on high performance computing to process seismic imaging algorithms such as reverse time migration. Graphics processing units are used to accelerate reverse time migration, but these deployments suffer from limitations such as the lack of high graphics processing unit memory capacity, frequent CPU-GPU communications that may be bottlenecked by the PCI bus transfer rate, and high power consumptions. Recently, AMD has launched the Accelerated Processing Unit (APU): a processor that merges a CPU and a graphics processing unit on the same die featuring a unified CPU-GPU memory. In this paper, we explore how efficiently may the APU be applicable to reverse time migration. Using OpenCL (along with MPI and OpenMP), a CPU/APU/GPU comparative study is conducted on a single node for the 3D acoustic reverse time migration, and then extended on up to 16 nodes. We show the relevance of overlapping the I/O and MPI communications with the computations for the APU and graphics processing unit clusters, that performance results of APUs range between those of CPUs and those of graphics processing units, and that the APU power efficiency is greater than or equal to the graphics processing unit one.

APA, Harvard, Vancouver, ISO, and other styles

44

Manacero, Aleardo, Emanuel Guariglia, Thiago Alexandre de Souza, Renata Spolon Lobato, and Roberta Spolon. "Parallel fuzzy minimals on GPU." Applied Sciences 12, no. 5 (February 25, 2022): 2385. http://dx.doi.org/10.3390/app12052385.

Full text

Abstract:

Clustering is a classification method that organizes objects into groups based on their similarity. Data clustering can extract valuable information, such as human behavior, trends, and so on, from large datasets by using either hard or fuzzy approaches. However, this is a time-consuming problem due to the increasing volumes of data collected. In this context, sequential executions are not feasible and their parallelization is mandatory to complete the process in an acceptable time. Parallelization requires redesigning algorithms to take advantage of massively parallel platforms. In this paper we propose a novel parallel implementation of the fuzzy minimals algorithm on graphics processing unit as a high-performance low-cost solution for common clustering issues. The performance of this implementation is compared with an equivalent algorithm based on the message passing interface. Numerical simulations show that the proposed solution on graphics processing unit can achieve high performances with regards to the cost-accuracy ratio.

APA, Harvard, Vancouver, ISO, and other styles

45

Min, Seung Won, Kun Wu, Sitao Huang, Mert Hidayetoğlu, Jinjun Xiong, Eiman Ebrahimi, Deming Chen, and Wen-mei Hwu. "Large graph convolutional network training with GPU-oriented data communication architecture." Proceedings of the VLDB Endowment 14, no. 11 (July 2021): 2087–100. http://dx.doi.org/10.14778/3476249.3476264.

Full text

Abstract:

Graph Convolutional Networks (GCNs) are increasingly adopted in large-scale graph-based recommender systems. Training GCN requires the minibatch generator traversing graphs and sampling the sparsely located neighboring nodes to obtain their features. Since real-world graphs often exceed the capacity of GPU memory, current GCN training systems keep the feature table in host memory and rely on the CPU to collect sparse features before sending them to the GPUs. This approach, however, puts tremendous pressure on host memory bandwidth and the CPU. This is because the CPU needs to (1) read sparse features from memory, (2) write features into memory as a dense format, and (3) transfer the features from memory to the GPUs. In this work, we propose a novel GPU-oriented data communication approach for GCN training, where GPU threads directly access sparse features in host memory through zero-copy accesses without much CPU help. By removing the CPU gathering stage, our method significantly reduces the consumption of the host resources and data access latency. We further present two important techniques to achieve high host memory access efficiency by the GPU: (1) automatic data access address alignment to maximize PCIe packet efficiency, and (2) asynchronous zero-copy access and kernel execution to fully overlap data transfer with training. We incorporate our method into PyTorch and evaluate its effectiveness using several graphs with sizes up to 111 million nodes and 1.6 billion edges. In a multi-GPU training setup, our method is 65--92% faster than the conventional data transfer method, and can even match the performance of all-in-GPU-memory training for some graphs that fit in GPU memory.

APA, Harvard, Vancouver, ISO, and other styles

46

Hasif Azman, Ahmad, Syed Abdul Mutalib Al Junid, Abdul Hadi Abdul Razak, Mohd Faizul Md Idros, Abdul Karimi Halim, and Fairul Nazmie Osman. "Performance Evaluation of SW Algorithm on NVIDIA GeForce GTX TITAN X Graphic Processing Unit (GPU)." Indonesian Journal of Electrical Engineering and Computer Science 12, no. 2 (November 1, 2018): 670. http://dx.doi.org/10.11591/ijeecs.v12.i2.pp670-676.

Full text

Abstract:

Nowadays, the requirement for high performance and sensitive alignment tools have increased after the advantage of the Deoxyribonucleic Acid (DNA) and molecular biology has been figured out through Bioinformatics study. Therefore, this paper reports the performance evaluation of parallel Smith-Waterman Algorithm implementation on the new NVIDIA GeForce GTX Titan X Graphic Processing Unit (GPU) compared to the Central Processing Unit (CPU) running on Intel® CoreTM i5-4440S CPU 2.80GHz. Both of the design were developed using C-programming language and targeted to the respective platform. The code for GPU was developed and compiled using NVIDIA Compute Unified Device Architecture (CUDA). It clearly recorded that, the performance of GPU based computational is better compared to the CPU based. These results indicate that the GPU based DNA sequence alignment has a better speed in accelerating the computational process of DNA sequence alignment.

APA, Harvard, Vancouver, ISO, and other styles

47

Zhao, Di. "Mobile GPU Computing Based Filter Bank Convolution for Three-Dimensional Wavelet Transform." International Journal of Mobile Computing and Multimedia Communications 7, no. 2 (April 2016): 22–35. http://dx.doi.org/10.4018/ijmcmc.2016040102.

Full text

Abstract:

Mobile GPU computing, or System on Chip with embedded GPU (SoC GPU), becomes in great demand recently. Since these SoCs are designed for mobile devices with real-time applications such as image processing and video processing, high-efficient implementations of wavelet transform are essential for these chips. In this paper, the author develops two SoC GPU based DWT: signal based parallelization for discrete wavelet transform (sDWT) and coefficient based parallelization for discrete wavelet transform (cDWT), and the author evaluates the performance of three-dimensional wavelet transform on SoC GPU Tegra K1. Computational results show that, SoC GPU based DWT is significantly faster than SoC CPU based DWT. Computational results also show that, sDWT can generally satisfy the requirement of real-time processing (30 frames per second) with the image sizes of 352×288, 480×320, 720×480 and 1280×720, while cDWT can only obtain read-time processing with small image sizes of 352×288 and 480×320.

APA, Harvard, Vancouver, ISO, and other styles

48

Wu, Chao, Bowen Yang, Wenwu Zhu, and Yaoxue Zhang. "Toward High Mobile GPU Performance Through Collaborative Workload Offloading." IEEE Transactions on Parallel and Distributed Systems 29, no. 2 (February 1, 2018): 435–49. http://dx.doi.org/10.1109/tpds.2017.2754482.

Full text

APA, Harvard, Vancouver, ISO, and other styles

49

Zhang, Tao, Xiao-Yang Liu, and Xiaodong Wang. "High Performance GPU Tensor Completion With Tubal-Sampling Pattern." IEEE Transactions on Parallel and Distributed Systems 31, no. 7 (July 1, 2020): 1724–39. http://dx.doi.org/10.1109/tpds.2020.2975196.

Full text

APA, Harvard, Vancouver, ISO, and other styles

50

Leong, Martin C. W., Kit-Hang Lee, Bowen P. Y. Kwan, Yui-Lun Ng, Zhiyu Liu, Nassir Navab, Wayne Luk, and Ka-Wai Kwok. "Performance-aware programming for intraoperative intensity-based image registration on graphics processing units." International Journal of Computer Assisted Radiology and Surgery 16, no. 3 (January 23, 2021): 375–86. http://dx.doi.org/10.1007/s11548-020-02303-y.

Full text

Abstract:

Abstract Purpose Intensity-based image registration has been proven essential in many applications accredited to its unparalleled ability to resolve image misalignments. However, long registration time for image realignment prohibits its use in intra-operative navigation systems. There has been much work on accelerating the registration process by improving the algorithm’s robustness, but the innate computation required by the registration algorithm has been unresolved. Methods Intensity-based registration methods involve operations with high arithmetic load and memory access demand, which supposes to be reduced by graphics processing units (GPUs). Although GPUs are widespread and affordable, there is a lack of open-source GPU implementations optimized for non-rigid image registration. This paper demonstrates performance-aware programming techniques, which involves systematic exploitation of GPU features, by implementing the diffeomorphic log-demons algorithm. Results By resolving the pinpointed computation bottlenecks on GPU, our implementation of diffeomorphic log-demons on Nvidia GTX Titan X GPU has achieved ~ 95 times speed-up compared to the CPU and registered a 1.3-M voxel image in 286 ms. Even for large 37-M voxel images, our implementation is able to register in 8.56 s, which attained ~ 258 times speed-up. Our solution involves effective employment of GPU computation units, memory, and data bandwidth to resolve computation bottlenecks. Conclusion The computation bottlenecks in diffeomorphic log-demons are pinpointed, analyzed, and resolved using various GPU performance-aware programming techniques. The proposed fast computation on basic image operations not only enhances the computation of diffeomorphic log-demons, but is also potentially extended to speed up many other intensity-based approaches. Our implementation is open-source on GitHub at https://bit.ly/2PYZxQz.

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!