Log in

Relevant bibliographies by topics / NVIDIA CUDA GPU / Journal articles

Journal articles on the topic 'NVIDIA CUDA GPU'

To see the other types of publications on this topic, follow the link: NVIDIA CUDA GPU.

Author: Grafiati

Published: 10 December 2022

Last updated: 29 January 2023

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'NVIDIA CUDA GPU.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Nangla, Siddhante. "GPU Programming using NVIDIA CUDA." International Journal for Research in Applied Science and Engineering Technology 6, no. 6 (June 30, 2018): 79–84. http://dx.doi.org/10.22214/ijraset.2018.6016.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Liu, Zhi Yuan, and Xue Zhang Zhao. "Research and Implementation of Image Rotation Based on CUDA." Advanced Materials Research 216 (March 2011): 708–12. http://dx.doi.org/10.4028/www.scientific.net/amr.216.708.

Full text

Abstract:

GPU technology release CPU from burdensome graphic computing task. The nVIDIA company, the main GPU producer, adds CUDA technology in new GPU models which enhances GPU function greatly and has much advantage in computing complex matrix. General algorithms of image rotation and the structure of CUDA are introduced in this paper. An example of rotating an image by using HALCON based on CPU instruction extensions and CUDA technology is to prove the advantage of CUDA by comparing two results.

APA, Harvard, Vancouver, ISO, and other styles

3

Borcovas, Evaldas, and Gintautas Daunys. "CPU AND GPU (CUDA) TEMPLATE MATCHING COMPARISON / CPU IR GPU (CUDA) PALYGINIMAS VYKDANT ŠABLONŲ ATITIKTIES ALGORITMĄ." Mokslas – Lietuvos ateitis 6, no. 2 (April 24, 2014): 129–33. http://dx.doi.org/10.3846/mla.2014.16.

Full text

Abstract:

Image processing, computer vision or other complicated opticalinformation processing algorithms require large resources. It isoften desired to execute algorithms in real time. It is hard tofulfill such requirements with single CPU processor. NVidiaproposed CUDA technology enables programmer to use theGPU resources in the computer. Current research was madewith Intel Pentium Dual-Core T4500 2.3 GHz processor with4 GB RAM DDR3 (CPU I), NVidia GeForce GT320M CUDAcompliable graphics card (GPU I) and Intel Core I5-2500K3.3 GHz processor with 4 GB RAM DDR3 (CPU II), NVidiaGeForce GTX 560 CUDA compatible graphic card (GPU II).Additional libraries as OpenCV 2.1 and OpenCV 2.4.0 CUDAcompliable were used for the testing. Main test were made withstandard function MatchTemplate from the OpenCV libraries.The algorithm uses a main image and a template. An influenceof these factors was tested. Main image and template have beenresized and the algorithm computing time and performancein Gtpix/s have been measured. According to the informationobtained from the research GPU computing using the hardwarementioned earlier is till 24 times faster when it is processing abig amount of information. When the images are small the performanceof CPU and GPU are not significantly different. Thechoice of the template size makes influence on calculating withCPU. Difference in the computing time between the GPUs canbe explained by the number of cores which they have. Vaizdų apdorojimas, kompiuterinė rega ir kiti sudėtingi algoritmai, apdorojantys optinę informaciją, naudoja dideliusskaičiavimo išteklius. Dažnai šiuos algoritmus reikia realizuoti realiuoju laiku. Šį uždavinį išspręsti naudojant tik vienoCPU (angl. Central processing unit) pajėgumus yra sudėtinga. nVidia pasiūlyta CUDA (angl. Compute unified device architecture)technologija leidžia panaudoti GPU (angl. Graphic processing unit) išteklius. Tyrimui atlikti buvo pasirinkti du skirtingiCPU: Intel Pentium Dual-Core T4500 ir Intel Core I5 2500K, bei GPU: nVidia GeForce GT320M ir NVidia GeForce 560.Tyrime buvo panaudotos vaizdų apdorojimo bibliotekos: OpenCV 2.1 ir OpenCV 2.4. Tyrimui buvo pasirinktas šablonų atitiktiesalgoritmas. Algoritmui realizuoti reikalingas analizuojamas vaizdas ir ieškomo objekto vaizdo šablonas. Tyrimo metu buvokeičiamas vaizdo ir šablono dydis bei stebima, kaip tai veikia algoritmo vykdymo trukmę ir vykdomų operacijų skaičių persekundę. Iš gautų rezultatų galima teigti, kad apdorojant didelį duomenų kiekį GPU realizuoja algoritmą iki 24 kartų greičiaunei tik CPU. Dirbant su nedideliu duomenų kiekiu, skirtumas tarp CPU ir GPU yra minimalus. Lyginant skaičiavimus dviejuoseGPU, pastebėta, kad skaičiavimų sparta yra tiesiogiai proporcinga GPU turimų branduolių kiekiui. Mūsų tyrimo atvejuspartesniame GPU jų buvo 16 kartų daugiau, tad ir skaičiavimai vyko 16 kartų sparčiau.

APA, Harvard, Vancouver, ISO, and other styles

4

Gonzalez Clua, Esteban Walter, and Marcelo Panaro Zamith. "Programming in CUDA for Kepler and Maxwell Architecture." Revista de Informática Teórica e Aplicada 22, no. 2 (November 21, 2015): 233. http://dx.doi.org/10.22456/2175-2745.56384.

Full text

Abstract:

Since the first version of CUDA was launch, many improvements were made in GPU computing. Every new CUDA version included important novel features, turning this architecture more and more closely related to a typical parallel High Performance Language. This tutorial will present the GPU architecture and CUDA principles, trying to conceptualize novel features included by NVIDIA, such as dynamics parallelism, unified memory and concurrent kernels. This text also includes some optimization remarks for CUDA programs.

APA, Harvard, Vancouver, ISO, and other styles

5

Ahmed, Rafid, Md Sazzadul Islam, and Jia Uddin. "Optimizing Apple Lossless Audio Codec Algorithm using NVIDIA CUDA Architecture." International Journal of Electrical and Computer Engineering (IJECE) 8, no. 1 (February 1, 2018): 70. http://dx.doi.org/10.11591/ijece.v8i1.pp70-75.

Full text

Abstract:

As majority of the compression algorithms are implementations for CPU architecture, the primary focus of our work was to exploit the opportunities of GPU parallelism in audio compression. This paper presents an implementation of Apple Lossless Audio Codec (ALAC) algorithm by using NVIDIA GPUs Compute Unified Device Architecture (CUDA) Framework. The core idea was to identify the areas where data parallelism could be applied and parallel programming model CUDA could be used to execute the identified parallel components on Single Instruction Multiple Thread (SIMT) model of CUDA. The dataset was retrieved from European Broadcasting Union, Sound Quality Assessment Material (SQAM). Faster execution of the algorithm led to execution time reduction when applied to audio coding for large audios. This paper also presents the reduction of power usage due to running the parallel components on GPU. Experimental results reveal that we achieve about 80-90% speedup through CUDA on the identified components over its CPU implementation while saving CPU power consumption.

APA, Harvard, Vancouver, ISO, and other styles

6

Lin, Chun-Yuan, Chung-Hung Wang, Che-Lun Hung, and Yu-Shiang Lin. "Accelerating Multiple Compound Comparison Using LINGO-Based Load-Balancing Strategies on Multi-GPUs." International Journal of Genomics 2015 (2015): 1–9. http://dx.doi.org/10.1155/2015/950905.

Full text

Abstract:

Compound comparison is an important task for the computational chemistry. By the comparison results, potential inhibitors can be found and then used for the pharmacy experiments. The time complexity of a pairwise compound comparison isO(n2), wherenis the maximal length of compounds. In general, the length of compounds is tens to hundreds, and the computation time is small. However, more and more compounds have been synthesized and extracted now, even more than tens of millions. Therefore, it still will be time-consuming when comparing with a large amount of compounds (seen as a multiple compound comparison problem, abbreviated to MCC). The intrinsic time complexity of MCC problem isO(k2n2)withkcompounds of maximal lengthn. In this paper, we propose a GPU-based algorithm for MCC problem, called CUDA-MCC, on single- and multi-GPUs. Four LINGO-based load-balancing strategies are considered in CUDA-MCC in order to accelerate the computation speed among thread blocks on GPUs. CUDA-MCC was implemented by C+OpenMP+CUDA. CUDA-MCC achieved 45 times and 391 times faster than its CPU version on a single NVIDIA Tesla K20m GPU card and a dual-NVIDIA Tesla K20m GPU card, respectively, under the experimental results.

APA, Harvard, Vancouver, ISO, and other styles

7

Blyth, Simon. "Meeting the challenge of JUNO simulation with Opticks: GPU optical photon acceleration via NVIDIA® OptiXTM." EPJ Web of Conferences 245 (2020): 11003. http://dx.doi.org/10.1051/epjconf/202024511003.

Full text

Abstract:

Opticks is an open source project that accelerates optical photon simulation by integrating NVIDIA GPU ray tracing, accessed via NVIDIA OptiX, with Geant4 toolkit based simulations. A single NVIDIA Turing architecture GPU has been measured to provide optical photon simulation speedup factors exceeding 1500 times single threaded Geant4 with a full JUNO analytic GPU geometry automatically translated from the Geant4 geometry. Optical physics processes of scattering, absorption, scintillator reemission and boundary processes are implemented within CUDA OptiX programs based on the Geant4 implementations. Wavelength-dependent material and surface properties as well as inverse cumulative distribution functions for reemission are interleaved into GPU textures providing fast interpolated property lookup or wavelength generation. Major recent developments enable Opticks to benefit from ray trace dedicated RT cores available in NVIDIA RTX series GPUs. Results of extensive validation tests are presented.

APA, Harvard, Vancouver, ISO, and other styles

8

FUJIMOTO, NORIYUKI. "DENSE MATRIX-VECTOR MULTIPLICATION ON THE CUDA ARCHITECTURE." Parallel Processing Letters 18, no. 04 (December 2008): 511–30. http://dx.doi.org/10.1142/s0129626408003545.

Full text

Abstract:

Recently GPUs have acquired the ability to perform fast general purpose computation by running thousands of threads concurrently. This paper presents a new algorithm for dense matrix-vector multiplication on the NVIDIA CUDA architecture. The experiments are conducted on a PC with GeForce 8800GTX and 2.0 GHz Intel Xeon E5335 CPU. The results show that the proposed algorithm runs a maximum of 11.19 times faster than NVIDIA's BLAS library CUBLAS 1.1 on the GPU and 35.15 times faster than the Intel Math Kernel Library 9.1 on a single core x86 with SSE3 SIMD instructions. The performance of Jacobi's iterative method for solving linear equations, which includes the data transfer time between CPU and GPU, shows that the proposed algorithm is practical for real applications.

APA, Harvard, Vancouver, ISO, and other styles

9

Blyth, Simon. "Integration of JUNO simulation framework with Opticks: GPU accelerated optical propagation via NVIDIA® OptiX™." EPJ Web of Conferences 251 (2021): 03009. http://dx.doi.org/10.1051/epjconf/202125103009.

Full text

Abstract:

Opticks is an open source project that accelerates optical photon simulation by integrating NVIDIA GPU ray tracing, accessed via NVIDIA OptiX, with Geant4 toolkit based simulations. A single NVIDIA Turing architecture GPU has been measured to provide optical photon simulation speedup factors exceeding 1500 times single threaded Geant4 with a full JUNO analytic GPU geometry automatically translated from the Geant4 geometry. Optical physics processes of scattering, absorption, scintillator reemission and boundary processes are implemented within CUDA OptiX programs based on the Geant4 implementations. Wavelength-dependent material and surface properties as well as inverse cumulative distribution functions for reemission are interleaved into GPU textures providing fast interpolated property lookup or wavelength generation. In this work we describe major recent developments to facilitate integration of Opticks with the JUNO simulation framework including on GPU collection effciency hit culling which substantially reduces both the CPU memory needed for photon hits and copying overheads. Also progress with the migration of Opticks to the all new NVIDIA OptiX 7 API is described.

APA, Harvard, Vancouver, ISO, and other styles

10

Bi, Yujiang, Yi Xiao, WeiYi Guo, Ming Gong, Peng Sun, Shun Xu, and Yi-bo Yang. "Lattice QCD GPU Inverters on ROCm Platform." EPJ Web of Conferences 245 (2020): 09008. http://dx.doi.org/10.1051/epjconf/202024509008.

Full text

Abstract:

The open source ROCm/HIP platform for GPU computing provides a uniform framework to support both the NVIDIA and AMD GPUs, and also the possibility to porting the CUDA code to the HIP-compatible one. We present the porting progress on the Overlap fermion inverter (GWU-code) and also the general Lattice QCD inverter package - QUDA. The manual of using QUDA on HIP and also the tips of porting general CUDA code into the HIP framework are also provided.

APA, Harvard, Vancouver, ISO, and other styles

11

Kim, Youngtae, and Gyuhyeon Hwang. "Efficient Parallel CUDA Random Number Generator on NVIDIA GPUs." Journal of KIISE 42, no. 12 (December 15, 2015): 1467–73. http://dx.doi.org/10.5626/jok.2015.42.12.1467.

Full text

APA, Harvard, Vancouver, ISO, and other styles

12

Mu, Tian Hong, and Yun Yang. "A Method for Binary Image Component Parallel Labeling Algorithm Based on CUDA." Advanced Materials Research 811 (September 2013): 538–42. http://dx.doi.org/10.4028/www.scientific.net/amr.811.538.

Full text

Abstract:

More internal transistor of GPU is used as a data processing rather than process control. Compared with the existing multinuclear CPU, it has more processors and higher ability of the whole parallel processing, which is suitable for a large scale super calculation based on desktop platform. CUDA platform, put forward by NVIDIA Company, which is a new hardware and software architecture of realized the general calculation of GPU combined with the high parallel ability, and adopt CUDAC programming language to realize a parallel binary image connected domain label algorithm based on CUDA. The algorithm uses eight connection body labels, which has the high parallel ability, the less association between steps and the efficiency of the great promotion space.

APA, Harvard, Vancouver, ISO, and other styles

13

Kommera, Pranay Reddy, Vinay Ramakrishnaiah, Christine Sweeney, Jeffrey Donatelli, and Petrus H. Zwart. "GPU-accelerated multitiered iterative phasing algorithm for fluctuation X-ray scattering." Journal of Applied Crystallography 54, no. 4 (July 30, 2021): 1179–88. http://dx.doi.org/10.1107/s1600576721005744.

Full text

Abstract:

The multitiered iterative phasing (MTIP) algorithm is used to determine the biological structures of macromolecules from fluctuation scattering data. It is an iterative algorithm that reconstructs the electron density of the sample by matching the computed fluctuation X-ray scattering data to the external observations, and by simultaneously enforcing constraints in real and Fourier space. This paper presents the first ever MTIP algorithm acceleration efforts on contemporary graphics processing units (GPUs). The Compute Unified Device Architecture (CUDA) programming model is used to accelerate the MTIP algorithm on NVIDIA GPUs. The computational performance of the CUDA-based MTIP algorithm implementation outperforms the CPU-based version by an order of magnitude. Furthermore, the Heterogeneous-Compute Interface for Portability (HIP) runtime APIs are used to demonstrate portability by accelerating the MTIP algorithm across NVIDIA and AMD GPUs.

APA, Harvard, Vancouver, ISO, and other styles

14

Vintache, Damien, Bernard Humbert, and David Brasse. "Iterative reconstruction for transmission tomography on GPU using Nvidia CUDA." Tsinghua Science and Technology 15, no. 1 (February 2010): 11–16. http://dx.doi.org/10.1016/s1007-0214(10)70002-x.

Full text

APA, Harvard, Vancouver, ISO, and other styles

15

Jin, Nai Gao, Fei Mo Li, and Zhao Xing Li. "Quasi-Monte Carlo Gaussian Particle Filtering Acceleration Using CUDA." Applied Mechanics and Materials 130-134 (October 2011): 3311–15. http://dx.doi.org/10.4028/www.scientific.net/amm.130-134.3311.

Full text

Abstract:

A CUDA accelerated Quasi-Monte Carlo Gaussian particle filter (QMC-GPF) is proposed to deal with real-time non-linear non-Gaussian problems. GPF is especially suitable for parallel implementation as a result of the elimination of resampling step. QMC-GPF is an efficient counterpart of GPF using QMC sampling method instead of MC. Since particles generated by QMC method provides the best-possible distribution in the sampling space, QMC-GPF can make more accurate estimation with the same number of particles compared with traditional particle filter. Experimental results show that our GPU implementation of QMC-GPF can achieve the maximum speedup ratio of 95 on NVIDIA GeForce GTX 460.

APA, Harvard, Vancouver, ISO, and other styles

16

Choi, Hyeonseong, and Jaehwan Lee. "Efficient Use of GPU Memory for Large-Scale Deep Learning Model Training." Applied Sciences 11, no. 21 (November 4, 2021): 10377. http://dx.doi.org/10.3390/app112110377.

Full text

Abstract:

To achieve high accuracy when performing deep learning, it is necessary to use a large-scale training model. However, due to the limitations of GPU memory, it is difficult to train large-scale training models within a single GPU. NVIDIA introduced a technology called CUDA Unified Memory with CUDA 6 to overcome the limitations of GPU memory by virtually combining GPU memory and CPU memory. In addition, in CUDA 8, memory advise options are introduced to efficiently utilize CUDA Unified Memory. In this work, we propose a newly optimized scheme based on CUDA Unified Memory to efficiently use GPU memory by applying different memory advise to each data type according to access patterns in deep learning training. We apply CUDA Unified Memory technology to PyTorch to see the performance of large-scale learning models through the expanded GPU memory. We conduct comprehensive experiments on how to efficiently utilize Unified Memory by applying memory advises when performing deep learning. As a result, when the data used for deep learning are divided into three types and a memory advise is applied to the data according to the access pattern, the deep learning execution time is reduced by 9.4% compared to the default Unified Memory.

APA, Harvard, Vancouver, ISO, and other styles

17

Lin, Chun-Yuan, Jin Ye, Che-Lun Hung, Chung-Hung Wang, Min Su, and Jianjun Tan. "Constructing a Bioinformatics Platform with Web and Mobile Services Based on NVIDIA Jetson TK1." International Journal of Grid and High Performance Computing 7, no. 4 (October 2015): 57–73. http://dx.doi.org/10.4018/ijghpc.2015100105.

Full text

Abstract:

Current high-end graphics processing units (abbreviate to GPUs), such as NVIDIA Tesla, Fermi, Kepler series cards which contain up to thousand cores per-chip, are widely used in the high performance computing fields. These GPU cards (called desktop GPUs) should be installed in personal computers/servers with desktop CPUs; moreover, the cost and power consumption of constructing a high performance computing platform with these desktop CPUs and GPUs are high. NVIDIA releases Tegra K1, called Jetson TK1, which contains 4 ARM Cortex-A15 CPUs and 192 CUDA cores (Kepler GPU) and is an embedded board with low cost, low power consumption and high applicability advantages for embedded applications. NVIDIA Jetson TK1 becomes a new research direction. Hence, in this paper, a bioinformatics platform was constructed based on NVIDIA Jetson TK1. ClustalWtk and MCCtk tools for sequence alignment and compound comparison were designed on this platform, respectively. Moreover, the web and mobile services for these two tools with user friendly interfaces also were provided. The experimental results showed that the cost-performance ratio by NVIDIA Jetson TK1 is higher than that by Intel XEON E5-2650 CPU and NVIDIA Tesla K20m GPU card.

APA, Harvard, Vancouver, ISO, and other styles

18

Zhou, Yan, Tian Nan, Ya Li Cui, Tang Pei Cheng, and Jing Li Shao. "Numerical Simulation of Groundwater Flow Based on CUDA." Applied Mechanics and Materials 556-562 (May 2014): 3527–31. http://dx.doi.org/10.4028/www.scientific.net/amm.556-562.3527.

Full text

Abstract:

In this work, in order to improve the operation speed of groundwater flow numerical model, we studied the approach of solving linear equations and the relative acceleration problems based on CUDA platform. We developed the GPCG module base on GPU platform to replace PCG module of MODFLOW 2005 by using NVIDIA TESLA C2070. We obtained the effective acceleration results on GPU platform, by establishing series of ideal and instance models, which showed that the overall speedup of the models is around 2.5times and the calculation speedup is about 10times.

APA, Harvard, Vancouver, ISO, and other styles

19

Chen, Dong, Hua You Su, Wen Mei, Li Xuan Wang, and Chun Yuan Zhang. "Scalable Parallel Motion Estimation on Muti-GPU System." Applied Mechanics and Materials 347-350 (August 2013): 3708–14. http://dx.doi.org/10.4028/www.scientific.net/amm.347-350.3708.

Full text

Abstract:

With NVIDIA’s parallel computing architecture CUDA, using GPU to speed up compute-intensive applications has become a research focus in recent years. In this paper, we proposed a scalable method for multi-GPU system to accelerate motion estimation algorithm, which is the most time consuming process in video encoding. Based on the analysis of data dependency and multi-GPU architecture, a parallel computing model and a communication model are designed. We tested our parallel algorithm and analyzed the performance with 10 standard video sequences in different resolutions using 4 NVIDIA GTX460 GPUs, and calculated the overall speedup. Our results show that a speedup of 36.1 times using 1 GPU and more than 120 times for 4 GPUs on 1920x1080 sequences. Further, our parallel algorithm demonstrated the potential of nearly linear speedup according to the number of GPUs in the system.

APA, Harvard, Vancouver, ISO, and other styles

20

Rao, Naseem, and Safdar Tanweer. "Performance Analysis of Healthcare data and its Implementation on NVIDIA GPU using CUDA-C." Journal of Drug Delivery and Therapeutics 9, no. 1-s (February 21, 2019): 361–63. http://dx.doi.org/10.22270/jddt.v9i1-s.2447.

Full text

Abstract:

In this paper we show how commodity GPU based data mining can help classify various healthcare data in different groups faster than traditional CPU based systems. In addition such systems are cheaper than various ASIC (Application Specific Integrated Circuits) based solutions. Such faster clustering of data could provide useful insights for making successful decisions in case of emergency and outbreaks. Finally, we present conclusion based on our research done so far. In our work we used NVIDIA GPU for implementing an algorithm for healthcare data classification. Speech dissiliency and stuttering assessment can also be addressed through classification audio/speech samples using ANN, k-NN, SVM etc4. Such a faster and economical way to get such insights is of paramount importance. Specifically as a proof-of-concept we have implement k-means algorithm on health care related data set. Keywords: NVIDIA; GPU; ECG; CPU; ANN.

APA, Harvard, Vancouver, ISO, and other styles

21

Hasif Azman, Ahmad, Syed Abdul Mutalib Al Junid, Abdul Hadi Abdul Razak, Mohd Faizul Md Idros, Abdul Karimi Halim, and Fairul Nazmie Osman. "Performance Evaluation of SW Algorithm on NVIDIA GeForce GTX TITAN X Graphic Processing Unit (GPU)." Indonesian Journal of Electrical Engineering and Computer Science 12, no. 2 (November 1, 2018): 670. http://dx.doi.org/10.11591/ijeecs.v12.i2.pp670-676.

Full text

Abstract:

Nowadays, the requirement for high performance and sensitive alignment tools have increased after the advantage of the Deoxyribonucleic Acid (DNA) and molecular biology has been figured out through Bioinformatics study. Therefore, this paper reports the performance evaluation of parallel Smith-Waterman Algorithm implementation on the new NVIDIA GeForce GTX Titan X Graphic Processing Unit (GPU) compared to the Central Processing Unit (CPU) running on Intel® CoreTM i5-4440S CPU 2.80GHz. Both of the design were developed using C-programming language and targeted to the respective platform. The code for GPU was developed and compiled using NVIDIA Compute Unified Device Architecture (CUDA). It clearly recorded that, the performance of GPU based computational is better compared to the CPU based. These results indicate that the GPU based DNA sequence alignment has a better speed in accelerating the computational process of DNA sequence alignment.

APA, Harvard, Vancouver, ISO, and other styles

22

Zhu, Li, and Yi Min Yang. "Real-Time Multitasking Video Encoding Processing System of Multicore." Applied Mechanics and Materials 66-68 (July 2011): 2074–79. http://dx.doi.org/10.4028/www.scientific.net/amm.66-68.2074.

Full text

Abstract:

This paper achieved the optimize which is based on the Series processors Produced by NVIDIA, such as Geforce, Tegra, Nexus and so on, and discussed the future development of the video image processor. Expounded the most popular DSP optimization techniques and objectives in the current, to optimized the design for the methods of the various papers available in existence. Based on the NVIDIA's series of products, specific discussed CUDA GPU architecture based on NVIDIA's products, raised the hardware and algorithms of the current most popular video encoding equipment, based on real practical technology to improve the transmission and encoding of multimedia data.

APA, Harvard, Vancouver, ISO, and other styles

23

Semenenko, Julija, Aliaksei Kolesau, Vadimas Starikovičius, Artūras Mackūnas, and Dmitrij Šešok. "COMPARISON OF GPU AND CPU EFFICIENCY WHILE SOLVING HEAT CONDUCTION PROBLEMS." Mokslas - Lietuvos ateitis 12 (November 24, 2020): 1–5. http://dx.doi.org/10.3846/mla.2020.13500.

Full text

Abstract:

Overview of GPU usage while solving different engineering problems, comparison between CPU and GPU computations and overview of the heat conduction problem are provided in this paper. The Jacobi iterative algorithm was implemented by using Python, TensorFlow GPU library and NVIDIA CUDA technology. Numerical experiments were conducted with 6 CPUs and 4 GPUs. The fastest used GPU completed the calculations 19 times faster than the slowest CPU. On average, GPU was from 9 to 11 times faster than CPU. Significant relative speed-up in GPU calculations starts when the matrix contains at least 4002 floating-point numbers.

APA, Harvard, Vancouver, ISO, and other styles

24

Lo, Win-Tsung, Yue-Shan Chang, Ruey-Kai Sheu, Chun-Chieh Chiu, and Shyan-Ming Yuan. "CUDT: A CUDA Based Decision Tree Algorithm." Scientific World Journal 2014 (2014): 1–12. http://dx.doi.org/10.1155/2014/745640.

Full text

Abstract:

Decision tree is one of the famous classification methods in data mining. Many researches have been proposed, which were focusing on improving the performance of decision tree. However, those algorithms are developed and run on traditional distributed systems. Obviously the latency could not be improved while processing huge data generated by ubiquitous sensing node in the era without new technology help. In order to improve data processing latency in huge data mining, in this paper, we design and implement a new parallelized decision tree algorithm on a CUDA (compute unified device architecture), which is a GPGPU solution provided by NVIDIA. In the proposed system, CPU is responsible for flow control while the GPU is responsible for computation. We have conducted many experiments to evaluate system performance of CUDT and made a comparison with traditional CPU version. The results show that CUDT is 5∼55 times faster than Weka-j48 and is 18 times speedup than SPRINT for large data set.

APA, Harvard, Vancouver, ISO, and other styles

25

Obrecht, Christian, Bernard Tourancheau, and Frédéric Kuznik. "Performance Evaluation of an OpenCL Implementation of the Lattice Boltzmann Method on the Intel Xeon Phi." Parallel Processing Letters 25, no. 03 (September 2015): 1541001. http://dx.doi.org/10.1142/s0129626415410017.

Full text

Abstract:

A portable OpenCL implementation of the lattice Boltzmann method targeting emerging many-core architectures is described. The main purpose of this work is to evaluate and compare the performance of this code on three mainstream hardware architectures available today, namely an Intel CPU, an Nvidia GPU, and the Intel Xeon Phi. Because of the similarities between OpenCL and CUDA, we chose to follow some of the strategies devised to implement efficient lattice Boltzmann solvers on Nvidia GPU, while remaining as generic as possible. Being fairly configurable, this program makes possible to ascertain the best options for each hardware platforms. The achieved performance is quite satisfactory for both the CPU and the GPU. For the Xeon Phi however, the results are below expectations. Nevertheless, comparison with data from the literature shows that on this architecture the code seems memory-bound.

APA, Harvard, Vancouver, ISO, and other styles

26

Toledo, Leonel, Pedro Valero-Lara, Jeffrey S. Vetter, and Antonio J. Peña. "Towards Enhancing Coding Productivity for GPU Programming Using Static Graphs." Electronics 11, no. 9 (April 20, 2022): 1307. http://dx.doi.org/10.3390/electronics11091307.

Full text

Abstract:

The main contribution of this work is to increase the coding productivity of GPU programming by using the concept of Static Graphs. GPU capabilities have been increasing significantly in terms of performance and memory capacity. However, there are still some problems in terms of scalability and limitations to the amount of work that a GPU can perform at a time. To minimize the overhead associated with the launch of GPU kernels, as well as to maximize the use of GPU capacity, we have combined the new CUDA Graph API with the CUDA programming model (including CUDA math libraries) and the OpenACC programming model. We use as test cases two different, well-known and widely used problems in HPC and AI: the Conjugate Gradient method and the Particle Swarm Optimization. In the first test case (Conjugate Gradient) we focus on the integration of Static Graphs with CUDA. In this case, we are able to significantly outperform the NVIDIA reference code, reaching an acceleration of up to 11× thanks to a better implementation, which can benefit from the new CUDA Graph capabilities. In the second test case (Particle Swarm Optimization), we complement the OpenACC functionality with the use of CUDA Graph, achieving again accelerations of up to one order of magnitude, with average speedups ranging from 2× to 4×, and performance very close to a reference and optimized CUDA code. Our main target is to achieve a higher coding productivity model for GPU programming by using Static Graphs, which provides, in a very transparent way, a better exploitation of the GPU capacity. The combination of using Static Graphs with two of the current most important GPU programming models (CUDA and OpenACC) is able to reduce considerably the execution time w.r.t. the use of CUDA and OpenACC only, achieving accelerations of up to more than one order of magnitude. Finally, we propose an interface to incorporate the concept of Static Graphs into the OpenACC Specifications.

APA, Harvard, Vancouver, ISO, and other styles

27

Narang, Hira, Fan Wu, and Abdul Rafae Mohammed. "An Efficient Acceleration of Solving Heat and Mass Transfer Equations with the Second Kind Boundary Conditions in Capillary Porous Radially Composite Cylinder Using Programmable Graphics Hardware." Computer and Information Science 13, no. 2 (April 29, 2020): 75. http://dx.doi.org/10.5539/cis.v13n2p75.

Full text

Abstract:

With the recent developments in computing technology, increased efforts have gone into the simulation of various scientific methods and phenomenon in engineering fields. One such case is the simulation of heat and mass transfer equations which is becoming more and more important in analyzing various scenarios in engineering applications. Analysing the heat and mass transfer phenomenon under various environmental conditions require us to simulate it. However, this process of numerical solution of heat and mass transfer equations is very time consuming. Therefore, this paper aims at utilizing one of the acceleration techniques developed in the graphics community that exploits a graphics processing unit (GPU) which is applied to the numerical solutions of heat and mass transfer equations. The nVidia Compute Unified Device Architecture (CUDA) programming model can be a good method of applying parallel computing to program the graphical processing unit. This paper shows a good improvement in the performance, while solving the heat and mass transfer equations for a capillary porous radially composite cylinder with the second kind of boundary conditions, numerically running on GPU. This heat and mass transfer simulation is implemented using CUDA platform on nVidia Quadro FX 4800 graphics card. Our experimental results depict the drastic performance improvement when GPU is used to perform heat and mass transfer simulation. GPU can significantly accelerate the performance with a maximum observed speedup of more than 8 fold times. Therefore, the GPU is a good approach to accelerate the heat and mass transfer simulation.

APA, Harvard, Vancouver, ISO, and other styles

28

Myasishchev, A., S. Lienkov, V. Dzhulii, and I. Muliar. "USING GPU NVIDIA FOR LINEAR ALGEBRA PROLEMS." Collection of scientific works of the Military Institute of Kyiv National Taras Shevchenko University, no. 64 (2019): 144–57. http://dx.doi.org/10.17721/2519-481x/2019/64-14.

Full text

Abstract:

Research goals and objectives: the purpose of the article is to study the feasibility of graphics processors using in solving linear equations systems and calculating matrix multiplication as compared with conventional multi-core processors. The peculiarities of the MAGMA and CUBLAS libraries use for various graphics processors are considered. A performance comparison is made between the Tesla C2075 and GeForce GTX 480 GPUs and a six-core AMD processor. Subject of research: the software is developed basing on the MAGMA and CUBLAS libraries for the purpose of the NVIDIA Tesla C2075 and GeForce GTX 480 GPUs performance study for linear equation systems solving and matrix multiplication calculating. Research methods used: libraries were used to parallelize the linear algebra problems solution. For GPUs, these are MAGMA and CUBLAS, for multi-core processors, the ScaLAPACK and ATLAS libraries. To study the operational speed there are used methods and algorithms of computational procedures parallelization similar to these libraries. A software module has been developed for linear equations systems solving and matrix multiplication calculating by parallel systems. Results of the research: it has been determined that for double-precision numbers the GPU GeForce GTX 480 and the GPU Tesla C2075 performance is approximately 3.5 and 6.3 times higher than that of the AMD CPU. And the GPU GeForce GTX 480 performance is 1.3 times higher than the GPU Tesla C2075 performance for single precision numbers. To achieve maximum performance of the NVIDIA CUDA GPU, you need to use the MAGMA or CUBLAS libraries, which accelerate the calculations by about 6.4 times as compared to the traditional programming method. It has been determined that in equations systems solving on a 6-core CPU, it is possible to achieve a maximum acceleration of 3.24 times as compared to calculations on the 1st core using the ScaLAPACK and ATLAS libraries instead of 6-fold theoretical acceleration. Therefore, it is impossible to efficiently use processors with a large number of cores with considered libraries. It is demonstrated that the advantage of the GPU over the CPU increases with the number of equations.

APA, Harvard, Vancouver, ISO, and other styles

29

Syrocki, Łukasz, and Grzegorz Pestka. "Implementation of algebraic procedures on the GPU using CUDA architecture on the example of generalized eigenvalue problem." Open Computer Science 6, no. 1 (May 13, 2016): 79–90. http://dx.doi.org/10.1515/comp-2016-0006.

Full text

Abstract:

AbstractThe ready to use set of functions to facilitate solving a generalized eigenvalue problem for symmetric matrices in order to efficiently calculate eigenvalues and eigenvectors, using Compute Unified Device Architecture (CUDA) technology from NVIDIA, is provided. An integral part of the CUDA is the high level programming environment enabling tracking both code executed on Central Processing Unit and on Graphics Processing Unit. The presented matrix structures allow for the analysis of the advantages of using graphics processors in such calculations.

APA, Harvard, Vancouver, ISO, and other styles

30

Arpaio, Maximilian, Enrico Vitucci, and Franco Fuschini. "A Comparative Study of the Computation Efficiency of a GPU-Based Ray Launching Algorithm for UAV-Assisted Wireless Communications." Applied Computational Electromagnetics Society 35, no. 12 (February 15, 2021): 1456–62. http://dx.doi.org/10.47037/2020.aces.j.351201.

Full text

Abstract:

Graphics Processing Units (GPU), have opened up new opportunities for speeding up general-purpose parallel computing applications. In this paper, we present the computation efficiency in terms of time performances of a novel ray launching field prediction algorithm which relies on NVIDIA GPUs and its Compute Unified Device Architecture (CUDA). The software tool assesses the propagation losses between a wireless transmitter - carried by an Unmanned Air Vehicle (UAV) - over a 3D urban environment. Together with other effective features, the software tool is shown to reduce by several orders of magnitude the computation time of simulations. Performances and cost-benefit analysis of three different NVIDIA GPU configurations are thus investigated over three different urban scenarios, taken as test-cases for Air-to-Ground (A2G) communications for 5G applications and beyond.

APA, Harvard, Vancouver, ISO, and other styles

31

Masek, Jan, Radim Burget, Lukas Povoda, and Malay Kishore Dutta. "Multi–GPU Implementation of Machine Learning Algorithm using CUDA and OpenCL." International Journal of Advances in Telecommunications, Electrotechnics, Signals and Systems 5, no. 2 (June 10, 2016): 101. http://dx.doi.org/10.11601/ijates.v5i2.142.

Full text

Abstract:

Using modern Graphic Processing Units (GPUs) becomes very useful for computing complex and time consuming processes. GPUs provide high–performance computation capabilities with a good price. This paper deals with a multi–GPU OpenCL and CUDA implementations of k–Nearest Neighbor (k–NN) algorithm. This work compares performances of OpenCLand CUDA implementations where each of them is suitable for different number of used attributes. The proposed CUDA algorithm achieves acceleration up to 880x in comparison witha single thread CPU version. The common k-NN was modified to be faster when the lower number of k neighbors is set. The performance of algorithm was verified with two GPUs dual-core NVIDIA GeForce GTX 690 and CPU Intel Core i7 3770 with 4.1 GHz frequency. The results of speed up were measured for one GPU, two GPUs, three and four GPUs. We performed several tests with data sets containing up to 4 million elements with various number of attributes.

APA, Harvard, Vancouver, ISO, and other styles

32

Blyth, Simon. "Opticks : GPU Optical Photon Simulation for Particle Physics using NVIDIA® OptiXTM." EPJ Web of Conferences 214 (2019): 02027. http://dx.doi.org/10.1051/epjconf/201921402027.

Full text

Abstract:

Opticks is an open source project that integrates the NVIDIA OptiX GPU ray tracing engine with Geant4 toolkit based simulations. Massive parallelism brings drastic performance improvements with optical photon simulation speedup expected to exceed 1000 times Geant4 with workstation GPUs. Optical physics processes of scattering, absorption, scintillator reemission and boundary processes are implemented as CUDA OptiX programs based on the Geant4 implementations. Wavelength-dependent material and surface properties as well as inverse cumulative distribution functions for reemission are interleaved into GPU textures providing fast interpolated property lookup or wavelength generation. OptiX handles the creation and application of a choice of acceleration structures such as boundary volume hierarchies and the transparent use of multiple GPUs. A major recent advance is the implementation of GPU ray tracing of complex constructive solid geometry shapes, enabling automated translation of Geant4 geometries to the GPU without approximation. Using common initial photons and random number sequences allows the Opticks and Geant4 simulations to be run point-by-point aligned. Aligned running has reached near perfect equivalence with test geometries.

APA, Harvard, Vancouver, ISO, and other styles

33

van Aart, Evert, Neda Sepasian, Andrei Jalba, and Anna Vilanova. "CUDA-Accelerated Geodesic Ray-Tracing for Fiber Tracking." International Journal of Biomedical Imaging 2011 (2011): 1–12. http://dx.doi.org/10.1155/2011/698908.

Full text

Abstract:

Diffusion Tensor Imaging (DTI) allows to noninvasively measure the diffusion of water in fibrous tissue. By reconstructing the fibers from DTI data using a fiber-tracking algorithm, we can deduce the structure of the tissue. In this paper, we outline an approach to accelerating such a fiber-tracking algorithm using a Graphics Processing Unit (GPU). This algorithm, which is based on the calculation of geodesics, has shown promising results for both synthetic and real data, but is limited in its applicability by its high computational requirements. We present a solution which uses the parallelism offered by modern GPUs, in combination with the CUDA platform by NVIDIA, to significantly reduce the execution time of the fiber-tracking algorithm. Compared to a multithreaded CPU implementation of the same algorithm, our GPU mapping achieves a speedup factor of up to 40 times.

APA, Harvard, Vancouver, ISO, and other styles

34

Zhang, Qing Hu, Dong Wang, Ya Peng Jiang, and Jun Quan Chen. "Parallel Solvers on the GPU for Large-Scale Finite Element Equations." Applied Mechanics and Materials 701-702 (December 2014): 207–13. http://dx.doi.org/10.4028/www.scientific.net/amm.701-702.207.

Full text

Abstract:

We present a parallel solution based on CUDA for accelerating the computation for solving large-scale Finite Element equations in electrical and magnetic field. JCG is used for solving equations and corresponding kernel function is designed for spMV. A computation speed test for solving FE equations is taken on NVIDIA Tesla K20c GPU hardware platform, the result proves that the method of kernel can reach 17.1 times faster than the solution using CPU, however it cannot ensure the advantage with CPU if we only use the lib functions on GPU to solve equations.

APA, Harvard, Vancouver, ISO, and other styles

35

Sun, Lei, Han Tao Zhang, and Xiao Ping Zhou. "GPU-Based PSO Application in Multiuser Detection and Trajectory Parameter Estimation." Applied Mechanics and Materials 340 (July 2013): 829–32. http://dx.doi.org/10.4028/www.scientific.net/amm.340.829.

Full text

Abstract:

The parallel character of particle swarm algorithm (PSO) and the Graphic Processing Unit (GPU) technology of Compute United Device Architecture (CUDA) from NVIDIA are analyzed. Two methods of the realization of PSO based on GPU are discussed. One method is using the module of open source particle swarm algorithm supporting the GPU, with the application of multiuser detector (MUD). The other method is using the module of MATLAB supporting the GPU with the application of the moving parameter estimation. The test results show that the PSO algorithm based on GPU technology can significantly improve the speed of system capacity, to solve the problem of multi-dimensional global optimization, with the poor real-time performance. It can be widely used in the project of high real-time requirements.

APA, Harvard, Vancouver, ISO, and other styles

36

Chen, Yu Min, Fei Zeng, Jing Yang Wu, Qiao Wan, and Zhi Jun Su. "GPU-Accelerated Discrete Wavelet Transform for Images." Advanced Materials Research 718-720 (July 2013): 2086–91. http://dx.doi.org/10.4028/www.scientific.net/amr.718-720.2086.

Full text

Abstract:

Discrete Wavelet Transform (DWT) has been brought into wide use in image processing, but it cant fit the demand of the hugeimage data because the time of computing is vast. The GPU is an attractive platform for a broad fieldof applications,which remains asignificanthigharithmetic processingcapability. Therefore itcan beusedasa powerful accelerator without extra cost.CUDA(computeunifieddevicearchitecture) providesahardwareandsoftwareenvironment touse the GPU to accelerate the DWT for images. In this paper, we use the NVIDIA GeForce GT 650M that complies with the CUDA to improvethe execution time of theDiscrete Wavelet Transformfor images. TheresultofexperimentsindicatesthattheCUDAtechnology hastheadvantagesof parallel processingandtheefficiencyofimagetransform isimprovedgreatly. Whats more, it performs better on the larger size image (the max speedup is 15.9).

APA, Harvard, Vancouver, ISO, and other styles

37

Cagigas-Muñiz, Daniel, Fernando Diaz-del-Rio, Manuel Ramón López-Torres, Francisco Jiménez-Morales, and José Luis Guisado. "Developing Efficient Discrete Simulations on Multicore and GPU Architectures." Electronics 9, no. 1 (January 19, 2020): 189. http://dx.doi.org/10.3390/electronics9010189.

Full text

Abstract:

In this paper we show how to efficiently implement parallel discrete simulations on multicore and GPU architectures through a real example of an application: a cellular automata model of laser dynamics. We describe the techniques employed to build and optimize the implementations using OpenMP and CUDA frameworks. We have evaluated the performance on two different hardware platforms that represent different target market segments: high-end platforms for scientific computing, using an Intel Xeon Platinum 8259CL server with 48 cores, and also an NVIDIA Tesla V100 GPU, both running on Amazon Web Server (AWS) Cloud; and on a consumer-oriented platform, using an Intel Core i9 9900k CPU and an NVIDIA GeForce GTX 1050 TI GPU. Performance results were compared and analyzed in detail. We show that excellent performance and scalability can be obtained in both platforms, and we extract some important issues that imply a performance degradation for them. We also found that current multicore CPUs with large core numbers can bring a performance very near to that of GPUs, and even identical in some cases.

APA, Harvard, Vancouver, ISO, and other styles

38

Gong, Wei, Kévyn Johannes, and Frédéric Kuznik. "Numerical Simulation of Melting with Natural Convection Based on Lattice Boltzmann Method and Performed with CUDA Enabled GPU." Communications in Computational Physics 17, no. 5 (May 2015): 1201–24. http://dx.doi.org/10.4208/cicp.2014.m350.

Full text

Abstract:

AbstractA new solver is developed to numerically simulate the melting phase change with natural convection. This solver was implemented on a single Nvidia GPU based on the CUDA technology in order to simulate the melting phase change in a 2D rectangular enclosure. The Rayleigh number is of the order of magnitude of 108and Prandlt is 50. The hybrid thermal lattice Boltzmann method (HTLBM) is employed to simulate the natural convection in the liquid phase, and the enthalpy formulation is used to simulate the phase change aspect. The model is validated by experimental data and published analytic results. The simulation results manifest a strong convection in the melted phase and a different flow pattern from the reference results with low Rayleigh number. In addition, the computational performance is estimated for single precision arithmetic, and this solver yields 703.31MLUPS and 61.89GB/s device to device data throughput on a Nvidia Tesla C2050 GPU.

APA, Harvard, Vancouver, ISO, and other styles

39

Mamri, Ayoub, Mohamed Abouzahir, Mustapha Ramzi, and Rachid Latif. "ORB-SLAM accelerated on heterogeneous parallel architectures." E3S Web of Conferences 229 (2021): 01055. http://dx.doi.org/10.1051/e3sconf/202122901055.

Full text

Abstract:

SLAM algorithm permits the robot to cartography the desired environment while positioning it in space. It is a more efficient system and more accredited by autonomous vehicle navigation and robotic application in the ongoing research. Except it did not adopt any complete end-to-end hardware implementation yet. Our work aims to a hardware/software optimization of an expensive computational time functional block of monocular ORB-SLAM2. Through this, we attempt to implement the proposed optimization in FPGA-based heterogeneous embedded architecture that shows attractive results. Toward this, we adopt a comparative study with other heterogeneous architecture including powerful embedded GPGPU (NVIDIA Tegra TX1) and high-end GPU (NVIDIA GeForce 920MX). The implementation is achieved using high-level synthesis-based OpenCL for FPGA and CUDA for NVIDIA targeted boards.

APA, Harvard, Vancouver, ISO, and other styles

40

Liu, Jie, Chunye Gong, Weimin Bao, Guojian Tang, and Yuewen Jiang. "Solving the Caputo Fractional Reaction-Diffusion Equation on GPU." Discrete Dynamics in Nature and Society 2014 (2014): 1–7. http://dx.doi.org/10.1155/2014/820162.

Full text

Abstract:

We present a parallel GPU solution of the Caputo fractional reaction-diffusion equation in one spatial dimension with explicit finite difference approximation. The parallel solution, which is implemented with CUDA programming model, consists of three procedures: preprocessing, parallel solver, and postprocessing. The parallel solver involves the parallel tridiagonal matrix vector multiplication, vector-vector addition, and constant vector multiplication. The most time consuming loop of vector-vector addition and constant vector multiplication is optimized and impressive performance improvement is got. The experimental results show that the GPU solution compares well with the exact solution. The optimized GPU solution on NVIDIA Quadro FX 5800 is 2.26 times faster than the optimized parallel CPU solution on multicore Intel Xeon E5540 CPU.

APA, Harvard, Vancouver, ISO, and other styles

41

Gan, Xin Biao, Li Shen, Quan Yuan Tan, Cong Liu, and Zhi Ying Wang. "Performance Evaluation and Optimization on GPU." Advanced Materials Research 219-220 (March 2011): 1445–49. http://dx.doi.org/10.4028/www.scientific.net/amr.219-220.1445.

Full text

Abstract:

GPU provides higher peak performance with hundreds of cores than CPU counterpart. However, it is a big challenge to take full advantage of their computing power. In order to understand performance bottlenecks of applications on many-core GPU and then optimize parallel programs on GPU architectures, we propose a performance evaluating model based on memory wall and then classify applications into AbM (Application bound-in Memory) and AbC (Application bound-in Computing). Furthermore, we optimize kernels characterized with low memory bandwidth including matrix multiplication and FFT (Fast Fourier Transform) by employing texture cache on NVIDIA GTX280 using CUDA (Compute Unified Device Architecture). Experimental results show that texture cache is helpful for AbM with better data locality, so it is critical to utilize GPU memory hierarchy efficiently for performance improvement.

APA, Harvard, Vancouver, ISO, and other styles

42

ROBERGE, VINCENT, and MOHAMMED TARBOUCHI. "COMPARISON OF PARALLEL PARTICLE SWARM OPTIMIZERS FOR GRAPHICAL PROCESSING UNITS AND MULTICORE PROCESSORS." International Journal of Computational Intelligence and Applications 12, no. 01 (March 2013): 1350006. http://dx.doi.org/10.1142/s1469026813500065.

Full text

Abstract:

In this paper, we present a parallel implementation of the particle swarm optimization (PSO) on graphical processing units (GPU) using CUDA. By fully utilizing the processing power of graphic processors, our implementation (CUDA-PSO) provides a speedup of 167× compared to a sequential implementation on CPU. This speedup is significantly superior to what has been reported in recent papers and is achieved by four optimizations we made to better adapt the parallel algorithm to the specific architecture of the NVIDIA GPU. However, because today's personal computers are usually equipped with a multicore CPU, it may be unfair to compare our CUDA implementation to a sequential one. For this reason, we implemented a parallel PSO for multicore CPUs using MPI (MPI-PSO) and compared its performance against our CUDA-PSO. The execution time of our CUDA-PSO remains 15.8× faster than our MPI-PSO which ran on a high-end 12-core workstation. Moreover, we show with statistical significance that the results obtained using our CUDA-PSO are of equal quality as the results obtained by the sequential PSO or the MPI-PSO. Finally, we use our parallel PSO for real-time harmonic minimization of multilevel power inverters with 20 DC sources while considering the first 100 harmonics and show that our CUDA-PSO is 294× faster than the sequential PSO and 32.5× faster than our parallel MPI-PSO.

APA, Harvard, Vancouver, ISO, and other styles

43

Magro, A., K. Zarb Adami, and J. Hickish. "GPU-Powered Coherent Beamforming." Journal of Astronomical Instrumentation 04, no. 01n02 (June 2015): 1550002. http://dx.doi.org/10.1142/s2251171715500026.

Full text

Abstract:

Graphics processing units (GPU)-based beamforming is a relatively unexplored area in radio astronomy, possibly due to the assumption that any such system will be severely limited by the PCIe bandwidth required to transfer data to the GPU. We have developed a CUDA-based GPU implementation of a coherent beamformer, specifically designed and optimized for deployment at the BEST-2 array which can generate an arbitrary number of synthesized beams for a wide range of parameters. It achieves [Formula: see text] TFLOPs on an NVIDIA Tesla K20, approximately 10x faster than an optimized, multithreaded CPU implementation. This kernel has been integrated into two real-time, GPU-based time-domain software pipelines deployed at the BEST-2 array in Medicina: a standalone beamforming pipeline and a transient detection pipeline. We present performance benchmarks for the beamforming kernel as well as the transient detection pipeline with beamforming capabilities as well as results of test observation.

APA, Harvard, Vancouver, ISO, and other styles

44

D’Ambrosio, Donato, Giuseppe Filippone, Rocco Rongo, William Spataro, and Giuseppe A. Trunfio. "Cellular Automata and GPGPU." International Journal of Grid and High Performance Computing 4, no. 3 (July 2012): 30–47. http://dx.doi.org/10.4018/jghpc.2012070102.

Full text

Abstract:

This paper presents an efficient implementation of the SCIARA Cellular Automata computational model for simulating lava flows using the Compute Unified Device Architecture (CUDA) interface developed by NVIDIA and carried out on Graphical Processing Units (GPU). GPUs are specifically designated for efficiently processing graphic data sets. However, they are also recently being exploited for achieving excellent computational results for applications non-directly connected with Computer Graphics. The authors show an implementation of SCIARA and present results referred to a Tesla GPU computing processor, a NVIDIA device specifically designed for High Performance Computing, and a Geforce GT 330M commodity graphic card. Their carried out experiments show that significant performance improvements are achieved, over a factor of 100, depending on the problem size and type of performed memory optimization. Experiments have confirmed the effectiveness and validity of adopting graphics hardware as an alternative to expensive hardware solutions, such as cluster or multi-core machines, for the implementation of Cellular Automata models.

APA, Harvard, Vancouver, ISO, and other styles

45

Stojanovic, Natalija, and Dragan Stojanovic. "Parallelizing Multiple Flow Accumulation Algorithm using CUDA and OpenACC." ISPRS International Journal of Geo-Information 8, no. 9 (September 3, 2019): 386. http://dx.doi.org/10.3390/ijgi8090386.

Full text

Abstract:

Watershed analysis, as a fundamental component of digital terrain analysis, is based on the Digital Elevation Model (DEM), which is a grid (raster) model of the Earth surface and topography. Watershed analysis consists of computationally and data intensive computing algorithms that need to be implemented by leveraging parallel and high-performance computing methods and techniques. In this paper, the Multiple Flow Direction (MFD) algorithm for watershed analysis is implemented and evaluated on multi-core Central Processing Units (CPU) and many-core Graphics Processing Units (GPU), which provides significant improvements in performance and energy usage. The implementation is based on NVIDIA CUDA (Compute Unified Device Architecture) implementation for GPU, as well as on OpenACC (Open ACCelerators), a parallel programming model, and a standard for parallel computing. Both phases of the MFD algorithm (i) iterative DEM preprocessing and (ii) iterative MFD algorithm, are parallelized and run over multi-core CPU and GPU. The evaluation of the proposed solutions is performed with respect to the execution time, energy consumption, and programming effort for algorithm parallelization for different sizes of input data. An experimental evaluation has shown not only the advantage of using OpenACC programming over CUDA programming in implementing the watershed analysis on a GPU in terms of performance, energy consumption, and programming effort, but also significant benefits in implementing it on the multi-core CPU.

APA, Harvard, Vancouver, ISO, and other styles

46

KUMAR, PIYUSH, and ANUPAM AGRAWAL. "GPU-ACCELERATED INTERACTIVE VISUALIZATION OF 3D VOLUMETRIC DATA USING CUDA." International Journal of Image and Graphics 13, no. 02 (April 2013): 1340003. http://dx.doi.org/10.1142/s0219467813400032.

Full text

Abstract:

Improving the image quality and the rendering speed have always been a challenge to the programmers involved in large scale volume rendering especially in the field of medical image processing. The paper aims to perform volume rendering using the graphics processing unit (GPU), in which, with its massively parallel capability has the potential to revolutionize this field. This work is now better with the help of GPU accelerated system. The final results would allow the doctors to diagnose and analyze the 2D computed tomography (CT) scan data using three dimensional visualization techniques. The system is used in multiple types of datasets, from 10 MB to 350 MB medical volume data. Further, the use of compute unified device architecture (CUDA) framework, a low learning curve technology, for such purpose would greatly reduce the cost involved in CT scan analysis; hence bring it to the common masses. The volume rendering has been done on Nvidia Tesla C1060 (there are 240 CUDA cores, which provides execution of data parallely) card and its performance has also been benchmarked.

APA, Harvard, Vancouver, ISO, and other styles

47

Wang, Song, Shan Liang Yang, and Ge Li. "Study of Accelerating Infrared Imaging Simulation Based on CUDA." Applied Mechanics and Materials 651-653 (September 2014): 2045–49. http://dx.doi.org/10.4028/www.scientific.net/amm.651-653.2045.

Full text

Abstract:

This paper builds an infrared scene of sphere target based on JAMSE, which provides EO/IR environment and is suite to build infrared imaging simulation system of engineering and engagement-level. In addition, to speed up this infrared imaging simulation, we analyzed the process of external rendering mode, which is applied in JMAES EO/IR environment, and found the external rendering image compounding is a highly independently process, which is suite to parallel computing. After testing on NVIDIA TESLA C2075 GPU with CUDA, and comparing the performance with the corresponding sequentialprocess on CPU, we got a satisfied result. This process obtains a speed up of over 10.

APA, Harvard, Vancouver, ISO, and other styles

48

He, Jiandong, Chong Wu, and Yining Jia. "A GPU-Accelerated 3D Mesh Deformation Method Based on Radial Basis Function Interpolation." Mathematical Problems in Engineering 2022 (October 13, 2022): 1–8. http://dx.doi.org/10.1155/2022/6018008.

Full text

Abstract:

In this paper, we developed a GPU parallelized 3D mesh deformation based on the radial basis function (RBF) interpolation using NVIDIA CUDA C++. The RBF mesh deformation method interpolates displacements of the boundary nodes to the whole mesh, which can handle large mesh deformations caused by translations, rotations, and deformations. However, the computational performance of RBF mesh deformation depends on the quantity of grids. For 3D mesh deformation, especially for mesh with large number of boundary nodes, RBF interpolation has been verified computationally intensive. By introducing GPU acceleration, the computational performance of RBF mesh deformation code improved significantly. Several benchmark test cases were carried out to show the feasibility and efficiency of GPU parallelization. In summary, the GPU parallelized RBF interpolation shows the potential to become an alternative way to deal with 3D mesh deformation problems in an efficient way.

APA, Harvard, Vancouver, ISO, and other styles

49

Meng, Da-di, Yu-xin Hu, Tao Shi, Rui Sun, and Xiao-bo Li. "Airborne SAR Real-time Imaging Algorithm Design and Implementation with CUDA on NVIDIA GPU." JOURNAL OF RADARS 2, no. 4 (January 15, 2014): 481–91. http://dx.doi.org/10.3724/sp.j.1300.2013.13056.

Full text

APA, Harvard, Vancouver, ISO, and other styles

50

Pogorilyy, S. D., and M. S. Slynko. "Research and development of Johnson's algorithm parallel schemes in GPGPU technology." PROBLEMS IN PROGRAMMING, no. 2-3 (June 2016): 105–12. http://dx.doi.org/10.15407/pp2016.02-03.105.

Full text

Abstract:

Johnson’s all pairs shortest path algorithm application in an edge weighted, directed graph is considered. Its formalization in terms of Glushkov’s modified systems of algorithmic algebras was made. The expediency of using GPGPU technology to accelerate the algorithm is proved. A number of schemas of parallel algorithm optimized for using in GPGPU were obtained. Suggested approach to the implementation of the schemes obtained using computing architecture NVIDIA CUDA. An experimental study of improved performance by using GPU for computations was made.

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!