Relevant bibliographies by topics / NVIDIA CUDA GPU

Academic literature on the topic 'NVIDIA CUDA GPU'

Author: Grafiati

Published: 10 December 2022

Last updated: 29 January 2023

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Journal articles
Dissertations / Theses
Books
Book chapters
Conference papers

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'NVIDIA CUDA GPU.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "NVIDIA CUDA GPU"

Nangla, Siddhante. "GPU Programming using NVIDIA CUDA." International Journal for Research in Applied Science and Engineering Technology 6, no. 6 (June 30, 2018): 79–84. http://dx.doi.org/10.22214/ijraset.2018.6016.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Liu, Zhi Yuan, and Xue Zhang Zhao. "Research and Implementation of Image Rotation Based on CUDA." Advanced Materials Research 216 (March 2011): 708–12. http://dx.doi.org/10.4028/www.scientific.net/amr.216.708.

Full text

Abstract:

GPU technology release CPU from burdensome graphic computing task. The nVIDIA company, the main GPU producer, adds CUDA technology in new GPU models which enhances GPU function greatly and has much advantage in computing complex matrix. General algorithms of image rotation and the structure of CUDA are introduced in this paper. An example of rotating an image by using HALCON based on CPU instruction extensions and CUDA technology is to prove the advantage of CUDA by comparing two results.

APA, Harvard, Vancouver, ISO, and other styles

Borcovas, Evaldas, and Gintautas Daunys. "CPU AND GPU (CUDA) TEMPLATE MATCHING COMPARISON / CPU IR GPU (CUDA) PALYGINIMAS VYKDANT ŠABLONŲ ATITIKTIES ALGORITMĄ." Mokslas – Lietuvos ateitis 6, no. 2 (April 24, 2014): 129–33. http://dx.doi.org/10.3846/mla.2014.16.

Full text

Abstract:

Image processing, computer vision or other complicated opticalinformation processing algorithms require large resources. It isoften desired to execute algorithms in real time. It is hard tofulfill such requirements with single CPU processor. NVidiaproposed CUDA technology enables programmer to use theGPU resources in the computer. Current research was madewith Intel Pentium Dual-Core T4500 2.3 GHz processor with4 GB RAM DDR3 (CPU I), NVidia GeForce GT320M CUDAcompliable graphics card (GPU I) and Intel Core I5-2500K3.3 GHz processor with 4 GB RAM DDR3 (CPU II), NVidiaGeForce GTX 560 CUDA compatible graphic card (GPU II).Additional libraries as OpenCV 2.1 and OpenCV 2.4.0 CUDAcompliable were used for the testing. Main test were made withstandard function MatchTemplate from the OpenCV libraries.The algorithm uses a main image and a template. An influenceof these factors was tested. Main image and template have beenresized and the algorithm computing time and performancein Gtpix/s have been measured. According to the informationobtained from the research GPU computing using the hardwarementioned earlier is till 24 times faster when it is processing abig amount of information. When the images are small the performanceof CPU and GPU are not significantly different. Thechoice of the template size makes influence on calculating withCPU. Difference in the computing time between the GPUs canbe explained by the number of cores which they have. Vaizdų apdorojimas, kompiuterinė rega ir kiti sudėtingi algoritmai, apdorojantys optinę informaciją, naudoja dideliusskaičiavimo išteklius. Dažnai šiuos algoritmus reikia realizuoti realiuoju laiku. Šį uždavinį išspręsti naudojant tik vienoCPU (angl. Central processing unit) pajėgumus yra sudėtinga. nVidia pasiūlyta CUDA (angl. Compute unified device architecture)technologija leidžia panaudoti GPU (angl. Graphic processing unit) išteklius. Tyrimui atlikti buvo pasirinkti du skirtingiCPU: Intel Pentium Dual-Core T4500 ir Intel Core I5 2500K, bei GPU: nVidia GeForce GT320M ir NVidia GeForce 560.Tyrime buvo panaudotos vaizdų apdorojimo bibliotekos: OpenCV 2.1 ir OpenCV 2.4. Tyrimui buvo pasirinktas šablonų atitiktiesalgoritmas. Algoritmui realizuoti reikalingas analizuojamas vaizdas ir ieškomo objekto vaizdo šablonas. Tyrimo metu buvokeičiamas vaizdo ir šablono dydis bei stebima, kaip tai veikia algoritmo vykdymo trukmę ir vykdomų operacijų skaičių persekundę. Iš gautų rezultatų galima teigti, kad apdorojant didelį duomenų kiekį GPU realizuoja algoritmą iki 24 kartų greičiaunei tik CPU. Dirbant su nedideliu duomenų kiekiu, skirtumas tarp CPU ir GPU yra minimalus. Lyginant skaičiavimus dviejuoseGPU, pastebėta, kad skaičiavimų sparta yra tiesiogiai proporcinga GPU turimų branduolių kiekiui. Mūsų tyrimo atvejuspartesniame GPU jų buvo 16 kartų daugiau, tad ir skaičiavimai vyko 16 kartų sparčiau.

APA, Harvard, Vancouver, ISO, and other styles

Gonzalez Clua, Esteban Walter, and Marcelo Panaro Zamith. "Programming in CUDA for Kepler and Maxwell Architecture." Revista de Informática Teórica e Aplicada 22, no. 2 (November 21, 2015): 233. http://dx.doi.org/10.22456/2175-2745.56384.

Full text

Abstract:

Since the first version of CUDA was launch, many improvements were made in GPU computing. Every new CUDA version included important novel features, turning this architecture more and more closely related to a typical parallel High Performance Language. This tutorial will present the GPU architecture and CUDA principles, trying to conceptualize novel features included by NVIDIA, such as dynamics parallelism, unified memory and concurrent kernels. This text also includes some optimization remarks for CUDA programs.

APA, Harvard, Vancouver, ISO, and other styles

Ahmed, Rafid, Md Sazzadul Islam, and Jia Uddin. "Optimizing Apple Lossless Audio Codec Algorithm using NVIDIA CUDA Architecture." International Journal of Electrical and Computer Engineering (IJECE) 8, no. 1 (February 1, 2018): 70. http://dx.doi.org/10.11591/ijece.v8i1.pp70-75.

Full text

Abstract:

As majority of the compression algorithms are implementations for CPU architecture, the primary focus of our work was to exploit the opportunities of GPU parallelism in audio compression. This paper presents an implementation of Apple Lossless Audio Codec (ALAC) algorithm by using NVIDIA GPUs Compute Unified Device Architecture (CUDA) Framework. The core idea was to identify the areas where data parallelism could be applied and parallel programming model CUDA could be used to execute the identified parallel components on Single Instruction Multiple Thread (SIMT) model of CUDA. The dataset was retrieved from European Broadcasting Union, Sound Quality Assessment Material (SQAM). Faster execution of the algorithm led to execution time reduction when applied to audio coding for large audios. This paper also presents the reduction of power usage due to running the parallel components on GPU. Experimental results reveal that we achieve about 80-90% speedup through CUDA on the identified components over its CPU implementation while saving CPU power consumption.

APA, Harvard, Vancouver, ISO, and other styles

Lin, Chun-Yuan, Chung-Hung Wang, Che-Lun Hung, and Yu-Shiang Lin. "Accelerating Multiple Compound Comparison Using LINGO-Based Load-Balancing Strategies on Multi-GPUs." International Journal of Genomics 2015 (2015): 1–9. http://dx.doi.org/10.1155/2015/950905.

Full text

Abstract:

Compound comparison is an important task for the computational chemistry. By the comparison results, potential inhibitors can be found and then used for the pharmacy experiments. The time complexity of a pairwise compound comparison isO(n2), wherenis the maximal length of compounds. In general, the length of compounds is tens to hundreds, and the computation time is small. However, more and more compounds have been synthesized and extracted now, even more than tens of millions. Therefore, it still will be time-consuming when comparing with a large amount of compounds (seen as a multiple compound comparison problem, abbreviated to MCC). The intrinsic time complexity of MCC problem isO(k2n2)withkcompounds of maximal lengthn. In this paper, we propose a GPU-based algorithm for MCC problem, called CUDA-MCC, on single- and multi-GPUs. Four LINGO-based load-balancing strategies are considered in CUDA-MCC in order to accelerate the computation speed among thread blocks on GPUs. CUDA-MCC was implemented by C+OpenMP+CUDA. CUDA-MCC achieved 45 times and 391 times faster than its CPU version on a single NVIDIA Tesla K20m GPU card and a dual-NVIDIA Tesla K20m GPU card, respectively, under the experimental results.

APA, Harvard, Vancouver, ISO, and other styles

Blyth, Simon. "Meeting the challenge of JUNO simulation with Opticks: GPU optical photon acceleration via NVIDIA® OptiXTM." EPJ Web of Conferences 245 (2020): 11003. http://dx.doi.org/10.1051/epjconf/202024511003.

Full text

Abstract:

Opticks is an open source project that accelerates optical photon simulation by integrating NVIDIA GPU ray tracing, accessed via NVIDIA OptiX, with Geant4 toolkit based simulations. A single NVIDIA Turing architecture GPU has been measured to provide optical photon simulation speedup factors exceeding 1500 times single threaded Geant4 with a full JUNO analytic GPU geometry automatically translated from the Geant4 geometry. Optical physics processes of scattering, absorption, scintillator reemission and boundary processes are implemented within CUDA OptiX programs based on the Geant4 implementations. Wavelength-dependent material and surface properties as well as inverse cumulative distribution functions for reemission are interleaved into GPU textures providing fast interpolated property lookup or wavelength generation. Major recent developments enable Opticks to benefit from ray trace dedicated RT cores available in NVIDIA RTX series GPUs. Results of extensive validation tests are presented.

APA, Harvard, Vancouver, ISO, and other styles

FUJIMOTO, NORIYUKI. "DENSE MATRIX-VECTOR MULTIPLICATION ON THE CUDA ARCHITECTURE." Parallel Processing Letters 18, no. 04 (December 2008): 511–30. http://dx.doi.org/10.1142/s0129626408003545.

Full text

Abstract:

Recently GPUs have acquired the ability to perform fast general purpose computation by running thousands of threads concurrently. This paper presents a new algorithm for dense matrix-vector multiplication on the NVIDIA CUDA architecture. The experiments are conducted on a PC with GeForce 8800GTX and 2.0 GHz Intel Xeon E5335 CPU. The results show that the proposed algorithm runs a maximum of 11.19 times faster than NVIDIA's BLAS library CUBLAS 1.1 on the GPU and 35.15 times faster than the Intel Math Kernel Library 9.1 on a single core x86 with SSE3 SIMD instructions. The performance of Jacobi's iterative method for solving linear equations, which includes the data transfer time between CPU and GPU, shows that the proposed algorithm is practical for real applications.

APA, Harvard, Vancouver, ISO, and other styles

Blyth, Simon. "Integration of JUNO simulation framework with Opticks: GPU accelerated optical propagation via NVIDIA® OptiX™." EPJ Web of Conferences 251 (2021): 03009. http://dx.doi.org/10.1051/epjconf/202125103009.

Full text

Abstract:

Opticks is an open source project that accelerates optical photon simulation by integrating NVIDIA GPU ray tracing, accessed via NVIDIA OptiX, with Geant4 toolkit based simulations. A single NVIDIA Turing architecture GPU has been measured to provide optical photon simulation speedup factors exceeding 1500 times single threaded Geant4 with a full JUNO analytic GPU geometry automatically translated from the Geant4 geometry. Optical physics processes of scattering, absorption, scintillator reemission and boundary processes are implemented within CUDA OptiX programs based on the Geant4 implementations. Wavelength-dependent material and surface properties as well as inverse cumulative distribution functions for reemission are interleaved into GPU textures providing fast interpolated property lookup or wavelength generation. In this work we describe major recent developments to facilitate integration of Opticks with the JUNO simulation framework including on GPU collection effciency hit culling which substantially reduces both the CPU memory needed for photon hits and copying overheads. Also progress with the migration of Opticks to the all new NVIDIA OptiX 7 API is described.

APA, Harvard, Vancouver, ISO, and other styles

Bi, Yujiang, Yi Xiao, WeiYi Guo, Ming Gong, Peng Sun, Shun Xu, and Yi-bo Yang. "Lattice QCD GPU Inverters on ROCm Platform." EPJ Web of Conferences 245 (2020): 09008. http://dx.doi.org/10.1051/epjconf/202024509008.

Full text

Abstract:

The open source ROCm/HIP platform for GPU computing provides a uniform framework to support both the NVIDIA and AMD GPUs, and also the possibility to porting the CUDA code to the HIP-compatible one. We present the porting progress on the Overlap fermion inverter (GWU-code) and also the general Lattice QCD inverter package - QUDA. The manual of using QUDA on HIP and also the tips of porting general CUDA code into the HIP framework are also provided.

APA, Harvard, Vancouver, ISO, and other styles

More sources

Dissertations / Theses on the topic "NVIDIA CUDA GPU"

Ikeda, Patricia Akemi. "Um estudo do uso eficiente de programas em placas gráficas." Universidade de São Paulo, 2011. http://www.teses.usp.br/teses/disponiveis/45/45134/tde-25042012-212956/.

Full text

Abstract:

Inicialmente projetadas para processamento de gráficos, as placas gráficas (GPUs) evoluíram para um coprocessador paralelo de propósito geral de alto desempenho. Devido ao enorme potencial que oferecem para as diversas áreas de pesquisa e comerciais, a fabricante NVIDIA destaca-se pelo pioneirismo ao lançar a arquitetura CUDA (compatível com várias de suas placas), um ambiente capaz de tirar proveito do poder computacional aliado à maior facilidade de programação. Na tentativa de aproveitar toda a capacidade da GPU, algumas práticas devem ser seguidas. Uma delas consiste em manter o hardware o mais ocupado possível. Este trabalho propõe uma ferramenta prática e extensível que auxilie o programador a escolher a melhor configuração para que este objetivo seja alcançado.
Initially designed for graphical processing, the graphic cards (GPUs) evolved to a high performance general purpose parallel coprocessor. Due to huge potencial that graphic cards offer to several research and commercial areas, NVIDIA was the pioneer lauching of CUDA architecture (compatible with their several cards), an environment that take advantage of computacional power combined with an easier programming. In an attempt to make use of all capacity of GPU, some practices must be followed. One of them is to maximizes hardware utilization. This work proposes a practical and extensible tool that helps the programmer to choose the best configuration and achieve this goal.

APA, Harvard, Vancouver, ISO, and other styles

Rivera-Polanco, Diego Alejandro. "COLLECTIVE COMMUNICATION AND BARRIER SYNCHRONIZATION ON NVIDIA CUDA GPU." Lexington, Ky. : [University of Kentucky Libraries], 2009. http://hdl.handle.net/10225/1158.

Full text

Abstract:

Thesis (M.S.)--University of Kentucky, 2009.
Title from document title page (viewed on May 18, 2010). Document formatted into pages; contains: ix, 88 p. : ill. Includes abstract and vita. Includes bibliographical references (p. 86-87).

APA, Harvard, Vancouver, ISO, and other styles

Harvey, Jesse Patrick. "GPU acceleration of object classification algorithms using NVIDIA CUDA /." Online version of thesis, 2009. http://hdl.handle.net/1850/10894.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Lerchundi, Osa Gorka. "Fast Implementation of Two Hash Algorithms on nVidia CUDA GPU." Thesis, Norwegian University of Science and Technology, Department of Telematics, 2009. http://urn.kb.se/resolve?urn=urn:nbn:no:ntnu:diva-9817.

Full text

Abstract:

User needs increases as time passes. We started with computers like the size of a room where the perforated plaques did the same function as the current machine code object does and at present we are at a point where the number of processors within our graphic device unit its not enough for our requirements. A change in the evolution of computing is looming. We are in a transition where the sequential computation is losing ground on the benefit of the distributed. And not because of the birth of the new GPUs easily accessible this trend is novel but long before it was used for projects like SETI@Home, fightAIDS@Home, ClimatePrediction and there were shouting from the rooftops about what was to come. Grid computing was its formal name. Until now it was linked only to distributed systems over the network, but as this technology evolves it will take different meaning. nVidia with CUDA has been one of the first companies to make this kind of software package noteworthy. Instead of being a proof of concept its a real tool. Where the transition is expressed in greater magnitude in which the true artist is the programmer who uses it and achieves performance increases. As with many innovations, a community distributed worldwide has grown behind this software package and each one doing its bit. It is noteworthy that after CUDA release a lot of software developments grown like the cracking of the hitherto insurmountable WPA. With Sony-Toshiba-IBM (STI) alliance it could be said the same thing, it has a great community and great software (IBM is the company in charge of maintenance). Unlike nVidia is not as accessible as it is but IBM is powerful enough to enter home made supercomputing market. In this case, after IBM released the PS3 SDK, a notorious application was created using the benefits of parallel computing named Folding@Home. Its purpose is to, inter alia, find the cure for cancer. To sum up, this is only the beginning, and in this thesis is sized up the possibility of using this technology for accelerating cryptographic hash algorithms. BLUE MIDNIGHT WISH (The hash algorithm that is applied to the surgery) is undergone to an environment change adapting it to a parallel capable code for creating empirical measures that compare to the current sequential implementations. It will answer questions that nowadays havent been answered yet. BLUE MIDNIGHT WISH is a candidate hash function for the next NIST standard SHA-3, designed by professor Danilo Gligoroski from NTNU and Vlastimil Klima an independent cryptographer from Czech Republic. So far, from speed point of view BLUE MIDNIGHT WISH is on the top of the charts (generally on the second place right behind EDON-R - another hash function from professor Danilo Gligoroski). One part of the work on this thesis was to investigate is it possible to achieve faster speeds in processing of Blue Midnight Wish when the computations are distributed among the cores in a CUDA device card. My numerous experiments give a clear answer: NO. Although the answer is negative, it still has a significant scientific value. The point is that my work acknowledges viewpoints and standings of a part of the cryptographic community that is doubtful that the cryptographic primitives will benefit when executed in parallel in many cores in one CPU. Indeed, my experiments show that the communication costs between cores in CUDA outweigh by big margin the computational costs done inside one core (processor) unit.

APA, Harvard, Vancouver, ISO, and other styles

Sreenibha, Reddy Byreddy. "Performance Metrics Analysis of GamingAnywhere with GPU accelerated Nvidia CUDA." Thesis, Blekinge Tekniska Högskola, Institutionen för datalogi och datorsystemteknik, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-16846.

Full text

Abstract:

The modern world has opened the gates to a lot of advancements in cloud computing, particularly in the field of Cloud Gaming. The most recent development made in this area is the open-source cloud gaming system called GamingAnywhere. The relationship between the CPU and GPU is what is the main object of our concentration in this thesis paper. The Graphical Processing Unit (GPU) performance plays a vital role in analyzing the playing experience and enhancement of GamingAnywhere. In this paper, the virtualization of the GPU has been concentrated on and is suggested that the acceleration of this unit using NVIDIA CUDA, is the key for better performance while using GamingAnywhere. After vast research, the technique employed for NVIDIA CUDA has been chosen as gVirtuS. There is an experimental study conducted to evaluate the feasibility and performance of GPU solutions by VMware in cloud gaming scenarios given by GamingAnywhere. Performance is measured in terms of bitrate, packet loss, jitter and frame rate. Different resolutions of the game are considered in our empirical research and our results show that the frame rate and bitrate have increased with different resolutions, and the usage of NVIDIA CUDA enhanced GPU.

APA, Harvard, Vancouver, ISO, and other styles

Savioli, Nicolo'. "Parallelization of the algorithm WHAM with NVIDIA CUDA." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2013. http://amslaurea.unibo.it/6377/.

Full text

Abstract:

The aim of my thesis is to parallelize the Weighting Histogram Analysis Method (WHAM), which is a popular algorithm used to calculate the Free Energy of a molucular system in Molecular Dynamics simulations. WHAM works in post processing in cooperation with another algorithm called Umbrella Sampling. Umbrella Sampling has the purpose to add a biasing in the potential energy of the system in order to force the system to sample a specific region in the configurational space. Several N independent simulations are performed in order to sample all the region of interest. Subsequently, the WHAM algorithm is used to estimate the original system energy starting from the N atomic trajectories. The parallelization of WHAM has been performed through CUDA, a language that allows to work in GPUs of NVIDIA graphic cards, which have a parallel achitecture. The parallel implementation may sensibly speed up the WHAM execution compared to previous serial CPU imlementations. However, the WHAM CPU code presents some temporal criticalities to very high numbers of interactions. The algorithm has been written in C++ and executed in UNIX systems provided with NVIDIA graphic cards. The results were satisfying obtaining an increase of performances when the model was executed on graphics cards with compute capability greater. Nonetheless, the GPUs used to test the algorithm is quite old and not designated for scientific calculations. It is likely that a further performance increase will be obtained if the algorithm would be executed in clusters of GPU at high level of computational efficiency. The thesis is organized in the following way: I will first describe the mathematical formulation of Umbrella Sampling and WHAM algorithm with their apllications in the study of ionic channels and in Molecular Docking (Chapter 1); then, I will present the CUDA architectures used to implement the model (Chapter 2); and finally, the results obtained on model systems will be presented (Chapter 3).

APA, Harvard, Vancouver, ISO, and other styles

Zaahid, Mohammed. "Performance Metrics Analysis of GamingAnywhere with GPU acceletayed NVIDIA CUDA using gVirtuS." Thesis, Blekinge Tekniska Högskola, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-16852.

Full text

Abstract:

APA, Harvard, Vancouver, ISO, and other styles

Virk, Bikram. "Implementing method of moments on a GPGPU using Nvidia CUDA." Thesis, Georgia Institute of Technology, 2010. http://hdl.handle.net/1853/33980.

Full text

Abstract:

This thesis concentrates on the algorithmic aspects of Method of Moments (MoM) and Locally Corrected Nyström (LCN) numerical methods in electromagnetics. The data dependency in each step of the algorithm is analyzed to implement a parallel version that can harness the powerful processing power of a General Purpose Graphics Processing Unit (GPGPU). The GPGPU programming model provided by NVIDIA's Compute Unified Device Architecture (CUDA) is described to learn the software tools at hand enabling us to implement C code on the GPGPU. Various optimizations such as the partial update at every iteration, inter-block synchronization and using shared memory enable us to achieve an overall speedup of approximately 10. The study also brings out the strengths and weaknesses in implementing different methods such as Crout's LU decomposition and triangular matrix inversion on a GPGPU architecture. The results suggest future directions of study in different algorithms and their effectiveness on a parallel processor environment. The performance data collected show how different features of the GPGPU architecture can be enhanced to yield higher speedup.

APA, Harvard, Vancouver, ISO, and other styles

Ekstam, Ljusegren Hannes, and Hannes Jonsson. "Parallelizing Digital Signal Processing for GPU." Thesis, Linköpings universitet, Programvara och system, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-167189.

Full text

Abstract:

Because of the increasing importance of signal processing in today's society, there is a need to easily experiment with new ways to process signals. Usually, fast-performing digital signal processing is done with special-purpose hardware that are difficult to develop for. GPUs pose an alternative for fast performing digital signal processing. The work in this thesis is an analysis and implementation of a GPU version of a digital signal processing chain provided by SAAB. Through an iterative process of development and testing, a final implementation was achieved. Two benchmarks, both comprised of 4.2 M test samples, were made to compare the CPU implementation with the GPU implementation. The benchmark was run on three different platforms: a desktop computer, a NVIDIA Jetson AGX Xavier and a NVIDIA Jetson TX2. The results show that the parallelized version can reach several magnitudes higher throughput than the CPU implementation.

APA, Harvard, Vancouver, ISO, and other styles

Araújo, João Manuel da Silva. "Paralelização de algoritmos de Filtragem baseados em XPATH/XML com recurso a GPUs." Master's thesis, FCT - UNL, 2009. http://hdl.handle.net/10362/2530.

Full text

Abstract:

Dissertação de Mestrado em Engenharia Informática
Esta dissertação envolve o estudo da viabilidade da utilização dos GPUs para o processamento paralelo aplicado aos algoritmos de filtragem de notificações num sistema editor/assinante. Este objectivo passou por realizar uma comparação de resultados experimentais entre a versão sequencial (nos CPUs) e a versão paralela de um algoritmo de filtragem escolhido como referência. Essa análise procurou dar elementos para aferir se eventuais ganhos da exploração dos GPUs serão suficientes para compensar a maior complexidade do processo.

APA, Harvard, Vancouver, ISO, and other styles

More sources

Books on the topic "NVIDIA CUDA GPU"

Dagg, Michael. NVIDIA GPU Programming: Massively Parallel Programming with CUDA. Wiley & Sons, Incorporated, John, 2013.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

Dagg, Michael. NVIDIA GPU Programming: Massively Parallel Programming with CUDA. Wiley & Sons, Incorporated, John, 2012.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

Dagg, Michael. NVIDIA GPU Programming: Massively Parallel Programming with CUDA. Wiley & Sons, Incorporated, John, 2012.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

Book chapters on the topic "NVIDIA CUDA GPU"

Bhura, Mayank, Pranav H. Deshpande, and K. Chandrasekaran. "CUDA or OpenCL." In Research Advances in the Integration of Big Data and Smart Computing, 267–79. IGI Global, 2016. http://dx.doi.org/10.4018/978-1-4666-8737-0.ch015.

Full text

Abstract:

Usage of General Purpose Graphics Processing Units (GPGPUs) in high-performance computing is increasing as heterogeneous systems continue to become dominant. CUDA had been the programming environment for nearly all such NVIDIA GPU based GPGPU applications. Still, the framework runs only on NVIDIA GPUs, for other frameworks it requires reimplementation to utilize additional computing devices that are available. OpenCL provides a vendor-neutral and open programming environment, with many implementations available on CPUs, GPUs, and other types of accelerators, OpenCL can thus be regarded as write once, run anywhere framework. Despite this, both frameworks have their own pros and cons. This chapter presents a comparison of the performance of CUDA and OpenCL frameworks, using an algorithm to find the sum of all possible triple products on a list of integers, implemented on GPUs.

APA, Harvard, Vancouver, ISO, and other styles

Lin, Chun-Yuan, Jin Ye, Che-Lun Hung, Chung-Hung Wang, Min Su, and Jianjun Tan. "Constructing a Bioinformatics Platform with Web and Mobile Services Based on NVIDIA Jetson TK1." In Data Analytics in Medicine, 629–44. IGI Global, 2020. http://dx.doi.org/10.4018/978-1-7998-1204-3.ch035.

Full text

Abstract:

Current high-end graphics processing units (abbreviate to GPUs), such as NVIDIA Tesla, Fermi, Kepler series cards which contain up to thousand cores per-chip, are widely used in the high performance computing fields. These GPU cards (called desktop GPUs) should be installed in personal computers/servers with desktop CPUs; moreover, the cost and power consumption of constructing a high performance computing platform with these desktop CPUs and GPUs are high. NVIDIA releases Tegra K1, called Jetson TK1, which contains 4 ARM Cortex-A15 CPUs and 192 CUDA cores (Kepler GPU) and is an embedded board with low cost, low power consumption and high applicability advantages for embedded applications. NVIDIA Jetson TK1 becomes a new research direction. Hence, in this paper, a bioinformatics platform was constructed based on NVIDIA Jetson TK1. ClustalWtk and MCCtk tools for sequence alignment and compound comparison were designed on this platform, respectively. Moreover, the web and mobile services for these two tools with user friendly interfaces also were provided. The experimental results showed that the cost-performance ratio by NVIDIA Jetson TK1 is higher than that by Intel XEON E5-2650 CPU and NVIDIA Tesla K20m GPU card.

APA, Harvard, Vancouver, ISO, and other styles

Adhikari, Mainak, and Sukhendu Kar. "Advanced Topics GPU Programming and CUDA Architecture." In Advances in Systems Analysis, Software Engineering, and High Performance Computing, 175–203. IGI Global, 2016. http://dx.doi.org/10.4018/978-1-4666-8853-7.ch008.

Full text

Abstract:

Graphics processing unit (GPU), which typically handles computation only for computer graphics. Any GPU providing a functionally complete set of operations performed on arbitrary bits can compute any computable value. Additionally, the use of multiple graphics cards in one computer, or large numbers of graphics chips, further parallelizes the already parallel nature of graphics processing. CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model created by NVIDIA and implemented by the graphics processing units (GPUs). CUDA gives program developers direct access to the virtual instruction set and memory of the parallel computational elements in CUDA GPUs. This chapter first discuss some features and challenges of GPU programming and the effort to address some of the challenges with building and running GPU programming in high performance computing (HPC) environment. Finally this chapter point out the importance and standards of CUDA architecture.

APA, Harvard, Vancouver, ISO, and other styles

Khadtare, Mahesh Satish. "GPU Based Image Quality Assessment using Structural Similarity (SSIM) Index." In Advances in Systems Analysis, Software Engineering, and High Performance Computing, 276–82. IGI Global, 2016. http://dx.doi.org/10.4018/978-1-4666-8853-7.ch013.

Full text

Abstract:

This chapter deals with performance analysis of CUDA implementation of an image quality assessment tool based on structural similarity index (SSI). Since it had been initial created at the University of Texas in 2002, the Structural SIMilarity (SSIM) image assessment algorithm has become a valuable tool for still image and video processing analysis. SSIM provided a big giant over MSE (Mean Square Error) and PSNR (Peak Signal to Noise Ratio) techniques because it way more closely aligned with the results that would have been obtained with subjective testing. For objective image analysis, this new technique represents as significant advancement over SSIM as the advancement that SSIM provided over PSNR. The method is computationally intensive and this poses issues in places wherever real time quality assessment is desired. We tend to develop a CUDA implementation of this technique that offers a speedup of approximately 30 X on Nvidia GTX275 and 80 X on C2050 over Intel single core processor.

APA, Harvard, Vancouver, ISO, and other styles

Yamada, Susumu, Masahiko Machida, and Toshiyuki Imamura. "High Performance Eigenvalue Solver for Hubbard Model: Tuning Strategies for LOBPCG Method on CUDA GPU." In Parallel Computing: Technology Trends. IOS Press, 2020. http://dx.doi.org/10.3233/apc200030.

Full text

Abstract:

The exact diagonalization is the most accurate approach for solving the Hubbard model. The approach calculates the ground state of the Hamiltonian derived exactly from the model. Since the Hamiltonian is a large sparse symmetric matrix, we usually utilize an iteration method. It has been reported that the LOBPCG method is one of the most effectual solvers for this problem. Since most operations of the method are linear operations, the method can be executed on CUDA GPU, which is one of the mainstream processors, by using cuBLAS and cuSPARSE libraries straightforwardly. However, since the routines are executed one after the other, cached data can not be reused among other routines. In this research, we tune the routines by fusing some of their loop operations in order to reuse cached data. Moreover, we propose the tuning strategies for the Hamiltonian-vector multiplication with shared memory system in consideration of the character of the Hamiltonian. The numerical test on NVIDIA Tesla P100 shows that the tuned LOBPCG code is about 1.5 times faster than the code with cuBLAS and cuSPARSE routines.

APA, Harvard, Vancouver, ISO, and other styles

Peña-Cantillana, Francisco, Daniel Díaz-Pernil, Hepzibah A. Christinal, and Miguel A. Gutiérrez-Naranjo. "Implementation on CUDA of the Smoothing Problem with Tissue-Like P Systems." In Natural Computing for Simulation and Knowledge Discovery, 184–93. IGI Global, 2014. http://dx.doi.org/10.4018/978-1-4666-4253-9.ch012.

Full text

Abstract:

Smoothing is often used in Digital Imagery for improving the quality of an image by reducing its level of noise. This paper presents a parallel implementation of an algorithm for smoothing 2D images in the framework of Membrane Computing. The chosen formal framework has been tissue-like P systems. The algorithm has been implemented by using a novel device architecture called CUDA (Compute Unified Device Architecture) which allows the parallel NVIDIA Graphics Processors Units (GPUs) to solve many complex computational problems. Some examples are presented and compared; research lines for the future are also discussed.

APA, Harvard, Vancouver, ISO, and other styles

"Practical Examples of Automated Development of Efficient Parallel Programs." In Advances in Systems Analysis, Software Engineering, and High Performance Computing, 180–216. IGI Global, 2021. http://dx.doi.org/10.4018/978-1-5225-9384-3.ch006.

Full text

Abstract:

In this chapter, some examples of application of the developed software tools for design, generation, transformation, and optimization of programs for multicore processors and graphics processing units are considered. In particular, the algebra-algorithmic-integrated toolkit for design and synthesis of programs (IDS) and the rewriting rules system TermWare.NET are applied for design and parallelization of programs for multicore central processing units. The developed algebra-dynamic models and the rewriting rules toolkit are used for parallelization and optimization of programs for NVIDIA GPUs supporting the CUDA technology. The TuningGenie framework is applied for parallel program auto-tuning: optimization of sorting, Brownian motion simulation, and meteorological forecasting programs to a target platform. The parallelization of Fortran programs using the rewriting rules technique on sample problems in the field of quantum chemistry is examined.

APA, Harvard, Vancouver, ISO, and other styles

Conference papers on the topic "NVIDIA CUDA GPU"

Buck, Ian. "GPU computing with NVIDIA CUDA." In ACM SIGGRAPH 2007 courses. New York, New York, USA: ACM Press, 2007. http://dx.doi.org/10.1145/1281500.1281647.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Harris, Mark. "Many-core GPU computing with NVIDIA CUDA." In the 22nd annual international conference. New York, New York, USA: ACM Press, 2008. http://dx.doi.org/10.1145/1375527.1375528.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Kirk, David. "NVIDIA cuda software and gpu parallel computing architecture." In the 6th international symposium. New York, New York, USA: ACM Press, 2007. http://dx.doi.org/10.1145/1296907.1296909.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Li, Yazhou, and Yahong Rosa Zheng. "Profiling NVIDIA Jetson Embedded GPU Devices for Autonomous Machines." In 6th International Conference on Computer Science, Engineering And Applications (CSEA 2020). AIRCC Publishing Corporation, 2020. http://dx.doi.org/10.5121/csit.2020.101811.

Full text

Abstract:

This paper presents two methods, tegrastats GUI version jtop and Nsight Systems, to profile NVIDIA Jetson embedded GPU devices on a model race car which is a great platform for prototyping and field testing autonomous driving algorithms. The two profilers analyze the power consumption, CPU/GPU utilization, and the run time of CUDA C threads of Jetson TX2 in five different working modes. The performance differences among the five modes are demonstrated using three example programs: vector add in C and CUDA C, a simple ROS (Robot Operating System) package of the wall follow algorithm in Python, and a complex ROS package of the particle filter algorithm for SLAM (Simultaneous Localization and Mapping). The results show that the tools are effective means for selecting operating mode of the embedded GPU devices.

APA, Harvard, Vancouver, ISO, and other styles

Soni, Hemlata, and Pradeep Chhawcharia. "GPU-accelerated MoM based scattering/radiation analysis using NVIDIA CUDA." In 2015 IEEE International Conference on Research in Computational Intelligence and Communication Networks (ICRCICN). IEEE, 2015. http://dx.doi.org/10.1109/icrcicn.2015.7434257.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Ramroach, Sterling, Jonathan Herbert, and Ajay Joshi. "CUDA-ACCELERATED FEATURE SELECTION." In International Conference on Emerging Trends in Engineering & Technology (IConETech-2020). Faculty of Engineering, The University of the West Indies, St. Augustine, 2020. http://dx.doi.org/10.47412/juqg5057.

Full text

Abstract:

Identifying important features from high dimensional data is usually done using one-dimensional filtering techniques. These techniques discard noisy attributes and those that are constant throughout the data. This is a time-consuming task that has scope for acceleration via high performance computing techniques involving the graphics processing unit (GPU). The proposed algorithm involves acceleration via the Compute Unified Device Architecture (CUDA) framework developed by Nvidia. This framework facilitates the seamless scaling of computation on any CUDA-enabled GPUs. Thus, the Pearson Correlation Coefficient can be applied in parallel on each feature with respect to the response variable. The ranks obtained for each feature can be used to determine the most relevant features to select. Using data from the UCI Machine Learning Repository, our results show an increase in efficiency for multi-dimensional analysis with a more reliable feature importance ranking. When tested on a high-dimensional dataset of 1000 samples and 10,000 features, we achieved a 1,230-time speedup using CUDA. This acceleration grows exponentially, as with any embarrassingly parallel task.

APA, Harvard, Vancouver, ISO, and other styles

Carrigan, Travis J., Jacob Watt, and Brian H. Dennis. "Using GPU-Based Computing to Solve Large Sparse Systems of Linear Equations." In ASME 2011 International Design Engineering Technical Conferences and Computers and Information in Engineering Conference. ASMEDC, 2011. http://dx.doi.org/10.1115/detc2011-48452.

Full text

Abstract:

Often thought of as tools for image rendering or data visualization, graphics processing units (GPU) are becoming increasingly popular in the areas of scientific computing due to their low cost massively parallel architecture. With the introduction of CUDA C by NVIDIA and CUDA enabled GPUs, the ability to perform general purpose computations without the need to utilize shading languages is now possible. One such application that benefits from the capabilities provided by NVIDIA hardware is computational continuum mechanics (CCM). The need to solve sparse linear systems of equations is common in CCM when partial differential equations are discretized. Often these systems are solved iteratively using domain decomposition among distributed processors working in parallel. In this paper we explore the benefits of using GPUs to improve the performance of sparse matrix operations, more specifically, sparse matrix-vector multiplication. Our approach does not require domain decomposition, so it is simpler than corresponding implementation for distributed memory parallel computers. We demonstrate that for matrices produced from finite element discretizations on unstructured meshes, the performance of the matrix-vector multiplication operation is just under 13 times faster than when run serially on an Intel i5 system. Furthermore, we show that when used in conjunction with the biconjugate gradient stabilized method (BiCGSTAB), a gradient based iterative linear solver, the method is over 13 times faster than the serially executed C equivalent. And lastly, we emphasize the application of such method for solving Poisson’s equation using the Galerkin finite element method, and demonstrate over 10.5 times higher performance on the GPU when compared with the Intel i5 system.

APA, Harvard, Vancouver, ISO, and other styles

Zhou, Tao, Qiang-Ming Cai, Xin Cao, Wen Jiang, Yuying Zhu, Yuyu Zhu, and Jun Fan. "GPU-Accelerated HO-SIE-DDM Using NVIDIA CUDA for Analysis of Multiscale Problems." In 2022 Asia-Pacific International Symposium on Electromagnetic Compatibility (APEMC). IEEE, 2022. http://dx.doi.org/10.1109/apemc53576.2022.9888565.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Kelly, Jesse. "GPU-Accelerated Simulation of Two-Phase Incompressible Fluid Flow Using a Level-Set Method for Interface Capturing." In ASME 2009 International Mechanical Engineering Congress and Exposition. ASMEDC, 2009. http://dx.doi.org/10.1115/imece2009-13330.

Full text

Abstract:

Computational fluid dynamics has seen a surge of popularity as a tool for visual effects animators over the past decade since Stam’s seminal Stable Fluids paper [1]. Complex fluid dynamics simulations can often be prohibitive to run due to the time it takes to perform all of the necessary computations. This project proposes an accelerated two-phase incompressible fluid flow solver implemented on programmable graphics hardware. Modern graphics-processing units (GPUs) are highly parallel computing devices, and in problems with a large potential for parallel computation the GPU may vastly out-perform the CPU. This project will use the potential parallelism in the solution of the Navier-Stokes equations in writing a GPU-accelerated flow solver. NVIDIA’s Compute-Unified-Device-Architecture (CUDA) language will be used to program the parallel portions of the solver. CUDA is a C-like language introduced by the NVIDIA Corporation with the goal of simplifying general-purpose computing on the GPU. CUDA takes advantage of data-parallelism by executing the same or near-same code on different data streams simultaneously, so the algorithms used in the flow solver will be designed to be highly data-parallel. Most finite difference-based fluid solvers for computer graphics applications have used the traditional staggered marker-and-cell (MAC) grid, introduced by Harlow and Welsh [2]. The proposed approach improves upon the programmability of solvers such as these by using a non-staggered (collocated) grid. An efficient technique is implemented to smooth the pressure oscillations that often result from the use of a collocated grid in the simulation of incompressible flows. To be appropriate for visual effects use, a fluid solver must have some means of tracking fluid interfaces in order to have a renderable fluid surface. This project uses the level-set method [3] for interface tracking. The level set is treated as a scalar property, and so its propagation in time is computed using the same transport algorithm used in the main fluid flow solver.

APA, Harvard, Vancouver, ISO, and other styles

Alam, Muhammad S., and Liang Cheng. "Parallelization of LBM Code Using CUDA Capable GPU Platform for 3D Single and Two-Sided Non-Facing Lid-Driven Cavity Flow." In ASME 2011 30th International Conference on Ocean, Offshore and Arctic Engineering. ASMEDC, 2011. http://dx.doi.org/10.1115/omae2011-50332.

Full text

Abstract:

In this paper, a lattice Boltzmann model is developed and then parallelized employing a Compute Unified Device Architecture (CUDA) capable nVIDIA GPU platform. Numerical algorithms are developed for the solution of 3D single and two-sided non-facing lid-driven (TSNFL) cavity flow for Re = 10–1000. The algorithms are verified by solving both steady and unsteady 3D cavity and 3D TSNFL flow problems. Excellent agreement is obtained between numerical predictions and results available in literature. The results show that the CUDA-enabled LBM code is computationally efficient. It is observed that the implementation of LBM on a GPU allows at least thirty million lattice updates per second for 3-D lid driven cavity flow. Computations have been carried out for a 2-D lid driven cavity flow too. It is revealed that LBM-GPU calculation achieves 641 million lattice updates per second for the 2-D lid driven cavity flow.

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

Academic literature on the topic 'NVIDIA CUDA GPU'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Contents

Journal articles on the topic "NVIDIA CUDA GPU"

Dissertations / Theses on the topic "NVIDIA CUDA GPU"

Books on the topic "NVIDIA CUDA GPU"

Book chapters on the topic "NVIDIA CUDA GPU"

Conference papers on the topic "NVIDIA CUDA GPU"