Journal articles on the topic 'GPU code generation'

To see the other types of publications on this topic, follow the link: GPU code generation.

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'GPU code generation.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

EMMART, NIALL, and CHARLES WEEMS. "SEARCH-BASED AUTOMATIC CODE GENERATION FOR MULTIPRECISION MODULAR EXPONENTIATION ON MULTIPLE GENERATIONS OF GPU." Parallel Processing Letters 23, no. 04 (December 2013): 1340009. http://dx.doi.org/10.1142/s0129626413400094.

Full text
Abstract:
Multiprecision modular exponentiation has a variety of uses, including cryptography, prime testing and computational number theory. It is also a very costly operation to compute. GPU parallelism can be used to accelerate these computations, but to use the GPU efficiently, a problem must involve many simultaneous exponentiation operations. Handling a large number of TLS/SSL encrypted sessions in a data center is an important problem that fits this profile. We are developing a framework that enables generation of highly efficient implementations of exponentiation operations for different NVIDIA GPU architectures and problem instances. One of the challenges in generating such code is that NVIDIA's PTX is not a true assembly language, but is instead a virtual instruction set that is compiled and optimized in different ways for different generations of GPU hardware. Thus, the same PTX code runs with different levels of efficiency on different machines. And as the precision of the computations changes, each architecture has its own break-even points where a different algorithm or parallelization strategy must be employed. To make the code efficient for a given problem instance and architecture thus requires searching a multidimensional space of algorithms and configurations, by generating PTX code for each combination, executing it, validating the numerical result, and evaluating its performance. Our framework automates much of this process, and produces exponentiation code that is up to six times faster than the best known hand-coded implementations for the NVIDIA GTX 580. Our goal for the framework is to enable users to relatively quickly find the best configuration for each new GPU architecture. However, in migrating to the GTX 680, which has three times as many cores as the GTX 580, we found that the best performance our system could achieve was significantly less than for the GTX 580. The decrease was traced to a radical shift in the NVIDIA architecture that greatly reduces the storage resources for each core. Further analysis and feasibility simulations indicate that it should be possible, through changes in our code generators to adapt for different storage models, to take greater advantage of the parallelism on the GTX 680. That will add a new dimension to our search space, but will also give our framework greater flexibility for dealing with future architectures.
APA, Harvard, Vancouver, ISO, and other styles
2

Afar Nazim, Allazov. "Automatic Generation of GPU Code in DVOR." University News. North-Caucasian Region. Technical Sciences Series, no. 3 (September 2015): 3–9. http://dx.doi.org/10.17213/0321-2653-2015-3-3-9.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Blazewicz, Marek, Ian Hinder, David M. Koppelman, Steven R. Brandt, Milosz Ciznicki, Michal Kierzynka, Frank Löffler, Erik Schnetter, and Jian Tao. "From Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation." Scientific Programming 21, no. 1-2 (2013): 1–16. http://dx.doi.org/10.1155/2013/167841.

Full text
Abstract:
Starting from a high-level problem description in terms of partial differential equations using abstract tensor notation, theChemoraframework discretizes, optimizes, and generates complete high performance codes for a wide range of compute architectures. Chemora extends the capabilities of Cactus, facilitating the usage of large-scale CPU/GPU systems in an efficient manner for complex applications, without low-level code tuning. Chemora achieves parallelism through MPI and multi-threading, combining OpenMP and CUDA. Optimizations include high-level code transformations, efficient loop traversal strategies, dynamically selected data and instruction cache usage strategies, and JIT compilation of GPU code tailored to the problem characteristics. The discretization is based on higher-order finite differences on multi-block domains. Chemora's capabilities are demonstrated by simulations of black hole collisions. This problem provides an acid test of the framework, as the Einstein equations contain hundreds of variables and thousands of terms.
APA, Harvard, Vancouver, ISO, and other styles
4

Rodrigues, A. Wendell O., Frédéric Guyomarc'h, Jean-Luc Dekeyser, and Yvonnick Le Menach. "Automatic Multi-GPU Code Generation Applied to Simulation of Electrical Machines." IEEE Transactions on Magnetics 48, no. 2 (February 2012): 831–34. http://dx.doi.org/10.1109/tmag.2011.2179527.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Rawat, Prashant Singh, Miheer Vaidya, Aravind Sukumaran-Rajam, Mahesh Ravishankar, Vinod Grover, Atanas Rountev, Louis-Noel Pouchet, and P. Sadayappan. "Domain-Specific Optimization and Generation of High-Performance GPU Code for Stencil Computations." Proceedings of the IEEE 106, no. 11 (November 2018): 1902–20. http://dx.doi.org/10.1109/jproc.2018.2862896.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Basu, Protonu, Samuel Williams, Brian Van Straalen, Leonid Oliker, Phillip Colella, and Mary Hall. "Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers." Parallel Computing 64 (May 2017): 50–64. http://dx.doi.org/10.1016/j.parco.2017.04.002.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Klöckner, Andreas, Nicolas Pinto, Yunsup Lee, Bryan Catanzaro, Paul Ivanov, and Ahmed Fasih. "PyCUDA and PyOpenCL: A scripting-based approach to GPU run-time code generation." Parallel Computing 38, no. 3 (March 2012): 157–74. http://dx.doi.org/10.1016/j.parco.2011.09.001.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Hagiescu, Andrei, Bing Liu, R. Ramanathan, Sucheendra K. Palaniappan, Zheng Cui, Bipasa Chattopadhyay, P. S. Thiagarajan, and Weng-Fai Wong. "GPU code generation for ODE-based applications with phased shared-data access patterns." ACM Transactions on Architecture and Code Optimization 10, no. 4 (December 2013): 1–19. http://dx.doi.org/10.1145/2541228.2555311.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Holzer, Markus, Martin Bauer, Harald Köstler, and Ulrich Rüde. "Highly efficient lattice Boltzmann multiphase simulations of immiscible fluids at high-density ratios on CPUs and GPUs through code generation." International Journal of High Performance Computing Applications 35, no. 4 (May 13, 2021): 413–27. http://dx.doi.org/10.1177/10943420211016525.

Full text
Abstract:
A high-performance implementation of a multiphase lattice Boltzmann method based on the conservative Allen-Cahn model supporting high-density ratios and high Reynolds numbers is presented. Meta-programming techniques are used to generate optimized code for CPUs and GPUs automatically. The coupled model is specified in a high-level symbolic description and optimized through automatic transformations. The memory footprint of the resulting algorithm is reduced through the fusion of compute kernels. A roofline analysis demonstrates the excellent efficiency of the generated code on a single GPU. The resulting single GPU code has been integrated into the multiphysics framework waLBerla to run massively parallel simulations on large domains. Communication hiding and GPUDirect-enabled MPI yield near-perfect scaling behavior. Scaling experiments are conducted on the Piz Daint supercomputer with up to 2048 GPUs, simulating several hundred fully resolved bubbles. Further, validation of the implementation is shown in a physically relevant scenario—a three-dimensional rising air bubble in water.
APA, Harvard, Vancouver, ISO, and other styles
10

Walsh, Stuart D. C., and Martin O. Saar. "Developing Extensible Lattice-Boltzmann Simulators for General-Purpose Graphics-Processing Units." Communications in Computational Physics 13, no. 3 (March 2013): 867–79. http://dx.doi.org/10.4208/cicp.351011.260112s.

Full text
Abstract:
AbstractLattice-Boltzmann methods are versatile numerical modeling techniques capable of reproducing a wide variety of fluid-mechanical behavior. These methods are well suited to parallel implementation, particularly on the single-instruction multiple data (SIMD) parallel processing environments found in computer graphics processing units (GPUs).Although recent programming tools dramatically improve the ease with which GPUbased applications can be written, the programming environment still lacks the flexibility available to more traditional CPU programs. In particular, it may be difficult to develop modular and extensible programs that require variable on-device functionality with current GPU architectures.This paper describes a process of automatic code generation that overcomes these difficulties for lattice-Boltzmann simulations. It details the development of GPU-based modules for an extensible lattice-Boltzmann simulation package – LBHydra. The performance of the automatically generated code is compared to equivalent purposewritten codes for both single-phase,multiphase, andmulticomponent flows. The flexibility of the new method is demonstrated by simulating a rising, dissolving droplet moving through a porous medium with user generated lattice-Boltzmann models and subroutines.
APA, Harvard, Vancouver, ISO, and other styles
11

Xu, Shuqi, and Gilles Noguere. "Generation of thermal scattering files with the CINEL code." EPJ Nuclear Sciences & Technologies 8 (2022): 8. http://dx.doi.org/10.1051/epjn/2022004.

Full text
Abstract:
The CINEL code dedicated to generate the thermal neutron scattering files in ENDF-6 format for solid crystalline, free gas materials and liquid water is presented. Compared to the LEAPR module of the NJOY code, CINEL is able to calculate the coherent and incoherent elastic scattering cross sections for any solid crystalline materials. Specific material properties such as anharmonicity and texture can be taken into account in CINEL. The calculation of the thermal scattering laws can be accelerated by using graphics processing unit (GPU), which enables to remove the short collision time approximation for large values of momentum transfer. CINEL is able to generate automatically the grids of dimensionless momentum and energy transfers. The Sampling the Velocity of the Target nucleus (SVT) algorithm capable of determining the scattered neutron distributions is implemented in CINEL. The obtained distributions for free target nuclei such as hydrogen and oxygen are in good agreement with analytical results and Monte-Carlo simulations when incident neutron energies are above a few eV. The introduction of the effective temperature and the rejection step to the SVT algorithm shows improvements to the neutron up-scattering treatment of hydrogen bound in liquid water.
APA, Harvard, Vancouver, ISO, and other styles
12

Lapillonne, Xavier, and Oliver Fuhrer. "Using Compiler Directives to Port Large Scientific Applications to GPUs: An Example from Atmospheric Science." Parallel Processing Letters 24, no. 01 (March 2014): 1450003. http://dx.doi.org/10.1142/s0129626414500030.

Full text
Abstract:
For many scientific applications, Graphics Processing Units (GPUs) can be an interesting alternative to conventional CPUs as they can deliver higher memory bandwidth and computing power. While it is conceivable to re-write the most execution time intensive parts using a low-level API for accelerator programming, it may not be feasible to do it for the entire application. But, having only selected parts of the application running on the GPU requires repetitively transferring data between the GPU and the host CPU, which may lead to a serious performance penalty. In this paper we assess the potential of compiler directives, based on the OpenACC standard, for porting large parts of code and thus achieving a full GPU implementation. As an illustrative and relevant example, we consider the climate and numerical weather prediction code COSMO (Consortium for Small Scale Modeling) and focus on the physical parametrizations, a part of the code which describes all physical processes not accounted for by the fundamental equations of atmospheric motion. We show, by porting three of the dominant parametrization schemes, the radiation, microphysics and turbulence parametrizations, that compiler directives are an efficient tool both in terms of final execution time as well as implementation effort. Compiler directives enable to port large sections of the existing code with minor modifications while still allowing for further optimizations for the most performance critical parts. With the example of the radiation parametrization, which contains the solution of a block tri-diagonal linear system, the required code modifications and key optimizations are discussed in detail. Performance tests for the three physical parametrizations show a speedup of between 3× and 7× for execution time obtained on a GPU and on a multi-core CPU of an equivalent generation.
APA, Harvard, Vancouver, ISO, and other styles
13

Cesare, Valentina, Ugo Becciani, Alberto Vecchiato, Mario Gilberto Lattanzi, Fabio Pitari, Marco Aldinucci, and Beatrice Bucciarelli. "The MPI + CUDA Gaia AVU–GSR Parallel Solver Toward Next-generation Exascale Infrastructures." Publications of the Astronomical Society of the Pacific 135, no. 1049 (July 1, 2023): 074504. http://dx.doi.org/10.1088/1538-3873/acdf1e.

Full text
Abstract:
Abstract We ported to the GPU with CUDA the Astrometric Verification Unit–Global Sphere Reconstruction (AVU–GSR) Parallel Solver developed for the ESA Gaia mission, by optimizing a previous OpenACC porting of this application. The code aims to find, with a [10, 100] μarcsec precision, the astrometric parameters of ∼108 stars, the attitude and instrumental settings of the Gaia satellite, and the global parameter γ of the parametrized Post-Newtonian formalism, by solving a system of linear equations, A × x = b , with the LSQR iterative algorithm. The coefficient matrix A of the final Gaia data set is large, with ∼1011 × 108 elements, and sparse, reaching a size of ∼10–100 TB, typical for the Big Data analysis, which requires an efficient parallelization to obtain scientific results in reasonable timescales. The speedup of the CUDA code over the original AVU–GSR solver, parallelized on the CPU with MPI + OpenMP, increases with the system size and the number of resources, reaching a maximum of ∼14×, >9× over the OpenACC application. This result is obtained by comparing the two codes on the CINECA cluster Marconi100, with 4 V100 GPUs per node. After verifying the agreement between the solutions of a set of systems with different sizes computed with the CUDA and the OpenMP codes and that the solutions showed the required precision, the CUDA code was put in production on Marconi100, essential for an optimal AVU–GSR pipeline and the successive Gaia Data Releases. This analysis represents a first step to understand the (pre-)Exascale behavior of a class of applications that follow the same structure of this code. In the next months, we plan to run this code on the pre-Exascale platform Leonardo of CINECA, with 4 next-generation A200 GPUs per node, toward a porting on this infrastructure, where we expect to obtain even higher performances.
APA, Harvard, Vancouver, ISO, and other styles
14

Noguere, G., S. Xu, L. Desgrange, J. Boucher, E. Bourasseau, G. Carlot, A. Filhol, et al. "Generation of thermal scattering laws with the CINEL code." EPJ Web of Conferences 284 (2023): 17002. http://dx.doi.org/10.1051/epjconf/202328417002.

Full text
Abstract:
The thermal scattering laws (TSL) take into account the crystalline structure and atomic motions of isotopes bound in materials. This paper presents the CINEL code, which was developed to generate temperature-dependent TSL for solid, liquid and free gas materials of interest for nuclear reactors. CINEL is able to calculate TSL from the phonon density of states (PDOS) of materials under the Gaussian-Incoherent approximations. The PDOS can be obtained by using theoretical approaches (e.g., ab initio density functional theory and molecular dynamics) or experimental results. In this work, the PDOS presented in the ENDF/BVIII.0 and NJOY-NCrystal libraries were used for numerical validation purposes. The CINEL results are in good agreement with those reported in these databases, even in the specific cases of TSL with the newly mixed elastic format. The coding flexibility offered by Python using the JupyterLab interface allowed to investigate limits of physical models reported in the literature, such as a four-site model for UO2, anharmonic behaviors of oxygen atoms bound in a Fm3m structure, texture in Zry4 samples and jump corrections in a roto-translational diffusion model for liquid water. The use of graphic processing units (GPU) is a necessity to perform calculations in a few minutes. The performances of the CINEL code is illustrated with the results obtained on actinide oxides having a Fm3m structure (UO2, ThO2, NpO2 and PuO2), low enriched fuel (UMo), cladding (Zry4) and moderators (H2O with a specific emphasis on ice).
APA, Harvard, Vancouver, ISO, and other styles
15

Morales-Hernández, M., M. B. Sharif, S. Gangrade, T. T. Dullo, S. C. Kao, A. Kalyanapu, S. K. Ghafoor, K. J. Evans, E. Madadi-Kandjani, and B. R. Hodges. "High-performance computing in water resources hydrodynamics." Journal of Hydroinformatics 22, no. 5 (March 4, 2020): 1217–35. http://dx.doi.org/10.2166/hydro.2020.163.

Full text
Abstract:
Abstract This work presents a vision of future water resources hydrodynamics codes that can fully utilize the strengths of modern high-performance computing (HPC). The advances to computing power, formerly driven by the improvement of central processing unit processors, now focus on parallel computing and, in particular, the use of graphics processing units (GPUs). However, this shift to a parallel framework requires refactoring the code to make efficient use of the data as well as changing even the nature of the algorithm that solves the system of equations. These concepts along with other features such as the precision for the computations, dry regions management, and input/output data are analyzed in this paper. A 2D multi-GPU flood code applied to a large-scale test case is used to corroborate our statements and ascertain the new challenges for the next-generation parallel water resources codes.
APA, Harvard, Vancouver, ISO, and other styles
16

Frolov, Vladimir, Vadim Sanzharov, Vladimir Galaktionov, and Alexander Shcherbakov. "Development in Vulkan: a domain-specific approach." Proceedings of the Institute for System Programming of the RAS 33, no. 5 (2021): 181–204. http://dx.doi.org/10.15514/ispras-2021-33(5)-11.

Full text
Abstract:
In this paper we propose a high-level approach to developing GPU applications based on the Vulkan API. The purpose of the work is to reduce the complexity of developing and debugging applications that implement complex algorithms on the GPU using Vulkan. The proposed approach uses the technology of code generation by translating a C++ program into an optimized implementation in Vulkan, which includes automatic shader generation, resource binding, and the use of synchronization mechanisms (Vulkan barriers). The proposed solution is not a general-purpose programming technology, but specializes in specific tasks. At the same time, it has extensibility, which allows to adapt the solution to new problems. For single input C++ program, we can generate several implementations for different cases (via translator options) or different hardware. For example, a call to virtual functions can be implemented either through a switch construct in a kernel, or through sorting threads and an indirect dispatching via different kernels, or through the so-called callable shaders in Vulkan. Instead of creating a universal programming technology for building various software systems, we offer an extensible technology that can be customized for a specific class of applications. Unlike, for example, Halide, we do not use a domain-specific language, and the necessary knowledge is extracted from ordinary C++ code. Therefore, we do not extend with any new language constructs or directives and the input source code is assumed to be normal C++ source code (albeit with some restrictions) that can be compiled by any C++ compiler. We use pattern matching to find specific patterns (or patterns) in C++ code and convert them to GPU efficient code using Vulkan. Pattern are expressed through classes, member functions, and the relationship between them. Thus, the proposed technology makes it possible to ensure a cross-platform solution by generating different implementations of the same algorithm for different GPUs. At the same time, due to this, it allows you to provide access to specific hardware functionality required in computer graphics applications. Patterns are divided into architectural and algorithmic. The architectural pattern defines the domain and behavior of the translator as a whole (for example, image processing, ray tracing, neural networks, computational fluid dynamics and etc.). Algorithmic pattern express knowledge of data flow and control and define a narrower class of algorithms that can be efficiently implemented in hardware. Algorithmic patterns can occur within architectural patterns. For example, parallel reduction, compaction (parallel append), sorting, prefix sum, histogram calculation, map-reduce, etc. The proposed generator works on the principle of code morphing. The essence of this approach is that, having a certain class in the program and transformation rules, one can automatically generate another class with the desired properties (for example, the implementation of the algorithm on the GPU). The generated class inherits from the input class and thus has access to all data and functions of the input class. Overriding virtual functions in generated class helps user to carefully connect generated code to the other Vulkan code written by hand. Shaders can be generated in two variants: OpenCL shaders for google “clspv” compiler and GLSL shaders for an arbitrary GLSL compiler. Clspv variant is better for code which intensively uses pointers and the GLSL generator is better if specific HW features are used (like hardware ray tracing acceleration). We have demonstrated our technology on several examples related to image processing and ray tracing on which we get 30-100 times acceleration over multithreaded CPU implementation.
APA, Harvard, Vancouver, ISO, and other styles
17

Golosio, Bruno, Jose Villamar, Gianmarco Tiddia, Elena Pastorelli, Jonas Stapmanns, Viviana Fanti, Pier Stanislao Paolucci, Abigail Morrison, and Johanna Senk. "Runtime Construction of Large-Scale Spiking Neuronal Network Models on GPU Devices." Applied Sciences 13, no. 17 (August 24, 2023): 9598. http://dx.doi.org/10.3390/app13179598.

Full text
Abstract:
Simulation speed matters for neuroscientific research: this includes not only how quickly the simulated model time of a large-scale spiking neuronal network progresses but also how long it takes to instantiate the network model in computer memory. On the hardware side, acceleration via highly parallel GPUs is being increasingly utilized. On the software side, code generation approaches ensure highly optimized code at the expense of repeated code regeneration and recompilation after modifications to the network model. Aiming for a greater flexibility with respect to iterative model changes, here we propose a new method for creating network connections interactively, dynamically, and directly in GPU memory through a set of commonly used high-level connection rules. We validate the simulation performance with both consumer and data center GPUs on two neuroscientifically relevant models: a cortical microcircuit of about 77,000 leaky-integrate-and-fire neuron models and 300 million static synapses, and a two-population network recurrently connected using a variety of connection rules. With our proposed ad hoc network instantiation, both network construction and simulation times are comparable or shorter than those obtained with other state-of-the-art simulation technologies while still meeting the flexibility demands of explorative network modeling.
APA, Harvard, Vancouver, ISO, and other styles
18

Hoffmann, Lars, Paul F. Baumeister, Zhongyin Cai, Jan Clemens, Sabine Griessbach, Gebhard Günther, Yi Heng, et al. "Massive-Parallel Trajectory Calculations version 2.2 (MPTRAC-2.2): Lagrangian transport simulations on graphics processing units (GPUs)." Geoscientific Model Development 15, no. 7 (April 5, 2022): 2731–62. http://dx.doi.org/10.5194/gmd-15-2731-2022.

Full text
Abstract:
Abstract. Lagrangian models are fundamental tools to study atmospheric transport processes and for practical applications such as dispersion modeling for anthropogenic and natural emission sources. However, conducting large-scale Lagrangian transport simulations with millions of air parcels or more can become rather numerically costly. In this study, we assessed the potential of exploiting graphics processing units (GPUs) to accelerate Lagrangian transport simulations. We ported the Massive-Parallel Trajectory Calculations (MPTRAC) model to GPUs using the open accelerator (OpenACC) programming model. The trajectory calculations conducted within the MPTRAC model were fully ported to GPUs, i.e., except for feeding in the meteorological input data and for extracting the particle output data, the code operates entirely on the GPU devices without frequent data transfers between CPU and GPU memory. Model verification, performance analyses, and scaling tests of the Message Passing Interface (MPI) – Open Multi-Processing (OpenMP) – OpenACC hybrid parallelization of MPTRAC were conducted on the Jülich Wizard for European Leadership Science (JUWELS) Booster supercomputer operated by the Jülich Supercomputing Centre, Germany. The JUWELS Booster comprises 3744 NVIDIA A100 Tensor Core GPUs, providing a peak performance of 71.0 PFlop s−1. As of June 2021, it is the most powerful supercomputer in Europe and listed among the most energy-efficient systems internationally. For large-scale simulations comprising 108 particles driven by the European Centre for Medium-Range Weather Forecasts' fifth-generation reanalysis (ERA5), the performance evaluation showed a maximum speed-up of a factor of 16 due to the utilization of GPUs compared to CPU-only runs on the JUWELS Booster. In the large-scale GPU run, about 67 % of the runtime is spent on the physics calculations, conducted on the GPUs. Another 15 % of the runtime is required for file I/O, mostly to read the large ERA5 data set from disk. Meteorological data preprocessing on the CPUs also requires about 15 % of the runtime. Although this study identified potential for further improvements of the GPU code, we consider the MPTRAC model ready for production runs on the JUWELS Booster in its present form. The GPU code provides a much faster time to solution than the CPU code, which is particularly relevant for near-real-time applications of a Lagrangian transport model.
APA, Harvard, Vancouver, ISO, and other styles
19

Kiran, Utpal, Deepak Sharma, and Sachin Singh Gautam. "GPU-warp based finite element matrices generation and assembly using coloring method." Journal of Computational Design and Engineering 6, no. 4 (November 17, 2018): 705–18. http://dx.doi.org/10.1016/j.jcde.2018.11.001.

Full text
Abstract:
Abstract Finite element method has been successfully implemented on the graphics processing units to achieve a significant reduction in simulation time. In this paper, new strategies for the finite element matrix generation including numerical integration and assembly are proposed by using a warp per element for a given mesh. These strategies are developed using the well-known coloring method. The proposed strategies use a specialized algorithm to realize fine-grain parallelism and efficient use of on-chip memory resources. The warp shuffle feature of Compute Unified Device Architecture (CUDA) is used to accelerate numerical integration. The evaluation of elemental stiffness matrix is further optimized by adopting a partial parallel implementation of numerical integration. Performance evaluations of the proposed strategies are done for three-dimensional elasticity problem using the 8-noded hexahedral elements with three degrees of freedom per node. We obtain a speedup of up to 8.2× over the coloring based assembly by element strategy (using a single thread per element) on NVIDIA Tesla K40 GPU. Also, the proposed strategies achieve better arithmetic throughput and bandwidth. Highlights CUDA Warp based strategies for FE matrix generation and assembly. Performed using coloring method and on linear hexahedral element meshing in 3D. Obtained speedup of 5.17×− 8.2× over single thread per element strategy on GPU. Strategies showed better arithmetic throughput and bandwidth through code profiling.
APA, Harvard, Vancouver, ISO, and other styles
20

Сентябов, А. В., А. А. Гаврилов, М. А. Кривов, А. А. Дектерев, and М. Н. Притула. "Efficiency analysis of hydrodynamic calculations on GPU and CPU clusters." Numerical Methods and Programming (Vychislitel'nye Metody i Programmirovanie), no. 3 (September 20, 2016): 329–38. http://dx.doi.org/10.26089/nummet.v17r331.

Full text
Abstract:
Рассматривается ускорение параллельных гидродинамических расчетов на кластерах с CPU- и GPU-узлами. Для тестирования используется собственный CFD-код SigmaFlow, портированный для расчетов на графических ускорителях с помощью технологии CUDA. Алгоритм моделирования течения несжимаемой жидкости основан на SIMPLE-подобной процедуре и дискретизации с помощью метода контрольного объема на неструктурированных сетках из тексаэдральных ячеек. Сравнение скорости расчета показывает высокую производительность графических ускорителей нового поколения в GPGPU-расчетах. Speedup of parallel hydrodynamic calculations on clusters with CPUs and GPUs is considered. The CFD SigmaFlow code developed by the authors and ported for GPU by means of CUDA is used in test calculations. The incompressible flow simulation is based on a SIMPLE-like procedure and on a discretization by the control volume method on unstructured hexahedral meshes. The performance evaluation shows a high efficiency of the new generation of GPUs for GPGPU calculations.
APA, Harvard, Vancouver, ISO, and other styles
21

Song, Yankan, Ying Chen, Shaowei Huang, Yin Xu, Zhitong Yu, and Wei Xue. "Efficient GPU-Based Electromagnetic Transient Simulation for Power Systems With Thread-Oriented Transformation and Automatic Code Generation." IEEE Access 6 (2018): 25724–36. http://dx.doi.org/10.1109/access.2018.2833506.

Full text
APA, Harvard, Vancouver, ISO, and other styles
22

Khalid, Muhammad Farhan, Kanzal Iman, Amna Ghafoor, Mujtaba Saboor, Ahsan Ali, Urwa Muaz, Abdul Rehman Basharat, et al. "PERCEPTRON: an open-source GPU-accelerated proteoform identification pipeline for top-down proteomics." Nucleic Acids Research 49, W1 (May 17, 2021): W510—W515. http://dx.doi.org/10.1093/nar/gkab368.

Full text
Abstract:
Abstract PERCEPTRON is a next-generation freely available web-based proteoform identification and characterization platform for top-down proteomics (TDP). PERCEPTRON search pipeline brings together algorithms for (i) intact protein mass tuning, (ii) de novo sequence tags-based filtering, (iii) characterization of terminal as well as post-translational modifications, (iv) identification of truncated proteoforms, (v) in silico spectral comparison, and (vi) weight-based candidate protein scoring. High-throughput performance is achieved through the execution of optimized code via multiple threads in parallel, on graphics processing units (GPUs) using NVidia Compute Unified Device Architecture (CUDA) framework. An intuitive graphical web interface allows for setting up of search parameters as well as for visualization of results. The accuracy and performance of the tool have been validated on several TDP datasets and against available TDP software. Specifically, results obtained from searching two published TDP datasets demonstrate that PERCEPTRON outperforms all other tools by up to 135% in terms of reported proteins and 10-fold in terms of runtime. In conclusion, the proposed tool significantly enhances the state-of-the-art in TDP search software and is publicly available at https://perceptron.lums.edu.pk. Users can also create in-house deployments of the tool by building code available on the GitHub repository (http://github.com/BIRL/Perceptron).
APA, Harvard, Vancouver, ISO, and other styles
23

Lessley, Brenton, Shaomeng Li, and Hank Childs. "HashFight: A Platform-Portable Hash Table for Multi-Core and Many-Core Architectures." Electronic Imaging 2020, no. 1 (January 26, 2020): 376–1. http://dx.doi.org/10.2352/issn.2470-1173.2020.1.vda-376.

Full text
Abstract:
We introduce a new platform-portable hash table and collision-resolution approach, HashFight, for use in visualization and data analysis algorithms. Designed entirely in terms of dataparallel primitives (DPPs), HashFight is atomics-free and consists of a single code base that can be invoked across a diverse range of architectures. To evaluate its hashing performance, we compare the single-node insert and query throughput of Hash- Fight to that of two best-in-class GPU and CPU hash table implementations, using several experimental configurations and factors. Overall, HashFight maintains competitive performance across both modern and older generation GPU and CPU devices, which differ in computational and memory abilities. In particular, HashFight achieves stable performance across all hash table sizes, and has leading query throughput for the largest sets of queries, while remaining within a factor of 1.5X of the comparator GPU implementation on all smaller query sets. Moreover, HashFight performs better than the comparator CPU implementation across all configurations. Our findings reveal that our platform-agnostic implementation can perform as well as optimized, platform-specific implementations, which demonstrates the portable performance of our DPP-based design.
APA, Harvard, Vancouver, ISO, and other styles
24

He, Q., A. Rezaei, and S. Pursiainen. "Zeffiro User Interface for Electromagnetic Brain Imaging: a GPU Accelerated FEM Tool for Forward and Inverse Computations in Matlab." Neuroinformatics 18, no. 2 (October 9, 2019): 237–50. http://dx.doi.org/10.1007/s12021-019-09436-9.

Full text
Abstract:
Abstract This article introduces the Zeffiro interface (ZI) version 2.2 for brain imaging. ZI aims to provide a simple, accessible and multimodal open source platform for finite element method (FEM) based and graphics processing unit (GPU) accelerated forward and inverse computations in the Matlab environment. It allows one to (1) generate a given multi-compartment head model, (2) to evaluate a lead field matrix as well as (3) to invert and analyze a given set of measurements. GPU acceleration is applied in each of the processing stages (1)–(3). In its current configuration, ZI includes forward solvers for electro-/magnetoencephalography (EEG) and linearized electrical impedance tomography (EIT) as well as a set of inverse solvers based on the hierarchical Bayesian model (HBM). We report the results of EEG and EIT inversion tests performed with real and synthetic data, respectively, and demonstrate numerically how the inversion parameters affect the EEG inversion outcome in HBM. The GPU acceleration was found to be essential in the generation of the FE mesh and the LF matrix in order to achieve a reasonable computing time. The code package can be extended in the future based on the directions given in this article.
APA, Harvard, Vancouver, ISO, and other styles
25

Wrede, Fabian, and Herbert Kuchen. "Towards High-Performance Code Generation for Multi-GPU Clusters Based on a Domain-Specific Language for Algorithmic Skeletons." International Journal of Parallel Programming 48, no. 4 (May 22, 2020): 713–28. http://dx.doi.org/10.1007/s10766-020-00659-x.

Full text
APA, Harvard, Vancouver, ISO, and other styles
26

Concas, M. "A vendor-agnostic, single code-based GPU tracking for the Inner Tracking System of the ALICE experiment." Journal of Physics: Conference Series 2438, no. 1 (February 1, 2023): 012134. http://dx.doi.org/10.1088/1742-6596/2438/1/012134.

Full text
Abstract:
Abstract During the LHC Run 3 the ALICE online computing farm will process up to 50 times more Pb-Pb events per second than in Run 2. The implied computing resource scaling requires a shift in the approach that comprises the extensive usage of Graphics Processing Units (GPU) for the processing. We will give an overview of the state of the art for the data reconstruction on GPUs in ALICE, with additional focus on the Inner Tracking System detector. A detailed teardown of adopted techniques, implemented algorithms and approaches and performance report will be shown. Additionally, we will show how we support different GPUs brands (NVIDIA and AMD) with a single code-base using an automatic code translation and generation for different target architectures. Strengths and possible weaknesses of this approach will be discussed. Finally, an overview of the next steps towards an even more comprehensive usage of GPUs in ALICE software will be illustrated.
APA, Harvard, Vancouver, ISO, and other styles
27

Ikuyajolu, Olawale James, Luke Van Roekel, Steven R. Brus, Erin E. Thomas, Yi Deng, and Sarat Sreepathi. "Porting the WAVEWATCH III (v6.07) wave action source terms to GPU." Geoscientific Model Development 16, no. 4 (March 3, 2023): 1445–58. http://dx.doi.org/10.5194/gmd-16-1445-2023.

Full text
Abstract:
Abstract. Surface gravity waves play a critical role in several processes, including mixing, coastal inundation, and surface fluxes. Despite the growing literature on the importance of ocean surface waves, wind–wave processes have traditionally been excluded from Earth system models (ESMs) due to the high computational costs of running spectral wave models. The development of the Next Generation Ocean Model for the DOE’s (Department of Energy) E3SM (Energy Exascale Earth System Model) Project partly focuses on the inclusion of a wave model, WAVEWATCH III (WW3), into E3SM. WW3, which was originally developed for operational wave forecasting, needs to be computationally less expensive before it can be integrated into ESMs. To accomplish this, we take advantage of heterogeneous architectures at DOE leadership computing facilities and the increasing computing power of general-purpose graphics processing units (GPUs). This paper identifies the wave action source terms, W3SRCEMD, as the most computationally intensive module in WW3 and then accelerates them via GPU. Our experiments on two computing platforms, Kodiak (P100 GPU and Intel(R) Xeon(R) central processing unit, CPU, E5-2695 v4) and Summit (V100 GPU and IBM POWER9 CPU) show respective average speedups of 2× and 4× when mapping one Message Passing Interface (MPI) per GPU. An average speedup of 1.4× was achieved using all 42 CPU cores and 6 GPUs on a Summit node (with 7 MPI ranks per GPU). However, the GPU speedup over the 42 CPU cores remains relatively unchanged (∼ 1.3×) even when using 4 MPI ranks per GPU (24 ranks in total) and 3 MPI ranks per GPU (18 ranks in total). This corresponds to a 35 %–40 % decrease in both simulation time and usage of resources. Due to too many local scalars and arrays in the W3SRCEMD subroutine and the huge WW3 memory requirement, GPU performance is currently limited by the data transfer bandwidth between the CPU and the GPU. Ideally, OpenACC routine directives could be used to further improve performance. However, W3SRCEMD would require significant code refactoring to make this possible. We also discuss how the trade-off between the occupancy, register, and latency affects the GPU performance of WW3.
APA, Harvard, Vancouver, ISO, and other styles
28

Vasilev, Eugene, Dmitry Lachinov, Anton Grishin, and Vadim Turlapov. "Fast tetrahedral mesh generation and segmentation of an atlas-based heart model using a periodic uniform grid." Russian Journal of Numerical Analysis and Mathematical Modelling 33, no. 5 (November 27, 2018): 315–23. http://dx.doi.org/10.1515/rnam-2018-0026.

Full text
Abstract:
Abstract A fast procedure for generation of regular tetrahedral finite element mesh for objects with complex shape cavities is proposed. The procedure like LBIE-Mesher can generate tetrahedral meshes for the volume interior to a polygonal surface, or for an interval volume between two surfaces having a complex shape and defined in STL-format. This procedure consists of several stages: generation of a regular tetrahedral mesh that fills the volume of the required object; generation of clipping for the uniform grid parts by a boundary surface; shifting vertices of the boundary layer to align onto the surface.We present a sequential and parallel implementation of the algorithm and compare their performance with existing generators of tetrahedral grids such as TetGen, NETGEN, and CGAL. The current version of the algorithm using the mobile GPU is about 5 times faster than NETGEN. The source code of the developed software is available on GitHub.
APA, Harvard, Vancouver, ISO, and other styles
29

Moscibrodzka, Monika A., and Aristomenis I. Yfantis. "Prospects for Ray-tracing Light Intensity and Polarization in Models of Accreting Compact Objects Using a GPU." Astrophysical Journal Supplement Series 265, no. 1 (March 1, 2023): 22. http://dx.doi.org/10.3847/1538-4365/acb6f9.

Full text
Abstract:
Abstract The Event Horizon Telescope (EHT) has recently released high-resolution images of accretion flows onto two supermassive black holes. Our physical understanding of these images depends on the accuracy and precision of numerical models of plasma and radiation around compact objects. The goal of this work is to speed up radiative-transfer simulations used to create mock images of black holes for comparison with the EHT observations. A ray-tracing code for general relativistic and fully polarized radiative transfer through plasma in strong gravity is ported onto a graphics processing unit (GPU). We describe our GPU implementation and carry out speedup tests using models of optically thin advection-dominated accretion flow onto a black hole realized semianalytically and in 3D general relativistic magnetohydrodynamic simulations, low and very high image pixel resolutions, and two different sets of CPU+GPUs. We show that a GPU with high double precision computing capability can significantly reduce the image production computational time, with a speedup factor of up to approximately 1200. The significant speedup facilitates, e.g., dynamic model fitting to the EHT data, including polarimetric data. The method extension may enable studies of emission from plasma with nonthermal particle distribution functions for which accurate approximate synchrotron emissivities are not available. The significant speedup reduces the carbon footprint of the generation of the EHT image libraries by at least an order of magnitude.
APA, Harvard, Vancouver, ISO, and other styles
30

Torky, Ahmed A., and Youssef F. Rashed. "High-performance practical stiffness analysis of high-rise buildings using superfloor elements." Journal of Computational Design and Engineering 7, no. 2 (April 1, 2020): 211–27. http://dx.doi.org/10.1093/jcde/qwaa018.

Full text
Abstract:
Abstract This study develops a high-performance computing method using OpenACC (Open Accelerator) for the stiffness matrix and load vector generation of shear-deformable plates in bending using the boundary element method on parallel processors. The boundary element formulation for plates in bending is used to derive fully populated displacement-based stiffness matrices and load vectors at degrees of freedom of interest. The computed stiffness matrix of the plate is defined as a single superfloor element and can be solved using stiffness analysis, $Ku = F$, instead of the conventional boundary element method, $Hu = Gt$. Fortran OpenACC code implementations are proposed for the computation of the superfloor element’s stiffness, which includes one serial computing code for the CPU (central processing unit) and two parallel computing codes for the GPU (graphics processing unit) and multicore CPU. As industrial level practical floors are full of supports and geometrical information, the computation time of superfloor elements is reduced dramatically when computing on parallel processors. It is demonstrated that the OpenACC implementation does not affect numerical accuracy. The feasibility and accuracy are confirmed by numerical examples that include real buildings with industrial level structural floors. Engineering computations for massive floors with immense geometrical detail and a multitude of load cases can be modeled as is without the need for simplification.
APA, Harvard, Vancouver, ISO, and other styles
31

Frontiere, Nicholas, J. D. Emberson, Michael Buehlmann, Joseph Adamo, Salman Habib, Katrin Heitmann, and Claude-André Faucher-Giguère. "Simulating Hydrodynamics in Cosmology with CRK-HACC." Astrophysical Journal Supplement Series 264, no. 2 (January 24, 2023): 34. http://dx.doi.org/10.3847/1538-4365/aca58d.

Full text
Abstract:
Abstract We introduce CRK-HACC, an extension of the Hardware/Hybrid Accelerated Cosmology Code (HACC), to resolve gas hydrodynamics in large-scale structure formation simulations of the universe. The new framework couples the HACC gravitational N-body solver with a modern smoothed-particle hydrodynamics (SPH) approach called conservative reproducing kernel SPH (CRKSPH). CRKSPH utilizes smoothing functions that exactly interpolate linear fields while manifestly preserving conservation laws (momentum, mass, and energy). The CRKSPH method has been incorporated to accurately model baryonic effects in cosmology simulations—an important addition targeting the generation of precise synthetic sky predictions for upcoming observational surveys. CRK-HACC inherits the codesign strategies of the HACC solver and is built to run on modern GPU-accelerated supercomputers. In this work, we summarize the primary solver components and present a number of standard validation tests to demonstrate code accuracy, including idealized hydrodynamic and cosmological setups, as well as self-similarity measurements.
APA, Harvard, Vancouver, ISO, and other styles
32

Cecilia, José M., Juan-Carlos Cano, Juan Morales-García, Antonio Llanes, and Baldomero Imbernón. "Evaluation of Clustering Algorithms on GPU-Based Edge Computing Platforms." Sensors 20, no. 21 (November 6, 2020): 6335. http://dx.doi.org/10.3390/s20216335.

Full text
Abstract:
Internet of Things (IoT) is becoming a new socioeconomic revolution in which data and immediacy are the main ingredients. IoT generates large datasets on a daily basis but it is currently considered as “dark data”, i.e., data generated but never analyzed. The efficient analysis of this data is mandatory to create intelligent applications for the next generation of IoT applications that benefits society. Artificial Intelligence (AI) techniques are very well suited to identifying hidden patterns and correlations in this data deluge. In particular, clustering algorithms are of the utmost importance for performing exploratory data analysis to identify a set (a.k.a., cluster) of similar objects. Clustering algorithms are computationally heavy workloads and require to be executed on high-performance computing clusters, especially to deal with large datasets. This execution on HPC infrastructures is an energy hungry procedure with additional issues, such as high-latency communications or privacy. Edge computing is a paradigm to enable light-weight computations at the edge of the network that has been proposed recently to solve these issues. In this paper, we provide an in-depth analysis of emergent edge computing architectures that include low-power Graphics Processing Units (GPUs) to speed-up these workloads. Our analysis includes performance and power consumption figures of the latest Nvidia’s AGX Xavier to compare the energy-performance ratio of these low-cost platforms with a high-performance cloud-based counterpart version. Three different clustering algorithms (i.e., k-means, Fuzzy Minimals (FM), and Fuzzy C-Means (FCM)) are designed to be optimally executed on edge and cloud platforms, showing a speed-up factor of up to 11× for the GPU code compared to sequential counterpart versions in the edge platforms and energy savings of up to 150% between the edge computing and HPC platforms.
APA, Harvard, Vancouver, ISO, and other styles
33

Winter, Robin, Joren Retel, Frank Noé, Djork-Arné Clevert, and Andreas Steffen. "grünifai: interactive multiparameter optimization of molecules in a continuous vector space." Bioinformatics 36, no. 13 (May 5, 2020): 4093–94. http://dx.doi.org/10.1093/bioinformatics/btaa271.

Full text
Abstract:
Abstract Summary Optimizing small molecules in a drug discovery project is a notoriously difficult task as multiple molecular properties have to be considered and balanced at the same time. In this work, we present our novel interactive in silico compound optimization platform termed grünifai to support the ideation of the next generation of compounds under the constraints of a multiparameter objective. grünifai integrates adjustable in silico models, a continuous representation of the chemical space, a scalable particle swarm optimization algorithm and the possibility to actively steer the compound optimization through providing feedback on generated intermediate structures. Availability and implementation Source code and documentation are freely available under an MIT license and are openly available on GitHub (https://github.com/jrwnter/gruenifai). The backend, including the optimization method and distribution on multiple GPU nodes is written in Python 3. The frontend is written in ReactJS.
APA, Harvard, Vancouver, ISO, and other styles
34

Lehmann, Moritz. "Esoteric Pull and Esoteric Push: Two Simple In-Place Streaming Schemes for the Lattice Boltzmann Method on GPUs." Computation 10, no. 6 (June 2, 2022): 92. http://dx.doi.org/10.3390/computation10060092.

Full text
Abstract:
I present two novel thread-safe in-place streaming schemes for the lattice Boltzmann method (LBM) on graphics processing units (GPUs), termed Esoteric Pull and Esoteric Push, that result in the LBM only requiring one copy of the density distribution functions (DDFs) instead of two, greatly reducing memory demand. These build upon the idea of the existing Esoteric Twist scheme, to stream half of the DDFs at the end of one stream-collide kernel and the remaining half at the beginning of the next and offer the same beneficial properties over the AA-Pattern scheme—reduced memory bandwidth due to implicit bounce-back boundaries and the possibility of swapping pointers between even and odd time steps. However, the streaming directions are chosen in a way that allows the algorithm to be implemented in about one tenth the amount of code, as two simple loops, and is compatible with all velocity sets and suitable for automatic code-generation. The performance of the new streaming schemes is slightly increased over Esoteric Twist due to better memory coalescence. Benchmarks across a large variety of GPUs and CPUs show that for most dedicated GPUs, performance differs only insignificantly from the One-Step Pull scheme; however, for integrated GPUs and CPUs, performance is significantly improved. The two proposed algorithms greatly facilitate modifying existing code to allow for in-place streaming, even with extensions already in place, such as was demonstrated for the Free Surface LBM implementation FluidX3D. Their simplicity, together with their ideal performance characteristics, may enable more widespread adoption of in-place streaming across LBM GPU code.
APA, Harvard, Vancouver, ISO, and other styles
35

Grosser, Tobias, Sven Verdoolaege, Albert Cohen, and P. Sadayappan. "The Relation Between Diamond Tiling and Hexagonal Tiling." Parallel Processing Letters 24, no. 03 (September 2014): 1441002. http://dx.doi.org/10.1142/s0129626414410023.

Full text
Abstract:
Iterative stencil computations are important in scientific computing and more also in the embedded and mobile domain. Recent publications have shown that tiling schemes that ensure concurrent start provide efficient ways to execute these kernels. Diamond tiling and hybrid-hexagonal tiling are two tiling schemes that enable concurrent start. Both have different advantages: diamond tiling has been integrated in a general purpose optimization framework and uses a cost function to choose among tiling hyperplanes, whereas the greater flexibility with tile sizes for hybrid-hexagonal tiling has been exploited for effective generation of GPU code. In this paper we undertake a comparative study of these two tiling approaches and propose a hybrid approach that combines them. We analyze the effects of tile size and wavefront choices on tile-level parallelism, and formulate constraints for optimal diamond tile shapes. We then extend, for the case of two dimensions, the diamond tiling formulation into a hexagonal tiling one, which offers both the flexibility of hexagonal tiling and the generality of the original diamond tiling implementation. We also show how to compute tile sizes that maximize the compute-to-communication ratio, and apply this result to compare the best achievable ratio and the associated synchronization overhead for diamond and hexagonal tiling.
APA, Harvard, Vancouver, ISO, and other styles
36

Bloch, Aurelien, Simone Casale-Brunet, and Marco Mattavelli. "Performance Estimation of High-Level Dataflow Program on Heterogeneous Platforms by Dynamic Network Execution." Journal of Low Power Electronics and Applications 12, no. 3 (June 23, 2022): 36. http://dx.doi.org/10.3390/jlpea12030036.

Full text
Abstract:
The performance of programs executed on heterogeneous parallel platforms largely depends on the design choices regarding how to partition the processing on the various different processing units. In other words, it depends on the assumptions and parameters that define the partitioning, mapping, scheduling, and allocation of data exchanges among the various processing elements of the platform executing the program. The advantage of programs written in languages using the dataflow model of computation (MoC) is that executing the program with different configurations and parameter settings does not require rewriting the application software for each configuration setting, but only requires generating a new synthesis of the execution code corresponding to different parameters. The synthesis stage of dataflow programs is usually supported by automatic code generation tools. Another competitive advantage of dataflow software methodologies is that they are well-suited to support designs on heterogeneous parallel systems as they are inherently free of memory access contention issues and naturally expose the available intrinsic parallelism. So as to fully exploit these advantages and to be able to efficiently search the configuration space to find the design points that better satisfy the desired design constraints, it is necessary to develop tools and associated methodologies capable of evaluating the performance of different configurations and to drive the search for good design configurations, according to the desired performance criteria. The number of possible design assumptions and associated parameter settings is usually so large (i.e., the dimensions and size of the design space) that intuition as well as trial and error are clearly unfeasible, inefficient approaches. This paper describes a method for the clock-accurate profiling of software applications developed using the dataflow programming paradigm such as the formal RVL-CAL language. The profiling can be applied when the application program has been compiled and executed on GPU/CPU heterogeneous hardware platforms utilizing two main methodologies, denoted as static and dynamic. This paper also describes how a method for the qualitative evaluation of the performance of such programs as a function of the supplied configuration parameters can be successfully applied to heterogeneous platforms. The technique was illustrated using two different application software examples and several design points.
APA, Harvard, Vancouver, ISO, and other styles
37

Abdelfattah, A., H. Anzt, J. Dongarra, M. Gates, A. Haidar, J. Kurzak, P. Luszczek, S. Tomov, I. Yamazaki, and A. YarKhan. "Linear algebra software for large-scale accelerated multicore computing." Acta Numerica 25 (May 1, 2016): 1–160. http://dx.doi.org/10.1017/s0962492916000015.

Full text
Abstract:
Many crucial scientific computing applications, ranging from national security to medical advances, rely on high-performance linear algebra algorithms and technologies, underscoring their importance and broad impact. Here we present the state-of-the-art design and implementation practices for the acceleration of the predominant linear algebra algorithms on large-scale accelerated multicore systems. Examples are given with fundamental dense linear algebra algorithms – from the LU, QR, Cholesky, and LDLT factorizations needed for solving linear systems of equations, to eigenvalue and singular value decomposition (SVD) problems. The implementations presented are readily available via the open-source PLASMA and MAGMA libraries, which represent the next generation modernization of the popular LAPACK library for accelerated multicore systems.To generate the extreme level of parallelism needed for the efficient use of these systems, algorithms of interest are redesigned and then split into well-chosen computational tasks. The task execution is scheduled over the computational components of a hybrid system of multicore CPUs with GPU accelerators and/or Xeon Phi coprocessors, using either static scheduling or light-weight runtime systems. The use of light-weight runtime systems keeps scheduling overheads low, similar to static scheduling, while enabling the expression of parallelism through sequential-like code. This simplifies the development effort and allows exploration of the unique strengths of the various hardware components. Finally, we emphasize the development of innovative linear algebra algorithms using three technologies – mixed precision arithmetic, batched operations, and asynchronous iterations – that are currently of high interest for accelerated multicore systems.
APA, Harvard, Vancouver, ISO, and other styles
38

Liu, Weihuang, Xiaodong Cun, Chi-Man Pun, Menghan Xia, Yong Zhang, and Jue Wang. "CoordFill: Efficient High-Resolution Image Inpainting via Parameterized Coordinate Querying." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 2 (June 26, 2023): 1746–54. http://dx.doi.org/10.1609/aaai.v37i2.25263.

Full text
Abstract:
Image inpainting aims to fill the missing hole of the input. It is hard to solve this task efficiently when facing high-resolution images due to two reasons: (1) Large reception field needs to be handled for high-resolution image inpainting. (2) The general encoder and decoder network synthesizes many background pixels synchronously due to the form of the image matrix. In this paper, we try to break the above limitations for the first time thanks to the recent development of continuous implicit representation. In detail, we down-sample and encode the degraded image to produce the spatial-adaptive parameters for each spatial patch via an attentional Fast Fourier Convolution (FFC)-based parameter generation network. Then, we take these parameters as the weights and biases of a series of multi-layer perceptron (MLP), where the input is the encoded continuous coordinates and the output is the synthesized color value. Thanks to the proposed structure, we only encode the high-resolution image in a relatively low resolution for larger reception field capturing. Then, the continuous position encoding will be helpful to synthesize the photo-realistic high-frequency textures by re-sampling the coordinate in a higher resolution. Also, our framework enables us to query the coordinates of missing pixels only in parallel, yielding a more efficient solution than the previous methods. Experiments show that the proposed method achieves real-time performance on the 2048X2048 images using a single GTX 2080 Ti GPU and can handle 4096X4096 images, with much better performance than existing state-of-the-art methods visually and numerically. The code is available at: https://github.com/NiFangBaAGe/CoordFill.
APA, Harvard, Vancouver, ISO, and other styles
39

Thi Nga, Nguyen, Ha Thi Thu, Nguyen Thi Hoa, Vu Thi Hien, Nguyen Thu Trang, Nguyen Thanh Ba, Tran Van Khanh, et al. "Assessment of the genetic changes of the attenuated Hanvet1.vn strain compared with original virulent 02HY strain of the porcine reproductive and respiratory syndrome virus." Vietnam Journal of Biotechnology 20, no. 2 (June 30, 2022): 245–52. http://dx.doi.org/10.15625/1811-4989/16677.

Full text
Abstract:
The attenuated porcine reproductive and respiratory syndrome virus (PRRSV) strain Hanvet1.vn was developed by Hanvet Pharmaceutical Co., Ltd. by inoculating the virulent strain 02HYon Marc-145 cells for 80 generations and used to produce PRRS vaccine. In this study, we published the results of sequencing, analyzing and comparing the genome of the attenuated PRRSV strain Hanvet1.vn compared with the original pathogenic strain 02HY. The genomes of strains Hanvet1.vn and 02HY have 8 reading frames, coding for 8 non-structural and structural proteins: NSP1a, NSP1b, GP2, GP3, GP4, GP5, MP, NP. After sequencing and translating into proteins, the gene sequence of each open reading frame (ORF) of strain Hanvet1.vn was compared with the sequence of pathogenic strain 02HY to find nucleotide and amino acid changes. The results showed that the Hanvet1.vn pathogenic strain genome (Genbank Accession KU842720) when compared with the pathogenic strain 02HY genome (Submission2490633) had89 nucleotide mutations that changed 51 amino acids in 7 ORFs and 7 proteins, respectively. Particularly, ORF6 encoding for the M protein is completely unchanged. The size of each reading frame is also exactly the same. It showed that there were no insertion and deletion (Indel) mutations in the ORFs of the attenuated strain after 80 generations of inoculation. There was a change in the genome that made the strain Hanvet1.vn become attenuated, but the gene encoding for the GP5 protein that induces the production of neutralizing antibodies only changed two nucleotides at position 471 (A->G), causing the TCA codon to turn into a TCG codon. This is a silent mutation and both codons code for the amino acid Serine (S). The second mutation at position 587 (A->T) causes Glutamine (Q) to transform into Leucine (L). However, this modification does not belong to the GP5 antigenic epitopes. In clonclusion, after 80 passages, despite changes occurred in genes of Hanvet1.vn strain for becoming an attenuated strain, the GP5 protein of the attenuated strain did not change its antigenic amino acids.
APA, Harvard, Vancouver, ISO, and other styles
40

Harris-Dewey, Jared, and Richard Klein. "Generative Adversarial Networks for Non-Raytraced Global Illumination on Older GPU Hardware." International Journal of Electronics and Electrical Engineering 10, no. 1 (March 2022): 1–6. http://dx.doi.org/10.18178/ijeee.10.1.1-6.

Full text
Abstract:
We give an overview of the different rendering methods and we demonstrate that the use of a Generative Adversarial Networks (GAN) for Global Illumination (GI) gives a superior quality rendered image to that of a rasterisations image. We utilise the Pix2Pix architecture and specify the hyper-parameters and methodology used to mimic ray-traced images from a set of input features. We also demonstrate that the GANs quality is comparable to the quality of the ray-traced images, but is able to produce the image, at a fraction of the time. Source Code: https://github.com/Jaredrhd/Global-Illumination-using-Pix2Pix-GAN
APA, Harvard, Vancouver, ISO, and other styles
41

Govett, Mark, Jim Rosinski, Jacques Middlecoff, Tom Henderson, Jin Lee, Alexander MacDonald, Ning Wang, Paul Madden, Julie Schramm, and Antonio Duarte. "Parallelization and Performance of the NIM Weather Model on CPU, GPU, and MIC Processors." Bulletin of the American Meteorological Society 98, no. 10 (October 1, 2017): 2201–13. http://dx.doi.org/10.1175/bams-d-15-00278.1.

Full text
Abstract:
Abstract The design and performance of the Non-Hydrostatic Icosahedral Model (NIM) global weather prediction model is described. NIM is a dynamical core designed to run on central processing unit (CPU), graphics processing unit (GPU), and Many Integrated Core (MIC) processors. It demonstrates efficient parallel performance and scalability to tens of thousands of compute nodes and has been an effective way to make comparisons between traditional CPU and emerging fine-grain processors. The design of the NIM also serves as a useful guide in the fine-grain parallelization of the finite volume cubed (FV3) model recently chosen by the National Weather Service (NWS) to become its next operational global weather prediction model. This paper describes the code structure and parallelization of NIM using standards-compliant open multiprocessing (OpenMP) and open accelerator (OpenACC) directives. NIM uses the directives to support a single, performance-portable code that runs on CPU, GPU, and MIC systems. Performance results are compared for five generations of computer chips including the recently released Intel Knights Landing and NVIDIA Pascal chips. Single and multinode performance and scalability is also shown, along with a cost–benefit comparison based on vendor list prices.
APA, Harvard, Vancouver, ISO, and other styles
42

Holm, Håvard H., André R. Brodtkorb, and Martin L. Sætra. "GPU Computing with Python: Performance, Energy Efficiency and Usability." Computation 8, no. 1 (January 9, 2020): 4. http://dx.doi.org/10.3390/computation8010004.

Full text
Abstract:
In this work, we examine the performance, energy efficiency, and usability when using Python for developing high-performance computing codes running on the graphics processing unit (GPU). We investigate the portability of performance and energy efficiency between Compute Unified Device Architecture (CUDA) and Open Compute Language (OpenCL); between GPU generations; and between low-end, mid-range, and high-end GPUs. Our findings showed that the impact of using Python is negligible for our applications, and furthermore, CUDA and OpenCL applications tuned to an equivalent level can in many cases obtain the same computational performance. Our experiments showed that performance in general varies more between different GPUs than between using CUDA and OpenCL. We also show that tuning for performance is a good way of tuning for energy efficiency, but that specific tuning is needed to obtain optimal energy efficiency.
APA, Harvard, Vancouver, ISO, and other styles
43

Fang, Jian Wen, Jin Hui Yu, Shuang Xia Han, and Peng Wang. "Real-Time Ocean Water Animation in Cartoon Style." Key Engineering Materials 474-476 (April 2011): 2320–24. http://dx.doi.org/10.4028/www.scientific.net/kem.474-476.2320.

Full text
Abstract:
This paper presents a model for automatically generating 3D cartoon ocean water animations in real-time. The dynamic ocean water surface model is modeled by a spectral method. The cartoon rendering process is implemented by multipass on GPU: First, we code normal of ocean model and generate normal map. Next, we extract discontinuities from normal map and smooth it into edge map. Finally we combine the edge map with cartoon shading based on a projective texture mapping. Some experimental results demonstrate the prettiness and efficiency of the presented model.
APA, Harvard, Vancouver, ISO, and other styles
44

Rojek, Krzysztof, Kamil Halbiniak, and Lukasz Kuczynski. "CFD code adaptation to the FPGA architecture." International Journal of High Performance Computing Applications 35, no. 1 (November 10, 2020): 33–46. http://dx.doi.org/10.1177/1094342020972461.

Full text
Abstract:
For the last years, we observe the intensive development of accelerated computing platforms. Although current trends indicate a well-established position of GPU devices in the HPC environment, FPGA (Field-Programmable Gate Array) aspires to be an alternative solution to offload the CPU computation. This paper presents a systematic adaptation of four various CFD (Computational Fluids Dynamic) kernels to the Xilinx Alveo U250 FPGA. The goal of this paper is to investigate the potential of the FPGA architecture as the future infrastructure able to provide the most complex numerical simulations in the area of fluid flow modeling. The selected kernels are customized to a real-scientific scenario, compatible with the EULAG (Eulerian/semi-Lagrangian) fluid solver. The solver is used to simulate thermo-fluid flows across a wide range of scales and is extensively used in numerical weather prediction. The proposed adaptation is focused on the analysis of the strengths and weaknesses of the FPGA accelerator, considering performance and energy efficiency. The proposed adaptation is compared with a CPU implementation that was strongly optimized to provide realistic and objective benchmarks. The performance results are compared with a set of server CPUs containing various Intel generations, including Intel SkyLake-based CPUs as Xeon Gold 6148 and Xeon Platinum 8168, as well as Intel Xeon E5-2695 CPU based on the IvyBridge architecture. Since all the kernels belong to the group of memory-bound algorithms, our main challenge is to saturate global memory bandwidth and provide data locality with the intensive BRAM (Block RAM) reusing. Our adaptation allows us to reduce the performance per watt up to 80% compared to the CPUs.
APA, Harvard, Vancouver, ISO, and other styles
45

Liu, Yongjiu, Hao Gao, Qingyi Gu, Tadayoshi Aoyama, Takeshi Takaki, and Idaku Ishii. "High-Frame-Rate Structured Light 3-D Vision for Fast Moving Objects." Journal of Robotics and Mechatronics 26, no. 3 (June 20, 2014): 311–20. http://dx.doi.org/10.20965/jrm.2014.p0311.

Full text
Abstract:
<div class=""abs_img""><img src=""[disp_template_path]/JRM/abst-image/00260003/04.jpg"" width=""300"" />HFR 3D vision system</span></div> This paper presents a fast motion-compensated structured-light vision system that realizes 3-D shape measurement at 500 fps using a high-frame-rate camera-projector system. Multiple light patterns with an 8-bit gray code, are projected on the measured scene at 1000 fps, and are processed in real time for generating 512 × 512 depth images at 500 fps by using the parallel processing of a motion-compensated structured-light method on a GPU board. Several experiments were performed on fast-moving 3-D objects using the proposed method. </span>
APA, Harvard, Vancouver, ISO, and other styles
46

Bentley, Phillip. "Accurate Simulation of Neutrons in Less Than One Minute Pt. 2: Sandman—GPU-Accelerated Adjoint Monte-Carlo Sampled Acceptance Diagrams." Quantum Beam Science 4, no. 2 (June 16, 2020): 24. http://dx.doi.org/10.3390/qubs4020024.

Full text
Abstract:
A computational method in the modelling of neutron beams is described that blends neutron acceptance diagrams, GPU-based Monte-Carlo sampling, and a Bayesian approach to efficiency. The resulting code reaches orders of magnitude improvement in performance relative to existing methods. For example, data rates similar to world-leading, real instruments can be achieved on a 2017 laptop, generating 10 6 neutrons per second at the sample position of a high-resolution small angle scattering instrument. The method is benchmarked, and is shown to be in agreement with previous work. Finally, the method is demonstrated on a mature instrument design, where a sub-second turnaround in an interactive simulation process allows the rapid exploration of a wide range of options. This results in a doubling of the design performance, at the same time as reducing the hardware cost by 40%.
APA, Harvard, Vancouver, ISO, and other styles
47

ZEIBDAWI, Abed R., Jean E. GRUNDY, Bogna LASIA, and Edward L. G. PRYZDIAL. "Coagulation factor Va Glu-96-Asp-111: a chelator-sensitive site involved in function and subunit association." Biochemical Journal 377, no. 1 (January 1, 2004): 141–48. http://dx.doi.org/10.1042/bj20031205.

Full text
Abstract:
Coagulation FVa (factor Va) accelerates the essential generation of thrombin by FXa (factor Xa). Although the noncovalent Ca2+-dependent association between the FVa light and heavy subunits (FVaL and FVaH) is required for function, little is known about the specific residues involved. Previous fragmentation studies and homology modelling led us to investigate the contribution of Leu-94–Asp-112. Including prospective divalent cation-binding acidic amino acids, nine conserved residues were individually replaced with Ala in the recombinant B-domainless FVa precursor (ΔFV). While mutation of Thr-104, Glu-108, Asp-112 or Tyr-100 resulted in only minor changes to FXa-mediated thrombin generation, the functions of E96A (81%), D111A (70%) and D102A (60%) mutants (where the single-letter amino acid code is used) were notably reduced. The mutants targeting neighbouring acidic residues, Asp-79 and Glu-119, had activity comparable with ΔFV, supporting the specific involvement of select residues. Providing a basis for reduced activity, thrombin treatment of D111A resulted in spontaneous dissociation of subunits. Since FVaH and FVaL derived from E96A or D102A remained associated in the presence of Ca2+, like the wild type, but conversely dissociated rapidly upon chelation, a subtle difference in divalent cation co-ordination is implied. Subunit interactions for all other single-point mutants resembled the wild type. These data, along with corroborating multipoint mutants, reveal Asp-111 as essential for FVa subunit association. Although Glu-96 and Asp-102 can be mutated without gross changes to divalent cation-dependent FVaH–FVaL interactions, they too are required for optimal function. Thus Glu-96–Asp-111 imparts at least two discernible effects on FVa coagulation activity.
APA, Harvard, Vancouver, ISO, and other styles
48

Bartels, David W., and William D. Hutchison. "Microbial Control of First-Generation Ecb on Whorl Stage Corn, 1990." Insecticide and Acaricide Tests 16, no. 1 (January 1, 1991): 72. http://dx.doi.org/10.1093/iat/16.1.72.

Full text
Abstract:
Abstract This experiment was conducted at the Rosemount Agricultural Experiment Station, Rosemount, Minn. 'Green Giant Code 40' sweet corn was planted on 7 Jun. Plots consisted of a single 30-ft row on 36-inch centers and were arranged in a randomized complete block design with 4 replications. Plots were infested with first instar ECB larvae on 24 Jun to simulate heavy pest pressure. Approximately 50 larvae were placed into each whorl with a “bazooka” applicator. The corn was 20—25 inches tall with no tassels visible from above. Insecticide applications were delayed by heavy rains (2.15 inches) until 30 Jun. At the time of application, the percent of larvae in each of the following instars was I, 9.6%; early II, 42.3%; and late II, 48.1%. Granular formulations were banded (7-inch) over the whorl with a battery powered Gandy applicator mounted on a two-wheeled, hand pushed frame. Liquid formulations were applied using an R&D CO2-pressurized (35 psi) backpack sprayer. A single, hand held nozzle (LF-3) delivering 20 GPA was used to direct the spray over the whorl. Evaluations of the number of live larvae found on the entire plant were made 21 Aug.
APA, Harvard, Vancouver, ISO, and other styles
49

Kuśmirek, Wiktor, Wiktor Franus, and Robert Nowak. "Linking De Novo Assembly Results with Long DNA Reads Using the dnaasm-link Application." BioMed Research International 2019 (April 11, 2019): 1–10. http://dx.doi.org/10.1155/2019/7847064.

Full text
Abstract:
Currently, third-generation sequencing techniques, which make it possible to obtain much longer DNA reads compared to the next-generation sequencing technologies, are becoming more and more popular. There are many possibilities for combining data from next-generation and third-generation sequencing. Herein, we present a new application called dnaasm-link for linking contigs, the result of de novo assembly of second-generation sequencing data, with long DNA reads. Our tool includes an integrated module to fill gaps with a suitable fragment of an appropriate long DNA read, which improves the consistency of the resulting DNA sequences. This feature is very important, in particular for complex DNA regions. Our implementation is found to outperform other state-of-the-art tools in terms of speed and memory requirements, which may enable its usage for organisms with a large genome, something which is not possible in existing applications. The presented application has many advantages: (i) it significantly optimizes memory and reduces computation time; (ii) it fills gaps with an appropriate fragment of a specified long DNA read; (iii) it reduces the number of spanned and unspanned gaps in existing genome drafts. The application is freely available to all users under GNU Library or Lesser General Public License version 3.0 (LGPLv3). The demo application, Docker image, and source code can be downloaded from project homepage.
APA, Harvard, Vancouver, ISO, and other styles
50

Alcantara, Licinius Dimitri Sá de. "Towards a simple and secure method for binary cryptography via linear algebra." Revista Brasileira de Computação Aplicada 9, no. 3 (October 31, 2017): 44. http://dx.doi.org/10.5335/rbca.v9i3.6556.

Full text
Abstract:
A simple and secure binary matrix encryption (BME) method is proposed and formalized on a linear algebra basis. The developed cryptography scheme does not require the idealization of a set of complex procedures or the generation of parallel bit stream for encryption of data, but it only needs to capture binary data sequences from the unprotected digital data, which are transformed into encrypted binary sequences by a cipher matrix. This method can be performed on physical or application layer level, and can be easily applied into any digital storage and telecommunication system. It also has the advantage that the encrypted data length is not increased, which avoids additional burden for data storage and transmission. In order to validate the presented methodology, a GNU Octave program code was written to encrypt and decrypt data files.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography