To see the other types of publications on this topic, follow the link: Parallel code optimization.

Journal articles on the topic 'Parallel code optimization'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Parallel code optimization.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Özcan, Ender, and Esin Onbaşioğlu. "Memetic Algorithms for Parallel Code Optimization." International Journal of Parallel Programming 35, no. 1 (December 2, 2006): 33–61. http://dx.doi.org/10.1007/s10766-006-0026-x.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Luo, Hao, Guoyang Chen, Pengcheng Li, Chen Ding, and Xipeng Shen. "Data-centric combinatorial optimization of parallel code." ACM SIGPLAN Notices 51, no. 8 (November 9, 2016): 1–2. http://dx.doi.org/10.1145/3016078.2851182.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Bailey, Duane A., Janice E. Cuny, and Bruce B. MacLeod. "Reducing communication overhead: A parallel code optimization." Journal of Parallel and Distributed Computing 4, no. 5 (October 1987): 505–20. http://dx.doi.org/10.1016/0743-7315(87)90021-9.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Shang, Zhi. "Large-Scale CFD Parallel Computing Dealing with Massive Mesh." Journal of Engineering 2013 (2013): 1–6. http://dx.doi.org/10.1155/2013/850148.

Full text
Abstract:
In order to run CFD codes more efficiently on large scales, the parallel computing has to be employed. For example, in industrial scales, it usually uses tens of thousands of mesh cells to capture the details of complex geometries. How to distribute these mesh cells among the multiprocessors for obtaining a good parallel computing performance (HPC) is really a challenge. Due to dealing with the massive mesh cells, it is difficult for the CFD codes without parallel optimizations to handle this kind of large-scale computing. Some of the open source mesh partitioning software packages, such as Metis, ParMetis, Scotch, PT-Scotch, and Zoltan, are able to deal with the distribution of large number of mesh cells. Therefore they were employed as the parallel optimization tools ported into Code_Saturne, an open source CFD code, for testing if they can solve the issue of dealing with massive mesh cells for CFD codes. Through the studies, it was found that the mesh partitioning optimization software packages can help CFD codes not only deal with massive mesh cells but also have a good HPC.
APA, Harvard, Vancouver, ISO, and other styles
5

Özturan, Can, Balaram Sinharoy, and Boleslaw K. Szymanski. "Compiler Technology for Parallel Scientific Computation." Scientific Programming 3, no. 3 (1994): 201–25. http://dx.doi.org/10.1155/1994/243495.

Full text
Abstract:
There is a need for compiler technology that, given the source program, will generate efficient parallel codes for different architectures with minimal user involvement. Parallel computation is becoming indispensable in solving large-scale problems in science and engineering. Yet, the use of parallel computation is limited by the high costs of developing the needed software. To overcome this difficulty we advocate a comprehensive approach to the development of scalable architecture-independent software for scientific computation based on our experience with equational programming language (EPL). Our approach is based on a program decomposition, parallel code synthesis, and run-time support for parallel scientific computation. The program decomposition is guided by the source program annotations provided by the user. The synthesis of parallel code is based on configurations that describe the overall computation as a set of interacting components. Run-time support is provided by the compiler-generated code that redistributes computation and data during object program execution. The generated parallel code is optimized using techniques of data alignment, operator placement, wavefront determination, and memory optimization. In this article we discuss annotations, configurations, parallel code generation, and run-time support suitable for parallel programs written in the functional parallel programming language EPL and in Fortran.
APA, Harvard, Vancouver, ISO, and other styles
6

Kiselev, E. A., P. N. Telegin, and A. V. Baranov. "Impact of Parallel Code Optimization on Computer Power Consumption." Lobachevskii Journal of Mathematics 44, no. 12 (December 2023): 5306–19. http://dx.doi.org/10.1134/s1995080223120211.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Safarik, Jakub, and Vaclav Snasel. "Acceleration of Particle Swarm Optimization with AVX Instructions." Applied Sciences 13, no. 2 (January 4, 2023): 734. http://dx.doi.org/10.3390/app13020734.

Full text
Abstract:
Parallel implementations of algorithms are usually compared with single-core CPU performance. The advantage of multicore vector processors decreases the performance gap between GPU and CPU computation, as shown in many recent pieces of research. With the AVX-512 instruction set, there will be another performance boost for CPU computations. The availability of parallel code running on CPUs made them much easier and more accessible than GPUs. This article compares the performances of parallel implementations of the particle swarm optimization algorithm. The code was written in C++, and we used various techniques to obtain parallel execution through Advanced Vector Extensions. We present the performance on various benchmark functions and different problem configurations. The article describes and compares the performance boost gained from parallel execution on CPU, along with advantages and disadvantages of parallelization techniques.
APA, Harvard, Vancouver, ISO, and other styles
8

Chowdhary, K. R., Rajendra Purohit, and Sunil Dutt Purohit. "Source-to-source translation for code-optimization." Journal of Information and Optimization Sciences 44, no. 3 (2023): 407–16. http://dx.doi.org/10.47974/jios-1350.

Full text
Abstract:
Multi-core design intends to serve a large market with user-oriented and highproductivity management as opposed to any other parallel system. Small numbers of processors, a frequent feature of current multi-core systems, are ideal for future generation of CPUs, where automated parallelization succeeds on shared space architectures. The multi-core compiler optimization platform CETUS (high-level to high-level compiler) offers initiates automatic parallelization in compiled programmes. This compiler’s infrastructure is built with C programmes in mind and is user-friendly and simple to use. It offers the significant parallelization passes and also the underlying empowering techniques, allows source-to-source conversions, and delivers these features. This compiler has undergone numerous benchmark investigations (techniques) and approach implementation iterations. It might enhance the programs’ parallel performance. The main drawback of advanced optimising compilers, however, is that they don’t provide runtime details like the program’s input data. The approaches presented in this paper facilitatedynamic optimization using CETUS. The large amount of proposed compiler analyses and modifications for parallelization is the last point. To research the behaviour as well as the throughput gains, we investigated both non-CETUS based and CETUS based parallelized program features in this work.
APA, Harvard, Vancouver, ISO, and other styles
9

WANG, SHENGYUE, PEN-CHUNG YEW, and ANTONIA ZHAI. "CODE TRANSFORMATIONS FOR ENHANCING THE PERFORMANCE OF SPECULATIVELY PARALLEL THREADS." Journal of Circuits, Systems and Computers 21, no. 02 (April 2012): 1240008. http://dx.doi.org/10.1142/s0218126612400087.

Full text
Abstract:
As technology advances, microprocessors that integrate multiple cores on a single chip are becoming increasingly common. How to use these processors to improve the performance of a single program has been a challenge. For general-purpose applications, it is especially difficult to create efficient parallel execution due to the complex control flow and ambiguous data dependences. Thread-level speculation and transactional memory provide two hardware mechanisms that are able to optimistically parallelize potentially dependent threads. However, a compiler that performs detailed performance trade-off analysis is essential for generating efficient parallel programs for these hardwares. This compiler must be able to take into consideration the cost of intra-thread as well as inter-thread value communication. On the other hand, the ubiquitous existence of complex, input-dependent control flow and data dependence patterns in general-purpose applications makes it impossible to have one technique optimize all program patterns. In this paper, we propose three optimization techniques to improve the thread performance: (i) scheduling instruction and generating recovery code to reduce the critical forwarding path introduced by synchronizing memory resident values; (ii) identifying reduction variables and transforming the code the minimize the serializing execution; and (iii) dynamically merging consecutive iterations of a loop to avoid stalls due to unbalanced workload. Detailed evaluation of the proposed mechanism shows that each optimization technique improves a subset but none improve all of the SPEC2000 benchmarks. On average, the proposed optimizations improve the performance by 7% for the set of the SPEC2000 benchmarks that have already been optimized for register-resident value communication.
APA, Harvard, Vancouver, ISO, and other styles
10

Siow, C. L., Jaswar, and Efi Afrizal. "Computational Fluid Dynamic Using Parallel Loop of Multi-Cores Processor." Applied Mechanics and Materials 493 (January 2014): 80–85. http://dx.doi.org/10.4028/www.scientific.net/amm.493.80.

Full text
Abstract:
Computational Fluid Dynamics (CFD) software is often used to study fluid flow and structures motion in fluids. The CFD normally requires large size of arrays and computer memory and then caused long execution time. However, Innovation of computer hardware such as multi-cores processor provides an alternative solution to improve this programming performance. This paper discussed loop parallelize multi-cores processor for optimization of sequential looping CFD code. This loop parallelize CFD was achieved by applying multi-tasking or multi-threading code into the original CFD code which was developed by one of the authors. The CFD code was developed based on Reynolds Average Navier-Stokes (RANS) method. The new CFD code program was developed using Microsoft Visual Basic (VB) programming language. In the early stage, the whole CFD code was constructed in a sequential flow before it is modified to parallel flow by using VBs multi-threading library. In the comparison, fluid flow around the hull of round-shaped FPSO was selected to compare the performance of both the programming codes. Besides, executed results of this self-developed code such as pressure distribution around the hull were also presented in this paper.
APA, Harvard, Vancouver, ISO, and other styles
11

Beard, Jonathan C., Peng Li, and Roger D. Chamberlain. "RaftLib: A C++ template library for high performance stream parallel processing." International Journal of High Performance Computing Applications 31, no. 5 (October 19, 2016): 391–404. http://dx.doi.org/10.1177/1094342016672542.

Full text
Abstract:
Stream processing is a compute paradigm that has been around for decades, yet until recently has failed to garner the same attention as other mainstream languages and libraries (e.g. C++, OpenMP, MPI). Stream processing has great promise: the ability to safely exploit extreme levels of parallelism to process huge volumes of streaming data. There have been many implementations, both libraries and full languages. The full languages implicitly assume that the streaming paradigm cannot be fully exploited in legacy languages, while library approaches are often preferred for being integrable with the vast expanse of extant legacy code. Libraries, however are often criticized for yielding to the shape of their respective languages. RaftLib aims to fully exploit the stream processing paradigm, enabling a full spectrum of streaming graph optimizations, while providing a platform for the exploration of integrability with legacy C/C++ code. RaftLib is built as a C++ template library, enabling programmers to utilize the robust C++ standard library, and other legacy code, along with RaftLib’s parallelization framework. RaftLib supports several online optimization techniques: dynamic queue optimization, automatic parallelization, and real-time low overhead performance monitoring.
APA, Harvard, Vancouver, ISO, and other styles
12

Soegiarso, R., and H. Adeli. "Optimization of Large Space Frame Steel Structures." Engineering Journal 34, no. 2 (June 30, 1997): 54–60. http://dx.doi.org/10.62913/engj.v34i2.681.

Full text
Abstract:
Optimization of large space frame steel structures subjected to realistic code-specified stress, displacement, and buckling constraints is investigated. The basis of design is the American Institute of Steel Construction (AISC) Allowable Stress Design (ASD) specifications. The types of structures considered are space moment resisting frames with and without bracings. The structures are subjected to wind loadings according to the Uniform Building Code (UBC) in addition to dead and live loads. The parallel-vector algorithm developed in this research is applied to three highrise building structures ranging in size from a 20-story structure with 1,920 members to a 60-story structure with 5,760 members, and its parallel processing and vectorization performance is evaluated. For the largest structure, speedups of 6.4 and 17.8 are achieved due to parallel processing (using eight processors) and vectorization, respectively. When vectorization is combined with parallel processing a very significant speedup of 97.1 is achieved.
APA, Harvard, Vancouver, ISO, and other styles
13

Ge, Lixin, Zenghai Li, Cho-Kuen Ng, and Liling Xiao. "High Performance Computing in Parallel Electromagnetics Simulation Code suite ACE3P." Applied Computational Electromagnetics Society 35, no. 11 (February 4, 2021): 1332–33. http://dx.doi.org/10.47037/2020.aces.j.351135.

Full text
Abstract:
A comprehensive set of parallel finite-element codes suite ACE3P (Advanced Computational Electromagnetics 3D Parallel) is developed by SLAC for multi-physics modeling of particle accelerators running on massively parallel computer platforms for high fidelity and high accuracy simulation. ACE3P enables rapid virtual prototyping of accelerator and RF component design, optimization and analysis. Advanced modeling capabilities have been facilitated by implementations of novel algorithms for numerical solvers. Code performance on state-of-the-art high performance computing (HPC) platforms for large-scale RF modeling in accelerator applications will be presented in this paper. All the simulations have been performed on the supercomputers at National Energy Research Computer Center (NERSC).
APA, Harvard, Vancouver, ISO, and other styles
14

Williams, Dan, and Luc Bauwens. "Simulation of Compressible Flow on a Massively Parallel Architecture." Scientific Programming 4, no. 3 (1995): 193–201. http://dx.doi.org/10.1155/1995/453684.

Full text
Abstract:
This article describes the porting and optimization of an explicit, time-dependent, computational fluid dynamics code on an 8,192-node MasPar MP-1. The MasPar is a very fine-grained, single instruction, multiple data parallel computer. The code uses the flux-corrected transport algorithm. We describe the techniques used to port and optimize the code, and the behavior of a test problem. The test problem used to benchmark the flux-corrected transport code on the MasPar was a two-dimensional exploding shock with periodic boundary conditions. We discuss the performance that our code achieved on the MasPar, and compare its performance on the MasPar with its performance on other architectures. The comparisons show that the performance of the code on the MasPar is slightly better than on a CRAY Y-MP for a functionally equivalent, optimized two-dimensional code.
APA, Harvard, Vancouver, ISO, and other styles
15

Passino, Kevin M. "Bacterial Foraging Optimization." International Journal of Swarm Intelligence Research 1, no. 1 (January 2010): 1–16. http://dx.doi.org/10.4018/jsir.2010010101.

Full text
Abstract:
The bacterial foraging optimization (BFO) algorithm mimics how bacteria forage over a landscape of nutrients to perform parallel nongradient optimization. In this article, the author provides a tutorial on BFO, including an overview of the biology of bacterial foraging and the pseudo-code that models this process. The algorithms features are briefly compared to those in genetic algorithms, other bio-inspired methods, and nongradient optimization. The applications and future directions of BFO are also presented.
APA, Harvard, Vancouver, ISO, and other styles
16

Hückelheim, Jan, Paul Hovland, Michelle Mills Strout, and Jens-Dominik Müller. "Reverse-mode algorithmic differentiation of an OpenMP-parallel compressible flow solver." International Journal of High Performance Computing Applications 33, no. 1 (June 29, 2017): 140–54. http://dx.doi.org/10.1177/1094342017712060.

Full text
Abstract:
Reverse-mode algorithmic differentiation (AD) is an established method for obtaining adjoint derivatives of computer simulation applications. In computational fluid dynamics (CFD), adjoint derivatives of a cost function output such as drag or lift with respect to design parameters such as surface coordinates or geometry control points are a key ingredient for shape optimization, uncertainty quantification and flow control. The computational cost of CFD applications and their derivatives makes it essential to use high-performance computing hardware efficiently, including multi- and many-core architectures. Nevertheless, OpenMP is not supported in most AD tools, and previously shown methods achieve poor scalability of the derivative code. We present the AD of an OpenMP-parallelized finite volume compressible flow solver for unstructured meshes. Our approach enables us to reuse the parallelization of the original code in the computation of adjoint derivatives. The method works by identifying code segments that can be differentiated in reverse-mode without changing their memory access pattern. The OpenMP parallelization is integrated into the derivative code during the build process in a way that is robust to modifications of the original code and independent of the OpenMP support of the differentiation tool. We show the scalability of our adjoint CFD solver on test cases ranging from thousands to millions of finite volume mesh cells on CPUs with up to 16 threads as well as on an Intel XeonPhi card with 236 threads. We demonstrate that our approach is more practical to implement for production-sized CFD codes and produces more efficient adjoint derivative code than previously shown AD methods.
APA, Harvard, Vancouver, ISO, and other styles
17

Hao, Huiqun, Jinrong Jiang, Tianyi Wang, Hailong Liu, Pengfei Lin, Ziyang Zhang, and Beifang Niu. "Deep Parallel Optimizations on an LASG/IAP Climate System Ocean Model and Its Large-Scale Parallelization." Applied Sciences 13, no. 4 (February 19, 2023): 2690. http://dx.doi.org/10.3390/app13042690.

Full text
Abstract:
This paper proposes a series of parallel optimizations on a high-resolution ocean model, the LASG/IAP Climate System Ocean Model (LICOM), which was independently developed by the Institute of Atmospheric Physics of the Chinese Academy of Sciences. The version of LICOM that we used was LICOM 2.1. In order to improve the parallel performance of LICOM, a series of parallel optimization methods were applied. We optimized the parallelization scheme to tackle the problem of load imbalance. Some communication optimizations were implemented, including data packing, the application of the least communication algorithm, and the replacement of communications with calculations. Furthermore, for the calculation procedures, we implemented some mature optimizations and expanded functions in a loop. Additionally, a hybrid of MPI and OpenMP, as well as an asynchronous parallel IO, was used. In this work, the optimized version of LICOM 2.1 was able to achieve a speedup of more than two times compared with the original code. The parallelization scheme optimization and the communication optimization produced considerable improvement in performance in the large-scale parallelization. Meanwhile, the newly optimized LICOM could scale up to 245,760 processor cores. However, for the original version, there was no speedup when scaled up to over 10,000 processor cores. Additionally, the problem of jumpy wall time during the time integration process was also tackled with this optimization. Finally, we conducted a practical simulation from 1993 to 2007 by using the optimized version of LICOM 2.1. The results showed that the mesoscale vortex was well simulated by the model.
APA, Harvard, Vancouver, ISO, and other styles
18

Bonati, Claudio, Simone Coscetti, Massimo D’Elia, Michele Mesiti, Francesco Negro, Enrico Calore, Sebastiano Fabio Schifano, Giorgio Silvi, and Raffaele Tripiccione. "Design and optimization of a portable LQCD Monte Carlo code using OpenACC." International Journal of Modern Physics C 28, no. 05 (March 9, 2017): 1750063. http://dx.doi.org/10.1142/s0129183117500632.

Full text
Abstract:
The present panorama of HPC architectures is extremely heterogeneous, ranging from traditional multi-core CPU processors, supporting a wide class of applications but delivering moderate computing performance, to many-core Graphics Processor Units (GPUs), exploiting aggressive data-parallelism and delivering higher performances for streaming computing applications. In this scenario, code portability (and performance portability) become necessary for easy maintainability of applications; this is very relevant in scientific computing where code changes are very frequent, making it tedious and prone to error to keep different code versions aligned. In this work, we present the design and optimization of a state-of-the-art production-level LQCD Monte Carlo application, using the directive-based OpenACC programming model. OpenACC abstracts parallel programming to a descriptive level, relieving programmers from specifying how codes should be mapped onto the target architecture. We describe the implementation of a code fully written in OpenAcc, and show that we are able to target several different architectures, including state-of-the-art traditional CPUs and GPUs, with the same code. We also measure performance, evaluating the computing efficiency of our OpenACC code on several architectures, comparing with GPU-specific implementations and showing that a good level of performance-portability can be reached.
APA, Harvard, Vancouver, ISO, and other styles
19

Zhang, Jian, Zhe Dai, Ruitian Li, Liang Deng, Jie Liu, and Naichun Zhou. "Acceleration of a Production-Level Unstructured Grid Finite Volume CFD Code on GPU." Applied Sciences 13, no. 10 (May 18, 2023): 6193. http://dx.doi.org/10.3390/app13106193.

Full text
Abstract:
Due to the complex topological relationship, poor data locality, and data racing problems in unstructured CFD computing, how to parallelize the finite volume method algorithms in shared memory to efficiently explore the hardware capabilities of many-core GPUs has become a significant challenge. Based on a production-level unstructured CFD software, three shared memory parallel programming strategies, atomic operation, colouring, and reduction were designed and implemented by deeply analysing its computing behaviour and memory access mode. Several data locality optimization methods—grid reordering, loop fusion, and multi-level memory access—were proposed. Aimed at the sequential attribute of LU-SGS solution, two methods based on cell colouring and hyperplane were implemented. All the parallel methods and optimization techniques implemented were comprehensively analysed and evaluated by the three-dimensional grid of the M6 wing and CHN-T1 aeroplane. The results show that using the Cuthill–McKee grid renumbering and loop fusion optimization techniques can improve memory access performance by 10%. The proposed reduction strategy, combined with multi-level memory access optimization, has a significant acceleration effect, speeding up the hot spot subroutine with data races three times. Compared with the serial CPU version, the overall speed-up of the GPU codes can reach 127. Compared with the parallel CPU version, the overall speed-up of the GPU codes can achieve more than thirty times the result in the same Message Passing Interface (MPI) ranks.
APA, Harvard, Vancouver, ISO, and other styles
20

Duran-Gonzalez, Julian, Victor Hugo Sanchez-Espinoza, Luigi Mercatali, Armando Gomez-Torres, and Edmundo del Valle-Gallegos. "Verification of the Parallel Transport Codes Parafish and AZTRAN with the TAKEDA Benchmarks." Energies 15, no. 7 (March 28, 2022): 2476. http://dx.doi.org/10.3390/en15072476.

Full text
Abstract:
With the increase in computational resources, parallel computation in neutron transport codes is inherent since it allows simulations with high spatial-angular resolution. Among the different methodologies available for the solution of the neutron transport equation, spherical harmonics (PN) and discrete-ordinates (SN) approximations have been widely used, as they are established classical methods for performing nuclear reactor calculations. This work focuses on describing and verifying two parallel deterministic neutron transport codes under development. The first one is the Parafish code that is based on the finite-element method and PN approximation. The second one is the AZTRAN code, based on the RTN-0 nodal method and SN approximation. The capabilities of these two codes have been tested on the TAKEDA benchmarks and the results obtained show good behavior and accuracy compared to the Monte Carlo reference solutions. Additionally, the speedup obtained by each code in the parallel execution is acceptable. In general, the results encourage further improvement in the codes to be comparable to other well-validated deterministic transport codes.
APA, Harvard, Vancouver, ISO, and other styles
21

Coulaud, Olivier, Michaël Dussere, Pascal Hénon, Erik Lefebvre, and Jean Roman. "Optimization of a kinetic laser–plasma interaction code for large parallel systems." Parallel Computing 29, no. 9 (September 2003): 1175–89. http://dx.doi.org/10.1016/s0167-8191(03)00098-x.

Full text
APA, Harvard, Vancouver, ISO, and other styles
22

Lou, John Z., and John D. Farrara. "Performance analysis and optimization on a parallel atmospheric general circulation model code." Concurrency: Practice and Experience 10, no. 7 (June 1998): 549–65. http://dx.doi.org/10.1002/(sici)1096-9128(199806)10:7<549::aid-cpe365>3.0.co;2-w.

Full text
APA, Harvard, Vancouver, ISO, and other styles
23

Tsyganov, Andrey V., and Oleg I. Bulychov. "Implementing Parallel Metaheuristic Optimization Framework Using Metaprogramming and Design Patterns." Applied Mechanics and Materials 263-266 (December 2012): 1864–73. http://dx.doi.org/10.4028/www.scientific.net/amm.263-266.1864.

Full text
Abstract:
In the present paper we introduce an approach to implementing parallel metaheuristic optimization frameworks which is used in the design of the framework HeO. This experimental cross-platform framework is a collection of popular optimization methods implemented in C++ as algorithmic skeletons. The key feature of the discussed approach is the wide usage of metaprogramming and design patterns which allow to increase the reusability of the code and ease the process of hybrid algorithms construction for the end-user. We consider framework structure and implementation details and provide the results of numerical experiments for some well-known optimization problems.
APA, Harvard, Vancouver, ISO, and other styles
24

Bergamaschi, Luca, Angeles Martínez, and Giorgio Pini. "Parallel Rayleigh Quotient Optimization with FSAI-Based Preconditioning." Journal of Applied Mathematics 2012 (2012): 1–14. http://dx.doi.org/10.1155/2012/872901.

Full text
Abstract:
The present paper describes a parallel preconditioned algorithm for the solution of partial eigenvalue problems for large sparse symmetric matrices, on parallel computers. Namely, we consider the Deflation-Accelerated Conjugate Gradient (DACG) algorithm accelerated by factorized-sparse-approximate-inverse- (FSAI-) type preconditioners. We present an enhanced parallel implementation of the FSAI preconditioner and make use of the recently developed Block FSAI-IC preconditioner, which combines the FSAI and the Block Jacobi-IC preconditioners. Results onto matrices of large size arising from finite element discretization of geomechanical models reveal that DACG accelerated by these type of preconditioners is competitive with respect to the available public parallelhyprepackage, especially in the computation of a few of the leftmost eigenpairs. The parallel DACG code accelerated by FSAI is written in MPI-Fortran 90 language and exhibits good scalability up to one thousand processors.
APA, Harvard, Vancouver, ISO, and other styles
25

Dolapchiev, Ivaylo, Kostadin Brandisky, and Petar Ivanov. "Eddy current testing probe optimization using a parallel genetic algorithm." Serbian Journal of Electrical Engineering 5, no. 1 (2008): 39–48. http://dx.doi.org/10.2298/sjee0801039d.

Full text
Abstract:
This paper uses the developed parallel version of Michalewicz's Genocop III Genetic Algorithm (GA) searching technique to optimize the coil geometry of an eddy current non-destructive testing probe (ECTP). The electromagnetic field is computed using FEMM 2D finite element code. The aim of this optimization was to determine coil dimensions and positions that improve ECTP sensitivity to physical properties of the tested devices.
APA, Harvard, Vancouver, ISO, and other styles
26

Kolganov, Alexander Sergeevich, and Nikita Andreevich Kataev. "Data distribution and parallel code generation for heterogeneous computational clusters." Proceedings of the Institute for System Programming of the RAS 34, no. 4 (2022): 89–100. http://dx.doi.org/10.15514/ispras-2022-34(4)-7.

Full text
Abstract:
We present new techniques for compilation of sequential programs for almost affine accesses in loop nests for distributed-memory parallel architectures. Our approach is implemented as a source-to-source automatic parallelizing compiler that expresses parallelism with the DVMH directive-based programming model. Compared to all previous approaches ours addresses all three main sub-problems of the problem of distributed memory parallelization: data and computation distribution and communication optimization. Parallelization of sequential programs with structured grid computations is considered. In this paper, we use the NAS Parallel Benchmarks to evaluate the performance of generated programs and provide experimental results on up to 9 nodes of a computational cluster with two 8-core processors in a node.
APA, Harvard, Vancouver, ISO, and other styles
27

Ivanov, Boyan D., and David J. Kropaczek. "ASSESSMENT OF PARALLEL SIMULATED ANNEALING PERFORMANCE WITH THE NEXUS/ANC9 CORE DESIGN CODE SYSTEM." EPJ Web of Conferences 247 (2021): 02019. http://dx.doi.org/10.1051/epjconf/202124702019.

Full text
Abstract:
The method of parallel simulated annealing is being considered as a loading pattern optimization method to be used within the framework of the latest Westinghouse core design code system NEXUS/ANC9. A prototype version of NEXUS/ANC9 that incorporates the parallel simulated annealing method was developed. The prototype version was evaluated in terms of robustness, performance and results. The prototype code was used to optimize LPs for several plants and cycles, including 2-loop, 3-loop and 4-loop Westinghouse plants. Different fuel assembly lattices with IFBA, WABA and Gadolinium burnable absorbers were also exercised in these cores. Different strategies were evaluated using different options in the code. Special attention was paid to the robustness and performance when different number of parallel processes were used with different size of Markov chain.
APA, Harvard, Vancouver, ISO, and other styles
28

Alebady, Wallaa Yaseen, and Ahmed Abdulkadhim Hamad. "Turbo polar code based on soft-cancelation algorithm." Indonesian Journal of Electrical Engineering and Computer Science 26, no. 1 (April 1, 2022): 521. http://dx.doi.org/10.11591/ijeecs.v26.i1.pp521-530.

Full text
Abstract:
Since the first polar code of Arikan, <span>the research field of polar codes has been continuously active. Improving the performance of finite-code-length polar codes is the central point of this field. In this paper, the parallel concatenated systematic turbo polar code (PCSTPC) model has been proposed to improve the polar codes performance in a finite-length regime. On the encoder side, two systematic polar encoders are used as constituent encoders. While on the decoder side, two single iteration soft-cancelation (SCAN) decoders are used as soft-in-soft-out (SISO) algorithms inside the iterative decoding algorithm of the parallel concatenated systematic turbo polar code (PCSTPC). As compared to the optimized turbo polar code with SCAN and BP decoders, the proposed model has about 0.2 dB and 0.48 dB gains at BER=10<sup>(-4)</sup>, respectively, in addition to 0.1 dB, 0.31 dB, and 0.72 dB gains over the TPC-SSCL32, TPC-SSCL16, and TPC-SSCL8 models, respectively. Moreover, the proposed model offers less complexity in comparison with other models, therefore requiring less memory and time resources.</span>
APA, Harvard, Vancouver, ISO, and other styles
29

Porter, Andrew R., Jeremy Appleyard, Mike Ashworth, Rupert W. Ford, Jason Holt, Hedong Liu, and Graham D. Riley. "Portable multi- and many-core performance for finite-difference or finite-element codes – application to the free-surface component of NEMO (NEMOLite2D 1.0)." Geoscientific Model Development 11, no. 8 (August 27, 2018): 3447–64. http://dx.doi.org/10.5194/gmd-11-3447-2018.

Full text
Abstract:
Abstract. We present an approach which we call PSyKAl that is designed to achieve portable performance for parallel finite-difference, finite-volume, and finite-element earth-system models. In PSyKAl the code related to the underlying science is formally separated from code related to parallelization and single-core optimizations. This separation of concerns allows scientists to code their science independently of the underlying hardware architecture and for optimization specialists to be able to tailor the code for a particular machine, independently of the science code. We have taken the free-surface part of the NEMO ocean model and created a new shallow-water model named NEMOLite2D. In doing this we have a code which is of a manageable size and yet which incorporates elements of full ocean models (input/output, boundary conditions, etc.). We have then manually constructed a PSyKAl version of this code and investigated the transformations that must be applied to the middle, PSy, layer in order to achieve good performance, both serial and parallel. We have produced versions of the PSy layer parallelized with both OpenMP and OpenACC; in both cases we were able to leave the natural-science parts of the code unchanged while achieving good performance on both multi-core CPUs and GPUs. In quantifying whether or not the obtained performance is “good” we also consider the limitations of the basic roofline model and improve on it by generating kernel-specific CPU ceilings.
APA, Harvard, Vancouver, ISO, and other styles
30

Vasilev, Vladimir S., Alexander I. Legalov, and Sergey V. Zykov. "The System for Transforming the Code of Dataflow Programs into Imperative." Modeling and Analysis of Information Systems 28, no. 2 (June 11, 2021): 198–214. http://dx.doi.org/10.18255/1818-1015-2021-2-198-214.

Full text
Abstract:
Functional dataflow programming languages are designed to create parallel portable programs. The source code of such programs is translated into a set of graphs that reflect information and control dependencies. The main way of their execution is interpretation, which does not allow to perform calculations efficiently on real parallel computing systems and leads to poor performance. To run programs directly on existing computing systems, you need to use specific optimization and transformation methods that take into account the features of both the programming language and the architecture of the system. Currently, the most common is the Von Neumann architecture, however, parallel programming for it in most cases is carried out using imperative languages with a static type system. For different architectures of parallel computing systems, there are various approaches to writing parallel programs. The transformation of dataflow parallel programs into imperative programs allows to form a framework of imperative code fragments that directly display sequential calculations. In the future, this framework can be adapted to a specific parallel architecture. The paper considers an approach to performing this type of transformation, which consists in allocating fragments of dataflow parallel programs as templates, which are subsequently replaced by equivalent fragments of imperative languages. The proposed transformation methods allow generating program code, to which various optimizing transformations can be applied in the future, including parallelization taking into account the target architecture.
APA, Harvard, Vancouver, ISO, and other styles
31

Dobreva, P. "Optimization of the Code of the Numerical Magnetosheath-Magnetosphere Model." Journal of Theoretical and Applied Mechanics 43, no. 2 (June 1, 2013): 77–82. http://dx.doi.org/10.2478/jtam-2013-0016.

Full text
Abstract:
Abstract The proposed three dimensional model contains two earlier developed 3D regional numerical models: a grid-characteristic model of the magnetosheath and a finite element model of the magnetosphere. The model output is the distribution of gas-dynamic parameters in the magnetosheath and of magnetic field inside the magnetosphere. The efforts are focused on the modernization of the existing software, written in Fortran, using several techniques for parallel programming such as OpenMP extensions. After analyzing the numerical performance of the model a possible scenario for the code optimization is shown. First results with the improved variant of the model are presented.
APA, Harvard, Vancouver, ISO, and other styles
32

Kan, Guangyuan, Chenliang Li, Depeng Zuo, Xiaodi Fu, and Ke Liang. "Massively Parallel Monte Carlo Sampling for Xinanjiang Hydrological Model Parameter Optimization Using CPU-GPU Computer Cluster." Water 15, no. 15 (August 3, 2023): 2810. http://dx.doi.org/10.3390/w15152810.

Full text
Abstract:
The Monte Carlo sampling (MCS) method is a simple and practical way for hydrological model parameter optimization. The MCS procedure is used to generate a large number of data points. Therefore, its computational efficiency is a key issue when applied to large-scale problems. The MCS method is an internally concurrent algorithm that can be parallelized. It has the potential to execute on massively parallel hardware systems such as multi-node computer clusters equipped with multiple CPUs and GPUs, which are known as heterogeneous hardware systems. To take advantage of this, we parallelize the algorithm and implement it on a multi-node computer cluster that hosts multiple INTEL multi-core CPUs and NVIDIA many-core GPUs by using C++ programming language combined with the MPI, OpenMP, and CUDA parallel programming libraries. The parallel parameter optimization method is coupled with the Xinanjiang hydrological model to test the acceleration efficiency when tackling real-world applications that have a very high computational burden. Numerical experiments indicate, on the one hand, that the computational efficiency of the massively parallel parameter optimization method is significantly improved compared to single-core CPU code, and the multi-GPU code achieves the fastest speed. On the other hand, the scalability property of the proposed method is also satisfactory. In addition, the correctness of the proposed method is also tested using sensitivity and uncertainty analysis of the model parameters. Study results indicate good acceleration efficiency and reliable correctness of the proposed parallel optimization methods, which demonstrates excellent prospects in practical applications.
APA, Harvard, Vancouver, ISO, and other styles
33

Torlapati, Jagadish, and T. Prabhakar Clement. "Using Parallel Genetic Algorithms for Estimating Model Parameters in Complex Reactive Transport Problems." Processes 7, no. 10 (September 20, 2019): 640. http://dx.doi.org/10.3390/pr7100640.

Full text
Abstract:
In this study, we present the details of an optimization method for parameter estimation of one-dimensional groundwater reactive transport problems using a parallel genetic algorithm (PGA). The performance of the PGA was tested with two problems that had published analytical solutions and two problems with published numerical solutions. The optimization model was provided with the published experimental results and reasonable bounds for the unknown kinetic reaction parameters as inputs. Benchmarking results indicate that the PGA estimated parameters that are close to the published parameters and it also predicted the observed trends well for all four problems. Also, OpenMP FORTRAN parallel constructs were used to demonstrate the speedup of the code on an Intel quad-core desktop computer. The parallel code showed a linear speedup with an increasing number of processors. Furthermore, the performance of the underlying optimization algorithm was tested to evaluate its sensitivity to the various genetic algorithm (GA) parameters, including initial population size, number of generations, and parameter bounds. The PGA used in this study is generic and can be easily scaled to higher-order water quality modeling problems involving real-world applications.
APA, Harvard, Vancouver, ISO, and other styles
34

Viktorov, Ivan, and Ruslan Gibadullin. "The principles of building a machine-learning-based service for converting sequential code into parallel code." E3S Web of Conferences 431 (2023): 05012. http://dx.doi.org/10.1051/e3sconf/202343105012.

Full text
Abstract:
This article presents a novel approach for automating the parallelization of programming code using machine learning. The approach centers on a two-phase algorithm, incorporating a training phase and a transformation phase. In the training phase, a neural network is trained using data in the form of Abstract Syntax Trees, with Word2Vec being employed as the primary model for converting the syntax tree into numerical arrays. The choice of Word2Vec is attributed to its efficacy in encoding words with less reliance on context, compared to other natural language processing models such as GloVe and FastText. During the transformation phase, the trained model is applied to new sequential code, transforming it into parallel programming code. The article discusses in detail the mechanisms behind the algorithm, the rationale for the selection of Word2Vec, and the subsequent processing of code data. This methodology introduces an intelligent, automated system capable of understanding and optimizing the syntactic and semantic structures of code for parallel computing environments. The article is relevant for researchers and practitioners seeking to enhance code optimization techniques through the integration of machine learning models.
APA, Harvard, Vancouver, ISO, and other styles
35

Liu, Jie, Tao Zhu, Yang Zhang, and Zhenyu Liu. "Parallel Particle Swarm Optimization Using Apache Beam." Information 13, no. 3 (February 28, 2022): 119. http://dx.doi.org/10.3390/info13030119.

Full text
Abstract:
The majority of complex research problems can be formulated as optimization problems. Particle Swarm Optimization (PSO) algorithm is very effective in solving optimization problems because of its robustness, simplicity, and global search capabilities. Since the computational cost of these problems is usually high, it has been necessary to develop optimization algorithms with parallelization. With the advent of big-data technology, such problems can be solved by distributed parallel computing. In previous related work, MapReduce (a programming model that implements a distributed parallel approach to processing and producing large datasets on a cluster) has been used to parallelize the PSO algorithm, but frequent file reads and writes make the execution time of MRPSO very long. We propose Apache Beam particle swarm optimization (BPSO), which uses Apache Beam parallel programming model. In the experiment, we compared BPSO and PSO based on MapReduce (MRPSO) on four benchmark functions by changing the number of particles and optimizing the dimensions of the problem. The experimental results show that, as the number of particles increases, MRPSO remains largely constant when the number of particles is small (<1000), while the time required for algorithm execution increases rapidly when the number of particles exceeds a certain amount (>1000), while BPSO grows slowly and tends to yield better results than MRPSO. As the dimensionality of the optimization problem increases, BPSO can take half the time of MRPSO and obtain better results than it does. MRPSO requires more execution time than BPSO, as the problem complexity varies, but both MRPSO and BPSO are not very sensitive to problem complexity. All program code and input data are uploaded to GitHub.
APA, Harvard, Vancouver, ISO, and other styles
36

Tahara, Y., F. Stern, and Y. Himeno. "Computational Fluid Dynamics–Based Optimization of a Surface Combatant." Journal of Ship Research 48, no. 04 (December 1, 2004): 273–87. http://dx.doi.org/10.5957/jsr.2004.48.4.273.

Full text
Abstract:
Computational fluid dynamics (CFD)-based optimization of a surface combatant is presented with the following main objectives:development of a high-performance optimization module for a Reynolds averaged Navier-Stokes (RANS) solver for with-free-surface condition; anddemonstration of the capability of the optimization method for flow- and wave-field optimization of the Model 5415 hull form. The optimization module is based on extension of successive quadratic programming (SQP) for higher-performance optimization method by introduction of parallel computing architecture, that is, message passing interface (MPI) protocol. It is shown that the present parallel SQP module is nearly m(= 2k+ 1; k is number of design parameters) times faster than conventional SQP, and the computational speed does not depend on the number of design parameters. The RANS solver is CFDSHIP-IOWA, a general-purpose parallel multiblock RANS code based on higher-order upwind finite difference and a projection method for velocity-pressure coupling; it offers the capability of free-surface flow calculation. The focus of the present study is on code development and demonstration of capability, which justifies use of a relatively simple turbulence model, a free-surface model without breaking model, static sinkage and trim, and simplified design constraints and geometry modeling. An overview is given of the high-performance optimization method and CFDSHIP-IOWA, and results are presented for stern optimization for minimization of transom wave field disturbance; sonar dome optimization for minimization of sonar-dome vortices; and bow optimization for minimization of bow wave. In conclusion, the present work has successfully demonstrated the capability of the CFD-based optimization method for flow- and wave-field optimization of the Model 5415 hull form. The present method is very promising and warrants further investigations for computer-aided design (CAD)-based hull form modification methods and more appropriate design constraints.
APA, Harvard, Vancouver, ISO, and other styles
37

Jacob, Ferosh, Jeff Gray, Jeffrey C. Carver, Marjan Mernik, and Purushotham Bangalore. "PPModel: a modeling tool for source code maintenance and optimization of parallel programs." Journal of Supercomputing 62, no. 3 (September 12, 2012): 1560–82. http://dx.doi.org/10.1007/s11227-012-0821-7.

Full text
APA, Harvard, Vancouver, ISO, and other styles
38

Han, Changcai, Hui Li, and Weigang Chen. "Minimum Distance Optimization with Chord Edge Growth for High Girth Non-Binary LDPC Codes." Electronics 9, no. 12 (December 17, 2020): 2161. http://dx.doi.org/10.3390/electronics9122161.

Full text
Abstract:
Short or moderate-length non-binary low-density parity-check (NB-LDPC) codes have the potential applications in future low latency and high-reliability communication thanks to the strong error correction capability and parallel decoding. Because of the existence of the error floor, the NB-LDPC codes usually cannot satisfy very low bit error rate (BER) requirements. In this paper, a low-complexity method is proposed for optimizing the minimum distance of the NB-LDPC code in a progressive chord edge growth manner. Specifically, each chord edge connecting two non-adjacent vertices is added to the Hamiltonian cycle one-by-one. For each newly added chord edge, the configuration of non-zero entries corresponding to the chord edge is determined according to the so-called full rank condition (FRC) of all cycles that are related to the chord edge in the obtained subgraph. With minor modifications to the designed method, it can be used to construct the NB-LDPC codes with an efficient encoding structure. The analysis results show that the method for designing NB-LDPC codes while using progressive chord edge growth has lower complexity than traditional methods. The simulation results show that the proposed method can effectively improve the performance of the NB-LDPC code in the high signal-to-noise ratio (SNR) region. While using the proposed scheme, an NB-LDPC code with a quite low BER can be constructed with extremely low complexity.
APA, Harvard, Vancouver, ISO, and other styles
39

Khebbou, Driss, Idriss Chana, and Hussain Ben-Azza. "Single parity check node adapted to polar codes with dynamic frozen bit equivalent to binary linear block codes." Indonesian Journal of Electrical Engineering and Computer Science 29, no. 2 (February 1, 2023): 816. http://dx.doi.org/10.11591/ijeecs.v29.i2.pp816-824.

Full text
Abstract:
<span lang="EN-US">In the context of decoding binary linear block codes by polar code decoding techniques, we propose in this paper a new optimization of the serial nature of decoding the polar codes equivalent to binary linear block codes. In addition to the special nodes proposed by the simplified successive-cancellation list technique, we propose a new special node allowing to estimate in parallel the bits of its sub-code. The simulation is done in an additive white gaussian noise channel (AWGN) channel for several linear block codes, namely bose–chaudhuri–hocquenghem codes (BCH) codes, quadratic-residue (QR) codes, and linear block codes recently designed in the literature. The performance of the proposed technique offers the same performance in terms of frame error rate (FER) as the ordered statistics decoding (OSD) algorithm, which achieves that of maximum likelihood decoder (MLD), but with high memory requirements and computational complexity.</span>
APA, Harvard, Vancouver, ISO, and other styles
40

Yessayan, Raffi, Yousry Y. Azmy, and R. Joseph Zerr. "ITERATIVE AND PARALLEL PERFORMANCE ANALYSIS OF NON-BLOCKING COMMUNICATION ALGORITHMS IN THE MASSIVELY PARALLEL NEUTRON TRANSPORT CODE PIDOTS." EPJ Web of Conferences 247 (2021): 03016. http://dx.doi.org/10.1051/epjconf/202124703016.

Full text
Abstract:
The PIDOTS neutral particle transport code utilizes a red/black implementation of the Parallel Gauss-Seidel algorithm to solve the SN approximation of the neutron transport equation on 3D Cartesian meshes. PIDOTS is designed for execution on massively parallel platforms and is capable of using the full resources of modern, leadership class high performance computers. Initial testing revealed that some configurations of PIDOTS’s Integral Transport Matrix Method solver demonstrated unexpectedly poor parallel scaling. Work at Idaho and Los Alamos National Laboratories then revealed that this inefficiency was a result of the accumulation of high-cost latency events in the complex blocking communication networks employed during each PIDOTS iteration. That work explored the possibility of minimizing those inefficiencies while maintaining a blocking communications model. While significant speedups were obtained, it was shown that fully mitigating the problem on general-purpose platforms was highly unlikely for a blocking code. This work continues that analysis by implementing a deeply interleaved non-blocking communication model into PIDOTS. This new model benefits from the optimization work performed on the blocking model while also providing significant opportunities to overlap the remaining un-mitigated communication costs with computation. Additionally, our new approach is easily transferable to other similarly spatially decomposed codes. The resulting algorithm was tested on LANL’s Trinity system at up to 32,768 processors and was found at that processor count to effectively hide 100% of MPI communication cost – equivalently 20% of the red/black phase time. It is expected that the implemented interleaving algorithm can fully support far higher processor counts and completely hide communication costs up ~50% of total iteration time.
APA, Harvard, Vancouver, ISO, and other styles
41

Chen, Cheng, Zheng Wang, Deepak Majeti, Nick Vrvilo, Timothy Warburton, Vivek Sarkar, and Gang Li. "Optimization of Lattice Boltzmann Simulation With Graphics-Processing-Unit Parallel Computing and the Application in Reservoir Characterization." SPE Journal 21, no. 04 (August 15, 2016): 1425–35. http://dx.doi.org/10.2118/179733-pa.

Full text
Abstract:
Summary Shale permeability is sufficiently low to require an unconventional scale of stimulation treatments, such as very-large-volume, high-rate, multistage hydraulic-fracturing applications. Upscaling of hydrocarbon transport processes in shales is challenging because of the low permeability and strong heterogeneity. Rock characterization with high-resolution imaging [X-ray tomography and scanning electron microscope (SEM)] is usually highly localized and contains significant uncertainties because of the small field of view. Therefore, an effective high-performance computing method is required to collect information over a larger scale to meet the ergodicity requirement in upscaling. The lattice Boltzmann (LB) method has received significant attention in computational fluid dynamics because of its capability in coping with complicated boundary conditions. A combination of high-resolution imaging and LB simulation is a powerful approach for evaluating the transport properties of a porous medium in a timely manner, on the basis of the numerical solution of the Navier-Stokes equations and Darcy's law. In this work, a graphics-processing-unit (GPU) -enhanced lattice Boltzmann simulator (GELBS) was developed, which was optimized by GPU parallel computing on the basis of the inherent parallelism of the LB method. Specifically, the LB method was used to implement the computational kernel; a sparse data structure was applied to optimize memory allocation; the OCCA (Medina et al. 2014) portability library was used, which enables the GELBS codes to use different application-programming interfaces (APIs) including open computing language (OpenCL), compute unified device architecture (CUDA), and open multiprocessing (OpenMP). OpenCL is an open standard for cross-platform parallel computing, CUDA is supported only by NVIDIA devices, and OpenMP is primarily used on central processing units (CPUs). It was found that the GPU-accelerated code was approximately 1,000 times faster than the unoptimized serial code and 10 times faster than the parallel code run on a standalone CPU. The CUDA code was slightly faster than OpenCL code on the NVIDA GPU because of the extra cost of OpenCL used to adapt to a heterogeneous platform. The GELBS was validated by comparing it with analytical solutions, laboratory measurements, and other independent numerical simulators in previous studies, and it was proved to have a second-order global accuracy. The GELBS was then used to analyze thin cuttings extracted from a sandstone reservoir and a shale-gas reservoir. The sandstone permeabilities were found relatively isotropic, whereas the shale permeabilities were strongly anisotropic because of the horizontal lamination structure. In shale cuttings, the average permeability in the horizontal direction was higher than that in the vertical direction by approximately two orders of magnitude. Correlations between porosity and permeability were observed in both rocks. The combination of GELBS and high-resolution imaging methods makes for a powerful tool for permeability evaluation when conventional laboratory measurement is impossible because of small cuttings sizes. The constitutive correlations between geometry and transport properties can be used for upscaling in different rock types. The GPU-optimized code significantly accelerates the computing speed; thus, many more samples can be analyzed given the same processing time. Consequently, the ergodicity requirement is met, which leads to a better reservoir characterization.
APA, Harvard, Vancouver, ISO, and other styles
42

Yu, Qing Sheng, and Jian Zhang. "The Data Concurrent Operation of Multi Medial Xtension Technique." Applied Mechanics and Materials 263-266 (December 2012): 316–21. http://dx.doi.org/10.4028/www.scientific.net/amm.263-266.316.

Full text
Abstract:
The concurrent operation of SAD value computing can be actualized by use of the SIMD technology, We give the image organization optimization algorithm of improvement from the image of Single Instruction Multiple Data parallel operation,Such as the idea of oriented object to lead into Parallel operation process that calculated SAD value.through the test result of code device moving forecast for Multi Media extension optimization velocity testing,we know that in current most complex video coding H.264/AVC,the implement of this algorithm can raise the coding speed of encoder obviously,and offer guarantee to realize the video communication of real time in narrow bandwidth.
APA, Harvard, Vancouver, ISO, and other styles
43

Пушкарев, К. В., and В. Д. Кошур. "A hybrid heuristic parallel method of global optimization." Numerical Methods and Programming (Vychislitel'nye Metody i Programmirovanie), no. 2 (June 30, 2015): 242–55. http://dx.doi.org/10.26089/nummet.v16r224.

Full text
Abstract:
Рассматривается задача нахождения глобального минимума непрерывной целевой функции многих переменных в области, имеющей вид многомерного параллелепипеда. Для решения сложных задач глобальной оптимизации предлагается гибридный эвристический параллельный метод глобальной оптимизации (ГЭПМ), основанный на комбинировании и гибридизации различных методов и технологии многоагентной системы. В состав ГЭПМ включены как новые методы (например, метод нейросетевой аппроксимации инверсных зависимостей, использующий обобщeнно-регрессионные нейронные сети (GRNN), отображающие значения целевой функции в значения координат), так и модифицированные классические методы (например, модифицированный метод Хука-Дживса). Кратко описывается программная реализация ГЭПМ в форме кроссплатформенной (на уровне исходного кода) программной библиотеки на языке C++, использующей обмен сообщениями через интерфейс MPI (Message Passing Interface). Приводятся результаты сравнения ГЭПМ с 21 современным методом глобальной оптимизации и генетическим алгоритмом на 28 тестовых целевых функциях 50 переменных. The problem of finding the global minimum of a continuous objective function of multiple variables in a multidimensional parallelepiped is considered. A hybrid heuristic parallel method for solving of complicated global optimization problems is proposed. The method is based on combining various methods and on the multi-agent technology. It consists of new methods (for example, the method of neural network approximation of inverse coordinate mappings that uses Generalized Regression Neural Networks (GRNN) to map the values of an objective function to coordinates) and modified classical methods (for example, the modified Hooke-Jeeves method). An implementation of the proposed method as a cross-platform (on the source code level) library written in the C++ language is briefly discussed. This implementation uses the message passing via MPI (Message Passing Interface). The method is compared with 21 modern methods of global optimization and with a genetic algorithm using 28 test objective functions of 50 variables.
APA, Harvard, Vancouver, ISO, and other styles
44

Peredo, Oscar, Julián M. Ortiz, and José R. Herrero. "Acceleration of the Geostatistical Software Library (GSLIB) by code optimization and hybrid parallel programming." Computers & Geosciences 85 (December 2015): 210–33. http://dx.doi.org/10.1016/j.cageo.2015.09.016.

Full text
APA, Harvard, Vancouver, ISO, and other styles
45

Calore, E., A. Gabbana, SF Schifano, and R. Tripiccione. "Optimization of lattice Boltzmann simulations on heterogeneous computers." International Journal of High Performance Computing Applications 33, no. 1 (April 24, 2017): 124–39. http://dx.doi.org/10.1177/1094342017703771.

Full text
Abstract:
High-performance computing systems are more and more often based on accelerators. Computing applications targeting those systems often follow a host-driven approach, in which hosts offload almost all compute-intensive sections of the code onto accelerators; this approach only marginally exploits the computational resources available on the host CPUs, limiting overall performances. The obvious step forward is to run compute-intensive kernels in a concurrent and balanced way on both hosts and accelerators. In this paper, we consider exactly this problem for a class of applications based on lattice Boltzmann methods, widely used in computational fluid dynamics. Our goal is to develop just one program, portable and able to run efficiently on several different combinations of hosts and accelerators. To reach this goal, we define common data layouts enabling the code to exploit the different parallel and vector options of the various accelerators efficiently, and matching the possibly different requirements of the compute-bound and memory-bound kernels of the application. We also define models and metrics that predict the best partitioning of workloads among host and accelerator, and the optimally achievable overall performance level. We test the performance of our codes and their scaling properties using, as testbeds, HPC clusters incorporating different accelerators: Intel Xeon Phi many-core processors, NVIDIA GPUs, and AMD GPUs.
APA, Harvard, Vancouver, ISO, and other styles
46

Hoffmann, Lars, Kaveh Haghighi Mood, Andreas Herten, Markus Hrywniak, Jiri Kraus, Jan Clemens, and Mingzhao Liu. "Accelerating Lagrangian transport simulations on graphics processing units: performance optimizations of Massive-Parallel Trajectory Calculations (MPTRAC) v2.6." Geoscientific Model Development 17, no. 9 (May 17, 2024): 4077–94. http://dx.doi.org/10.5194/gmd-17-4077-2024.

Full text
Abstract:
Abstract. Lagrangian particle dispersion models are indispensable tools for the study of atmospheric transport processes. However, Lagrangian transport simulations can become numerically expensive when large numbers of air parcels are involved. To accelerate these simulations, we made considerable efforts to port the Massive-Parallel Trajectory Calculations (MPTRAC) model to graphics processing units (GPUs). Here we discuss performance optimizations of the major bottleneck of the GPU code of MPTRAC, the advection kernel. Timeline, roofline, and memory analyses of the baseline GPU code revealed that the application is memory-bound, and performance suffers from near-random memory access patterns. By changing the data structure of the horizontal wind and vertical velocity fields of the global meteorological data driving the simulations from structure of arrays (SoAs) to array of structures (AoSs) and by introducing a sorting method for better memory alignment of the particle data, performance was greatly improved. We evaluated the performance on NVIDIA A100 GPUs of the Jülich Wizard for European Leadership Science (JUWELS) Booster module at the Jülich Supercomputing Center, Germany. For our largest test case, transport simulations with 108 particles driven by the European Centre for Medium-Range Weather Forecasts (ECMWF) ERA5 reanalysis, we found that the runtime for the full set of physics computations was reduced by 75 %, including a reduction of 85 % for the advection kernel. In addition to demonstrating the benefits of code optimization for GPUs, we show that the runtime of central processing unit (CPU-)only simulations is also improved. For our largest test case, we found a runtime reduction of 34 % for the physics computations, including a reduction of 65 % for the advection kernel. The code optimizations discussed here bring the MPTRAC model closer to applications on upcoming exascale high-performance computing systems and will also be of interest for optimizing the performance of other models using particle methods.
APA, Harvard, Vancouver, ISO, and other styles
47

Wang, Nen-Zi, and Hsin-Yi Chen. "A cross-platform parallel programming model for fluid-film lubrication optimization." Industrial Lubrication and Tribology 70, no. 6 (August 13, 2018): 1002–11. http://dx.doi.org/10.1108/ilt-11-2016-0283.

Full text
Abstract:
Purpose A cross-platform paradigm (computing model), which combines the graphical user interface of MATLAB and parallel Fortran programming, for fluid-film lubrication analysis is proposed. The purpose of this paper is to take the advantages of effective multithreaded computing of OpenMP and MATLAB’s user-friendly interface and real-time display capability. Design/methodology/approach A validation of computing performance of MATLAB and Fortran coding for solving two simple sliders by iterative solution methods is conducted. The online display of the particles’ search process is incorporated in the MATLAB coding, and the execution of the air foil bearing optimum design is conducted by using OpenMP multithreaded computing in the background. The optimization analysis is conducted by particle swarm optimization method for an air foil bearing design. Findings It is found that the MATLAB programs require prolonged execution times than those by using Fortran computing in iterative methods. The execution time of the air foil bearing optimum design is significantly minimized by using the OpenMP computing. As a result, the cross-platform paradigm can provide a useful graphical user interface. And very little code rewritting of the original numerical models is required, which is usually optimized for either serial or parallel computing. Research limitations/implications Iterative methods are commonly applied in fluid-film lubrication analyses. In this study, iterative methods are used as the solution methods, which may not be an effective way to compute in the MATLAB’s setting. Originality/value In this study, a cross-platform paradigm consisting of a standalone MATLAB and Fortran codes is proposed. The approach combines the best of the two paradigms and each coding can be modified or maintained independently for different applications.
APA, Harvard, Vancouver, ISO, and other styles
48

Bonati, Claudio, Enrico Calore, Simone Coscetti, Massimo D’Elia, Michele Mesiti, Francesco Negro, Sebastiano Fabio Schifano, Giorgio Silvi, and Raffaele Tripiccione. "Portable LQCD Monte Carlo code using OpenACC." EPJ Web of Conferences 175 (2018): 09008. http://dx.doi.org/10.1051/epjconf/201817509008.

Full text
Abstract:
Varying from multi-core CPU processors to many-core GPUs, the present scenario of HPC architectures is extremely heterogeneous. In this context, code portability is increasingly important for easy maintainability of applications; this is relevant in scientific computing where code changes are numerous and frequent. In this talk we present the design and optimization of a state-of-the-art production level LQCD Monte Carlo application, using the OpenACC directives model. OpenACC aims to abstract parallel programming to a descriptive level, where programmers do not need to specify the mapping of the code on the target machine. We describe the OpenACC implementation and show that the same code is able to target different architectures, including state-of-the-art CPUs and GPUs.
APA, Harvard, Vancouver, ISO, and other styles
49

Ding, Wei, Yuanrui Zhang, Mahmut Kandemir, and Seung Woo Son. "Compiler-Directed File Layout Optimization for Hierarchical Storage Systems." Scientific Programming 21, no. 3-4 (2013): 65–78. http://dx.doi.org/10.1155/2013/167581.

Full text
Abstract:
File layout of array data is a critical factor that effects the behavior of storage caches, and has so far taken not much attention in the context of hierarchical storage systems. The main contribution of this paper is a compiler-driven file layout optimization scheme for hierarchical storage caches. This approach, fully automated within an optimizing compiler, analyzes a multi-threaded application code and determines a file layout for each disk-resident array referenced by the code, such that the performance of the target storage cache hierarchy is maximized. We tested our approach using 16 I/O intensive application programs and compared its performance against two previously proposed approaches under different cache space management schemes. Our experimental results show that the proposed approach improves the execution time of these parallel applications by 23.7% on average.
APA, Harvard, Vancouver, ISO, and other styles
50

Știrb, Iulia. "Extending NUMA-BTLP Algorithm with Thread Mapping Based on a Communication Tree." Computers 7, no. 4 (December 3, 2018): 66. http://dx.doi.org/10.3390/computers7040066.

Full text
Abstract:
The paper presents a Non-Uniform Memory Access (NUMA)-aware compiler optimization for task-level parallel code. The optimization is based on Non-Uniform Memory Access—Balanced Task and Loop Parallelism (NUMA-BTLP) algorithm Ştirb, 2018. The algorithm gets the type of each thread in the source code based on a static analysis of the code. After assigning a type to each thread, NUMA-BTLP Ştirb, 2018 calls NUMA-BTDM mapping algorithm Ştirb, 2016 which uses PThreads routine pthread_setaffinity_np to set the CPU affinities of the threads (i.e., thread-to-core associations) based on their type. The algorithms perform an improve thread mapping for NUMA systems by mapping threads that share data on the same core(s), allowing fast access to L1 cache data. The paper proves that PThreads based task-level parallel code which is optimized by NUMA-BTLP Ştirb, 2018 and NUMA-BTDM Ştirb, 2016 at compile-time, is running time and energy efficiently on NUMA systems. The results show that the energy is optimized with up to 5% at the same execution time for one of the tested real benchmarks and up to 15% for another benchmark running in infinite loop. The algorithms can be used on real-time control systems such as client/server based applications which require efficient access to shared resources. Most often, task parallelism is used in the implementation of the server and loop parallelism is used for the client.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography