Journal articles on the topic 'Computation speedup'

To see the other types of publications on this topic, follow the link: Computation speedup.

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Computation speedup.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Zhang, Guiming, and Jin Xu. "Multi-GPU-Parallel and Tile-Based Kernel Density Estimation for Large-Scale Spatial Point Pattern Analysis." ISPRS International Journal of Geo-Information 12, no. 2 (January 18, 2023): 31. http://dx.doi.org/10.3390/ijgi12020031.

Full text
Abstract:
Kernel density estimation (KDE) is a commonly used method for spatial point pattern analysis, but it is computationally demanding when analyzing large datasets. GPU-based parallel computing has been adopted to address such computational challenges. The existing GPU-parallel KDE method, however, utilizes only one GPU for parallel computing. Additionally, it assumes that the input data can be held in GPU memory all at once for computation, which is unrealistic when conducting KDE analysis over large geographic areas at high resolution. This study develops a multi-GPU-parallel and tile-based KDE algorithm to overcome these limitations. It exploits multiple GPUs to speedup complex KDE computation by distributing computation across GPUs, and approaches density estimation with a tile-based strategy to bypass the memory bottleneck. Experiment results show that the parallel KDE algorithm running on multiple GPUs achieves significant speedups over running on a single GPU, and higher speedups are achieved on KDE tasks of a larger problem size. The tile-based strategy renders it feasible to estimate high-resolution density surfaces over large areas even on GPUs with only limited memory. Multi-GPU parallel computing and tile-based density estimation, while incurring very little computational overhead, effectively enable conducting KDE for large-scale spatial point pattern analysis on geospatial big data.
APA, Harvard, Vancouver, ISO, and other styles
2

Gao, Wen Hua, Li Qin Duan, Wei Zhou, and Pei Xin Ye. "Information-Based Complexity of Integration in the Randomized and Quantum Computation Model." Advanced Materials Research 403-408 (November 2011): 367–71. http://dx.doi.org/10.4028/www.scientific.net/amr.403-408.367.

Full text
Abstract:
In this paper, we investigate the integration of the Hölder-Nikolskii classes in the randomized and quantum computation model. We develop randomized and quantum algorithms for integration of functions from this class and analyze their convergence rates. Comparing our result with the convergence rates in the deterministic setting, we see that quantum computing can reach an exponential speedup over deterministic classical computation and a quadratic speedup over randomized classical computation.
APA, Harvard, Vancouver, ISO, and other styles
3

MORENO MAZA, MARC, and YUZHEN XIE. "BALANCED DENSE POLYNOMIAL MULTIPLICATION ON MULTI-CORES." International Journal of Foundations of Computer Science 22, no. 05 (August 2011): 1035–55. http://dx.doi.org/10.1142/s0129054111008556.

Full text
Abstract:
In symbolic computation, polynomial multiplication is a fundamental operation akin to matrix multiplication in numerical computation. We present efficient implementation strategies for FFT-based dense polynomial multiplication targeting multi-cores. We show that balanced input data can maximize parallel speedup and minimize cache complexity for bivariate multiplication. However, unbalanced input data, which are common in symbolic computation, are challenging. We provide efficient techniques, that we call contraction and extension, to reduce multivariate (and univariate) multiplication to balanced bivariate multiplication. Our implementation in Cilk++ demonstrates good speedup on multi-cores.
APA, Harvard, Vancouver, ISO, and other styles
4

Xu, Zhiqiang, Yiming Wang, Naidi Sun, Zhengying Li, Song Hu, and Quan Liu. "Parallel Computing for Quantitative Blood Flow Imaging in Photoacoustic Microscopy." Sensors 19, no. 18 (September 16, 2019): 4000. http://dx.doi.org/10.3390/s19184000.

Full text
Abstract:
Photoacoustic microscopy (PAM) is an emerging biomedical imaging technology capable of quantitative measurement of the microvascular blood flow by correlation analysis. However, the computational cost is high, limiting its applications. Here, we report a parallel computation design based on graphics processing unit (GPU) for high-speed quantification of blood flow in PAM. Two strategies were utilized to improve the computational efficiency. First, the correlation method in the algorithm was optimized to avoid redundant computation and a parallel computing structure was designed. Second, the parallel design was realized on GPU and optimized by maximizing the utilization of computing resource in GPU. The detailed timings and speedup for each calculation step were given and the MATLAB and C/C++ code versions based on CPU were presented as a comparison. Full performance test shows that a stable speedup of ~80-fold could be achieved with the same calculation accuracy and the computation time could be reduced from minutes to just several seconds with the imaging size ranging from 1 × 1 mm2 to 2 × 2 mm2. Our design accelerates PAM-based blood flow measurement and paves the way for real-time PAM imaging and processing by significantly improving the computational efficiency.
APA, Harvard, Vancouver, ISO, and other styles
5

Zhang, Zhigang, Songfeng Lu, Jie Sun, and Qing Zhou. "The Constant Speedup Mechanism on Adiabatic Quantum Computation." Journal of Computational and Theoretical Nanoscience 13, no. 10 (October 1, 2016): 7262–65. http://dx.doi.org/10.1166/jctn.2016.5997.

Full text
Abstract:
In the adiabatic quantum computation model, a computational procedure is described by the continuous time evolution of a time dependent Hamiltonian. Classically, the unstructured search problem can be solved only in a running time of order O(G). However, by modifying the structure of local Hamiltonian or using specific interpolating functions, it is possible to do the calculation in constant time for a quantum computer. This paper reveals the cause that lead to the speedup. We analyze two kinds of specific adiabatic quantum models, and conclude that the value of relevant elements in back-diagonal of the local Hamiltonian is the main factors affecting the time complexity of adiabatic quantum algorithms. According to the speedup mechanism, we have proposed two kinds of adiabatic quantum algorithms to make a constant time complexity.
APA, Harvard, Vancouver, ISO, and other styles
6

AKL, SELIM G. "INHERENTLY PARALLEL GEOMETRIC COMPUTATIONS." Parallel Processing Letters 16, no. 01 (March 2006): 19–37. http://dx.doi.org/10.1142/s0129626406002447.

Full text
Abstract:
A new computational paradigm is described which offers the possibility of superlinear (and sometimes unbounded) speedup, when parallel computation is used. The computations involved are subject only to given mathematical constraints and hence do not depend on external circumstances to achieve superlinear performance. The focus here is on geometric transformations. Given a geometric object A with some property, it is required to transform A into another object B which enjoys the same property. If the transformation requires several steps, each resulting in an intermediate object, then each of these intermediate objects must also obey the same property. We show that in transforming one triangulation of a polygon into another, a parallel algorithm achieves a superlinear speedup. In the case where a convex decomposition of a set of points is to be transformed, the improvement in performance is unbounded, meaning that a parallel algorithm succeeds in solving the problem as posed, while all sequential algorithms fail.
APA, Harvard, Vancouver, ISO, and other styles
7

Su, Huayou, Kaifang Zhang, and Songzhu Mei. "On the Transformation Optimization for Stencil Computation." Electronics 11, no. 1 (December 23, 2021): 38. http://dx.doi.org/10.3390/electronics11010038.

Full text
Abstract:
Stencil computation optimizations have been investigated quite a lot, and various approaches have been proposed. Loop transformation is a vital kind of optimization in modern production compilers and has proved successful employment within compilers. In this paper, we combine the two aspects to study the potential benefits some common transformation recipes may have for stencils. The recipes consist of loop unrolling, loop fusion, address precalculation, redundancy elimination, instruction reordering, load balance, and a forward and backward update algorithm named semi-stencil. Experimental evaluations of diverse stencil kernels, including 1D, 2D, and 3D computation patterns, on two typical ARM and Intel platforms, demonstrate the respective effects of the transformation recipes. An average speedup of 1.65× is obtained, and the best is 1.88× for the single transformation recipes we analyze. The compound recipes demonstrate a maximum speedup of 1.92×.
APA, Harvard, Vancouver, ISO, and other styles
8

Wani, Mohsin Altaf, and Manzoor Ahmad. "Statically Optimal Binary Search Tree Computation Using Non-Serial Polyadic Dynamic Programming on GPU's." International Journal of Grid and High Performance Computing 11, no. 1 (January 2019): 49–70. http://dx.doi.org/10.4018/ijghpc.2019010104.

Full text
Abstract:
Modern GPUs perform computation at a very high rate when compared to CPUs; as a result, they are increasingly used for general purpose parallel computation. Determining if a statically optimal binary search tree is an optimization problem to find the optimal arrangement of nodes in a binary search tree so that average search time is minimized. Knuth's modification to the dynamic programming algorithm improves the time complexity to O(n2). We develop a multiple GPU-based implementation of this algorithm using different approaches. Using suitable GPU implementation for a given workload provides a speedup of up to four times over other GPU based implementations. We are able to achieve a speedup factor of 409 on older GTX 570 and a speedup factor of 745 is achieved on a more modern GTX 1060 when compared to a conventional single threaded CPU based implementation.
APA, Harvard, Vancouver, ISO, and other styles
9

YONG, XIE, and HSU WEN-JING. "ALIGNED MULTITHREADED COMPUTATIONS AND THEIR SCHEDULING WITH PERFORMANCE GUARANTEES." Parallel Processing Letters 13, no. 03 (September 2003): 353–64. http://dx.doi.org/10.1142/s0129626403001331.

Full text
Abstract:
This paper considers the problem of scheduling dynamic parallel computations to achieve linear speedup without using significantly more space per processor than that required for a single processor execution. Earlier research in the Cilk project proposed the "strict" computational model, in which every dependency goes from a thread x only to one of x's ancestor threads, and guaranteed both linear speedup and linear expansion of space. However, Cilk threads are stateless, and the task graph that Cilk language expresses is series-parallel graph, which is a proper subset of arbitrary task graph. Moreover, Cilk does not support applications with pipelining. We propose the "aligned" multithreaded computational model, which extends the "strict" computational model in Cilk. In the aligned multithreaded computational model, dependencies can go from arbitrary thread x not only to x's ancestor threads, but also to x's younger brother threads, that are spawned by x's parent thread but after x. We use the same measures of time and space as those used in Cilk: T1 is the time required for executing the computation on 1 processor, T∞ is the time required by an infinite number of processors, and S1 is the space required to execute the computation on 1 processor. We show that for any aligned computation, there exists an execution schedule that achieves both efficient time and efficient space. Specifically, we show that for an execution of any aligned multithreaded computation on P processors, the time required is bounded by O(T1/P + T∞), and the space required can be loosely bounded by O(λ·S1P), where λ is the maximum number of younger brother threads that have the same parent thread and can be blocked during execution. If we assume that λ is a constant, and the space requirements for elder and younger brother threads are the same, then the space required would be bounded by O(S1P). Based on the aligned multithreaded computational model, we show that the aligned multithreaded computational model supports pipelined applications. Furthermore, we propose a multithreaded programming language and show that it can express arbitrary task graph.
APA, Harvard, Vancouver, ISO, and other styles
10

Al-Neama, Mohammed W., Naglaa M. Reda, and Fayed F. M. Ghaleb. "An Improved Distance Matrix Computation Algorithm for Multicore Clusters." BioMed Research International 2014 (2014): 1–12. http://dx.doi.org/10.1155/2014/406178.

Full text
Abstract:
Distance matrix has diverse usage in different research areas. Its computation is typically an essential task in most bioinformatics applications, especially in multiple sequence alignment. The gigantic explosion of biological sequence databases leads to an urgent need for accelerating these computations.DistVectalgorithm was introduced in the paper of Al-Neama et al. (in press) to present a recent approach for vectorizing distance matrix computing. It showed an efficient performance in both sequential and parallel computing. However, the multicore cluster systems, which are available now, with their scalability and performance/cost ratio, meet the need for more powerful and efficient performance. This paper proposesDistVect1as highly efficient parallel vectorized algorithm with high performance for computing distance matrix, addressed to multicore clusters. It reformulatesDistVect1vectorized algorithm in terms of clusters primitives. It deduces an efficient approach of partitioning and scheduling computations, convenient to this type of architecture. Implementations employ potential of both MPI and OpenMP libraries. Experimental results show that the proposed method performs improvement of around 3-fold speedup upon SSE2. Further it also achieves speedups more than 9 orders of magnitude compared to the publicly available parallel implementation utilized in ClustalW-MPI.
APA, Harvard, Vancouver, ISO, and other styles
11

Zhang, Fan, Chenxi Zhao, Songtao Han, Fei Ma, and Deliang Xiang. "GPU-Based Parallel Implementation of VLBI Correlator for Deep Space Exploration System." Remote Sensing 13, no. 6 (March 23, 2021): 1226. http://dx.doi.org/10.3390/rs13061226.

Full text
Abstract:
Very Long Baseline Interferometry (VLBI) solution can yield accurate information of angular position, and has been successfully used in the field of deep space exploration, such as astrophysics, imaging, detector positioning, and so on. The increase in VLBI data volume puts higher demands on efficient processing. Essentially, the main step of VLBI is the correlation processing, through which the angular position can be calculated. Since the VLBI correlation processing is both computation-intensive and data-intensive, the CPU cluster is usually employed in practical application to perform complex distributed computation. In this paper, we propose a parallel implementation of VLBI correlator based on graphics processing unit (GPU) to realize a more efficient and economical angular position calculation of deep space target. On the basis of massively GPU parallel computing, the coalesced access strategy and the parallel pipeline strategy are introduced to further accelerate the VLBI correlator. Experimental results show that the optimized GPU-based VLBI method can meet the real-time processing requirements of the received data stream. Compared with the sequential method, the proposed approach can reach a 224.1 × calculation speedup, and a 36.8 × application speedup. Compared with the multi-CPUs method, it can achieve 28.6 × calculation speedup and 4.7 × application speedup.
APA, Harvard, Vancouver, ISO, and other styles
12

McDermott, M., S. K. Prasad, S. Shekhar, and X. Zhou. "INTERESTING SPATIO-TEMPORAL REGION DISCOVERY COMPUTATIONS OVER GPU AND MAPREDUCE PLATFORMS." ISPRS Annals of Photogrammetry, Remote Sensing and Spatial Information Sciences II-4/W2 (July 10, 2015): 35–41. http://dx.doi.org/10.5194/isprsannals-ii-4-w2-35-2015.

Full text
Abstract:
Discovery of interesting paths and regions in spatio-temporal data sets is important to many fields such as the earth and atmospheric sciences, GIS, public safety and public health both as a goal and as a preliminary step in a larger series of computations. This discovery is usually an exhaustive procedure that quickly becomes extremely time consuming to perform using traditional paradigms and hardware and given the rapidly growing sizes of today’s data sets is quickly outpacing the speed at which computational capacity is growing. In our previous work (Prasad et al., 2013a) we achieved a 50 times speedup over sequential using a single GPU. We were able to achieve near linear speedup over this result on interesting path discovery by using Apache Hadoop to distribute the workload across multiple GPU nodes. Leveraging the parallel architecture of GPUs we were able to drastically reduce the computation time of a 3-dimensional spatio-temporal interest region search on a single tile of normalized difference vegetative index for Saudi Arabia. We were further able to see an almost linear speedup in compute performance by distributing this workload across several GPUs with a simple MapReduce model. This increases the speed of processing 10 fold over the comparable sequential while simultaneously increasing the amount of data being processed by 384 fold. This allowed us to process the entirety of the selected data set instead of a constrained window.
APA, Harvard, Vancouver, ISO, and other styles
13

Ghodsi, Seyed Roholah, and Mohammad Taeibi-Rahni. "A Novel Parallel Algorithm Based on the Gram-Schmidt Method for Tridiagonal Linear Systems of Equations." Mathematical Problems in Engineering 2010 (2010): 1–17. http://dx.doi.org/10.1155/2010/268093.

Full text
Abstract:
This paper introduces a new parallel algorithm based on the Gram-Schmidt orthogonalization method. This parallel algorithm can find almost exact solutions of tridiagonal linear systems of equations in an efficient way. The system of equations is partitioned proportional to number of processors, and each partition is solved by a processor with a minimum request from the other partitions' data. The considerable reduction in data communication between processors causes interesting speedup. The relationships between partitions approximately disappear if some columns are switched. Hence, the speed of computation increases, and the computational cost decreases. Consequently, obtained results show that the suggested algorithm is considerably scalable. In addition, this method of partitioning can significantly decrease the computational cost on a single processor and make it possible to solve greater systems of equations. To evaluate the performance of the parallel algorithm, speedup and efficiency are presented. The results reveal that the proposed algorithm is practical and efficient.
APA, Harvard, Vancouver, ISO, and other styles
14

Fan, Junfu, Chenghu Zhou, Ting Ma, Min Ji, Yuke Zhou, and Tao Xu. "DWSI: an approach to solving the polygon intersection-spreading problem with a parallel union algorithm at the feature layer level." Boletim de Ciências Geodésicas 20, no. 1 (March 2014): 159–82. http://dx.doi.org/10.1590/s1982-21702014000100011.

Full text
Abstract:
A dual-way seeds indexing (DWSI) method based on R-tree and the Open Geospatial Consortium (OGC) simple feature model was proposed to solve the polygon intersection-spreading problem. The parallel polygon union algorithm based on the improved DWSI and the OpenMP parallel programming model was developed to validate the usability of the data partition method. The experimental results reveal that the improved DWSI method can implement a robust parallel task partition by overcoming the polygon intersection-spreading problem. The parallel union algorithm applied DWSI not only scaled up the data processing but also speeded up the computation compared with the serial proposal, and it showed a higher computational efficiency with higher speedup benchmarks in the treatment of larger-scale dataset. Therefore, the improved DWSI can be a potential approach to parallelizing the vector data overlay algorithms based on the OGC simple data model at the feature layer level.
APA, Harvard, Vancouver, ISO, and other styles
15

Tang, Wenjie, Wentong Cai, Yiping Yao, Xiao Song, and Feng Zhu. "An alternative approach for collaborative simulation execution on a CPU+GPU hybrid system." SIMULATION 96, no. 3 (November 14, 2019): 347–61. http://dx.doi.org/10.1177/0037549719885178.

Full text
Abstract:
In the past few years, the graphics processing unit (GPU) has been widely used to accelerate time-consuming models in simulations. Since both model computation and simulation management are main factors that affect the performance of large-scale simulations, only accelerating model computation will limit the potential speedup. Moreover, models that can be well accelerated by a GPU could be insufficient, especially for simulations with many lightweight models. Traditionally, the parallel discrete event simulation (PDES) method is used to solve this class of simulation, but most PDES simulators only utilize the central processing unit (CPU) even though the GPU is commonly available now. Hence, we propose an alternative approach for collaborative simulation execution on a CPU+GPU hybrid system. The GPU supports both simulation management and model computation as CPUs. A concurrency-oriented scheduling algorithm was proposed to enable cooperation between the CPU and the GPU, so that multiple computation and communication resources can be efficiently utilized. In addition, GPU functions have also been carefully designed to adapt the algorithm. The combination of those efforts allows the proposed approach to achieve significant speedup compared to the traditional PDES on a CPU.
APA, Harvard, Vancouver, ISO, and other styles
16

Touati, Sid-Ahmed-Ali, Julien Worms, and Sébastien Briais. "The Speedup-Test: a statistical methodology for programme speedup analysis and computation." Concurrency and Computation: Practice and Experience 25, no. 10 (October 15, 2012): 1410–26. http://dx.doi.org/10.1002/cpe.2939.

Full text
APA, Harvard, Vancouver, ISO, and other styles
17

Liu, Cong, Wen Wang, and Zhi Ying Wang. "Speculative High Performance Computation on Heterogeneous Multi-Core." Advanced Materials Research 1049-1050 (October 2014): 2126–30. http://dx.doi.org/10.4028/www.scientific.net/amr.1049-1050.2126.

Full text
Abstract:
Thread level speculation has been proposed and researched to parallelize traditional sequential applications on homogeneous multi-core architecture. In this paper, a heterogeneous multi-core hardware simulation system is present, which provides with TLS execution mechanism. With a novel TLS programming model and a number of new speculative tuning techniques, benchmarkGzipis parallelized from-3% to 195% on a four-core processor, and the speedup of the test benchmarks are 30%, 43% and 156%, respectively with arbitrary, hotspot and insight speculation.
APA, Harvard, Vancouver, ISO, and other styles
18

Le Bras, Ronan, Yexiang Xue, Richard Bernstein, Carla Gomes, and Bart Selman. "A Human Computation Framework for Boosting Combinatorial Solvers." Proceedings of the AAAI Conference on Human Computation and Crowdsourcing 2 (September 5, 2014): 121–32. http://dx.doi.org/10.1609/hcomp.v2i1.13155.

Full text
Abstract:
We propose a general framework for boosting combinatorial solvers through human computation. Our framework combines insights from human workers with the power of combinatorial optimization. The combinatorial solver is also used to guide requests for the workers, and thereby obtain the most useful human feedback quickly. Our approach also incorporates a problem decomposition approach with a general strategy for discarding incorrect human input. We apply this framework in the domain of materials discovery, and demonstrate a speedup of over an order of magnitude.
APA, Harvard, Vancouver, ISO, and other styles
19

Li, De Bo, Qi Sheng Xu, Yue Liang Shen, Zhi Yong Wen, and Ya Ming Liu. "Parallel Algorithms for Compressible Turbulent Flow Simulation Using Direct Numerical Method." Advanced Materials Research 516-517 (May 2012): 980–91. http://dx.doi.org/10.4028/www.scientific.net/amr.516-517.980.

Full text
Abstract:
In this study, SPMD parallel computation of compressible turbulent jet flow with an explicit finite difference method by direct numerical method is performed on the IBM Linux Cluster. The conservation equations, boundary conditions including NSCBC (charactering boundary conditions), grid generation method, and the solving processing are carefully presented in order to give other researchers a clear understanding of the large scale parallel computing of compressible turbulent flows using explicit finite difference method, which is scarce in the literatures. The speedup factor and parallel computational efficiency are presented with different domain decomposition methods. In order to use our explicit finite method for large scale parallel computing, the grid size imposed on each processor, the speedup factor, and the efficiency factor should be carefully chosen in order to design an efficient parallel code. Our newly developed parallel code is quite efficient from that of implicit finite difference method or spectral method on parallel computational efficiency. This is quite useful for future research for gas and particle two-phase flow, which is still a problem for highly efficient code for two-phase parallel computing.
APA, Harvard, Vancouver, ISO, and other styles
20

Bravyi, Sergey, Oliver Dial, Jay M. Gambetta, Darío Gil, and Zaira Nazario. "The future of quantum computing with superconducting qubits." Journal of Applied Physics 132, no. 16 (October 28, 2022): 160902. http://dx.doi.org/10.1063/5.0082975.

Full text
Abstract:
For the first time in history, we are seeing a branching point in computing paradigms with the emergence of quantum processing units (QPUs). Extracting the full potential of computation and realizing quantum algorithms with a super-polynomial speedup will most likely require major advances in quantum error correction technology. Meanwhile, achieving a computational advantage in the near term may be possible by combining multiple QPUs through circuit knitting techniques, improving the quality of solutions through error suppression and mitigation, and focusing on heuristic versions of quantum algorithms with asymptotic speedups. For this to happen, the performance of quantum computing hardware needs to improve and software needs to seamlessly integrate quantum and classical processors together to form a new architecture that we are calling quantum-centric supercomputing. In the long term, we see hardware that exploits qubit connectivity in higher than 2D topologies to realize more efficient quantum error correcting codes, modular architectures for scaling QPUs and parallelizing workloads, and software that evolves to make the intricacies of the technology invisible to the users and realize the goal of ubiquitous, frictionless quantum computing.
APA, Harvard, Vancouver, ISO, and other styles
21

Zhang, Jianfei, and Lei Zhang. "Efficient CUDA Polynomial Preconditioned Conjugate Gradient Solver for Finite Element Computation of Elasticity Problems." Mathematical Problems in Engineering 2013 (2013): 1–12. http://dx.doi.org/10.1155/2013/398438.

Full text
Abstract:
Graphics processing unit (GPU) has obtained great success in scientific computations for its tremendous computational horsepower and very high memory bandwidth. This paper discusses the efficient way to implement polynomial preconditioned conjugate gradient solver for the finite element computation of elasticity on NVIDIA GPUs using compute unified device architecture (CUDA). Sliced block ELLPACK (SBELL) format is introduced to store sparse matrix arising from finite element discretization of elasticity with fewer padding zeros than traditional ELLPACK-based formats. Polynomial preconditioning methods have been investigated both in convergence and running time. From the overall performance, the least-squares (L-S) polynomial method is chosen as a preconditioner in PCG solver to finite element equations derived from elasticity for its best results on different example meshes. In the PCG solver, mixed precision algorithm is used not only to reduce the overall computational, storage requirements and bandwidth but to make full use of the capacity of the GPU devices. With SBELL format and mixed precision algorithm, the GPU-based L-S preconditioned CG can get a speedup of about 7–9 to CPU-implementation.
APA, Harvard, Vancouver, ISO, and other styles
22

FABRI, ANDREAS, and OLIVIER DEVILLERS. "SCALABLE ALGORITHMS FOR BICHROMATIC LINE SEGMENT INTERSECTION PROBLEMS ON COARSE GRAINED MULTICOMPUTERS." International Journal of Computational Geometry & Applications 06, no. 04 (December 1996): 487–506. http://dx.doi.org/10.1142/s0218195996000307.

Full text
Abstract:
We present output-sensitive scalable parallel algorithms for bichromatic line segment intersection problems for the coarse grained multicomputer model. Under the assumption that n≥p2, where n is the number of line segments and p the number of processors, we obtain an intersection counting algorithm with a time complexity of [Formula: see text], where Ts(m, p) is the time used to sort m items on a p processor machine. The first term captures the time spent in sequential computation performed locally by each processor. The second term captures the interprocessor communication time. An additional [Formula: see text] time in sequential computation is spent on the reporting of the k intersections. As the sequential time complexity is O(n log n) for counting and an additional time O(k) for reporting, we obtain a speedup of [Formula: see text] in the sequential part of the algorithm. The speedup in the communication part obviously depends on the underlying architecture. For example for a hypercube it ranges between [Formula: see text] and [Formula: see text] depending on the ratio of n and p. As the reporting does not involve more interprocessor communication than the counting, the algorithm achieves a full speedup of p for k≥ O( max (n log n log p, n log 3 p)) even on a hypercube.
APA, Harvard, Vancouver, ISO, and other styles
23

Mielikainen, J., B. Huang, H. L. A. Huang, M. D. Goldberg, and A. Mehta. "Speeding Up the Computation of WRF Double-Moment 6-Class Microphysics Scheme with GPU." Journal of Atmospheric and Oceanic Technology 30, no. 12 (December 1, 2013): 2896–906. http://dx.doi.org/10.1175/jtech-d-12-00218.1.

Full text
Abstract:
Abstract The Weather Research and Forecasting model (WRF) double-moment 6-class microphysics scheme (WDM6) implements a double-moment bulk microphysical parameterization of clouds and precipitation and is applicable in mesoscale and general circulation models. WDM6 extends the WRF single-moment 6-class microphysics scheme (WSM6) by incorporating the number concentrations for cloud and rainwater along with a prognostic variable of cloud condensation nuclei (CCN) number concentration. Moreover, it predicts the mixing ratios of six water species (water vapor, cloud droplets, cloud ice, snow, rain, and graupel), similar to WSM6. This paper describes improving the computational performance of WDM6 by exploiting its inherent fine-grained parallelism using the NVIDIA graphics processing unit (GPU). Compared to the single-threaded CPU, a single GPU implementation of WDM6 obtains a speedup of 150× with the input/output (I/O) transfer and 206× without the I/O transfer. Using four GPUs, the speedup reaches 347× and 715×, respectively.
APA, Harvard, Vancouver, ISO, and other styles
24

DU, LIU-GE, KANG LI, FAN-MIN KONG, and YUAN HU. "PARALLEL 3D FINITE-DIFFERENCE TIME-DOMAIN METHOD ON MULTI-GPU SYSTEMS." International Journal of Modern Physics C 22, no. 02 (February 2011): 107–21. http://dx.doi.org/10.1142/s012918311101618x.

Full text
Abstract:
Finite-difference time-domain (FDTD) is a popular but computational intensive method to solve Maxwell's equations for electrical and optical devices simulation. This paper presents implementations of three-dimensional FDTD with convolutional perfect match layer (CPML) absorbing boundary conditions on graphics processing unit (GPU). Electromagnetic fields in Yee cells are calculated in parallel millions of threads arranged as a grid of blocks with compute unified device architecture (CUDA) programming model and considerable speedup factors are obtained versus sequential CPU code. We extend the parallel algorithm to multiple GPUs in order to solve electrically large structures. Asynchronous memory copy scheme is used in data exchange procedure to improve the computation efficiency. We successfully use this technique to simulate pointwise source radiation and validate the result by comparison to high precision computation, which shows favorable agreements. With four commodity GTX295 graphics cards on a single personal computer, more than 4000 million Yee cells can be updated in one second, which is hundreds of times faster than traditional CPU computation.
APA, Harvard, Vancouver, ISO, and other styles
25

Deforth, Kevin, Marc Desgroseilliers, Nicolas Gama, Mariya Georgieva, Dimitar Jetchev, and Marius Vuille. "XORBoost: Tree Boosting in the Multiparty Computation Setting." Proceedings on Privacy Enhancing Technologies 2022, no. 4 (October 2022): 66–85. http://dx.doi.org/10.56553/popets-2022-0099.

Full text
Abstract:
We present a novel protocol XORBoost for both training gradient boosted tree models and for using these models for inference in the multiparty computation (MPC) setting. Our protocol supports training for generically split datasets (vertical and horizontal splitting, or combination of those) while keeping all the information about features, thresholds, and evaluation paths private; only tree depth and the number of the binary trees are public parameters of the model. By using novel optimization techniques that reduce the number of oblivious permutation evaluations as well as sorting operations, we further speedup the algorithm. The protocol is agnostic to the underlying MPC framework or implementation.
APA, Harvard, Vancouver, ISO, and other styles
26

Wang, N., C. M. Tsai, and K. C. Cha. "A Study of Parallel Efficiency of Modified Direct Algorithm Applied to Thermohydrodynamic Lubrication." Journal of Mechanics 25, no. 2 (June 2009): 143–50. http://dx.doi.org/10.1017/s1727719100002598.

Full text
Abstract:
AbstractThis study examines the parallel computing as a means to minimize the execution time in the optimization applied to thermohydrodynamic (THD) lubrication. The objective of the optimization is to maximize the load capacity of a slider bearing with two design variables. A global optimization method, DIviding RECTangle (DIRECT) algorithm, is used. The first approach was to apply the parallel computing within the THD model in a shared-memory processing (SMP) environment to examine the parallel efficiency of fine-grain computation. Next, a distributed parallel computing in the search level was conducted by use of the standard DIRECT algorithm. Then, the algorithm is modified to provide a version suitable for effective parallel computing. In the latter coarse-grain computation the speedups obtained by the DIRECT algorithms are compared with some previous studies using other parallel optimization methods. In the fine-grain computation of the SMP machine, the communication and overhead time costs prohibit high speedup in the cases of four or more simultaneous threads. It is found that the standard DIRECT algorithm is an efficient sequential but less parallel-computing-friendly method. When the modified algorithm is used in the slider bearing optimization, a parallel efficiency of 96.3% is obtained in the 16-computing-node cluster. This study presents the modified DIRECT algorithm, an efficient parallel search method, for general engineering optimization problems.
APA, Harvard, Vancouver, ISO, and other styles
27

Bashashin, Maxim, Elena Zemlyanaya, and Konstantin Lukyanov. "Double-Folding Nucleus-Nucleus Optical Potential: Parallel MPI and OpenMP Implementations." EPJ Web of Conferences 226 (2020): 02004. http://dx.doi.org/10.1051/epjconf/202022602004.

Full text
Abstract:
The computation of the real part of the nucleus-nucleus optical potential based on the microscopic double-folding model was implemented within both the MPI and OpenMP parallelising techniques. Test calculations of the total cross section of the 6He + 28Si scattering at the energy 50 A MeV show that both techniques provide significant comparable speedup of the calculations.
APA, Harvard, Vancouver, ISO, and other styles
28

Alhussan, Amel Ali, Hussah Nasser AlEisa, Ghada Atteia, Nahed H. Solouma, Rania Ahmed Abdel Azeem Abul Seoud, Ola S. Ayoub, Vidan F. Ghoneim, and Nagwan Abdel Samee. "ForkJoinPcc Algorithm for Computing the Pcc Matrix in Gene Co-Expression Networks." Electronics 11, no. 8 (April 7, 2022): 1174. http://dx.doi.org/10.3390/electronics11081174.

Full text
Abstract:
High-throughput microarrays contain a huge number of genes. Determining the relationships between all these genes is a time-consuming computation. In this paper, the authors provide a parallel algorithm for finding the Pearson’s correlation coefficient between genes measured in the Affymetrix microarrays. The main idea in the proposed algorithm, ForkJoinPcc, mimics the well-known parallel programming model: the fork–join model. The parallel MATLAB APIs have been employed and evaluated on shared or distributed multiprocessing systems. Two performance metrics—the processing and communication times—have been used to assess the performance of the ForkJoinPcc. The experimental results reveal that the ForkJoinPcc algorithm achieves a substantial speedup on the cluster platform of 62× compared with a 3.8× speedup on the multicore platform.
APA, Harvard, Vancouver, ISO, and other styles
29

Servais, Jason, and Ehsan Atoofian. "Adaptive Computation Reuse for Energy-Efficient Training of Deep Neural Networks." ACM Transactions on Embedded Computing Systems 20, no. 6 (November 30, 2021): 1–24. http://dx.doi.org/10.1145/3487025.

Full text
Abstract:
In recent years, Deep Neural Networks (DNNs) have been deployed into a diverse set of applications from voice recognition to scene generation mostly due to their high-accuracy. DNNs are known to be computationally intensive applications, requiring a significant power budget. There have been a large number of investigations into energy-efficiency of DNNs. However, most of them primarily focused on inference while training of DNNs has received little attention. This work proposes an adaptive technique to identify and avoid redundant computations during the training of DNNs. Elements of activations exhibit a high degree of similarity, causing inputs and outputs of layers of neural networks to perform redundant computations. Based on this observation, we propose Adaptive Computation Reuse for Tensor Cores (ACRTC) where results of previous arithmetic operations are used to avoid redundant computations. ACRTC is an architectural technique, which enables accelerators to take advantage of similarity in input operands and speedup the training process while also increasing energy-efficiency. ACRTC dynamically adjusts the strength of computation reuse based on the tolerance of precision relaxation in different training phases. Over a wide range of neural network topologies, ACRTC accelerates training by 33% and saves energy by 32% with negligible impact on accuracy.
APA, Harvard, Vancouver, ISO, and other styles
30

Weiss, Robin M., and Jeffrey Shragge. "Solving 3D anisotropic elastic wave equations on parallel GPU devices." GEOPHYSICS 78, no. 2 (March 1, 2013): F7—F15. http://dx.doi.org/10.1190/geo2012-0063.1.

Full text
Abstract:
Efficiently modeling seismic data sets in complex 3D anisotropic media by solving the 3D elastic wave equation is an important challenge in computational geophysics. Using a stress-stiffness formulation on a regular grid, we tested a 3D finite-difference time-domain solver using a second-order temporal and eighth-order spatial accuracy stencil that leverages the massively parallel architecture of graphics processing units (GPUs) to accelerate the computation of key kernels. The relatively small memory of an individual GPU limits the model domain sizes that can be computed on a single device. To circumvent this constraint and move toward modeling industry-sized 3D anisotropic elastic data sets, we parallelized computation across multiple GPU devices by using domain decomposition and, for each time step, employing an interdevice communication protocol to exchange data values falling within interior boundaries of each subdomain. For two or more GPU devices within a single compute node, we use direct peer-to-peer (i.e., GPU-to-GPU) communication, whereas for networked nodes we employed message-passing interface directives to route data over the network. Our 2D GPU-based anisotropic elastic modeling tests achieved a [Formula: see text] speedup relative to an OpenMP CPU implementation run on an eight-core machine, whereas our 3D tests using dual-GPU devices produced up to a [Formula: see text] speedup. The performance boost afforded by the GPU architecture allowed us to model seismic data for 3D anisotropic elastic models at lower hardware cost and in less time than has been previously possible.
APA, Harvard, Vancouver, ISO, and other styles
31

SAMPSON, MARINOS, DIMITRIOS VOUDOURIS, and GEORGE PAPAKONSTANTINOU. "USING SIMPLE DISJOINT DECOMPOSITION TO PERFORM SECURE COMPUTATIONS." Journal of Circuits, Systems and Computers 19, no. 07 (November 2010): 1559–69. http://dx.doi.org/10.1142/s0218126610006906.

Full text
Abstract:
This paper deals with the use of a minimal model for performing secure computations. The communication is based on a protocol which makes use of disjoint function decomposition and more precisely of minimal ESCT (Exclusive-or Sum of Complex Terms) expressions in order to perform a secure computation. The complexity of this protocol is directly proportional to the size of the ESCT expression in use, which is much smaller in comparison to other proposed minimal models (e.g., ESOP). Moreover, quantum algorithms are discussed that provide significant speedup to the process of producing the ESCT expressions, when compared to conventional ones. Hence, this paper provides a very useful application of the ESCT expressions in the field of cryptographic protocols.
APA, Harvard, Vancouver, ISO, and other styles
32

Yuan, Yirang, Qing Yang, Changfeng Li, and Tongjun Sun. "A Numerical Approximation Structured by Mixed Finite Element and Upwind Fractional Step Difference for Semiconductor Device with Heat Conduction and Its Numerical Analysis." Numerical Mathematics: Theory, Methods and Applications 10, no. 3 (June 20, 2017): 541–61. http://dx.doi.org/10.4208/nmtma.2017.y15013.

Full text
Abstract:
AbstractA coupled mathematical system of four quasi-linear partial differential equations and the initial-boundary value conditions is presented to interpret transient behavior of three dimensional semiconductor device with heat conduction. The electric potential is defined by an elliptic equation, the electron and hole concentrations are determined by convection-dominated diffusion equations and the temperature is interpreted by a heat conduction equation. A mixed finite element approximation is used to get the electric field potential and one order of computational accuracy is improved. Two concentration equations and the heat conduction equation are solved by a fractional step scheme modified by a second-order upwind difference method, which can overcome numerical oscillation, dispersion and computational complexity. This changes the computation of a three dimensional problem into three successive computations of one-dimensional problem where the method of speedup is used and the computational work is greatly shortened. An optimal second-order error estimate in L2 norm is derived by prior estimate theory and other special techniques of partial differential equations. This type of parallel method is important in numerical analysis and is most valuable in numerical application of semiconductor device and it can successfully solve this international famous problem.
APA, Harvard, Vancouver, ISO, and other styles
33

Zhang, Shanghong, Wenda Li, Zhu Jing, Yujun Yi, and Yong Zhao. "Comparison of Three Different Parallel Computation Methods for a Two-Dimensional Dam-Break Model." Mathematical Problems in Engineering 2017 (2017): 1–12. http://dx.doi.org/10.1155/2017/1970628.

Full text
Abstract:
Three parallel methods (OpenMP, MPI, and OpenACC) are evaluated for the computation of a two-dimensional dam-break model using the explicit finite volume method. A dam-break event in the Pangtoupao flood storage area in China is selected as a case study to demonstrate the key technologies for implementing parallel computation. The subsequent acceleration of the methods is also evaluated. The simulation results show that the OpenMP and MPI parallel methods achieve a speedup factor of 9.8× and 5.1×, respectively, on a 32-core computer, whereas the OpenACC parallel method achieves a speedup factor of 20.7× on NVIDIA Tesla K20c graphics card. The results show that if the memory required by the dam-break simulation does not exceed the memory capacity of a single computer, the OpenMP parallel method is a good choice. Moreover, if GPU acceleration is used, the acceleration of the OpenACC parallel method is the best. Finally, the MPI parallel method is suitable for a model that requires little data exchange and large-scale calculation. This study compares the efficiency and methodology of accelerating algorithms for a dam-break model and can also be used as a reference for selecting the best acceleration method for a similar hydrodynamic model.
APA, Harvard, Vancouver, ISO, and other styles
34

Prots’ko, I., N. Kryvinska, and O. Gryshchuk. "THE RUNTIME ANALYSIS OF COMPUTATION OF MODULAR EXPONENTIATION." Radio Electronics, Computer Science, Control, no. 3 (October 6, 2021): 42–47. http://dx.doi.org/10.15588/1607-3274-2021-3-4.

Full text
Abstract:
Context. Providing the problem of fast calculation of the modular exponentiation requires the development of effective algorithmic methods using the latest information technologies. Fast computations of the modular exponentiation are extremely necessary for efficient computations in theoretical-numerical transforms, for provide high crypto capability of information data and in many other applications. Objective – the runtime analysis of software functions for computation of modular exponentiation of the developed programs based on parallel organization of computation with using multithreading. Method. Modular exponentiation is implemented using a 2k-ary sliding window algorithm, where k is chosen according to the size of the exponent. Parallelization of computation consists in using the calculation of the remainders of numbers raised to the power of 2i modulo, and their further parallel multiplications modulo. Results. Comparison of the runtimes of three variants of functions for computing the modular exponentiation is performed. In the algorithm of parallel organization of computation with using multithreading provide faster computation of modular exponentiation for exponent values larger than 1K binary digits compared to the function of modular exponentiation of the MPIR library. The MPIR library with an integer data type with the number of binary digits from 256 to 2048 bits is used to develop an algorithm for computing the modular exponentiation with using multithreading. Conclusions. In the work has been considered and analysed the developed software implementation of the computation of modular exponentiation on universal computer systems. One of the ways to implement the speedup of computing modular exponentiation is developing algorithms that can use multithreading technology on multi-cores microprocessors. The multithreading software implementation of modular exponentiation with increasing from 1024 the number of binary digit of exponent shows an improvement of computation time with comparison with the function of modular exponentiation of the MPIR library.
APA, Harvard, Vancouver, ISO, and other styles
35

Jordan, Stephen P., Keith S. M. Lee, and John Preskill. "Quantum computation of scattering in scalar quantum field theories." Quantum Information and Computation 14, no. 11&12 (September 2014): 1014–80. http://dx.doi.org/10.26421/qic14.11-12-8.

Full text
Abstract:
Quantum field theory provides the framework for the most fundamental physical theories to be confirmed experimentally and has enabled predictions of unprecedented precision. However, calculations of physical observables often require great computational complexity and can generally be performed only when the interaction strength is weak. A full understanding of the foundations and rich consequences of quantum field theory remains an outstanding challenge. We develop a quantum algorithm to compute relativistic scattering amplitudes in massive $\phi^4$ theory in spacetime of four and fewer dimensions. The algorithm runs in a time that is polynomial in the number of particles, their energy, and the desired precision, and applies at both weak and strong coupling. Thus, it offers exponential speedup over existing classical methods at high precision or strong coupling.
APA, Harvard, Vancouver, ISO, and other styles
36

Zhou, Hongkuan, Ajitesh Srivastava, Hanqing Zeng, Rajgopal Kannan, and Viktor Prasanna. "Accelerating large scale real-time GNN inference using channel pruning." Proceedings of the VLDB Endowment 14, no. 9 (May 2021): 1597–605. http://dx.doi.org/10.14778/3461535.3461547.

Full text
Abstract:
Graph Neural Networks (GNNs) are proven to be powerful models to generate node embedding for downstream applications. However, due to the high computation complexity of GNN inference, it is hard to deploy GNNs for large-scale or real-time applications. In this paper, we propose to accelerate GNN inference by pruning the dimensions in each layer with negligible accuracy loss. Our pruning framework uses a novel LASSO regression formulation for GNNs to identify feature dimensions (channels) that have high influence on the output activation. We identify two inference scenarios and design pruning schemes based on their computation and memory usage for each. To further reduce the inference complexity, we effectively store and reuse hidden features of visited nodes, which significantly reduces the number of supporting nodes needed to compute the target embedding. We evaluate the proposed method with the node classification problem on five popular datasets and a real-time spam detection application. We demonstrate that the pruned GNN models greatly reduce computation and memory usage with little accuracy loss. For full inference, the proposed method achieves an average of 3.27X speedup with only 0.002 drop in F1-Micro on GPU. For batched inference, the proposed method achieves an average of 6.67X speedup with only 0.003 drop in F1-Micro on CPU. To the best of our knowledge, we are the first to accelerate large scale real-time GNN inference through channel pruning.
APA, Harvard, Vancouver, ISO, and other styles
37

AKL, SELIM G. "PARALLEL REAL-TIME COMPUTATION OF NONLINEAR FEEDBACK FUNCTIONS." Parallel Processing Letters 13, no. 01 (March 2003): 65–75. http://dx.doi.org/10.1142/s012962640300115x.

Full text
Abstract:
This paper focuses on the improvement in the quality of computation provided by parallelism. The problem of interest is that of computing the maximum of a nonlinear feedback function in a real-time environment. We show that the solution obtained in parallel is significantly, provably, and consistently better than a sequential one. It is important to note that our purpose is not to demonstrate merely that a parallel computer can obtain a solution to a computational problem that is of higher quality than one derived sequentially. The latter is an interesting (and often surprising) observation in its own right, but we wish to go further. It is shown here that the improvement in quality due to parallelism can be arbitrarily high. To be specific, the ratio of the parallel solution to the sequential one is typically superlinear in the number of processors used by the parallel computer. This result is akin to superlinear speedup—a phenomenon itself originally thought to be impossible.
APA, Harvard, Vancouver, ISO, and other styles
38

Valcan, Sorin, and Mihail Gaianu. "CUDA Implementation For Eye Location On Infrared Images." Scalable Computing: Practice and Experience 23, no. 1 (April 25, 2022): 1–8. http://dx.doi.org/10.12694/scpe.v23i1.1954.

Full text
Abstract:
Parallel programming using GPUs is a modern solution to reduce computation time for large tasks. This is done by dividing algorithms in smaller parts which can be executed simultaneously. CUDA has many practical applications especially in video processing, medical imaging and machine learning. This paper presents how parallel implementations can speedup a ground truth data generation algorithm for eye location on infrared driver recordings which is executed on a database with more than 2 million frames. Computation time is much shorter compared to a sequential CPU implementation which makes it feasible to run it multiple times if updates are required and even use it in real time applications.
APA, Harvard, Vancouver, ISO, and other styles
39

Valcan, Sorin, and Mihail Gaianu. "CUDA Implementation For Eye Location On Infrared Images." Scalable Computing: Practice and Experience 23, no. 1 (April 25, 2022): 1–8. http://dx.doi.org/10.12694/scpe.v23i1.1954.

Full text
Abstract:
Parallel programming using GPUs is a modern solution to reduce computation time for large tasks. This is done by dividing algorithms in smaller parts which can be executed simultaneously. CUDA has many practical applications especially in video processing, medical imaging and machine learning. This paper presents how parallel implementations can speedup a ground truth data generation algorithm for eye location on infrared driver recordings which is executed on a database with more than 2 million frames. Computation time is much shorter compared to a sequential CPU implementation which makes it feasible to run it multiple times if updates are required and even use it in real time applications.
APA, Harvard, Vancouver, ISO, and other styles
40

DAS, B. K., R. N. MAHAPATRA, and B. N. CHATTERJI. "PERFORMANCE MODELING OF DISCRETE COSINE TRANSFORM FOR STAR GRAPH CONNECTED MULTIPROCESSORS." Journal of Circuits, Systems and Computers 06, no. 06 (December 1996): 635–48. http://dx.doi.org/10.1142/s0218126696000443.

Full text
Abstract:
Discrete Cosine Transform algorithm has emphasized the research attention for its ability to analyze application-based problems in signal and image processing like speech coding, image coding, filtering, cepstral analysis, topographic classification, progressive image transmission, data compression etc. This has major applications in pattern recognition and image processing. In this paper, a Cooley-Tukey approach has been proposed for computation of Discrete Cosine Transform and the necessary mathematical formulations have been developed for Star Graph connected multiprocessors. The signal flow graph of the algorithm has been designed for mapping onto the Star Graph. The modeling results are derived in terms of computation time, speedup and efficiency.
APA, Harvard, Vancouver, ISO, and other styles
41

Lin, Chien-Chou, Chi-Chun Pan, and Jen-Hui Chuang. "A novel potential-based path planning of 3-D articulated robots with moving bases." Robotica 22, no. 4 (August 2004): 359–67. http://dx.doi.org/10.1017/s0263574704000062.

Full text
Abstract:
This paper proposes a novel path planning algorithm of 3-D articulated robots with moving bases based on a generalized potential field model. The approach computes, similar to that done in electrostatics, repulsive forces and torques between charged objects. A collision-free path can be obtained by locally adjusting the robot configuration to search for minimum potential configurations using these forces and torques. The proposed approach is efficient since these potential gradients are analytically tractable. In order to speedup the computation, a sequential planning strategy is adopted. Simulation results show that the proposed algorithm works well, in terms of collision avoidance and computation efficiency.
APA, Harvard, Vancouver, ISO, and other styles
42

Wang, Mao, Handong Tan, Yuzhu Wang, Changhong Lin, and Miao Peng. "Parallel Computation for Inversion Algorithm of 2D ZTEM." Applied Sciences 12, no. 24 (December 10, 2022): 12664. http://dx.doi.org/10.3390/app122412664.

Full text
Abstract:
ZTEM refers to the Z-axis tipper electromagnetic method. The ZTEM method is an airborne magnetotelluric sounding method based on the difference in rocks’ resistivity using the native electromagnetic field. The method is effective in exploring large-scale structures when the ground is fluctuant. The paper introduces the inversion algorithm of 2D ZTEM named the conjugate gradient method. This method, which avoids solving the Jacobi matrix, is very effective but not effective enough when the model is divided into a big grid. This study can perform further computation using parallel computation and then receive the processed data. We compare the results of the serial algorithm with the result of the parallel algorithm, which proves that the parallel algorithm is correct. When the number of processes is between three and six, the speedup ratio is between 1.74 and 3.19. It improves the effectiveness of the parallel algorithm of 2D ZTEM.
APA, Harvard, Vancouver, ISO, and other styles
43

Zhang, Dejian, Bingqing Lin, Jiefeng Wu, and Qiaoying Lin. "GP-SWAT (v1.0): a two-level graph-based parallel simulation tool for the SWAT model." Geoscientific Model Development 14, no. 10 (September 30, 2021): 5915–25. http://dx.doi.org/10.5194/gmd-14-5915-2021.

Full text
Abstract:
Abstract. High-fidelity and large-scale hydrological models are increasingly used to investigate the impacts of human activities and climate change on water availability and quality. However, the detailed representations of real-world systems and processes contained in these models inevitably lead to prohibitively high execution times, ranging from minutes to days. Such models become computationally prohibitive or even infeasible when large iterative model simulations are involved. In this study, we propose a generic two-level (i.e., watershed- and subbasin-level) model parallelization schema to reduce the run time of computationally expensive model applications through a combination of model spatial decomposition and the graph-parallel Pregel algorithm. Taking the Soil and Water Assessment Tool (SWAT) as an example, we implemented a generic tool named GP-SWAT, enabling watershed-level and subbasin-level model parallelization on a Spark computer cluster. We then evaluated GP-SWAT in two sets of experiments to demonstrate the ability of GP-SWAT to accelerate single and iterative model simulations and to run in different environments. In each test set, GP-SWAT was applied for the parallel simulation of four synthetic hydrological models with different input/output (I/O) burdens. The single-model parallelization results showed that GP-SWAT can obtain a 2.3–5.8-times speedup. For multiple simulations with subbasin-level parallelization, GP-SWAT yielded a remarkable speedup of 8.34–27.03 times. In both cases, the speedup ratios increased with an increasing computation burden. The experimental results indicate that GP-SWAT can effectively solve the high-computational-demand problems of the SWAT model. In addition, as a scalable and flexible tool, it can be run in diverse environments, from a commodity computer running the Microsoft Windows operating system to a Spark cluster consisting of a large number of computational nodes. Moreover, it is possible to apply this generic tool to other subbasin-based hydrological models or even acyclic models in other domains to alleviate I/O demands and to optimize model computational performance.
APA, Harvard, Vancouver, ISO, and other styles
44

Meissner, Adam. "Experimental analysis of some computation rules in a simple parallel reasoning system for the ALC description logic." International Journal of Applied Mathematics and Computer Science 21, no. 1 (March 1, 2011): 83–95. http://dx.doi.org/10.2478/v10006-011-0006-1.

Full text
Abstract:
Experimental analysis of some computation rules in a simple parallel reasoning system for theALCdescription logicA computation rule determines the order of selecting premises during an inference process. In this paper we empirically analyse three particular computation rules in a tableau-based, parallel reasoning system for theALCdescription logic, which is built in the relational programming model in the Oz language. The system is constructed in the lean deduction style, namely, it has the form of a small program containing only basic mechanisms, which assure soundness and completeness of reasoning. In consequence, the system can act as a convenient test-bed for comparing various inference algorithms and their elements. We take advantage of this property and evaluate the studied methods of selecting premises with regard to their efficiency and speedup, which can be obtained by parallel processing.
APA, Harvard, Vancouver, ISO, and other styles
45

Onken, Derek, Samy Wu Fung, Xingjian Li, and Lars Ruthotto. "OT-Flow: Fast and Accurate Continuous Normalizing Flows via Optimal Transport." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 10 (May 18, 2021): 9223–32. http://dx.doi.org/10.1609/aaai.v35i10.17113.

Full text
Abstract:
A normalizing flow is an invertible mapping between an arbitrary probability distribution and a standard normal distribution; it can be used for density estimation and statistical inference. Computing the flow follows the change of variables formula and thus requires invertibility of the mapping and an efficient way to compute the determinant of its Jacobian. To satisfy these requirements, normalizing flows typically consist of carefully chosen components. Continuous normalizing flows (CNFs) are mappings obtained by solving a neural ordinary differential equation (ODE). The neural ODE's dynamics can be chosen almost arbitrarily while ensuring invertibility. Moreover, the log-determinant of the flow's Jacobian can be obtained by integrating the trace of the dynamics' Jacobian along the flow. Our proposed OT-Flow approach tackles two critical computational challenges that limit a more widespread use of CNFs. First, OT-Flow leverages optimal transport (OT) theory to regularize the CNF and enforce straight trajectories that are easier to integrate. Second, OT-Flow features exact trace computation with time complexity equal to trace estimators used in existing CNFs. On five high-dimensional density estimation and generative modeling tasks, OT-Flow performs competitively to state-of-the-art CNFs while on average requiring one-fourth of the number of weights with an 8x speedup in training time and 24x speedup in inference.
APA, Harvard, Vancouver, ISO, and other styles
46

He, Dong, Supun C. Nakandala, Dalitso Banda, Rathijit Sen, Karla Saur, Kwanghyun Park, Carlo Curino, Jesús Camacho-Rodríguez, Konstantinos Karanasos, and Matteo Interlandi. "Query processing on tensor computation runtimes." Proceedings of the VLDB Endowment 15, no. 11 (July 2022): 2811–25. http://dx.doi.org/10.14778/3551793.3551833.

Full text
Abstract:
The huge demand for computation in artificial intelligence (AI) is driving unparalleled investments in hardware and software systems for AI. This leads to an explosion in the number of specialized hardware devices, which are now offered by major cloud vendors. By hiding the low-level complexity through a tensor-based interface, tensor computation runtimes (TCRs) such as PyTorch allow data scientists to efficiently exploit the exciting capabilities offered by the new hardware. In this paper, we explore how database management systems can ride the wave of innovation happening in the AI space. We design, build, and evaluate Tensor Query Processor (TQP): TQP transforms SQL queries into tensor programs and executes them on TCRs. TQP is able to run the full TPC-H benchmark by implementing novel algorithms for relational operators on the tensor routines. At the same time, TQP can support various hardware while only requiring a fraction of the usual development effort. Experiments show that TQP can improve query execution time by up to 10X over specialized CPU- and GPU-only systems. Finally, TQP can accelerate queries mixing ML predictions and SQL end-to-end, and deliver up to 9X speedup over CPU baselines.
APA, Harvard, Vancouver, ISO, and other styles
47

Zeng, Yao Yuan, Zheng Hua Wang, and Wen Tao Zhao. "Application of Parallel Computation in Numerical Simulation of Laser Propulsion." Applied Mechanics and Materials 130-134 (October 2011): 3027–31. http://dx.doi.org/10.4028/www.scientific.net/amm.130-134.3027.

Full text
Abstract:
As for the problem of numerical simulation oflaser propulsion of three dimensions and multi-sub domains, the domain decomposition strategy based on message passing mechanismis applied in this paper to realize parallelization. The cell-centered finite volume scheme is performed to solve Euler equation. A five-step Runge-Kutta scheme of explicit integral model is used for time advancement. The spatial discretization of inviscid fluid is estimated byhigh-order Godunov-type scheme. We test some different examples on a cluster system and the results show the smallest number of speedup is more than 5.19 when the degree of parallelism is 8. In a word, parallel computation is an inevitable choice to achieve the aim of accelerating the study of the mechanism of laser propulsion.
APA, Harvard, Vancouver, ISO, and other styles
48

Kyriacou, Costas, Paraskevas Evripidou, and Pedro Trancoso. "CacheFlow: Cache Optimizations for Data Driven Multithreading." Parallel Processing Letters 16, no. 02 (June 2006): 229–44. http://dx.doi.org/10.1142/s0129626406002599.

Full text
Abstract:
Data-Driven Multithreading is a non-blocking multithreading model of execution that provides effective latency tolerance by allowing the computation processor do useful work, while a long latency event is in progress. With the Data-Driven Multithreading model, a thread is scheduled for execution only if all of its inputs have been produced and placed in the processor's local memory. Data-driven sequencing leads to irregular memory access patterns that could affect negatively cache performance. Nevertheless, it enables the implementation of short-term optimal cache management policies. This paper presents the implementation of CacheFlow, an optimized cache management policy which eliminates the side effects due to the loss of locality caused by the data-driven sequencing, and reduces further cache misses. CacheFlow employs thread-based prefetching to preload data blocks of threads deemed executable. Simulation results, for nine scientific applications, on a 32-node Data-Driven Multithreaded machine show an average speedup improvement from 19.8 to 22.6. Two techniques to further improve the performance of CacheFlow, conflict avoidance and thread reordering, are proposed and tested. Simulation experiments have shown a speedup improvement of 24% and 32%, respectively. The average speedup for all applications on a 32-node machine with both optimizations is 26.1.
APA, Harvard, Vancouver, ISO, and other styles
49

Ottapura, Sayooj, Rahul Mistry, Jatin Keni, and Chaitanya Jage. "Underwater Image Processing using Graphics Processing Unit (GPU)." ITM Web of Conferences 32 (2020): 03041. http://dx.doi.org/10.1051/itmconf/20203203041.

Full text
Abstract:
Image processing is a method used for enhancement of an image or to extract some useful information from the image. It is a type of signal processing in which input is an image and output may be an image or any characteristics/features associated with that image. In this paper we will be focusing on a specific type of Image Processing i.e. Underwater Image Processing. Underwater Image Processing has always faced the problem of imbalance in colour distribution and this problem can be tackled by the simplest algorithm for colour balancing. We will be proceeding with the assumption that the highest values of R, G, B observed in the image corresponds to white and the lowest values corresponds to darkness. The underwater images are majorly saturated by blue colour because of its short wavelength and in this paper, we aim to enhance the image. We proposed a colour balancing algorithm for normalizing the image. The entire process will first be carried out on a CPU followed by a GPU. We will then compare the speedup obtained. Speedup is an important parameter in the field on image processing since a better speedup can help reduce the computation time significantly while maintaining a higher efficiency.
APA, Harvard, Vancouver, ISO, and other styles
50

Ding, Yi, Zuhua Xu, Jun Zhao, and Zhijiang Shao. "Fast Model Predictive Control Combining Offline Method and Online Optimization with K-D Tree." Mathematical Problems in Engineering 2015 (2015): 1–10. http://dx.doi.org/10.1155/2015/982041.

Full text
Abstract:
Computation time is the main factor that limits the application of model predictive control (MPC). This paper presents a fast model predictive control algorithm that combines offline method and online optimization to solve the MPC problem. The offline method uses a k-d tree instead of a table to implement partial enumeration, which accelerates online searching operation. Only a part of the explicit solution is stored in the k-d tree for online searching, and the k-d tree is updated in runtime to accommodate the change in the operating point. Online optimization is invoked when searching on the k-d tree fails. Numerical experiments show that the proposed algorithm is efficient on both small-scale and large-scale processes. The average speedup factor in the large-scale process is at least 6, the worst-case speedup factor is at least 2, and the performance is less than 0.05% suboptimal.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography