Log in

Relevant bibliographies by topics / SpMV Multiplication / Journal articles

To see the other types of publications on this topic, follow the link: SpMV Multiplication.

Journal articles on the topic 'SpMV Multiplication'

Author: Grafiati

Published: 6 September 2023

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 46 journal articles for your research on the topic 'SpMV Multiplication.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Giannoula, Christina, Ivan Fernandez, Juan Gómez-Luna, Nectarios Koziris, Georgios Goumas, and Onur Mutlu. "Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Architectures." ACM SIGMETRICS Performance Evaluation Review 50, no. 1 (June 20, 2022): 33–34. http://dx.doi.org/10.1145/3547353.3522661.

Full text

Abstract:

Several manufacturers have already started to commercialize near-bank Processing-In-Memory (PIM) architectures, after decades of research efforts. Near-bank PIM architectures place simple cores close to DRAM banks. Recent research demonstrates that they can yield significant performance and energy improvements in parallel applications by alleviating data access costs. Real PIM systems can provide high levels of parallelism, large aggregate memory bandwidth and low memory access latency, thereby being a good fit to accelerate the Sparse Matrix Vector Multiplication (SpMV) kernel. SpMV has been characterized as one of the most significant and thoroughly studied scientific computation kernels. It is primarily a memory-bound kernel with intensive memory accesses due its algorithmic nature, the compressed matrix format used, and the sparsity patterns of the input matrices given. This paper provides the first comprehensive analysis of SpMV on a real-world PIM architecture, and presents SparseP, the first SpMV library for real PIM architectures. We make two key contributions. First, we design efficient SpMV algorithms to accelerate the SpMV kernel in current and future PIM systems, while covering a wide variety of sparse matrices with diverse sparsity patterns. Second, we provide the first comprehensive analysis of SpMV on a real PIM architecture. Specifically, we conduct our rigorous experimental analysis of SpMV kernels in the UPMEM PIM system, the first publicly-available real-world PIM architecture. Our extensive evaluation provides new insights and recommendations for software designers and hardware architects to efficiently accelerate the SpMV kernel on real PIM systems. For more information about our thorough characterization on the SpMV PIM execution, results, insights and the open-source SparseP software package [21], we refer the reader to the full version of the paper [3, 4]. The SparseP software package is publicly and freely available at https://github.com/CMU-SAFARI/SparseP.

APA, Harvard, Vancouver, ISO, and other styles

2

He, Guixia, and Jiaquan Gao. "A Novel CSR-Based Sparse Matrix-Vector Multiplication on GPUs." Mathematical Problems in Engineering 2016 (2016): 1–12. http://dx.doi.org/10.1155/2016/8471283.

Full text

Abstract:

Sparse matrix-vector multiplication (SpMV) is an important operation in scientific computations. Compressed sparse row (CSR) is the most frequently used format to store sparse matrices. However, CSR-based SpMVs on graphic processing units (GPUs), for example, CSR-scalar and CSR-vector, usually have poor performance due to irregular memory access patterns. This motivates us to propose a perfect CSR-based SpMV on the GPU that is called PCSR. PCSR involves two kernels and accesses CSR arrays in a fully coalesced manner by introducing a middle array, which greatly alleviates the deficiencies of CSR-scalar (rare coalescing) and CSR-vector (partial coalescing). Test results on a single C2050 GPU show that PCSR fully outperforms CSR-scalar, CSR-vector, and CSRMV and HYBMV in the vendor-tuned CUSPARSE library and is comparable with a most recently proposed CSR-based algorithm, CSR-Adaptive. Furthermore, we extend PCSR on a single GPU to multiple GPUs. Experimental results on four C2050 GPUs show that no matter whether the communication between GPUs is considered or not PCSR on multiple GPUs achieves good performance and has high parallel efficiency.

APA, Harvard, Vancouver, ISO, and other styles

3

Gao, Jiaquan, Yuanshen Zhou, and Kesong Wu. "A Novel Multi-GPU Parallel Optimization Model for The Sparse Matrix-Vector Multiplication." Parallel Processing Letters 26, no. 04 (December 2016): 1640001. http://dx.doi.org/10.1142/s0129626416400016.

Full text

Abstract:

Accelerating the sparse matrix-vector multiplication (SpMV) on the graphics processing units (GPUs) has attracted considerable attention recently. We observe that on a specific multiple-GPU platform, the SpMV performance can usually be greatly improved when a matrix is partitioned into several blocks according to a predetermined rule and each block is assigned to a GPU with an appropriate storage format. This motivates us to propose a novel multi-GPU parallel SpMV optimization model. Our model involves two stages. In the first stage, a simple rule is defined to divide any given matrix among multiple GPUs, and then a performance model, which is independent of the problems and dependent on the resources of devices, is proposed to accurately predict the execution time of SpMV kernels. Using these models, we construct in the second stage an optimally multi-GPU parallel SpMV algorithm that is automatically and rapidly generated for the platform for any problem. Given that our model for SpMV is general, independent of the problems, and dependent on the resources of devices, this model is constructed only once for each type of GPU. The experiments validate the high efficiency of our proposed model.

APA, Harvard, Vancouver, ISO, and other styles

4

AlAhmadi, Sarah, Thaha Mohammed, Aiiad Albeshri, Iyad Katib, and Rashid Mehmood. "Performance Analysis of Sparse Matrix-Vector Multiplication (SpMV) on Graphics Processing Units (GPUs)." Electronics 9, no. 10 (October 13, 2020): 1675. http://dx.doi.org/10.3390/electronics9101675.

Full text

Abstract:

Graphics processing units (GPUs) have delivered a remarkable performance for a variety of high performance computing (HPC) applications through massive parallelism. One such application is sparse matrix-vector (SpMV) computations, which is central to many scientific, engineering, and other applications including machine learning. No single SpMV storage or computation scheme provides consistent and sufficiently high performance for all matrices due to their varying sparsity patterns. An extensive literature review reveals that the performance of SpMV techniques on GPUs has not been studied in sufficient detail. In this paper, we provide a detailed performance analysis of SpMV performance on GPUs using four notable sparse matrix storage schemes (compressed sparse row (CSR), ELLAPCK (ELL), hybrid ELL/COO (HYB), and compressed sparse row 5 (CSR5)), five performance metrics (execution time, giga floating point operations per second (GFLOPS), achieved occupancy, instructions per warp, and warp execution efficiency), five matrix sparsity features (nnz, anpr, nprvariance, maxnpr, and distavg), and 17 sparse matrices from 10 application domains (chemical simulations, computational fluid dynamics (CFD), electromagnetics, linear programming, economics, etc.). Subsequently, based on the deeper insights gained through the detailed performance analysis, we propose a technique called the heterogeneous CPU–GPU Hybrid (HCGHYB) scheme. It utilizes both the CPU and GPU in parallel and provides better performance over the HYB format by an average speedup of 1.7x. Heterogeneous computing is an important direction for SpMV and other application areas. Moreover, to the best of our knowledge, this is the first work where the SpMV performance on GPUs has been discussed in such depth. We believe that this work on SpMV performance analysis and the heterogeneous scheme will open up many new directions and improvements for the SpMV computing field in the future.

APA, Harvard, Vancouver, ISO, and other styles

5

Liu, Sheng, Yasong Cao, and Shuwei Sun. "Mapping and Optimization Method of SpMV on Multi-DSP Accelerator." Electronics 11, no. 22 (November 11, 2022): 3699. http://dx.doi.org/10.3390/electronics11223699.

Full text

Abstract:

Sparse matrix-vector multiplication (SpMV) solves the product of a sparse matrix and dense vector, and the sparseness of a sparse matrix is often more than 90%. Usually, the sparse matrix is compressed to save storage resources, but this causes irregular access to dense vectors in the algorithm, which takes a lot of time and degrades the SpMV performance of the system. In this study, we design a dedicated channel in the DMA to implement an indirect memory access process to speed up the SpMV operation. On this basis, we propose six SpMV algorithm schemes and map them to optimize the performance of SpMV. The results show that the M processor’s SpMV performance reached 6.88 GFLOPS. Besides, the average performance of the HPCG benchmark is 2.8 GFLOPS.

APA, Harvard, Vancouver, ISO, and other styles

6

Anzt, Hartwig, Stanimire Tomov, and Jack Dongarra. "On the performance and energy efficiency of sparse linear algebra on GPUs." International Journal of High Performance Computing Applications 31, no. 5 (October 5, 2016): 375–90. http://dx.doi.org/10.1177/1094342016672081.

Full text

Abstract:

In this paper we unveil some performance and energy efficiency frontiers for sparse computations on GPU-based supercomputers. We compare the resource efficiency of different sparse matrix–vector products (SpMV) taken from libraries such as cuSPARSE and MAGMA for GPU and Intel’s MKL for multicore CPUs, and develop a GPU sparse matrix–matrix product (SpMM) implementation that handles the simultaneous multiplication of a sparse matrix with a set of vectors in block-wise fashion. While a typical sparse computation such as the SpMV reaches only a fraction of the peak of current GPUs, we show that the SpMM succeeds in exceeding the memory-bound limitations of the SpMV. We integrate this kernel into a GPU-accelerated Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) eigensolver. LOBPCG is chosen as a benchmark algorithm for this study as it combines an interesting mix of sparse and dense linear algebra operations that is typical for complex simulation applications, and allows for hardware-aware optimizations. In a detailed analysis we compare the performance and energy efficiency against a multi-threaded CPU counterpart. The reported performance and energy efficiency results are indicative of sparse computations on supercomputers.

APA, Harvard, Vancouver, ISO, and other styles

7

Liu, Jie. "Accuracy Controllable SpMV Optimization on GPU." Journal of Physics: Conference Series 2363, no. 1 (November 1, 2022): 012008. http://dx.doi.org/10.1088/1742-6596/2363/1/012008.

Full text

Abstract:

Sparse matrix vector multiplication (SpMV) is a key kernel widely used in a variety of fields, and mixed-precision calculation brings opportunities to SpMV optimization. Researchers have proposed to store nonzero elements in the interval (-1, 1) in single precision and calculate SpMV in mixed precision. Though it leads to high performance, it also brings loss of accuracy. This paper proposes an accuracy controllable optimization method for SpMV. By limiting the error caused by converting double-precision floating-point numbers in the interval (-1, 1) into single-precision format, the calculation accuracy of mixed-precision SpMV is effectively improved. We tested sparse matrices from the SuiteSparse Matrix Collection on Tesla V100. Compared with the existing mixed-precision MpSpMV kernel, the mixed-precision SpMV proposed in this paper achieves an accuracy improvement.

APA, Harvard, Vancouver, ISO, and other styles

8

Zeng, Guangsen, and Yi Zou. "Leveraging Memory Copy Overlap for Efficient Sparse Matrix-Vector Multiplication on GPUs." Electronics 12, no. 17 (August 31, 2023): 3687. http://dx.doi.org/10.3390/electronics12173687.

Full text

Abstract:

Sparse matrix-vector multiplication (SpMV) is central to many scientific, engineering, and other applications, including machine learning. Compressed Sparse Row (CSR) is a widely used sparse matrix storage format. SpMV using the CSR format on GPU computing platforms is widely studied, where the access behavior of GPU is often the performance bottleneck. The Ampere GPU architecture recently from NVIDIA provides a new asynchronous memory copy instruction, memcpy_async, for more efficient data movement in shared memory. Leveraging the capability of this new memcpy_async instruction, we first propose the CSR-Partial-Overlap to carefully overlap the data copy from global memory to shared memory and computation, allowing us to take full advantage of the data transfer time. In addition, we design the dynamic batch partition and the dynamic threads distribution to achieve effective load balancing, avoid the overhead of fixing up partial sums, and improve thread utilization. Furthermore, we propose the CSR-Full-Overlap based on the CSR-Partial-Overlap, which takes the overlap of data transfer from host to device and SpMV kernel execution into account as well. The CSR-Full-Overlap unifies the two major overlaps in SpMV and hides the computation as much as possible in the two important access behaviors of the GPU. This allows CSR-Full-Overlap to achieve the best performance gains from both overlaps. As far as we know, this paper is the first in-depth study of how memcpy_async can be potentially applied to help accelerate SpMV computation in GPU platforms. We compare CSR-Full-Overlap to the current state-of-the-art cuSPARSE, where our experimental results show an average 2.03x performance gain and up to 2.67x performance gain.

APA, Harvard, Vancouver, ISO, and other styles

9

Gao, Jiaquan, Panpan Qi, and Guixia He. "Efficient CSR-Based Sparse Matrix-Vector Multiplication on GPU." Mathematical Problems in Engineering 2016 (2016): 1–14. http://dx.doi.org/10.1155/2016/4596943.

Full text

Abstract:

Sparse matrix-vector multiplication (SpMV) is an important operation in computational science and needs be accelerated because it often represents the dominant cost in many widely used iterative methods and eigenvalue problems. We achieve this objective by proposing a novel SpMV algorithm based on the compressed sparse row (CSR) on the GPU. Our method dynamically assigns different numbers of rows to each thread block and executes different optimization implementations on the basis of the number of rows it involves for each block. The process of accesses to the CSR arrays is fully coalesced, and the GPU’s DRAM bandwidth is efficiently utilized by loading data into the shared memory, which alleviates the bottleneck of many existing CSR-based algorithms (i.e., CSR-scalar and CSR-vector). Test results on C2050 and K20c GPUs show that our method outperforms a perfect-CSR algorithm that inspires our work, the vendor tuned CUSPARSE V6.5 and CUSP V0.5.1, and three popular algorithms clSpMV, CSR5, and CSR-Adaptive.

APA, Harvard, Vancouver, ISO, and other styles

10

Chen, Shizhao, Jianbin Fang, Chuanfu Xu, and Zheng Wang. "Adaptive Hybrid Storage Format for Sparse Matrix–Vector Multiplication on Multi-Core SIMD CPUs." Applied Sciences 12, no. 19 (September 29, 2022): 9812. http://dx.doi.org/10.3390/app12199812.

Full text

Abstract:

Optimizing sparse matrix–vector multiplication (SpMV) is challenging due to the non-uniform distribution of the non-zero elements of the sparse matrix. The best-performing SpMV format changes depending on the input matrix and the underlying architecture, and there is no “one-size-fit-for-all” format. A hybrid scheme combining multiple SpMV storage formats allows one to choose an appropriate format to use for the target matrix and hardware. However, existing hybrid approaches are inadequate for utilizing the SIMD cores of modern multi-core CPUs with SIMDs, and it remains unclear how to best mix different SpMV formats for a given matrix. This paper presents a new hybrid storage format for sparse matrices, specifically targeting multi-core CPUs with SIMDs. Our approach partitions the target sparse matrix into two segmentations based on the regularities of the memory access pattern, where each segmentation is stored in a format suitable for its memory access patterns. Unlike prior hybrid storage schemes that rely on the user to determine the data partition among storage formats, we employ machine learning to build a predictive model to automatically determine the partition threshold on a per matrix basis. Our predictive model is first trained off line, and the trained model can be applied to any new, unseen sparse matrix. We apply our approach to 956 matrices and evaluate its performance on three distinct multi-core CPU platforms: a 72-core Intel Knights Landing (KNL) CPU, a 128-core AMD EPYC CPU, and a 64-core Phytium ARMv8 CPU. Experimental results show that our hybrid scheme, combined with the predictive model, outperforms the best-performing alternative by 2.9%, 17.5% and 16% on average on KNL, AMD, and Phytium, respectively.

APA, Harvard, Vancouver, ISO, and other styles

11

Wang, Yang, Jie Liu, Xiaoxiong Zhu, Qingyang Zhang, Shengguo Li, and Qinglin Wang. "Improving Structured Grid-Based Sparse Matrix-Vector Multiplication and Gauss–Seidel Iteration on GPDSP." Applied Sciences 13, no. 15 (August 3, 2023): 8952. http://dx.doi.org/10.3390/app13158952.

Full text

Abstract:

Structured grid-based sparse matrix-vector multiplication and Gauss–Seidel iterations are very important kernel functions in scientific and engineering computations, both of which are memory intensive and bandwidth-limited. GPDSP is a general purpose digital signal processor, which is a very significant embedded processor that has been introduced into high-performance computing. In this paper, we designed various optimization methods, which included a blocking method to improve data locality and increase memory access efficiency, a multicolor reordering method to develop Gauss–Seidel fine-grained parallelism, a data partitioning method designed for GPDSP memory structures, and a double buffering method to overlap computation and access memory on structured grid-based SpMV and Gauss–Seidel iterations for GPDSP. At last, we combined the above optimization methods to design a multicore vectorization algorithm. We tested the matrices generated with structured grids of different sizes on the GPDSP platform and obtained speedups of up to 41× and 47× compared to the unoptimized SpMV and Gauss–Seidel iterations, with maximum bandwidth efficiencies of 72% and 81%, respectively. The experiment results show that our algorithms could fully utilize the external memory bandwidth. We also implemented the commonly used mixed precision algorithm on the GPDSP and obtained speedups of 1.60× and 1.45× for the SpMV and Gauss–Seidel iterations, respectively.

APA, Harvard, Vancouver, ISO, and other styles

12

Mahmoud, Mohammed, Mark Hoffmann, and Hassan Reza. "Developing a New Storage Format and a Warp-Based SpMV Kernel for Configuration Interaction Sparse Matrices on the GPU." Computation 6, no. 3 (August 24, 2018): 45. http://dx.doi.org/10.3390/computation6030045.

Full text

Abstract:

Sparse matrix-vector multiplication (SpMV) can be used to solve diverse-scaled linear systems and eigenvalue problems that exist in numerous, and varying scientific applications. One of the scientific applications that SpMV is involved in is known as Configuration Interaction (CI). CI is a linear method for solving the nonrelativistic Schrödinger equation for quantum chemical multi-electron systems, and it can deal with the ground state as well as multiple excited states. In this paper, we have developed a hybrid approach in order to deal with CI sparse matrices. The proposed model includes a newly-developed hybrid format for storing CI sparse matrices on the Graphics Processing Unit (GPU). In addition to the new developed format, the proposed model includes the SpMV kernel for multiplying the CI matrix (proposed format) by a vector using the C language and the Compute Unified Device Architecture (CUDA) platform. The proposed SpMV kernel is a vector kernel that uses the warp approach. We have gauged the newly developed model in terms of two primary factors, memory usage and performance. Our proposed kernel was compared to the cuSPARSE library and the CSR5 (Compressed Sparse Row 5) format and already outperformed both.

APA, Harvard, Vancouver, ISO, and other styles

13

Mohammed, Saira Banu Jamal, M. Rajasekhara Babu, and Sumithra Sriram. "GPU Implementation of Image Convolution Using Sparse Model with Efficient Storage Format." International Journal of Grid and High Performance Computing 10, no. 1 (January 2018): 54–70. http://dx.doi.org/10.4018/ijghpc.2018010104.

Full text

Abstract:

With the growth of data parallel computing, role of GPU computing in non-graphic applications such as image processing becomes a focus in research fields. Convolution is an integral operation in filtering, smoothing and edge detection. In this article, the process of convolution is realized as a sparse linear system and is solved using Sparse Matrix Vector Multiplication (SpMV). The Compressed Sparse Row (CSR) format of SPMV shows better CPU performance compared to normal convolution. To overcome the stalling of threads for short rows in the GPU implementation of CSR SpMV, a more efficient model is proposed, which uses the Adaptive-Compressed Row Storage (A-CSR) format to implement the same. Using CSR in the convolution process achieves a 1.45x and a 1.159x increase in speed compared to the normal convolution of image smoothing and edge detection operations, respectively. An average speedup of 2.05x is achieved for image smoothing technique and 1.58x for edge detection technique in GPU platform usig adaptive CSR format.

APA, Harvard, Vancouver, ISO, and other styles

14

Giannoula, Christina, Ivan Fernandez, Juan Gómez Luna, Nectarios Koziris, Georgios Goumas, and Onur Mutlu. "SparseP." Proceedings of the ACM on Measurement and Analysis of Computing Systems 6, no. 1 (February 24, 2022): 1–49. http://dx.doi.org/10.1145/3508041.

Full text

Abstract:

Several manufacturers have already started to commercialize near-bank Processing-In-Memory (PIM) architectures, after decades of research efforts. Near-bank PIM architectures place simple cores close to DRAM banks. Recent research demonstrates that they can yield significant performance and energy improvements in parallel applications by alleviating data access costs. Real PIM systems can provide high levels of parallelism, large aggregate memory bandwidth and low memory access latency, thereby being a good fit to accelerate the Sparse Matrix Vector Multiplication (SpMV) kernel. SpMV has been characterized as one of the most significant and thoroughly studied scientific computation kernels. It is primarily a memory-bound kernel with intensive memory accesses due its algorithmic nature, the compressed matrix format used, and the sparsity patterns of the input matrices given. This paper provides the first comprehensive analysis of SpMV on a real-world PIM architecture, and presents SparseP, the first SpMV library for real PIM architectures. We make three key contributions. First, we implement a wide variety of software strategies on SpMV for a multithreaded PIM core, including (1) various compressed matrix formats, (2) load balancing schemes across parallel threads and (3) synchronization approaches, and characterize the computational limits of a single multithreaded PIM core. Second, we design various load balancing schemes across multiple PIM cores, and two types of data partitioning techniques to execute SpMV on thousands of PIM cores: (1) 1D-partitioned kernels to perform the complete SpMV computation only using PIM cores, and (2) 2D-partitioned kernels to strive a balance between computation and data transfer costs to PIM-enabled memory. Third, we compare SpMV execution on a real-world PIM system with 2528 PIM cores to an Intel Xeon CPU and an NVIDIA Tesla V100 GPU to study the performance and energy efficiency of various devices, i.e., both memory-centric PIM systems and conventional processor-centric CPU/GPU systems, for the SpMV kernel. SparseP software package provides 25 SpMV kernels for real PIM systems supporting the four most widely used compressed matrix formats, i.e., CSR, COO, BCSR and BCOO, and a wide range of data types. SparseP is publicly and freely available at https://github.com/CMU-SAFARI/SparseP. Our extensive evaluation using 26 matrices with various sparsity patterns provides new insights and recommendations for software designers and hardware architects to efficiently accelerate the SpMV kernel on real PIM systems.

APA, Harvard, Vancouver, ISO, and other styles

15

Merrill, Duane, and Michael Garland. "Merge-based sparse matrix-vector multiplication (SpMV) using the CSR storage format." ACM SIGPLAN Notices 51, no. 8 (November 9, 2016): 1–2. http://dx.doi.org/10.1145/3016078.2851190.

Full text

APA, Harvard, Vancouver, ISO, and other styles

16

Muhammed, Thaha, Rashid Mehmood, Aiiad Albeshri, and Iyad Katib. "SURAA: A Novel Method and Tool for Loadbalanced and Coalesced SpMV Computations on GPUs." Applied Sciences 9, no. 5 (March 6, 2019): 947. http://dx.doi.org/10.3390/app9050947.

Full text

Abstract:

Sparse matrix-vector (SpMV) multiplication is a vital building block for numerous scientific and engineering applications. This paper proposes SURAA (translates to speed in arabic), a novel method for SpMV computations on graphics processing units (GPUs). The novelty lies in the way we group matrix rows into different segments, and adaptively schedule various segments to different types of kernels. The sparse matrix data structure is created by sorting the rows of the matrix on the basis of the nonzero elements per row ( n p r) and forming segments of equal size (containing approximately an equal number of nonzero elements per row) using the Freedman–Diaconis rule. The segments are assembled into three groups based on the mean n p r of the segments. For each group, we use multiple kernels to execute the group segments on different streams. Hence, the number of threads to execute each segment is adaptively chosen. Dynamic Parallelism available in Nvidia GPUs is utilized to execute the group containing segments with the largest mean n p r, providing improved load balancing and coalesced memory access, and hence more efficient SpMV computations on GPUs. Therefore, SURAA minimizes the adverse effects of the n p r variance by uniformly distributing the load using equal sized segments. We implement the SURAA method as a tool and compare its performance with the de facto best commercial (cuSPARSE) and open source (CUSP, MAGMA) tools using widely used benchmarks comprising 26 high n p r v a r i a n c e matrices from 13 diverse domains. SURAA outperforms the other tools by delivering 13.99x speedup on average. We believe that our approach provides a fundamental shift in addressing SpMV related challenges on GPUs including coalesced memory access, thread divergence, and load balancing, and is set to open new avenues for further improving SpMV performance in the future.

APA, Harvard, Vancouver, ISO, and other styles

17

Fagerlund, Olav Aanes, Takeshi Kitayama, Gaku Hashimoto, and Hiroshi Okuda. "Effect of GPU Communication-Hiding for SPMV Using OpenACC." International Journal of Computational Methods 13, no. 02 (March 2016): 1640011. http://dx.doi.org/10.1142/s0219876216400119.

Full text

Abstract:

In this study, we discuss overlapping possibilities of Sparse Matrix-Vector multiplication (SpMV) in cases where we have multiple RHS-vectors and where the whole sparse matrix data may or may not fit into the memory of the discrete GPU, at once, by using OpenACC. With GPUs, one can take advantage of their relatively high memory bandwidths. However, data needs to be transferred over the relatively slow PCIe bus. We implement communication-hiding to increase performance. In the case of three degrees of freedom and modeling 2,097,152 nodes, we observe a just above 40% performance increase by applying communication-hiding in our routine. This underlines the importance of applying such techniques in simulations, when it is suitable with the algorithmic structure of the problem in relation to the underlying computer architecture.

APA, Harvard, Vancouver, ISO, and other styles

18

Guo, Ping, and Liqiang Wang. "Accurate cross-architecture performance modeling for sparse matrix-vector multiplication (SpMV) on GPUs." Concurrency and Computation: Practice and Experience 27, no. 13 (February 12, 2014): 3281–94. http://dx.doi.org/10.1002/cpe.3217.

Full text

APA, Harvard, Vancouver, ISO, and other styles

19

Yu, Yu Fei, Bin Yan, Biao Wang, Lei Li, Yu Han, and Xiao Qi Xi. "GPU Accelerated Reconstruction in Compton Scattering Tomography Using Matrix Compression." Applied Mechanics and Materials 519-520 (February 2014): 102–7. http://dx.doi.org/10.4028/www.scientific.net/amm.519-520.102.

Full text

Abstract:

An acceleration strategy for TV-ADM reconstruction algorithm in Compton scattering tomography (CST) is proposed. By analyzing the sparse characteristic of CST projection matrixes, firstly, the sparse matrix vector CSR format and ELL format are used to store them, which greatly reduce the memory consumption. Then, a Sparse Matrix Vector multiplication (SpMV) method is utilized to accelerate the projector and back projector process. Finally, based on the parallel features, the TV-ADM is computed with Graphics Processing Unit (GPU). Numerical experiments show that the TV-ADM with the presented acceleration strategy could achieve a 96 times speedup ratio and 224 times memory compression ratio without precision loss.

APA, Harvard, Vancouver, ISO, and other styles

20

Tanaka, Teruo, Ryo Otsuka, Akihiro Fujii, Takahiro Katagiri, and Toshiyuki Imamura. "Implementation of D-Spline-Based Incremental Performance Parameter Estimation Method with ppOpen-AT." Scientific Programming 22, no. 4 (2014): 299–307. http://dx.doi.org/10.1155/2014/310879.

Full text

Abstract:

In automatic performance tuning (AT), a primary aim is to optimize performance parameters that are suitable for certain computational environments in ordinary mathematical libraries. For AT, an important issue is to reduce the estimation time required for optimizing performance parameters. To reduce the estimation time, we previously proposed the Incremental Performance Parameter Estimation method (IPPE method). This method estimates optimal performance parameters by inserting suitable sampling points that are based on computational results for a fitting function. As the fitting function, we introduced d-Spline, which is highly adaptable and requires little estimation time. In this paper, we report the implementation of the IPPE method with ppOpen-AT, which is a scripting language (set of directives) with features that reduce the workload of the developers of mathematical libraries that have AT features. To confirm the effectiveness of the IPPE method for the runtime phase AT, we applied the method to sparse matrix–vector multiplication (SpMV), in which the block size of the sparse matrix structure blocked compressed row storage (BCRS) was used for the performance parameter. The results from the experiment show that the cost was negligibly small for AT using the IPPE method in the runtime phase. Moreover, using the obtained optimal value, the execution time for the mathematical library SpMV was reduced by 44% on comparing the compressed row storage and BCRS (block size 8).

APA, Harvard, Vancouver, ISO, and other styles

21

Benatia, Akrem, Weixing Ji, Yizhuo Wang, and Feng Shi. "Sparse matrix partitioning for optimizing SpMV on CPU-GPU heterogeneous platforms." International Journal of High Performance Computing Applications 34, no. 1 (November 14, 2019): 66–80. http://dx.doi.org/10.1177/1094342019886628.

Full text

Abstract:

Sparse matrix–vector multiplication (SpMV) kernel dominates the computing cost in numerous applications. Most of the existing studies dedicated to improving this kernel have been targeting just one type of processing units, mainly multicore CPUs or graphics processing units (GPUs), and have not explored the potential of the recent, rapidly emerging, CPU-GPU heterogeneous platforms. To take full advantage of these heterogeneous systems, the input sparse matrix has to be partitioned on different available processing units. The partitioning problem is more challenging with the existence of many sparse formats whose performances depend both on the sparsity of the input matrix and the used hardware. Thus, the best performance does not only depend on how to partition the input sparse matrix but also on which sparse format to use for each partition. To address this challenge, we propose in this article a new CPU-GPU heterogeneous method for computing the SpMV kernel that combines between different sparse formats to achieve better performance and better utilization of CPU-GPU heterogeneous platforms. The proposed solution horizontally partitions the input matrix into multiple block-rows and predicts their best sparse formats using machine learning-based performance models. A mapping algorithm is then used to assign the block-rows to the CPU and GPU(s) available in the system. Our experimental results using real-world large unstructured sparse matrices on two different machines show a noticeable performance improvement.

APA, Harvard, Vancouver, ISO, and other styles

22

Favaro, Federico, Ernesto Dufrechou, Pablo Ezzatti, and Juan Pablo Oliver. "Energy-efficient algebra kernels in FPGA for High Performance Computing." Journal of Computer Science and Technology 21, no. 2 (October 21, 2021): e09. http://dx.doi.org/10.24215/16666038.21.e09.

Full text

Abstract:

The dissemination of multi-core architectures and the later irruption of massively parallel devices, led to a revolution in High-Performance Computing (HPC) platforms in the last decades. As a result, Field-Programmable Gate Arrays (FPGAs) are re-emerging as a versatile and more energy-efficient alternative to other platforms. Traditional FPGA design implies using low-level Hardware Description Languages (HDL) such as VHDL or Verilog, which follow an entirely different programming model than standard software languages, and their use requires specialized knowledge of the underlying hardware. In the last years, manufacturers started to make big efforts to provide High-Level Synthesis (HLS) tools, in order to allow a grater adoption of FPGAs in the HPC community.Our work studies the use of multi-core hardware and different FPGAs to address Numerical Linear Algebra (NLA) kernels such as the general matrix multiplication GEMM and the sparse matrix-vector multiplication SpMV. Specifically, we compare the behavior of fine-tuned kernels in a multi-core CPU processor and HLS implementations on FPGAs. We perform the experimental evaluation of our implementations on a low-end and a cutting-edge FPGA platform, in terms of runtime and energy consumption, and compare the results against the Intel MKL library in CPU.

APA, Harvard, Vancouver, ISO, and other styles

23

Mehta, Mayuri A., and Devesh C. Jinwala. "A Hybrid Dynamic Load Balancing Algorithm for Distributed Systems Using Genetic Algorithms." International Journal of Distributed Systems and Technologies 5, no. 3 (July 2014): 1–23. http://dx.doi.org/10.4018/ijdst.2014070101.

Full text

Abstract:

Dynamic Load Balancing (DLB) is sine qua non in modern distributed systems to ensure the efficient utilization of computing resources therein. This paper proposes a novel framework for hybrid dynamic load balancing. Its framework uses a Genetic Algorithms (GA) based supernode selection approach within. The GA-based approach is useful in choosing optimally loaded nodes as the supernodes directly from data set, thereby essentially improving the speed of load balancing process. Applying the proposed GA-based approach, this work analyzes the performance of hybrid DLB algorithm under different system states such as lightly loaded, moderately loaded, and highly loaded. The performance is measured with respect to three parameters: average response time, average round trip time, and average completion time of the users. Further, it also evaluates the performance of hybrid algorithm utilizing OnLine Transaction Processing (OLTP) benchmark and Sparse Matrix Vector Multiplication (SPMV) benchmark applications to analyze its adaptability to I/O-intensive, memory-intensive, or/and CPU-intensive applications. The experimental results show that the hybrid algorithm significantly improves the performance under different system states and under a wide range of workloads compared to traditional decentralized algorithm.

APA, Harvard, Vancouver, ISO, and other styles

24

Wilkinson, Lucas, Kazem Cheshmi, and Maryam Mehri Dehnavi. "Register Tiling for Unstructured Sparsity in Neural Network Inference." Proceedings of the ACM on Programming Languages 7, PLDI (June 6, 2023): 1995–2020. http://dx.doi.org/10.1145/3591302.

Full text

Abstract:

Unstructured sparse neural networks are an important class of machine learning (ML) models, as they compact model size and reduce floating point operations. The execution time of these models is frequently dominated by the sparse matrix multiplication (SpMM) kernel, C = A × B , where A is a sparse matrix, and B and C are dense matrices. The unstructured sparsity pattern of matrices in pruned machine learning models along with their sparsity ratio has rendered useless the large class of libraries and systems that optimize sparse matrix multiplications. Reusing registers is particularly difficult because accesses to memory locations should be known statically. This paper proposes Sparse Register Tiling, a new technique composed of an unroll-and-sparse-jam transformation followed by data compression that is specifically tailored to sparsity patterns in ML matrices. Unroll-and-sparse-jam uses sparsity information to jam the code while improving register reuse. Sparse register tiling is evaluated across 2396 weight matrices from transformer and convolutional models with a sparsity range of 60-95% and provides an average speedup of 1.72× and 2.65× over MKL SpMM and dense matrix multiplication, respectively, on a multicore CPU processor. It also provides an end-to-end speedup of 2.12× for MobileNetV1 with 70% sparsity on an ARM processor commonly used in edge devices.

APA, Harvard, Vancouver, ISO, and other styles

25

Ernst, Thomas. "On the q-Lie group of q-Appell polynomial matrices and related factorizations." Special Matrices 6, no. 1 (February 1, 2018): 93–109. http://dx.doi.org/10.1515/spma-2018-0009.

Full text

Abstract:

Abstract In the spirit of our earlier paper [10] and Zhang and Wang [16],we introduce the matrix of multiplicative q-Appell polynomials of order M ∈ ℤ. This is the representation of the respective q-Appell polynomials in ke-ke basis. Based on the fact that the q-Appell polynomials form a commutative ring [11], we prove that this set constitutes a q-Lie group with two dual q-multiplications in the sense of [9]. A comparison with earlier results on q-Pascal matrices gives factorizations according to [7], which are specialized to q-Bernoulli and q-Euler polynomials.We also show that the corresponding q-Bernoulli and q-Euler matrices form q-Lie subgroups. In the limit q → 1 we obtain corresponding formulas for Appell polynomial matrices.We conclude by presenting the commutative ring of generalized q-Pascal functional matrices,which operates on all functions f ∈ C∞q .

APA, Harvard, Vancouver, ISO, and other styles

26

Guzu, D., T. Hoffmann-Ostenhof, and A. Laptev. "On a class of sharp multiplicative Hardy inequalities." St. Petersburg Mathematical Journal 32, no. 3 (May 11, 2021): 523–30. http://dx.doi.org/10.1090/spmj/1659.

Full text

Abstract:

A class of weighted Hardy inequalities is treated. The sharp constants depend on the lowest eigenvalues of auxiliary Schrödinger operators on a sphere. In particular, for some block radial weights these sharp constants are given in terms of the lowest eigenvalue of a Legendre type equation.

APA, Harvard, Vancouver, ISO, and other styles

27

Bakhadly, Bakhad, Alexander Guterman, and María Jesús de la Puente. "Orthogonality for (0, −1) tropical normal matrices." Special Matrices 8, no. 1 (February 17, 2020): 40–60. http://dx.doi.org/10.1515/spma-2020-0006.

Full text

Abstract:

AbstractWe study pairs of mutually orthogonal normal matrices with respect to tropical multiplication. Minimal orthogonal pairs are characterized. The diameter and girth of three graphs arising from the orthogonality equivalence relation are computed.

APA, Harvard, Vancouver, ISO, and other styles

28

Nikolski, N., and A. Pushnitski. "Szegő-type limit theorems for “multiplicative Toeplitz” operators and non-Følner approximations." St. Petersburg Mathematical Journal 32, no. 6 (October 20, 2021): 1033–50. http://dx.doi.org/10.1090/spmj/1683.

Full text

APA, Harvard, Vancouver, ISO, and other styles

29

Ernst, Thomas. "On the q-exponential of matrix q-Lie algebras." Special Matrices 5, no. 1 (January 26, 2017): 36–50. http://dx.doi.org/10.1515/spma-2017-0003.

Full text

Abstract:

Abstract In this paper, we define several new concepts in the borderline between linear algebra, Lie groups and q-calculus.We first introduce the ring epimorphism r, the set of all inversions of the basis q, and then the important q-determinant and corresponding q-scalar products from an earlier paper. Then we discuss matrix q-Lie algebras with a modified q-addition, and compute the matrix q-exponential to form the corresponding n × n matrix, a so-called q-Lie group, or manifold, usually with q-determinant 1. The corresponding matrix multiplication is twisted under τ, which makes it possible to draw diagrams similar to Lie group theory for the q-exponential, or the so-called q-morphism. There is no definition of letter multiplication in a general alphabet, but in this article we introduce new q-number systems, the biring of q-integers, and the extended q-rational numbers. Furthermore, we provide examples of matrices in suq(4), and its corresponding q-Lie group. We conclude with an example of system of equations with Ward number coeficients.

APA, Harvard, Vancouver, ISO, and other styles

30

Alkenani, Ahmad N., Mohammad Ashraf, and Aisha Jabeen. "Nonlinear generalized Jordan (σ, Γ)-derivations on triangular algebras." Special Matrices 6, no. 1 (December 20, 2017): 216–28. http://dx.doi.org/10.1515/spma-2017-0008.

Full text

Abstract:

Abstract Let R be a commutative ring with identity element, A and B be unital algebras over R and let M be (A,B)-bimodule which is faithful as a left A-module and also faithful as a right B-module. Suppose that A = Tri(A,M,B) is a triangular algebra which is 2-torsion free and σ, Γ be automorphisms of A. A map δ:A→A (not necessarily linear) is called a multiplicative generalized (σ, Γ)-derivation (resp. multiplicative generalized Jordan (σ, Γ)-derivation) on A associated with a (σ, Γ)-derivation (resp. Jordan (σ, Γ)-derivation) d on A if δ(xy) = δ(x)r(y) + σ(x)d(y) (resp. σ(x<sup>2</sup>) = δ(x)r(x) + δ(x)d(x)) holds for all x, y Є A. In the present paper it is shown that if δ:A→A is a multiplicative generalized Jordan (σ, Γ)-derivation on A, then δ is an additive generalized (σ, Γ)-derivation on A.

APA, Harvard, Vancouver, ISO, and other styles

31

Farooq, Aamir, Mahvish Samar, Rewayat Khan, Hanyu Li, and Muhammad Kamran. "Perturbation analysis for the Takagi vector matrix." Special Matrices 10, no. 1 (July 3, 2021): 23–33. http://dx.doi.org/10.1515/spma-2020-0144.

Full text

Abstract:

Abstract In this article, we present some perturbation bounds for the Takagi vector matrix when the original matrix undergoes the additive or multiplicative perturbation. Two numerical examples are given to illuminate these bounds.

APA, Harvard, Vancouver, ISO, and other styles

32

Watanabe, Aki, Takayuki Kawaguchi, Mai Sakimoto, Yuya Oikawa, Keiichiro Furuya, and Taichi Matsuoka. "Occupational Dysfunction as a Mediator between Recovery Process and Difficulties in Daily Life in Severe and Persistent Mental Illness: A Bayesian Structural Equation Modeling Approach." Occupational Therapy International 2022 (June 17, 2022): 1–11. http://dx.doi.org/10.1155/2022/2661585.

Full text

Abstract:

Background. This study is aimed at verifying a hypothetical model of the structural relationship between the recovery process and difficulties in daily life mediated by occupational dysfunction in severe and persistent mental illness (SPMI). Methods. Community-dwelling participants with SPMI were enrolled in this multicenter cross-sectional study. The Recovery Assessment Scale (RAS), the World Health Organization Disability Assessment Schedule second edition (WHODAS 2.0), and the Classification and Assessment of Occupational Dysfunction (CAOD) were used for assessment. Confirmatory factor analysis, multiple regression analysis, and Bayesian structural equation modelling (BSEM) were determined to analyze the hypothesized model. If the mediation model was significant, the path coefficient from difficulty in daily life to recovery and the multiplication of the path coefficients mediated by occupational dysfunction were considered as each the direct effect and the indirect effect. The goodness of fit in the model was determined by the posterior predictive P value (PPP). Each path coefficient was validated with median and 95% confidence interval (CI). Results. The participants comprised 98 individuals with SPMI. The factor structures of RAS, WHODAS 2.0, and CAOD were confirmed by confirmatory factor analysis to be similar to those of their original studies. Multiple regression analysis showed that the independent variables of RAS were WHODAS 2.0 and CAOD, and that of CAOD was WHODAS 2.0. The goodness of fit of the model in the BSEM was satisfactory with a PPP = 0.27 . The standardized path coefficients were, respectively, significant at -0.372 from “difficulty in daily life” to “recovery” as the direct effect and at -0.322 (95% CI: -0.477, -0.171) mediated by “occupational dysfunction” as the indirect effect. Conclusions. An approach for reducing not only difficulty in daily life but also occupational dysfunction may be an additional strategy of person-centered, recovery-oriented practice in SPMI.

APA, Harvard, Vancouver, ISO, and other styles

33

Van Tran, Nam, and Imme van den Berg. "An algebraic model for the propagation of errors in matrix calculus." Special Matrices 8, no. 1 (March 5, 2020): 68–97. http://dx.doi.org/10.1515/spma-2020-0008.

Full text

Abstract:

AbstractWe assume that every element of a matrix has a small, individual error, and model it by an external number, which is the sum of a nonstandard real number and a neutrix, the latter being a convex (external) additive group. The algebraic properties of external numbers formalize common error analysis, with rules for calculation which are a sort of mellowed form of the axioms for real numbers.We model the propagation of errors in matrix calculus by the calculus of matrices with external numbers, and study its algebraic properties. Many classical properties continue to hold, sometimes stated in terms of inclusion instead of equality. There are notable exceptions, for which we give counterexamples and investigate suitable adaptations. In particular we study addition and multiplication of matrices, determinants, near inverses, and generalized notions of linear independence and rank.

APA, Harvard, Vancouver, ISO, and other styles

34

Radzikhovsky, N., I. Sokulskiy, and O. Dyshkant. "Pathomorphology of certain organs of immunogenesis by experimental reproduction of coronavirus infection in dogs." Scientific Messenger of LNU of Veterinary Medicine and Biotechnologies 22, no. 99 (October 28, 2020): 75–79. http://dx.doi.org/10.32718/nvlvet9912.

Full text

Abstract:

The article, based on the results of histological studies, presents data on the microscopic structure of the immune system – thymus, spleen, lymph nodes of dogs with experimental infection with coronavirus enteritis. Pathomorphological studies of immunocompetent organs from the dead (n = 5) puppies crossed Labrador breeds with outbred, infected with a coronavirus field isolate cultured on heterologous cell cultures (kidney kidney hamster (BHK-21), rabbit kidney (RK-13) and the renal mumps (SPEV). Pathological dissection of dogs was performed by partial evisceration in the usual sequence. Prepared histological sections were stained with hematoxylin and eosin according to standard recipes. The general histological structure and microstructural changes of histo- and cytostructures of organs in histological samples were studied under a light microscope. During coronavirus enteritis in dogs, pathomorphological changes in immunocompetent organs were found, which characterize the suppression of immunogenesis function during an infectious disease of viral etiology. Thus, in the spleen there are spotted hemorrhages, lymph nodes, moderate hyperplasia, with signs of hemorrhagic inflammation. Active proliferation of lymphoid cells, which leads to hyperplasia, is one of the markers of the pathogen's effect on the macroorganism in the form of an inflammatory process in regional lymph nodes, which indicates the multiplication of the virus and the development of immunological processes. Based on our analysis of literature sources, monitoring results and our own research, it was found that viral enteritis occupies a leading place in the infectious pathology of dogs and causes significant harm to animal owners. Thus, the need for additional research to clarify, supplement and summarize data on the pathomorphology of various organs and tissues in canine corona viridae enteritis, current immunoprophylaxis and treatment can significantly reduce the incidence and mortality from infection. We found a set of histological changes in the immune system during the experimental reproduction of coronavirus infection, can be considered a characteristic criterion for pathomorphological differential diagnosis of coronavirus enteritis in dogs.

APA, Harvard, Vancouver, ISO, and other styles

35

Ratnakar, Shashi Kant, Subhajit Sanfui, and Deepak Sharma. "Graphics Processing Unit-Based Element-by-Element Strategies for Accelerating Topology Optimization of Three-Dimensional Continuum Structures Using Unstructured All-Hexahedral Mesh." Journal of Computing and Information Science in Engineering 22, no. 2 (December 9, 2021). http://dx.doi.org/10.1115/1.4052892.

Full text

Abstract:

Abstract Topology optimization has been successful in generating optimal topologies of various structures arising in real-world applications. Since these applications can have complex and large domains, topology optimization suffers from a high computational cost because of the use of unstructured meshes for discretization of these domains and their finite element analysis (FEA). This article addresses this challenge by developing three graphics processing unit (GPU)-based element-by-element strategies targeting unstructured all-hexahedral mesh for the matrix-free precondition conjugate gradient (PCG) finite element solver. These strategies mainly perform sparse matrix multiplication (SpMV) arising with the FEA solver by allocating more compute threads of GPU per element. Moreover, the strategies are developed to use shared memory of GPU for efficient memory transactions. The proposed strategies are tested with solid isotropic material with penalization (SIMP) method on four examples of 3D structural topology optimization. Results demonstrate that the proposed strategies achieve speedup up to 8.2 × over the standard GPU-based SpMV strategies from the literature.

APA, Harvard, Vancouver, ISO, and other styles

36

Xiao, Guoqing, Chuanghui Yin, Tao Zhou, Xueqi Li, Yuedan Chen, and Kenli Li. "A Survey of Accelerating Parallel Sparse Linear Algebra." ACM Computing Surveys, June 17, 2023. http://dx.doi.org/10.1145/3604606.

Full text

Abstract:

Sparse linear algebra includes the fundamental and important operations in various large-scale scientific computing and real-world applications. There exists performance bottleneck for sparse linear algebra since it mainly contains the memory-bound computations with low arithmetic intensity. How to improve its performance has increasingly become a focus of research efforts. Using parallel computing techniques to accelerate sparse linear algebra is currently the most popular method, while facing various challenges, e.g., large-scale data brings difficulties in storage, and the sparsity of data leads to irregular memory accesses and parallel load imbalance. Therefore, this paper provides a comprehensive overview on acceleration of sparse linear algebra operations using parallel computing platforms, where we focus on four main classifications: sparse matrix-vector multiplication (SpMV), sparse matrix-sparse vector multiplication (SpMSpV), sparse general matrix-matrix multiplication (SpGEMM), and sparse tensor algebra. The takeaways from this paper include the following: understanding the challenges of accelerating linear sparse algebra on various hardware platforms; understanding how structured data sparsity can improve storage efficiency; understanding how to optimize parallel load balance; understanding how to improve the efficiency of memory accesses; understanding how do the adaptive frameworks automatically select the optimal algorithms; and understanding recent design trends for acceleration of parallel sparse linear algebra.

APA, Harvard, Vancouver, ISO, and other styles

37

Cui, Huanyu, Nianbin Wang, Qilong Han, Ye Wang, and Jiahang Li. "A two‐stage parallel method on GPU based on hybrid‐compression‐format for diagonal matrix." Concurrency and Computation: Practice and Experience, August 10, 2023. http://dx.doi.org/10.1002/cpe.7887.

Full text

Abstract:

AbstractSpMV (Sparse matrix‐vector multiplication) is an important computing core in traditional high‐performance computing and also one of the emerging data‐intensive applications. For diagonal sparse matrices, it is frequently necessary to fill in a large number of zeros to maintain the diagonal structure as for using DIA (Diagonal) storage format. The fact that filling with zeros may consume additional computing and memory resources, will certainly lead to degradation of the parallel computing performance of SpMV, further causing computing and storage redundancy. To solve the deficiencies of the DIA format, a Two‐stage parallel SpMV method is presented in this paper, which can reasonably distribute the data of diagonal matrix and irregular matrix to different CUDA kernels. As different corresponding compression methods are particularly designed for different matrix forms, a partition‐based hybrid format of DIA and CSR (HPDC) is therefore adopted in the two‐stage method to ensure load balancing among computing resources and continuity of data access on the diagonal. Simultaneously, a standard deviation among blocks is used as a criterion to obtain the optimal number of blocks and distribution of data. The experimental data were implemented in the Florida data set. Compared to DIA, cuSPARSE‐CSR, HDC, and BRCSD, the execution time of the Two‐stage method is shortened by 4, 3.4, 1.9, and 1.15, respectively.

APA, Harvard, Vancouver, ISO, and other styles

38

Lu, Alec, Zhenman Fang, and Lesley Shannon. "Demystifying the Soft and Hardened Memory Systems of Modern FPGAs for Software Programmers through Microbenchmarking." ACM Transactions on Reconfigurable Technology and Systems, February 9, 2022. http://dx.doi.org/10.1145/3517131.

Full text

Abstract:

Both modern datacenter and embedded FPGAs provide great opportunities for high-performance and high energy-efficiency computing. With the growing public availability of FPGAs from major cloud service providers such as AWS, Alibaba, and Nimbix, as well as uniform hardware accelerator development tools (such as Xilinx Vitis and Intel oneAPI) for software programmers, hardware and software developers can now easily access FPGA platforms. However, it is nontrivial to develop efficient FPGA accelerators, especially for software programmers who use high-level synthesis (HLS). The major goal of this paper is to figure out how to efficiently access the memory system of modern datacenter and embedded FPGAs in HLS-based accelerator designs. This is especially important for memory-bound applications; for example, a naive accelerator design only utilizes less than 5% of the available off-chip memory bandwidth. To achieve our goal, we first identify a comprehensive set of factors that affect the memory bandwidth, including 1) the clock frequency of the accelerator design, 2) the number of concurrent memory access ports, 3) the data width of each port, 4) the maximum burst access length for each port, and 5) the size of consecutive data accesses. Then we carefully design a set of HLS-based microbenchmarks to quantitatively evaluate the performance of the memory systems of datacenter FPGAs (Xilinx Alveo U200 and U280) and embedded FPGA (Xilinx ZCU104) when changing those affecting factors, and provide insights into efficient memory access in HLS-based accelerator designs. Comparing between the typically used soft and hardened memory systems respectively found on datacenter and embedded FPGAs, we further summarize their unique features and discuss the effective approaches to leverage these systems. To demonstrate the usefulness of our insights, we also conduct two case studies to accelerate the widely used K-nearest neighbors (KNN) and sparse matrix-vector multiplication (SpMV) algorithms on datacenter FPGAs with a soft (and thus more flexible) memory system. Compared to the baseline designs, optimized designs leveraging our insights achieve about 3.5x and 8.5x speedups for the KNN and SpMV accelerators. Our final optimized KNN and SpMV designs on a Xilinx Alveo U200 FPGA fully utilize its off-chip memory bandwidth, and achieve about 5.6x and 3.4x speedups over the 24-core CPU implementations.

APA, Harvard, Vancouver, ISO, and other styles

39

Ratnakar, Shashi Kant, Utpal Kiran, and Deepak Sharma. "Acceleration of structural topology optimization using symmetric element-by-element strategy for unstructured meshes on GPU." Engineering Computations, November 3, 2022. http://dx.doi.org/10.1108/ec-01-2022-0022.

Full text

Abstract:

PurposeStructural topology optimization is computationally expensive due to the involvement of high-resolution mesh and repetitive use of finite element analysis (FEA) for computing the structural response. Since FEA consumes most of the computational time in each optimization iteration, a novel GPU-based parallel strategy for FEA is presented and applied to the large-scale structural topology optimization of 3D continuum structures.Design/methodology/approachA matrix-free solver based on preconditioned conjugate gradient (PCG) method is proposed to minimize the computational time associated with solution of linear system of equations in FEA. The proposed solver uses an innovative strategy to utilize only symmetric half of elemental stiffness matrices for implementation of the element-by-element matrix-free solver on GPU.FindingsUsing solid isotropic material with penalization (SIMP) method, the proposed matrix-free solver is tested over three 3D structural optimization problems that are discretized using all hexahedral structured and unstructured meshes. Results show that the proposed strategy demonstrates 3.1× –3.3× speedup for the FEA solver stage and overall speedup of 2.9× –3.3× over the standard element-by-element strategy on the GPU. Moreover, the proposed strategy requires almost 1.8× less GPU memory than the standard element-by-element strategy.Originality/valueThe proposed GPU-based matrix-free element-by-element solver takes a more general approach to the symmetry concept than previous works. It stores only symmetric half of the elemental matrices in memory and performs matrix-free sparse matrix-vector multiplication (SpMV) without any inter-thread communication. A customized data storage format is also proposed to store and access only symmetric half of elemental stiffness matrices for coalesced read and write operations on GPU over the unstructured mesh.

APA, Harvard, Vancouver, ISO, and other styles

40

Appi Reddy, K., and T. Kurmayya. "Moore-Penrose inverses of Gram matrices leaving a cone invariant in an indefinite inner product space." Special Matrices 3, no. 1 (January 10, 2015). http://dx.doi.org/10.1515/spma-2015-0013.

Full text

Abstract:

AbstractIn this paper we characterize Moore-Penrose inverses of Gram matrices leaving a cone invariant in an indefinite inner product space using the indefinite matrix multiplication. This characterization includes the acuteness (or obtuseness) of certain closed convex cones.

APA, Harvard, Vancouver, ISO, and other styles

41

Tao, Zhuofu, Chen Wu, Yuan Liang, Kun Wang, and Lei He. "LW-GCN: A Lightweight FPGA-based Graph Convolutional Network Accelerator." ACM Transactions on Reconfigurable Technology and Systems, August 4, 2022. http://dx.doi.org/10.1145/3550075.

Full text

Abstract:

Graph convolutional networks (GCNs) have been introduced to effectively process non-euclidean graph data. However, GCNs incur large amounts of irregularity in computation and memory access, which prevents efficient use of traditional neural network accelerators. Moreover, existing dedicated GCN accelerators demand high memory volumes and are difficult to implement onto resource limited edge devices. In this work, we propose LW-GCN , a lightweight FPGA-based accelerator with a software-hardware co-designed process to tackle irregularity in computation and memory access in GCN inference. LW-GCN decomposes the main GCN operations into Sparse Matrix-Matrix Multiplication (SpMM) and Matrix-Matrix Multiplication (MM). We propose a novel compression format to balance workload across PEs and prevent data hazards. Moreover, we apply data quantization and workload tiling, and map both SpMM and MM of GCN inference onto a uniform architecture on resource limited hardware. Evaluation on GCN and GraphSAGE are performed on Xilinx Kintex-7 FPGA with three popular datasets. Compared to existing CPU, GPU, and state-of-the-art FPGA-based accelerator, LW-GCN reduces latency by up to 60x, 12x and 1.7x and increases power efficiency by up to 912x., 511x and 3.87x, respectively. Furthermore, compared with NVIDIA’s latest edge GPU Jetson Xavier NX, LW-GCN achieves speedup and energy savings of 32x and 84x, respectively.

APA, Harvard, Vancouver, ISO, and other styles

42

Verde-Star, Luis. "Elementary triangular matrices and inverses of k-Hessenberg and triangular matrices." Special Matrices 3, no. 1 (January 6, 2015). http://dx.doi.org/10.1515/spma-2015-0025.

Full text

Abstract:

AbstractWe use elementary triangular matrices to obtain some factorization, multiplication, and inversion properties of triangular matrices. We also obtain explicit expressions for the inverses of strict k-Hessenberg matrices and banded matrices. Our results can be extended to the cases of block triangular and block Hessenberg matrices. An n × n lower triangular matrix is called elementary if it is of the form I + C, where I is the identity matrix and C is lower triangular and has all of its nonzero entries in the k-th column,where 1 ≤ k ≤ n.

APA, Harvard, Vancouver, ISO, and other styles

43

Nazarov, F., V. Vasyunin, and A. Volberg. "On a Bellman function associated with the Chang–Wilson–Wolff theorem: a case study." St. Petersburg Mathematical Journal, June 27, 2022. http://dx.doi.org/10.1090/spmj/1719.

Full text

Abstract:

The tail of distribution (i.e., the measure of the set { f ≥ x } \{f\ge x\} ) is estimated for those functions f f whose dyadic square function is bounded by a given constant. In particular, an estimate following from the Chang–Wilson–Wolf theorem is slightly improved. The study of the Bellman function corresponding to the problem reveals a curious structure of this function: it has jumps of the first derivative at a dense subset of the interval [ 0 , 1 ] [0,1] (where it is calculated exactly), but it is of C ∞ C^\infty -class for x > 3 x>\sqrt 3 (where it is calculated up to a multiplicative constant). An unusual feature of the paper consists of the usage of computer calculations in the proof. Nevertheless, all the proofs are quite rigorous, since only the integer arithmetic was assigned to a computer.

APA, Harvard, Vancouver, ISO, and other styles

44

Hilberdink, T., and A. Pushnitski. "Spectral asymptotics for a family of LCM matrices." St. Petersburg Mathematical Journal, June 7, 2023. http://dx.doi.org/10.1090/spmj/1764.

Full text

Abstract:

The family of arithmetical matrices is studied given explicitly by E ( σ , τ ) = { n σ m σ [ n , m ] τ } n , m = 1 ∞ , \begin{equation*} E(\sigma ,\tau )= \bigg \{\frac {n^\sigma m^\sigma }{[n,m]^\tau }\bigg \}_{n,m=1}^\infty , \end{equation*} where [ n , m ] [n,m] is the least common multiple of n n and m m and the real parameters σ \sigma and τ \tau satisfy ρ ≔ τ − 2 σ > 0 \rho ≔\tau -2\sigma >0 , τ − σ > 1 2 \tau -\sigma >\frac 12 , and τ > 0 \tau >0 . It is proved that E ( σ , τ ) E(\sigma ,\tau ) is a compact selfadjoint positive definite operator on ℓ 2 ( N ) \ell ^2(\mathbb {N}) , and the ordered sequence of eigenvalues of E ( σ , τ ) E(\sigma ,\tau ) obeys the asymptotic relation λ n ( E ( σ , τ ) ) = ϰ ( σ , τ ) n ρ + o ( n − ρ ) , n → ∞ , \begin{equation*} \lambda _n(E(\sigma ,\tau ))=\frac {\varkappa (\sigma ,\tau )}{n^\rho }+o(n^{-\rho }), \quad n\to \infty , \end{equation*} with some ϰ ( σ , τ ) > 0 \varkappa (\sigma ,\tau )>0 . This fact is applied to the asymptotics of singular values of truncated multiplicative Toeplitz matrices with the symbol given by the Riemann zeta function on the vertical line with abscissa σ > 1 / 2 \sigma >1/2 . The relationship of the spectral analysis of E ( σ , τ ) E(\sigma ,\tau ) with the theory of generalized prime systems is also pointed out.

APA, Harvard, Vancouver, ISO, and other styles

45

Burchard, Luk, Kristian Gregorius Hustad, Johannes Langguth, and Xing Cai. "Enabling unstructured-mesh computation on massively tiled AI processors: An example of accelerating in silico cardiac simulation." Frontiers in Physics 11 (March 30, 2023). http://dx.doi.org/10.3389/fphy.2023.979699.

Full text

Abstract:

A new trend in processor architecture design is the packaging of thousands of small processor cores into a single device, where there is no device-level shared memory but each core has its own local memory. Thus, both the work and data of an application code need to be carefully distributed among the small cores, also termed as tiles. In this paper, we investigate how numerical computations that involve unstructured meshes can be efficiently parallelized and executed on a massively tiled architecture. Graphcore IPUs are chosen as the target hardware platform, to which we port an existing monodomain solver that simulates cardiac electrophysiology over realistic 3D irregular heart geometries. There are two computational kernels in this simulator, where a 3D diffusion equation is discretized over an unstructured mesh and numerically approximated by repeatedly executing sparse matrix-vector multiplications (SpMVs), whereas an individual system of ordinary differential equations (ODEs) is explicitly integrated per mesh cell. We demonstrate how a new style of programming that uses Poplar/C++ can be used to port these commonly encountered computational tasks to Graphcore IPUs. In particular, we describe a per-tile data structure that is adapted to facilitate the inter-tile data exchange needed for parallelizing the SpMVs. We also study the achievable performance of the ODE solver that heavily depends on special mathematical functions, as well as their accuracy on Graphcore IPUs. Moreover, topics related to using multiple IPUs and performance analysis are addressed. In addition to demonstrating an impressive level of performance that can be achieved by IPUs for monodomain simulation, we also provide a discussion on the generic theme of parallelizing and executing unstructured-mesh multiphysics computations on massively tiled hardware.

APA, Harvard, Vancouver, ISO, and other styles

46

Pistelli, Laura, Cecilia Noccioli, Francesca D'Angiolillo, and Luisa Pistelli. "Composition of volatile in micropropagated and field grown aromatic plants from Tuscany Islands." Acta Biochimica Polonica 60, no. 1 (February 25, 2013). http://dx.doi.org/10.18388/abp.2013_1949.

Full text

Abstract:

Aromatic plant species present in the natural Park of Tuscany Archipelago are used as flavoring agents and spices, as dietary supplements and in cosmetics and aromatherapy. The plants are usually collected from wild stands, inducing a depletion of the natural habitat. Therefore, micropropagation of these aromatic plants can play a role in the protection of the natural ecosystem, can guarantee a massive sustainable production and can provide standardized plant materials for diverse economical purposes. The aim of this study is to compare the volatile organic compounds produced by the wild plants with those from in vitro plantlets using headspace solid phase micro-extraction (HS-SPME) followed by capillary gas-chromatography coupled to mass spectrometry (GC-MS). Typical plants of this natural area selected for this work were Calamintha nepeta L., Crithmum maritimum L., Lavandula angustifolia L., Myrtus communis L., Rosmarinus officinalis L., Salvia officinalis L. and Satureja hortensis L. Different explants were used: microcuttings with vegetative apical parts, axillary buds and internodes. Sterilization percentage, multiplication rate and shoot length, as well as root formation were measured. The volatile aromatic profiles produced from in vitro plantlets were compared with those of the wild plants, in particular for C. maritimum, R. officinalis, S. officinalis and S. hortensis. This study indicated that the micropropagation technique can represent a valid alternative to produce massive and sterile plant material characterised by the same aromatic flavour as in the wild grown plants.

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!