Dissertations / Theses: 'Sparse Matrix Vector Multiplication'

1

Ashari, Arash. "Sparse Matrix-Vector Multiplication on GPU." The Ohio State University, 2014. http://rave.ohiolink.edu/etdc/view?acc_num=osu1417770100.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Ramachandran, Shridhar. "Incremental PageRank acceleration using Sparse Matrix-Sparse Vector Multiplication." The Ohio State University, 2016. http://rave.ohiolink.edu/etdc/view?acc_num=osu1462894358.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

Balasubramanian, Deepan Karthik. "Efficient Sparse Matrix Vector Multiplication for Structured Grid Representation." The Ohio State University, 2012. http://rave.ohiolink.edu/etdc/view?acc_num=osu1339730490.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Mansour, Ahmad [Verfasser]. "Sparse Matrix-Vector Multiplication Based on Network-on-Chip / Ahmad Mansour." München : Verlag Dr. Hut, 2015. http://d-nb.info/1075409470/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Singh, Kunal. "High-Performance Sparse Matrix-Multi Vector Multiplication on Multi-Core Architecture." The Ohio State University, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=osu1524089757826551.

Full text

APA, Harvard, Vancouver, ISO, and other styles

6

El-Kurdi, Yousef M. "Sparse Matrix-Vector floating-point multiplication with FPGAs for finite element electromagnetics." Thesis, McGill University, 2006. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=98958.

Full text

Abstract:

The Finite Element Method (FEM) is a computationally intensive scientific and engineering analysis tool that has diverse applications ranging from structural engineering to electromagnetic simulation. Field Programmable Gate Arrays (FPGAs) have been shown to have higher peak floating-point performance than general purpose CPUs, and the trends are moving in favor of FPGAs. We present an architecture and implementation of an FPGA-based Sparse Matrix-Vector Multiplier (SMVM) for use in the iterative solution of large, sparse systems of equations arising from FEM applications. Our architecture exploits the FEM matrix sparsity structure to achieve a balance between performance and hardware resource requirements. The architecture is based on a pipelined linear array of Processing Elements (PEs). A hardware-oriented matrix "striping" scheme is developed which reduces the number of required processing elements. The implemented SMVM-pipeline prototype contains 8 PEs and is clocked at 110 MHz obtaining a peak performance of 1.76 GFLOPS. For 8 GB/s of memory bandwidth typical of recent FPGA reconfigurable systems, this architecture can achieve 1.5 GFLOPS sustained performance. A single pipeline uses 30% of the logic resources and 40% of the memory resources of a Stratix S80 FPGA. Using multiple instances of the pipeline, linear scaling of the peak and sustained performance can be achieved. Our stream-through architecture provides the added advantage of enabling an iterative implementation of the SMVM computation required by iterative solvers such as the conjugate gradient method, avoiding initialization time due to data loading and setup inside the FPGA internal memory.

APA, Harvard, Vancouver, ISO, and other styles

7

Godwin, Jeswin Samuel. "High-Performancs Sparse Matrix-Vector Multiplication on GPUS for Structured Grid Computations." The Ohio State University, 2013. http://rave.ohiolink.edu/etdc/view?acc_num=osu1357280824.

Full text

APA, Harvard, Vancouver, ISO, and other styles

8

Kunchum, Rakshith. "On Improving Sparse Matrix-Matrix Multiplication on GPUs." The Ohio State University, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=osu1492694387445938.

Full text

APA, Harvard, Vancouver, ISO, and other styles

9

Pantawongdecha, Payut. "Autotuning divide-and-conquer matrix-vector multiplication." Thesis, Massachusetts Institute of Technology, 2016. http://hdl.handle.net/1721.1/105968.

Full text

Abstract:

Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2016.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 73-75).
Divide and conquer is an important concept in computer science. It is used ubiquitously to simplify and speed up programs. However, it needs to be optimized, with respect to parameter settings for example, in order to achieve the best performance. The problem boils down to searching for the best implementation choice on a given set of requirements, such as which machine the program is running on. The goal of this thesis is to apply and evaluate the Ztune approach [14] on serial divide-and-conquer matrix-vector multiplication. We implemented Ztune to autotune serial divide-and-conquer matrix-vector multiplication on machines with different hardware configurations, and found that Ztuneoptimized codes ran 1%-5% faster than the hand-optimized counterparts. We also compared Ztune-optimized results with other matrix-vector multiplication libraries including the Intel Math Kernel Library and OpenBLAS. Since the matrix-vector multiplication problem is a level 2 BLAS, it is not as computationally intensive as level 3 BLAS problems such as matrix-matrix multiplication and stencil computation. As a result, the measurement in matrix-vector multiplication is more prone to error from factors such as noise, cache alignment of the matrix, and cache states, which lead to wrong decision choices for Ztune. We explored multiple options to get more accurate measurements and demonstrated the techniques that remedied these issues. Lastly, we applied the Ztune approach to matrix-matrix multiplication, and we were able to achieve 2%-85% speedup compared to the hand-tuned code. This thesis represents joint work with Ekanathan Palamadai Natarajan.
by Payut Pantawongdecha.
M. Eng.

APA, Harvard, Vancouver, ISO, and other styles

10

Belgin, Mehmet. "Structure-based Optimizations for Sparse Matrix-Vector Multiply." Diss., Virginia Tech, 2010. http://hdl.handle.net/10919/30260.

Full text

Abstract:

This dissertation introduces two novel techniques, OSF and PBR, to improve the performance of Sparse Matrix-vector Multiply (SMVM) kernels, which dominate the runtime of iterative solvers for systems of linear equations. SMVM computations that use sparse formats typically achieve only a small fraction of peak CPU speeds because they are memory bound due to their low flops:byte ratio, they access memory irregularly, and exhibit poor ILP due to inefficient pipelining. We particularly focus on improving the flops:byte ratio, which is the main limiter on performance, by exploiting recurring structures or sub-structures in matrices. Our techniques also support micro-architecture level optimizations to further improve performance. Operation Stacking Framework (OSF) stacks problems in large ensemble computations, which run the same sparse kernel using an identical matrix structure, such that they share a single copy of the indexing information to significantly reduce memory bandwidth usage. OSF provides performance improvements of up to 1.94x on an AMD Opteron compared to the CSR method. We validate performance results using hardware event counters, which demonstrate significantly improved cache and pipeline utilization. Pattern-based Representation (PBR) exploits recurring block nonzero patterns by generating custom code for each recurring block pattern. In this way, no indexing data for individual nonzero elements are read from memory, reducing the overall size of the indices by up to 98%. Our code generator emits highly tuned codes that utilize SSE vectorization and software prefetching. PBR accurately identifies a block size that achieves optimal or near-optimal performance using a linear multiple regression performance model. On recent multicore machines, PBR provides performance improvements of up to 3.4x sequentially and 5x in parallel, compared to the CSR method. The PBR library we provide converts matrices at runtime, allowing our method to be used as a drop-in replacement for existing methods. We compare PBRâ s overhead relative to its benefits and show that PBR is beneficial for many applications that repetitively call the SMVM kernel for the same matrix structure.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

11

DeLorimier, Michael DeHon André. "Floating-point sparse matrix-vector multiply for FPGAs /." Diss., Pasadena, Calif. : California Institute of Technology, 2005. http://resolver.caltech.edu/CaltechETD:etd-05132005-144347.

Full text

APA, Harvard, Vancouver, ISO, and other styles

12

Thumma, Vineeth Reddy. "Optimizing Sparse Matrix-Matrix Multiplication for Graph Computations on GPUs and Multi-Core Systems." The Ohio State University, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=osu1524113772955789.

Full text

APA, Harvard, Vancouver, ISO, and other styles

13

Kuang, Da. "Nonnegative matrix factorization for clustering." Diss., Georgia Institute of Technology, 2014. http://hdl.handle.net/1853/52299.

Full text

Abstract:

This dissertation shows that nonnegative matrix factorization (NMF) can be extended to a general and efficient clustering method. Clustering is one of the fundamental tasks in machine learning. It is useful for unsupervised knowledge discovery in a variety of applications such as text mining and genomic analysis. NMF is a dimension reduction method that approximates a nonnegative matrix by the product of two lower rank nonnegative matrices, and has shown great promise as a clustering method when a data set is represented as a nonnegative data matrix. However, challenges in the widespread use of NMF as a clustering method lie in its correctness and efficiency: First, we need to know why and when NMF could detect the true clusters and guarantee to deliver good clustering quality; second, existing algorithms for computing NMF are expensive and often take longer time than other clustering methods. We show that the original NMF can be improved from both aspects in the context of clustering. Our new NMF-based clustering methods can achieve better clustering quality and run orders of magnitude faster than the original NMF and other clustering methods. Like other clustering methods, NMF places an implicit assumption on the cluster structure. Thus, the success of NMF as a clustering method depends on whether the representation of data in a vector space satisfies that assumption. Our approach to extending the original NMF to a general clustering method is to switch from the vector space representation of data points to a graph representation. The new formulation, called Symmetric NMF, takes a pairwise similarity matrix as an input and can be viewed as a graph clustering method. We evaluate this method on document clustering and image segmentation problems and find that it achieves better clustering accuracy. In addition, for the original NMF, it is difficult but important to choose the right number of clusters. We show that the widely-used consensus NMF in genomic analysis for choosing the number of clusters have critical flaws and can produce misleading results. We propose a variation of the prediction strength measure arising from statistical inference to evaluate the stability of clusters and select the right number of clusters. Our measure shows promising performances in artificial simulation experiments. Large-scale applications bring substantial efficiency challenges to existing algorithms for computing NMF. An important example is topic modeling where users want to uncover the major themes in a large text collection. Our strategy of accelerating NMF-based clustering is to design algorithms that better suit the computer architecture as well as exploit the computing power of parallel platforms such as the graphic processing units (GPUs). A key observation is that applying rank-2 NMF that partitions a data set into two clusters in a recursive manner is much faster than applying the original NMF to obtain a flat clustering. We take advantage of a special property of rank-2 NMF and design an algorithm that runs faster than existing algorithms due to continuous memory access. Combined with a criterion to stop the recursion, our hierarchical clustering algorithm runs significantly faster and achieves even better clustering quality than existing methods. Another bottleneck of NMF algorithms, which is also a common bottleneck in many other machine learning applications, is to multiply a large sparse data matrix with a tall-and-skinny dense matrix. We use the GPUs to accelerate this routine for sparse matrices with an irregular sparsity structure. Overall, our algorithm shows significant improvement over popular topic modeling methods such as latent Dirichlet allocation, and runs more than 100 times faster on data sets with millions of documents.

APA, Harvard, Vancouver, ISO, and other styles

14

Muradov, Feruz. "Development, Implementation, Optimization and Performance Analysis of Matrix-Vector Multiplication on Eight-Core Digital Signal Processor." Thesis, KTH, Numerisk analys, NA, 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-131289.

Full text

Abstract:

This thesis work aims at implementing the sparse matrix vector multiplication on eight-core Digital Signal Processor (DSP) and giving insights on how to optimize matrix multiplication on DSP to achieve high energy efficiency. We used two sparse matrix formats: the Compressed Sparse Row (CSR) and the Block Compressed Sparse Row (BCSR) formats. We carried out loop unrolling optimization of the naive algorithm. In addition, we implemented the Registerblocked and the Cache-blocked sparse matrix vector multiplications to optimize the naive algorithm. The computational performance improvement with loop unrolling technique was promising (≈12%). With this optimization, we observed a decrease of power usage (0.3 W) when using a matrix size of 600 and an increase of power usage (1.2 W), when using larger size matrices. The Register-blocked algorithm resulted to be the most efficient technique on DSP. With this algorithm, we were able to increase performance by a factor of six when compared to the naive algorithm, still retaining low power consumption (≈ 14 W). The Cache-blocked sparse matrix vector multiplication is known to be most convenient for large number of architectures with coherent caches. However, because DSP does not support coherency between caches, this method did not show large improvement in computational performance. In fact, we confirm that power consumption for the Cache-blocked method was higher when compared to other effective algorithms such as Register-blocked sparse matrix vector multiplication and loop unrolling of naive algorithm. In conclusion, we found that the DSP delivers low power consumption, excellent computational performance and energy efficiency when the Register-blocked sparse matrix vector multiplication technique is used.

APA, Harvard, Vancouver, ISO, and other styles

15

Murugandi, Iyyappa Thirunavukkarasu. "A New Representation of Structured Grids for Matrix-vector Operation and Optimization of Doitgen Kernel." The Ohio State University, 2010. http://rave.ohiolink.edu/etdc/view?acc_num=osu1276878729.

Full text

APA, Harvard, Vancouver, ISO, and other styles

16

Eibner, Tino, and Jens Markus Melenk. "Fast algorithms for setting up the stiffness matrix in hp-FEM: a comparison." Universitätsbibliothek Chemnitz, 2006. http://nbn-resolving.de/urn:nbn:de:swb:ch1-200601623.

Full text

Abstract:

We analyze and compare different techniques to set up the stiffness matrix in the hp-version of the finite element method. The emphasis is on methods for second order elliptic problems posed on meshes including triangular and tetrahedral elements. The polynomial degree may be variable. We present a generalization of the Spectral Galerkin Algorithm of [7], where the shape functions are adapted to the quadrature formula, to the case of triangles/tetrahedra. Additionally, we study on-the-fly matrix-vector multiplications, where merely the matrix-vector multiplication is realized without setting up the stiffness matrix. Numerical studies are included.

APA, Harvard, Vancouver, ISO, and other styles

17

Niu, Qingpeng. "Characterization and Enhancement of Data Locality and Load Balancing for Irregular Applications." The Ohio State University, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=osu1420811652.

Full text

APA, Harvard, Vancouver, ISO, and other styles

18

Flegar, Goran. "Sparse Linear System Solvers on GPUs: Parallel Preconditioning, Workload Balancing, and Communication Reduction." Doctoral thesis, Universitat Jaume I, 2019. http://hdl.handle.net/10803/667096.

Full text

Abstract:

With the breakdown of Dennard scaling in the mid-2000s and the end of Moore's law on the horizon, the high performance computing community is turning its attention towards unconventional accelerator hardware to ensure the continued growth of computational capacity. This dissertation presents several contributions related to the iterative solution of sparse linear systems on the most widely used general purpose accelerator - the Graphics Processing Unit (GPU). Specifically, it accelerates the major building blocks of Krylov solvers, and describes their realization as part of a software library of reusable building blocks. The first part of the dissertation focuses on the sparse matrix-vector product and effective load balancing in the presence of irregular sparsity patterns. The second part describes the design of high-performance preconditioners. Finally, the third part demonstrates the potential of adaptive precision techniques for constructing preconditioners with lower memory footprint, and accuracy comparable to their full precision equivalents.
Con el final de la ley de Dennard y el cercano fin de la ley de Moore, la comunidad en computación de altas prestaciones se está centrando en tecnologías de aceleración no convencionales para asegurar el crecimiento exponencial de la capacidad de computación. Esta tesis contribuye a la solución iterativa de sistemas lineales dispersos en el acelerador más difundido: el procesador gráfico. Específicamente, el trabajo acelera los bloques fundamentales de los métodos de Krylov, y describe su implementación como parte de una biblioteca de bloques reutilizables. La primera parte del trabajo se centra en el producto matriz-vector disperso y el equilibrado de la carga ante patrones de dispersidad irregulares. La segunda parte describe el diseño de precondicionadores de alto rendimiento. Finalmente, la tercera parte demuestra el potencial de las técnicas de precisión adaptativa para construir precondicionadores con menor consumo de memoria, y fiabilidad comparable con las versiones de precisión completa.

APA, Harvard, Vancouver, ISO, and other styles

19

Boyer, Brice. "Multiplication matricielle efficace et conception logicielle pour la bibliothèque de calcul exact LinBox." Phd thesis, Université de Grenoble, 2012. http://tel.archives-ouvertes.fr/tel-00767915.

Full text

Abstract:

Dans ce mémoire de thèse, nous développons d'abord des multiplications matricielles efficaces. Nous créons de nouveaux ordonnancements qui permettent de réduire la taille de la mémoire supplémentaire nécessaire lors d'une multiplication du type Winograd tout en gardant une bonne complexité, grâce au développement d'outils externes ad hoc (jeu de galets), à des calculs fins de complexité et à de nouveaux algorithmes hybrides. Nous utilisons ensuite des technologies parallèles (multicœurs et GPU) pour accélérer efficacement la multiplication entre matrice creuse et vecteur dense (SpMV), essentielles aux algorithmes dits /boîte noire/, et créons de nouveaux formats hybrides adéquats. Enfin, nous établissons des méthodes de /design/ générique orientées vers l'efficacité, notamment par conception par briques de base, et via des auto-optimisations. Nous proposons aussi des méthodes pour améliorer et standardiser la qualité du code de manière à pérenniser et rendre plus robuste le code produit. Cela permet de pérenniser de rendre plus robuste le code produit. Ces méthodes sont appliquées en particulier à la bibliothèque de calcul exact LinBox.

APA, Harvard, Vancouver, ISO, and other styles

20

Hong, Changwan. "Code Optimization on GPUs." The Ohio State University, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=osu1557123832601533.

Full text

APA, Harvard, Vancouver, ISO, and other styles

21

Tsai, Sung-Han, and 蔡松翰. "Optimization for sparse matrix-vector multiplication based on NVIDIA CUDA platform." Thesis, 2017. http://ndltd.ncl.edu.tw/handle/qw23p7.

Full text

Abstract:

碩士
國立彰化師範大學
資訊工程學系
105
In recent years, large size sparse matrices are often used in fields such as science and engineering which usually apply in computing linear model. Using the ELLPACK format to store sparse matrices, it can reduce the matrix storage space. But if there is too much nonzero elements in one of row of the original sparse matrix, it still waste too much memory space. There are many research focusing on the Sparse Matrix–Vector Multiplication（SpMV）with ELLPACK format on Graphic Processing Unit（GPU）. Therefore, the purpose of our research is reducing the access space of sparse matrix which is transformed in Compressed Sparse Row（CSR）format after Reverse Cutthill-McKee（RCM）algorithm to accelerate for SpMV on GPU. Due to lower data access ratio from SpMV, the performance is restricted by memory bandwidth. Our propose is based on CSR format from two aspects:（1）reduce cache misses to enhance the vector locality and raise the performance, and（2）reduce accessed matrix data by index reduction to optimize the performance.

APA, Harvard, Vancouver, ISO, and other styles

22

Jheng, Hong-Yuan, and 鄭弘元. "FPGA Acceleration of Sparse Matrix-Vector Multiplication Based on Network-on-Chip." Thesis, 2011. http://ndltd.ncl.edu.tw/handle/y884tf.

Full text

Abstract:

碩士
國立臺灣科技大學
電子工程系
99
The Sparse Matrix-Vector Multiplication (SMVM) is a pervasive operation in many scientific and engineering applications. Moreover, SMVM is a computational intensive operation that dominates the performance in most iterative linear system solvers. There are some optimization challenges in computations involving SMVM due to its high memory access rate and irregular memory access pattern. In this thesis, a new design concept for SMVM in an FPGA by using Network-on-Chip (NoC) is presented. In traditional circuit design on-chip communications have been designed with dedicated point-to-point interconnections or shared buses. Therefore, regular data transfer is the major concern of many parallel implementations. However, when dealing with the SMVM operation, the required data transfers are usually dependent on the sparsity structure of the matrix and can be extremely irregular. Using an NoC architecture makes it possible to deal with arbitrary structure of the data transfers, i.e. with arbitrary structured sparse matrices. In addition, the size of the pipelined SMVM calculator based on NoC architecture can be customized to 2×2, 4×4, ..., p×p (p∈N) due to its high scalibility and flexibility. The implementation is done in IEEE-754 single floating-point precision on the Xilinx Virtex-6 FPGA. The experimental results show that the proposed NoC-based implementation can achieve approximate 2.3 - 5.6 speed-up over the MATLAB-based software implementation in Matrix Market benchmark applications.

APA, Harvard, Vancouver, ISO, and other styles

23

Hsu, Wei-chun, and 徐偉郡. "Sparse Matrix-Vector Multiplication: A Low Communication Cost Data Mapping-Based Architecture." Thesis, 2015. http://ndltd.ncl.edu.tw/handle/09761233547687794389.

Full text

Abstract:

碩士
國立臺灣科技大學
電子工程系
103
The performance of the sparse matrix-vector multiplication (SMVM) on a parallel system is strongly conditioned by the distribution of data among its components. Two costs arise as a result of the used data mapping method: arithmetic and communication. The communication cost of an algorithm often dominates the arithmetic cost, and the gap between these costs tends to increase. Therefore, finding a mapping method that reduces the communication cost is of high importance. On the other hand, the load distribution among the processing units must not be sacrificed as well. In this paper, a data mapping method is proposed for SMVM on Network-on-Chip (NoC) which achieves balanced working load and reduces the communication cost. Afterwards, an FPGA-based architecture is introduced which is designed to fit the proposed data mapping method. The experimental results show that the communication cost of the proposed design is 40\% lower than the previous work.

APA, Harvard, Vancouver, ISO, and other styles

24

TSAI, NIAN-YING, and 蔡念穎. "On Job Allocation Strategies for Running Sparse Matrix-Vector Multiplication on GPUs." Thesis, 2017. http://ndltd.ncl.edu.tw/handle/mpwmh4.

Full text

Abstract:

碩士
國立彰化師範大學
資訊工程學系
105
In the era of big data, Graphic Processing Unit (GPU) has been widely used to deal with many parallelization problems as the amount of data to be processed. Sparse Matrix – Vector Multiplication is an important and basic operation in many fields, there are still many improved space for raising the performance of the GPU operation. This paper is mainly about job allocation strategies for running Sparse Matrix-Vector Multiplication on GPUs. The LightSpMV algorithm is based on the standard CSR format. The CSR format is a common sparse matrix storage format which is more flexible and better than other formats. The LightSpMV algorithm uses two dynamic configuration methods, whose matrix row is distributed to one for vector and the other for warp. Both of the methods use Atomic operations to get the Row index values. Because Atomic operations spent too much execution time, we proposed three strategies for this part of the workload allocation: (1) Using warp as the basic unit, through doubling the number of rows which have to be executed for each allocation, to make the number of Atomic operations reduced. (2) Using block as the basic unit, the number of rows are allocated dynamically. Compared to the dynamic configuration of using warp as basic unit, this strategy can reduce the number of Atomic operations. (3) Using block as the basic unit, the number of rows executed by blocks are static allocation. In each block we reuse warp as the basic unit and the warp are allocated dynamically instead of Atomic operations.After the implementation of our experiments in the work environment of the GTX980 GPU, the best performance is the third strategy and the performance improvement is nearly 100%.

APA, Harvard, Vancouver, ISO, and other styles

25

Tsai, Minzong, and 蔡旻容. "Implementing Simple Parallel Sparse Matrix-Matrix Multiplication Using OpenMP." Thesis, 2012. http://ndltd.ncl.edu.tw/handle/69501441604898008038.

Full text

Abstract:

碩士
靜宜大學
資訊碩士在職專班
100
Recently, parallel programming is becoming more popular because of current trend of multi-core technology. In this thesis, we design a parallel simple sparse matrix-matrix multiplication using common sparse format. We analyze the data structure of sparse matrix named CRS (Compress Row Storage) and CCS (Compress Column Storage) in order to reduce unnecessary mathematical operation. Our algorithms are written in C and the parallel algorithms are implemented using OpenMP. We present four parallel algorithms of sparse matrix multiplication named CRSxCRS, CRSxCCS, CCSxCCS, and CCSxCRS using OpenMP and run on Dell 6950. Finally, we compare their performances. The preliminary experimental result shows that the CRSxCRS and CCSxCCS give the best in four algorithms.

APA, Harvard, Vancouver, ISO, and other styles

26

Mirza, Salma. "Scalable, Memory-Intensive Scientific Computing on Field Programmable Gate Arrays." 2010. https://scholarworks.umass.edu/theses/404.

Full text

Abstract:

Cache-based, general purpose CPUs perform at a small fraction of their maximum floating point performance when executing memory-intensive simulations, such as those required for many scientific computing problems. This is due to the memory bottleneck that is encountered with large arrays that must be stored in dynamic RAM. A system of FPGAs, with a large enough memory bandwidth, and clocked at only hundreds of MHz can outperform a CPU clocked at GHz in terms of floating point performance. An FPGA core designed for a target performance that does not unnecessarily exceed the memory imposed bottleneck can then be distributed, along with multiple memory interfaces, into a scalable architecture that overcomes the bandwidth limitation of a single interface. Interconnected cores can work together to solve a scientific computing problem and exploit a bandwidth that is the sum of the bandwidth available from all of their connected memory interfaces. The implementation demonstrates this concept of scalability with two memory interfaces through the use of available FPGA prototyping platforms. Even though the FPGAs operate at 133 MHz, which is twenty one times slower than an AMD Phenom X4 processor operating at 2.8 GHz, the system of two FPGAs performs eight times slower than the processor for the example problem of SMVM in heat transfer. However, the system is demonstrated to be scalable with a run-time that decreases linearly with respect to the available memory bandwidth. The floating point performance of a single board implementation is 12 GFlops which doubles to 24 GFlops for a two board implementation, for a gather or scatter operation on matrices of varying sizes.

APA, Harvard, Vancouver, ISO, and other styles

27

Wu, Xiaolong. "Optimizing Sparse Matrix-Matrix Multiplication on a Heterogeneous CPU-GPU Platform." 2015. http://scholarworks.gsu.edu/cs_theses/83.

Full text

Abstract:

Sparse Matrix-Matrix multiplication (SpMM) is a fundamental operation over irregular data, which is widely used in graph algorithms, such as finding minimum spanning trees and shortest paths. In this work, we present a hybrid CPU and GPU-based parallel SpMM algorithm to improve the performance of SpMM. First, we improve data locality by element-wise multiplication. Second, we utilize the ordered property of row indices for partial sorting instead of full sorting of all triples according to row and column indices. Finally, through a hybrid CPU-GPU approach using two level pipelining technique, our algorithm is able to better exploit a heterogeneous system. Compared with the state-of-the-art SpMM methods in cuSPARSE and CUSP libraries, our approach achieves an average of 1.6x and 2.9x speedup separately on the nine representative matrices from University of Florida sparse matrix collection.

APA, Harvard, Vancouver, ISO, and other styles

28

Batjargal, Delgerdalai, and 白德格. "Parallel Matrix Transposition and Vector Multiplication Using OpenMP." Thesis, 2012. http://ndltd.ncl.edu.tw/handle/50550637149183575586.

Full text

Abstract:

碩士
靜宜大學
資訊工程學系
101
In this thesis, we propose two parallel algorithms for sparse matrix-transpose and vector multiplication using CSR (Compressed Sparse Row) format. Even though this storage format is simple and hence easy to understand and maintained, one of its limitation is difficult to parallelized, and a performance of a naïve parallel algorithm can be worst. But by preprocessing useful information that is hidden and indirect in its data structure during reading a matrix from a file, our algorithm of the matrix transposition can then be performed in parallel using OpenMP. Our codes are run on a quad-core Intel Xeon64 CPU E5507 platform. We measure, and compare the performance of our algorithms with that of using Compressed Sparse Block (CSB) format. Our experimental results show that our algorithms are comparable to the CSB based algorithm when the nonzero are scatter around the matrix and size of matrix is growing.

APA, Harvard, Vancouver, ISO, and other styles

29

Lin, Yi-sheng, and 林于勝. "The Design of a NoC-Based Sparse Matrix Multiplication System." Thesis, 2012. http://ndltd.ncl.edu.tw/handle/17795356751665658959.

Full text

Abstract:

碩士
國立臺灣科技大學
電子工程系
100
With the advance of deep-submicron technologies, a huge number of IP blocks can be integrated into a single chip. However, the bus-based architecture becomes a performance bottleneck for the multicore SoC. In this thesis, a NoC-based architecture for sparse matrix multiplication is proposed to improve the performance of parallel computation. The architecture consists of routers, network interfaces, and processing elements. Our router is efficient by using the pipeline technique and wormhole switching with the XY routing algorithm. We also present a method of mapping and partitioning for large matrices to increase the load balancing and efficiency of packet distribution. In addition, the mesh-based network is fully parameterized so that it is flexible. Various sizes of the NoC-based architecture have been implemented, including 2×2, 2×4, 4×4, 4×8 ,and 8×8, on Xilinx Virtex 5 and TSMC 0.18 um cell library. In the FPGA implementation, the performance of our design is evaluated by a number of random and real-application matrices. Besides, the effects of network sizes, matrix sizes, and sparsity on the system performance are considered. Compared with MicroBlaze and Intel processors, our design achieves up to 40x and 2x speedup respectively. In the ASIC implementation, the core area of the 4×4 NoC architecture is 1,986.5 um×1,985.4 um, which is equivalent to 259,026 gates. The average power consumption is 417 mW at the operating frequency of 166 MHz.

APA, Harvard, Vancouver, ISO, and other styles

30

deLorimier, Michael John. "Floating-Point Sparse Matrix-Vector Multiply for FPGAs." Thesis, 2005. https://thesis.library.caltech.edu/1776/1/smvm_thesis.pdf.

Full text

Abstract:

Large, high density FPGAs with high local distributed memory bandwidth surpass the peak floating-point performance of high-end, general-purpose processors. Microprocessors do not deliver near their peak floating-point performance on efficient algorithms that use the Sparse Matrix-Vector Multiply (SMVM) kernel. In fact, microprocessors rarely achieve 33% of their peak floating-point performance when computing SMVM. We develop and analyze a scalable SMVM implementation on modern FPGAs and show that it can sustain high throughput, near peak, floating-point performance. Our implementation consists of logic design as well as scheduling and data placement techniques. For benchmark matrices from the Matrix Market Suite we project 1.5 double precision Gflops/FPGA for a single VirtexII-6000-4 and 12 double precision Gflops for 16 Virtex IIs (750 Mflops/FPGA). We also analyze the asymptotic efficiency of our architecture as parallelism scales using a constant rent-parameter matrix model. This demonstrates that our data placement techniques provide an asymptotic scaling benefit.

While FPGA performance is attractive, higher performance is possible if we re-balance the hardware resources in FPGAs with embedded memories. We show that sacrificing half the logic area for memory area rarely degrades performance and improves performance for large matrices, by up to 5 times. We also 0 the performance effect of adding custom floating-point using a simple area model to preserve total chip area. Sacrificing logic for memory and custom floating-point units increases single FPGA performance to 5 double precision Gflops.

APA, Harvard, Vancouver, ISO, and other styles

31

Piccolo, Alessandro, and Johan Soodla. "Performance of parallel sparse matrix-matrixmultiplication." Thesis, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-255134.

Full text

APA, Harvard, Vancouver, ISO, and other styles

32

LIN, ANG-HSUAN, and 林昂萱. "Implementing OpenCL Sparse Matrix Multiplication and Transposition Using DIA Format on Intel HD Graphics 5000 mobile device." Thesis, 2017. http://ndltd.ncl.edu.tw/handle/86833354306613661461.

Full text

Abstract:

碩士
靜宜大學
資訊應用與科技管理碩士在職專班
105
Sparse matrix-matrix multiplication (SPMM) is a basic operation for mathematics, linear algebra and statistics. For many years, hundreds of researchers have tried to enhance the performance of the operation using multi-threads, grids and GPU (graphics processing unit) architecture. They have gotten huge achievements and leaded us to the new age of high performance computing. In this thesis, we implement a parallel sparse matrix-matrix multiplication and transposition in DIA format using OpenCL that run on Intel HD Graphics 5000 mobile device, which is a graphic chip built in Intel-i5-4260U CPU. Our experimental result shows that the speed up can be achieved between 2 and 12 times even though there is fewer units of work. Hence, it is also beneficial to use a mobile device for scientific computing.

APA, Harvard, Vancouver, ISO, and other styles

33

"Exploring the potential for accelerating sparse matrix-vector product on a Processing-in-Memory architecture." Thesis, 2009. http://hdl.handle.net/1911/61946.

Full text

Abstract:

As the importance of memory access delays on performance has mushroomed over the past few decades, researchers have begun exploring Processing-in-Memory (PIM) technology, which offers higher memory bandwidth, lower memory latency, and lower power consumption. In this study, we investigate whether an emerging PIM design from Sandia National Laboratories can boost performance for sparse matrix-vector product (SMVP). While SMVP is in the best-case bandwidth-bound, factors related to matrix structure and representation also limit performance. We analyze SMVP both in the context of an AMD Opteron processor and the Sandia PIM, exploring the performance limiters for each and the degree to which these can be ameliorated by data and code transformations. Over a range of sparse matrices, SMVP on the PIM outperformed the Opteron by a factor of 1.82. On the PIM, computational kernel and data structure transformations improved performance by almost 40% over conventional implementations using compressed-sparse row format.

APA, Harvard, Vancouver, ISO, and other styles

34

Zein, Ahmed H. El. "Use of graphics processing units for sparse matrix-vector products in statistical machine learning applications." Master's thesis, 2009. http://hdl.handle.net/1885/148368.

Full text

APA, Harvard, Vancouver, ISO, and other styles

35

Παπαδήμα, Ελισσάβετ. "Πειραματική αξιολόγηση μεθοδολογίας βελτιστοποίησης του αλγόριθμου πολλαπλασιασμού πίνακα επί διάνυσμα σε μονοπύρηνες και πολυπύρηνες αρχιτεκτονικές." Thesis, 2013. http://hdl.handle.net/10889/7283.

Full text

Abstract:

Στην παρούσα διπλωματική εργασία έγινε υλοποίηση και πειραματική αξιολόγηση μιας μεθοδολογίας η οποία έχει αναπτυχθεί στο Εργαστήριο Ολοκληρωμένων Κυκλωμάτων και αφορά τη βελτιστοποίηση του Πολλαπλασιασμού Πίνακα επί Διάνυσμα (ΠΠΔ) σε μονοπύρηνους και πολυπύρηνους επεξεργαστές. Η μεθοδολογία εκμεταλλεύεται το σύνολο των χαρακτηριστικών της αρχιτεκτονικής που χρησιμοποιείται και συγκεκριμένα (α) την ιεραρχία της μνήμης, (β) το μέγεθος της κρυφής μνήμης, (γ) το βαθμό συσχέτισης κάθε επιπέδου της κρυφής μνήμης, (δ) την καθυστέρηση της μνήμης και (ε) το πλήθος των πυρήνων. Είναι η πρώτη φορά που λαμβάνεται υπόψη ο βαθμός συσχέτισης της μνήμης. Σκοπός της μεθοδολογίας είναι η βελτιστοποίηση με βάση όλες τις παραμέτρους μαζί και όχι καθεμία ξεχωριστά. Για να βελτιωθεί η απόδοση προτείνεται διαφορετικός χρονοπρογραμματισμός ανάλογα με το μέγεθος του πίνακα. Για την πειραματική αξιολόγηση χρησιμοποιήθηκαν οι επεξεργαστές γενικού σκοπού Intel Core 2 Duo E6065 και Τ6600, ο Intel i7-3930K και ο ενσωματωμένος επεξεργαστής ειδικού σκοπού Microblaze από το Virtex-5 FPGA (Xilinx). Τα αποτελέσματα συγκρίνονται με την state-of-the-art βιβλιοθήκη ATLAS (Automatically Tuned Linear Algebra Software) και παρουσιάζουν βελτίωση 30%. Από τα πειραματικά αποτελέσματα είναι φανερό ότι η κύρια μνήμη είναι το bottleneck του προβλήματος. Επίσης, η απόδοση βελτιώνεται όταν αλλάζει το layout του πίνακα τόσο σε μονοπύρηνες όσο σε πολυπύρηνες αρχιτεκτονικές. Όσον αφορά τη μέθοδο του tiling τα πειραματικά αποτελέσματα δείχνουν ότι η μείωση των αστοχιών δεν βελτιώνει πάντα την απόδοση γιατί υπάρχει trade-off ανάμεσα στο μέγεθος του tile και στις εντολές διευθυνσιοδότησης. Επίσης, είναι φανερό ότι στις πολυπύρηνες αρχιτεκτονικές δεν υπάρχει γραμμική σχέση της απόδοσης και του πλήθους των πυρήνων που χρησιμοποιούνται. Αυτό οφείλεται στο περιορισμένο εύρος ζώνης της μνήμης.
The subject of this MSc Thesis is the implementation and the experimental evaluation of a methodology that has been developed at the Laboratory of Integrated Circuits and optimizes the Matrix Vector Multiplication (MVM) in single-core and multi-core processors. The methodology fully exploits the characteristics of the architecture. Specifically, it exploits (a) the hierarchy of the memory, (b) the cache size, (c) the cache associativity, (d) the memory latency and (e) the number of the cores. It is the first time that the cache associativity is taken into account. The methodology optimizes all the parameters together as one problem and not separately. A different scheduling is proposed according to the size of the matrix. The general purpose processors Intel Core 2 Duo E6065, Intel Core 2 Duo T6600 and Intel i7-3930K and the embedded processor Virtex-5 Microblaze have been used. The results have been compared with the state-of-the-art library ATLAS (Automatically Tuned Linear Algebra Software) and the performance is improved by 30%. According to the experimental results, it is obvious that the bottleneck is the memory latency. Moreover, the performance is increased when a new way of saving the matrix in the main memory (data array layout) is used in both single-core and multi-core architectures. As far as the tiling is concerned, the experimental results indicate that the decrease of the misses does not always improve the performance because there is a trade-off between the tile size and the addressing instructions. According to the experimental results, as far as multicore architectures are concerned, there is no linear relation between the performance and the number of the cores, because of the limited memory bandwidth.

APA, Harvard, Vancouver, ISO, and other styles

36

Heinemeyer, Eric. "Integral Equation Methods for Rough Surface Scattering Problems in three Dimensions." Doctoral thesis, 2008. http://hdl.handle.net/11858/00-1735-0000-000D-F15F-2.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Sparse Matrix Vector Multiplication'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles