Log in

Relevant bibliographies by topics / Sparse Accelerator / Journal articles

To see the other types of publications on this topic, follow the link: Sparse Accelerator.

Journal articles on the topic 'Sparse Accelerator'

Author: Grafiati

Published: 5 April 2025

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Sparse Accelerator.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Xie, Xiaoru, Mingyu Zhu, Siyuan Lu, and Zhongfeng Wang. "Efficient Layer-Wise N:M Sparse CNN Accelerator with Flexible SPEC: Sparse Processing Element Clusters." Micromachines 14, no. 3 (February 24, 2023): 528. http://dx.doi.org/10.3390/mi14030528.

Full text

Abstract:

Recently, the layer-wise N:M fine-grained sparse neural network algorithm (i.e., every M-weights contains N non-zero values) has attracted tremendous attention, as it can effectively reduce the computational complexity with negligible accuracy loss. However, the speed-up potential of this algorithm will not be fully exploited if the right hardware support is lacking. In this work, we design an efficient accelerator for the N:M sparse convolutional neural networks (CNNs) with layer-wise sparse patterns. First, we analyze the performances of different processing element (PE) structures and extensions to construct the flexible PE architecture. Second, the variable sparse convolutional dimensions and sparse ratios are involved in the hardware design. With a sparse PE cluster (SPEC) design, the hardware can efficiently accelerate CNNs with the layer-wise N:M pattern. Finally, we employ the proposed SPEC into the CNN accelerator with flexible network-on-chip and specially designed dataflow. We implement hardware accelerators on Xilinx ZCU102 FPGA and Xilinx VCU118 FPGA and evaluate them with classical CNNs such as Alexnet, VGG-16, and ResNet-50. Compared with existing accelerators designed for structured and unstructured pruned networks, our design achieves the best performance in terms of power efficiency.

APA, Harvard, Vancouver, ISO, and other styles

2

Li, Yihang. "Sparse-Aware Deep Learning Accelerator." Highlights in Science, Engineering and Technology 39 (April 1, 2023): 305–10. http://dx.doi.org/10.54097/hset.v39i.6544.

Full text

Abstract:

In view of the difficulty of hardware implementation of convolutional neural network computing, most of the previous convolutional neural network accelerator designs focused on solving the bottleneck of computational performance and bandwidth, ignoring the importance of convolutional neural network scarcity for accelerator design. In recent years, there are a few convolutional neural network accelerators that can take advantage of the scarcity, but they are usually difficult to consider in terms of computational flexibility, parallel efficiency and resource overhead. In view of the problem that the application of convolutional neural network (CNN) on the embedded side is limited by real-time, and there is a large degree of sparsity in CNN convolution calculation. This paper summarizes the methods of sparsification from the algorithm level and based on FPGA level. The different methods of sparsification and the research and analysis of different application layers are introduced. The advantages and development trend of sparsification are analyzed and summarized.

APA, Harvard, Vancouver, ISO, and other styles

3

Xu, Jia, Han Pu, and Dong Wang. "Sparse Convolution FPGA Accelerator Based on Multi-Bank Hash Selection." Micromachines 16, no. 1 (December 27, 2024): 22. https://doi.org/10.3390/mi16010022.

Full text

Abstract:

Reconfigurable processor-based acceleration of deep convolutional neural network (DCNN) algorithms has emerged as a widely adopted technique, with particular attention on sparse neural network acceleration as an active research area. However, many computing devices that claim high computational power still struggle to execute neural network algorithms with optimal efficiency, low latency, and minimal power consumption. Consequently, there remains significant potential for further exploration into improving the efficiency, latency, and power consumption of neural network accelerators across diverse computational scenarios. This paper investigates three key techniques for hardware acceleration of sparse neural networks. The main contributions are as follows: (1) Most neural network inference tasks are typically executed on general-purpose computing devices, which often fail to deliver high energy efficiency and are not well-suited for accelerating sparse convolutional models. In this work, we propose a specialized computational circuit for the convolutional operations of sparse neural networks. This circuit is designed to detect and eliminate the computational effort associated with zero values in the sparse convolutional kernels, thereby enhancing energy efficiency. (2) The data access patterns in convolutional neural networks introduce significant pressure on the high-latency off-chip memory access process. Due to issues such as data discontinuity, the data reading unit often fails to fully exploit the available bandwidth during off-chip read and write operations. In this paper, we analyze bandwidth utilization in the context of convolutional accelerator data handling and propose a strategy to improve off-chip access efficiency. Specifically, we leverage a compiler optimization plugin developed for Vitis HLS, which automatically identifies and optimizes on-chip bandwidth utilization. (3) In coefficient-based accelerators, the synchronous operation of individual computational units can significantly hinder efficiency. Previous approaches have achieved asynchronous convolution by designing separate memory units for each computational unit; however, this method consumes a substantial amount of on-chip memory resources. To address this issue, we propose a shared feature map cache design for asynchronous convolution in the accelerators presented in this paper. This design resolves address access conflicts when multiple computational units concurrently access a set of caches by utilizing a hash-based address indexing algorithm. Moreover, the shared cache architecture reduces data redundancy and conserves on-chip resources. Using the optimized accelerator, we successfully executed ResNet50 inference on an Intel Arria 10 1150GX FPGA, achieving a throughput of 497 GOPS, or an equivalent computational power of 1579 GOPS, with a power consumption of only 22 watts.

APA, Harvard, Vancouver, ISO, and other styles

4

Zheng, Yong, Haigang Yang, Yiping Jia, and Zhihong Huang. "PermLSTM: A High Energy-Efficiency LSTM Accelerator Architecture." Electronics 10, no. 8 (April 8, 2021): 882. http://dx.doi.org/10.3390/electronics10080882.

Full text

Abstract:

Pruning and quantization are two commonly used approaches to accelerate the LSTM (Long Short-Term Memory) model. However, the traditional linear quantization usually suffers from the problem of gradient vanishing, and the existing pruning methods all have the problem of producing undesired irregular sparsity or large indexing overhead. To alleviate the problem of vanishing gradient, this work proposed a normalized linear quantization approach, which first normalize operands regionally and then quantize them in a local mix-max range. To overcome the problem of irregular sparsity and large indexing overhead, this work adopts the permuted block diagonal mask matrices to generate the sparse model. Due to the sparse model being highly regular, the position of non-zero weights can be obtained by a simple calculation, thus avoiding the large indexing overhead. Based on the sparse LSTM model generated from the permuted block diagonal mask matrices, this paper also proposed a high energy-efficiency accelerator, PermLSTM that comprehensively exploits the sparsity of weights, activations, and products regarding the matrix–vector multiplications, resulting in a 55.1% reduction in power consumption. The accelerator has been realized on Arria-10 FPGAs running at 150 MHz and achieved 2.19×∼24.4× energy efficiency compared with the other FPGA-based LSTM accelerators previously reported.

APA, Harvard, Vancouver, ISO, and other styles

5

Yavits, Leonid, and Ran Ginosar. "Accelerator for Sparse Machine Learning." IEEE Computer Architecture Letters 17, no. 1 (January 1, 2018): 21–24. http://dx.doi.org/10.1109/lca.2017.2714667.

Full text

APA, Harvard, Vancouver, ISO, and other styles

6

Teodorovic, Predrag, and Rastislav Struharik. "Hardware Acceleration of Sparse Oblique Decision Trees for Edge Computing." Elektronika ir Elektrotechnika 25, no. 5 (October 6, 2019): 18–24. http://dx.doi.org/10.5755/j01.eie.25.5.24351.

Full text

Abstract:

This paper presents a hardware accelerator for sparse decision trees intended for FPGA applications. To the best of authors’ knowledge, this is the first accelerator of this type. Beside the hardware accelerator itself, a novel algorithm for induction of sparse decision trees is also presented. Sparse decision trees can be attractive because they require less memory resources and can be more efficiently processed using specialized hardware compared to traditional oblique decision trees. This can be of significant interest, particularly, in the edge-based applications, where memory and compute resources as well as power consumption are severely constrained. The performance of the proposed sparse decision tree induction algorithm as well as developed hardware accelerator are studied using standard benchmark datasets obtained from the UCI Machine Learning Repository database. The results of the experimental study indicate that the proposed algorithm and hardware accelerator are very favourably compared with some of the existing solutions.

APA, Harvard, Vancouver, ISO, and other styles

7

Vranjkovic, Vuk, Predrag Teodorovic, and Rastislav Struharik. "Universal Reconfigurable Hardware Accelerator for Sparse Machine Learning Predictive Models." Electronics 11, no. 8 (April 8, 2022): 1178. http://dx.doi.org/10.3390/electronics11081178.

Full text

Abstract:

This study presents a universal reconfigurable hardware accelerator for efficient processing of sparse decision trees, artificial neural networks and support vector machines. The main idea is to develop a hardware accelerator that will be able to directly process sparse machine learning models, resulting in shorter inference times and lower power consumption compared to existing solutions. To the author’s best knowledge, this is the first hardware accelerator of this type. Additionally, this is the first accelerator that is capable of processing sparse machine learning models of different types. Besides the hardware accelerator itself, algorithms for induction of sparse decision trees, pruning of support vector machines and artificial neural networks are presented. Such sparse machine learning classifiers are attractive since they require significantly less memory resources for storing model parameters. This results in reduced data movement between the accelerator and the DRAM memory, as well as a reduced number of operations required to process input instances, leading to faster and more energy-efficient processing. This could be of a significant interest in edge-based applications, with severely constrained memory, computation resources and power consumption. The performance of algorithms and the developed hardware accelerator are demonstrated using standard benchmark datasets from the UCI Machine Learning Repository database. The results of the experimental study reveal that the proposed algorithms and presented hardware accelerator are superior when compared to some of the existing solutions. Throughput is increased up to 2 times for decision trees, 2.3 times for support vector machines and 38 times for artificial neural networks. When the processing latency is considered, maximum performance improvement is even higher: up to a 4.4 times reduction for decision trees, a 84.1 times reduction for support vector machines and a 22.2 times reduction for artificial neural networks. Finally, since it is capable of supporting sparse classifiers, the usage of the proposed hardware accelerator leads to a significant reduction in energy spent on DRAM data transfers and a reduction of 50.16% for decision trees, 93.65% for support vector machines and as much as 93.75% for artificial neural networks, respectively.

APA, Harvard, Vancouver, ISO, and other styles

8

Gowda, Kavitha Malali Vishveshwarappa, Sowmya Madhavan, Stefano Rinaldi, Parameshachari Bidare Divakarachari, and Anitha Atmakur. "FPGA-Based Reconfigurable Convolutional Neural Network Accelerator Using Sparse and Convolutional Optimization." Electronics 11, no. 10 (May 22, 2022): 1653. http://dx.doi.org/10.3390/electronics11101653.

Full text

Abstract:

Nowadays, the data flow architecture is considered as a general solution for the acceleration of a deep neural network (DNN) because of its higher parallelism. However, the conventional DNN accelerator offers only a restricted flexibility for diverse network models. In order to overcome this, a reconfigurable convolutional neural network (RCNN) accelerator, i.e., one of the DNN, is required to be developed over the field-programmable gate array (FPGA) platform. In this paper, the sparse optimization of weight (SOW) and convolutional optimization (CO) are proposed to improve the performances of the RCNN accelerator. The combination of SOW and CO is used to optimize the feature map and weight sizes of the RCNN accelerator; therefore, the hardware resources consumed by this RCNN are minimized in FPGA. The performances of RCNN-SOW-CO are analyzed by means of feature map size, weight size, sparseness of the input feature map (IFM), weight parameter proportion, block random access memory (BRAM), digital signal processing (DSP) elements, look-up tables (LUTs), slices, delay, power, and accuracy. An existing architectures OIDSCNN, LP-CNN, and DPR-NN are used to justify efficiency of the RCNN-SOW-CO. The LUT of RCNN-SOW-CO with Alexnet designed in the Zynq-7020 is 5150, which is less than the OIDSCNN and DPR-NN.

APA, Harvard, Vancouver, ISO, and other styles

9

Dey, Sumon, Lee Baker, Joshua Schabel, Weifu Li, and Paul D. Franzon. "A Scalable Cluster-based Hierarchical Hardware Accelerator for a Cortically Inspired Algorithm." ACM Journal on Emerging Technologies in Computing Systems 17, no. 4 (June 30, 2021): 1–29. http://dx.doi.org/10.1145/3447777.

Full text

Abstract:

This article describes a scalable, configurable and cluster-based hierarchical hardware accelerator through custom hardware architecture for Sparsey, a cortical learning algorithm. Sparsey is inspired by the operation of the human cortex and uses a Sparse Distributed Representation to enable unsupervised learning and inference in the same algorithm. A distributed on-chip memory organization is designed and implemented in custom hardware to improve memory bandwidth and accelerate the memory read/write operations for synaptic weight matrices. Bit-level data are processed from distributed on-chip memory and custom multiply-accumulate hardware is implemented for binary and fixed-point multiply-accumulation operations. The fixed-point arithmetic and fixed-point storage are also adapted in this implementation. At 16 nm, the custom hardware of Sparsey achieved an overall 24.39× speedup, 353.12× energy efficiency per frame, and 1.43× reduction in silicon area against a state-of-the-art GPU.

APA, Harvard, Vancouver, ISO, and other styles

10

Liu, Sheng, Yasong Cao, and Shuwei Sun. "Mapping and Optimization Method of SpMV on Multi-DSP Accelerator." Electronics 11, no. 22 (November 11, 2022): 3699. http://dx.doi.org/10.3390/electronics11223699.

Full text

Abstract:

Sparse matrix-vector multiplication (SpMV) solves the product of a sparse matrix and dense vector, and the sparseness of a sparse matrix is often more than 90%. Usually, the sparse matrix is compressed to save storage resources, but this causes irregular access to dense vectors in the algorithm, which takes a lot of time and degrades the SpMV performance of the system. In this study, we design a dedicated channel in the DMA to implement an indirect memory access process to speed up the SpMV operation. On this basis, we propose six SpMV algorithm schemes and map them to optimize the performance of SpMV. The results show that the M processor’s SpMV performance reached 6.88 GFLOPS. Besides, the average performance of the HPCG benchmark is 2.8 GFLOPS.

APA, Harvard, Vancouver, ISO, and other styles

11

Vranjkovic, Vuk, and Rastislav Struharik. "Hardware Acceleration of Sparse Support Vector Machines for Edge Computing." Elektronika ir Elektrotechnika 26, no. 3 (June 27, 2020): 42–53. http://dx.doi.org/10.5755/j01.eie.26.3.25796.

Full text

Abstract:

In this paper, a hardware accelerator for sparse support vector machines (SVM) is proposed. We believe that the proposed accelerator is the first accelerator of this kind. The accelerator is designed for use in field programmable gate arrays (FPGA) systems. Additionally, a novel algorithm for the pruning of SVM models is developed. The pruned SVM model has a smaller memory footprint and can be processed faster compared to dense SVM models. In the systems with memory throughput, compute or power constraints, such as edge computing, this can be a big advantage. The experiments on several standard datasets are conducted, which aim is to compare the efficiency of the proposed architecture and the developed algorithm to the existing solutions. The results of the experiments reveal that the proposed hardware architecture and SVM pruning algorithm has superior characteristics in comparison to the previous work in the field. A memory reduction from 3 % to 85 % is achieved, with a speed-up in a range from 1.17 to 7.92.

APA, Harvard, Vancouver, ISO, and other styles

12

Liu, Peng, and Yu Wang. "A Low-Power General Matrix Multiplication Accelerator with Sparse Weight-and-Output Stationary Dataflow." Micromachines 16, no. 1 (January 16, 2025): 101. https://doi.org/10.3390/mi16010101.

Full text

Abstract:

General matrix multiplication (GEMM) in machine learning involves massive computation and data movement, which restricts its deployment on resource-constrained devices. Although data reuse can reduce data movement during GEMM processing, current approaches fail to fully exploit its potential. This work introduces a sparse GEMM accelerator with a weight-and-output stationary (WOS) dataflow and a distributed buffer architecture. It processes GEMM in a compressed format and eliminates on-chip transfers of both weights and partial sums. Furthermore, to map the compressed GEMM of various sizes onto the accelerator, an adaptable mapping scheme is designed. However, the irregular sparsity of weight matrices makes it difficult to store them in local buffers with the compressed format; denser vectors can exceed the buffer capacity, while sparser vectors may lead to the underutilization of buffers. To address this complication, this work also proposes an offline sparsity-aware shuffle strategy for weights, which balances the utilization of distributed buffers and minimizes buffer waste. Finally, a low-cost sparse computing method is applied to the WOS dataflow with globally shared inputs to achieve high computing throughput. Experiments with an FPGA show that the proposed accelerator achieves 1.73× better computing efficiency and 1.36× higher energy efficiency than existing approaches.

APA, Harvard, Vancouver, ISO, and other styles

13

Wang, Deguang, Junzhong Shen, Mei Wen, and Chunyuan Zhang. "Efficient Implementation of 2D and 3D Sparse Deconvolutional Neural Networks with a Uniform Architecture on FPGAs." Electronics 8, no. 7 (July 18, 2019): 803. http://dx.doi.org/10.3390/electronics8070803.

Full text

Abstract:

Three-dimensional (3D) deconvolution is widely used in many computer vision applications. However, most previous works have only focused on accelerating two-dimensional (2D) deconvolutional neural networks (DCNNs) on Field-Programmable Gate Arrays (FPGAs), while the acceleration of 3D DCNNs has not been well studied in depth as they have higher computational complexity and sparsity than 2D DCNNs. In this paper, we focus on the acceleration of both 2D and 3D sparse DCNNs on FPGAs by proposing efficient schemes for mapping 2D and 3D sparse DCNNs on a uniform architecture. Firstly, a pruning method is used to prune unimportant network connections and increase the sparsity of weights. After being pruned, the number of parameters of DCNNs is reduced significantly without accuracy loss. Secondly, the remaining non-zero weights are encoded in coordinate (COO) format, reducing the memory demands of parameters. Finally, to demonstrate the effectiveness of our work, we implement our accelerator design on the Xilinx VC709 evaluation platform for four real-life 2D and 3D DCNNs. After the first two steps, the storage required of DCNNs is reduced up to 3.9×. Results show that the performance of our method on the accelerator outperforms that of the our prior work by 2.5× to 3.6× in latency.

APA, Harvard, Vancouver, ISO, and other styles

14

He, Pengzhou, Yazheng Tu, Tianyou Bao, Çetin Çetin Koç, and Jiafeng Xie. "HSPA: High-Throughput Sparse Polynomial Multiplication for Code-based Post-Quantum Cryptography." ACM Transactions on Embedded Computing Systems 24, no. 1 (December 10, 2024): 1–24. https://doi.org/10.1145/3703837.

Full text

Abstract:

Increasing attention has been paid to code-based post-quantum cryptography (PQC) schemes, e.g., HQC (Hamming Quasi-Cyclic) and BIKE (Bit Flipping Key Encapsulation), since they’ve been selected as the fourth-round National Institute of Standards and Technology (NIST) PQC standardization candidates. Though sparse polynomial multiplication is one of the critical components for HQC and BIKE, hardware-implemented high-performance sparse polynomial multiplier is rarely reported in the literature (due to its high-dimension and sparsity of polynomials involved in the computation). Based on this consideration, in this article, we propose two novel H igh-throughput S parse P olynomial multiplication A ccelerators (HSPA) for the mentioned two code-based PQC schemes. Specifically, we have designed the two accelerators based on two different implementation strategies targeting potential applications with different resource availability, i.e., one accelerator deploys a memory-based structure for computation while the other does not need memory usage. We have proposed three layers of coherent interdependent efforts to obtain the proposed accelerators. First, we have proposed two implementation strategies to execute the targeted sparse polynomial multiplication, i.e., a new parallel segment based accumulation (PSA) approach and a novel permutating-with-power (PWP)-based method. Then, the proposed two hardware accelerators are presented with detailed structural descriptions. Finally, field-programmable gate array (FPGA)-based implementation is presented to showcase the superior performance of the proposed accelerators. A proper comparison is also carried out to confirm the efficiency of the proposed designs. For instance, the proposed accelerator (using memory-based structure) has 56.84% and 80.25% less area-delay product (ADP) than the existing memory-based design (an extended high-speed version) on the UltraScale+ device, respectively, for n =17,669 and ω =75 (HQC) and n = 12,323 and ω =142 (BIKE). The proposed design strategy fits well with the two targeted code-based PQC schemes, which can be extended further to construct high-performance hardware cryptoprocessors. We hope the results of this work will be useful for the ongoing NIST PQC standardization process.

APA, Harvard, Vancouver, ISO, and other styles

15

XIAO, Hao, Kaikai ZHAO, and Guangzhu LIU. "Efficient Hardware Accelerator for Compressed Sparse Deep Neural Network." IEICE Transactions on Information and Systems E104.D, no. 5 (May 1, 2021): 772–75. http://dx.doi.org/10.1587/transinf.2020edl8153.

Full text

APA, Harvard, Vancouver, ISO, and other styles

16

Li, Jiajun, Shuhao Jiang, Shijun Gong, Jingya Wu, Junchao Yan, Guihai Yan, and Xiaowei Li. "SqueezeFlow: A Sparse CNN Accelerator Exploiting Concise Convolution Rules." IEEE Transactions on Computers 68, no. 11 (November 1, 2019): 1663–77. http://dx.doi.org/10.1109/tc.2019.2924215.

Full text

APA, Harvard, Vancouver, ISO, and other styles

17

Li, Fanrong, Gang Li, Zitao Mo, Xiangyu He, and Jian Cheng. "FSA: A Fine-Grained Systolic Accelerator for Sparse CNNs." IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 39, no. 11 (November 2020): 3589–600. http://dx.doi.org/10.1109/tcad.2020.3012212.

Full text

APA, Harvard, Vancouver, ISO, and other styles

18

Yang, Tao, Zhezhi He, Tengchuan Kou, Qingzheng Li, Qi Han, Haibao Yu, Fangxin Liu, Yun Liang, and Li Jiang. "BISWSRBS: A Winograd-based CNN Accelerator with a Fine-grained Regular Sparsity Pattern and Mixed Precision Quantization." ACM Transactions on Reconfigurable Technology and Systems 14, no. 4 (December 31, 2021): 1–28. http://dx.doi.org/10.1145/3467476.

Full text

Abstract:

Field-programmable Gate Array (FPGA) is a high-performance computing platform for Convolution Neural Networks (CNNs) inference. Winograd algorithm, weight pruning, and quantization are widely adopted to reduce the storage and arithmetic overhead of CNNs on FPGAs. Recent studies strive to prune the weights in the Winograd domain, however, resulting in irregular sparse patterns and leading to low parallelism and reduced utilization of resources. Besides, there are few works to discuss a suitable quantization scheme for Winograd. In this article, we propose a regular sparse pruning pattern in the Winograd-based CNN, namely, Sub-row-balanced Sparsity (SRBS) pattern, to overcome the challenge of the irregular sparse pattern. Then, we develop a two-step hardware co-optimization approach to improve the model accuracy using the SRBS pattern. Based on the pruned model, we implement a mixed precision quantization to further reduce the computational complexity of bit operations. Finally, we design an FPGA accelerator that takes both the advantage of the SRBS pattern to eliminate low-parallelism computation and the irregular memory accesses, as well as the mixed precision quantization to get a layer-wise bit width. Experimental results on VGG16/VGG-nagadomi with CIFAR-10 and ResNet-18/34/50 with ImageNet show up to 11.8×/8.67× and 8.17×/8.31×/10.6× speedup, 12.74×/9.19× and 8.75×/8.81×/11.1× energy efficiency improvement, respectively, compared with the state-of-the-art dense Winograd accelerator [20] with negligible loss of model accuracy. We also show that our design has 4.11× speedup compared with the state-of-the-art sparse Winograd accelerator [19] on VGG16.

APA, Harvard, Vancouver, ISO, and other styles

19

Wu, Di, Xitian Fan, Wei Cao, and Lingli Wang. "SWM: A High-Performance Sparse-Winograd Matrix Multiplication CNN Accelerator." IEEE Transactions on Very Large Scale Integration (VLSI) Systems 29, no. 5 (May 2021): 936–49. http://dx.doi.org/10.1109/tvlsi.2021.3060041.

Full text

APA, Harvard, Vancouver, ISO, and other styles

20

Liu, Qingliang, Jinmei Lai, and Jiabao Gao. "An Efficient Channel-Aware Sparse Binarized Neural Networks Inference Accelerator." IEEE Transactions on Circuits and Systems II: Express Briefs 69, no. 3 (March 2022): 1637–41. http://dx.doi.org/10.1109/tcsii.2021.3119369.

Full text

APA, Harvard, Vancouver, ISO, and other styles

21

Sun, Yichun, Hengzhu Liu, and Tong Zhou. "Sparse Cholesky Factorization on FPGA Using Parameterized Model." Mathematical Problems in Engineering 2017 (2017): 1–11. http://dx.doi.org/10.1155/2017/3021591.

Full text

Abstract:

Cholesky factorization is a fundamental problem in most engineering and science computation applications. When dealing with a large sparse matrix, numerical decomposition consumes the most time. We present a vector architecture to parallelize numerical decomposition of Cholesky factorization. We construct an integrated analytical parameterized performance model to accurately predict the execution times of typical matrices under varying parameters. Our proposed approach is general for accelerator and limited by neither field-programmable gate arrays (FPGAs) nor application-specific integrated circuit. We implement a simplified module in FPGAs to prove the accuracy of the model. The experiments show that, for most cases, the performance differences between the predicted and measured execution are less than 10%. Based on the performance model, we optimize parameters and obtain a balance of resources and performance after analyzing the performance of varied parameter settings. Comparing with the state-of-the-art implementation in CPU and GPU, we find that the performance of the optimal parameters is 2x that of CPU. Our model offers several advantages, particularly in power consumption. It provides guidance for the design of future acceleration components.

APA, Harvard, Vancouver, ISO, and other styles

22

Wang, Renping, Shun Li, Enhao Tang, Sen Lan, Yajing Liu, Jing Yang, Shizhen Huang, and Hailong Hu. "SH-GAT: Software-hardware co-design for accelerating graph attention networks on FPGA." Electronic Research Archive 32, no. 4 (2024): 2310–22. http://dx.doi.org/10.3934/era.2024105.

Full text

Abstract:

<abstract><p>Graph convolution networks (GCN) have demonstrated success in learning graph structures; however, they are limited in inductive tasks. Graph attention networks (GAT) were proposed to address the limitations of GCN and have shown high performance in graph-based tasks. Despite this success, GAT faces challenges in hardware acceleration, including: 1) The GAT algorithm has difficulty adapting to hardware; 2) challenges in efficiently implementing Sparse matrix multiplication (SPMM); and 3) complex addressing and pipeline stall issues due to irregular memory accesses. To this end, this paper proposed SH-GAT, an FPGA-based GAT accelerator that achieves more efficient GAT inference. The proposed approach employed several optimizations to enhance GAT performance. First, this work optimized the GAT algorithm using split weights and softmax approximation to make it more hardware-friendly. Second, a load-balanced SPMM kernel was designed to fully leverage potential parallelism and improve data throughput. Lastly, data preprocessing was performed by pre-fetching the source node and its neighbor nodes, effectively addressing pipeline stall and complexly addressing issues arising from irregular memory access. SH-GAT was evaluated on the Xilinx FPGA Alveo U280 accelerator card with three popular datasets. Compared to existing CPU, GPU, and state-of-the-art (SOTA) FPGA-based accelerators, SH-GAT can achieve speedup by up to 3283$ \times $, 13$ \times $, and 2.3$ \times $.</p></abstract>

APA, Harvard, Vancouver, ISO, and other styles

23

Xie, Xiaoru, Jun Lin, Zhongfeng Wang, and Jinghe Wei. "An Efficient and Flexible Accelerator Design for Sparse Convolutional Neural Networks." IEEE Transactions on Circuits and Systems I: Regular Papers 68, no. 7 (July 2021): 2936–49. http://dx.doi.org/10.1109/tcsi.2021.3074300.

Full text

APA, Harvard, Vancouver, ISO, and other styles

24

Lai, Bo-Cheng, Jyun-Wei Pan, and Chien-Yu Lin. "Enhancing Utilization of SIMD-Like Accelerator for Sparse Convolutional Neural Networks." IEEE Transactions on Very Large Scale Integration (VLSI) Systems 27, no. 5 (May 2019): 1218–22. http://dx.doi.org/10.1109/tvlsi.2019.2897052.

Full text

APA, Harvard, Vancouver, ISO, and other styles

25

Lu, Yuntao, Chao Wang, Lei Gong, and Xuehai Zhou. "SparseNN: A Performance-Efficient Accelerator for Large-Scale Sparse Neural Networks." International Journal of Parallel Programming 46, no. 4 (October 3, 2017): 648–59. http://dx.doi.org/10.1007/s10766-017-0528-8.

Full text

APA, Harvard, Vancouver, ISO, and other styles

26

Melham, R. "A systolic accelerator for the iterative solution of sparse linear systems." IEEE Transactions on Computers 38, no. 11 (1989): 1591–95. http://dx.doi.org/10.1109/12.42132.

Full text

APA, Harvard, Vancouver, ISO, and other styles

27

Li, Tao, and Li Shen. "A sparse matrix vector multiplication accelerator based on high-bandwidth memory." Computers and Electrical Engineering 105 (January 2023): 108488. http://dx.doi.org/10.1016/j.compeleceng.2022.108488.

Full text

APA, Harvard, Vancouver, ISO, and other styles

28

Zhu, Chaoyang, Kejie Huang, Shuyuan Yang, Ziqi Zhu, Hejia Zhang, and Haibin Shen. "An Efficient Hardware Accelerator for Structured Sparse Convolutional Neural Networks on FPGAs." IEEE Transactions on Very Large Scale Integration (VLSI) Systems 28, no. 9 (September 2020): 1953–65. http://dx.doi.org/10.1109/tvlsi.2020.3002779.

Full text

APA, Harvard, Vancouver, ISO, and other styles

29

Wang, Zixiao, Ke Xu, Shuaixiao Wu, Li Liu, Lingzhi Liu, and Dong Wang. "Sparse-YOLO: Hardware/Software Co-Design of an FPGA Accelerator for YOLOv2." IEEE Access 8 (2020): 116569–85. http://dx.doi.org/10.1109/access.2020.3004198.

Full text

APA, Harvard, Vancouver, ISO, and other styles

30

Humble, Ryan, William Colocho, Finn O’Shea, Daniel Ratner, and Eric Darve. "Resilient VAE: Unsupervised Anomaly Detection at the SLAC Linac Coherent Light Source." EPJ Web of Conferences 295 (2024): 09033. http://dx.doi.org/10.1051/epjconf/202429509033.

Full text

Abstract:

Significant advances in utilizing deep learning for anomaly detection have been made in recent years. However, these methods largely assume the existence of a normal training set (i.e., uncontaminated by anomalies) or even a completely labeled training set. In many complex engineering systems, such as particle accelerators, labels are sparse and expensive; in order to perform anomaly detection in these cases, we must drop these assumptions and utilize a completely unsupervised method. This paper introduces the Resilient Variational Autoencoder (ResVAE), a deep generative model specifically designed for anomaly detection. ResVAE exhibits resilience to anomalies present in the training data and provides feature-level anomaly attribution. During the training process, ResVAE learns the anomaly probability for each sample as well as each individual feature, utilizing these probabilities to effectively disregard anomalous examples in the training data. We apply our proposed method to detect anomalies in the accelerator status at the SLAC Linac Coherent Light Source (LCLS). By utilizing shot-to-shot data from the beam position monitoring system, we demonstrate the exceptional capability of ResVAE in identifying various types of anomalies that are visible in the accelerator.

APA, Harvard, Vancouver, ISO, and other styles

31

Liang, Zhongwei, Xiaochu Liu, Guilin Wen, and Jinrui Xiao. "Effectiveness prediction of abrasive jetting stream of accelerator tank using normalized sparse autoencoder-adaptive neural fuzzy inference system." Proceedings of the Institution of Mechanical Engineers, Part B: Journal of Engineering Manufacture 234, no. 13 (June 26, 2020): 1615–39. http://dx.doi.org/10.1177/0954405420927582.

Full text

Abstract:

Abrasive jetting stream generated from accelerator tank is crucial to the precision machining of industrial products during the process of strengthen jet grinding. In this article, its effectiveness prediction using normalized sparse autoencoder-adaptive neural fuzzy inference system is carried out to provide an optimal result of jetting stream. A normalized sparse autoencoder-adaptive neural fuzzy inference system capable of calculating the concentration density of abrasive impact stress by normalized sparse autoencoder and identifying the effectiveness indexes of abrasive jetting by adaptive neural fuzzy inference system is proposed to predict the stream effectiveness index in grinding practices, indicating that when turbulence root-mean-square velocity ( VRMS) is 420 m/s, turbulence intensity ( Ti) is 570, turbulence kinetic energy ( Tc) is 540 kJ, turbulence entropy ( Te) is 620 J/K, and Reynolds shear stress ( Rs) is 430 kPa (Error tolerance = ± 5%, the same as follows), the optimized effectiveness quality of abrasive jetting stream could be ensured. The effectiveness prediction involve the following steps: measuring the jet impact data on the interior boundary surface of accelerator tank, calculating the concentration density of abrasive impact stress, establishing the descriptive analytical frame work of normalized sparse autoencoder-adaptive neural fuzzy inference system, adaptive prediction of abrasive jetting stream effectiveness through normalized sparse autoencoder-adaptive neural fuzzy inference system computation, and performance verification of actual effectiveness prediction in the efficiency quantification and quality assessment when it compared to that of alternative approaches, such as genetic, simulated annealing–genetic algorithm, Taguchi, artificial neural network–simulated annealing, and genetically optimized neural network system methods. Objective of this research is to adaptive predict the abrasive jetting stream effectiveness using a new-proposed prediction system, a stable and reliable abrasive jetting stream therefore can be achieved using jetting pressure ( Pw) at 320 MPa, mass of cast steel grits ( Mc) at 270 g, mass of bearing steel grits ( Mb) at 310 g, mass of brown-fused alumina grits ( Ma) at 360 g, and mass rate of abrasives ( Fa) at 0.46 kg/min. It is concluded that normalized sparse autoencoder-adaptive neural fuzzy inference system owns an outstanding predictive capability and possesses a much better working advancement in typical calibration indexes of accuracy and efficiency, meanwhile a high agreement between the fuzzy predicted and actual measured values of effectiveness indexes is ensured. This novel method could be promoted constructively to improve the quality uniformity for abrasive jetting stream and to facilitate the productive managements of abrasive jet machining consequently.

APA, Harvard, Vancouver, ISO, and other styles

32

Shimoda, Masayuki, Youki Sada, and Hiroki Nakahara. "FPGA-Based Inter-layer Pipelined Accelerators for Filter-Wise Weight-Balanced Sparse Fully Convolutional Networks with Overlapped Tiling." Journal of Signal Processing Systems 93, no. 5 (February 13, 2021): 499–512. http://dx.doi.org/10.1007/s11265-021-01642-6.

Full text

Abstract:

AbstractConvolutional neural networks (CNNs) exhibit state-of-the-art performance while performing computer-vision tasks. CNNs require high-speed, low-power, and high-accuracy hardware for various scenarios, such as edge environments. However, the number of weights is so large that embedded systems cannot store them owing to their limited on-chip memory. A different method is used to minimize the input image size, for real-time processing, but it causes a considerable drop in accuracy. Although pruned sparse CNNs and special accelerators are proposed, the requirement of random access incurs a large number of wide multiplexers for a high degree of parallelism, which becomes more complicated and unsuitable for FPGA implementation. To address this problem, we propose filter-wise pruning with distillation and block RAM (BRAM)-based zero-weight skipping accelerator. It eliminates weights such that each filter has the same number of nonzero weights, performing retraining with distillation, while retaining comparable accuracy. Further, filter-wise pruning enables our accelerator to exploit inter-filter parallelism, where a processing block for a layer executes filters concurrently, with a straightforward architecture. We also propose an overlapped tiling algorithm, where tiles are extracted with overlap to prevent both accuracy degradation and high utilization of BRAMs storing high-resolution images. Our evaluation using semantic-segmentation tasks showed a 1.8 times speedup and 18.0 times increase in power efficiency of our FPGA design compared with a desktop GPU. Additionally, compared with the conventional FPGA implementation, the speedup and accuracy improvement were 1.09 times and 6.6 points, respectively. Therefore, our approach is useful for FPGA implementation and exhibits considerable accuracy for applications in embedded systems.

APA, Harvard, Vancouver, ISO, and other styles

33

Wang, Miao, Xiaoya Fan, Wei Zhang, Ting Zhu, Tengteng Yao, Hui Ding, and Danghui Wang. "Balancing memory-accessing and computing over sparse DNN accelerator via efficient data packaging." Journal of Systems Architecture 117 (August 2021): 102094. http://dx.doi.org/10.1016/j.sysarc.2021.102094.

Full text

APA, Harvard, Vancouver, ISO, and other styles

34

Zhao, Yunping, Jianzhuang Lu, and Xiaowen Chen. "A Dynamically Reconfigurable Accelerator Design Using a Sparse-Winograd Decomposition Algorithm for CNNs." Computers, Materials & Continua 66, no. 1 (2020): 517–35. http://dx.doi.org/10.32604/cmc.2020.012380.

Full text

APA, Harvard, Vancouver, ISO, and other styles

35

Liu, Zhi-Gang, Paul N. Whatmough, and Matthew Mattina. "Systolic Tensor Array: An Efficient Structured-Sparse GEMM Accelerator for Mobile CNN Inference." IEEE Computer Architecture Letters 19, no. 1 (January 1, 2020): 34–37. http://dx.doi.org/10.1109/lca.2020.2979965.

Full text

APA, Harvard, Vancouver, ISO, and other styles

36

Pham, Duc-An, and Bo-Cheng Lai. "Dataflow and microarchitecture co-optimisation for sparse CNN on distributed processing element accelerator." IET Circuits, Devices & Systems 14, no. 8 (November 1, 2020): 1185–94. http://dx.doi.org/10.1049/iet-cds.2019.0225.

Full text

APA, Harvard, Vancouver, ISO, and other styles

37

Zhang, Min, Linpeng Li, Hai Wang, Yan Liu, Hongbo Qin, and Wei Zhao. "Optimized Compression for Implementing Convolutional Neural Networks on FPGA." Electronics 8, no. 3 (March 6, 2019): 295. http://dx.doi.org/10.3390/electronics8030295.

Full text

Abstract:

Field programmable gate array (FPGA) is widely considered as a promising platform for convolutional neural network (CNN) acceleration. However, the large numbers of parameters of CNNs cause heavy computing and memory burdens for FPGA-based CNN implementation. To solve this problem, this paper proposes an optimized compression strategy, and realizes an accelerator based on FPGA for CNNs. Firstly, a reversed-pruning strategy is proposed which reduces the number of parameters of AlexNet by a factor of 13× without accuracy loss on the ImageNet dataset. Peak-pruning is further introduced to achieve better compressibility. Moreover, quantization gives another 4× with negligible loss of accuracy. Secondly, an efficient storage technique, which aims for the reduction of the whole overhead cache of the convolutional layer and the fully connected layer, is presented respectively. Finally, the effectiveness of the proposed strategy is verified by an accelerator implemented on a Xilinx ZCU104 evaluation board. By improving existing pruning techniques and the storage format of sparse data, we significantly reduce the size of AlexNet by 28×, from 243 MB to 8.7 MB. In addition, the overall performance of our accelerator achieves 9.73 fps for the compressed AlexNet. Compared with the central processing unit (CPU) and graphics processing unit (GPU) platforms, our implementation achieves 182.3× and 1.1× improvements in latency and throughput, respectively, on the convolutional (CONV) layers of AlexNet, with an 822.0× and 15.8× improvement for energy efficiency, separately. This novel compression strategy provides a reference for other neural network applications, including CNNs, long short-term memory (LSTM), and recurrent neural networks (RNNs).

APA, Harvard, Vancouver, ISO, and other styles

38

Liu, Chester, Sung-Gun Cho, and Zhengya Zhang. "A 2.56-mm2 718GOPS Configurable Spiking Convolutional Sparse Coding Accelerator in 40-nm CMOS." IEEE Journal of Solid-State Circuits 53, no. 10 (October 2018): 2818–27. http://dx.doi.org/10.1109/jssc.2018.2865457.

Full text

APA, Harvard, Vancouver, ISO, and other styles

39

Aimar, Alessandro, Hesham Mostafa, Enrico Calabrese, Antonio Rios-Navarro, Ricardo Tapiador-Morales, Iulia-Alexandra Lungu, Moritz B. Milde, et al. "NullHop: A Flexible Convolutional Neural Network Accelerator Based on Sparse Representations of Feature Maps." IEEE Transactions on Neural Networks and Learning Systems 30, no. 3 (March 2019): 644–56. http://dx.doi.org/10.1109/tnnls.2018.2852335.

Full text

APA, Harvard, Vancouver, ISO, and other styles

40

Qian, Cheng, Bruce Childers, Libo Huang, Hui Guo, and Zhiying Wang. "CGAcc: A Compressed Sparse Row Representation-Based BFS Graph Traversal Accelerator on Hybrid Memory Cube." Electronics 7, no. 11 (November 7, 2018): 307. http://dx.doi.org/10.3390/electronics7110307.

Full text

Abstract:

Graph traversal is widely used in map routing, social network analysis, causal discovery and many more applications. Because it is a memory-bound process, graph traversal puts significant pressure on the memory subsystem. Due to poor spatial locality and the increasing size of today’s datasets, graph traversal consumes an ever-larger part of application execution time. One way to mitigate this cost is memory prefetching, which issues requests from the processor to the memory in anticipation of needing certain data. However, traditional prefetching does not work well for graph traversal due to data dependencies, the parallel nature of graphs and the need to move vast amounts of data from memory to the caches. In this paper, we propose a compressed sparse row representation-based graph accelerator on the Hybrid Memory Cube (HMC), called CGAcc. CGAcc combines Compressed Sparse Row (CSR) graph representation with in-memory prefetching and processing to improve the performance of graph traversal. Our approach integrates the prefetching and processing in the logic layer of a 3D stacked Dynamic Random-Access Memory (DRAM) architecture, based on Micron’s HMC. We selected HMC to implement CGAcc because it can provide quite high bandwidth and low access latency. Furthermore, this device has multiple DRAM layers connected to internal logic to control memory access and perform rudimentary computation. Using the CSR representation, CGAcc deploys prefetchers in the HMC to exploit the short transaction latency between the logic and DRAM layers. By doing this, it can also avoid large data movement costs. In the runtime, CGAcc pipelines the prefetching to fetch data from DRAM arrays to improve memory-level parallelism. To further reduce the access latency, several optimized internal caches are also introduced to hold the prefetched data to be Processed In-Memory (PIM). A comprehensive evaluation shows the effectiveness of CGAcc. Experimental results showed that, compared to a conventional HMC main memory equipped with a stream prefetcher, CGAcc achieved an average 3.51× speedup with moderate hardware cost.

APA, Harvard, Vancouver, ISO, and other styles

41

Bian, Haoqiong, Tiannan Sha, and Anastasia Ailamaki. "Using Cloud Functions as Accelerator for Elastic Data Analytics." Proceedings of the ACM on Management of Data 1, no. 2 (June 13, 2023): 1–27. http://dx.doi.org/10.1145/3589306.

Full text

Abstract:

Cloud function (CF) services, such as AWS Lambda, have been applied as the new computing infrastructure in implementing analytical query engines. For bursty and sparse workloads, CF-based query engine is more elastic than the traditional query engines running in servers, i.e., virtual machines (VMs), and might provide a higher performance/price ratio. However, it is still controversial whether CF services are good suites for general analytical workloads, in respect of the limitations of CFs in storage, network, and lifetime, as well as the much higher resource unit prices than VMs. In this paper, we first present micro-benchmark evaluations of the features of CF and VM. We reveal that for query processing, though CF is more elastic than VM, it is less scalable and is more expensive for continuous workloads. Then, to get the best of both worlds, we propose Pixels-Turbo - a hybrid query engine that processes queries in a scalable VM cluster by default and invokes CFs to accelerate the processing of unpredictable workload spikes. In the query engine, we propose several optimizations to improve the performance and scalability of the CF-based operators and a cost-based optimizer to select the appropriate algorithm and parallelism for the physical query plan. Evaluations on TPC-H and real-world workload show that our query engine has a 1-2 orders of magnitude higher performance/price ratio than state-of-the-art serverless query engines for sustained workloads while not compromising the elasticity for workload spikes.

APA, Harvard, Vancouver, ISO, and other styles

42

Chen, Xi, Chang Gao, Zuowen Wang, Longbiao Cheng, Sheng Zhou, Shih-Chii Liu, and Tobi Delbruck. "Exploiting Symmetric Temporally Sparse BPTT for Efficient RNN Training." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 10 (March 24, 2024): 11399–406. http://dx.doi.org/10.1609/aaai.v38i10.29020.

Full text

Abstract:

Recurrent Neural Networks (RNNs) are useful in temporal sequence tasks. However, training RNNs involves dense matrix multiplications which require hardware that can support a large number of arithmetic operations and memory accesses. Implementing online training of RNNs on the edge calls for optimized algorithms for an efficient deployment on hardware. Inspired by the spiking neuron model, the Delta RNN exploits temporal sparsity during inference by skipping over the update of hidden states from those inactivated neurons whose change of activation across two timesteps is below a defined threshold. This work describes a training algorithm for Delta RNNs that exploits temporal sparsity in the backward propagation phase to reduce computational requirements for training on the edge. Due to the symmetric computation graphs of forward and backward propagation during training, the gradient computation of inactivated neurons can be skipped. Results show a reduction of ∼80% in matrix operations for training a 56k parameter Delta LSTM on the Fluent Speech Commands dataset with negligible accuracy loss. Logic simulations of a hardware accelerator designed for the training algorithm show 2-10X speedup in matrix computations for an activation sparsity range of 50%-90%. Additionally, we show that the proposed Delta RNN training will be useful for online incremental learning on edge devices with limited computing resources.

APA, Harvard, Vancouver, ISO, and other styles

43

Weng, Yui-Kai, Shih-Hsu Huang, and Hsu-Yu Kao. "Block-Based Compression and Corresponding Hardware Circuits for Sparse Activations." Sensors 21, no. 22 (November 10, 2021): 7468. http://dx.doi.org/10.3390/s21227468.

Full text

Abstract:

In a CNN (convolutional neural network) accelerator, to reduce memory traffic and power consumption, there is a need to exploit the sparsity of activation values. Therefore, some research efforts have been paid to skip ineffectual computations (i.e., multiplications by zero). Different from previous works, in this paper, we point out the similarity of activation values: (1) in the same layer of a CNN model, most feature maps are either highly dense or highly sparse; (2) in the same layer of a CNN model, feature maps in different channels are often similar. Based on the two observations, we propose a block-based compression approach, which utilizes both the sparsity and the similarity of activation values to further reduce the data volume. Moreover, we also design an encoder, a decoder and an indexing module to support the proposed approach. The encoder is used to translate output activations into the proposed block-based compression format, while both the decoder and the indexing module are used to align nonzero values for effectual computations. Compared with previous works, benchmark data consistently show that the proposed approach can greatly reduce both memory traffic and power consumption.

APA, Harvard, Vancouver, ISO, and other styles

44

Xu, Shiyao, Jingfei Jiang, jinwei Xu, and Xifu Qian. "Efficient SpMM Accelerator for Deep Learning: Sparkle and Its Automated Generator." ACM Transactions on Reconfigurable Technology and Systems, June 7, 2024. http://dx.doi.org/10.1145/3665896.

Full text

Abstract:

Deep learning (DL) technology has made breakthroughs in a wide range of intelligent tasks such as vision, language, recommendation systems, etc. Sparse matrix multiplication (SpMM) is the key computation kernel of most sparse models. Conventional computing platforms such as CPUs, GPUs, and AI chips with regular processing units are unable to effectively support sparse computation due to their fixed structure and instruction sets. This work extends Sparkle, an accelerator architecture, which is developed specifically for processing SpMM in DL. During the balanced data loading process, some modifications are implemented to enhance the flexibility of the Sparkle architecture. Additionally, a Sparkle generator is proposed to accommodate diverse resource constraints and facilitate adaptable deployment. Leveraging Sparkle’s structural parameters and template-based design methods, the generator enables automatic Sparkle circuit generation under varying parameters. An instantiated Sparkle accelerator is implemented on the Xilinx xqvu11p FPGA platform with a specific configuration. Compared to the state-of-the-art SpMM accelerator SIGMA, the Sparkle accelerator instance improves the sparse computing efficiency by about 10 to 20 \(\%\) . Furthermore, the Sparkle instance achieved 7.76 \(\times\) higher performance over the Nvidia Orin NX GPU. More instances of accelerators with different parameters were evaluated, demonstrating that the Sparkle architecture can effectively accelerate SpMM.

APA, Harvard, Vancouver, ISO, and other styles

45

Hwang, Soojin, Daehyeon Baek, Jongse Park, and Jaehyuk Huh. "Cerberus: Triple Mode Acceleration of Sparse Matrix and Vector Multiplication." ACM Transactions on Architecture and Code Optimization, March 17, 2024. http://dx.doi.org/10.1145/3653020.

Full text

Abstract:

The multiplication of sparse matrix and vector (SpMV) is one of the most widely used kernels in high-performance computing as well as machine learning acceleration for sparse neural networks. The design space of SpMV accelerators has two axes: algorithm and matrix representation. There have been two widely used algorithms and data representations. Two algorithms, scalar multiplication and dot product, can be combined with two sparse data representations, compressed sparse and bitmap formats for the matrix and vector. Although the prior accelerators adopted one of the possible designs, it is yet to be investigated which design is the best one across different hardware resources and workload characteristics. This paper first investigates the impact of design choices with respect to the algorithm and data representation. Our evaluation shows that no single design always outperforms the others across different workloads, but the two best designs (i.e. compressed sparse format and bitmap format with dot product) have complementary performance with trade-offs incurred by the matrix characteristics. Based on the analysis, this study proposes Cerberus, a triple-mode accelerator supporting two sparse operation modes in addition to the base dense mode. To allow such multi-mode operation, it proposes a prediction model based on matrix characteristics under a given hardware configuration, which statically selects the best mode for a given sparse matrix with its dimension and density information. Our experimental results show that Cerberus provides 12.1 × performance improvements from a dense-only accelerator, and 1.5 × improvements from a fixed best SpMV design.

APA, Harvard, Vancouver, ISO, and other styles

46

Xie, Kunpeng, Ye Lu, Xinyu He, Dezhi Yi, Huijuan Dong, and Yao Chen. "Winols: A Large-Tiling Sparse Winograd CNN Accelerator on FPGAs." ACM Transactions on Architecture and Code Optimization, January 31, 2024. http://dx.doi.org/10.1145/3643682.

Full text

Abstract:

Convolutional Neural Networks (CNNs) can benefit from the computational reductions provided by the Winograd minimal filtering algorithm and weight pruning. However, harnessing the potential of both methods simultaneously introduces complexity in designing pruning algorithms and accelerators. Prior studies aimed to establish regular sparsity patterns in the Winograd domain, but they were primarily suited for small tiles, with domain transformation dictating the sparsity ratio. The irregularities in data access and domain transformation pose challenges in accelerator design, especially for larger Winograd tiles. This paper introduces ”Winols,” an innovative algorithm-hardware co-design strategy that emphasizes the strengths of the large-tiling Winograd algorithm. Through a spatial-to-Winograd relevance degree evaluation, we extensively explore domain transformation and propose a cross-domain pruning technique that retains sparsity across both spatial and Winograd domains. To compress pruned weight matrices, we invent a relative column encoding scheme. We further design an FPGA-based accelerator for CNN models with large Winograd tiles and sparse matrix-vector operations. Evaluations indicate our pruning method achieves up to 80% weight tile sparsity in the Winograd domain without compromising accuracy. Our Winols accelerator outperforms dense accelerator by a factor of 31.7 × in inference latency. When compared with prevailing sparse Winograd accelerators, Winols reduces latency by an average of 10.9 ×, and improves DSP and energy efficiencies by over 5.6 × and 5.7 ×, respectively. When compared with the CPU and GPU platform, Winols accelerator with tile size 8 × 8 achieves 24.6 × and 2.84 × energy efficiency improvements, respectively.

APA, Harvard, Vancouver, ISO, and other styles

47

Wang, Bo, Sheng Ma, Shengbai Luo, Lizhou Wu, Jianmin Zhang, Chunyuan Zhang, and Tiejun Li. "SparGD: A Sparse GEMM Accelerator with Dynamic Dataflow." ACM Transactions on Design Automation of Electronic Systems, November 27, 2023. http://dx.doi.org/10.1145/3634703.

Full text

Abstract:

Deep learning has become a highly popular research field, and previously deep learning algorithms ran primarily on CPUs and GPUs. However, with the rapid development of deep learning, it was discovered that existing processors could not meet the specific large-scale computing requirements of deep learning, and custom deep learning accelerators have become popular. The majority of the primary workloads in deep learning are general matrix-matrix multiplications (GEMM), and emerging GEMMs are highly sparse and irregular. The TPU and SIGMA are typical GEMM accelerators in recent years, but the TPU does not support sparsity, and both the TPU and SIGMA have insufficient utilization rates of the Processing Element (PE). We design and implement the SparGD, a sparse GEMM accelerator with dynamic dataflow. The SparGD has specific PE structures, flexible distribution networks and reduction networks, and a simple dataflow switching module. When running sparse and irregular GEMMs, the SparGD can maintain high PE utilization while utilizing sparsity, and can switch to the optimal dataflow according to the computing environment. For sparse, irregular GEMMs, our experimental results show that the SparGD outperforms systolic arrays by 30 times and SIGMA by 3.6 times.

APA, Harvard, Vancouver, ISO, and other styles

48

Soltaniyeh, Mohammadreza, Richard P. Martin, and Santosh Nagarakatte. "An Accelerator for Sparse Convolutional Neural Networks Leveraging Systolic General Matrix-Matrix Multiplication." ACM Transactions on Architecture and Code Optimization, April 25, 2022. http://dx.doi.org/10.1145/3532863.

Full text

Abstract:

This paper proposes a novel hardware accelerator for the inference task with sparse convolutional neural networks (CNNs) by building a hardware unit to perform Image to Column ( Im2Col ) transformation of the input feature map coupled with a systolic array-based general matrix-matrix multiplication (GEMM) unit. Our design carefully overlaps the Im2Col transformation with the GEMM computation to maximize parallelism. We propose a novel design for the Im2Col unit that uses a set of distributed local memories connected by a ring network, which improves energy efficiency and latency by streaming the input feature map only once. The systolic array-based GEMM unit in the accelerator can be dynamically configured as multiple GEMM units with square-shaped systolic arrays or as a single GEMM unit with a tall systolic array. This dynamic reconfigurability enables effective pipelining of Im2Col and GEMM operations and attains high processing element utilization for a wide range of CNNs. Further, our accelerator is sparsity-aware, improving performance and energy efficiency by effectively mapping the sparse feature maps and weights to the processing elements, skipping ineffectual operations and unnecessary data movements involving zeros. Our prototype, SPOTS, is on average 2.16 ×, 1.74 ×, and 1.63 × faster than Gemmini, Eyeriss, and Sparse-PE, which are prior hardware accelerators for dense and sparse CNNs, respectively. SPOTS is also 78 ×, and 12 × more energy-efficient when compared to CPU and GPU implementations, respectively.

APA, Harvard, Vancouver, ISO, and other styles

49

Soltaniyeh, Mohammadreza, Richard P. Martin, and Santosh Nagarakatte. "An Accelerator for Sparse Convolutional Neural Networks Leveraging Systolic General Matrix-Matrix Multiplication." ACM Transactions on Architecture and Code Optimization, April 25, 2022. http://dx.doi.org/10.1145/3532863.

Full text

Abstract:

This paper proposes a novel hardware accelerator for the inference task with sparse convolutional neural networks (CNNs) by building a hardware unit to perform Image to Column ( Im2Col ) transformation of the input feature map coupled with a systolic array-based general matrix-matrix multiplication (GEMM) unit. Our design carefully overlaps the Im2Col transformation with the GEMM computation to maximize parallelism. We propose a novel design for the Im2Col unit that uses a set of distributed local memories connected by a ring network, which improves energy efficiency and latency by streaming the input feature map only once. The systolic array-based GEMM unit in the accelerator can be dynamically configured as multiple GEMM units with square-shaped systolic arrays or as a single GEMM unit with a tall systolic array. This dynamic reconfigurability enables effective pipelining of Im2Col and GEMM operations and attains high processing element utilization for a wide range of CNNs. Further, our accelerator is sparsity-aware, improving performance and energy efficiency by effectively mapping the sparse feature maps and weights to the processing elements, skipping ineffectual operations and unnecessary data movements involving zeros. Our prototype, SPOTS, is on average 2.16 ×, 1.74 ×, and 1.63 × faster than Gemmini, Eyeriss, and Sparse-PE, which are prior hardware accelerators for dense and sparse CNNs, respectively. SPOTS is also 78 ×, and 12 × more energy-efficient when compared to CPU and GPU implementations, respectively.

APA, Harvard, Vancouver, ISO, and other styles

50

Del Sarto, Nicola, Diane A. Isabelle, Valentina Cucino, and Alberto Di Minin. "Engaging with startups through corporate accelerators: the case of H‐FARM's White Label Accelerator." R&D Management, July 9, 2024. http://dx.doi.org/10.1111/radm.12705.

Full text

Abstract:

Corporate accelerators have emerged in recent years as an innovation mechanism that builds bridges between corporations and startups. Through inbound Open Innovation (OI) activities, established firms open their innovation processes to startups to acquire their knowledge. Previous research has focused either on independent accelerators or on corporate accelerator programs that an established firm operates internally. The literature on how accelerators orchestrate different OI practices is sparse, yet large corporations are forging ahead with corporate accelerators. Furthermore, new corporate accelerator models have emerged, rendering the corporate accelerator phenomenon more heterogenous. In this paper, adopting an OI lens, we explore the White Label Accelerator (WLA), a recent model in which an independent accelerator manages the program on behalf of a single corporate organization. Through an in‐depth case study of Technogym Wellness Accelerator, a pioneer and most important WLA in Italy set up by H‐Farm on behalf of Technogym, a large fitness and wellness company, we investigate how such accelerators function as an OI tool. We found four key dimensions that enable successful outcomes. In particular, the importance of entrepreneurial alertness as a key driver for the effective exploitation of intellectual property represents a significant finding. Our research contributes to OI and entrepreneurial finance literature and provides insightful managerial implications to corporate accelerator stakeholders and startups' managers.

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!