Siga este enlace para ver otros tipos de publicaciones sobre el tema: Neural network accelerator.

Artículos de revistas sobre el tema "Neural network accelerator"

Crea una cita precisa en los estilos APA, MLA, Chicago, Harvard y otros

Elija tipo de fuente:

Consulte los 50 mejores artículos de revistas para su investigación sobre el tema "Neural network accelerator".

Junto a cada fuente en la lista de referencias hay un botón "Agregar a la bibliografía". Pulsa este botón, y generaremos automáticamente la referencia bibliográfica para la obra elegida en el estilo de cita que necesites: APA, MLA, Harvard, Vancouver, Chicago, etc.

También puede descargar el texto completo de la publicación académica en formato pdf y leer en línea su resumen siempre que esté disponible en los metadatos.

Explore artículos de revistas sobre una amplia variedad de disciplinas y organice su bibliografía correctamente.

1

Eliahu, Adi, Ronny Ronen, Pierre-Emmanuel Gaillardon y Shahar Kvatinsky. "multiPULPly". ACM Journal on Emerging Technologies in Computing Systems 17, n.º 2 (abril de 2021): 1–27. http://dx.doi.org/10.1145/3432815.

Texto completo
Resumen
Computationally intensive neural network applications often need to run on resource-limited low-power devices. Numerous hardware accelerators have been developed to speed up the performance of neural network applications and reduce power consumption; however, most focus on data centers and full-fledged systems. Acceleration in ultra-low-power systems has been only partially addressed. In this article, we present multiPULPly, an accelerator that integrates memristive technologies within standard low-power CMOS technology, to accelerate multiplication in neural network inference on ultra-low-power systems. This accelerator was designated for PULP, an open-source microcontroller system that uses low-power RISC-V processors. Memristors were integrated into the accelerator to enable power consumption only when the memory is active, to continue the task with no context-restoring overhead, and to enable highly parallel analog multiplication. To reduce the energy consumption, we propose novel dataflows that handle common multiplication scenarios and are tailored for our architecture. The accelerator was tested on FPGA and achieved a peak energy efficiency of 19.5 TOPS/W, outperforming state-of-the-art accelerators by 1.5× to 4.5×.
Los estilos APA, Harvard, Vancouver, ISO, etc.
2

Cho, Jaechan, Yongchul Jung, Seongjoo Lee y Yunho Jung. "Reconfigurable Binary Neural Network Accelerator with Adaptive Parallelism Scheme". Electronics 10, n.º 3 (20 de enero de 2021): 230. http://dx.doi.org/10.3390/electronics10030230.

Texto completo
Resumen
Binary neural networks (BNNs) have attracted significant interest for the implementation of deep neural networks (DNNs) on resource-constrained edge devices, and various BNN accelerator architectures have been proposed to achieve higher efficiency. BNN accelerators can be divided into two categories: streaming and layer accelerators. Although streaming accelerators designed for a specific BNN network topology provide high throughput, they are infeasible for various sensor applications in edge AI because of their complexity and inflexibility. In contrast, layer accelerators with reasonable resources can support various network topologies, but they operate with the same parallelism for all the layers of the BNN, which degrades throughput performance at certain layers. To overcome this problem, we propose a BNN accelerator with adaptive parallelism that offers high throughput performance in all layers. The proposed accelerator analyzes target layer parameters and operates with optimal parallelism using reasonable resources. In addition, this architecture is able to fully compute all types of BNN layers thanks to its reconfigurability, and it can achieve a higher area–speed efficiency than existing accelerators. In performance evaluation using state-of-the-art BNN topologies, the designed BNN accelerator achieved an area–speed efficiency 9.69 times higher than previous FPGA implementations and 24% higher than existing VLSI implementations for BNNs.
Los estilos APA, Harvard, Vancouver, ISO, etc.
3

Hong, JiUn, Saad Arslan, TaeGeon Lee y HyungWon Kim. "Design of Power-Efficient Training Accelerator for Convolution Neural Networks". Electronics 10, n.º 7 (26 de marzo de 2021): 787. http://dx.doi.org/10.3390/electronics10070787.

Texto completo
Resumen
To realize deep learning techniques, a type of deep neural network (DNN) called a convolutional neural networks (CNN) is among the most widely used models aimed at image recognition applications. However, there is growing demand for light-weight and low-power neural network accelerators, not only for inference but also for training process. In this paper, we propose a training accelerator that provides low power and compact chip size targeted for mobile and edge computing applications. It accelerates to achieve the real-time processing of both inference and training using concurrent floating-point data paths. The proposed accelerator can be externally controlled and employs resource sharing and an integrated convolution-pooling block to achieve low area and low energy consumption. We implemented the proposed training accelerator in an FPGA (Field Programmable Gate Array) and evaluated its training performance using an MNIST CNN example in comparison with a PC with GPU (Graphics Processing Unit). While both methods achieved a similar training accuracy of 95.1%, the proposed accelerator, when implemented in a silicon chip, reduced the energy consumption by 480 times compared to the counterpart. Additionally, when implemented on an FPGA, an energy reduction of over 4.5 times was achieved compared to the existing FPGA training accelerator for the MNIST dataset. Therefore, the proposed accelerator is more suitable for deployment in mobile/edge nodes compared to the existing software and hardware accelerators.
Los estilos APA, Harvard, Vancouver, ISO, etc.
4

Noskova, E. S., I. E. Zakharov, Y. N. Shkandybin y S. G. Rykovanov. "Towards energy-efficient neural network calculations". Computer Optics 46, n.º 1 (febrero de 2022): 160–66. http://dx.doi.org/10.18287/2412-6179-co-914.

Texto completo
Resumen
Nowadays, the problem of creating high-performance and energy-efficient hardware for Artificial Intelligence tasks is very acute. The most popular solution to this problem is the use of Deep Learning Accelerators, such as GPUs and Tensor Processing Units to run neural networks. Recently, NVIDIA has announced the NVDLA project, which allows one to design neural network accelerators based on an open-source code. This work describes a full cycle of creating a prototype NVDLA accelerator, as well as testing the resulting solution by running the resnet-50 neural network on it. Finally, an assessment of the performance and power efficiency of the prototype NVDLA accelerator when compared to the GPU and CPU is provided, the results of which show the superiority of NVDLA in many characteristics.
Los estilos APA, Harvard, Vancouver, ISO, etc.
5

Ferianc, Martin, Hongxiang Fan, Divyansh Manocha, Hongyu Zhou, Shuanglong Liu, Xinyu Niu y Wayne Luk. "Improving Performance Estimation for Design Space Exploration for Convolutional Neural Network Accelerators". Electronics 10, n.º 4 (23 de febrero de 2021): 520. http://dx.doi.org/10.3390/electronics10040520.

Texto completo
Resumen
Contemporary advances in neural networks (NNs) have demonstrated their potential in different applications such as in image classification, object detection or natural language processing. In particular, reconfigurable accelerators have been widely used for the acceleration of NNs due to their reconfigurability and efficiency in specific application instances. To determine the configuration of the accelerator, it is necessary to conduct design space exploration to optimize the performance. However, the process of design space exploration is time consuming because of the slow performance evaluation for different configurations. Therefore, there is a demand for an accurate and fast performance prediction method to speed up design space exploration. This work introduces a novel method for fast and accurate estimation of different metrics that are of importance when performing design space exploration. The method is based on a Gaussian process regression model parametrised by the features of the accelerator and the target NN to be accelerated. We evaluate the proposed method together with other popular machine learning based methods in estimating the latency and energy consumption of our implemented accelerator on two different hardware platforms targeting convolutional neural networks. We demonstrate improvements in estimation accuracy, without the need for significant implementation effort or tuning.
Los estilos APA, Harvard, Vancouver, ISO, etc.
6

Sunny, Febin P., Asif Mirza, Mahdi Nikdast y Sudeep Pasricha. "ROBIN: A Robust Optical Binary Neural Network Accelerator". ACM Transactions on Embedded Computing Systems 20, n.º 5s (31 de octubre de 2021): 1–24. http://dx.doi.org/10.1145/3476988.

Texto completo
Resumen
Domain specific neural network accelerators have garnered attention because of their improved energy efficiency and inference performance compared to CPUs and GPUs. Such accelerators are thus well suited for resource-constrained embedded systems. However, mapping sophisticated neural network models on these accelerators still entails significant energy and memory consumption, along with high inference time overhead. Binarized neural networks (BNNs), which utilize single-bit weights, represent an efficient way to implement and deploy neural network models on accelerators. In this paper, we present a novel optical-domain BNN accelerator, named ROBIN , which intelligently integrates heterogeneous microring resonator optical devices with complementary capabilities to efficiently implement the key functionalities in BNNs. We perform detailed fabrication-process variation analyses at the optical device level, explore efficient corrective tuning for these devices, and integrate circuit-level optimization to counter thermal variations. As a result, our proposed ROBIN architecture possesses the desirable traits of being robust, energy-efficient, low latency, and high throughput, when executing BNN models. Our analysis shows that ROBIN can outperform the best-known optical BNN accelerators and many electronic accelerators. Specifically, our energy-efficient ROBIN design exhibits energy-per-bit values that are ∼4 × lower than electronic BNN accelerators and ∼933 × lower than a recently proposed photonic BNN accelerator, while a performance-efficient ROBIN design shows ∼3 × and ∼25 × better performance than electronic and photonic BNN accelerators, respectively.
Los estilos APA, Harvard, Vancouver, ISO, etc.
7

Anmin, Kong y Zhao Bin. "A Parallel Loading Based Accelerator for Convolution Neural Network". International Journal of Machine Learning and Computing 10, n.º 5 (5 de octubre de 2020): 669–74. http://dx.doi.org/10.18178/ijmlc.2020.10.5.989.

Texto completo
Los estilos APA, Harvard, Vancouver, ISO, etc.
8

Xia, Chengpeng, Yawen Chen, Haibo Zhang, Hao Zhang, Fei Dai y Jigang Wu. "Efficient neural network accelerators with optical computing and communication". Computer Science and Information Systems, n.º 00 (2022): 66. http://dx.doi.org/10.2298/csis220131066x.

Texto completo
Resumen
Conventional electronic Artificial Neural Networks (ANNs) accelerators focus on architecture design and numerical computation optimization to improve the training efficiency. However, these approaches have recently encountered bottlenecks in terms of energy efficiency and computing performance, which leads to an increase interest in photonic accelerator. Photonic architectures with low energy consumption, high transmission speed and high bandwidth have been considered as an important role for generation of computing architectures. In this paper, to provide a better understanding of optical technology used in ANN acceleration, we present a comprehensive review for the efficient photonic computing and communication in ANN accelerators. The related photonic devices are investigated in terms of the application in ANNs acceleration, and a classification of existing solutions is proposed that are categorized into optical computing acceleration and optical communication acceleration according to photonic effects and photonic architectures. Moreover, we discuss the challenges for these photonic neural network acceleration approaches to highlight the most promising future research opportunities in this field.
Los estilos APA, Harvard, Vancouver, ISO, etc.
9

Tang, Wenkai y Peiyong Zhang. "GPGCN: A General-Purpose Graph Convolution Neural Network Accelerator Based on RISC-V ISA Extension". Electronics 11, n.º 22 (21 de noviembre de 2022): 3833. http://dx.doi.org/10.3390/electronics11223833.

Texto completo
Resumen
In the past two years, various graph convolution neural networks (GCNs) accelerators have emerged, each with their own characteristics, but their common disadvantage is that the hardware architecture is not programmable and it is optimized for a specific network and dataset. They may not support acceleration for different GCNs and may not achieve optimal hardware resource utilization for datasets of different sizes. Therefore, given the above shortcomings, and according to the development trend of traditional neural network accelerators, this paper proposes and implements GPGCN: a general-purpose GCNs accelerator architecture based on RISC-V instruction set extension, providing the software programming freedom to support acceleration for various GCNs, and achieving the best acceleration efficiency for different GCNs with different datasets. Compared with traditional CPU, and traditional CPU with vector expansion, GPGCN achieves above 1001×, 267× speedup for GCN with the Cora dataset. Compared with dedicated accelerators, GPGCN has software programmability and supports the acceleration of more GCNs.
Los estilos APA, Harvard, Vancouver, ISO, etc.
10

An, Fubang, Lingli Wang y Xuegong Zhou. "A High Performance Reconfigurable Hardware Architecture for Lightweight Convolutional Neural Network". Electronics 12, n.º 13 (27 de junio de 2023): 2847. http://dx.doi.org/10.3390/electronics12132847.

Texto completo
Resumen
Since the lightweight convolutional neural network EfficientNet was proposed by Google in 2019, the series of models have quickly become very popular due to their superior performance with a small number of parameters. However, the existing convolutional neural network hardware accelerators for EfficientNet still have much room to improve the performance of the depthwise convolution, squeeze-and-excitation module and nonlinear activation functions. In this paper, we first design a reconfigurable register array and computational kernel to accelerate the depthwise convolution. Next, we propose a vector unit to implement the nonlinear activation functions and the scale operation. An exchangeable-sequence dual-computational kernel architecture is proposed to improve the performance and the utilization. In addition, the memory architectures are designed to complete the hardware accelerator for the above computing architecture. Finally, in order to evaluate the performance of the hardware accelerator, the accelerator is implemented based on Xilinx XCVU37P. The results show that the proposed accelerator can work at the main system clock frequency of 300 MHz with the DSP kernel at 600 MHz. The performance of EfficientNet-B3 in our architecture can reach 69.50 FPS and 255.22 GOPS. Compared with the latest EfficientNet-B3 accelerator, which uses the same FPGA development board, the accelerator proposed in this paper can achieve a 1.28-fold improvement of single-core performance and 1.38-fold improvement of performance of each DSP.
Los estilos APA, Harvard, Vancouver, ISO, etc.
11

Biookaghazadeh, Saman, Pravin Kumar Ravi y Ming Zhao. "Toward Multi-FPGA Acceleration of the Neural Networks". ACM Journal on Emerging Technologies in Computing Systems 17, n.º 2 (abril de 2021): 1–23. http://dx.doi.org/10.1145/3432816.

Texto completo
Resumen
High-throughput and low-latency Convolutional Neural Network (CNN) inference is increasingly important for many cloud- and edge-computing applications. FPGA-based acceleration of CNN inference has demonstrated various benefits compared to other high-performance devices such as GPGPUs. Current FPGA CNN-acceleration solutions are based on a single FPGA design, which are limited by the available resources on an FPGA. In addition, they can only accelerate conventional 2D neural networks. To address these limitations, we present a generic multi-FPGA solution, written in OpenCL, which can accelerate more complex CNNs (e.g., C3D CNN) and achieve a near linear speedup with respect to the available single-FPGA solutions. The design is built upon the Intel Deep Learning Accelerator architecture, with three extensions. First, it includes updates for better area efficiency (up to 25%) and higher performance (up to 24%). Second, it supports 3D convolutions for more challenging applications such as video learning. Third, it supports multi-FPGA communication for higher inference throughput. The results show that utilizing multiple FPGAs can linearly increase the overall bandwidth while maintaining the same end-to-end latency. In addition, the design can outperform other FPGA 2D accelerators by up to 8.4 times and 3D accelerators by up to 1.7 times.
Los estilos APA, Harvard, Vancouver, ISO, etc.
12

Ge, Fen, Ning Wu, Hao Xiao, Yuanyuan Zhang y Fang Zhou. "Compact Convolutional Neural Network Accelerator for IoT Endpoint SoC". Electronics 8, n.º 5 (5 de mayo de 2019): 497. http://dx.doi.org/10.3390/electronics8050497.

Texto completo
Resumen
As a classical artificial intelligence algorithm, the convolutional neural network (CNN) algorithm plays an important role in image recognition and classification and is gradually being applied in the Internet of Things (IoT) system. A compact CNN accelerator for the IoT endpoint System-on-Chip (SoC) is proposed in this paper to meet the needs of CNN computations. Based on analysis of the CNN structure, basic functional modules of CNN such as convolution circuit and pooling circuit with a low data bandwidth and a smaller area are designed, and an accelerator is constructed in the form of four acceleration chains. After the acceleration unit design is completed, the Cortex-M3 is used to construct a verification SoC and the designed verification platform is implemented on the FPGA to evaluate the resource consumption and performance analysis of the CNN accelerator. The CNN accelerator achieved a throughput of 6.54 GOPS (giga operations per second) by consuming 4901 LUTs without using any hardware multipliers. The comparison shows that the compact accelerator proposed in this paper makes the CNN computational power of the SoC based on the Cortex-M3 kernel two times higher than the quad-core Cortex-A7 SoC and 67% of the computational power of eight-core Cortex-A53 SoC.
Los estilos APA, Harvard, Vancouver, ISO, etc.
13

Clements, Joseph y Yingjie Lao. "DeepHardMark: Towards Watermarking Neural Network Hardware". Proceedings of the AAAI Conference on Artificial Intelligence 36, n.º 4 (28 de junio de 2022): 4450–58. http://dx.doi.org/10.1609/aaai.v36i4.20367.

Texto completo
Resumen
This paper presents a framework for embedding watermarks into DNN hardware accelerators. Unlike previous works that have looked at protecting the algorithmic intellectual properties of deep learning systems, this work proposes a methodology for defending deep learning hardware. Our methodology embeds modifications into the hardware accelerator's functional blocks that can be revealed with the rightful owner's key DNN and corresponding key sample, verifying the legitimate owner. We propose an Lp-box ADMM based algorithm to co-optimize watermark's hardware overhead and impact on the design's algorithmic functionality. We evaluate the performance of the hardware watermarking scheme on popular image classifier models using various accelerator designs. Our results demonstrate that the proposed methodology effectively embeds watermarks while preserving the original functionality of the hardware architecture. Specifically, we can successfully embed watermarks into the deep learning hardware and reliably execute a ResNet ImageNet classifiers with an accuracy degradation of only 0.009%
Los estilos APA, Harvard, Vancouver, ISO, etc.
14

Xia, Chengpeng, Yawen Chen, Haibo Zhang y Jigang Wu. "STADIA: Photonic Stochastic Gradient Descent for Neural Network Accelerators". ACM Transactions on Embedded Computing Systems 22, n.º 5s (9 de septiembre de 2023): 1–23. http://dx.doi.org/10.1145/3607920.

Texto completo
Resumen
Deep Neural Networks (DNNs) have demonstrated great success in many fields such as image recognition and text analysis. However, the ever-increasing sizes of both DNN models and training datasets make deep leaning extremely computation- and memory-intensive. Recently, photonic computing has emerged as a promising technology for accelerating DNNs. While the design of photonic accelerators for DNN inference and forward propagation of DNN training has been widely investigated, the architectural acceleration for equally important backpropagation of DNN training has not been well studied. In this paper, we propose a novel silicon photonic-based backpropagation accelerator for high performance DNN training. Specifically, a general-purpose photonic gradient descent unit named STADIA is designed to implement the multiplication, accumulation, and subtraction operations required for computing gradients using mature optical devices including Mach-Zehnder Interferometer (MZI) and Mircoring Resonator (MRR), which can significantly reduce the training latency and improve the energy efficiency of backpropagation. To demonstrate efficient parallel computing, we propose a STADIA-based backpropagation acceleration architecture and design a dataflow by using wavelength-division multiplexing (WDM). We analyze the precision of STADIA by quantifying the precision limitations imposed by losses and noises. Furthermore, we evaluate STADIA with different element sizes by analyzing the power, area and time delay for photonic accelerators based on DNN models such as AlexNet, VGG19 and ResNet. Simulation results show that the proposed architecture STADIA can achieve significant improvement by 9.7× in time efficiency and 147.2× in energy efficiency, compared with the most advanced optical-memristor based backpropagation accelerator.
Los estilos APA, Harvard, Vancouver, ISO, etc.
15

Wei, Rongshan, Chenjia Li, Chuandong Chen, Guangyu Sun y Minghua He. "Memory Access Optimization of a Neural Network Accelerator Based on Memory Controller". Electronics 10, n.º 4 (10 de febrero de 2021): 438. http://dx.doi.org/10.3390/electronics10040438.

Texto completo
Resumen
Special accelerator architecture has achieved great success in processor architecture, and it is trending in computer architecture development. However, as the memory access pattern of an accelerator is relatively complicated, the memory access performance is relatively poor, limiting the overall performance improvement of hardware accelerators. Moreover, memory controllers for hardware accelerators have been scarcely researched. We consider that a special accelerator memory controller is essential for improving the memory access performance. To this end, we propose a dynamic random access memory (DRAM) memory controller called NNAMC for neural network accelerators, which monitors the memory access stream of an accelerator and transfers it to the optimal address mapping scheme bank based on the memory access characteristics. NNAMC includes a stream access prediction unit (SAPU) that analyzes the type of data stream accessed by the accelerator via hardware, and designs the address mapping for different banks using a bank partitioning model (BPM). The image mapping method and hardware architecture were analyzed in a practical neural network accelerator. In the experiment, NNAMC achieved significantly lower access latency of the hardware accelerator than the competing address mapping schemes, increased the row buffer hit ratio by 13.68% on average (up to 26.17%), reduced the system access latency by 26.3% on average (up to 37.68%), and lowered the hardware cost. In addition, we also confirmed that NNAMC efficiently adapted to different network parameters.
Los estilos APA, Harvard, Vancouver, ISO, etc.
16

Chen, Weijian, Zhi Qi, Zahid Akhtar y Kamran Siddique. "Resistive-RAM-Based In-Memory Computing for Neural Network: A Review". Electronics 11, n.º 22 (9 de noviembre de 2022): 3667. http://dx.doi.org/10.3390/electronics11223667.

Texto completo
Resumen
Processing-in-memory (PIM) is a promising architecture to design various types of neural network accelerators as it ensures the efficiency of computation together with Resistive Random Access Memory (ReRAM). ReRAM has now become a promising solution to enhance computing efficiency due to its crossbar structure. In this paper, a ReRAM-based PIM neural network accelerator is addressed, and different kinds of methods and designs of various schemes are discussed. Various models and architectures implemented for a neural network accelerator are determined for research trends. Further, the limitations or challenges of ReRAM in a neural network are also addressed in this review.
Los estilos APA, Harvard, Vancouver, ISO, etc.
17

Hu, Jian, Xianlong Zhang y Xiaohua Shi. "Simulating Neural Network Processors". Wireless Communications and Mobile Computing 2022 (23 de febrero de 2022): 1–12. http://dx.doi.org/10.1155/2022/7500195.

Texto completo
Resumen
Deep learning has achieved competing results compared with human beings in many fields. Traditionally, deep learning networks are executed on CPUs and GPUs. In recent years, more and more neural network accelerators have been introduced in both academia and industry to improve the performance and energy efficiency for deep learning networks. In this paper, we introduce a flexible and configurable functional NN accelerator simulator, which could be configured to simulate u-architectures for different NN accelerators. The extensible and configurable simulator is helpful for system-level exploration of u-architecture, as well as operator optimization algorithm developments. The simulator is a functional simulator that simulates the latencies of calculation and memory access and the concurrent process between modules, and it gives the number of program execution cycles after the simulation is completed. We also integrated the simulator into the TVM compilation stack as an optional backend. Users can use TVM to write operators and execute them on the simulator.
Los estilos APA, Harvard, Vancouver, ISO, etc.
18

Lim, Se-Min y Sang-Woo Jun. "MobileNets Can Be Lossily Compressed: Neural Network Compression for Embedded Accelerators". Electronics 11, n.º 6 (9 de marzo de 2022): 858. http://dx.doi.org/10.3390/electronics11060858.

Texto completo
Resumen
Although neural network quantization is an imperative technology for the computation and memory efficiency of embedded neural network accelerators, simple post-training quantization incurs unacceptable levels of accuracy degradation on some important models targeting embedded systems, such as MobileNets. While explicit quantization-aware training or re-training after quantization can often reclaim lost accuracy, this is not always possible or convenient. We present an alternative approach to compressing such difficult neural networks, using a novel variant of the ZFP lossy floating-point compression algorithm to compress both model weights and inter-layer activations and demonstrate that it can be efficiently implemented on an embedded FPGA platform. Our ZFP variant, which we call ZFPe, is designed for efficient implementation on embedded accelerators, such as FPGAs, requiring a fraction of chip resources per bandwidth compared to state-of-the-art lossy compression accelerators. ZFPe-compressing the MobileNet V2 model with an 8-bit budget per weight and activation results in significantly higher accuracy compared to 8-bit integer post-training quantization and shows no loss of accuracy, compared to an uncompressed model when given a 12-bit budget per floating-point value. To demonstrate the benefits of our approach, we implement an embedded neural network accelerator on a realistic embedded acceleration platform equipped with the low-power Lattice ECP5-85F FPGA and a 32 MB SDRAM chip. Each ZFPe module consumes less than 6% of LUTs while compressing or decompressing one value per cycle, requiring a fraction of the resources compared to state-of-the-art compression accelerators while completely removing the memory bottleneck of our accelerator.
Los estilos APA, Harvard, Vancouver, ISO, etc.
19

Xie, Xiaoru, Mingyu Zhu, Siyuan Lu y Zhongfeng Wang. "Efficient Layer-Wise N:M Sparse CNN Accelerator with Flexible SPEC: Sparse Processing Element Clusters". Micromachines 14, n.º 3 (24 de febrero de 2023): 528. http://dx.doi.org/10.3390/mi14030528.

Texto completo
Resumen
Recently, the layer-wise N:M fine-grained sparse neural network algorithm (i.e., every M-weights contains N non-zero values) has attracted tremendous attention, as it can effectively reduce the computational complexity with negligible accuracy loss. However, the speed-up potential of this algorithm will not be fully exploited if the right hardware support is lacking. In this work, we design an efficient accelerator for the N:M sparse convolutional neural networks (CNNs) with layer-wise sparse patterns. First, we analyze the performances of different processing element (PE) structures and extensions to construct the flexible PE architecture. Second, the variable sparse convolutional dimensions and sparse ratios are involved in the hardware design. With a sparse PE cluster (SPEC) design, the hardware can efficiently accelerate CNNs with the layer-wise N:M pattern. Finally, we employ the proposed SPEC into the CNN accelerator with flexible network-on-chip and specially designed dataflow. We implement hardware accelerators on Xilinx ZCU102 FPGA and Xilinx VCU118 FPGA and evaluate them with classical CNNs such as Alexnet, VGG-16, and ResNet-50. Compared with existing accelerators designed for structured and unstructured pruned networks, our design achieves the best performance in terms of power efficiency.
Los estilos APA, Harvard, Vancouver, ISO, etc.
20

Li, Yihang. "Sparse-Aware Deep Learning Accelerator". Highlights in Science, Engineering and Technology 39 (1 de abril de 2023): 305–10. http://dx.doi.org/10.54097/hset.v39i.6544.

Texto completo
Resumen
In view of the difficulty of hardware implementation of convolutional neural network computing, most of the previous convolutional neural network accelerator designs focused on solving the bottleneck of computational performance and bandwidth, ignoring the importance of convolutional neural network scarcity for accelerator design. In recent years, there are a few convolutional neural network accelerators that can take advantage of the scarcity, but they are usually difficult to consider in terms of computational flexibility, parallel efficiency and resource overhead. In view of the problem that the application of convolutional neural network (CNN) on the embedded side is limited by real-time, and there is a large degree of sparsity in CNN convolution calculation. This paper summarizes the methods of sparsification from the algorithm level and based on FPGA level. The different methods of sparsification and the research and analysis of different application layers are introduced. The advantages and development trend of sparsification are analyzed and summarized.
Los estilos APA, Harvard, Vancouver, ISO, etc.
21

Afifi, Salma, Febin Sunny, Amin Shafiee, Mahdi Nikdast y Sudeep Pasricha. "GHOST: A Graph Neural Network Accelerator using Silicon Photonics". ACM Transactions on Embedded Computing Systems 22, n.º 5s (9 de septiembre de 2023): 1–25. http://dx.doi.org/10.1145/3609097.

Texto completo
Resumen
Graph neural networks (GNNs) have emerged as a powerful approach for modelling and learning from graph-structured data. Multiple fields have since benefitted enormously from the capabilities of GNNs, such as recommendation systems, social network analysis, drug discovery, and robotics. However, accelerating and efficiently processing GNNs require a unique approach that goes beyond conventional artificial neural network accelerators, due to the substantial computational and memory requirements of GNNs. The slowdown of scaling in CMOS platforms also motivates a search for alternative implementation substrates. In this paper, we present GHOST , the first silicon-photonic hardware accelerator for GNNs. GHOST efficiently alleviates the costs associated with both vertex-centric and edge-centric operations. It implements separately the three main stages involved in running GNNs in the optical domain, allowing it to be used for the inference of various widely used GNN models and architectures, such as graph convolution networks and graph attention networks. Our simulation studies indicate that GHOST exhibits at least 10.2 × better throughput and 3.8 × better energy efficiency when compared to GPU, TPU, CPU and multiple state-of-the-art GNN hardware accelerators.
Los estilos APA, Harvard, Vancouver, ISO, etc.
22

Yang, Zhi. "Dynamic Logo Design System of Network Media Art Based on Convolutional Neural Network". Mobile Information Systems 2022 (31 de mayo de 2022): 1–10. http://dx.doi.org/10.1155/2022/3247229.

Texto completo
Resumen
Nowadays, we are in an era of rapid development of Internet technology and unlimited expansion of information dissemination. While the application of new media and digital multimedia has become more popular, it has also brought Earth shaking changes to our life. In order to solve the problem that the traditional static visual image has been difficult to meet people’s needs, a network media art dynamic logo design system based on convolutional neural network is proposed. Firstly, the software and hardware platform related to accelerator development is introduced, the advanced integrated design and calculation IP core are determined as the FPGA hardware accelerator, and the design objectives and requirements of the accelerator system are analyzed. The overall architecture of the accelerator system is designed. 76% of designers believe that the dynamic logo has promoted the corporate image. Then, the function and architecture of IP core are designed based on advanced synthesis, the code structure is standardized, the function is divided, and the operation acceleration is further optimized by using the instruction set of HLS. Finally, the design is integrated by Vivado HLS and Vivado IDE software. The experiment shows that the accelerator system has low power consumption and high resource utilization.
Los estilos APA, Harvard, Vancouver, ISO, etc.
23

Hosseini, Morteza y Tinoosh Mohsenin. "Binary Precision Neural Network Manycore Accelerator". ACM Journal on Emerging Technologies in Computing Systems 17, n.º 2 (abril de 2021): 1–27. http://dx.doi.org/10.1145/3423136.

Texto completo
Resumen
This article presents a low-power, programmable, domain-specific manycore accelerator, Binarized neural Network Manycore Accelerator (BiNMAC), which adopts and efficiently executes binary precision weight/activation neural network models. Such networks have compact models in which weights are constrained to only 1 bit and can be packed several in one memory entry that minimizes memory footprint to its finest. Packing weights also facilitates executing single instruction, multiple data with simple circuitry that allows maximizing performance and efficiency. The proposed BiNMAC has light-weight cores that support domain-specific instructions, and a router-based memory access architecture that helps with efficient implementation of layers in binary precision weight/activation neural networks of proper size. With only 3.73% and 1.98% area and average power overhead, respectively, novel instructions such as Combined Population-Count-XNOR , Patch-Select , and Bit-based Accumulation are added to the instruction set architecture of the BiNMAC, each of which replaces execution cycles of frequently used functions with 1 clock cycle that otherwise would have taken 54, 4, and 3 clock cycles, respectively. Additionally, customized logic is added to every core to transpose 16×16-bit blocks of memory on a bit-level basis, that expedites reshaping intermediate data to be well-aligned for bitwise operations. A 64-cluster architecture of the BiNMAC is fully placed and routed in 65-nm TSMC CMOS technology, where a single cluster occupies an area of 0.53 mm 2 with an average power of 232 mW at 1-GHz clock frequency and 1.1 V. The 64-cluster architecture takes 36.5 mm 2 area and, if fully exploited, consumes a total power of 16.4 W and can perform 1,360 Giga Operations Per Second (GOPS) while providing full programmability. To demonstrate its scalability, four binarized case studies including ResNet-20 and LeNet-5 for high-performance image classification, as well as a ConvNet and a multilayer perceptron for low-power physiological applications were implemented on BiNMAC. The implementation results indicate that the population-count instruction alone can expedite the performance by approximately 5×. When other new instructions are added to a RISC machine with existing population-count instruction, the performance is increased by 58% on average. To compare the performance of the BiNMAC with other commercial-off-the-shelf platforms, the case studies with their double-precision floating-point models are also implemented on the NVIDIA Jetson TX2 SoC (CPU+GPU). The results indicate that, within a margin of ∼2.1%--9.5% accuracy loss, BiNMAC on average outperforms the TX2 GPU by approximately 1.9× (or 7.5× with fabrication technology scaled) in energy consumption for image classification applications. On low power settings and within a margin of ∼3.7%--5.5% accuracy loss compared to ARM Cortex-A57 CPU implementation, BiNMAC is roughly ∼9.7×--17.2× (or 38.8×--68.8× with fabrication technology scaled) more energy efficient for physiological applications while meeting the application deadline.
Los estilos APA, Harvard, Vancouver, ISO, etc.
24

Park, Sang-Soo y Ki-Seok Chung. "CENNA: Cost-Effective Neural Network Accelerator". Electronics 9, n.º 1 (10 de enero de 2020): 134. http://dx.doi.org/10.3390/electronics9010134.

Texto completo
Resumen
Convolutional neural networks (CNNs) are widely adopted in various applications. State-of-the-art CNN models deliver excellent classification performance, but they require a large amount of computation and data exchange because they typically employ many processing layers. Among these processing layers, convolution layers, which carry out many multiplications and additions, account for a major portion of computation and memory access. Therefore, reducing the amount of computation and memory access is the key for high-performance CNNs. In this study, we propose a cost-effective neural network accelerator, named CENNA, whose hardware cost is reduced by employing a cost-centric matrix multiplication that employs both Strassen’s multiplication and a naïve multiplication. Furthermore, the convolution method using the proposed matrix multiplication can minimize data movement by reusing both the feature map and the convolution kernel without any additional control logic. In terms of throughput, power consumption, and silicon area, the efficiency of CENNA is up to 88 times higher than that of conventional designs for the CNN inference.
Los estilos APA, Harvard, Vancouver, ISO, etc.
25

Kim, Dongyoung, Junwhan Ahn y Sungjoo Yoo. "ZeNA: Zero-Aware Neural Network Accelerator". IEEE Design & Test 35, n.º 1 (febrero de 2018): 39–46. http://dx.doi.org/10.1109/mdat.2017.2741463.

Texto completo
Los estilos APA, Harvard, Vancouver, ISO, etc.
26

Chen, Tianshi, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen y Olivier Temam. "A High-Throughput Neural Network Accelerator". IEEE Micro 35, n.º 3 (mayo de 2015): 24–32. http://dx.doi.org/10.1109/mm.2015.41.

Texto completo
Los estilos APA, Harvard, Vancouver, ISO, etc.
27

To, Chun-Hao, Eduardo Rozo, Elisabeth Krause, Hao-Yi Wu, Risa H. Wechsler y Andrés N. Salcedo. "LINNA: Likelihood Inference Neural Network Accelerator". Journal of Cosmology and Astroparticle Physics 2023, n.º 01 (1 de enero de 2023): 016. http://dx.doi.org/10.1088/1475-7516/2023/01/016.

Texto completo
Resumen
Abstract Bayesian posterior inference of modern multi-probe cosmological analyses incurs massive computational costs. For instance, depending on the combinations of probes, a single posterior inference for the Dark Energy Survey (DES) data had a wall-clock time that ranged from 1 to 21 days using a state-of-the-art computing cluster with 100 cores. These computational costs have severe environmental impacts and the long wall-clock time slows scientific productivity. To address these difficulties, we introduce LINNA: the Likelihood Inference Neural Network Accelerator. Relative to the baseline DES analyses, LINNA reduces the computational cost associated with posterior inference by a factor of 8–50. If applied to the first-year cosmological analysis of Rubin Observatory's Legacy Survey of Space and Time (LSST Y1), we conservatively estimate that LINNA will save more than U.S. $300,000 on energy costs, while simultaneously reducing CO2 emission by 2,400 tons. To accomplish these reductions, LINNA automatically builds training data sets, creates neural network emulators, and produces a Markov chain that samples the posterior. We explicitly verify that LINNA accurately reproduces the first-year DES (DES Y1) cosmological constraints derived from a variety of different data vectors with our default code settings, without needing to retune the algorithm every time. Further, we find that LINNA is sufficient for enabling accurate and efficient sampling for LSST Y10 multi-probe analyses. We make LINNA publicly available at https://github.com/chto/linna, to enable others to perform fast and accurate posterior inference in contemporary cosmological analyses.
Los estilos APA, Harvard, Vancouver, ISO, etc.
28

Liang, Yong, Junwen Tan, Zhisong Xie, Zetao Chen, Daoqian Lin y Zhenhao Yang. "Research on Convolutional Neural Network Inference Acceleration and Performance Optimization for Edge Intelligence". Sensors 24, n.º 1 (31 de diciembre de 2023): 240. http://dx.doi.org/10.3390/s24010240.

Texto completo
Resumen
In recent years, edge intelligence (EI) has emerged, combining edge computing with AI, and specifically deep learning, to run AI algorithms directly on edge devices. In practical applications, EI faces challenges related to computational power, power consumption, size, and cost, with the primary challenge being the trade-off between computational power and power consumption. This has rendered traditional computing platforms unsustainable, making heterogeneous parallel computing platforms a crucial pathway for implementing EI. In our research, we leveraged the Xilinx Zynq 7000 heterogeneous computing platform, employed high-level synthesis (HLS) for design, and implemented two different accelerators for LeNet-5 using loop unrolling and pipelining optimization techniques. The experimental results show that when running at a clock speed of 100 MHz, the PIPELINE accelerator, compared to the UNROLL accelerator, experiences an 8.09% increase in power consumption but speeds up by 14.972 times, making the PIPELINE accelerator superior in performance. Compared to the CPU, the PIPELINE accelerator reduces power consumption by 91.37% and speeds up by 70.387 times, while compared to the GPU, it reduces power consumption by 93.35%. This study provides two different optimization schemes for edge intelligence applications through design and experimentation and demonstrates the impact of different quantization methods on FPGA resource consumption. These experimental results can provide a reference for practical applications, thereby providing a reference hardware acceleration scheme for edge intelligence applications.
Los estilos APA, Harvard, Vancouver, ISO, etc.
29

Liu, Yang, Yiheng Zhang, Xiaoran Hao, Lan Chen, Mao Ni, Ming Chen y Rong Chen. "Design of a Convolutional Neural Network Accelerator Based on On-Chip Data Reordering". Electronics 13, n.º 5 (4 de marzo de 2024): 975. http://dx.doi.org/10.3390/electronics13050975.

Texto completo
Resumen
Convolutional neural networks have been widely applied in the field of computer vision. In convolutional neural networks, convolution operations account for more than 90% of the total computational workload. The current mainstream approach to achieving high energy-efficient convolution operations is through dedicated hardware accelerators. Convolution operations involve a significant amount of weights and input feature data. Due to limited on-chip cache space in accelerators, there is a significant amount of off-chip DRAM memory access involved in the computation process. The latency of DRAM access is 20 times higher than that of SRAM, and the energy consumption of DRAM access is 100 times higher than that of multiply–accumulate (MAC) units. It is evident that the “memory wall” and “power wall” issues in neural network computation remain challenging. This paper presents the design of a hardware accelerator for convolutional neural networks. It employs a dataflow optimization strategy based on on-chip data reordering. This strategy improves on-chip data utilization and reduces the frequency of data exchanges between on-chip cache and off-chip DRAM. The experimental results indicate that compared to the accelerator without this strategy, it can reduce data exchange frequency by up to 82.9%.
Los estilos APA, Harvard, Vancouver, ISO, etc.
30

Ro, Yuhwan, Eojin Lee y Jung Ahn. "Evaluating the Impact of Optical Interconnects on a Multi-Chip Machine-Learning Architecture". Electronics 7, n.º 8 (27 de julio de 2018): 130. http://dx.doi.org/10.3390/electronics7080130.

Texto completo
Resumen
Following trends that emphasize neural networks for machine learning, many studies regarding computing systems have focused on accelerating deep neural networks. These studies often propose utilizing the accelerator specialized in a neural network and the cluster architecture composed of interconnected accelerator chips. We observed that inter-accelerator communication within a cluster has a significant impact on the training time of the neural network. In this paper, we show the advantages of optical interconnects for multi-chip machine-learning architecture by demonstrating performance improvements through replacing electrical interconnects with optical ones in an existing multi-chip system. We propose to use highly practical optical interconnect implementation and devise an arithmetic performance model to fairly assess the impact of optical interconnects on a machine-learning accelerator platform. In our evaluation of nine Convolutional Neural Networks with various input sizes, 100 and 400 Gbps optical interconnects reduce the training time by an average of 20.6% and 35.6%, respectively, compared to the baseline system with 25.6 Gbps electrical ones.
Los estilos APA, Harvard, Vancouver, ISO, etc.
31

Chen, Zhimei. "Hardware Accelerated Optimization of Deep Learning Model on Artificial Intelligence Chip". Frontiers in Computing and Intelligent Systems 6, n.º 2 (15 de diciembre de 2023): 11–14. http://dx.doi.org/10.54097/fcis.v6i2.03.

Texto completo
Resumen
With the rapid development of deep learning technology, the demand for computing resources is increasing, and the accelerated optimization of hardware on artificial intelligence (AI) chip has become one of the key ways to solve this challenge. This paper aims to explore the hardware acceleration optimization strategy of deep learning model on AI chip to improve the training and inference performance of the model. In this paper, the method and practice of optimizing deep learning model on AI chip are deeply analyzed by comprehensively considering the hardware characteristics such as parallel processing ability, energy-efficient computing, neural network accelerator, flexibility and programmability, high integration and heterogeneous computing structure. By designing and implementing an efficient convolution accelerator, the computational efficiency of the model is improved. The introduction of energy-efficient computing effectively reduces energy consumption, which provides feasibility for the practical application of mobile devices and embedded systems. At the same time, the optimization design of neural network accelerator becomes the core of hardware acceleration, and deep learning calculation such as convolution and matrix operation are accelerated through special hardware structure, which provides strong support for the real-time performance of the model. By analyzing the actual application cases of hardware accelerated optimization in different application scenarios, this paper highlights the key role of hardware accelerated optimization in improving the performance of deep learning model. Hardware accelerated optimization not only improves the computing efficiency, but also provides efficient and intelligent computing support for AI applications in different fields.
Los estilos APA, Harvard, Vancouver, ISO, etc.
32

Huang, Hongmin, Zihao Liu, Taosheng Chen, Xianghong Hu, Qiming Zhang y Xiaoming Xiong. "Design Space Exploration for YOLO Neural Network Accelerator". Electronics 9, n.º 11 (16 de noviembre de 2020): 1921. http://dx.doi.org/10.3390/electronics9111921.

Texto completo
Resumen
The You Only Look Once (YOLO) neural network has great advantages and extensive applications in computer vision. The convolutional layers are the most important part of the neural network and take up most of the computation time. Improving the efficiency of the convolution operations can greatly increase the speed of the neural network. Field programmable gate arrays (FPGAs) have been widely used in accelerators for convolutional neural networks (CNNs) thanks to their configurability and parallel computing. This paper proposes a design space exploration for the YOLO neural network based on FPGA. A data block transmission strategy is proposed and a multiply and accumulate (MAC) design, which consists of two 14 × 14 processing element (PE) matrices, is designed. The PE matrices are configurable for different CNNs according to the given required functions. In order to take full advantage of the limited logical resources and the memory bandwidth on the given FPGA device and to simultaneously achieve the best performance, an improved roofline model is used to evaluate the hardware design to balance the computing throughput and the memory bandwidth requirement. The accelerator achieves 41.99 giga operations per second (GOPS) and consumes 7.50 W running at the frequency of 100 MHz on the Xilinx ZC706 board.
Los estilos APA, Harvard, Vancouver, ISO, etc.
33

Brennsteiner, Stefan, Tughrul Arslan, John Thompson y Andrew McCormick. "A Real-Time Deep Learning OFDM Receiver". ACM Transactions on Reconfigurable Technology and Systems 15, n.º 3 (30 de septiembre de 2022): 1–25. http://dx.doi.org/10.1145/3494049.

Texto completo
Resumen
Machine learning in the physical layer of communication systems holds the potential to improve performance and simplify design methodology. Many algorithms have been proposed; however, the model complexity is often unfeasible for real-time deployment. The real-time processing capability of these systems has not been proven yet. In this work, we propose a novel, less complex, fully connected neural network to perform channel estimation and signal detection in an orthogonal frequency division multiplexing system. The memory requirement, which is often the bottleneck for fully connected neural networks, is reduced by ≈ 27 times by applying known compression techniques in a three-step training process. Extensive experiments were performed for pruning and quantizing the weights of the neural network detector. Additionally, Huffman encoding was used on the weights to further reduce memory requirements. Based on this approach, we propose the first field-programmable gate array based, real-time capable neural network accelerator, specifically designed to accelerate the orthogonal frequency division multiplexing detector workload. The accelerator is synthesized for a Xilinx RFSoC field-programmable gate array, uses small-batch processing to increase throughput, efficiently supports branching neural networks, and implements superscalar Huffman decoders.
Los estilos APA, Harvard, Vancouver, ISO, etc.
34

Cho, Mannhee y Youngmin Kim. "FPGA-Based Convolutional Neural Network Accelerator with Resource-Optimized Approximate Multiply-Accumulate Unit". Electronics 10, n.º 22 (19 de noviembre de 2021): 2859. http://dx.doi.org/10.3390/electronics10222859.

Texto completo
Resumen
Convolutional neural networks (CNNs) are widely used in modern applications for their versatility and high classification accuracy. Field-programmable gate arrays (FPGAs) are considered to be suitable platforms for CNNs based on their high performance, rapid development, and reconfigurability. Although many studies have proposed methods for implementing high-performance CNN accelerators on FPGAs using optimized data types and algorithm transformations, accelerators can be optimized further by investigating more efficient uses of FPGA resources. In this paper, we propose an FPGA-based CNN accelerator using multiple approximate accumulation units based on a fixed-point data type. We implemented the LeNet-5 CNN architecture, which performs classification of handwritten digits using the MNIST handwritten digit dataset. The proposed accelerator was implemented, using a high-level synthesis tool on a Xilinx FPGA. The proposed accelerator applies an optimized fixed-point data type and loop parallelization to improve performance. Approximate operation units are implemented using FPGA logic resources instead of high-precision digital signal processing (DSP) blocks, which are inefficient for low-precision data. Our accelerator model achieves 66% less memory usage and approximately 50% reduced network latency, compared to a floating point design and its resource utilization is optimized to use 78% fewer DSP blocks, compared to general fixed-point designs.
Los estilos APA, Harvard, Vancouver, ISO, etc.
35

Choubey, Abhishek y Shruti Bhargava Choubey. "A Promising Hardware Accelerator with PAST Adder". Advances in Science and Technology 105 (abril de 2021): 241–48. http://dx.doi.org/10.4028/www.scientific.net/ast.105.241.

Texto completo
Resumen
Recent neural network research has demonstrated a significant benefit in machine learning compared to conventional algorithms based on handcrafted models and features. In regions such as video, speech and image recognition, the neural network is now widely adopted. But the high complexity of neural network inference in computation and storage poses great differences on its application. These networks are computer-intensive algorithms that currently require the execution of dedicated hardware. In this case, we point out the difficulty of Adders (MOAs) and their high-resource utilization in a CNN implementation of FPGA .to address these challenge a parallel self-time adder is implemented which mainly aims at minimizing the amount of transistors and estimating different factors for PASTA, i.e. field, power, delay.
Los estilos APA, Harvard, Vancouver, ISO, etc.
36

de Sousa, André L., Mário P. Véstias y Horácio C. Neto. "Multi-Model Inference Accelerator for Binary Convolutional Neural Networks". Electronics 11, n.º 23 (30 de noviembre de 2022): 3966. http://dx.doi.org/10.3390/electronics11233966.

Texto completo
Resumen
Binary convolutional neural networks (BCNN) have shown good accuracy for small to medium neural network models. Their extreme quantization of weights and activations reduces off-chip data transfer and greatly reduces the computational complexity of convolutions. Further reduction in the complexity of a BCNN model for fast execution can be achieved with model size reduction at the cost of network accuracy. In this paper, a multi-model inference technique is proposed to reduce the execution time of the binarized inference process without accuracy reduction. The technique considers a cascade of neural network models with different computation/accuracy ratios. A parameterizable binarized neural network with different trade-offs between complexity and accuracy is used to obtain multiple network models. We also propose a hardware accelerator to run multi-model inference throughput in embedded systems. The multi-model inference accelerator is demonstrated on low-density Zynq-7010 and Zynq-7020 FPGA devices, classifying images from the CIFAR-10 dataset. The proposed accelerator improves the frame rate per number of LUTs by 7.2× those of previous solutions on a ZYNQ7020 FPGA with similar accuracy. This shows the effectiveness of the multi-model inference technique and the efficiency of the proposed hardware accelerator.
Los estilos APA, Harvard, Vancouver, ISO, etc.
37

Kang, Soongyu, Seongjoo Lee y Yunho Jung. "Design of Network-on-Chip-Based Restricted Coulomb Energy Neural Network Accelerator on FPGA Device". Sensors 24, n.º 6 (15 de marzo de 2024): 1891. http://dx.doi.org/10.3390/s24061891.

Texto completo
Resumen
Sensor applications in internet of things (IoT) systems, coupled with artificial intelligence (AI) technology, are becoming an increasingly significant part of modern life. For low-latency AI computation in IoT systems, there is a growing preference for edge-based computing over cloud-based alternatives. The restricted coulomb energy neural network (RCE-NN) is a machine learning algorithm well-suited for implementation on edge devices due to its simple learning and recognition scheme. In addition, because the RCE-NN generates neurons as needed, it is easy to adjust the network structure and learn additional data. Therefore, the RCE-NN can provide edge-based real-time processing for various sensor applications. However, previous RCE-NN accelerators have limited scalability when the number of neurons increases. In this paper, we propose a network-on-chip (NoC)-based RCE-NN accelerator and present the results of implementation on a field-programmable gate array (FPGA). NoC is an effective solution for managing massive interconnections. The proposed RCE-NN accelerator utilizes a hierarchical–star (H–star) topology, which efficiently handles a large number of neurons, along with routers specifically designed for the RCE-NN. These approaches result in only a slight decrease in the maximum operating frequency as the number of neurons increases. Consequently, the maximum operating frequency of the proposed RCE-NN accelerator with 512 neurons increased by 126.1% compared to a previous RCE-NN accelerator. This enhancement was verified with two datasets for gas and sign language recognition, achieving accelerations of up to 54.8% in learning time and up to 45.7% in recognition time. The NoC scheme of the proposed RCE-NN accelerator is an appropriate solution to ensure the scalability of the neural network while providing high-performance on-chip learning and recognition.
Los estilos APA, Harvard, Vancouver, ISO, etc.
38

Cosatto, E. y H. P. Craf. "A neural network accelerator for image analysis". IEEE Micro 15, n.º 3 (junio de 1995): 32–38. http://dx.doi.org/10.1109/40.387680.

Texto completo
Los estilos APA, Harvard, Vancouver, ISO, etc.
39

Kuznar, Damian, Robert Szczygiel, Piotr Maj y Anna Kozioł. "Design of artificial neural network hardware accelerator". Journal of Instrumentation 18, n.º 04 (1 de abril de 2023): C04013. http://dx.doi.org/10.1088/1748-0221/18/04/c04013.

Texto completo
Resumen
Abstract We present a design of the scalable processor capable of providing an artificial neural network (ANN) functionality and in-house developed tools for automatic conversion of an ANN model designed with the TensorFlow library into an HDL code. The hardware is described in SystemVerilog and the synthesized module of the processor can perform calculations of a neural network with the speed exceeding 100 MHz. Our in-house designed software tool for ANN conversion supports translation of an arbitrary multilayer perceptron neural network into a state machine module, which performs necessary calculations. It is also dynamically reconfigurable so that the ANN operating on the hardware can be changed after it is deployed as an ASIC. The project aims the in-pixel implementation towards an X-ray photon energy estimation. The energy estimation shall be delivered with accuracy that exceeds the accuracy of an ADC converter that feeds the ANN with data.
Los estilos APA, Harvard, Vancouver, ISO, etc.
40

Kumar, Pramod. "Review of Advanced Methods in Hardware Acceleration for Deep Neural Networks". International Journal for Research in Applied Science and Engineering Technology 12, n.º 5 (31 de mayo de 2024): 4523–29. http://dx.doi.org/10.22214/ijraset.2024.62595.

Texto completo
Resumen
Abstract: Convolutional neural networks have become very efficient in performing tasks like Object Detection providing human like accuracy. However, their practical implementation needs significant hardware resources and memory bandwidth. In recent past a lot of research is being carried out for achieving higher efficiency in implementing such neural networks in hardware. We talk about FPGAs for hardware implementation due to their flexibility for customisation for such neural network architectures. In this paper we will discuss the metrics for efficient hardware accelerator and general methods available for achieving an efficient design. Further, we will discuss the actual methods used by recent research for implementation of deep neural networks particularly for object detection related applications. These methods range from actual ASIC design like TPUs [1] for on chip acceleration, state of the art open source designs like Gemini to methods like hardware reuse, re-configurable nodes and approximation in computations as a trade-off between speed and accuracy. This paper will be a valuable summary for the researchers starting in the field of hardware accelerators design for neural networks
Los estilos APA, Harvard, Vancouver, ISO, etc.
41

Wang, Yuejiao, Zhong Ma y Zunming Yang. "Sequential Characteristics Based Operators Disassembly Quantization Method for LSTM Layers". Applied Sciences 12, n.º 24 (12 de diciembre de 2022): 12744. http://dx.doi.org/10.3390/app122412744.

Texto completo
Resumen
Embedded computing platforms such as neural network accelerators deploying neural network models need to quantize the values into low-bit integers through quantization operations. However, most current embedded computing platforms with a fixed-point architecture do not directly support performing the quantization operation for the LSTM layer. Meanwhile, the influence of sequential input data for LSTM has not been taken into account by quantization algorithms. Aiming at these two technical bottlenecks, a new sequential-characteristics-based operators disassembly quantization method for LSTM layers is proposed. Specifically, the calculation process of the LSTM layer is split into multiple regular layers supported by the neural network accelerator. The quantization-parameter-generation process is designed as a sequential-characteristics-based combination strategy for sequential and diverse image groups. Therefore, LSTM is converted into multiple mature operators for single-layer quantization and deployed on the neural network accelerator. Comparison experiments with the state of the art show that the proposed quantization method has comparable or even better performance than the full-precision baseline in the field of character-/word-level language prediction and image classification applications. The proposed method has strong application potential in the subsequent addition of novel operators for future neural network accelerators.
Los estilos APA, Harvard, Vancouver, ISO, etc.
42

Seto, Kenshu. "A Survey on System-Level Design of Neural Network Accelerators". Journal of Integrated Circuits and Systems 16, n.º 2 (18 de agosto de 2021): 1–10. http://dx.doi.org/10.29292/jics.v16i2.505.

Texto completo
Resumen
In this paper, we present a brief survey on the system-level optimizations used for convolutional neural network (CNN) inference accelerators. For the nested loop of convolutional (CONV) layers, we discuss the effects of loop optimizations such as loop interchange, tiling, unrolling and fusion on CNN accelerators. We also explain memory optimizations that are effective with the loop optimizations. In addition, we discuss streaming architectures and single computation engine architectures that are commonly used in CNN accelerators. Optimizations for CNN models are briefly explained, followed by the recent trends and future directions of the CNN accelerator design.
Los estilos APA, Harvard, Vancouver, ISO, etc.
43

Park, Sang-Soo y Ki-Seok Chung. "CONNA: Configurable Matrix Multiplication Engine for Neural Network Acceleration". Electronics 11, n.º 15 (29 de julio de 2022): 2373. http://dx.doi.org/10.3390/electronics11152373.

Texto completo
Resumen
Convolutional neural networks (CNNs) have demonstrated promising results in various applications such as computer vision, speech recognition, and natural language processing. One of the key computations in many CNN applications is matrix multiplication, which accounts for a significant portion of computation. Therefore, hardware accelerators to effectively speed up the computation of matrix multiplication have been proposed, and several studies have attempted to design hardware accelerators to perform better matrix multiplications in terms of both speed and power consumption. Typically, accelerators with either a two-dimensional (2D) systolic array structure or a single instruction multiple data (SIMD) architecture are effective only when the input matrix has shapes that are close to or similar to a square. However, several CNN applications require multiplications of non-squared matrices with various shapes and dimensions, and such irregular shapes lead to poor utilization efficiency of the processing elements (PEs). This study proposes a configurable engine for neural network acceleration, called CONNA, whose computation engine can conduct matrix multiplications with highly utilized computing units, regardless of the access patterns, shapes, and dimensions of the input matrices by changing the shape of matrix multiplication conducted in the physical array. To verify the functionality of the CONNA accelerator, we implemented CONNA as an SoC platform that integrates a RISC-V MCU with CONNA on Xilinx VC707 FPGA. SqueezeNet on CONNA achieved an inference performance of 100 frames per second (FPS) with 2.36 mm2 and 83.55 mW in a 65 nm process, improving efficiency by up to 34.1 times better than existing accelerators in terms of FPS, silicon area, and power consumption.
Los estilos APA, Harvard, Vancouver, ISO, etc.
44

Paulenka, D. A. "Comparative analysis of single-board computers for the development of a microarchitectural computing system for fire detection". Informatics 21, n.º 2 (28 de junio de 2024): 73–85. http://dx.doi.org/10.37661/1816-0301-2024-21-2-73-85.

Texto completo
Resumen
Objectives. The purpose of the work is to select the basic computing microplatform of the onboard microarchitectural computing complex for the detection of anomalous situations in the territory of the Republic of Belarus from space on the basis of artificial intelligence methods.Methods. The method of comparative analysis is used to select a computing platform. A series of performance tests and comparative analysis (benchmarking) are performed on the selected equipment. The methods of comparative and benchmarking analysis are performed in accordance with the terms of reference to the current project.Results. A comparative analysis and performance testing of Raspberry Pi 4 Model B and Cool Pi 4 Model B single-board computers, as well as AI-accelerator Google Coral USB Accelerator with Google Edge TPU have been performed. The comparative analysis showed that Raspberry Pi 4 Model B and Cool Pi 4 Model B fully meet the terms of reference to the current project. At the same time Cool Pi 4 Model B handles neural network calculations well, but four times slower than similar calculations on Google Coral USB Accelerator. Neural network computations on the Raspberry Pi 4 Model B are 22 times slower than similar computations on the Google Coral USB Accelerator. Cool Pi 4 Model B outperforms Raspberry Pi 4 Model B by the factor of two to three for data copying and compression and almost six times faster for neural network computations.Conclusion. Despite the fact that Raspberry Pi 4 Model B meets the terms of reference to the project as a computational basis, when developing an on-board microarchitectural computing system for detecting anomalous situations, it is worth using more powerful alternatives with built-in AI-accelerators (e.g., Radxa Rock 5 Model A) or with an additional external AI-accelerator (e.g., a combination of Cool Pi 4 Model B and Google Coral USB Accelerator). Using a Raspberry Pi 4 Model B with an additional AI-accelerator is also acceptable and will speed up computations by several dozen times. AI-accelerators provide the fastest neural network computations, but there are features related to the novelty of the technology that will be explored in further development.
Los estilos APA, Harvard, Vancouver, ISO, etc.
45

Gao, Xiangang, Bin Wu, Peng Li y Zehuan Jing. "1D-CNN-Transformer for Radar Emitter Identification and Implemented on FPGA". Remote Sensing 16, n.º 16 (12 de agosto de 2024): 2962. http://dx.doi.org/10.3390/rs16162962.

Texto completo
Resumen
Deep learning has brought great development to radar emitter identification technology. In addition, specific emitter identification (SEI), as a branch of radar emitter identification, has also benefited from it. However, the complexity of most deep learning algorithms makes it difficult to adapt to the requirements of the low power consumption and high-performance processing of SEI on embedded devices, so this article proposes solutions from the aspects of software and hardware. From the software side, we design a Transformer variant network, lightweight convolutional Transformer (LW-CT) that supports parameter sharing. Then, we cascade convolutional neural networks (CNNs) and the LW-CT to construct a one-dimensional-CNN-Transformer(1D-CNN-Transformer) lightweight neural network model that can capture the long-range dependencies of radar emitter signals and extract signal spatial domain features meanwhile. In terms of hardware, we design a low-power neural network accelerator based on an FPGA to complete the real-time recognition of radar emitter signals. The accelerator not only designs high-efficiency computing engines for the network, but also devises a reconfigurable buffer called “Ping-pong CBUF” and two-level pipeline architecture for the convolution layer for alleviating the bottleneck caused by the off-chip storage access bandwidth. Experimental results show that the algorithm can achieve a high recognition performance of SEI with a low calculation overhead. In addition, the hardware acceleration platform not only perfectly meets the requirements of the radar emitter recognition system for low power consumption and high-performance processing, but also outperforms the accelerators in other papers in terms of the energy efficiency ratio of Transformer layer processing.
Los estilos APA, Harvard, Vancouver, ISO, etc.
46

Wang, Hongzhe, Junjie Wang, Hao Hu, Guo Li, Shaogang Hu, Qi Yu, Zhen Liu, Tupei Chen, Shijie Zhou y Yang Liu. "Ultra-High-Speed Accelerator Architecture for Convolutional Neural Network Based on Processing-in-Memory Using Resistive Random Access Memory". Sensors 23, n.º 5 (21 de febrero de 2023): 2401. http://dx.doi.org/10.3390/s23052401.

Texto completo
Resumen
Processing-in-Memory (PIM) based on Resistive Random Access Memory (RRAM) is an emerging acceleration architecture for artificial neural networks. This paper proposes an RRAM PIM accelerator architecture that does not use Analog-to-Digital Converters (ADCs) and Digital-to-Analog Converters (DACs). Additionally, no additional memory usage is required to avoid the need for a large amount of data transportation in convolution computation. Partial quantization is introduced to reduce the accuracy loss. The proposed architecture can substantially reduce the overall power consumption and accelerate computation. The simulation results show that the image recognition rate for the Convolutional Neural Network (CNN) algorithm can reach 284 frames per second at 50 MHz using this architecture. The accuracy of the partial quantization remains almost unchanged compared to the algorithm without quantization.
Los estilos APA, Harvard, Vancouver, ISO, etc.
47

Gowda, Kavitha Malali Vishveshwarappa, Sowmya Madhavan, Stefano Rinaldi, Parameshachari Bidare Divakarachari y Anitha Atmakur. "FPGA-Based Reconfigurable Convolutional Neural Network Accelerator Using Sparse and Convolutional Optimization". Electronics 11, n.º 10 (22 de mayo de 2022): 1653. http://dx.doi.org/10.3390/electronics11101653.

Texto completo
Resumen
Nowadays, the data flow architecture is considered as a general solution for the acceleration of a deep neural network (DNN) because of its higher parallelism. However, the conventional DNN accelerator offers only a restricted flexibility for diverse network models. In order to overcome this, a reconfigurable convolutional neural network (RCNN) accelerator, i.e., one of the DNN, is required to be developed over the field-programmable gate array (FPGA) platform. In this paper, the sparse optimization of weight (SOW) and convolutional optimization (CO) are proposed to improve the performances of the RCNN accelerator. The combination of SOW and CO is used to optimize the feature map and weight sizes of the RCNN accelerator; therefore, the hardware resources consumed by this RCNN are minimized in FPGA. The performances of RCNN-SOW-CO are analyzed by means of feature map size, weight size, sparseness of the input feature map (IFM), weight parameter proportion, block random access memory (BRAM), digital signal processing (DSP) elements, look-up tables (LUTs), slices, delay, power, and accuracy. An existing architectures OIDSCNN, LP-CNN, and DPR-NN are used to justify efficiency of the RCNN-SOW-CO. The LUT of RCNN-SOW-CO with Alexnet designed in the Zynq-7020 is 5150, which is less than the OIDSCNN and DPR-NN.
Los estilos APA, Harvard, Vancouver, ISO, etc.
48

Hou, Jia, Zichu Liu, Zepeng Yang y Chen Yang. "Hardware Trojan Attacks on the Reconfigurable Interconnections of Field-Programmable Gate Array-Based Convolutional Neural Network Accelerators and a Physically Unclonable Function-Based Countermeasure Detection Technique". Micromachines 15, n.º 1 (19 de enero de 2024): 149. http://dx.doi.org/10.3390/mi15010149.

Texto completo
Resumen
Convolutional neural networks (CNNs) have demonstrated significant superiority in modern artificial intelligence (AI) applications. To accelerate the inference process of CNNs, reconfigurable CNN accelerators that support diverse networks are widely employed for AI systems. Given the ubiquitous deployment of these AI systems, there is a growing concern regarding the security of CNN accelerators and the potential attacks they may face, including hardware Trojans. This paper proposes a hardware Trojan designed to attack a crucial component of FPGA-based CNN accelerators: the reconfigurable interconnection network. Specifically, the hardware Trojan alters the data paths during activation, resulting in incorrect connections in the arithmetic circuit and consequently causing erroneous convolutional computations. To address this issue, the paper introduces a novel detection technique based on physically unclonable functions (PUFs) to safeguard the reconfigurable interconnection network against hardware Trojan attacks. Experimental results demonstrate that by incorporating a mere 0.27% hardware overhead to the accelerator, the proposed hardware Trojan can degrade the inference accuracy of popular neural network architectures, including LeNet, AlexNet, and VGG, by a significant range of 8.93% to 86.20%. The implemented arbiter-PUF circuit on a Xilinx Zynq XC7Z100 platform successfully detects the presence and location of hardware Trojans in a reconfigurable interconnection network. This research highlights the vulnerability of reconfigurable CNN accelerators to hardware Trojan attacks and proposes a promising detection technique to mitigate potential security risks. The findings underscore the importance of addressing hardware security concerns in the design and deployment of AI systems utilizing FPGA-based CNN accelerators.
Los estilos APA, Harvard, Vancouver, ISO, etc.
49

C., Dr Aarthi y Kowsalya S. "Feed Forward Neural Network With Column-Wise Matrix–Vector Multiplication on FPGAs". International Research Journal of Computer Science 11, n.º 04 (5 de abril de 2024): 355–59. http://dx.doi.org/10.26562/irjcs.2024.v1104.42.

Texto completo
Resumen
This article presents a reconfigurable accelerator for recurrent neural networks with fine grained column wise matrix vector multiplication (RENOWN).We propose a novel latency hiding architecture for recurrent neural network accelerator using column wise matrix vector multiplication instead of the state of the art row wise operation. This hardware (HW) architecture can eliminate data dependencies to improve the throughput of RNN inference systems. Besides, we introduce a configurable checkerboard tiling strategy which allows large weight matrices, while incorporating various configurations of element-based parallelism (EP) and vector-based parallelism (VP). These optimizations improve the exploitation of parallelism to increase HW utilization and enhance system throughput. Evaluation results show that our design can achieve over 29.6 tera operations per second (TOPS) which would be among the highest for field-programmable gate array (FPGA)-based RNN designs. Compared to state-of-the-art accelerators on FPGAs, our design achieves 3.7–14.8 times better performance and has the highest HW.
Los estilos APA, Harvard, Vancouver, ISO, etc.
50

Deng, Bao y Hao Lv. "Research on Dynamic Reconfigurable Convolutional Neural Network Accelerator". Journal of Physics: Conference Series 1952, n.º 3 (1 de junio de 2021): 032045. http://dx.doi.org/10.1088/1742-6596/1952/3/032045.

Texto completo
Los estilos APA, Harvard, Vancouver, ISO, etc.
Ofrecemos descuentos en todos los planes premium para autores cuyas obras están incluidas en selecciones literarias temáticas. ¡Contáctenos para obtener un código promocional único!

Pasar a la bibliografía