Dissertationen zum Thema „Neural network accelerator“

Um die anderen Arten von Veröffentlichungen zu diesem Thema anzuzeigen, folgen Sie diesem Link: Neural network accelerator.

Geben Sie eine Quelle nach APA, MLA, Chicago, Harvard und anderen Zitierweisen an

Wählen Sie eine Art der Quelle aus:

Machen Sie sich mit Top-50 Dissertationen für die Forschung zum Thema "Neural network accelerator" bekannt.

Neben jedem Werk im Literaturverzeichnis ist die Option "Zur Bibliographie hinzufügen" verfügbar. Nutzen Sie sie, wird Ihre bibliographische Angabe des gewählten Werkes nach der nötigen Zitierweise (APA, MLA, Harvard, Chicago, Vancouver usw.) automatisch gestaltet.

Sie können auch den vollen Text der wissenschaftlichen Publikation im PDF-Format herunterladen und eine Online-Annotation der Arbeit lesen, wenn die relevanten Parameter in den Metadaten verfügbar sind.

Sehen Sie die Dissertationen für verschiedene Spezialgebieten durch und erstellen Sie Ihre Bibliographie auf korrekte Weise.

1

Tianxu, Yue. „Convolutional Neural Network FPGA-accelerator on Intel DE10-Standard FPGA“. Thesis, Linköpings universitet, Elektroniska Kretsar och System, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-178174.

Der volle Inhalt der Quelle
Annotation:
Convolutional neural networks (CNNs) have been extensively used in many aspects, such as face and speech recognition, image searching and classification, and automatic drive. Hence, CNN accelerators have become a trending research. Generally, Graphics processing units (GPUs) are widely applied in CNNaccelerators. However, Field-programmable gate arrays (FPGAs) have higher energy and resource efficiency compared with GPUs, moreover, high-level synthesis tools based on Open Computing Language (OpenCL) can reduce the verification and implementation period for FPGAs. In this project, PipeCNN[1] is implemented on Intel DE10-Standard FPGA. This OpenCL design acceleratesAlexnet through the interaction between Advanced RISC Machine (ARM) and FPGA. Then, PipeCNN optimization based on memory read and convolution is analyzed and discussed.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
2

Oudrhiri, Ali. „Performance of a Neural Network Accelerator Architecture and its Optimization Using a Pipeline-Based Approach“. Electronic Thesis or Diss., Sorbonne université, 2023. https://accesdistant.sorbonne-universite.fr/login?url=https://theses-intra.sorbonne-universite.fr/2023SORUS658.pdf.

Der volle Inhalt der Quelle
Annotation:
Ces dernières années, les réseaux de neurones ont gagné en popularité en raison de leur polyvalence et de leur efficacité dans la résolution d'une grande variété de tâches complexes. Cependant, à mesure que les réseaux neuronaux continuent de trouver des applications dans une gamme toujours croissante de domaines, leurs importantes exigences en matière de calcul deviennent un défi pressant. Cette demande en calcul est particulièrement problématique lors du déploiement de réseaux neuronaux sur des dispositifs embarqués aux ressources limitées, en particulier dans le contexte du calcul en périphérie pour les tâches d'inférence. De nos jours, les puces accélératrices de réseaux neuronaux émergent comme le choix optimal pour prendre en charge les réseaux neuronaux en périphérie. Ces puces offrent une efficacité remarquable avec leur taille compacte, leur faible consommation d'énergie et leur latence réduite. Dans le cadre du calcul en périphérie, diverses exigences ont émergé, nécessitant des compromis dans divers aspects de performance. Cela a conduit au développement d'architectures d'accélérateurs hautement configurables, leur permettant de s'adapter aux demandes de performance distinctes. Dans ce contexte, l'accent est mis sur Gemini, un accélérateur configurable de réseaux neuronaux conçu avec une architecture imposée et mis en œuvre à l'aide de techniques de synthèse de haut niveau. Les considérations pour sa conception et sa mise en œuvre ont été motivées par le besoin de configurabilité de la parallélisation et d'optimisation des performances. Une fois cet accélérateur conçu, il est devenu essentiel de démontrer la puissance de sa configurabilité, aidant les utilisateurs à choisir l'architecture la plus adaptée à leurs réseaux neuronaux. Pour atteindre cet objectif, cette thèse a contribué au développement d'une stratégie de prédiction des performances fonctionnant à un niveau élevé d'abstraction, qui prend en compte l'architecture choisie et la configuration du réseau neuronal. Cet outil aide les clients à prendre des décisions concernant l'architecture appropriée pour leurs applications de réseaux neuronaux spécifiques. Au cours de la recherche, nous avons constaté qu'utiliser un seul accélérateur présentait plusieurs limites et que l'augmentation de la parallélisme avait des limitations en termes de performances. Par conséquent, nous avons adopté une nouvelle stratégie d'optimisation de l'accélération des réseaux neuronaux. Cette fois, nous avons adopté une approche de haut niveau qui ne nécessitait pas d'optimisations fines de l'accélérateur. Nous avons organisé plusieurs instances de Gemini en pipeline et avons attribué les couches à différents accélérateurs pour maximiser les performances. Nous avons proposé des solutions pour deux scénarios : un scénario utilisateur où la structure du pipeline est prédéfinie avec un nombre fixe d'accélérateurs, de configurations d'accélérateurs et de tailles de RAM. Nous avons proposé des solutions pour mapper les couches sur les différents accélérateurs afin d'optimiser les performances d'exécution. Nous avons fait de même pour un scénario concepteur, où la structure du pipeline n'est pas fixe, cette fois il est permis de choisir le nombre et la configuration des accélérateurs pour optimiser l'exécution et également les performances matérielles. Cette stratégie de pipeline s'est révélée efficace pour l'accélérateur Gemini. Bien que cette thèse soit née d'un besoin industriel spécifique, certaines solutions développées au cours de la recherche peuvent être appliquées ou adaptées à d'autres accélérations de réseaux neuronaux. Notamment, la stratégie de prédiction des performances et l'optimisation de haut niveau du traitement de réseaux neuronaux en combinant plusieurs instances offrent des aperçus précieux pour une application plus large
In recent years, neural networks have gained widespread popularity for their versatility and effectiveness in solving a wide range of complex tasks. Their ability to learn and make predictions from large data-sets has revolutionized various fields. However, as neural networks continue to find applications in an ever-expanding array of domains, their significant computational requirements become a pressing challenge. This computational demand is particularly problematic when deploying neural networks in resource-constrained embedded devices, especially within the context of edge computing for inference tasks. Nowadays, neural network accelerator chips emerge as the optimal choice for supporting neural networks at the edge. These chips offer remarkable efficiency with their compact size, low power consumption, and reduced latency. Moreover, the fact that they are integrated on the same chip environment also enhances security by minimizing external data communication. In the frame of edge computing, diverse requirements have emerged, necessitating trade-offs in various performance aspects. This has led to the development of accelerator architectures that are highly configurable, allowing them to adapt to distinct performance demands. In this context, the focus lies on Gemini, a configurable inference neural network accelerator designed with imposed architecture and implemented using High-Level Synthesis techniques. The considerations for its design and implementation were driven by the need for parallelization configurability and performance optimization. Once this accelerator was designed, demonstrating the power of its configurability became essential, helping users select the most suitable architecture for their neural networks. To achieve this objective, this thesis contributed to the development of a performance prediction strategy operating at a high-level of abstraction, which considers the chosen architecture and neural network configuration. This tool assists clients in making decisions regarding the appropriate architecture for their specific neural network applications. During the research, we noticed that using one accelerator presents several limits and that increasing parallelism had limitations on performances. Consequently, we adopted a new strategy for optimizing neural network acceleration. This time, we took a high-level approach that did not require fine-grained accelerator optimizations. We organized multiple Gemini instances into a pipeline and allocated layers to different accelerators to maximize performance. We proposed solutions for two scenarios: a user scenario where the pipeline structure is predefined with a fixed number of accelerators, accelerator configurations, and RAM sizes. We proposed solutions to map the layers on the different accelerators to optimise the execution performance. We did the same for a designer scenario, where the pipeline structure is not fixed, this time it is allowed to choose the number and configuration of the accelerators to optimize the execution and also hardware performances. This pipeline strategy has proven to be effective for the Gemini accelerator. Although this thesis originated from a specific industrial need, certain solutions developed during the research can be applied or adapted to other neural network accelerators. Notably, the performance prediction strategy and high-level optimization of NN processing through pipelining multiple instances offer valuable insights for broader application
APA, Harvard, Vancouver, ISO und andere Zitierweisen
3

Maltoni, Pietro. „Progetto di un acceleratore hardware per layer di convoluzioni depthwise in applicazioni di Deep Neural Network“. Master's thesis, Alma Mater Studiorum - Università di Bologna, 2021. http://amslaurea.unibo.it/24205/.

Der volle Inhalt der Quelle
Annotation:
Il progressivo sviluppo tecnologico e il costante monitoraggio, controllo e analisi della realtà circostante ha condotto allo sviluppo di dispositivi IoT sempre più performanti, per questo si è iniziato a parlare di Edge Computing. In questi dispositivi sono presenti le risorse per elaborare i dati dai sensori direttamente in locale. Questa tecnologia si adatta bene alle CNN, reti neurali per l'analisi e il riconoscimento di immagini. Le Separable Convolution rappresentano una nuova frontiera perchè permettono di diminuire in modo massiccio la quantità di operazioni da eseguire su tensori di dati dividendo la convoluzione in due parti: una Depthwise e una Pointwise. Tutto questo porta a risultati molto affidabili in termini di accuratezza e velocità ma è sempre centrale il problema legato al consumo di potenza in quanto i dispositivi si affidano solamente ad una batteria interna. Per questo è necessario avere un buon trade-off tra consumi e capacità computazionale. Per rispondere a questa sfida tecnologica lo stato dell'arte in questo ambito propone soluzioni diverse, composte da cluster con core ottimizzati e istruzioni dedicate o FPGA. In questa tesi proponiamo un acceleratore hardware sviluppato in PULP orientato al calcolo di layer di convoluzioni Depthwise. Grazie ad una logica HWC dei dati in memoria e al Window Buffer, una finestra che trasla sull'immagine per effettuare le convoluzioni canale per canale è stato possibile sviluppare una architettura del datapath orientata al riuso dei dati; questo porta l’acceleratore ad avere come risultato in uscita uno throughput massimo di 4 pixel per ciclo di clock. Con le performance di 6 GOP/s, un' efficienza energetica di 101 GOP/j e un consumo di potenza nell'ordine dei mW, dati ottenuti attraverso l'integrazione dell'IP all'interno del cluster di Darkside, nuovo chip di ricerca con tecnologia TSCM a 65 nm, l'acceleratore Depthwise si candida ad essere una soluzione ideale per questo tipo di applicazioni.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
4

Xu, Hongjie. „Energy-Efficient On-Chip Cache Architectures and Deep Neural Network Accelerators Considering the Cost of Data Movement“. Doctoral thesis, Kyoto University, 2021. http://hdl.handle.net/2433/263786.

Der volle Inhalt der Quelle
Annotation:
付記する学位プログラム名: 京都大学卓越大学院プログラム「先端光・電子デバイス創成学」
京都大学
新制・課程博士
博士(情報学)
甲第23325号
情博第761号
京都大学大学院情報学研究科通信情報システム専攻
(主査)教授 小野寺 秀俊, 教授 大木 英司, 教授 佐藤 高史
学位規則第4条第1項該当
Doctor of Informatics
Kyoto University
DFAM
APA, Harvard, Vancouver, ISO und andere Zitierweisen
5

Riera, Villanueva Marc. „Low-power accelerators for cognitive computing“. Doctoral thesis, Universitat Politècnica de Catalunya, 2020. http://hdl.handle.net/10803/669828.

Der volle Inhalt der Quelle
Annotation:
Deep Neural Networks (DNNs) have achieved tremendous success for cognitive applications, and are especially efficient in classification and decision making problems such as speech recognition or machine translation. Mobile and embedded devices increasingly rely on DNNs to understand the world. Smartphones, smartwatches and cars perform discriminative tasks, such as face or object recognition, on a daily basis. Despite the increasing popularity of DNNs, running them on mobile and embedded systems comes with several main challenges: delivering high accuracy and performance with a small memory and energy budget. Modern DNN models consist of billions of parameters requiring huge computational and memory resources and, hence, they cannot be directly deployed on low-power systems with limited resources. The objective of this thesis is to address these issues and propose novel solutions in order to design highly efficient custom accelerators for DNN-based cognitive computing systems. In first place, we focus on optimizing the inference of DNNs for sequence processing applications. We perform an analysis of the input similarity between consecutive DNN executions. Then, based on the high degree of input similarity, we propose DISC, a hardware accelerator implementing a Differential Input Similarity Computation technique to reuse the computations of the previous execution, instead of computing the entire DNN. We observe that, on average, more than 60% of the inputs of any neural network layer tested exhibit negligible changes with respect to the previous execution. Avoiding the memory accesses and computations for these inputs results in 63% energy savings on average. In second place, we propose to further optimize the inference of FC-based DNNs. We first analyze the number of unique weights per input neuron of several DNNs. Exploiting common optimizations, such as linear quantization, we observe a very small number of unique weights per input for several FC layers of modern DNNs. Then, to improve the energy-efficiency of FC computation, we present CREW, a hardware accelerator that implements a Computation Reuse and an Efficient Weight Storage mechanism to exploit the large number of repeated weights in FC layers. CREW greatly reduces the number of multiplications and provides significant savings in model memory footprint and memory bandwidth usage. We evaluate CREW on a diverse set of modern DNNs. On average, CREW provides 2.61x speedup and 2.42x energy savings over a TPU-like accelerator. In third place, we propose a mechanism to optimize the inference of RNNs. RNN cells perform element-wise multiplications across the activations of different gates, sigmoid and tanh being the common activation functions. We perform an analysis of the activation function values, and show that a significant fraction are saturated towards zero or one in popular RNNs. Then, we propose CGPA to dynamically prune activations from RNNs at a coarse granularity. CGPA avoids the evaluation of entire neurons whenever the outputs of peer neurons are saturated. CGPA significantly reduces the amount of computations and memory accesses while avoiding sparsity by a large extent, and can be easily implemented on top of conventional accelerators such as TPU with negligible area overhead, resulting in 12% speedup and 12% energy savings on average for a set of widely used RNNs. Finally, in the last contribution of this thesis we focus on static DNN pruning methodologies. DNN pruning reduces memory footprint and computational work by removing connections and/or neurons that are ineffectual. However, we show that prior pruning schemes require an extremely time-consuming iterative process that requires retraining the DNN many times to tune the pruning parameters. Then, we propose a DNN pruning scheme based on Principal Component Analysis and relative importance of each neuron's connection that automatically finds the optimized DNN in one shot.
Les xarxes neuronals profundes (DNN) han aconseguit un èxit enorme en aplicacions cognitives, i són especialment eficients en problemes de classificació i presa de decisions com ara reconeixement de veu o traducció automàtica. Els dispositius mòbils depenen cada cop més de les DNNs per entendre el món. Els telèfons i rellotges intel·ligents, o fins i tot els cotxes, realitzen diàriament tasques discriminatòries com ara el reconeixement de rostres o objectes. Malgrat la popularitat creixent de les DNNs, el seu funcionament en sistemes mòbils presenta diversos reptes: proporcionar una alta precisió i rendiment amb un petit pressupost de memòria i energia. Les DNNs modernes consisteixen en milions de paràmetres que requereixen recursos computacionals i de memòria enormes i, per tant, no es poden utilitzar directament en sistemes de baixa potència amb recursos limitats. L'objectiu d'aquesta tesi és abordar aquests problemes i proposar noves solucions per tal de dissenyar acceleradors eficients per a sistemes de computació cognitiva basats en DNNs. En primer lloc, ens centrem en optimitzar la inferència de les DNNs per a aplicacions de processament de seqüències. Realitzem una anàlisi de la similitud de les entrades entre execucions consecutives de les DNNs. A continuació, proposem DISC, un accelerador que implementa una tècnica de càlcul diferencial, basat en l'alt grau de semblança de les entrades, per reutilitzar els càlculs de l'execució anterior, en lloc de computar tota la xarxa. Observem que, de mitjana, més del 60% de les entrades de qualsevol capa de les DNNs utilitzades presenten canvis menors respecte a l'execució anterior. Evitar els accessos de memòria i càlculs d'aquestes entrades comporta un estalvi d'energia del 63% de mitjana. En segon lloc, proposem optimitzar la inferència de les DNNs basades en capes FC. Primer analitzem el nombre de pesos únics per neurona d'entrada en diverses xarxes. Aprofitant optimitzacions comunes com la quantització lineal, observem un nombre molt reduït de pesos únics per entrada en diverses capes FC de DNNs modernes. A continuació, per millorar l'eficiència energètica del càlcul de les capes FC, presentem CREW, un accelerador que implementa un eficient mecanisme de reutilització de càlculs i emmagatzematge dels pesos. CREW redueix el nombre de multiplicacions i proporciona estalvis importants en l'ús de la memòria. Avaluem CREW en un conjunt divers de DNNs modernes. CREW proporciona, de mitjana, una millora en rendiment de 2,61x i un estalvi d'energia de 2,42x. En tercer lloc, proposem un mecanisme per optimitzar la inferència de les RNNs. Les cel·les de les xarxes recurrents realitzen multiplicacions element a element de les activacions de diferents comportes, sigmoides i tanh sent les funcions habituals d'activació. Realitzem una anàlisi dels valors de les funcions d'activació i mostrem que una fracció significativa està saturada cap a zero o un en un conjunto d'RNNs populars. A continuació, proposem CGPA per podar dinàmicament les activacions de les RNNs a una granularitat gruixuda. CGPA evita l'avaluació de neurones senceres cada vegada que les sortides de neurones parelles estan saturades. CGPA redueix significativament la quantitat de càlculs i accessos a la memòria, aconseguint en mitjana un 12% de millora en el rendiment i estalvi d'energia. Finalment, en l'última contribució d'aquesta tesi ens centrem en metodologies de poda estàtica de les DNNs. La poda redueix la petjada de memòria i el treball computacional mitjançant l'eliminació de connexions o neurones redundants. Tanmateix, mostrem que els esquemes de poda previs fan servir un procés iteratiu molt llarg que requereix l'entrenament de les DNNs moltes vegades per ajustar els paràmetres de poda. A continuació, proposem un esquema de poda basat en l'anàlisi de components principals i la importància relativa de les connexions de cada neurona que optimitza automàticament el DNN optimitzat en un sol tret sense necessitat de sintonitzar manualment múltiples paràmetres
APA, Harvard, Vancouver, ISO und andere Zitierweisen
6

Khan, Muhammad Jazib. „Programmable Address Generation Unit for Deep Neural Network Accelerators“. Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-271884.

Der volle Inhalt der Quelle
Annotation:
The Convolutional Neural Networks are getting more and more popular due to their applications in revolutionary technologies like Autonomous Driving, Biomedical Imaging, and Natural Language Processing. With this increase in adoption, the complexity of underlying algorithms is also increasing. This trend entails implications for the computation platforms as well, i.e. GPUs, FPGA, or ASIC based accelerators, especially for the Address Generation Unit (AGU), which is responsible for the memory access. Existing accelerators typically have Parametrizable Datapath AGUs, which have minimal adaptability towards evolution in algorithms. Hence new hardware is required for new algorithms, which is a very inefficient approach in terms of time, resources, and reusability. In this research, six algorithms with different implications for hardware are evaluated for address generation, and a fully Programmable AGU (PAGU) is presented, which can adapt to these algorithms. These algorithms are Standard, Strided, Dilated, Upsampled and Padded convolution, and MaxPooling. The proposed AGU architecture is a Very Long Instruction Word based Application Specific Instruction Processor which has specialized components like hardware counters and zero-overhead loops and a powerful Instruction Set Architecture (ISA), which can model static and dynamic constraints and affine and non-affine Address Equations. The target has been to minimize the flexibility vs. area, power, and performance trade-off. For a working test network of Semantic Segmentation, results have shown that PAGU shows close to the ideal performance, one cycle per address, for all the algorithms under consideration excepts Upsampled Convolution for which it is 1.7 cycles per address. The area of PAGU is approx. 4.6 times larger than the Parametrizable Datapath approach, which is still reasonable considering the high flexibility benefits. The potential of PAGU is not just limited to neural network applications but also in more general digital signal processing areas, which can be explored in the future.
Convolutional Neural Networks blir mer och mer populära på grund av deras applikationer inom revolutionerande tekniker som autonom körning, biomedicinsk bildbehandling och naturligt språkbearbetning. Med denna ökning av antagandet ökar också komplexiteten hos underliggande algoritmer. Detta medför implikationer för beräkningsplattformarna såväl som GPU: er, FPGAeller ASIC-baserade acceleratorer, särskilt för Adressgenerationsenheten (AGU) som är ansvarig för minnesåtkomst. Befintliga acceleratorer har normalt Parametrizable Datapath AGU: er som har mycket begränsad anpassningsförmåga till utveckling i algoritmer. Därför krävs ny hårdvara för nya algoritmer, vilket är en mycket ineffektiv metod när det gäller tid, resurser och återanvändbarhet. I denna forskning utvärderas sex algoritmer med olika implikationer för hårdvara för adressgenerering och en helt programmerbar AGU (PAGU) presenteras som kan anpassa sig till dessa algoritmer. Dessa algoritmer är Standard, Strided, Dilated, Upsampled och Padded convolution och MaxPooling. Den föreslagna AGU-arkitekturen är en Very Long Instruction Word-baserad applikationsspecifik instruktionsprocessor som har specialiserade komponenter som hårdvara räknare och noll-overhead-slingor och en kraftfull Instruktionsuppsättning Arkitektur (ISA) som kan modellera statiska och dynamiska begränsningar och affinera och icke-affinerad adress ekvationer. Målet har varit att minimera flexibiliteten kontra avvägning av område, kraft och prestanda. För ett fungerande testnätverk av semantisk segmentering har resultaten visat att PAGU visar nära den perfekta prestanda, 1 cykel per adress, för alla algoritmer som beaktas undantar Upsampled Convolution för vilken det är 1,7 cykler per adress. Området för PAGU är ungefär 4,6 gånger större än Parametrizable Datapath-metoden, vilket fortfarande är rimligt med tanke på de stora flexibilitetsfördelarna. Potentialen för PAGU är inte bara begränsad till neurala nätverksapplikationer utan också i mer allmänna digitala signalbehandlingsområden som kan utforskas i framtiden.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
7

Jalasutram, Rommel. „Acceleration of spiking neural networks on multicore architectures“. Connect to this title online, 2009. http://etd.lib.clemson.edu/documents/1252424720/.

Der volle Inhalt der Quelle
APA, Harvard, Vancouver, ISO und andere Zitierweisen
8

Han, Bing. „ACCELERATION OF SPIKING NEURAL NETWORK ON GENERAL PURPOSE GRAPHICS PROCESSORS“. University of Dayton / OhioLINK, 2010. http://rave.ohiolink.edu/etdc/view?acc_num=dayton1271368713.

Der volle Inhalt der Quelle
APA, Harvard, Vancouver, ISO und andere Zitierweisen
9

Chen, Yu-Hsin Ph D. Massachusetts Institute of Technology. „Architecture design for highly flexible and energy-efficient deep neural network accelerators“. Thesis, Massachusetts Institute of Technology, 2018. http://hdl.handle.net/1721.1/117838.

Der volle Inhalt der Quelle
Annotation:
Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2018.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 141-147).
Deep neural networks (DNNs) are the backbone of modern artificial intelligence (AI). However, due to their high computational complexity and diverse shapes and sizes, dedicated accelerators that can achieve high performance and energy efficiency across a wide range of DNNs are critical for enabling AI in real-world applications. To address this, we present Eyeriss, a co-design of software and hardware architecture for DNN processing that is optimized for performance, energy efficiency and flexibility. Eyeriss features a novel Row-Stationary (RS) dataflow to minimize data movement when processing a DNN, which is the bottleneck of both performance and energy efficiency. The RS dataflow supports highly-parallel processing while fully exploiting data reuse in a multi-level memory hierarchy to optimize for the overall system energy efficiency given any DNN shape and size. It achieves 1.4x to 2.5x higher energy efficiency than other existing dataflows. To support the RS dataflow, we present two versions of the Eyeriss architecture. Eyeriss v1 targets large DNNs that have plenty of data reuse. It features a flexible mapping strategy for high performance and a multicast on-chip network (NoC) for high data reuse, and further exploits data sparsity to reduce processing element (PE) power by 45% and off-chip bandwidth by up to 1.9x. Fabricated in a 65nm CMOS, Eyeriss v1 consumes 278 mW at 34.7 fps for the CONV layers of AlexNet, which is 10x more efficient than a mobile GPU. Eyeriss v2 addresses support for the emerging compact DNNs that introduce higher variation in data reuse. It features a RS+ dataflow that improves PE utilization, and a flexible and scalable NoC that adapts to the bandwidth requirement while also exploiting available data reuse. Together, they provide over 10x higher throughput than Eyeriss v1 at 256 PEs. Eyeriss v2 also exploits sparsity and SIMD for an additional 6x increase in throughput.
by Yu-Hsin Chen.
Ph. D.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
10

Gaura, Elena Ioana. „Neural network techniques for the control and identification of acceleration sensors“. Thesis, Coventry University, 2000. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.313132.

Der volle Inhalt der Quelle
APA, Harvard, Vancouver, ISO und andere Zitierweisen
11

Anderson, Thomas. „Built-In Self Training of Hardware-Based Neural Networks“. University of Cincinnati / OhioLINK, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1512039036199393.

Der volle Inhalt der Quelle
APA, Harvard, Vancouver, ISO und andere Zitierweisen
12

Wijekoon, Jayawan. „Mixed signal VLSI circuit implementation of the cortical microcircuit models“. Thesis, University of Manchester, 2011. https://www.research.manchester.ac.uk/portal/en/theses/mixed-signal-vlsi-circuit-implementation-of-the-cortical-microcircuit-models(6deb2d34-5811-42ec-a4f1-e11cdb6816f1).html.

Der volle Inhalt der Quelle
Annotation:
This thesis proposes a novel set of generic and compact biologically plausible VLSI (Very Large Scale Integration) neural circuits, suitable for implementing a parallel VLSI network that closely resembles the function of a small-scale neocortical network. The proposed circuits include a cortical neuron, two different long-term plastic synapses and four different short-term plastic synapses. These circuits operate in accelerated-time, where the time scale of neural responses is approximately three to four orders of magnitude faster than the biological-time scale of the neuronal activities, providing higher computational throughput in computing neural dynamics. Further, a novel biological-time cortical neuron circuit with similar dynamics as of the accelerated-time neuron is proposed to demonstrate the feasibility of migrating accelerated-time circuits into biological-time circuits. The fabricated accelerated-time VLSI neuron circuit is capable of replicating distinct firing patterns such as regular spiking, fast spiking, chattering and intrinsic bursting, by tuning two external voltages. It reproduces biologically plausible action potentials. This neuron circuit is compact and enables implementation of many neurons in a single silicon chip. The circuit consumes extremely low energy per spike (8pJ). Incorporating this neuron circuit in a neural network facilitates diverse non-linear neuron responses, which is an important aspect in neural processing. Two of the proposed long term plastic synapse circuits include spike-time dependent plasticity (STDP) synapse, and dopamine modulated STDP synapse. The short-term plastic synapses include excitatory depressing, inhibitory facilitating, inhibitory depressing, and excitatory facilitating synapses. Many neural parameters of short- and long- term synapses can be modified independently using externally controlled tuning voltages to obtain distinct synaptic properties. Having diverse synaptic dynamics in a network facilitates richer network behaviours such as learning, memory, stability and dynamic gain control, inherent in a biological neural network. To prove the concept in VLSI, different combinations of these accelerated-time neural circuits are fabricated in three integrated circuits (ICs) using a standard 0.35 µm CMOS technology. Using first two ICs, functions of cortical neuron and STDP synapses have been experimentally verified. The third IC, the Cortical Neural Layer (CNL) Chip is designed and fabricated to facilitate cortical network emulations. This IC implements neural circuits with a similar composition to the cortical layer of the neocortex. The CNL chip comprises 120 cortical neurons and 7 560 synapses. Many of these CNL chips can be combined together to form a six-layered VLSI neocortical network to validate the network dynamics and to perform neural processing of small-scale cortical networks. The proposed neuromorphic systems can be used as a simulation acceleration platform to explore the processing principles of biological brains and also move towards realising low power, real-time intelligent computing devices and control systems.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
13

Ngo, Kalle. „FPGA Hardware Acceleration of Inception Style Parameter Reduced Convolution Neural Networks“. Thesis, KTH, Skolan för informations- och kommunikationsteknik (ICT), 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-205026.

Der volle Inhalt der Quelle
Annotation:
Some researchers have noted that the growth rate in the number of network parameters of many recently proposed state-of-the-art CNN topologies is placing unrealistic demands on hardware resources and limits the practical applications of Neural Networks. This is particularly apparent when considering many of the projected applications (IoT, autonomous vehicles, etc) utilize embedded systems with even greater restrictions on computation and memory bandwidth than the typical research-class computer cluster that the CNN was designed on. The GoogLeNet CNN in 2014 proposed a new level of organization (“Inception Module”) that was demonstrated in competition to achieve similar/better performance, while using an order of magnitude less network parameters than the other competing topologies. This thesis explores the characteristics of the new GoogLeNet inception modules and the implications it presents to current CNN accelerator architectures. A custom FPGA accelerator is proposed to offset the inception module’s increased need to buffer large intermediate convolution arrays through array partitioning and cascading two convolution operations into a single pipeline pass. A Xilinx Artix-7 FPGA was used to implement architecture where it was able continuously supply data to the 331 utilized DSP blocks (approx. half of total available), while using only a quarter of the DDR bandwidth to achieve a peak throughput of 9.11 GFLOPS. The low utilization of the DDR bandwidth suggests that with some optimization, the design can be scaled up to better utilize the available resources and increase throughput.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
14

Reiche, Myrgård Martin. „Acceleration of deep convolutional neural networks on multiprocessor system-on-chip“. Thesis, Uppsala universitet, Avdelningen för datorteknik, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-385904.

Der volle Inhalt der Quelle
Annotation:
In this master thesis some of the most promising existing frameworks and implementations of deep convolutional neural networks on multiprocessor system-on-chips (MPSoCs) are researched and evaluated. The thesis’ starting point was a previousthesis which evaluated possible deep learning models and frameworks for object detection on infra-red images conducted in the spring of 2018. In order to fit an existing deep convolutional neural network (DCNN) on a Multiple-Processor-System on Chip it needs modifications. Most DCNNs are trained on Graphic processing units (GPUs) with a bit width of 32 bit. This is not optimal for a platform with hard memory constraints such as the MPSoC which means it needs to be shortened. The optimal bit width depends on the network structure and requirements in terms of throughput and accuracy although most of the currently available object detection networks drop significantly when reduced below 6 bits width. After reducing the bit width, the network needs to be quantized and pruned for better memory usage. After quantization it can be implemented using one of many existing frameworks. This thesis focuses on Xilinx CHaiDNN and DNNWeaver V2 though it touches a little on revision, HLS4ML and DNNWeaver V1 as well. In conclusion the implementation of two network models on Xilinx Zynq UltraScale+ ZCU102 using CHaiDNN were evaluated. Conversion of existing network were done and quantization tested though not fully working. The results were a two to six times more power efficient implementation in comparison to GPU inference.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
15

Silfa, Franyell. „Energy-efficient architectures for recurrent neural networks“. Doctoral thesis, Universitat Politècnica de Catalunya, 2021. http://hdl.handle.net/10803/671448.

Der volle Inhalt der Quelle
Annotation:
Deep Learning algorithms have been remarkably successful in applications such as Automatic Speech Recognition and Machine Translation. Thus, these kinds of applications are ubiquitous in our lives and are found in a plethora of devices. These algorithms are composed of Deep Neural Networks (DNNs), such as Convolutional Neural Networks and Recurrent Neural Networks (RNNs), which have a large number of parameters and require a large amount of computations. Hence, the evaluation of DNNs is challenging due to their large memory and power requirements. RNNs are employed to solve sequence to sequence problems such as Machine Translation. They contain data dependencies among the executions of time-steps hence the amount of parallelism is severely limited. Thus, evaluating them in an energy-efficient manner is more challenging than evaluating other DNN algorithms. This thesis studies applications using RNNs to improve their energy efficiency on specialized architectures. Specifically, we propose novel energy-saving techniques and highly efficient architectures tailored to the evaluation of RNNs. We focus on the most successful RNN topologies which are the Long Short Term memory and the Gated Recurrent Unit. First, we characterize a set of RNNs running on a modern SoC. We identify that accessing the memory to fetch the model weights is the main source of energy consumption. Thus, we propose E-PUR: an energy-efficient processing unit for RNN inference. E-PUR achieves 6.8x speedup and improves energy consumption by 88x compared to the SoC. These benefits are obtained by improving the temporal locality of the model weights. In E-PUR, fetching the parameters is the main source of energy consumption. Thus, we strive to reduce memory accesses and propose a scheme to reuse previous computations. Our observation is that when evaluating the input sequences of an RNN model, the output of a given neuron tends to change lightly between consecutive evaluations.Thus, we develop a scheme that caches the neurons' outputs and reuses them whenever it detects that the change between the current and previously computed output value for a given neuron is small avoiding to fetch the weights. In order to decide when to reuse a previous value we employ a Binary Neural Network (BNN) as a predictor of reusability. The low-cost BNN can be employed in this context since its output is highly correlated to the output of RNNs. We show that our proposal avoids more than 24.2% of computations. Hence, on average, energy consumption is reduced by 18.5% for a speedup of 1.35x. RNN models’ memory footprint is usually reduced by using low precision for evaluation and storage. In this case, the minimum precision used is identified offline and it is set such that the model maintains its accuracy. This method utilizes the same precision to compute all time-steps.Yet, we observe that some time-steps can be evaluated with a lower precision while preserving the accuracy. Thus, we propose a technique that dynamically selects the precision used to compute each time-step. A challenge of our proposal is choosing a lower bit-width. We address this issue by recognizing that information from a previous evaluation can be employed to determine the precision required in the current time-step. Our scheme evaluates 57% of the computations on a bit-width lower than the fixed precision employed by static methods. We implement it on E-PUR and it provides 1.46x speedup and 19.2% energy savings on average.
Los algoritmos de aprendizaje profundo han tenido un éxito notable en aplicaciones como el reconocimiento automático de voz y la traducción automática. Por ende, estas aplicaciones son omnipresentes en nuestras vidas y se encuentran en una gran cantidad de dispositivos. Estos algoritmos se componen de Redes Neuronales Profundas (DNN), tales como las Redes Neuronales Convolucionales y Redes Neuronales Recurrentes (RNN), las cuales tienen un gran número de parámetros y cálculos. Por esto implementar DNNs en dispositivos móviles y servidores es un reto debido a los requisitos de memoria y energía. Las RNN se usan para resolver problemas de secuencia a secuencia tales como traducción automática. Estas contienen dependencias de datos entre las ejecuciones de cada time-step, por ello la cantidad de paralelismo es limitado. Por eso la evaluación de RNNs de forma energéticamente eficiente es un reto. En esta tesis se estudian RNNs para mejorar su eficiencia energética en arquitecturas especializadas. Para esto, proponemos técnicas de ahorro energético y arquitecturas de alta eficiencia adaptadas a la evaluación de RNN. Primero, caracterizamos un conjunto de RNN ejecutándose en un SoC. Luego identificamos que acceder a la memoria para leer los pesos es la mayor fuente de consumo energético el cual llega hasta un 80%. Por ende, creamos E-PUR: una unidad de procesamiento para RNN. E-PUR logra una aceleración de 6.8x y mejora el consumo energético en 88x en comparación con el SoC. Esas mejoras se deben a la maximización de la ubicación temporal de los pesos. En E-PUR, la lectura de los pesos representa el mayor consumo energético. Por ende, nos enfocamos en reducir los accesos a la memoria y creamos un esquema que reutiliza resultados calculados previamente. La observación es que al evaluar las secuencias de entrada de un RNN, la salida de una neurona dada tiende a cambiar ligeramente entre evaluaciones consecutivas, por lo que ideamos un esquema que almacena en caché las salidas de las neuronas y las reutiliza cada vez que detecta un cambio pequeño entre el valor de salida actual y el valor previo, lo que evita leer los pesos. Para decidir cuándo usar un cálculo anterior utilizamos una Red Neuronal Binaria (BNN) como predictor de reutilización, dado que su salida está altamente correlacionada con la salida de la RNN. Esta propuesta evita más del 24.2% de los cálculos y reduce el consumo energético promedio en 18.5%. El tamaño de la memoria de los modelos RNN suele reducirse utilizando baja precisión para la evaluación y el almacenamiento de los pesos. En este caso, la precisión mínima utilizada se identifica de forma estática y se establece de manera que la RNN mantenga su exactitud. Normalmente, este método utiliza la misma precisión para todo los cálculos. Sin embargo, observamos que algunos cálculos se pueden evaluar con una precisión menor sin afectar la exactitud. Por eso, ideamos una técnica que selecciona dinámicamente la precisión utilizada para calcular cada time-step. Un reto de esta propuesta es como elegir una precisión menor. Abordamos este problema reconociendo que el resultado de una evaluación previa se puede emplear para determinar la precisión requerida en el time-step actual. Nuestro esquema evalúa el 57% de los cálculos con una precisión menor que la precisión fija empleada por los métodos estáticos. Por último, la evaluación en E-PUR muestra una aceleración de 1.46x con un ahorro de energía promedio de 19.2%
APA, Harvard, Vancouver, ISO und andere Zitierweisen
16

Torcolacci, Veronica. „Implementation of Machine Learning Algorithms on Hardware Accelerators“. Master's thesis, Alma Mater Studiorum - Università di Bologna, 2020.

Den vollen Inhalt der Quelle finden
Annotation:
Nowadays, cutting-edge technology, innovation and efficiency are the cornerstones on which industries are based. Therefore, prognosis and health management have started to play a key role in the prevention of crucial faults and failures. Recognizing malfunctions in a system in advance is fundamental both in economic and safety terms. This obviously requires a lot of data – mainly information from sensors or machine control - to be processed, and it’s in this scenario that Machine Learning comes to the aid. This thesis aims to apply these methodologies to prognosis in automatic machines and has been carried out at LIAM lab (Laboratorio Industriale Automazione Macchine per il packaging), an industrial research laboratory born from the experience of leading companies in the sector. Machine learning techniques such as neural networks will be exploited to solve the problems of classification that derive from the system in exam. Such algorithms will be combined with systems identification techniques that performs an estimate of the plant parameters and a feature reduction by compressing the data. This makes easier for the neural networks to distinguish the different operating conditions and perform a good prognosis activity. Practically the algorithms will be developed in Python and then implemented on two hardware accelerators, whose performance will be evaluated.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
17

Tran, Ba-Hien. „Advancing Bayesian Deep Learning : Sensible Priors and Accelerated Inference“. Electronic Thesis or Diss., Sorbonne université, 2023. https://accesdistant.sorbonne-universite.fr/login?url=https://theses-intra.sorbonne-universite.fr/2023SORUS280.pdf.

Der volle Inhalt der Quelle
Annotation:
Au cours de la dernière décennie, l'apprentissage profond a connu un succès remarquable dans de nombreux domaines, révolutionnant divers domaines grâce à ses performances sans précédent. Cependant, une limitation fondamentale des modèles d'apprentissage profond réside dans leur incapacité à quantifier avec précision l'incertitude des prédictions, ce qui pose des défis pour les applications qui nécessitent une évaluation robuste des risques. Heureusement, l'apprentissage profond Bayésien offre une solution prometteuse en adoptant une formulation Bayésienne pour les réseaux neuronaux. Malgré des progrès significatifs ces dernières années, il reste plusieurs défis qui entravent l'adoption généralisée et l'applicabilité de l'apprentissage profond Bayésien. Dans cette thèse, nous abordons certains de ces défis en proposant des solutions pour choisir des priors pertinents et accélérer l'inférence des modèles d'apprentissage profond Bayésien. La première contribution de la thèse est une étude des pathologies associées à de mauvais choix de priors pour les réseaux neuronaux Bayésiens dans des tâches d'apprentissage supervisées, ainsi qu'une proposition pour résoudre ce problème de manière pratique et efficace. Plus précisément, notre approche consiste à raisonner en termes de priors fonctionnels, qui sont plus facilement obtenus, et à ajuster les priors des paramètres du réseau neuronal pour les aligner sur ces priors fonctionnels. La deuxième contribution est un nouveau cadre pour réaliser la sélection de modèle pour les autoencodeurs Bayésiens dans des tâches non supervisées, telles que l'apprentissage de représentation et la modélisation générative. À cette fin, nous raisonnons sur la vraisemblance marginale de ces modèles en termes de priors fonctionnels et proposons une approche entièrement basée sur les échantillons pour son optimisation. La troisième contribution est un nouveau modèle d'autoencodeur entièrement Bayésien qui traite à la fois les variables latentes locales et le décodeur global de manière Bayésienne. Nous proposons un schéma MCMC amorti efficace pour ce modèle et imposons des priors de processus Gaussiens clairsemés sur l'espace latent pour capturer les corrélations entre les encodages latents. La dernière contribution est une approche simple mais efficace pour améliorer les modèles génératifs basés sur la vraisemblance grâce à la mollification des données. Cela accélère l'inférence pour ces modèles en permettant une estimation précise de la densité dans les régions de faible densité tout en résolvant le problème du surajustement de la variété
Over the past decade, deep learning has witnessed remarkable success in a wide range of applications, revolutionizing various fields with its unprecedented performance. However, a fundamental limitation of deep learning models lies in their inability to accurately quantify prediction uncertainty, posing challenges for applications that demand robust risk assessment. Fortunately, Bayesian deep learning provides a promising solution by adopting a Bayesian formulation for neural networks. Despite significant progress in recent years, there remain several challenges that hinder the widespread adoption and applicability of Bayesian deep learning. In this thesis, we address some of these challenges by proposing solutions to choose sensible priors and accelerate inference for Bayesian deep learning models. The first contribution of the thesis is a study of the pathologies associated with poor choices of priors for Bayesian neural networks for supervised learning tasks and a proposal to tackle this problem in a practical and effective way. Specifically, our approach involves reasoning in terms of functional priors, which are more easily elicited, and adjusting the priors of neural network parameters to align with these functional priors. The second contribution is a novel framework for conducting model selection for Bayesian autoencoders for unsupervised tasks, such as representation learning and generative modeling. To this end, we reason about the marginal likelihood of these models in terms of functional priors and propose a fully sample-based approach for its optimization. The third contribution is a novel fully Bayesian autoencoder model that treats both local latent variables and the global decoder in a Bayesian fashion. We propose an efficient amortized MCMC scheme for this model and impose sparse Gaussian process priors over the latent space to capture correlations between latent encodings. The last contribution is a simple yet effective approach to improve likelihood-based generative models through data mollification. This accelerates inference for these models by allowing accurate density-esimation in low-density regions while addressing manifold overfitting
APA, Harvard, Vancouver, ISO und andere Zitierweisen
18

CARRERAS, MARCO. „Acceleration of Artificial Neural Networks at the edge: adapting flexibly to emerging devices and models“. Doctoral thesis, Università degli Studi di Cagliari, 2022. http://hdl.handle.net/11584/333521.

Der volle Inhalt der Quelle
Annotation:
Convolutional Neural Networks (CNNs) are nowadays ubiquitously used in a wide range of applications. While usually CNNs are designed to operate on images for computer vision (CV) tasks, more recently, they have been applied in multiple other embedded domains, to analyze different information and data types. A key research topic involving CNNs is related to methodologies and instruments implementing a shift from cloud computing to the edge computing paradigm. The classic implementation of CNN-based systems relies on the cloud: an embedded system samples data acquired by adequate sensors and sends them to a remote cloud computing facility, where the data is analyzed on high-performance processing platforms. However, to really enable ubiquitous use of CNNs, some use-cases require moving the classification/recognition tasks at the edge of the network, executing the CNN inference near-sensor, directly on embedded processing systems. At-the-edge data processing has multiple potential benefits: it improves responsiveness and reliability, avoids disclosure of private information, and reduces the communication bandwidth requirements posed by the transmission of raw sensor data. Among the possible technology substrates that may be used to implement such embedded platforms, a widely used solution relies on processing systems integrating Field Programmable Gate Arrays (FPGAs). The Digital Signal Processing (DSP) slices available in modern FPGAs are very well suitable for the execution of multiply-and-accumulate operations, representing the heaviest workload in CNNs. In particular, All-Programmable Systems on Chip (AP-SoCs), i.e. heterogeneous processing systems designed to exploit the cooperation between general-purpose processing cores and FPGA resources, can accommodate quite effectively both the highly parallel data-crunching operations in the network and the other more control-like and housekeeping-related actions surrounding them within the overall software applications. The work in this thesis focuses on CNN inference acceleration on AP-SoCs. It starts from a reference architecture, an FPGA-based CNN inference accelerator named NEURAghe [73], and extends it to assess its flexibility to different target devices and its applicability to a wider range of design cases and network topology. To this aim, in the first phase of the work, we have aggressively parameterized the architecture, to be capable of shaping it into different configurations to be implemented on various device sizes. In a second phase, we have tested and studied modifications to extend NEURAghe’s approach from mainstream CNNs, whose execution is widely supported by multiple accelerators in literature, to less deeply explored algorithm flavours, namely: • Temporal Convolutional Network (TCN), operating with mono-dimensional dilated kernels on sequences of samples; • Depthwise separable convolutions, that reduce the number of Multiply-Accumulate operations (MACs) to be performed per layer and, consequently, if countermeasures are not taken, reduce the utilization rate of hardware MAC modules in NEURAghe; • Event-based Spiking Neural Networks (SNNs), that requires an entirely different architecture pattern, that needs to be finely tuned and integrated into the NEURAghe system template to be effectively used on FPGA;
APA, Harvard, Vancouver, ISO und andere Zitierweisen
19

Hofmann, Jaco [Verfasser], Andreas [Akademischer Betreuer] Koch und Mladen [Akademischer Betreuer] Berekovic. „An Improved Framework for and Case Studies in FPGA-Based Application Acceleration - Computer Vision, In-Network Processing and Spiking Neural Networks / Jaco Hofmann ; Andreas Koch, Mladen Berekovic“. Darmstadt : Universitäts- und Landesbibliothek Darmstadt, 2020. http://d-nb.info/1202923097/34.

Der volle Inhalt der Quelle
APA, Harvard, Vancouver, ISO und andere Zitierweisen
20

Mealey, Thomas C. „Binary Recurrent Unit: Using FPGA Hardware to Accelerate Inference in Long Short-Term Memory Neural Networks“. University of Dayton / OhioLINK, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=dayton1524402925375566.

Der volle Inhalt der Quelle
APA, Harvard, Vancouver, ISO und andere Zitierweisen
21

Wu, Gang. „Using GPU acceleration and a novel artificial neural networks approach for ultra-fast fluorescence lifetime imaging microscopy analysis“. Thesis, University of Sussex, 2017. http://sro.sussex.ac.uk/id/eprint/71657/.

Der volle Inhalt der Quelle
Annotation:
Fluorescence lifetime imaging microscopy (FLIM) which is capable of visualizing local molecular and physiological parameters in living cells, plays a significant role in biological sciences, chemistry, and medical research. In order to unveil dynamic cellular processes, it is necessary to develop high-speed FLIM technology. Thanks to the development of highly parallel time-to-digital convertor (TDC) arrays, especially when integrated with single-photon avalanche diodes (SPADs), the acquisition rate of high-resolution fluorescence lifetime imaging has been dramatically improved. On the other hand, these technological advances and advanced data acquisition systems have generated massive data, which significantly increases the difficulty of FLIM analysis. Traditional FLIM systems rely on time-consuming iterative algorithms to retrieve the FLIM parameters. Therefore, lifetime analysis has become a bottleneck for high-speed FLIM applications, let alone real-time or video-rate FLIM systems. Although some simple algorithms have been proposed, most of them are only able to resolve a simple FLIM decay model. On the other hand, existing FLIM systems based on CPU processing do not make use of available parallel acceleration. In order to tackle the existing problems, my study focused on introducing the state-of-art general purpose graphics processing units (GPUs) to the FLIM analysis, and building a data processing system based on both CPU and GPUs. With a large amount of parallel cores, the GPUs are able to significantly speed up lifetime analysis compared to CPU-only processing. In addition to transform the existing algorithms into GPU computing, I have developed a new high-speed and GPU friendly algorithm based on an artificial neural network (ANN). The proposed GPU-ANN-FLIM method has dramatically improved the efficiency of FLIM analysis, which is at least 1000-folder faster than some traditional algorithms, meaning that it has great potential to fuel current revolutions in high-speed high-resolution FLIM applications.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
22

Kong, Yat Sheng [Verfasser], und Dieter [Akademischer Betreuer] Schramm. „Establishment of artificial neural network for suspension spring fatigue life prediction using strain and acceleration data / Yat Sheng Kong ; Betreuer: Dieter Schramm“. Duisburg, 2019. http://d-nb.info/1191692558/34.

Der volle Inhalt der Quelle
APA, Harvard, Vancouver, ISO und andere Zitierweisen
23

Viebke, André. „Accelerated Deep Learning using Intel Xeon Phi“. Thesis, Linnéuniversitetet, Institutionen för datavetenskap (DV), 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-45491.

Der volle Inhalt der Quelle
Annotation:
Deep learning, a sub-topic of machine learning inspired by biology, have achieved wide attention in the industry and research community recently. State-of-the-art applications in the area of computer vision and speech recognition (among others) are built using deep learning algorithms. In contrast to traditional algorithms, where the developer fully instructs the application what to do, deep learning algorithms instead learn from experience when performing a task. However, for the algorithm to learn require training, which is a high computational challenge. High Performance Computing can help ease the burden through parallelization, thereby reducing the training time; this is essential to fully utilize the algorithms in practice. Numerous work targeting GPUs have investigated ways to speed up the training, less attention have been paid to the Intel Xeon Phi coprocessor. In this thesis we present a parallelized implementation of a Convolutional Neural Network (CNN), a deep learning architecture, and our proposed parallelization scheme, CHAOS. Additionally a theoretical analysis and a performance model discuss the algorithm in detail and allow for predictions if even more threads are available in the future. The algorithm is evaluated on an Intel Xeon Phi 7120p, Xeon E5-2695v2 2.4 GHz and Core i5 661 3.33 GHz using various architectures and thread counts on the MNIST dataset. Findings show a 103.5x, 99.9x, 100.4x speed up for the large, medium, and small architecture respectively for 244 threads compared to 1 thread on the coprocessor. Moreover, a 10.9x - 14.1x (large to small) speed up compared to the sequential version running on Xeon E5. We managed to decrease training time from 7 days on the Core i5 and 31 hours on the Xeon E5, to 3 hours on the Intel Xeon Phi when training our large network for 15 epochs
APA, Harvard, Vancouver, ISO und andere Zitierweisen
24

Vogel, Sebastian A. A. Verfasser], Gerd [Akademischer Betreuer] [Ascheid und Walter [Akademischer Betreuer] Stechele. „Design and implementation of number representations for efficient multiplierless acceleration of convolutional neural networks / Sebastian A. A. Vogel ; Gerd Ascheid, Walter Stechele“. Aachen : Universitätsbibliothek der RWTH Aachen, 2020. http://d-nb.info/1220082716/34.

Der volle Inhalt der Quelle
APA, Harvard, Vancouver, ISO und andere Zitierweisen
25

Axillus, Viktor. „Comparing Julia and Python : An investigation of the performance on image processing with deep neural networks and classification“. Thesis, Blekinge Tekniska Högskola, Institutionen för programvaruteknik, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-19160.

Der volle Inhalt der Quelle
Annotation:
Python is the most popular language when it comes to prototyping and developing machine learning algorithms. Python is an interpreted language that causes it to have a significant performance loss compared to compiled languages. Julia is a newly developed language that tries to bridge the gap between high performance but cumbersome languages such as C++ and highly abstracted but typically slow languages such as Python. However, over the years, the Python community have developed a lot of tools that addresses its performance problems. This raises the question if choosing one language over the other has any significant performance difference. This thesis compares the performance, in terms of execution time, of the two languages in the machine learning domain. More specifically, image processing with GPU-accelerated deep neural networks and classification with k-nearest neighbor on the MNIST and EMNIST dataset. Python with Keras and Tensorflow is compared against Julia with Flux for GPU-accelerated neural networks. For classification Python with Scikit-learn is compared against Julia with Nearestneighbors.jl. The results point in the direction that Julia has a performance edge in regards to GPU-accelerated deep neural networks. With Julia outperforming Python by roughly 1.25x − 1.5x. For classification with k-nearest neighbor the results were a bit more varied with Julia outperforming Python in 5 out of 8 different measurements. However, there exists some validity threats and additional research is needed that includes all different frameworks available for the languages in order to provide a more conclusive and generalized answer.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
26

Slouka, Lukáš. „Implementace neuronové sítě bez operace násobení“. Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2018. http://www.nusl.cz/ntk/nusl-386017.

Der volle Inhalt der Quelle
Annotation:
The subject of this thesis is neural network acceleration with the goal of reducing the number of floating point multiplications. The theoretical part of the thesis surveys current trends and methods used in the field of neural network acceleration. However, the focus is on the binarization techniques which allow replacing multiplications with logical operators. The theoretical base is put into practice in two ways. First is the GPU implementation of crucial binary operators in the Tensorflow framework with a performance benchmark. Second is an application of these operators in simple image classifier. Results are certainly encouraging. Implemented operators achieve speed-up by a factor of 2.5 when compared to highly optimized cuBLAS operators. The last chapter compares accuracies achieved by binarized models and their full-precision counterparts on various architectures.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
27

PETRINI, ALESSANDRO. „HIGH PERFORMANCE COMPUTING MACHINE LEARNING METHODS FOR PRECISION MEDICINE“. Doctoral thesis, Università degli Studi di Milano, 2021. http://hdl.handle.net/2434/817104.

Der volle Inhalt der Quelle
Annotation:
La Medicina di Precisione (Precision Medicine) è un nuovo paradigma che sta rivoluzionando diversi aspetti delle pratiche cliniche: nella prevenzione e diagnosi, essa è caratterizzata da un approccio diverso dal "one size fits all" proprio della medicina classica. Lo scopo delle Medicina di Precisione è di trovare misure di prevenzione, diagnosi e cura che siano specifiche per ciascun individuo, a partire dalla sua storia personale, stile di vita e fattori genetici. Tre fattori hanno contribuito al rapido sviluppo della Medicina di Precisione: la possibilità di generare rapidamente ed economicamente una vasta quantità di dati omici, in particolare grazie alle nuove tecniche di sequenziamento (Next-Generation Sequencing); la possibilità di diffondere questa enorme quantità di dati grazie al paradigma "Big Data"; la possibilità di estrarre da questi dati tutta una serie di informazioni rilevanti grazie a tecniche di elaborazione innovative ed altamente sofisticate. In particolare, le tecniche di Machine Learning introdotte negli ultimi anni hanno rivoluzionato il modo di analizzare i dati: esse forniscono dei potenti strumenti per l'inferenza statistica e l'estrazione di informazioni rilevanti dai dati in maniera semi-automatica. Al contempo, però, molto spesso richiedono elevate risorse computazionali per poter funzionare efficacemente. Per questo motivo, e per l'elevata mole di dati da elaborare, è necessario sviluppare delle tecniche di Machine Learning orientate al Big Data che utilizzano espressamente tecniche di High Performance Computing, questo per poter sfruttare al meglio le risorse di calcolo disponibili e su diverse scale, dalle singole workstation fino ai super-computer. In questa tesi vengono presentate tre tecniche di Machine Learning sviluppate nel contesto del High Performance Computing e create per affrontare tre questioni fondamentali e ancora irrisolte nel campo della Medicina di Precisione, in particolare la Medicina Genomica: i) l'identificazione di varianti deleterie o patogeniche tra quelle neutrali nelle aree non codificanti del DNA; ii) l'individuazione della attività delle regioni regolatorie in diverse linee cellulari e tessuti; iii) la predizione automatica della funzione delle proteine nel contesto di reti biomolecolari. Per il primo problema è stato sviluppato parSMURF, un innovativo metodo basato su hyper-ensemble in grado di gestire l'elevato grado di sbilanciamento che caratterizza l'identificazione di varianti patogeniche e deleterie in mezzo al "mare" di varianti neutrali nelle aree non-coding del DNA. L'algoritmo è stato implementato per sfruttare appositamente le risorse di supercalcolo del CINECA (Marconi - KNL) e HPC Center Stuttgart (HLRS Apollo HAWK), ottenendo risultati allo stato dell'arte, sia per capacità predittiva, sia per scalabilità. Il secondo problema è stato affrontato tramite lo sviluppo di reti neurali "deep", in particolare Deep Feed Forward e Deep Convolutional Neural Networks per analizzare - rispettivamente - dati di natura epigenetica e sequenze di DNA, con lo scopo di individuare promoter ed enhancer attivi in linee cellulari e tessuti specifici. L'analisi è compiuta "genome-wide" e sono state usate tecniche di parallelizzazione su GPU. Infine, per il terzo problema è stato sviluppato un algoritmo di Machine Learning semi-supervisionato su grafo basato su reti di Hopfield per elaborare efficacemente grandi network biologici, utilizzando ancora tecniche di parallelizzazione su GPU; in particolare, una parte rilevante dell'algoritmo è data dall'introduzione di una tecnica parallela di colorazione del grafo che migliora il classico approccio greedy introdotto da Luby. Tra i futuri lavori e le attività in corso, viene presentato il progetto inerente all'estensione di parSMURF che è stato recentemente premiato dal consorzio Partnership for Advance in Computing in Europe (PRACE) allo scopo di sviluppare ulteriormente l'algoritmo e la sua implementazione, applicarlo a dataset di diversi ordini di grandezza più grandi e inserire i risultati in Genomiser, lo strumento attualmente allo stato dell'arte per l'individuazione di varianti genetiche Mendeliane. Questo progetto è inserito nel contesto di una collaborazione internazionale con i Jackson Lab for Genomic Medicine.
Precision Medicine is a new paradigm which is reshaping several aspects of clinical practice, representing a major departure from the "one size fits all" approach in diagnosis and prevention featured in classical medicine. Its main goal is to find personalized prevention measures and treatments, on the basis of the personal history, lifestyle and specific genetic factors of each individual. Three factors contributed to the rapid rise of Precision Medicine approaches: the ability to quickly and cheaply generate a vast amount of biological and omics data, mainly thanks to Next-Generation Sequencing; the ability to efficiently access this vast amount of data, under the Big Data paradigm; the ability to automatically extract relevant information from data, thanks to innovative and highly sophisticated data processing analytical techniques. Machine Learning in recent years revolutionized data analysis and predictive inference, influencing almost every field of research. Moreover, high-throughput bio-technologies posed additional challenges to effectively manage and process Big Data in Medicine, requiring novel specialized Machine Learning methods and High Performance Computing techniques well-tailored to process and extract knowledge from big bio-medical data. In this thesis we present three High Performance Computing Machine Learning techniques that have been designed and developed for tackling three fundamental and still open questions in the context of Precision and Genomic Medicine: i) identification of pathogenic and deleterious genomic variants among the "sea" of neutral variants in the non-coding regions of the DNA; ii) detection of the activity of regulatory regions across different cell lines and tissues; iii) automatic protein function prediction and drug repurposing in the context of biomolecular networks. For the first problem we developed parSMURF, a novel hyper-ensemble method able to deal with the huge data imbalance that characterizes the detection of pathogenic variants in the non-coding regulatory regions of the human genome. We implemented this approach with highly parallel computational techniques using supercomputing resources at CINECA (Marconi – KNL) and HPC Center Stuttgart (HLRS Apollo HAWK), obtaining state-of-the-art results. For the second problem we developed Deep Feed Forward and Deep Convolutional Neural Networks to respectively process epigenetic and DNA sequence data to detect active promoters and enhancers in specific tissues at genome-wide level using GPU devices to parallelize the computation. Finally we developed scalable semi-supervised graph-based Machine Learning algorithms based on parametrized Hopfield Networks to process in parallel using GPU devices large biological graphs, using a parallel coloring method that improves the classical Luby greedy algorithm. We also present ongoing extensions of parSMURF, very recently awarded by the Partnership for Advance in Computing in Europe (PRACE) consortium to further develop the algorithm, apply them to huge genomic data and embed its results into Genomiser, a state-of-the-art computational tool for the detection of pathogenic variants associated with Mendelian genetic diseases, in the context of an international collaboration with the Jackson Lab for Genomic Medicine.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
28

Zmeškal, Jiří. „Extrémní učící se stroje pro předpovídání časových řad“. Master's thesis, Vysoké učení technické v Brně. Fakulta elektrotechniky a komunikačních technologií, 2018. http://www.nusl.cz/ntk/nusl-376967.

Der volle Inhalt der Quelle
Annotation:
Thesis is aimed at the possibility of utilization of extreme learning machines and echo state networks for time series forecasting with possibility of utilizing GPU acceleration. Such predictions are part of nearly everyone’s daily lives through utilization in weather forecasting, prediction of regular and stock market, power consumption predictions and many more. Thesis is meant to familiarize reader firstly with theoretical basis of extreme learning machines and echo state networks, taking advantage of randomly generating majority of neural networks parameters and avoiding iterative processes. Secondly thesis demonstrates use of programing tools, such as ND4J and CUDA toolkit, to create very own programs. Finally, prediction capability and convenience of GPU acceleration is tested.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
29

Jasovský, Filip. „Realizace superpočítače pomocí grafické karty“. Master's thesis, Vysoké učení technické v Brně. Fakulta elektrotechniky a komunikačních technologií, 2014. http://www.nusl.cz/ntk/nusl-220617.

Der volle Inhalt der Quelle
Annotation:
This master´s thesis deals with realization of supercomputer using graphic card with CUDA technology. The theoretical part of this thesis describes the function and the possibility of graphic cards and desktop computers and processes taking place in the proces sof calculations on them. The practical part deals with creation system for calculations on the graphic card using the algorithm of artificial intelligence, more specifically artificial neural networks. Subsequently is the generated program used for data classification of large input data file. Finally the results are compared.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
30

Reoyo-Prats, Reine. „Etude du vieillissement de récepteurs solaires : estimation de propriétés thermophysiques par méthode photothermique associée aux outils issus de l'intelligence artificielle“. Thesis, Perpignan, 2020. http://www.theses.fr/2020PERP0017.

Der volle Inhalt der Quelle
Annotation:
L’augmentation de la consommation énergétique et la prise de conscience du dérèglement climatique induit par l’augmentation des émissions de gaz à effet de serre engendrent un changement progressif du modèle énergétique. Les technologies faisant appel à des ressources renouvelables se développent depuis plusieurs décennies ; c’est notamment le cas des centrales solaires à concentration. La problématique de leur durabilité se pose donc. Cette thèse participe en premier lieu à la réflexion concernant la méthodologie de vieillissement accéléré des matériaux employés dans les récepteurs de ces centrales, partie soumise au rayonnement solaire concentré. Pour cela, plusieurs protocoles expérimentaux sont réalisés. Leur efficacité est principalement jugée au vue de l’évolution des propriétés radiatives des matériaux (absorptivité, émissivité). En parallèle, les propriétés thermophysiques que sont la conductivité thermique et la diffusivité sont étudiées sur un panel plus large de matériaux. Compte tenu des limites et des contraintes de caractérisation avec les méthodes actuelles, une nouvelle méthode d’estimation de ces propriétés est développée. Celle-ci est basée sur les réseaux de neurones artificiels et s’appuie sur des données expérimentales issues d’expériences photothermiques
The increasing energy consumption and the awareness of climate change induced by the increasing greenhouse gas emissions result in a progressive change of the energy model. Technologies based on renewable resources have been developing for several decades, such as concentrated solar power plants (CSP). So the issue of their sustainability is studied in many research programs. This thesis contributes to the development of a methodology for the accelerated ageing of the materials used in CSP receivers, which is the component submitted to concentrated solar radiation. For this purpose, several experimental protocols are carried out. Their efficiency is examined in light of the evolution of the radiative properties of the materials (absorptivity, emissivity). On another hand, the thermophysical properties such as the thermal conductivity and diffusivity are studied on a wider range of materials. Considering the limits of the current characterization methods, a new method for estimating these properties is developed. This is based on artificial neural networks and relies on photothermal experimental data
APA, Harvard, Vancouver, ISO und andere Zitierweisen
31

Pradhan, Manoj Kumar. „Conformal Thermal Models for Optimal Loading and Elapsed Life Estimation of Power Transformers“. Thesis, Indian Institute of Science, 2004. https://etd.iisc.ac.in/handle/2005/97.

Der volle Inhalt der Quelle
Annotation:
Power and Generator Transformers are important and expensive elements of a power system. Inadvertent failure of Power Transformers would cause long interruption in power supply with consequent loss of reliability and revenue to the supply utilities. The mineral oil impregnated paper, OIP, is an insulation of choice in large power transformers in view of its excellent dielectric and other properties, besides being relatively inexpensive. During the normal working regime of the transformer, the insulation thereof is subjected to various stresses, the more important among them are, electrical, thermal, mechanical and chemical. Each of these stresses, appearing singly, or in combination, would lead to a time variant deterioration in the properties of insulation, called Ageing. This normal and inevitable process of degradation in the several essential properties of the insulation is irreversible, is a non-Markov physico-chemical reaction kinetic process. The speed or the rapidity of insulation deterioration is a very strong function of the magnitude of the stresses and the duration over which they acted. This is further compounded, if the stresses are in synergy. During the processes of ageing, some, or all the vital properties undergo subtle changes, more often, not in step with the duration of time over which the damage has been accumulated. Often, these changes are non monotonic, thus presenting a random or a chaotic picture and understanding the processes leading to eventual failure becomes difficult. But, there is some order in this chaos, in that, the time average of the changes over short intervals of time, seems to indicate some degree of predictability. The status of insulation at any given point in time is assessed by measuring such of those properties as are sensitive to the amount of ageing and comparing it with earlier measurements. This procedure, called the Diagnostic or nondestructive Testing, has been in vogue for some time now. Of the many parameters used as sensitive indices of the dynamics of insulation degradation, temporal changes in temperatures at different locations in the body of the transformer, more precisely, the winding hot spots (HST) and top oil temperature (TOT) are believed to give a fairly accurate indication of the rate of degradation. Further, an accurate estimation of the temperatures would enable to determine the loading limit (loadability) of power transformer. To estimate the temperature rise reasonably accurately, one has to resort to classical mathematical techniques involving formulation and solution of boundary value problem of heat conduction under carefully prescribed boundary conditions. Several complications are encountered in the development of the governing equations for the emergent heat transfer problems. The more important among them are, the inhomogeneous composition of the insulation structure and of the conductor, divergent flow patterns of the oil phase and inordinately varying thermal properties of conductor and insulation. Validation and reconfirmation of the findings of the thermal models can be made using state of the art methods, such as, Artificial Intelligence (AI) techniques, Artificial Neural Network (ANN) and Genetic Algorithm (GA). Over the years, different criteria have been prescribed for the prediction of terminal or end of life (EOL) of equipment from the standpoint of its insulation. But, thus far, no straightforward and unequivocal criterion is forth coming. Calculation of elapsed life in line with the existing methodology, given by IEEE, IEC, introduces unacceptable degrees of uncertainty. It is needless to say that, any conformal procedure proposed in the accurate prediction of EOL, has to be based on a technically feasible and economically viable consideration. A systematic study for understanding the dynamical nature of ageing in transformers in actual service is precluded for reasons very well known. Laboratory experiments on prototypes or pro-rated units fabricated based on similarity studies, are performed under controlled conditions and at accelerated stress levels to reduce experimental time. The results thereof can then be judiciously extrapolated to normal operating conditions and for full size equipment. The terms of reference of the present work are as follows; 1. Computation of TOT and HST Theoretical model based on Boundary Value Problem of Heat Conduction Application of AI Techniques 2. Experimental Investigation for estimating the Elapsed Life of transformers Based on the experimental investigation a semi-empirical expression has been developed to estimate the loss of life of power and station transformer by analyzing gas content and furfural dissolved in oil without performing off-line and destructive tests.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
32

Pradhan, Manoj Kumar. „Conformal Thermal Models for Optimal Loading and Elapsed Life Estimation of Power Transformers“. Thesis, Indian Institute of Science, 2004. http://hdl.handle.net/2005/97.

Der volle Inhalt der Quelle
Annotation:
Power and Generator Transformers are important and expensive elements of a power system. Inadvertent failure of Power Transformers would cause long interruption in power supply with consequent loss of reliability and revenue to the supply utilities. The mineral oil impregnated paper, OIP, is an insulation of choice in large power transformers in view of its excellent dielectric and other properties, besides being relatively inexpensive. During the normal working regime of the transformer, the insulation thereof is subjected to various stresses, the more important among them are, electrical, thermal, mechanical and chemical. Each of these stresses, appearing singly, or in combination, would lead to a time variant deterioration in the properties of insulation, called Ageing. This normal and inevitable process of degradation in the several essential properties of the insulation is irreversible, is a non-Markov physico-chemical reaction kinetic process. The speed or the rapidity of insulation deterioration is a very strong function of the magnitude of the stresses and the duration over which they acted. This is further compounded, if the stresses are in synergy. During the processes of ageing, some, or all the vital properties undergo subtle changes, more often, not in step with the duration of time over which the damage has been accumulated. Often, these changes are non monotonic, thus presenting a random or a chaotic picture and understanding the processes leading to eventual failure becomes difficult. But, there is some order in this chaos, in that, the time average of the changes over short intervals of time, seems to indicate some degree of predictability. The status of insulation at any given point in time is assessed by measuring such of those properties as are sensitive to the amount of ageing and comparing it with earlier measurements. This procedure, called the Diagnostic or nondestructive Testing, has been in vogue for some time now. Of the many parameters used as sensitive indices of the dynamics of insulation degradation, temporal changes in temperatures at different locations in the body of the transformer, more precisely, the winding hot spots (HST) and top oil temperature (TOT) are believed to give a fairly accurate indication of the rate of degradation. Further, an accurate estimation of the temperatures would enable to determine the loading limit (loadability) of power transformer. To estimate the temperature rise reasonably accurately, one has to resort to classical mathematical techniques involving formulation and solution of boundary value problem of heat conduction under carefully prescribed boundary conditions. Several complications are encountered in the development of the governing equations for the emergent heat transfer problems. The more important among them are, the inhomogeneous composition of the insulation structure and of the conductor, divergent flow patterns of the oil phase and inordinately varying thermal properties of conductor and insulation. Validation and reconfirmation of the findings of the thermal models can be made using state of the art methods, such as, Artificial Intelligence (AI) techniques, Artificial Neural Network (ANN) and Genetic Algorithm (GA). Over the years, different criteria have been prescribed for the prediction of terminal or end of life (EOL) of equipment from the standpoint of its insulation. But, thus far, no straightforward and unequivocal criterion is forth coming. Calculation of elapsed life in line with the existing methodology, given by IEEE, IEC, introduces unacceptable degrees of uncertainty. It is needless to say that, any conformal procedure proposed in the accurate prediction of EOL, has to be based on a technically feasible and economically viable consideration. A systematic study for understanding the dynamical nature of ageing in transformers in actual service is precluded for reasons very well known. Laboratory experiments on prototypes or pro-rated units fabricated based on similarity studies, are performed under controlled conditions and at accelerated stress levels to reduce experimental time. The results thereof can then be judiciously extrapolated to normal operating conditions and for full size equipment. The terms of reference of the present work are as follows; 1. Computation of TOT and HST Theoretical model based on Boundary Value Problem of Heat Conduction Application of AI Techniques 2. Experimental Investigation for estimating the Elapsed Life of transformers Based on the experimental investigation a semi-empirical expression has been developed to estimate the loss of life of power and station transformer by analyzing gas content and furfural dissolved in oil without performing off-line and destructive tests.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
33

Narmack, Kirilll. „Dynamic Speed Adaptation for Curves using Machine Learning“. Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-233545.

Der volle Inhalt der Quelle
Annotation:
The vehicles of tomorrow will be more sophisticated, intelligent and safe than the vehicles of today. The future is leaning towards fully autonomous vehicles. This degree project provides a data driven solution for a speed adaptation system that can be used to compute a vehicle speed for curves, suitable for the underlying driving style of the driver, road properties and weather conditions. A speed adaptation system for curves aims to compute a vehicle speed suitable for curves that can be used in Advanced Driver Assistance Systems (ADAS) or in Autonomous Driving (AD) applications. This degree project was carried out at Volvo Car Corporation. Literature in the field of speed adaptation systems and factors affecting the vehicle speed in curves was reviewed. Naturalistic driving data was both collected by driving and extracted from Volvo's data base and further processed. A novel speed adaptation system for curves was invented, implemented and evaluated. This speed adaptation system is able to compute a vehicle speed suitable for the underlying driving style of the driver, road properties and weather conditions. Two different artificial neural networks and two mathematical models were used to compute the desired vehicle speed in curves. These methods were compared and evaluated.
Morgondagens fordon kommer att vara mer sofistikerade, intelligenta och säkra än dagens fordon. Framtiden lutar mot fullständigt autonoma fordon. Detta examensarbete tillhandahåller en datadriven lösning för ett hastighetsanpassningssystem som kan beräkna ett fordons hastighet i kurvor som är lämpligt för förarens körstil, vägens egenskaper och rådande väder. Ett hastighetsanpassningssystem för kurvor har som mål att beräkna en fordonshastighet för kurvor som kan användas i Advanced Driver Assistance Systems (ADAS) eller Autonomous Driving (AD) applikationer. Detta examensarbete utfördes på Volvo Car Corporation. Litteratur kring hastighetsanpassningssystem samt faktorer som påverkar ett fordons hastighet i kurvor studerades. Naturalistisk bilkörningsdata samlades genom att köra bil samt extraherades från Volvos databas och bearbetades. Ett nytt hastighetsanpassningssystem uppfanns, implementerades samt utvärderades. Hastighetsanpassningssystemet visade sig vara kapabelt till att beräkna en lämplig fordonshastighet för förarens körstil under rådande väderförhållanden och vägens egenskaper. Två olika artificiella neuronnätverk samt två matematiska modeller användes för att beräkna fordonets hastighet. Dessa metoder jämfördes och utvärderades.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
34

Jebelli, Ali. „Development of Sensors and Microcontrollers for Underwater Robots“. Thesis, Université d'Ottawa / University of Ottawa, 2014. http://hdl.handle.net/10393/31283.

Der volle Inhalt der Quelle
Annotation:
Nowadays, small autonomous underwater robots are strongly preferred for remote exploration of unknown and unstructured environments. Such robots allow the exploration and monitoring of underwater environments where a long term underwater presence is required to cover a large area. Furthermore, reducing the robot size, embedding electrical board inside and reducing cost are some of the challenges designers of autonomous underwater robots are facing. As a key device for reliable operation-decision process of autonomous underwater robots, a relatively fast and cost effective controller based on Fuzzy logic and proportional-integral-derivative method is proposed in this thesis. It efficiently models nonlinear system behaviors largely present in robot operation and for which mathematical models are difficult to obtain. To evaluate its response, the fault finding test approach was applied and the response of each task of the robot depicted under different operating conditions. The robot performance while combining all control programs and including sensors was also investigated while the number of program codes and inputs were increased.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
35

Lee, Heng, und 李亨. „Convolutional Neural Network Accelerator with Vector Quantization“. Thesis, 2019. http://ndltd.ncl.edu.tw/handle/w7kr56.

Der volle Inhalt der Quelle
Annotation:
碩士
國立臺灣大學
電子工程學研究所
107
Deep neural networks (DNNs) have demonstrated impressive performance in many edge computer vision tasks, causing the increasing demand for DNN accelerator on mobile and internet of things (IoT) devices. However, the massive power consumption and storage requirement make the hardware design challenging. In this paper, we introduce a DNN accelerator based on a model compression technique vector quantization (VQ), which can reduce the network model size and computation cost simultaneously. Moreover, a specialized processing element (PE) is designed with various SRAM bank configurations as well as dataflows such that it can support different codebook/kernel sizes, and keep high utilization under small input or output channel numbers. Compared to the state-of-the-art, the proposed accelerator architecture achieves 3.94 times reduction in memory access and 1.2 times in latency for batch-one inference.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
36

Chen, Chun-Chen, und 陳俊辰. „Design Exploration Methodology for Deep Convolutional Neural Network Accelerator“. Thesis, 2018. http://ndltd.ncl.edu.tw/handle/sj63xu.

Der volle Inhalt der Quelle
APA, Harvard, Vancouver, ISO und andere Zitierweisen
37

Yu-LinHu und 胡雨霖. „General Accelerator Study and Design for Convolutional Neural Network“. Thesis, 2019. http://ndltd.ncl.edu.tw/handle/86u346.

Der volle Inhalt der Quelle
Annotation:
碩士
國立成功大學
電機工程學系
107
The hardware design of Convolutional Neural Networks (CNN) facing the following problems: high complexity of computation, large amount of data movement and divergence to different neural network in structural domain. The previous work has dealt well with the first two problems but fail to take the third question in a wide consideration. After analyzing the state-to-art CNN accelerators and the design space they exploiting, we try to develop a format that can describe the full design space. Base on our design space exploration and hardware evaluation, we propose a novel general CNN hardware accelerator, which contain: hierarchical memory storage, variable length and width two-dimensional hardware processing unit set, and elastic data distributor. Our work shows higher multipliers usage in FPGA result compared with previous FPGA design. On the other hands, our work is as efficient as other two latest works in ASIC synthesis estimate.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
38

ACHARJEE, SUVAJIT, und 蘇沃杰. „Hardware Efficient Accelerator for Binary Convolution Neural Network Inference“. Thesis, 2019. http://ndltd.ncl.edu.tw/handle/2c8792.

Der volle Inhalt der Quelle
Annotation:
碩士
國立交通大學
電機資訊國際學程
107
Binary Neural Network is such a topic in this recent era that it is improving day by day to improve the use in computer vision such as recognition, object detection, depth perception etc. However, most of the existing designs suffer from low hardware utilization or complex circuits that result in high hardware cost. In addition, a large amount of computation redundancy still exists in BNN inference. Therefore, to overcome all these issues like hardware utilization and the problem of computational complexity, this design has adopted systolic array architecture, which takes binarized inputs and weights. It is drastically reduced since weights and activations can be stored as single bit i.e., +1 is stored as 1, and -1 is stored as 0. In addition, the problem of computational is solved when it replaced the MAC operations by bitwise operations. In this design, eight PEs are used and each PE is parallel processed with each accumulator where convolution 3x3 kernel size is filtered in each PE block. The throughput is increased and operating frequency at maximum of 188.67 MHz and minimum at 125 MHz. Our results shown with eight PEs , the design achieves and support 63.168 GOPS, which is 10x more area efficient with other results. The power consumption after simulating the RTL synthesis is 0.014W. The architecture implemented successfully in Xilinx ISE 14.7 using the Spartan 6 series FPGA. This design also shows better area and bandwidth efficiency compared to the other state-of-the-art works.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
39

Lin, Chien-Yu, und 林建宇. „Merlin: A Sparse Neural Network Accelerator Utilizing Both Neuron and Weight Sparsity“. Thesis, 2017. http://ndltd.ncl.edu.tw/handle/6aq7yc.

Der volle Inhalt der Quelle
APA, Harvard, Vancouver, ISO und andere Zitierweisen
40

Wu, Yi-Heng, und 吳奕亨. „Compressing Convolutional Neural Network by VectorQuantization : Implementation and Accelerator Design“. Thesis, 2017. http://ndltd.ncl.edu.tw/handle/959vy5.

Der volle Inhalt der Quelle
Annotation:
碩士
國立臺灣大學
電子工程學研究所
105
In recent years, deep convolutional neural networks~(CNNs) achieve ground-breaking success in many computer vision research fields. Due to the large model size and tremendous computation of CNNs, they cannot be efficiently executed in small devices like mobile phones. Although several hardware accelerator architectures have been developed, most of them can only efficient address one of the two major layers in CNN, convolutional~(CONV) and fully connected~(FC) layers. In this thesis, based on algorithm-architecture-co-exploration, our architecture targets at executing both layers with high efficiency. Vector quantization technique is first selected to compress the parameters, reduce the computation, and unify the behaviors of both CONV and FC layers. To fully exploit the gain of vector quantization, we then propose an accelerator architecture for quantized CNN. Different DRAM access schemes are employed to reduce DRAM access, subsequently reduce power consumption. We also design a high-throughput processing element architecture to accelerate quantized layers. Compare to state-of-the-art accelerators for CNN, the proposed architecture achieves 1.2--5x less DRAM access and 1.5--5x higher throughput for both CONV and FC layers.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
41

Kung, Chu King, und 江子近. „An Energy-Efficient Accelerator SOC for Convolutional Neural Network Training“. Thesis, 2019. http://ndltd.ncl.edu.tw/handle/y475rn.

Der volle Inhalt der Quelle
Annotation:
碩士
國立臺灣大學
電子工程學研究所
107
The recent resurgence of artificial intelligence is due to advances in deep learning. Deep neural network (DNN) has exceeded human capability in many computer vision applications, such as object detection, image classification and playing games like Go. The idea of deep learning dates back to as early as the 1950s, with the key algorithmic breakthroughs occurred in the 1980s. Yet, it has only been in the past few years, that powerful hardware accelerators became available to train neural networks. Even now, the demand for machine learning algorithms is still increasing; and it is affecting almost every industry. Therefore, designing a powerful and efficient hardware accelerator for deep learning algorithms is of critical importance for the time being. The accelerators that run the deep learning algorithm must be general enough to support deep neural networks with various computational structures. For instance, general-purpose graphics processing units (GP-GPUs) were widely adopted for deep learning tasks ever since they allow users to execute arbitrary code on them. Other than graphics processing units, researchers have also paid a lot of attention to hardware acceleration of deep neural networks (DNNs) in the last few years. Google developed its own chip called the Tensor Processing Unit (TPU) to power its own machine learning services [8]; while Intel unveiled its first generation of ASIC processor, called Nervana, for deep learning a few years ago [9]. ASICs usually provide a better performance, compared with FPGA and software implementations. Nevertheless, existing accelerators mostly focus on inference. However, local DNN training is still required to meet the needs of new applications, such as incremental learning and on-device personalization. Unlike inference, training requires high dynamic range in order to deliver high learning quality. In this work, we introduce the floating-point signed digit (FloatSD) data representation format for reducing computational complexity required for both the inference and the training of a convolutional neural network (CNN). By co-designing data representation and circuit, we demonstrate that we can achieve high raw performance and optimal efficiency – both energy and area – without sacrificing the quality of training. This work focuses on the design of FloatSD based system on chip (SOC) for AI training and inference. The SOC consists of an AI IP, integrated DDR3 controller and ARC HS34 CPU through AXI/AHB standard AMBA interfaces. The platform can be programmed by the CPU via the AHB slave port to fit various neural network topologies. The completed SOC has been tested and validated on the HAPS-80 FPGA platform. A synthesis and automated place and route (APR) flow is used to tape out a 28 nm test chip, after testing and verifying the correctness of the SOC. At its normal operating condition (e.g. 400MHz), the accelerator is capable of 1.38 TFLOPs peak performance and 2.34 TFLOPS/W.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
42

Chen, Chih-Chiang, und 陳致強. „Energy-Efficient Accelerator and Data Processing Flow for Convolutional Neural Network“. Thesis, 2017. http://ndltd.ncl.edu.tw/handle/fy245e.

Der volle Inhalt der Quelle
Annotation:
碩士
國立交通大學
電子研究所
106
For recent years, Machine learning and Convolutional Neural Network (CNN) has become the most popular research topic in this era. Restricted to the hardware technique that has not become mature, this topic is not being fully developed before. Since CNN needs a lot of calculation and a large amount of data access and movement, the energy cost on the data access may even exceed the computation consumption. Therefore, how to manage data reuse efficiently and reduce data access has turned into a research theme. In this thesis, we propose a Processing Element (PE) that makes data reuse effectively. Meanwhile, we propose a data processing flow. With that flow, data can be propagated between each PE, making data reuse more frequently. Besides, we propose a 3D/2.5D accelerator system architecture. Transmitting data with TSV can further decrease the energy consumption. We make a comparison table of 2D, 2.5D, and 3D in speed, power, etc. Also, we propose a FPGA implementation design flow for reference and the future research. We present a new innovative reconfigurable accelerator for deep learning networks which has the advantages of both computation-intensive and data-intensive applications. This new reconfigurable computing hardware technique can mitigate the power and memory walls for both computation- and data-intensive applications such as, computer vision, computer graphics, convolution neural networks and deep learning networks.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
43

Chen, Yi-Kai, und 陳奕愷. „Architecture Design of Energy-Efficient Reconfigurable Deep Convolutional Neural Network Accelerator“. Thesis, 2018. http://ndltd.ncl.edu.tw/handle/46a96s.

Der volle Inhalt der Quelle
APA, Harvard, Vancouver, ISO und andere Zitierweisen
44

Juang, Tzung-Han, und 莊宗翰. „Energy-Efficient Accelerator Architecture for Neural Network Training and Its Circuit Design“. Thesis, 2018. http://ndltd.ncl.edu.tw/handle/sffx7b.

Der volle Inhalt der Quelle
Annotation:
碩士
國立臺灣大學
電子工程學研究所
106
Artificial intelligence (AI) has become the most popular research topic in recent years. AI can be applied to applications on image classification, object detection and natural language processing. Especially, researchers have breakthroughs on such fields with neural networks. Neural network is known for its versatile and deep architectures, which can have more than hundreds of layers. Such structure make neural network needs large amount of computation and memory. Improvement of hardware acceleration on graphics processing units (GPU) make neural networks be possible to be applied to practical applications. However, GPU tends to have large volume and is very power hungry. Many researches focused on reducing the resources of computation used in neural network and implementation on specific hardware. Most of these works only support acceleration on inference phase. Other than inference, this thesis proposed architecture that can also support training phase, which is based on backpropagation algorithm to find optimal models of neural networks. Training phase includes forward pass, backward pass and weight update, while inference only contains forward pass. This thesis is devoted to designing a unified architecture that can process these three stages in training phase on convolutional neural networks (CNN). In addition, IO bandwidth is always the bottleneck of accelerator design. To reduce data bandwidth, this thesis uses floating-point signed digit algorithm (FloatSD) and quantization techniques in previous work as basis to reduce neural network size and bit width of data values. The previous work can reach 0.8% loss of top-5 accuracy on ImageNet dataset compared to floating-point version. This thesis designs hardware accelerator for training neural networks, including the designs on data flow for processing, AMBA interface and memory settings. The design is an IP-level engine that can be applied to SOC platform. In addition, this thesis also focuses on optimizing data reusing to make the system have efficient DRAM access. Keyword: Convolutional neural network, Backpropagation, FloatSD
APA, Harvard, Vancouver, ISO und andere Zitierweisen
45

Hsu, Lien-Chih, und 徐連志. „ESSA: An Energy-Aware Bit-Serial Streaming Deep Convolutional Neural Network Accelerator“. Thesis, 2018. http://ndltd.ncl.edu.tw/handle/859cgm.

Der volle Inhalt der Quelle
Annotation:
碩士
國立清華大學
資訊工程學系所
107
Over the past decade, deep convolutional neural networks (CNN) have been widely embraced in various visual recognition applications due to their extraordinary accuracies that even surpassed those of human beings. However, the high computational complexity and massive amount of data storages are two challenges for the hardware design of CNN. Although the existence of GPU can deal with the high computational complexity, the large energy consumption due to huge external memory access has pushed researchers towards dedicated CNN accelerator designs. Generally, the precision of the modern CNN accelerators are set to 16-bit fixed-point. To reduce data storages, Sakr et al. [1] show that less precision can be used under the constraint of 1% accuracy degradation in recognitions. Besides, per-layer precision assignments can reach lower bit-width requirements than uniform precision assignment for all layers. In this paper, we propose an energy-aware bit-serial streaming Deep CNN accelerator to tackle the computational complexity, data storage and external memory access issues. With the ring streaming dataflow and the output reuse strategy to decrease the data access, the amount of external DRAM access for the convolutional layers is reduced by 357.26x compared to that of no output reuse case on AlexNet. In addition, we optimize the hardware utilization and avoid the unnecessary computations by the loop tiling technique and by mapping strides of convolutional layers to the unit-ones for computational performance enhancement. Furthermore, the bit-serial processing element (PE) is designed for using less number of bits in weights, which can reduce both the amount of computation and external memory access. We evaluate our design with the well-known roofline model, which is an efficient way for evaluation compared to real hardware implementation. The design space is explored to find the solution with the best computational performance and comunication to computation (CTC) ratio. Assume using the same FPGA as Chen et al. [2], we can reach 1.36x speed up and reduce 41% energy consumption for external memory access compared to the design in [2]. On the aspect of the hardware implementation for our PE-Array architecture design, the implementation can reach the operating frequency of 119 MHz and consume 68 k gates with the power consumption of 10.08mW under TSMC 90 nm technology. Compared to the 15.4 MB external memory access for Eyeriss [3] on the convolutional layers of AlexNet, our work only need 4.36 MB external memory access that dramatically reduce the most energy-consuming part of power consumption.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
46

Shen, En-Ho, und 沈恩禾. „Reconfigurable Low Arithmetic Precision Convolution Neural Network Accelerator VLSI Design and Implementation“. Thesis, 2019. http://ndltd.ncl.edu.tw/handle/7678c2.

Der volle Inhalt der Quelle
Annotation:
碩士
國立臺灣大學
電子工程學研究所
107
Deep neural networks (DNNs) shows promising results on various AI application tasks. However such networks typically are executed on general purpose GPUs with bulky size in form factor and hundreds of watt in power consumption, which unsuitable for mobile applications. In this thesis, we present a VLSI architecture able to process on quantized low numeric-precision convolution neural networks (CNNs), cutting down on power consumption from memory access and speeding the model up with limited area budget,particularlyfitformobiledevices.We first propose a quantization re-trainig algorithm for trainig low-precision CNN, then a dataflow with high data reuse rate with a specially data multiplication accumulation strategy specially designed for such quantized model. To fully utilize the efficiency of computation with such low-precision data, we design a micro-architecture for low bit-length multiplication and accumulation, then a on-chip memory hierarchy and data re-alignment flow for power saving and avoiding buffer bank-conflicts, and a PE array designed for taking broadcast-ed data from buffer and sending out finished data sequentially back to buffer for such dataflow. The architecture is highly flexible for various CNN shaped and re-configurable for low bit-length quantized models. The design synthesised with a 180KB on-chip memory capacity and a 1340k logic gate counts area, the implementation resultshows state-of-the-art hardware efficiency.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
47

Mohammadi, Mahnaz. „An Accelerator for Machine Learning Based Classifiers“. Thesis, 2017. http://etd.iisc.ac.in/handle/2005/4245.

Der volle Inhalt der Quelle
Annotation:
Artificial Neural Networks (ANNs) are algorithmic techniques that simulate biological neural systems. Typical realization of ANNs are software solutions using High Level Languages (HLLs) such as C, C++, etc. Such solutions have performance limitations which can be attributed to one of the following reasons: • Code generated by the compiler cannot perform application specific optimizations. • Communication latencies between processors through a memory hierarchy could be significant due to non-deterministic nature of the communications. In data mining _eld, ANN algorithms have been widely used as classifiers for data classification applications. Classification involves predicting a certain outcome based on a given input. In order to predict the outcome more precisely, the training algorithms should discover relationships between the attributes to make the prediction possible. So later, when an unseen pattern containing same set of attributes except for the prediction attribute (which is not known yet) is given to the algorithm it can process that pattern and produce its outcome. The prediction accuracy which defines how good the algorithm is in recognizing unseen patterns, depends on how well the algorithm is trained. Radial Basis Function Neural Network (RBFNN) is a type of neural network which has been widely used in classification applications. A pure software implementation of this network will not be able to cope with the performance expected of high-performance ANN applications. Accelerators can be used to speed-up these kinds of applications. Accelerators can take many forms. They range from especially configured cores to reconfigurable circuits. Multi-core and GPU based accelerators can speed-up these applications up to several orders of magnitude when compared to general purpose processors (GPPs). The efficiency of accelerators for RBFNN reduce as the network size increases. Custom hardware implementation is often required to exploit the parallelism and minimize computing time for real time application requirements. Neural networks have been implemented on different hardware platforms such as Application-Specific Integrated Circuits (ASICs) and Field Programmable Logic Gate Arrays (FPGAs). We provide a generic hardware solution for classification using RBFNN and Feed-forward Neural Network with backpropagation learning algorithm (FFBPNN) on a reconfigurable data path that overcomes the major drawback of _axed-function hardware data paths which offers limited edibility in terms of application interchangeability and scalability. Our contributions in this thesis are as follows: • Deification and implementation of open-source reference software implementation of a few categories of ANNs for classification purpose. • Benchmarking the performance on general processors. • Porting the source code for execution on GPU using Cuda API and benchmarking the performance. • Proposing scalable and area efficient hardware architectures for training the learning parameters of ANN. • Synthesizing the ANN on reconfigurable architectures. • MPSoC implementation of ANNs for functional verification of our implementation • Demonstration of the performance advantage of ANN realization on reconfigurable architectures over CPU and GPU for classification applications. • Proposing a generalized methodology for realization of classification using ANNs on reconfigurable architectures.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
48

Yen-Hsing und 李彥興. „Design of Low Complexity Convolutional Neural Network Accelerator for Finger-Vein Identification System“. Thesis, 2019. http://ndltd.ncl.edu.tw/cgi-bin/gs32/gsweb.cgi/login?o=dnclcdr&s=id=%22107NCHU5441060%22.&searchmode=basic.

Der volle Inhalt der Quelle
Annotation:
碩士
國立中興大學
電機工程學系所
107
Vein identification is a vital branch with more invisibility and unique features among biometric field. The reason is that veins belong to the physiological characteristics of human beings, and must be irradiated by specific bands of light to obtain a complete vein image. Also, taking vein image for identification has strong reliability for the reason that each person’s vein has its unique pattern. Thanks to the advanced computer technology, neural network has rapidly become main trend of classification method for image classification. Through establishing big database, we can utilize neural network for analyzing features, thus become the classification basis of database. What’s more, users usually won’t want their personal information be uploaded to clouds. Therefore, edge computing has become a vital issue for the sake of protecting user’s privacy. Inspired by these concepts, we proposed a low complexity convolutional neural network for finger vein recognition with top-1 accuracy of 95%. This neural network system can operate independently in client mode. After fetching the finger vein image of the user through the near-infrared camera mounted on Raspberry Pi embedded board, the vein feature can be efficiently extracted by vein curving algorithms and the identity of the user can quickly be returned. In order to implement the concept of edge computing, our proposed system is characterized by designing a silicon intellectual property (SIP) for the purpose of shortening the inference time of neural network. Compared to the ARM Cortex-A9 dual core CPU running in Linux environment working at 650 MHz, 120x acceleration can be obtained. Simulation and verification used the SDUMLA-HMT finger vein database [28] provided by Machine Learning and Data Mining Laboratory of Shandong University and our laboratory’s own database. Both dataset can reach 95% of accuracy for 10 users’ identity identification while inferenced by our low complexity neural network. Moreover, with low complexity, real-time identification can be achieved at the neural network inference side, and further close to commercialization.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
49

Wu, I.-Chen, und 吳易真. „An Energy-Efficient Accelerator with Relative-Indexing Memory for Sparse Compressed Convolutional Neural Network“. Thesis, 2018. http://ndltd.ncl.edu.tw/handle/tx6yx4.

Der volle Inhalt der Quelle
APA, Harvard, Vancouver, ISO und andere Zitierweisen
50

Gu, Wen-Sheng, und 辜玟勝. „High Efficiency Accelerator for Deep Convolutional Neural Network by Using High-Level-Synthesis Design Flow“. Thesis, 2019. http://ndltd.ncl.edu.tw/handle/tqdb4m.

Der volle Inhalt der Quelle
Annotation:
碩士
逢甲大學
電子工程學系
107
This paper uses the hottest deep learning in recent years to detect objects, including cars, trucks, locomotives, and pedestrians. This study is divided into two parts, the training model and the hardware implementation. The training model uses compression techniques to reduce the number of parameters and increase the pedestrian sample to enhance the AP of the pedestrian. Therefore, we propose the Agile Model. There are 19,061 training images, 4,950 test images, and a total of 24011 images, with Tiny-Yolo [17]. Compared with the Model Size, the reduction is 97.4%, the execution speed is 15FPS. In the hardware design, we use High-Level-Synthesis to build DCNN IP Core. IP Core has Convolution Layer, Batch Normalization Layer, Leaky ReLU and Pooling Layer. In order to store the data in the block RAM, firstly, the original information Floating-Point 32 bit is turned into a Fixed-Point 8 bit after Truncation, and the block RAM access data mode is improved to maximize the block RAM ac-cess. Create PS/PL Interface to send our feature map and weight val-ues to IP Core for acceleration and transfer back to DRAM. Then we will use Python interface to control data flow. This circuit performs an Agile Model at 100 Mega HZ with a GOPS/Power of 30.1.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
Wir bieten Rabatte auf alle Premium-Pläne für Autoren, deren Werke in thematische Literatursammlungen aufgenommen wurden. Kontaktieren Sie uns, um einen einzigartigen Promo-Code zu erhalten!

Zur Bibliographie