Dissertations / Theses: 'Software acceleration'

1

Borgström, Fredrik. "Acceleration of FreeRTOS withSierra RTOS accelerator : Implementation of a FreeRTOS software layer onSierra RTOS accelerator." Thesis, KTH, Data- och elektroteknik, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-188518.

Full text

Abstract:

Today, the effect of the most common ways to improve the performance of embedded systems and real-time operating systems is stagnating. Therefore it is interesting to examine new ways to push the performance boundaries of embedded systems and real-time operating systems even further. It has previously been demonstrated that the hardware-based real-time operating system, Sierra, has better performance than the software-based real-time operating system, FreeRTOS. These real-time operating systems have also been shown to be similar in many aspects, which mean that it is possible for Sierra to accelerate FreeRTOS. In this thesis an implementation of such acceleration has been carried out. Because existing real-time operating systems are constantly in development combined with that it was several years since an earlier comparison between the two real-time operating systems was per-formed, FreeRTOS and Sierra were compared in terms of functionality and architecture also in this thesis. This comparison showed that FreeRTOS and Sierra share the most fundamental functions of a real-time operating system, and thus can be accelerated by Sierra, but that FreeRTOS also has a number of exclusive functions to facilitate the use of that real-time operating system. The infor-mation obtained by this comparison was the very essence of how the acceleration would be imple-mented. After a number of performance tests it could be concluded that all of the implemented functions, with the exception of a few, had shorter execution time than the corresponding functions in the original version of FreeRTOS.
Idag är effekten av de vanligaste åtgärderna för att förbättra prestandan av inbyggda system och realtidsoperativsystem väldigt liten. På grund av detta är det intressant att undersöka nya åtgärder för att tänja prestandagränserna av inbyggda system och realtidsoperativsystem ytterliggare. Det har tidigare påvisats att det hårdvarubaseraderealtidsoperativsystemet, Sierra, har bättre prestanda än det mjukvarubaseraderealtidsoperativsystemet, FreeRTOS. Dessa realtidsoperativsystem har även visats vara lika i flera aspekter, vilket betyder att det är möjligt för Sierra att accelererera FreeRTOS. I detta examensarbete har en implementering av en sådan acceleration genomförts. Eftersom befintliga realtidsoperativsystem ständigtär i utveckling i kombination med att det är flera år sedan som en tidigare jämförelse mellan de båda systemen utfördes, så jämfördes FreeRTOS och Sierra i fråga om funktionalitet och uppbyggnad även i detta examensarbete.Denna jämförelse visade att FreeRTOS och Sierra delar de mest grundläggande funktionerna av ett realtidsoperativsystem, och som därmed kan accelereras av Sierra, men att FreeRTOS även har ett antal exklusiva funktioner för att underlätta användningen av det realtidsoperativsystemet. Informationen som erhölls av denna jämförelse var sedan grunden för hur själva accelerationen skulle implementeras. Efter ett antal prestandatesterkunde det konstateras att alla implementerade funktioner, med undantag för ett fåtal, hade kortare exekveringstid än motsvarande funktioner i ursprungsversionen av FreeRTOS.

APA, Harvard, Vancouver, ISO, and other styles

2

Kulkarni, Pallavi Anil. "Hardware acceleration of software library string functions." Ann Arbor, Mich. : ProQuest, 2007. http://gateway.proquest.com/openurl?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&res_dat=xri:pqdiss&rft_dat=xri:pqdiss:1447245.

Full text

Abstract:

Thesis (M.S. in Computer Engineering)--S.M.U., 2007.
Title from PDF title page (viewed Nov. 19, 2009). Source: Masters Abstracts International, Volume: 46-03, page: 1577. Adviser: Mitch Thornton. Includes bibliographical references.

APA, Harvard, Vancouver, ISO, and other styles

3

Blumer, Aric David. "Register Transfer Level Simulation Acceleration via Hardware/Software Process Migration." Diss., Virginia Tech, 2007. http://hdl.handle.net/10919/29380.

Full text

Abstract:

The run-time reconfiguration of Field Programmable Gate Arrays (FPGAs) opens new avenues to hardware reuse. Through the use of process migration between hardware and software, an FPGA provides a parallel execution cache. Busy processes can be migrated into hardware-based, parallel processors, and idle processes can be migrated out increasing the utilization of the hardware. The application of hardware/software process migration to the acceleration of Register Transfer Level (RTL) circuit simulation is developed and analyzed. RTL code can exhibit a form of locality of reference such that executing processes tend to be executed again. This property is termed executive temporal locality, and it can be exploited by migration systems to accelerate RTL simulation. In this dissertation, process migration is first formally modeled using Finite State Machines (FSMs). Upon FSMs are built programs, processes, migration realms, and the migration of process state within a realm. From this model, a taxonomy of migration realms is developed. Second, process migration is applied to the RTL simulation of digital circuits. The canonical form of an RTL process is defined, and transformations of HDL code are justified and demonstrated. These transformations allow a simulator to identify basic active units within the simulation and combine them to balance the load across a set of processors. Through the use of input monitors, executive locality of reference is identified and demonstrated on a set of six RTL designs. Finally, the implementation of a migration system is described which utilizes Virtual Machines (VMs) and Real Machines (RMs) in existing FPGAs. Empirical and algorithmic models are developed from the data collected from the implementation to evaluate the effect of optimizations and migration algorithms.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

4

Samothrakis, Stavros Nikolaou. "Acceleration techniques in ray tracing for dynamic scenes." Thesis, University of Sussex, 1998. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.241671.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Singh, Ajeet. "GePSeA: A General-Purpose Software Acceleration Framework for Lightweight Task Offloading." Thesis, Virginia Tech, 2009. http://hdl.handle.net/10919/34264.

Full text

Abstract:

Hardware-acceleration techniques continue to be used to boost the performance of scientific codes. To do so, software developers identify portions of these codes that are amenable for offloading and map them to hardware accelerators. However, offloading such tasks to specialized hardware accelerators is non-trivial. Furthermore, these accelerators can add significant cost to a computing system.

Consequently, this thesis proposes a framework called GePSeA (General Purpose Software Acceleration Framework), which uses a small fraction of the computational power on multi-core architectures to offload complex application-specific tasks. Specifically, GePSeA provides a lightweight process that acts as a helper agent to the application by executing application-specific tasks asynchronously and efficiently. GePSeA is not meant to replace hardware accelerators but to extend them. GePSeA provide several utilities called core components that offload tasks on to the core or to the special-purpose hardware when available in a way that is transparent to the application. Examples of such core components include reliable communication service, distributed lock management, global memory management, dynamic load distribution and network protocol processing. We then apply the GePSeA framework to two applications, namely mpiBLAST, an open-source computational biology application and Reliable Blast UDP (RBUDP) based file transfer application. We observe significant speed-up for both applications.
Master of Science

APA, Harvard, Vancouver, ISO, and other styles

6

Zhu, Huanzhou. "Developing graph-based co-scheduling algorithms with GPU acceleration." Thesis, University of Warwick, 2016. http://wrap.warwick.ac.uk/92000/.

Full text

Abstract:

On-chip cache is often shared between processes that run concurrently on different cores of the same processor. Resource contention of this type causes the performance degradation to the co-running processes. Contention-aware co-scheduling refers to the class of scheduling techniques to reduce the performance degradation. Most existing contention-aware co-schedulers only consider serial jobs. However, there often exist both parallel and serial jobs in computing systems. This thesis aims to tackle these issues. We start with modelling the problem of co-scheduling the mix of serial and parallel jobs as an Integer Programming (IP) problem. Then we construct a co-scheduling graph to model the problem, and a set of algorithms are developed to find both optimal and near-optimal solutions. The results show that the proposed algorithms can find the optimal co-scheduling solution and that the proposed approximation technique is able to find the near optimal solutions. In order to improve the scalability of the algorithms, we use GPU to accelerate the solving process. A graph processing framework, called WolfPath, is proposed in this thesis. By taking advantage of the co-scheduling graph, WolfPath achieves significant performance improvement. Due to the long preprocessing time of WolfPath, we developed WolfGraph, a GPU-based graph processing framework that features minimal preprocessing time and uses the hard disk as a memory extension to solve large-scale graphs on a single machine equipped with a GPU device. Comparing with existing GPU-based graph processing frameworks, WolfGraph can achieve similar execution time but with minimal preprocessing time.

APA, Harvard, Vancouver, ISO, and other styles

7

Yalim, Hacer. "Acceleration Of Direct Volume Rendering With Texture Slabs On Programmable Graphics Hardware." Master's thesis, METU, 2005. http://etd.lib.metu.edu.tr/upload/12606195/index.pdf.

Full text

Abstract:

This thesis proposes an efficient method to accelerate ray based volume rendering with texture slabs using programmable graphics hardware. In this method, empty space skipping and early ray termination are utilized without performing any preprocessing on CPU side. The acceleration structure is created on the fly by making use of depth buffer efficiently on Graphics Processing Unit (GPU) side. In the proposed method, texture slices are grouped together to form a texture slab. Rendering all the slabs from front to back viewing order in multiple rendering passes generates the resulting volume image. Slab silhouette maps (SSM) are created to identify and skip empty spaces along the ray direction at pixel level. These maps are created from the alpha component of the slab and stored in the depth buffer. In addition to the empty region information, SSM also contains information about the terminated rays. The method relies on hardware z-occlusion culling that is realized by means of SSMs to accelerate ray traversals. The cost of generating this acceleration data structure is very small compared to the total rendering time.

APA, Harvard, Vancouver, ISO, and other styles

8

Sherban, V. Yu. "Software components of the system for the kinematic and dynamic analysis of machines for sewing, textile and shoe industries." Thesis, Київський національний університет технологій та дизайну, 2017. https://er.knutd.edu.ua/handle/123456789/6655.

Full text

APA, Harvard, Vancouver, ISO, and other styles

9

Wang, Tsu-Han. "Real-time Software Architectures and Performance Evaluation Methods for 5G Radio Systems." Electronic Thesis or Diss., Sorbonne université, 2022. https://accesdistant.sorbonne-universite.fr/login?url=https://theses-intra.sorbonne-universite.fr/2022SORUS362.pdf.

Full text

Abstract:

La thèse porte sur les architectures temps réel pour la radio-logicielle 5G. Afin de répondre aux exigences de performances de la 5G, une accélération des procédés critiques combinée à des méthodes d’ordonnancement de processus temps réels sont nécessaires. Dans les systèmes embarqués 5G, l'accélération équivaut à une combinaison judicieuse d'unités matérielles supplémentaires pour les fonctions les plus coûteuses en termes de calcul avec des composants logiciels pour des procédures de contrôle complexe ainsi que l’arithmétique simples. Des solutions entièrement logicielles apparaissent également pour certaines applications, notamment dans l'écosystème dit Open Radio-Access Network (openRAN). Les contributions de cette thèse résident dans des méthodes d'accélération purement logicielles et de contrôle en temps réel d'interfaces dit « fronthaul » à faible latence. Étant donné que la 5G a des exigences de latence strictes et prend en charge le trafic de données à très haut débit, les méthodes d’ordonnancement du traitement en bande de base doivent être adaptées aux spécificités de l'interface radio. Plus précisément, nous proposons une décomposition fonctionnelle de l'interface-air 5G qui se prête à des implémentations logicielles multicœurs ciblant des serveurs haut de gamme exploitant l'accélération de données multiples à instruction unique (SIMD). De plus, nous fournissons quelques pistes pour le traitement multithread via le pipelining et l'utilisation de pools de threads. Nous mettons en évidence les méthodes et la caractérisation de leur performances qui ont été exploitées lors du développement de l'implémentation OpenAirInterface 5G
The thesis deals with 5G real-time Software Defined Radio architectures. In order to match 5G performance requirements, computational acceleration combined with real-time process scheduling methods are required. In 5G embedded systems acceleration amounts to a judicious combination additional hardware units for the most computationally costly functions with software for simpler arithmetic and complex control procedures. Fully software-based solutions are also appearing for certain applications, in particular in the so-called Open Radio-Access Network (openRAN) ecosystem. The contributions of this thesis lie in methods for purely software-based acceleration and real-time control of low-latency fronthaul interfaces. Since 5G has stringent latency requirements and support for very high-speed data traffic, methods for scheduling baseband processing need to be tailored to the specifics of the air-interface. Specifically, we propose a functional decomposition of the 5G air interface which is amenable to multi-core software implementations targeting high-end servers exploiting single-instruction multiple-data (SIMD) acceleration. Moreover, we provide some avenues for multi-threaded processing through pipelining and the use of thread pools. We highlight the methods and their performance evaluation that have been exploited during the development of the OpenAirInterface 5G implementation

APA, Harvard, Vancouver, ISO, and other styles

10

Tell, Eric. "Design of Programmable Baseband Processors." Doctoral thesis, Linköping : Univ, 2005. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-4377.

Full text

APA, Harvard, Vancouver, ISO, and other styles

11

Axillus, Viktor. "Comparing Julia and Python : An investigation of the performance on image processing with deep neural networks and classification." Thesis, Blekinge Tekniska Högskola, Institutionen för programvaruteknik, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-19160.

Full text

Abstract:

Python is the most popular language when it comes to prototyping and developing machine learning algorithms. Python is an interpreted language that causes it to have a significant performance loss compared to compiled languages. Julia is a newly developed language that tries to bridge the gap between high performance but cumbersome languages such as C++ and highly abstracted but typically slow languages such as Python. However, over the years, the Python community have developed a lot of tools that addresses its performance problems. This raises the question if choosing one language over the other has any significant performance difference. This thesis compares the performance, in terms of execution time, of the two languages in the machine learning domain. More specifically, image processing with GPU-accelerated deep neural networks and classification with k-nearest neighbor on the MNIST and EMNIST dataset. Python with Keras and Tensorflow is compared against Julia with Flux for GPU-accelerated neural networks. For classification Python with Scikit-learn is compared against Julia with Nearestneighbors.jl. The results point in the direction that Julia has a performance edge in regards to GPU-accelerated deep neural networks. With Julia outperforming Python by roughly 1.25x − 1.5x. For classification with k-nearest neighbor the results were a bit more varied with Julia outperforming Python in 5 out of 8 different measurements. However, there exists some validity threats and additional research is needed that includes all different frameworks available for the languages in order to provide a more conclusive and generalized answer.

APA, Harvard, Vancouver, ISO, and other styles

12

Závodník, Tomáš. "Architektura pro rekonstrukci knihy objednávek s nízkou latencí." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2016. http://www.nusl.cz/ntk/nusl-255477.

Full text

Abstract:

Information technology forms an important part of the world and algorithmic trading has already become a common concept among traders. The High Frequency Trading (HFT) requires use of special hardware accelerators which are able to provide input response with sufficiently low latency. This master's thesis is focused on design and implementation of an architecture for order book building, which represents an essential part of HFT solutions targeted on financial exchanges. The goal is to use the FPGA technology to process information about an exchange's state with latency so low that the resulting solution is effectively usable in practice. The resulting architecture combines hardware and software in conjunction with fast lookup algorithms to achieve maximum performance without affecting the function or integrity of the order book.

APA, Harvard, Vancouver, ISO, and other styles

13

Kekely, Lukáš. "Softwarově řízené monitorování síťového provozu." Doctoral thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2017. http://www.nusl.cz/ntk/nusl-412592.

Full text

Abstract:

Tato disertační práce se zabývá návrhem nového způsobu softwarově řízené (definované) hardwarové akcelerace pro moderní vysokorychlostní počítačové sítě. Hlavním cílem práce je formulace obecného, flexibilního a jednoduše použitelného konceptu akcelerace použitelného pro různé bezpečnostní a monitorovací aplikace, který by umožnil jejich reálné nasazení ve 100 Gb/s a rychlejších sítích. Disertační práce začíná rozborem aktuálního stavu poznání v oborech síťového monitorování, bezpečnosti a způsobů akcelerace zpracování vysokorychlostních síťových dat. Na základě tohoto rozboru je formulován a navržen zcela nový koncept s názvem Softwarově definované monitorování (SDM). Klíčová funkcionalita uvedeného konceptu je postavená na hardwarově akcelerované, aplikačně specifické (řízené), na tocích založené, informované redukci a distribuci zachycených síťových dat. Toto je zajištěno spojením vysokorychlostního hardwarového zpracování s flexibilním softwarovým řízením, které tak společně umožňují jednoduchou tvorbu různých komplexních a vysoce výkonných síťových aplikací. Pokročilé optimalizace a vylepšení základního SDM konceptu a jeho vybraných komponent jsou v práci též zkoumány, což vede k návrhu zcela unikátní a obecně použitelné FPGA architektury modulárního analyzátoru hlaviček paketů a vysoce výkonného klasifikátoru paketů založeného na kukaččím hashovaní. Nakonec je vytvořen vysokorychlostní SDM prototyp postavený nad FPGA akcelerační síťovou kartou, který je podrobně ověřen v podmínkách nasazení do reálných sítí. Jsou změřeny a diskutovány dosažitelné zlepšení výkonností v několika vybraných monitorovacích a bezpečnostních případech užití. Vytvořený SDM prototyp je rovněž nasazen v produkčním monitorování reálné páteřní sítě sdružení Cesnet a byl komercializován společností Netcope Technologies.

APA, Harvard, Vancouver, ISO, and other styles

14

David, Radu Alin. "Improving Channel Estimation and Tracking Performance in Distributed MIMO Communication Systems." Digital WPI, 2015. https://digitalcommons.wpi.edu/etd-dissertations/229.

Full text

Abstract:

This dissertation develops and analyzes several techniques for improving channel estimation and tracking performance in distributed multi-input multi-output (D-MIMO) wireless communication systems. D-MIMO communication systems have been studied for the last decade and are known to offer the benefits of antenna arrays, e.g., improved range and data rates, to systems of single-antenna devices. D-MIMO communication systems are considered a promising technology for future wireless standards including advanced cellular communication systems. This dissertation considers problems related to channel estimation and tracking in D-MIMO communication systems and is focused on three related topics: (i) characterizing oscillator stability for nodes in D-MIMO systems, (ii) the development of an optimal unified tracking framework and a performance comparison to previously considered sub-optimal tracking approaches, and (iii) incorporating independent kinematics into dynamic channel models and using accelerometers to improve channel tracking performance. A key challenge of D-MIMO systems is estimating and tracking the time-varying channels present between each pair of nodes in the system. Even if the propagation channel between a pair of nodes is time-invariant, the independent local oscillators in each node cause the carrier phases and frequencies and the effective channels between the nodes to have random time-varying phase offsets. The first part of this dissertation considers the problem of characterizing the stability parameters of the oscillators used as references for the transmitted waveforms. Having good estimates of these parameters is critical to facilitate optimal tracking of the phase and frequency offsets. We develop a new method for estimating these oscillator stability parameters based on Allan deviation measurements and compare this method to several previously developed parameter estimation techniques based on innovation covariance whitening. The Allan deviation method is validated with both simulations and experimental data from low-precision and high-precision oscillators. The second part of this dissertation considers a D-MIMO scenario with $N_t$ transmitters and $N_r$ receivers. While there are $N_t imes N_r$ node-to-node pairwise channels in such a system, there are only $N_t + N_r$ independent oscillators. We develop a new unified tracking model where one Kalman filter jointly tracks all of the pairwise channels and compare the performance of unified tracking to previously developed suboptimal local tracking approaches where the channels are not jointly tracked. Numerical results show that unified tracking tends to provide similar beamforming performance to local tracking but can provide significantly better nullforming performance in some scenarios. The third part of this dissertation considers a scenario where the transmit nodes in a D-MIMO system have independent kinematics. In general, this makes the channel tracking problem more difficult since the independent kinematics make the D-MIMO channels less predictable. We develop dynamics models which incorporate the effects of acceleration on oscillator frequency and displacement on propagation time. The tracking performance of a system with conventional feedback is compared to a system with conventional feedback and local accelerometer measurements. Numerical results show that the tracking performance is significantly improved with local accelerometer measurements.

APA, Harvard, Vancouver, ISO, and other styles

15

Lee, Joo Hong. "Hybrid Parallel Computing Strategies for Scientific Computing Applications." Diss., Virginia Tech, 2012. http://hdl.handle.net/10919/28882.

Full text

Abstract:

Multi-core, multi-processor, and Graphics Processing Unit (GPU) computer architectures pose significant challenges with respect to the efficient exploitation of parallelism for large-scale, scientific computing simulations. For example, a simulation of the human tonsil at the cellular level involves the computation of the motion and interaction of millions of cells over extended periods of time. Also, the simulation of Radiative Heat Transfer (RHT) effects by the Photon Monte Carlo (PMC) method is an extremely computationally demanding problem. The PMC method is example of the Monte Carlo simulation method—an approach extensively used in wide of application areas. Although the basic algorithmic framework of these Monte Carlo methods is simple, they can be extremely computationally intensive. Therefore, an efficient parallel realization of these simulations depends on a careful analysis of the nature these problems and the development of an appropriate software framework. The overarching goal of this dissertation is develop and understand what the appropriate parallel programming model should be to exploit these disparate architectures, both from the metric of efficiency, as well as from a software engineering perspective. In this dissertation we examine these issues through a performance study of PathSim2, a software framework for the simulation of large-scale biological systems, using two different parallel architectures’ distributed and shared memory. First, a message-passing implementation of a multiple germinal center simulation by PathSim2 is developed and analyzed for distributed memory architectures. Second, a germinal center simulation is implemented on shared memory architecture with two parallelization strategies based on Pthreads and OpenMP. Finally, we present work targeting a complete hybrid, parallel computing architecture. With this work we develop and analyze a software framework for generic Monte Carlo simulations implemented on multiple, distributed memory nodes consisting of a multi-core architecture with attached GPUs. This simulation framework is divided into two asynchronous parts: (a) a threaded, GPU-accelerated pseudo-random number generator (or producer), and (b) a multi-threaded Monte Carlo application (or consumer). The advantage of this approach is that this software framework can be directly used within any Monte Carlo application code, without requiring application-specific programming of the GPU. We examine this approach through a performance study of the simulation of RHT effects by the PMC method on a hybrid computing architecture. We present a theoretical analysis of our proposed approach, discuss methods to optimize performance based on this analysis, and compare this analysis to experimental results obtained from simulations run on two different hybrid, parallel computing architectures.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

16

Agha, Shahrukh. "Software and hardware techniques for accelerating MPEG2 motion estimation." Thesis, Loughborough University, 2006. https://dspace.lboro.ac.uk/2134/33935.

Full text

Abstract:

The aim of this thesis is to accelerate the process of motion estimation (ME) for the implementation of real time, portable video encoding. To this end a number of different techniques have been considered and these have been investigated in detail. Data Level Parallelism (DLP) is exploited first, through the use of vector instruction extensions using configurable/re-configurable processors to form a fast System-On-Chip (SoC) video encoder capable of embedding both full search and fast ME methods. Further parallelism is then exploited in the form of Thread Level Parallelism (TLP), introduced into the ME process through the use of multiple processors incorporated onto a single Soc. A theoretical explanation of the results, obtained with these methodologies, is then developed for algorithmic optimisations. This is followed with the investigation of an efficient, orthogonal technique based on the use of a reduced number of bits (RBSAD) for the purposes of image comparison. This technique, which provides savings of both power and time, is investigated along with a number of criteria for its improvement to full resolution. Finally a VLSI layout of a low-power ME engine, capable of using this technique, is presented. The combination of DLP, TLP and RBSAD is found to reduce the clock frequency requirement by around an order of magnitude.

APA, Harvard, Vancouver, ISO, and other styles

17

Linford, John Christian. "Accelerating Atmospheric Modeling Through Emerging Multi-core Technologies." Diss., Virginia Tech, 2010. http://hdl.handle.net/10919/27599.

Full text

Abstract:

The new generations of multi-core chipset architectures achieve unprecedented levels of computational power while respecting physical and economical constraints. The cost of this power is bewildering program complexity. Atmospheric modeling is a grand-challenge problem that could make good use of these architectures if they were more accessible to the average programmer. To that end, software tools and programming methodologies that greatly simplify the acceleration of atmospheric modeling and simulation with emerging multi-core technologies are developed. A general model is developed to simulate atmospheric chemical transport and atmospheric chemical kinetics. The Cell Broadband Engine Architecture (CBEA), General Purpose Graphics Processing Units (GPGPUs), and homogeneous multi-core processors (e.g. Intel Quad-core Xeon) are introduced. These architectures are used in case studies of transport modeling and kinetics modeling and demonstrate per-kernel speedups as high as 40x. A general analysis and code generation tool for chemical kinetics called "KPPA" is developed. KPPA generates highly tuned C, Fortran, or Matlab code that uses every layer of heterogeneous parallelism in the CBEA, GPGPU, and homogeneous multi-core architectures. A scalable method for simulating chemical transport is also developed. The Weather Research and Forecasting Model with Chemistry (WRF-Chem) is accelerated with these methods with good results: real forecasts of air quality are generated for the Eastern United States 65% faster than the state-of-the-art models.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

18

Yu, Jason Kwok Kwun. "Vector processing as a soft-core processor accelerator." Thesis, University of British Columbia, 2008. http://hdl.handle.net/2429/2394.

Full text

Abstract:

Soft processors simplify hardware design by being able to implement complex control strategies using software. However, they are not fast enough for many intensive data-processing tasks, such as highly data-parallel embedded applications. This thesis suggests adding a vector processing core to the soft processor as a general-purpose accelerator for these types of applications. The approach has the benefits of a purely software-oriented development model, a fixed ISA allowing parallel software and hardware development, a single accelerator that can accelerate multiple functions in an application, and scalable performance with a single source code. With no hardware design experience needed, a software programmer can make area-versus-performance tradeoffs by scaling the number of functional units and register file bandwidth with a single parameter. The soft vector processor can be further customized by a number of secondary parameters to add and remove features for the specific application to optimize resource utilization. This thesis shows that a vector processing architecture maps efficiently into an FPGA and provides a scalable amount of performance for a reasonable amount of area. Configurations of the soft vector processor with different performance levels are estimated to achieve speedups of 2-24x for 5-26x the area of a Nios II/s processor on three benchmark kernels.

APA, Harvard, Vancouver, ISO, and other styles

19

Bashford-Rogers, Thomas. "Accelerating global illumination for physically-based rendering." Thesis, University of Warwick, 2011. http://wrap.warwick.ac.uk/36762/.

Full text

Abstract:

Lighting is essential to generate realistic images using computer graphics. The computation of lighting takes into account the multitude of ways which light propagates around a virtual scene. This is termed global illumination, and is a vital part of physically-based rendering. Although providing compelling and accurate images, this is a computationally expensive process. This thesis presents several methods to improve the speed of global illumination computation, and therefore enables faster image synthesis. Global illumination can be calculated in an offline process, typically taking many minutes to hours to compute an accurate solution, or it can be approximated at interactive or real-time rates. This work proposes three methods which tackle the problem of improving the efficiency of computing global illumination. The first is an interactive method for calculating multiple-bounce global illumination on graphics hardware, which exploits the power of the graphics pipeline to create a voxelised representation of the scene through which light transport is computed. The second is an unbiased physically-based algorithm for improving the efficiency of path generation when calculating global illumination in complicated scenes. This is adaptive, and learns information about the lighting in the scene as the rendering progresses, and uses this to reduce variance in the image. In both common scenes used in graphics and situations which involve difficult light paths, this method gives a 30 - 70% boost in performance. The third method in this thesis is a sampling method which improves the efficiency of the common indoor-outdoor lighting scenario. This is done by both combining the lighting distribution with view importance, and automatically determining the important areas of the scene in which to start light paths. This gives a speed up of between three times, and two orders of magnitude, depending on scene and lighting complexity.

APA, Harvard, Vancouver, ISO, and other styles

20

Kancharla, Akshitha, and Akhil Pannala. "Factors for Accelerating the Development Speed in Systems of Artificial Intelligence." Thesis, Blekinge Tekniska Högskola, Institutionen för programvaruteknik, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-18420.

Full text

Abstract:

Background: With the increase in the application of Artificial Intelligence, there is an urge to find ways to increase the development speed of these systems (time-to-market). Because time is one of the most expensive and valuable resources in software development. Faster development speed is essential for companies to survive. There are articles in the literature that states the factors/antecedents for improving the development speed in Traditional Software Engineering. However, we cannot draw direct conclusions from these factors because development in Traditional Software Engineering and Artificial Intelligence differ. Objectives: The primary objectives of this research are: a) Conduct a literature review to identify the list of factors that affect the speed of Traditional Software Engineering. b) Perform an In-depth interview study to evaluate whether the listed factors of Traditional Software Engineering can be applied in accelerating the development of AI systems engineering. Methods: The method chosen to address the research question 1 is the Systematic Literature Review. The reason for selecting Systematic Literature Review (SLR) is that we follow specific well-defined structure to identify, analyze and interpret the data about the research question with the evidence. The search strategy Snowballing is the best alternative for conducting SLR as per the guidelines are given by Wohlin. The method chosen to address the research question 2 is an In-depth interview study. We conduct interviews to gather information related to our research. Here, the participant is the interviewee, who may be a data scientist or project manager in the field of AI and the interviewer is a student. Each interviewee lists the factors that affect the development speed of AI systems and rank them based on their importance using Trello. Results: The results from the systematic literature is the list of papers that are obtained from the snowball sampling. From the collected data, factors are extracted which are then used for the interviews. The interviews are conducted based on the questionnaire that was prepared. All the interviews are recorded and then transcribed. The transcribed data is analyzed using Conventional Content Analysis. Conclusions: The study identifies the factors that will help accelerate the development speed of Artificial Intelligence systems. The identified factors are mostly non-technical such as team leadership, trust, etc. By selecting suitable research methods for each research question, the objectives are addressed.

APA, Harvard, Vancouver, ISO, and other styles

21

Woods, Andrew. "Accelerating software radio astronomy FX correlation with GPU and FPGA co-processors." Master's thesis, University of Cape Town, 2010. http://hdl.handle.net/11427/12212.

Full text

Abstract:

Includes abstract.
Includes bibliographical references (leaves [117]-121).
This thesis attempts to accelerate compute intensive sections of a frequency domain radio astronomy correlator using dedicated co-processors. Two co-processor implementations were made independently with one using reconfigurable hardware (Xilinx Virtex 4LXlOO) and the other uses a graphics processor (Nvidia 9800GT). The objective of a radio astronomy correlator is to compute the complex valued correlation products for each baseline which can be used to reconstruct the sky's radio brightness distribution. Radio astronomy correlators have huge computation demands and this dissertation focuses on the computational aspects of correlation, concentrating on the X-engine stage of the correlator.

APA, Harvard, Vancouver, ISO, and other styles

22

Enes, Petter. "Build and Release Management : Supporting development of accelerator control software at CERN." Thesis, Norwegian University of Science and Technology, Department of Computer and Information Science, 2007. http://urn.kb.se/resolve?urn=urn:nbn:no:ntnu:diva-8708.

Full text

Abstract:

Software configuration management deals with control of the evolution of complex computer systems. The ability to handle changes, corrections and extensions is decisive for the outcome of a software project. Automated processes for handling these elements are therefore a crucial part of software development. This thesis focuses on build and release management, in the context of developing a control system for the worlds biggest particle accelerator. Build and release cover topics such as build support, versioning, dependency management and release management. The main part of the work has consisted of extending an in-house solution supporting the development process of accelerator control software at CERN. The main focus of this report is on the practical work done in this context. Based on a literature survey and examining of available tools, this thesis presents the state of the art concerning build and release management before elaborating on the practical work. Based on the experience gained from the work of this thesis, I conclude with a discussion of whether or not it is beneficiary to stick with in-house solution, or if switching to an external tool could prove better for the development process implemented.

APA, Harvard, Vancouver, ISO, and other styles

23

Motyka, Mikael. "Impact of Usability for Particle Accelerator Software Tools Analyzing Availability and Reliability." Thesis, Blekinge Tekniska Högskola, Institutionen för programvaruteknik, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-14394.

Full text

Abstract:

The importance of considering usability when developing software is widely recognized in literature. This non-functional system aspect focuses on the ease, effectiveness and efficiency of handling a system. However, usability cannot be defined as a specific system aspect since it depends on the field of application. In this work, the impact of usability for accelerator tools targeting availability and reliability analysis is investigated by further developing the already existing software tool Availsim. The tool, although proven to be unique by accounting for special accelerator complexities not possible to model with commercial software, is not used across facilities due to constraints caused by previous modifications. The study was conducted in collaboration with the European Spallation Source ERIC, a multidisciplinary research center based on the world’s most powerful neutron source, currently being built in Lund, Sweden. The work was conducted in the safety group within the accelerator division, where the availability and reliability studies were performed. Design Science Research was used as research methodology to answer how the proposed tool can help improving the usability for the analysis domain, along with to identify existing usability issues in the field. To obtain an overview of the current field, three questionnaires were sent out and one interview was conducted, listing important properties to consider for the tool to be developed along with how usability is perceived in the accelerator field of analysis. The developed software tool was evaluated with After Scenario Questionnaire and the System Usability Scale, two standardized ways of measuring usability along with custom made statements, explicitly targeting important attributes found when questioning the researchers. The result highlighted issues in the current field, listing multiple tools used for the analysis along with their positive and negative aspects, indicating a lengthy and tedious process in obtaining the required analysis results. It was also found that the adapted Availsim version improves usability of the previous versions, listing specific attributes that could be identified as correlating to the improved usability, fulfilling the purpose of the study. However, results indicate existing commercial tools obtained higher scores regarding the standardized tests targeting usability compared to the new Availsim version, pointing towards room for improvements.
Vikten av att ta hänsyn till användbarhet vid mjukvaruutveckling är välkänt inom litteraturen. Denna icke-funktionella system-aspekt fokuserar på enkelheten och effektiviteten vid systemhantering. Användbarheten av ett system kan dock inte definieras som en specifik systemaspekt då den beror på tillämpningsområdet. Detta arbete undersöker inverkan av användbarheten gällande verktyg som används vid analys utav tillgänglighet och tillförlitlighet (Eng. Availability and Reliability) för partikelacceleratorer genom att vidareutveckla den befintliga mjukvaran Availsim. Mjukvaran är bevisad att på ett unikt sett kunna ta acceleratorspecifika hänsynstaganden som inte är möjliga att återskapa med de kommersiella verktyg som finns tillgängliga idag. Trots mjukvarans unika egenskaper är den inte använd. Detta, på grund av tidigare modifieringar, vars begränsningar endast möjliggör användandet av mjukvaran vid en specifik anläggning. Studien utfördes i samarbete med European Spallation Source, ERIC. ESS är en multidisciplinär forskningsanläggning baserad på världens kraftfullaste neutronkälla som för närvarande byggs i Lund, Sverige. Arbetet utfördes i säkerhetsgruppen inom acceleratordivisionen, där analysen utav acceleratorns tillgänglighet och tillförlitlighet utförs. Design Science Research användes som forskningsmetodik för att svara på hur den föreslagna mjukvaran kan bidra till att förbättra användbarheten vid den angivna analysen, samt definiera de befintliga användbarhetsproblemen inom området. För att få en överblick av hur analysen bedrivs i dagsläget skickades tre enkäter ut och en intervju genomfördes för att sammanställa viktiga egenskaper att ta till hänsyn vid utveckling av den nya mjukvaran, tillsammans med hur forskarna uppfattar användbarhet för denna typ av analys. Den utvecklade mjukvaran utvärderades med två standardiserade frågeformulär, inriktade på att mäta användbarhet för system vid namn ”After Scenario Questionnaire” och ”System Usability Scale”. En tredje uppsättning av frågor konstruerades också för att explicit mäta de viktiga egenskaper som framkommit vid enkätutskicket och intervjun. I resultatet lyfts problem i det aktuella området fram där de verktyg som används vid analysen listades tillsammans med deras positiva och negativa egenskaper. Dessa egenskaper indikerade på en omständig och lång process för att erhålla de analysresultat som önskas. Det konstaterades också att den anpassade Availsim-versionen förbättrar användbarheten gentemot tidigare versioner genom att lista specifika egenskaper som kunde identifieras till att direkt ha en inverkan i hur användbarheten uppfattas. Resultaten visade också på att det befintliga, kommersiella verktyget Reliasoft erhöll högre resultat vid de standardiserade testerna. Något som tyder på utrymme för förbättringar.

APA, Harvard, Vancouver, ISO, and other styles

24

Khasymski, Aleksandr Sergeev. "Accelerated Storage Systems." Diss., Virginia Tech, 2015. http://hdl.handle.net/10919/51612.

Full text

Abstract:

Today's large-scale, high-performance, data-intensive applications put a tremendous stress on data centers to store, index, and retrieve large amounts of data. Exemplified by technologies such as social media, photo and video sharing, and e-commerce, the rise of the real-time web demands data stores support minimal latencies, always-on availability and ever-growing capacity. These requirements have fostered the development of a large number of high-performance storage systems, arguably the most important of which are Key-Value (KV) stores. An emerging trend for achieving low latency and high throughput in this space is a solution, which utilizes both DRAM and flash by storing an efficient index for the data in memory and minimizing accesses to flash, where both keys and values are stored. Many proposals have examined how to improve KV store performance in this area. However, these systems have shortcomings, including expensive sorting and excessive read and write amplification, which is detrimental to the life of the flash. Another trend in recent years equips large scale deployments with energy-efficient, high performance co-processors, such as Graphics Processing Units (GPUs). Recent work has explored using GPUs to accelerate compute-intensive I/O workloads, including RAID parity generation, encryption, and compression. While this research has proven the viability of GPUs to accelerate these workloads, we argue that there are significant benefits to be had by developing methods and data structures for deep integration of GPUs inside the storage stack, in order to achieve better performance, scalability, and reliability. In this dissertation, we propose comprehensive frameworks that leverage emerging technologies, such as GPUs and flash-based SSDs, to accelerate modern storage systems. For our accelerator-based solution, we focus on developing a system that features deep integration of the GPU in a distributed parallel file system. We utilize a framework that builds on the resources available in the file system and coordinates the workload in such a way that minimizes data movement across the PCIe bus, while exposing data parallelism to maximize the potential for acceleration on the GPU. Our research aims to improve the overall reliability of a PFS by developing a distributed per-file parity generation that provides end-to-end data integrity and unprecedented flexibility. Finally, we design a high-performance KV store utilizing a novel data structure tailored to specific flash requirements; it arranges data on flash in such a way as to minimize write amplification, which is detrimental to the flash cells. The system delivers outstanding read amplification through the use of a trie index and false positive filter.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

25

Alhamwi, Ali. "Co-design hardware/software of real time vision system on FPGA for obstacle detection." Thesis, Toulouse 3, 2016. http://www.theses.fr/2016TOU30342/document.

Full text

Abstract:

La détection, localisation d'obstacles et la reconstruction de carte d'occupation 2D sont des fonctions de base pour un robot navigant dans un environnement intérieure lorsque l'intervention avec les objets se fait dans un environnement encombré. Les solutions fondées sur la vision artificielle et couramment utilisées comme SLAM (simultaneous localization and mapping) ou le flux optique ont tendance a être des calculs intensifs. Ces solutions nécessitent des ressources de calcul puissantes pour répondre à faible vitesse en temps réel aux contraintes. Nous présentons une architecture matérielle pour la détection, localisation d'obstacles et la reconstruction de cartes d'occupation 2D en temps réel. Le système proposé est réalisé en utilisant une architecture de vision sur FPGA (field programmable gates array) et des capteurs d'odométrie pour la détection, localisation des obstacles et la cartographie. De la fusion de ces deux sources d'information complémentaires résulte un modèle amelioré de l'environnement autour des robots. L'architecture proposé est un système à faible coût avec un temps de calcul réduit, un débit d'images élevé, et une faible consommation d'énergie
Obstacle detection, localization and occupancy map reconstruction are essential abilities for a mobile robot to navigate in an environment. Solutions based on passive monocular vision such as simultaneous localization and mapping (SLAM) or optical flow (OF) require intensive computation. Systems based on these methods often rely on over-sized computation resources to meet real-time constraints. Inverse perspective mapping allows for obstacles detection at a low computational cost under the hypothesis of a flat ground observed during motion. It is thus possible to build an occupancy grid map by integrating obstacle detection over the course of the sensor. In this work we propose hardware/software system for obstacle detection, localization and 2D occupancy map reconstruction in real-time. The proposed system uses a FPGA-based design for vision and proprioceptive sensors for localization. Fusing this information allows for the construction of a simple environment model of the sensor surrounding. The resulting architecture is a low-cost, low-latency, high-throughput and low-power system

APA, Harvard, Vancouver, ISO, and other styles

26

Magnuson, Martin. "Process Control Methods for Operation of Superconducting Cavities at the LEP Accelerator at CERN." Thesis, Linköpings universitet, Institutionen för fysik, kemi och biologi, 1992. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-56503.

Full text

Abstract:

The aim of this thesis is to analyse the cryogenic process for cooling superconducting radio frequency accelerator test cavities in the LEP accelerator at CERN. A liquefaction cryoplant is analysed, including the production of liquid helium at 4.5 K, the systems for distribution and regulation of liquid helium, and the radio frequency field used for accelerating particles. After discussing regulation problems and modifications planned for a new cavity installation in 1992, different techniques for specifying the control programs for the new installation are evaluated. Various diagramming techniques, standards and methodologies, and Computer Aided Software Engineering-tools, are compared as to their practical usefulness in this kind of process control. Finally, in accordance with anticipated requirements, possible ways of making high and low level control program specifications are suggested.

APA, Harvard, Vancouver, ISO, and other styles

27

Ouedraogo, Ganda Stéphane. "Automatic synthesis of hardware accelerator from high-level specifications of physical layers for flexible radio." Thesis, Rennes 1, 2014. http://www.theses.fr/2014REN1S183/document.

Full text

Abstract:

L'internet des objets vise à connecter des milliards d'objets physiques ainsi qu'à les rendre accessibles depuis le monde numérique que représente l'internet d'aujourd'hui. Pour ce faire, l'accès à ces objets sera majoritairement réalisé sans fil et sans utiliser d'infrastructures prédéfinies ou de normes spécifiques. Une telle technologie nécessite de définir et d'implémenter des nœuds radio intelligents capables de s'adapter à différents protocoles physiques de communication. Nos travaux de recherches ont consisté à définir un flot de conception pour ces nœuds intelligents partant de leur modélisation à haut niveau jusqu'à leur implémentation sur des cibles de types FPGA. Ce flot vise à améliorer la programmabilité des formes d'ondes par l'utilisation de spécification de haut niveau exécutables et synthétisables, il repose sur la synthèse de haut niveau (HLS pour High Level Synthesis) pour le prototypage rapide des briques de base ainsi que sur le modèle de calcul de types flot de données des formes d'ondes radio. Le point d'entrée du flot consiste en un langage à usage spécifique (DSL pour Domain Specific Language) qui permet de modéliser à haut niveau une forme d'onde tout en insérant des contraintes d'implémentation pour des architectures reconfigurables telles que les FPGA. Il est associé à un compilateur qui permet de générer du code synthétisable ainsi que des scripts de synthèse. La forme d'onde finale est composée d'un chemin de données et d'une entité de contrôle implémentée sous forme d'une machine d'état hiérarchique
The Internet of Things (IoT) aims at connecting billions of communicating devices through an internet-like network. To this aim, the access to these things is expected to be performed via wireless technologies without using any predefined infrastructures or standards. This technology requires defining and implementing smart nodes capable to adapt to different radio communication protocols. In this thesis, we have defined a design methodology/flow, for such smart nodes, starting from their high-level specification down to their implementation in FPGA fabrics. This flow aims at improving the programmability of the waveforms by leveraging some high-level specifications. Thus, it relies on the High-Level Synthesis (HLS) for rapid prototyping of the waveforms functional blocks as well as the dataflow model of computation. Its entry point is Domain-Specific Language which enables modeling a waveform while inserting some implementation constraints for reconfigurable architectures such as the FPGAs. The flow is featured with a compiler which purpose is to produce some synthesis scripts and generate some RTL source code. The final waveform consists of a datapath and a control unit implemented as a Hierarchical Finite State Machine (HFSM)

APA, Harvard, Vancouver, ISO, and other styles

28

Silva, João Paulo Sá da. "Data processing in Zynq APSoC." Master's thesis, Universidade de Aveiro, 2014. http://hdl.handle.net/10773/14703.

Full text

Abstract:

Mestrado em Engenharia de Computadores e Telemática
Field-Programmable Gate Arrays (FPGAs) were invented by Xilinx in 1985, i.e. less than 30 years ago. The influence of FPGAs on many directions in engineering is growing continuously and rapidly. There are many reasons for such progress and the most important are the inherent reconfigurability of FPGAs and relatively cheap development cost. Recent field-configurable micro-chips combine the capabilities of software and hardware by incorporating multi-core processors and reconfigurable logic enabling the development of highly optimized computational systems for a vast variety of practical applications, including high-performance computing, data, signal and image processing, embedded systems, and many others. In this context, the main goals of the thesis are to study the new micro-chips, namely the Zynq-7000 family and to apply them to two selected case studies: data sort and Hamming weight calculation for long vectors.
Field-Programmable Gate Arrays (FPGAs) foram inventadas pela Xilinx em 1985, ou seja, há menos de 30 anos. A influência das FPGAs está a crescer continua e rapidamente em muitos ramos de engenharia. Há varias razões para esta evolução, as mais importantes são a sua capacidade de reconfiguração inerente e os baixos custos de desenvolvimento. Os micro-chips mais recentes baseados em FPGAs combinam capacidades de software e hardware através da incorporação de processadores multi-core e lógica reconfigurável permitindo o desenvolvimento de sistemas computacionais altamente otimizados para uma grande variedade de aplicações práticas, incluindo computação de alto desempenho, processamento de dados, de sinal e imagem, sistemas embutidos, e muitos outros. Neste contexto, este trabalho tem como o objetivo principal estudar estes novos micro-chips, nomeadamente a família Zynq-7000, para encontrar as melhores formas de potenciar as vantagens deste sistema usando casos de estudo como ordenação de dados e cálculo do peso de Hamming para vetores longos.

APA, Harvard, Vancouver, ISO, and other styles

29

Jönsson, Oscar. "An explorative study of the technology transfer coach as a preliminary for the design of a computer aid." Thesis, Linköpings universitet, Interaktiva och kognitiva system, 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-108308.

Full text

Abstract:

The university technology transfer coach has an important role in supporting the commercialization of research results. This thesis has studied the technology transfer coach and their needs in the coaching process. The goal has been to investigate information needs of the technology transfer coach as a preliminary for the design of computer aids.Using a grounded theory approach, we interviewed 17 coaches working in the Swedish technology transfer environment. Extracted quotes from interviews were openly coded and categorized. The analysis show three main problem areas related to the information needs of the technology transfer coach; awareness, communication, and resources. Moreover, 20 features for future computer aids were extracted from the interview data and scenarios and personas where developed to exemplify the future use of computer aids.We conclude that there is a need for computer support in the coaching process. Such systems should aid the coach in; awareness, aiding the coach to focus on meetings; communication, aid the coach to transfer commercialisation knowledge; and resources, aid the coach in accessing and delivering of resources to the coachee. However, it is imperative that the computer aids do not interfere with the coach current process; and that the computer aid is not seen as the sole solution.

APA, Harvard, Vancouver, ISO, and other styles

30

Yang, Fu-Kai, and 楊復凱. "Acceleration and Improvement of MPEG View Synthesis Reference Software on NVIDIA CUDA." Thesis, 2012. http://ndltd.ncl.edu.tw/handle/62819489159539479165.

Full text

Abstract:

碩士
國立交通大學
電子研究所
100
With the prosperity of 3D technology, Free Viewpoint Television (FTV) becomes a popular research topic. “View Synthesis” is a key step in FTV. There are some important and to-be-solved issues such as real-time operation and complexity reduction. NVIDIA Compute Unified Device Architecture (CUDA) is an effective platform in handling data-intensive applications. To implement the MPEG view synthesis reference software (VSRS) on CUDA, we parallelize the VSRS structure. In the meanwhile, our proposed parallel scheme improves the picture quality. We first propose an intra hole filling scheme to replace the original median filter. Then, to avoid data dependence we properly partition the data so that they can be processed by the parallel GPU threads. Also, we rearrange the data processing order in the threads to reduce branching instructions. Combining these techniques together, we save more than 94% computing time and achieve a similar image quality.

APA, Harvard, Vancouver, ISO, and other styles

31

Wu, Jyun-Cheng, and 吳峻丞. "Design of a Real-time Software-Based GPS Baseband Receiver Using GPU Acceleration." Thesis, 2011. http://ndltd.ncl.edu.tw/handle/21287776369433988238.

Full text

Abstract:

碩士
國立臺灣大學
電子工程學研究所
99
Nowaday, the personal navigation devices are more and more popular. The demand of GPS receiver in any form is also increasing. Developing the GPS receiver in software is feasible with the increasing of processor computation power. Compared to the traditional hardware receiver, the software-based receiver has many advantages. In system integration, upgrade, new algorism adopting and the platform changing, the software-based receiver has much more flexibility than traditional hardware receiver. In this thesis, I will improve the GPS software baseband receiver based on previous student’s research. There are three issues that I want to improve - software robustness, efficiency and position accuracy. For software robustness, I use the dynamic satellite list to let the software receiver can change its available satellite list based on the strength of satellite signals. Therefore the receiver can have much better adaptability to the real-world environment. For the efficiency of software execution, I adopted CUDA parallel programming model. By moving most computation cost elements into GPU, it not only could reduce the influence of CPU loading on the receiver performance, but also could speed up the execution of receiver. Furthermore, it can reduce the energy consumption of our receiver. Finally, I change some fine time estimation equation, in order to improve the position accuracy of our receiver.

APA, Harvard, Vancouver, ISO, and other styles

32

Zhou, Boyou. "A multi-layer approach to designing secure systems: from circuit to software." Thesis, 2019. https://hdl.handle.net/2144/36149.

Full text

Abstract:

In the last few years, security has become one of the key challenges in computing systems. Failures in the secure operations of these systems have led to massive information leaks and cyber-attacks. Case in point, the identity leaks from Equifax in 2016, Spectre and Meltdown attacks to Intel and AMD processors in 2017, Cyber-attacks on Facebook in 2018. These recent attacks have shown that the intruders attack different layers of the systems, from low-level hardware to software as a service(SaaS). To protect the systems, the defense mechanisms should confront the attacks in the different layers of the systems. In this work, we propose four security mechanisms for computing systems: (i ) using backside imaging to detect Hardware Trojans (HTs) in Application Specific Integrated Circuits (ASICs) chips, (ii ) developing energy-efficient reconfigurable cryptographic engines, (iii) examining the feasibility of malware detection using Hardware Performance Counters (HPC). Most of the threat models assume that the root of trust is the hardware running beneath the software stack. However, attackers can insert malicious hardware blocks, i.e. HTs, into the Integrated Circuits (ICs) that provide back-doors to the attackers or leak confidential information. HTs inserted during fabrication are extremely hard to detect since their overheads in performance and power are below the variations in the performance and power caused by manufacturing. In our work, we have developed an optical method that identifies modified or replaced gates in the ICs. We use the near-infrared light to image the ICs because silicon is transparent to near-infrared light and metal reflects infrared light. We leverage the near-infrared imaging to identify the locations of each gate, based on the signatures of metal structures reflected by the lowest metal layer. By comparing the imaged results to the pre-fabrication design, we can identify any modifications, shifts or replacements in the circuits to detect HTs. With the trust of the silicon, the computing system must use secure communication channels for its applications. The low-energy cost devices, such as the Internet of Things (IoT), leverage strong cryptographic algorithms (e.g. AES, RSA, and SHA) during communications. The cryptographic operations cause the IoT devices a significant amount of power. As a result, the power budget limits their applications. To mitigate the high power consumption, modern processors embed these cryptographic operations into hardware primitives. This also improves system performance. The hardware unit embedded into the processor provides high energy-efficiency, low energy cost. However, hardware implementations limit flexibility. The longevity of theIoTs can exceed the lifetime of the cryptographic algorithms. The replacement of the IoT devices is costly and sometimes prohibitive, e.g., monitors in nuclear reactors.In order to reconfigure cryptographic algorithms into hardware, we have developed a system with a reconfigurable encryption engine on the Zedboard platform. The hardware implementation of the engine ensures fast, energy-efficient cryptographic operations. With reliable hardware and secure communication channels in place, the computing systems should detect any malicious behaviors in the processes. We have explored the use of the Hardware Performance Counters (HPCs) in malware detection. HPCs are hardware units that count micro-architectural events, such as cache hits/misses and floating point operations. Anti-virus software is commonly used to detect malware but it also introduces performance overhead. To reduce anti-virus performance overhead, many researchers propose to use HPCs with machine learning models in malware detection. However, it is counter-intuitive that the high-level program behaviors can manifest themselves in low-level statics. We perform experiments using 2 ∼ 3 × larger program counts than the previous works and perform a rigorous analysis to determine whether HPCs can be used to detect malware. Our results show that the False Discovery Rate of malware detection can reach 20%. If we deploy this detection system on a fresh installed Windows 7 systems, among 1,323 binaries, 198 binaries would be flagged as malware.

APA, Harvard, Vancouver, ISO, and other styles

33

Nüssle, Mondrian [Verfasser]. "Acceleration of the hardware software interface of a communication device for parallel systems / vorgelegt von Mondrian Benediktus Nüßle." 2009. http://d-nb.info/993238440/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

34

TADDEI, RUGGERO. "Numerical Techniques for Antenna Arrays: Multi-Objective Optimization and Method of Moments Acceleration." Doctoral thesis, 2015. http://hdl.handle.net/2158/976428.

Full text

Abstract:

The approximate solution of Maxwell's equations exploiting Numerical Techniques is known as Computational ElectroMagnetics (CEM). CEM techniques have been available for close on four decades now, and they currently form an invaluable part of current RF, antenna and microwave engineering practice. The present work is focused on two different applications of numerical techniques for Computational Electromagnetics: numerical optimization and full-wave techniques.

APA, Harvard, Vancouver, ISO, and other styles

35

Abell, Stephen W. "Parallel acceleration of deadlock detection and avoidance algorithms on GPUs." Thesis, 2013. http://hdl.handle.net/1805/3653.

Full text

Abstract:

Indiana University-Purdue University Indianapolis (IUPUI)
Current mainstream computing systems have become increasingly complex. Most of which have Central Processing Units (CPUs) that invoke multiple threads for their computing tasks. The growing issue with these systems is resource contention and with resource contention comes the risk of encountering a deadlock status in the system. Various software and hardware approaches exist that implement deadlock detection/avoidance techniques; however, they lack either the speed or problem size capability needed for real-time systems. The research conducted for this thesis aims to resolve issues present in past approaches by converging the two platforms (software and hardware) by means of the Graphics Processing Unit (GPU). Presented in this thesis are two GPU-based deadlock detection algorithms and one GPU-based deadlock avoidance algorithm. These GPU-based algorithms are: (i) GPU-OSDDA: A GPU-based Single Unit Resource Deadlock Detection Algorithm, (ii) GPU-LMDDA: A GPU-based Multi-Unit Resource Deadlock Detection Algorithm, and (iii) GPU-PBA: A GPU-based Deadlock Avoidance Algorithm. Both GPU-OSDDA and GPU-LMDDA utilize the Resource Allocation Graph (RAG) to represent resource allocation status in the system. However, the RAG is represented using integer-length bit-vectors. The advantages brought forth by this approach are plenty: (i) less memory required for algorithm matrices, (ii) 32 computations performed per instruction (in most cases), and (iii) allows our algorithms to handle large numbers of processes and resources. The deadlock detection algorithms also require minimal interaction with the CPU by implementing matrix storage and algorithm computations on the GPU, thus providing an interactive service type of behavior. As a result of this approach, both algorithms were able to achieve speedups over two orders of magnitude higher than their serial CPU implementations (3.17-317.42x for GPU-OSDDA and 37.17-812.50x for GPU-LMDDA). Lastly, GPU-PBA is the first parallel deadlock avoidance algorithm implemented on the GPU. While it does not achieve two orders of magnitude speedup over its CPU implementation, it does provide a platform for future deadlock avoidance research for the GPU.

APA, Harvard, Vancouver, ISO, and other styles

36

Lin, Jing-bin, and 林景彬. "Software Accelerator Discussion for H.264/AVC." Thesis, 2008. http://ndltd.ncl.edu.tw/handle/29527963119602892761.

Full text

Abstract:

碩士
南台科技大學
電子工程系
96
With the flourishing development in multimedia technology and Internet, the application of multimedia is very popular. Due to the demand of transmitting and storing large image data, high-performance video compression techniques play important and inevitable roles in image processing. A new video compression standard, H.264, was proposed after the reveals of MPEG-1, MPEG-2 and MPEG-4 standards. It has the property of high compressing rate than MPEG-4, and has recently become a major role in multimedia field. The complexity of H.264 decoder is very huge. If the implementation of the decoder uses hardware design style, the cost is high. Also the flexibility of hardware design style is low. However, the performance is very low when the decoder is implemented by using software (C language) design style. In this thesis we discuss the design methodology of software acceleration for implementing H.264 decoder on an embedded system. In H.264 decoder the computation of IDCT part and memory is large. Instead of using C codes to implement the IDCT part and memory of the decoder, we use the assembly language to carry out the operation. We use the multimedia instruction (the wireless MMX instructions) proposed by the system to improve the performance of the decoder. The new decoder is run on the embedded system using WinCE operation. The experimental results show that the performance of our software acceleration method for designing the decoder improves 16.76% as compared to the design by using C language.

APA, Harvard, Vancouver, ISO, and other styles

37

Yuan, Yi. "A microprocessor performance and reliability simulation framework using the speculative functional-first methodology." Thesis, 2011. http://hdl.handle.net/2152/ETD-UT-2011-12-4848.

Full text

Abstract:

With the high complexity of modern day microprocessors and the slow speed of cycle-accurate simulations, architects are often unable to adequately evaluate their designs during the architectural exploration phases of chip design. This thesis presents the design and implementation of the timing partition of the cycle-accurate, microarchitecture-level SFFSim-Bear simulator. SFFSim-Bear is an implementation of the speculative functional-first (SFF) methodology, and utilizes a hybrid software-FPGA platform to accelerate simulation throughput. The timing partition, implemented in FPGA, features throughput-oriented, latency-tolerant designs to cope with the challenges of the hybrid platform. Furthermore, a fault injection framework is added to this implementation that allows designers to study the reliability aspects of their processors. The result is a simulator that is fast, accurate, flexible, and extensible.
text

APA, Harvard, Vancouver, ISO, and other styles

38

Lin, Zi-Gang, and 林子剛. "Design of Stack Memory Device and System Software for Java Accelerator IP." Thesis, 2011. http://ndltd.ncl.edu.tw/handle/61631031609034851274.

Full text

APA, Harvard, Vancouver, ISO, and other styles

39

Neto, Nuno Miguel Ladeira. "A Container-based architecture for accelerating software tests via setup state caching and parallelization." Master's thesis, 2019. https://hdl.handle.net/10216/122203.

Full text

APA, Harvard, Vancouver, ISO, and other styles

40

Chang, Keng-Chia, and 張耿嘉. "Adaboost-based Hardware Accelerator DIP Design and Hardware/Software Co-simulation for Face Detection." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/96k2r2.

Full text

Abstract:

碩士
國立中興大學
電機工程學系所
106
In recent years, many car accidents caused by the fatigue driving have occurred frequently. Thus, many scholars and experts all over the world have paid great efforts in this issue, and they are developing the suitable detection technologies to reduce car accidents caused by driver''s drowsiness. For the fatigue detection issue, the driver’s spirit status can be evaluated through the eye blinking condition. Therefore, the proposed design implements the hardware accelerator to process the large amount of high repetitiveness data on a hardware/software co-design platform for the drowsy detection system. In the proposed fatigue detection system, by recognizing the accurate facial and eye positions, the eye detection methodology with hardware acceleration is proposed to enhance the efficiency of driver’s fatigue detections. The proposed system includes four parts, which are the face detection, the eye-glasses bridge detection, the eye detection, and the eye closure detection. Firstly, the input images are filmed by the NIR camera which has the 720x480 resolution. The system uses gray-scale images without any color information in all steps, and the proposed design works effectively in daytime and nighttime. Secondly, for face detection, the proposed system uses the machine learning method to detect the face position and face size, and the information of face geometrical position is used to reduce the searching range of driver’s eyes. In this thesis, the proposed design uses the Adaboost-based hardware accelerator for face detections. When the face size and position are already known, the proposed system can decrease the search range of eyes. The hardware accelerator architecture design for face detection is the main contribution of the thesis, and the hardware accelerator has the expandable classifier features. If the system needs more complicated machine learning classifier, the hardware-based classifier can be expanded conveniently and be improved in the future work. In experimental results, the average processing frame rates are 331 frames/sec by the proposed hardware accelerator with the 90 nanometer CMOS technology, and the design can meet the goal of real-time applications. Furthermore, the hardware architecture of face classifier could be expanded for a more complex training module. To meet the real-time issue, the input image size and the complexity of face classifier could be adjustable to improve the system accuracy.

APA, Harvard, Vancouver, ISO, and other styles

41

(9529172), Ejebagom J. Ojogbo. "ZipThru: A software architecture that exploits Zipfian skew in datasets for accelerating Big Data analysis." Thesis, 2020.

Find full text

Abstract:

In the past decade, Big Data analysis has become a central part of many industries including entertainment, social networking, and online commerce. MapReduce, pioneered by Google, is a popular programming model for Big Data analysis, famous for its easy programmability due to automatic data partitioning, fault tolerance, and high performance. Majority of MapReduce workloads are summarizations, where the final output is a per-key ``reduced" version of the input, highlighting a shared property of each key in the input dataset.

While MapReduce was originally proposed for massive data analyses on networked clusters, the model is also applicable to datasets small enough to be analyzed on a single server. In this single-server context the intermediate tuple state generated by mappers is saved to memory, and only after all Map tasks have finished are reducers allowed to process it. This Map-then-Reduce sequential mode of execution leads to distant reuse of the intermediate state, resulting in poor locality for memory accesses. In addition the size of the intermediate state is often too large to fit in the on-chip caches, leading to numerous cache misses as the state grows during execution, further degrading performance. It is well known, however, that many large datasets used in these workloads possess a Zipfian/Power Law skew, where a minority of keys (e.g., 10\%) appear in a majority of tuples/records (e.g., 70\%).

I propose ZipThru, a novel MapReduce software architecture that exploits this skew to keep the tuples for the popular keys on-chip, processing them on the fly and thus improving reuse of their intermediate state and curtailing off-chip misses. ZipThru achieves this using four key mechanisms: 1) Concurrent execution of both Map and Reduce phases; 2) Holding only the small, reduced state of the minority of popular keys on-chip during execution; 3) Using a lookup table built from pre-processing a subset of the input to distinguish between popular and unpopular keys; and 4) Load balancing the concurrently executing Map and Reduce phases to efficiently share on-chip resources.

Evaluations using Phoenix, a shared-memory MapReduce implementation, on 16- and 32-core servers reveal that ZipThru incurs 72\% fewer cache misses on average over traditional MapReduce while achieving average speedups of 2.75x and 1.73x on both machines respectively.

APA, Harvard, Vancouver, ISO, and other styles

42

"Efficient and Secure Deep Learning Inference System: A Software and Hardware Co-design Perspective." Doctoral diss., 2020. http://hdl.handle.net/2286/R.I.62825.

Full text

Abstract:

abstract: The advances of Deep Learning (DL) achieved recently have successfully demonstrated its great potential of surpassing or close to human-level performance across multiple domains. Consequently, there exists a rising demand to deploy state-of-the-art DL algorithms, e.g., Deep Neural Networks (DNN), in real-world applications to release labors from repetitive work. On the one hand, the impressive performance achieved by the DNN normally accompanies with the drawbacks of intensive memory and power usage due to enormous model size and high computation workload, which significantly hampers their deployment on the resource-limited cyber-physical systems or edge devices. Thus, the urgent demand for enhancing the inference efficiency of DNN has also great research interests across various communities. On the other hand, scientists and engineers still have insufficient knowledge about the principles of DNN which makes it mostly be treated as a black-box. Under such circumstance, DNN is like "the sword of Damocles" where its security or fault-tolerance capability is an essential concern which cannot be circumvented. Motivated by the aforementioned concerns, this dissertation comprehensively investigates the emerging efficiency and security issues of DNNs, from both software and hardware design perspectives. From the efficiency perspective, as the foundation technique for efficient inference of target DNN, the model compression via quantization is elaborated. In order to maximize the inference performance boost, the deployment of quantized DNN on the revolutionary Computing-in-Memory based neural accelerator is presented in a cross-layer (device/circuit/system) fashion. From the security perspective, the well known adversarial attack is investigated spanning from its original input attack form (aka. Adversarial example generation) to its parameter attack variant.
Dissertation/Thesis
Doctoral Dissertation Electrical Engineering 2020

APA, Harvard, Vancouver, ISO, and other styles

43

Ramesh, Chinthala. "Hardware-Software Co-Design Accelerators for Sparse BLAS." Thesis, 2017. http://etd.iisc.ac.in/handle/2005/4276.

Full text

Abstract:

Sparse Basic Linear Algebra Subroutines (Sparse BLAS) is an important library. Sparse BLAS includes three levels of subroutines. Level 1, Level2 and Level 3 Sparse BLAS routines. Level 1 Sparse BLAS routines do computations over sparse vector and spare/dense vector. Level 2 deals with sparse matrix and vector operations. Level 3 deals with sparse matrix and dense matrix operations. The computations of these Sparse BLAS routines on General Purpose Processors (GPPs) not only suffer from less utilization of hardware resources but also takes more compute time than the workload due to poor data locality of sparse vector/matrix storage formats. In the literature, tremendous efforts have been put into software to improve these Sparse BLAS routines performance on GPPs. GPPs best suit for applications with high data locality, whereas Sparse BLAS routines operate on applications with less data locality hence, GPPs performance is poor. Various Custom Function Units (Hardware Accelerators) are proposed in the literature and are proved to be efficient than soft wares which tried to accelerate Sparse BLAS subroutines. Though existing hardware accelerators improved the Sparse BLAS performance compared to software Sparse BLAS routines, there is still lot of scope to improve these accelerators. This thesis describes both the existing software and hardware software co-designs (HW/SW co-design) and identifies the limitations of these existing solutions. We propose a new sparse data representation called Sawtooth Compressed Row Storage (SCRS) and corresponding SpMV and SpMM algorithms. SCRS based SpMV and SpMM are performing better than existing software solutions. Even though SCRS based SpMV and SpMM algorithms perform better than existing solutions, they still could not reach theoretical peak performance. The knowledge gained from the study of limitations of these existing solutions including the proposed SCRS based SpMV and SpMM is used to propose new HW/SW co-designs. Software accelerators are limited by the hardware properties of GPPs, and GPUs itself, hence, we propose HW/SW co-designs to accelerate few basic Sparse BLAS operations (SpVV and SpMV). Our proposed Parallel Sparse BLAS HW/SW co-design achieves near theoretical peak performance with reasonable hardware resources.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Software acceleration'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles