Log in

Relevant bibliographies by topics / Coarse Grained Reconfigurable arrays / Dissertations / Theses

To see the other types of publications on this topic, follow the link: Coarse Grained Reconfigurable arrays.

Dissertations / Theses on the topic 'Coarse Grained Reconfigurable arrays'

Author: Grafiati

Published: 6 September 2023

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 41 dissertations / theses for your research on the topic 'Coarse Grained Reconfigurable arrays.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Lee, Jong-Suk Mark. "FleXilicon: a New Coarse-grained Reconfigurable Architecture for Multimedia and Wireless Communications." Diss., Virginia Tech, 2010. http://hdl.handle.net/10919/77094.

Full text

Abstract:

High computing power and flexibility are important design factors for multimedia and wireless communication applications due to the demand for high quality services and frequent evolution of standards. The ASIC (Application Specific Integrated Circuit) approach provides an area efficient, high performance solution, but is inflexible. In contrast, the general purpose processor approach is flexible, but often fails to provide sufficient computing power. Reconfigurable architectures, which have been introduced as a compromise between the two extreme solutions, have been applied successfully for multimedia and wireless communication applications. In this thesis, we investigated a new coarse-grained reconfigurable architecture called FleXilicon which is designed to execute critical loops efficiently, and is embedded in an SOC with a host processor. FleXilicon improves resource utilization and achieves a high degree of loop level parallelism (LLP). The proposed architecture aims to mitigate major shortcomings with existing architectures through adoption of three schemes, (i) wider memory bandwidth, (ii) adoption of a reconfigurable controller, and (iii) flexible wordlength support. Increased memory bandwidth satisfies memory access requirement in LLP execution. New design of reconfigurable controller minimizes overhead in reconfiguration and improves area efficiency and reconfiguration overhead. Flexible word-length support improves LLP by increasing the number of processing elements executable. The simulation results indicate that FleXilicon reduces the number of clock cycles and increases the speed for all five applications simulated. The speedup ratios compared with conventional architectures are as large as two orders of magnitude for some applications. VLSI implementation of FleXilicon in 65 nm CMOS process indicates that the proposed architecture can operate at a high frequency up to 1 GHz with moderate silicon area.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

2

Saraswat, Rohit. "A Finite Domain Constraint Approach for Placement and Routing of Coarse-Grained Reconfigurable Architectures." DigitalCommons@USU, 2010. https://digitalcommons.usu.edu/etd/689.

Full text

Abstract:

Scheduling, placement, and routing are important steps in Very Large Scale Integration (VLSI) design. Researchers have developed numerous techniques to solve placement and routing problems. As the complexity of Application Specific Integrated Circuits (ASICs) increased over the past decades, so did the demand for improved place and route techniques. The primary objective of these place and route approaches has typically been wirelength minimization due to its impact on signal delay and design performance. With the advent of Field Programmable Gate Arrays (FPGAs), the same place and route techniques were applied to FPGA-based design. However, traditional place and route techniques may not work for Coarse-Grained Reconfigurable Architectures (CGRAs), which are reconfigurable devices offering wider path widths than FPGAs and more flexibility than ASICs, due to the differences in architecture and routing network. Further, the routing network of several types of CGRAs, including the Field Programmable Object Array (FPOA), has deterministic timing as compared to the routing fabric of most ASICs and FPGAs reported in the literature. This necessitates a fresh look at alternative approaches to place and route designs. This dissertation presents a finite domain constraint-based, delay-aware placement and routing methodology targeting an FPOA. The proposed methodology takes advantage of the deterministic routing network of CGRAs to perform a delay aware placement.

APA, Harvard, Vancouver, ISO, and other styles

3

Das, Satyajit. "Architecture and Programming Model Support for Reconfigurable Accelerators in Multi-Core Embedded Systems." Thesis, Lorient, 2018. http://www.theses.fr/2018LORIS490/document.

Full text

Abstract:

La complexité des systèmes embarqués et des applications impose des besoins croissants en puissance de calcul et de consommation énergétique. Couplé au rendement en baisse de la technologie, le monde académique et industriel est toujours en quête d'accélérateurs matériels efficaces en énergie. L'inconvénient d'un accélérateur matériel est qu'il est non programmable, le rendant ainsi dédié à une fonction particulière. La multiplication des accélérateurs dédiés dans les systèmes sur puce conduit à une faible efficacité en surface et pose des problèmes de passage à l'échelle et d'interconnexion. Les accélérateurs programmables fournissent le bon compromis efficacité et flexibilité. Les architectures reconfigurables à gros grains (CGRA) sont composées d'éléments de calcul au niveau mot et constituent un choix prometteur d'accélérateurs programmables. Cette thèse propose d'exploiter le potentiel des architectures reconfigurables à gros grains et de pousser le matériel aux limites énergétiques dans un flot de conception complet. Les contributions de cette thèse sont une architecture de type CGRA, appelé IPA pour Integrated Programmable Array, sa mise en œuvre et son intégration dans un système sur puce, avec le flot de compilation associé qui permet d'exploiter les caractéristiques uniques du nouveau composant, notamment sa capacité à supporter du flot de contrôle. L'efficacité de l'approche est éprouvée à travers le déploiement de plusieurs applications de traitement intensif. L'accélérateur proposé est enfin intégré à PULP, a Parallel Ultra-Low-Power Processing-Platform, pour explorer le bénéfice de ce genre de plate-forme hétérogène ultra basse consommation
Emerging trends in embedded systems and applications need high throughput and low power consumption. Due to the increasing demand for low power computing and diminishing returns from technology scaling, industry and academia are turning with renewed interest toward energy efficient hardware accelerators. The main drawback of hardware accelerators is that they are not programmable. Therefore, their utilization can be low is they perform one specific function and increasing the number of the accelerators in a system on chip (SoC) causes scalability issues. Programmable accelerators provide flexibility and solve the scalability issues. Coarse-Grained Reconfigurable Array (CGRA) architecture consisting of several processing elements with word level granularity is a promising choice for programmable accelerator. Inspired by the promising characteristics of programmable accelerators, potentials of CGRAs in near threshold computing platforms are studied and an end-to-end CGRA research framework is developed in this thesis. The major contributions of this framework are: CGRA design, implementation, integration in a computing system, and compilation for CGRA. First, the design and implementation of a CGRA named Integrated Programmable Array (IPA) is presented. Next, the problem of mapping applications with control and data flow onto CGRA is formulated. From this formulation, several efficient algorithms are developed using internal resources of a CGRA, with a vision for low power acceleration. The algorithms are integrated into an automated compilation flow. Finally, the IPA accelerator is augmented in PULP - a Parallel Ultra-Low-Power Processing-Platform to explore heterogeneous computing

APA, Harvard, Vancouver, ISO, and other styles

4

Dogan, Rabia. "System Level Exploration of RRAM for SRAM Replacement." Thesis, Linköpings universitet, Elektroniksystem, 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-92819.

Full text

Abstract:

Recently an effective usage of the chip area plays an essential role for System-on-Chip (SOC) designs. Nowadays on-chip memories take up more than 50%of the total die-area and are responsible for more than 40% of the total energy consumption. Cache memory alone occupies 30% of the on-chip area in the latest microprocessors. This thesis project “System Level Exploration of RRAM for SRAM Replacement” describes a Resistive Random Access Memory (RRAM) based memory organizationfor the Coarse Grained Reconfigurable Array (CGRA) processors. Thebenefit of the RRAM based memory organization, compared to the conventional Static-Random Access Memory (SRAM) based memory organization, is higher interms of energy and area requirement. Due to the ever-growing problems faced by conventional memories with Dynamic Voltage Scaling (DVS), emerging memory technologies gained more importance. RRAM is typically seen as a possible candidate to replace Non-volatilememory (NVM) as Flash approaches its scaling limits. The replacement of SRAMin the lowest layers of the memory hierarchies in embedded systems with RRAMis very attractive research topic; RRAM technology offers reduced energy and arearequirements, but it has limitations with regards to endurance and write latency. By reason of the technological limitations and restrictions to solve RRAM write related issues, it becomes beneficial to explore memory access schemes that tolerate the longer write times. Therefore, since RRAM write time cannot be reduced realistically speaking we have to derive instruction memory and data memory access schemes that tolerate the longer write times. We present an instruction memory access scheme to compromise with these problems. In addition to modified instruction memory architecture, we investigate the effect of the longer write times to the data memory. Experimental results provided show that the proposed architectural modifications can reduce read energy consumption by a significant frame without any performance penalty.

APA, Harvard, Vancouver, ISO, and other styles

5

Zain-ul-Abdin. "Programming of coarse-grained reconfigurable architectures." Doctoral thesis, Örebro universitet, Akademin för naturvetenskap och teknik, 2011. http://urn.kb.se/resolve?urn=urn:nbn:se:oru:diva-15246.

Full text

Abstract:

Coarse-grained reconfigurable architectures, which offer massive parallelism coupled with the capability of undergoing run-time reconfiguration, are gaining attention in order to meet not only the increased computational demands of high-performance embedded systems, but also to fulfill the need of adaptability to functional requirements of the application. This thesis focuses on the programming aspects of such coarse-grained reconfigurable computing devices, including the relevant computation models that are capable of exposing different kinds of parallelism inherent in the application and the ability of these models to capture the adaptability requirements of the application. The thesis suggests the occam-pi language for programming of a broad class of coarse-grained reconfigurable architectures as an intermediate language; we call it intermediate, since we believe that the applicationprogramming is best done in a high-level domain-specific language. The salient properties of the occam-pi language are explicit concurrency with built-in mechanisms for interprocessorcommunication, provision for expressing dynamic parallelism, support for the expression of dynamic reconfigurations, and placement attributes. To evaluate the programming approach, a compiler framework was extended to support the language extensions in the occam-pi language, and backends were developed to target two different coarse-grained reconfigurable architectures. XPP and Ambric. The results on XPP reveal that the occam-pi based implementations produce comparable throughput to those of NML programs, while programming at a much higher level of abstraction than that of NML. Similarly the two occam-pi implementations of autofocus criterion calculation targeted to the Ambric platform outperform the CPU implementation by factors of 11-23. Thus, the results of the implemented case-studies suggest that the occam-pi language based approach simplifies the development of applications employing run-time reconfigurable devices without compromising the performance benefits.

APA, Harvard, Vancouver, ISO, and other styles

6

Ul-Abdin, Zain. "Programming of Coarse-Grained Reconfigurable Architectures." Doctoral thesis, Högskolan i Halmstad, Centrum för forskning om inbyggda system (CERES), 2011. http://urn.kb.se/resolve?urn=urn:nbn:se:hh:diva-15050.

Full text

Abstract:

Coarse-grained reconfigurable architectures, which offer massive parallelism coupled with the capability of undergoing run-time reconfiguration, are gaining attention in order to meet not only the increased computational demands of high-performance embedded systems, but also to fulfill the need of adaptability to functional requirements of the application. This thesis focuses on the programming aspects of such coarse-grained reconfigurable computing devices, including the relevant computation models that are capable of exposing different kinds of parallelism inherent in the application and the ability of these models to capture the adaptability requirements of the application. The thesis suggests the occam-pi language for programming of a broad class of coarse-grained reconfigurable architectures as an intermediate language; we call it intermediate, since we believe that the applicationprogramming is best done in a high-level domain-specific language. The salient properties of the occam-pi language are explicit concurrency with built-in mechanisms for interprocessorcommunication, provision for expressing dynamic parallelism, support for the expression of dynamic reconfigurations, and placement attributes. To evaluate the programming approach, a compiler framework was extended to support the language extensions in the occam-pi language, and backends were developed to target two different coarse-grained reconfigurable architectures. XPP and Ambric. The results on XPP reveal that the occam-pi based implementations produce comparable throughput to those of NML programs, while programming at a much higher level of abstraction than that of NML. Similarly the two occam-pi implementations of autofocus criterion calculation targeted to the Ambric platform outperform the CPU implementation by factors of 11-23. Thus, the results of the implemented case-studies suggest that the occam-pi language based approach simplifies the development of applications employing run-time reconfigurable devices without compromising the performance benefits.

APA, Harvard, Vancouver, ISO, and other styles

7

Guo, Yuanqing. "Mapping applications to a coarse-grained reconfigurable architecture." Enschede : University of Twente [Host], 2006. http://doc.utwente.nl/57121.

Full text

APA, Harvard, Vancouver, ISO, and other styles

8

Bag, Zeki Ozan. "Energy-Aware Coarse Grained Reconfigurable Architectures Using Dynamically Reconfigurable Isolation Cells." Thesis, KTH, Skolan för informations- och kommunikationsteknik (ICT), 2012. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-108217.

Full text

Abstract:

This thesis presents a self adaptive power management system to improve energy efficiency of coarse-grained reconfigurable architectures (CGRAs). CGRAs can host multiple applications on a single platform. Moreover, a single application may have multiple versions which have different degree of parallelism (fully serial, partially serial, fully parallel etc.). Selection of the optimum application version depends on runtime conditions such as resource availability on the platform. A traditional worst case design to satisfy its specifications results in undesirable power efficiency. Existing solutions to this problem offer costly hardware to mainly employ dynamic voltage and frequency scaling (DVFS). We propose exploiting reconfiguration of available resources on CGRA. Our solution makes use of dynamically reconfigurable isolation cells (DRICs) instead of dedicated hardware. We also introduce autonomous parallelism, voltage and frequency selection (APVFS) to realize DVFS functionality and to select the optimum version. Three applications are used for simulations, namely; matrix multiplication, finite impulse response filter (FIR) and fast Fourier transform (FFT). Results show that up to 72 % and 55 % power and energy can be saved respectively. Synthesis of the fabric shows considerable reduction in area overheads compared to existing designs employing DVFS.

APA, Harvard, Vancouver, ISO, and other styles

9

Plessl, Christian [Verfasser]. "Hardware Virtualization on a Coarse-Grained Reconfigurable Processor / Christian Plessl." Aachen : Shaker, 2006. http://d-nb.info/1166513963/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

10

Yadav, Anil. "Exploration Of Energy And Area Efficient Techniques For Coarse-grained Reconfigurable Fabrics." Thesis, University of North Texas, 2011. https://digital.library.unt.edu/ark:/67531/metadc103413/.

Full text

Abstract:

Coarse-grained fabrics are comprised of multi-bit configurable logic blocks and configurable interconnect. This work is focused on area and energy optimization techniques for coarse-grained reconfigurable fabric architectures. In this work, a variety of design techniques have been explored to improve the utilization of computational resources and increase energy savings. This includes splitting, folding, multi-level vertical interconnect. In addition to this, I have also studied fully connected homogeneous and heterogeneous architectures, and 3D architecture. I have also examined some of the hybrid strategies of computation unit’s arrangements. In order to perform energy and area analysis, I selected a set of signal and image processing benchmarks from MediaBench suite. I implemented various fabric architectures on 90nm ASIC process from Synopsys. Results show area improvement with energy savings as compared to baseline architecture.

APA, Harvard, Vancouver, ISO, and other styles

11

Yang, Yu. "BENCHMARK OF TRIGGERED INSTRUCTION BASED COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR RADIO BASE STATION." Thesis, KTH, Skolan för informations- och kommunikationsteknik (ICT), 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-177446.

Full text

Abstract:

Spatially-programmed architectures such as FPGA are among the most prevailing hardware in various application areas. However FPGA suffers from great overheads such as area, latency and power efficiency. Coarse-grained Reconfigurable Architecture (CGRA) is designed in order to compensate these disadvantages of FPGA. In this thesis, a Triggered Instruction based novel CGRA designed by Intel is evaluated. Benchmark work in this thesis focuses on signal processing area. Three performance limiting functions, Channel Estimation, Radix-2 FFT and Interleaving are selected from LTE Uplink Receiver PHY Benchmark which is an open source benchmark, and implemented and analyzed in Triggered Instruction Architecture (TIA). Throughput-area relationships and throughput/area-area relationships are summarized in curves using a resource estimation method. The benchmark result shows that TIA offers good flexibility for temporal and spatial execution, and a mix of them. Designs in TIA are scalable and adjustable according to different performance requirement. Moreover, based on the development work, this thesis discusses development flow of TIA, various programming techniques, low latency mapping solutions, code size comparison, development environment and integration of heterogeneous system with TIA.

APA, Harvard, Vancouver, ISO, and other styles

12

Malik, Omer. "Pragma-Based Approach For Mapping DSP Functions On A Coarse Grained Reconfigurable Architecture." Licentiate thesis, KTH, Elektronik och Inbyggda System, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-166410.

Full text

APA, Harvard, Vancouver, ISO, and other styles

13

Nabi, Syed Waqar. "A coarse-grained dynamically reconfigurable MAC processor for power-sensitive multi-standard devices." Thesis, Connect to e-thesis, 2009. http://theses.gla.ac.uk/865/.

Full text

Abstract:

Thesis (Eng.D.) - University of Glasgow, 2009.
Eng.D. thesis submitted to the Universities of Glasgow, Strathclyde, Edinburgh and Heriott Watt for the degree of Doctor of Engineering in System Level Integration, University of Glasgow, 2009. Includes bibliographical references. Print version also available.

APA, Harvard, Vancouver, ISO, and other styles

14

FANNI, TIZIANA. "Power and Energy Management in Coarse-Grained Reconfigurable Systems: methodologies,automation and assessments." Doctoral thesis, Università degli Studi di Cagliari, 2019. http://hdl.handle.net/11584/260390.

Full text

Abstract:

In the era of Cyber-Physical Systems (CPS), designers need to cope with several constraints that have to be met at the same time. CPS are complex systems composed of different interactive and deeply intertwined components that have to change their behavioural modalities according to several factors as the environment status, requests from user and even their internal status, thus requiring high flexibility and performance, possibly with a low power consumption. The spectrum of existing computing systems ranges from general purpose to application specific systems. General purpose systems as CPUs, GPUs, DSPs offer high flexibility but are not able to provide high performance, due to their poor specialization. On the other side, Application Specific Integrated Circuits (ASICs) offer high performance but they do not provide flexibility at all, being designed for computing a single, specific application. In the middle between general purpose systems and ASICs lie the reconfigurable systems that provide a valuable solution to challenge simultaneously different requirements. Reconfigurable systems offer a certain level of flexibility, while guaranteeing high performance. However, two major issues still limit their wide applicability: high design complexity, implying huge engineering effort, as well as power inefficiencies. The activities behind my thesis address both these issues, with the primary focus on power consumption. The starting assumption is the definition of a set of strategies that, depending on the considered scenario and the chosen target device (ASIC or FPGA), may enable power/energy awareness and consumption optimization. In parallel, these strategies have been automated within different extensions of a dataflow to hardware design suite for coarse-grained reconfigurable systems.

APA, Harvard, Vancouver, ISO, and other styles

15

Han, Wei. "Multi-core architectures with coarse-grained dynamically reconfigurable processors for broadband wireless access technologies." Thesis, University of Edinburgh, 2010. http://hdl.handle.net/1842/3812.

Full text

Abstract:

Broadband Wireless Access technologies have significant market potential, especially the WiMAX protocol which can deliver data rates of tens of Mbps. Strong demand for high performance WiMAX solutions is forcing designers to seek help from multi-core processors that offer competitive advantages in terms of all performance metrics, such as speed, power and area. Through the provision of a degree of flexibility similar to that of a DSP and performance and power consumption advantages approaching that of an ASIC, coarse-grained dynamically reconfigurable processors are proving to be strong candidates for processing cores used in future high performance multi-core processor systems. This thesis investigates multi-core architectures with a newly emerging dynamically reconfigurable processor – RICA, targeting WiMAX physical layer applications. A novel master-slave multi-core architecture is proposed, using RICA processing cores. A SystemC based simulator, called MRPSIM, is devised to model this multi-core architecture. This simulator provides fast simulation speed and timing accuracy, offers flexible architectural options to configure the multi-core architecture, and enables the analysis and investigation of multi-core architectures. Meanwhile a profiling-driven mapping methodology is developed to partition the WiMAX application into multiple tasks as well as schedule and map these tasks onto the multi-core architecture, aiming to reduce the overall system execution time. Both the MRPSIM simulator and the mapping methodology are seamlessly integrated with the existing RICA tool flow. Based on the proposed master-slave multi-core architecture, a series of diverse homogeneous and heterogeneous multi-core solutions are designed for different fixed WiMAX physical layer profiles. Implemented in ANSI C and executed on the MRPSIM simulator, these multi-core solutions contain different numbers of cores, combine various memory architectures and task partitioning schemes, and deliver high throughputs at relatively low area costs. Meanwhile a design space exploration methodology is developed to search the design space for multi-core systems to find suitable solutions under certain system constraints. Finally, laying a foundation for future multithreading exploration on the proposed multi-core architecture, this thesis investigates the porting of a real-time operating system – Micro C/OS-II to a single RICA processor. A multitasking version of WiMAX is implemented on a single RICA processor with the operating system support.

APA, Harvard, Vancouver, ISO, and other styles

16

Sciaraffa, Rocco. "A Reconfigurable Device for GALS Systems." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-235712.

Full text

Abstract:

Globally Asynchronous Locally Synchronous (GALS) Field-Programmable Gate Array (FPGA) are composed of standard synchronous reconfigurable logic islands that communicate with each other via an asynchronous means. Past research into fully asynchronous FPGA has demonstrated high throughput and reliability adopting dual-rail encoding. GALS FPGAs have been proposed, relying on bundled-data encoding and fixed asynchronous communication between synchronous islands. This thesis proposes a new GALS FPGA architecture with fully reconfigurable asynchronous fabric, that relies on coarse-grained Configurable Logic Blocks (CLBs) to improve the communication capability of the device. Through datapath dedicated elements, asynchronous pipelines are efficiently mapped onto the device. The architecture is presented as well as the customized tool flow needed to compile Verilog for this new coarse-grained reconfigurable circuit.The main purpose of this thesis is to map communication-purpose user-circuits on the proposed asynchronous fabric and evaluate their performance. The benchmark circuits target the design of a Networkon-Chip (NoC) router and employ two-phase bundled-data protocol. The results are obtained through simulation and compared with the performances of the same circuits on a fine-grained classical FPGA style. The proposed architecture achieves up to 3.2x higher throughput and 2.9x lower latency than the classical one. The results show that the coarse-grained style efficiently maps asynchronous communication circuits, and it may be the starting point for future reconfigurable GALS systems. Future work should focus on improving the back-end synthesis and evaluating the FPGA GALS system as a whole.
Globala Asynkrona Lokalt Synkrona (GALS) FPGAer består av standardiserade synkrona rekonfigurerbara logiska öar som kommunicerar med varandra på ett asynkront sätt. Tidigare forskning om helt asynkrona FPGAer har demonstrerat att hög genomströmning och tillförlitlighet kan erhållas mha sk dual-rail kodning. GALS FPGA har också föreslagits, där man istället förlitar sig på kodad data och fast asynkron kommunikation mellan synkrona öar. Denna avhandling föreslår en ny GALS FPGA-arkitektur med en omkonfigurerbar asynkron struktur, bestående av sk Coarse-grained CLBs för att förbättra kommunikationsförmågan på enheten. Genom att datavägarna använder sig av dedikerade element, kan asynkrona pipelines mappas effektivt på enheten. Arkitekturen presenteras liksom det verktygsflöde som behövs för att kompilera Verilog för denna nya grovkornigt omkonfigurerbara krets.Huvudsyftet med denna avhandling är att mappa kommunikationskretsar på den föreslagna asynkrona strukturen och utvärdera dess prestanda. Referenskretsarna som används för utvärdering är en NoC router som använder sig av ett tvåfas kommunikationsprotokoll. Resultaten erhålls genom simulering och jämförs med prestanda av samma krets implementerad i en finkornig klassisk FPGA-stil. Den föreslagna arkitekturen uppnår ca 3.2x högre genomströmning och 2.9x lägre latens än den klassiska. Resultaten visar att en grovkornig stil kan mappa asynkrona kommunikationskretsar på ett effektivt sätt, och att det kan vara en bra utgångspunkt för framtida omkonfigurerbara GALS-system.Framtida arbete bör fokusera på att förbättra back-end-syntesen och att utvärdera FPGA GALS-systemet i sin helhet.

APA, Harvard, Vancouver, ISO, and other styles

17

Zhao, Xin. "High efficiency coarse-grained customised dynamically reconfigurable architecture for digital image processing and compression technologies." Thesis, University of Edinburgh, 2012. http://hdl.handle.net/1842/6187.

Full text

Abstract:

Digital image processing and compression technologies have significant market potential, especially the JPEG2000 standard which offers outstanding codestream flexibility and high compression ratio. Strong demand for high performance digital image processing and compression system solutions is forcing designers to seek proper architectures that offer competitive advantages in terms of all performance metrics, such as speed and power. Traditional architectures such as ASIC, FPGA and DSPs have limitations in either low flexibility or high power consumption. On the other hand, through the provision of a degree of flexibility similar to that of a DSP and performance and power consumption advantages approaching that of an ASIC, coarse-grained dynamically reconfigurable architectures are proving to be strong candidates for future high performance digital image processing and compression systems. This thesis investigates dynamically reconfigurable architectures and especially the newly emerging RICA paradigm. Case studies such as Reed- Solomon decoder and WiMAX OFDM timing synchronisation engine are implemented in order to explore the potential of RICA-based architectures and the possible optimisation approaches such as eliminating conditional branches, reducing memory accesses and constructing kernels. Based on investigations in this thesis, a novel customised dynamically reconfigurable architecture targeting digital image processing and compression applications is devised, which can be tailored to adopt different applications. A demosaicing engine based on the Freeman algorithm is designed and implemented on the proposed architecture as the pre-processing module in a digital imaging system. An efficient data buffer rotating scheme is designed with the aim of reducing memory accesses. Meanwhile an investigation targeting mapping the demosaicing engine onto a dual-core RICA platform is performed. After optimisation, the performance of the proposed engine is carefully evaluated and compared in aspects of throughput and consumed computational resources. When targeting the JPEG2000 standard, the core tasks such as 2-D Discrete Wavelet Transform (DWT) and Embedded Block Coding with Optimal Truncation (EBCOT) are implemented and optimised on the proposed architecture. A novel 2-D DWT architecture based on vector operations associated with RICA paradigm is developed, and the complete DWT application is highly optimised for both throughput and area. For the EBCOT implementation, a novel Partial Parallel Architecture (PPA) for the most computationally intensive module in EBCOT, termed Context Modeling (CM), is devised. Based on the algorithm evaluation, an ARM core is integrated into the proposed architecture for performance enhancement. A Ping-Pong memory switching mode with carefully designed communication scheme between RICA based architecture and ARM is proposed. Simulation results demonstrate that the proposed architecture for JPEG2000 offers significant advantage in throughput.

APA, Harvard, Vancouver, ISO, and other styles

18

Malla, Tika Kumari. "Case Studies to Learn Human Mapping Strategies in a Variety of Coarse-Grained Reconfigurable Architectures." Thesis, University of North Texas, 2017. https://digital.library.unt.edu/ark:/67531/metadc984195/.

Full text

Abstract:

Computer hardware and algorithm design have seen significant progress over the years. It is also seen that there are several domains in which humans are more efficient than computers. For example in image recognition, image tagging, natural language understanding and processing, humans often find complicated algorithms quite easy to grasp. This thesis presents the different case studies to learn human mapping strategy to solve the mapping problem in the area of coarse-grained reconfigurable architectures (CGRAs). To achieve optimum level performance and consume less energy in CGRAs, place and route problem has always been a major concern. Making use of human characteristics can be helpful in problems as such, through pattern recognition and experience. Therefore to conduct the case studies a computer mapping game called UNTANGLED was analyzed as a medium to convey insights of human mapping strategies in a variety of architectures. The purpose of this research was to learn from humans so that we can come up with better algorithms to outperform the existing algorithms. We observed how human strategies vary as we present them with different architectures, different architectures with constraints, different visualization as well as how the quality of solution changes with experience. In this work all the case studies obtained from exploiting human strategies provide useful feedback that can improve upon existing algorithms. These insights can be adapted to find the best architectural solution for a particular domain and for future research directions for mapping onto mesh-and- stripe based CGRAs.

APA, Harvard, Vancouver, ISO, and other styles

19

Jayabalan, Arun. "Development of a Massively Parallel Coarse Grained Reconfigurable Fabric verification Environment using Universal Verification Methodology." Thesis, KTH, Skolan för informations- och kommunikationsteknik (ICT), 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-206099.

Full text

Abstract:

According to the International Roadmap for semiconductors (ITRS), there should be a 1000X improvement in performance with only 120% increase in the power budget and no increase in the design team size to deal with designs that are 10X more complex. One solution to cope with this complexity is to increase the granularity of the building blocks for developing new architectures. As a solution, Dynamically Reconfigurable Resource Array (DRRA) with Distributed Memory Architecture(DiMArch) was developed. As the design complexity increased, the need for verification became inevitable in the design flow. To include the feature of reusability, a reconfigurable verification environment is required to effectively verify the device under test (DUT) and also improve the productivity in the design cycle. The thesis work begins with the specification & design and also the verification plans for the DRRA and DiMArch. The major task of the thesis work is in developing a reconfigurable verification environment for the DRRA using Universal Verification Methodology (UVM) and a systemlevel verification test bench for the DiMArch . This thesis work also focuses on the possible power optimization in the design.

APA, Harvard, Vancouver, ISO, and other styles

20

Balavendran, Joseph Rani Deepika. "Gamification to Solve a Mapping Problem in Electrical Engineering." Thesis, University of North Texas, 2020. https://digital.library.unt.edu/ark:/67531/metadc1703330/.

Full text

Abstract:

Coarse-Grained Reconfigurable Architectures (CGRAs) are promising in developing high performance low-power portable applications. In this research, we crowdsource a mapping problem using gamification to harnass human intelligence. A scientific puzzle game, Untangled, was developed to solve a mapping problem by encapsulating architectural characteristics. The primary motive of this research is to draw insights from the mapping solutions of players who possess innate abilities like decision-making, creative problem-solving, recognizing patterns, and learning from experience. In this dissertation, an extensive analysis was conducted to investigate how players' computational skills help to solve an open-ended problem with different constraints. From this analysis, we discovered a few common strategies among players, and subsequently, a library of dictionaries containing identified patterns from players' solutions was developed. The findings help to propose a better version of the game that incorporates these techniques recognized from the experience of players. In the future, an updated version of the game that can be developed may help low-performance players to provide better solutions for a mapping problem. Eventually, these solutions may help to develop efficient mapping algorithms, In addition, this research can be an exemplar for future researchers who want to crowdsource such electrical engineering problems and this approach can also be applied to other domains.

APA, Harvard, Vancouver, ISO, and other styles

21

Jung, Lukas Johannes [Verfasser], Christian [Akademischer Betreuer] Hochberger, and Diana [Akademischer Betreuer] Göhringer. "Optimization of the Memory Subsystem of a Coarse Grained Reconfigurable Hardware Accelerator / Lukas Johannes Jung ; Christian Hochberger, Diana Göhringer." Darmstadt : Universitäts- und Landesbibliothek Darmstadt, 2019. http://d-nb.info/1187919810/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

22

Safdar, Muhammad. "Modeling in Simulink and Synthesis of Digital Pre-Distortion for WLAN Power Amplifiers on a Coarse-Grained Reconfigurable Fabric." Thesis, Linköpings universitet, Elektroniska Kretsar och System, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-132323.

Full text

Abstract:

High data rates are highly demanded now-a-days in most of the communication systems such as audio/video broadcasting, cable networks, wireless networks etc. This can be achieved using Orthogonal Frequency Division Multiplexing (OFDM), which is a bandwidth-efficient method. However, the major drawback of the OFDM technique is its high Peak-to-Average Power Ratio (PAPR). Due to this high PAPR, the amplified signal is distorted if its peaks are not controlled. This thesis investigates a PAPR reduction technique called Fourier Projection Algorithm (FPA). During the thesis, the FPA algorithm is successfully designed to reduce the PAPR in the OFDM systems to avoid the clipping. The results of the FPA algorithm show that the efficiency of the system depends on the throughput, the complexity, and Tone Rate Loss (TRL) of the system. The simulations are first carried out in SIMULINK and MATLAB environments and later on it is synthesized on coarse-grained reconfigurable fabric platform.

APA, Harvard, Vancouver, ISO, and other styles

23

Muir, Mark I. R. "Re-targetable tools and methodologies for the efficient deployment of high-level source code on coarse-grained dynamically reconfigurable architectures." Thesis, University of Edinburgh, 2009. http://hdl.handle.net/1842/27072.

Full text

Abstract:

Reconfigurable computing traditionally consists of a data path machine (such as an FPGA) acting as a co-processor to a conventional microprocessor. This involves partitioning the application such that the data path intensive parts are implemented on the reconfigurable fabric, and the control flow intensive parts are implemented on the microprocessor. Often the two parts have to be written in different languages. New highly parallel data path architectures allow parallelism approaching that of FPGAs, but are able to be reconfigured very rapidly. As a result, it is possible to use these architectures to perform control flow in a manner similar to a microprocessor, and thus a complete program can be described from an unmodified high-level language (in particular C). This overcomes the historical instruction-level parallelism (ILP) wall. To make full use of the available parallelism, existing microprocessor tool flows are insufficient. Data path machines are typically programmed via HDL tools from the ASIC design world. This expresses algorithms at a lower level than the application algorithms are typically developed in. The work in this thesis builds upon earlier work to allow applications to be described from high-level languages, by employing low-level optimisations in the compiler back-end and working from the assembly, to maximise parallel efficiency. This consists of scheduling, where known techniques are used to pack instructions into basic blocks that map well to the reconfigurable core (optimising spatial efficiency); then automatic pipelining is applied to dramatically improve the achievable throughput (optimising temporal efficiency). Together these can be thought of as 'instruction-level parallelism done right'. Speed-ups of more than an order of magnitude were achieved, yielding throughputs of 180-380MPixels/s on typical image signal processing tasks, matching the performance of hard-wired ASICs. Furthermore, conventional software-based simulation technologies for data path machines are too slow for use in application verification. This thesis demonstrates how a high-speed software emulator can be created for self-controlled dynamically reconfigurable data path machines, using a static serialisation of the data paths in each configuration context. This yields run-time performance several orders of magnitude higher than existing techniques, making it suitable for use in feedback-directed optimisation.

APA, Harvard, Vancouver, ISO, and other styles

24

SAU, CARLO. "Dataflow based design suite for the development and management of multi-functional reconfigurable systems." Doctoral thesis, Università degli Studi di Cagliari, 2016. http://hdl.handle.net/11584/266751.

Full text

Abstract:

Embedded systems development constitutes an extremely challenging scenario for the designers since several constraints have to be meet at the same time. Flexibil- ity, performance and power efficiency are typically colliding requirements that are hardly addressed together. Reconfigurable systems provide a valuable alternative to common architectures to challenge contemporarily all those issues. Such a kind of systems, and in particular the coarse grained ones, exhibit a certain level of flexi- bility while guaranteeing strong performance. However they suffer of an increased design and management complexity. In this thesis it is discussed a fully automated methodology for the development of coarse grained reconfigurable platforms, by exploiting dataflow models for the de- scription of the desired functionalities. The thesis describes, actually, a whole design suite that offers, besides the reconfigurable substrate composition, also structural optimisation, dynamic power management and co-processing support. All the pro- vided features have been validated on different signal, image and video processing scenarios, targeting either FPGA and ASIC.

APA, Harvard, Vancouver, ISO, and other styles

25

Peyret, Thomas. "Architecture matérielle et flot de programmation associé pour la conception de systèmes numériques tolérants aux fautes." Thesis, Lorient, 2014. http://www.theses.fr/2014LORIS348/document.

Full text

Abstract:

Que ce soit dans l’automobile avec des contraintes thermiques ou dans l’aérospatial et lenucléaire soumis à des rayonnements ionisants, l’environnement entraîne l’apparition de fautesdans les systèmes électroniques. Ces fautes peuvent être transitoires ou permanentes et vontinduire des résultats erronés inacceptables dans certains contextes applicatifs. L’utilisation decomposants dits « rad-hard » est parfois compromise par leurs coûts élevés ou les difficultésd’approvisionnement liés aux règles d’exportation.Cette thèse propose une approche conjointe matérielle et logicielle indépendante de la technologied’intégration permettant d’utiliser des composants numériques programmables dans desenvironnements susceptibles de générer des fautes. Notre proposition comporte la définitiond’une Architecture Reconfigurable à Gros Grains (CGRA) capable d’exécuter des codes applicatifscomplets mais aussi l’ensemble des mécanismes matériels et logiciels permettant de rendrecette architecture tolérante aux fautes. Ce résultat est obtenu par l’association de redondance etde reconfiguration dynamique du CGRA en s’appuyant sur une banque de configurations généréepar une chaîne de programmation complète. Cette chaîne outillée repose sur un flot permettantde porter un code sous forme de Control and Data Flow Graph (CDFG) sur l’architecture enobtenant un grand nombre de configurations différentes et qui permet d’exploiter au mieux lepotentiel de l’architecture.Les travaux, qui ont été validés aux travers d’expériences sur des applications du domaine dutraitement du signal et de l’image, ont fait l’objet de publications en conférences internationaleset de dépôts de brevets
Whether in automotive with heat stress or in aerospace and nuclear field subjected to cosmic,neutron and gamma radiation, the environment can lead to the development of faults in electronicsystems. These faults, which can be transient or permanent, will lead to erroneous results thatare unacceptable in some application contexts. The use of so-called rad-hard components issometimes compromised due to their high costs and supply problems associated with exportrules.This thesis proposes a joint hardware and software approach independent of integrationtechnology for using digital programmable devices in environments that generate faults. Ourapproach includes the definition of a Coarse Grained Reconfigurable Architecture (CGRA) ableto execute entire application code but also all the hardware and software mechanisms to make ittolerant to transient and permanent faults. This is achieved by the combination of redundancyand dynamic reconfiguration of the CGRA based on a library of configurations generated by acomplete conception flow. This implemented flow relies on a flow to map a code represented as aControl and Data Flow Graph (CDFG) on the CGRA architecture by obtaining directly a largenumber of different configurations and allows to exploit the full potential of architecture.This work, which has been validated through experiments with applications in the field ofsignal and image processing, has been the subject of two publications in international conferencesand of two patents

APA, Harvard, Vancouver, ISO, and other styles

26

"Path Selection Based Branching for Coarse Grained Reconfigurable Arrays." Master's thesis, 2014. http://hdl.handle.net/2286/R.I.26802.

Full text

Abstract:

abstract: Coarse Grain Reconfigurable Arrays (CGRAs) are promising accelerators capable of achieving high performance at low power consumption. While CGRAs can efficiently accelerate loop kernels, accelerating loops with control flow (loops with if-then-else structures) is quite challenging. Techniques that handle control flow execution in CGRAs generally use predication. Such techniques execute both branches of an if-then-else structure and select outcome of either branch to commit based on the result of the conditional. This results in poor utilization of CGRA s computational resources. Dual-issue scheme which is the state of the art technique for control flow fetches instructions from both paths of the branch and selects one to execute at runtime based on the result of the conditional. This technique has an overhead in instruction fetch bandwidth. In this thesis, to improve performance of control flow execution in CGRAs, I propose a solution in which the result of the conditional expression that decides the branch outcome is communicated to the instruction fetch unit to selectively issue instructions from the path taken by the branch at run time. Experimental results show that my solution can achieve 34.6% better performance and 52.1% improvement in energy efficiency on an average compared to state of the art dual issue scheme without imposing any overhead in instruction fetch bandwidth.
Dissertation/Thesis
Masters Thesis Electrical Engineering 2014

APA, Harvard, Vancouver, ISO, and other styles

27

Kim, Yoonjin. "DESIGNING COST-EFFECTIVE COARSE-GRAINED RECONFIGURABLE ARCHITECTURE." 2009. http://hdl.handle.net/1969.1/ETD-TAMU-2009-05-649.

Full text

Abstract:

Application-specific optimization of embedded systems becomes inevitable to satisfy the market demand for designers to meet tighter constraints on cost, performance and power. On the other hand, the flexibility of a system is also important to accommodate the short time-to-market requirements for embedded systems. To compromise these incompatible demands, coarse-grained reconfigurable architecture (CGRA) has emerged as a suitable solution. A typical CGRA requires many processing elements (PEs) and a configuration cache for reconfiguration of its PE array. However, such a structure consumes significant area and power. Therefore, designing cost-effective CGRA has been a serious concern for reliability of CGRA-based embedded systems. As an effort to provide such cost-effective design, the first half of this work focuses on reducing power in the configuration cache. For power saving in the configuration cache, a low power reconfiguration technique is presented based on reusable context pipelining achieved by merging the concept of context reuse into context pipelining. In addition, we propose dynamic context compression capable of supporting only required bits of the context words set to enable and the redundant bits set to disable. Finally, we provide dynamic context management capable of reducing reduce power consumption in configuration cache by controlling a read/write operation of the redundant context words In the second part of this dissertation, we focus on designing a cost-effective PE array to reduce area and power. For area and power saving in a PE array, we devise a costeffective array fabric addresses novel rearrangement of processing elements and their interconnection designs to reduce area and power consumption. In addition, hierarchical reconfigurable computing arrays are proposed consisting of two reconfigurable computing blocks with two types of communication structure together. The two computing blocks have shared critical resources and such a sharing structure provides efficient communication interface between them with reducing overall area. Based on the proposed design approaches, a CGRA combining the multiple design schemes is shown to verify the synergy effect of the integrated approach. Experimental results show that the integrated approach reduces area by 23.07% and power by up to 72% when compared with the conventional CGRA.

APA, Harvard, Vancouver, ISO, and other styles

28

Kwok, Zion Siu-On. "Register file architecture optimization in a coarse-grained reconfigurable array." Thesis, 2005. http://hdl.handle.net/2429/16551.

Full text

Abstract:

This thesis investigates the impact of the global and local register file architecture on a reconfigurable system based on the ADRES architecture. The register files consume a significant amount of area on the reconfigurable device, and their architecture has a strong impact on the performance. We found that the global registers should be tightly connected to as many functional units as possible, while the connection of the local register files to their neighbours is less critical. We found that the global register file should contain 14 registers, while each local register file should only contain two registers. We used these results to propose two new architectures that demonstrate between -33% and 383% higher instructions per cycle per unit area compared to the original 4x4 and 8x8 array architectures, with 56% and 88% average improvement over a set of benchmarks for the new 4x4 and 8x8 array architectures, respectively.
Applied Science, Faculty of
Electrical and Computer Engineering, Department of
Graduate

APA, Harvard, Vancouver, ISO, and other styles

29

Ross, Dian Marie. "On designing coarse grain reconfigurable arrays to operate in weak inversion." Thesis, 2012. http://hdl.handle.net/1828/4362.

Full text

Abstract:

Field Programmable Gate Arrays (FPGAs) support the reconﬁgurable computing paradigm by providing an integrated circuit hardware platform that facilitates software like reconﬁgurability. The addition of an embedded microprocessor and peripherals to traditional FPGA Combinational Logic Blocks (CLBs) interleaved with interconnections has eﬀectively resulted in a programmable system on-chip. FPGAs are used to support ﬂexible implementations of Application Speciﬁc Integrated Circuit (ASIC) functions. Because FPGAs are reconﬁgurable, they often are used in place of ASICs during the cicuit design process. FPGAs are also used when only a small number of ICs are required: ASICs necessitate large manufacturing runs to be economically viable; for smaller runs the use of FPGAs is an economic alternative. Application domains of interest, such as intelligent guidance systems, medical devices, and sensors, often require low power, inexpensive calculation of trance- dental functions. COordinate Rotation DIgital Computer (CORDIC) is an iterative algorithm used to emmulate hardware expensive multipliers, such as Multiply/ACculmulate (MAC) units, with only shift and add operations. However, because CORDIC is a sequential algorithm, characterized as having the latency of a serial multiplier, techniques that speed up computational performance have many applications.To this end, three implementations of standard CORDIC, (i) unrolled hardwired, (ii) unrolled programmable, and (iii) rolled programmable, were implemented on four Xilinx FPGA families: Virtex-4, -5, and -6, and Spartan-6. Although hardwired unrolled was found to have the greatest speed at the expense of no runtime ﬂexibility, and rolled programmable was found to have the greatest ﬂexibility and lowest silicon area consumption at the expense of the longest propagation delay, improvements to CORDIC implementations were still sought. Three parallelized CORDIC techniques, P-CORDIC, Flat-CORDIC, and Para-CORDIC, were implemented on the same four FPGA families. P-CORDIC and Flat-CORDIC, were shown to have the lowest latency under various conditions; Para-CORDIC was found to perform well in deeply pipelined, high throughput circuits. Design rules for when to use standard versus precomputation CORDIC techniques are presented. To address the low power requirements of many applications of interest, the Unfolded Multiplexor-LRB (UMUX-LRB), patent held by Sima, et al, was analyzed in weak inversion across four transistor technology nodes (180nm, 130nm, 90nm, and 65nm). Previous was also expanded from strong inversion across 180nm, 130nm, and 90nm technology nodes to also include 65nm. The UMUX-LRB interconnection network is based upon the Xilinx commercial interconnection network. Therefore, this network (MUX-LRB), and another static circuit technique, CMOS-Transmission Gates (CMOS-TG), were proﬁled across all four technology nodes to provide a baseline of comparision. This analysis found the UMUX-LRB to have the smallest and most balanced rising and falling edge propagation delay, in addition to having the greatest reliability for temperature and process variation.
Graduate

APA, Harvard, Vancouver, ISO, and other styles

30

"Scalable Register File Architecture for CGRA Accelerators." Master's thesis, 2016. http://hdl.handle.net/2286/R.I.40738.

Full text

Abstract:

abstract: Coarse-grained Reconfigurable Arrays (CGRAs) are promising accelerators capable of accelerating even non-parallel loops and loops with low trip-counts. One challenge in compiling for CGRAs is to manage both recurring and nonrecurring variables in the register file (RF) of the CGRA. Although prior works have managed recurring variables via rotating RF, they access the nonrecurring variables through either a global RF or from a constant memory. The former does not scale well, and the latter degrades the mapping quality. This work proposes a hardware-software codesign approach in order to manage all the variables in a local nonrotating RF. Hardware provides modulo addition based indexing mechanism to enable correct addressing of recurring variables in a nonrotating RF. The compiler determines the number of registers required for each recurring variable and configures the boundary between the registers used for recurring and nonrecurring variables. The compiler also pre-loads the read-only variables and constants into the local registers in the prologue of the schedule. Synthesis and place-and-route results of the previous and the proposed RF design show that proposed solution achieves 17% better cycle time. Experiments of mapping several important and performance-critical loops collected from MiBench show proposed approach improves performance (through better mapping) by 18%, compared to using constant memory.
Dissertation/Thesis
Masters Thesis Computer Science 2016

APA, Harvard, Vancouver, ISO, and other styles

31

Shehan, Basher [Verfasser]. "Dynamic coarse grained reconfigurable architectures / presented by Basher Shehan." 2010. http://d-nb.info/1010124390/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

32

Varadarajan, Keshavan. "A Coarse Grained Reconfigurable Architecture Framework Supporting Macro-Dataflow Execution." Thesis, 2012. http://etd.iisc.ernet.in/handle/2005/2302.

Full text

Abstract:

A Coarse-Grained Reconfigurable Architecture (CGRA) is a processing platform which constitutes an interconnection of coarse-grained computation units (viz. Function Units (FUs), Arithmetic Logic Units (ALUs)). These units communicate directly, viz. send-receive like primitives, as opposed to the shared memory based communication used in multi-core processors. CGRAs are a well-researched topic and the design space of a CGRA is quite large. The design space can be represented as a 7-tuple (C, N, T, P, O, M, H) where each of the terms have the following meaning: C -choice of computation unit, N -choice of interconnection network, T -Choice of number of context frame (single or multiple), P -presence of partial reconfiguration, O choice of orchestration mechanism, M -design of memory hierarchy and H host-CGRA coupling. In this thesis, we develop an architectural framework for a Macro-Dataflow based CGRA where we make the following choice for each of these parameters: C -ALU, N -Network-on-Chip (NoC), T -Multiple contexts, P -support for partial reconfiguration, O -Macro Dataflow based orchestration, M -data memory banks placed at the periphery of the reconfigurable fabric (reconfigurable fabric is the name given to the interconnection of computation units), H -loose coupling between host processor and CGRA, enabling our CGRA to execute an application independent of the host-processor’s intervention. The motivations for developing such a CGRA are: To execute applications efficiently through reduction in reconfiguration time (i.e. the time needed to transfer instructions and data to the reconfigurable fabric) and reduction in execution time through better exploitation of all forms of parallelism: Instruction Level Parallelism (ILP), Data Level Parallelism (DLP) and Thread/Task Level Parallelism (TLP). We choose a macro-dataflow based orchestration framework in combination with partial reconfiguration so as to ease exploitation of TLP and DLP. Macro-dataflow serves as a light weight synchronization mechanism. We experiment with two variants of the macro-dataflow orchestration units, namely: hardware controlled orchestration unit and the compiler controlled orchestration unit. We employ a NoC as it helps reduce the reconfiguration overhead. To permit customization of the CGRA for a particular domain through the use of domain-specific custom-Intellectual Property (IP) blocks. This aids in improving both application performance and makes it energy efficient. To develop a CGRA which is completely programmable and accepts any program written using the C89 standard. The compiler and the architecture were co-developed to ensure that every feature of the architecture could be automatically programmed through an application by a compiler. In this CGRA framework, the orchestration mechanism (O) and the host-CGRA coupling (H) are kept fixed and we permit design space exploration of the other terms in the 7-tuple design space. The mode of compilation and execution remains invariant of these changes, hence referred to as a framework. We now elucidate the compilation and execution flow for this CGRA framework. An application written in C language is compiled and is transformed into a set of temporal partitions, referred to as HyperOps in this thesis. The macro-dataflow orchestration unit selects a HyperOp for execution when all its inputs are available. The instructions and operands for a ready HyperOp are transferred to the reconfigurable fabric for execution. Each ALU (in the computation unit) is capable of waiting for the availability of the input data, prior to issuing instructions. We permit the launch and execution of a temporal partition to progress in parallel, which reduces the reconfiguration overhead. We further cut launch delays by keeping loops persistent on fabric and thus eliminating the need to launch the instructions. The CGRA framework has been implemented using Bluespec System Verilog. We evaluate the performance of two of these CGRA instances: one for cryptographic applications and another instance for linear algebra kernels. We also run other general purpose integer and floating point applications to demonstrate the generic nature of these optimizations. We explore various microarchitectural optimizations viz. pipeline optimizations (i.e. changing value of T ), different forms of macro dataflow orchestration such as hardware controlled orchestration unit and compiler-controlled orchestration unit, different execution modes including resident loops, pipeline parallelism, changes to the router etc. As a result of these optimizations we observe 2.5x improvement in performance as compared to the base version. The reconfiguration overhead was hidden through overlapping launching of instructions with execution making. The perceived reconfiguration overhead is reduced drastically to about 9-11 cycles for each HyperOp, invariant of the size of the HyperOp. This can be mainly attributed to the data dependent instruction execution and use of the NoC. The overhead of the macro-dataflow execution unit was reduced to a minimum with the compiler controlled orchestration unit. To benchmark the performance of these CGRA instances, we compare the performance of these with an Intel Core 2 Quad running at 2.66GHz. On the cryptographic CGRA instance, running at 700MHz, we observe one to two orders of improvement in performance for cryptographic applications and up to one order of magnitude performance degradation for linear algebra CGRA instance. This relatively poor performance of linear algebra kernels can be attributed to the inability in exploiting ILP across computation units interconnected by the NoC, long latency in accessing data memory placed at the periphery of the reconfigurable fabric and unavailability of pipelined floating point units (which is critical to the performance of linear algebra kernels). The superior performance of the cryptographic kernels can be attributed to higher computation to load instruction ratio, careful choice of custom IP block, ability to construct large HyperOps which allows greater portion of the communication to be performed directly (as against communication through a register file in a general purpose processor) and the use of resident loops execution mode. The power consumption of a computation unit employed on the cryptography CGRA instance, along with its router is about 76mW, as estimated by Synopsys Design Vision using the Faraday 90nm technology library for an activity factor of 0.5. The power of other instances would be dependent on specific instantiation of the domain specific units. This implies that for a reconfigurable fabric of size 5 x 6 the total power consumption is about 2.3W. The area and power ( 84mW) dissipated by the macro dataflow orchestration unit, which is common to both instances, is comparable to a single computation unit, making it an effective and low overhead technique to exploit TLP.

APA, Harvard, Vancouver, ISO, and other styles

33

Jiang, Jun-Bin, and 江俊賓. "A Predicate-Aware Modulo Scheduling for Coarse Grained Reconfigurable Architectures." Thesis, 2011. http://ndltd.ncl.edu.tw/handle/qrf68u.

Full text

Abstract:

碩士
國立交通大學
電機學院IC設計產業專班
100
To balance the efficiency and flexibility, a coarse-grain reconfigurable architecture (CGRA) is proposed, which exploits the parallelism of a program without compromising of its flexibility. However, how to find more operation parallelism is a complicated problem for compilation. Modulo scheduling is one of the most adopted operation scheduling techniques in recent years, which introduces more parallelism by overlapping the iterations of a loop. Although modulo scheduling parallelizes lots of operations, we still observe that hardware resources is limited by 37.8% conditional executed operations. In this research, we propose a predicate-aware modulo scheduling which may map two disjoint operations into a same processing element to reduce the requirements of hardware resources; meanwhile, the corresponding architecture is also proposed. In addition, a weighted cost value mapping decision selection heuristic is designed to improve execution performance for the reconfigurable architecture. Our experimental results indicate that the initial interval of a loop of the selected benchmarks can be reduced by 12% to 25.2% compared with a related work and there is still 18 % reduction when compared with the related work that are equipped more resources.

APA, Harvard, Vancouver, ISO, and other styles

34

Alle, Mythri. "Compiling For Coarse-Grained Reconfigurable Architectures Based On Dataflow Execution Paradigm." Thesis, 2012. http://etd.iisc.ernet.in/handle/2005/2453.

Full text

Abstract:

Coarse-Grained Reconfigurable Architectures(CGRAs) can be employed for accelerating computational workloads that demand both flexibility and performance. CGRAs comprise a set of computation elements interconnected using a network and this interconnection of computation elements is referred to as a reconfigurable fabric. The size of application that can be accommodated on the reconfigurable fabric is limited by the size of instruction buffers associated with each Compute element. When an application cannot be accommodated entirely, application is partitioned such that each of these partitions can be executed on the reconfigurable fabric. These partitions are scheduled by an orchestrator. The orchestrator employs dynamic dataflow execution paradigm. Dynamic dataflow execution paradigm has inherent support for synchronization and helps in exploitation of parallelism that exists across application partitions. In this thesis, we present a compiler that targets such CGRAs. The compiler presented in this thesis is capable of accepting applications specified in C89 standard. To enable architectural design space exploration, the compiler is designed such that it can be customized for several instances of CGRAs employing dataflow execution paradigm at the orchestrator. This can be achieved by specifying the appropriate configuration parameters to the compiler. The focus of this thesis is to provide efficient support for various kinds of parallelism while ensuring correctness. The compiler is designed to support fine-grained task level parallelism that exists across iterations of loops and function calls. Additionally, compiler can also support pipeline parallelism, where a loop is split into multiple stages that execute in a pipelined manner. The prototype compiler, which targets multiple instances of a CGRA, is demonstrated in this thesis. We used this compiler to target multiple variants of CGRAs employing dataflow execution paradigm. We varied the reconfigur-able fabric, orchestration mechanism employed, size of instruction buffers. We also choose applications from two different domains viz. cryptography and linear algebra. The execution time of the CGRA (the best among all instances) is compared against an Intel Quad core processor. Cryptography applications show a performance improvement ranging from more than one order of magnitude to close to two orders of magnitude. These applications have large amounts of ILP and our compiler could successfully expose the ILP available in these applications. Further, the domain customization also played an important role in achieving good performance. We employed two custom functional units for accelerating Cryptography applications and compiler could efficiently use them. In linear algebra kernels we observe multiple iterations of the loop executing in parallel, effectively exploiting loop-level parallelism at runtime. Inspite of this we notice close to an order of magnitude performance degradation. The reason for this degradation can be attributed to the use of non-pipelined floating point units, and the delays involved in accessing memory. Pipeline parallelism was demonstrated using this compiler for FFT and QR factorization. Thus, the compiler is capable of efficiently supporting different kinds of parallelism and can support complete C89 standard. Further, the compiler can also support different instances of CGRAs employing dataflow execution paradigm.

APA, Harvard, Vancouver, ISO, and other styles

35

Γεωργιόπουλος, Σταύρος. "Μεθοδολογίες μεταγλώττισης σε επαναπροσδιοριζόμενα συστήματα αρχιτεκτονικών πίνακα." Thesis, 2011. http://hdl.handle.net/10889/5806.

Full text

Abstract:

Το αντικείμενο της παρούσας διδακτορικής διατριβής εστιάζεται στην ανάπτυξη αποδοτικών τεχνικών μεταγλώττισης για επαναπροσδιοριζόμενα ολοκληρωμένα συστήματα αρχιτεκτονικών πίνακα. Χρησιμοποιήθηκαν εφαρμογές που κυριαρχούνται από δεδομένα για τον έλεγχο των μεθοδολογιών. Σκοπός είναι να βελτιστοποιηθεί η εκτέλεση των εφαρμογών ως προς χαρακτηριστικά των επαναπροσδιοριζόμενων συστημάτων όπως η απόδοση, ο αριθμός εντολών ανά κύκλο ρολογιού, η επιφάνεια ολοκλήρωσης και ο βαθμός χρησιμοποίησης των επεξεργαστικών πόρων. Αυτό επιτυγχάνεται με την εισαγωγή πρωτότυπων τεχνικών χαρτογράφησης αλλά και την εύρεση βέλτιστων αρχιτεκτονικών. Στο πρώτο τμήμα της διατριβής υλοποιήθηκε η έρευνα, ανάπτυξη και αυτοματοποίηση τεχνικών μεταγλώττισης για επαναπροσδιοριζόμενα συστήματα αρχιτεκτονικών πίνακα. Κύριο χαρακτηριστικό αυτών των αρχιτεκτονικών είναι ύπαρξη μεγάλου αριθμού επεξεργαστικών στοιχείων που δουλεύουν παράλληλα με αποτέλεσμα να επιταχύνουν την εκτέλεση εφαρμογών που εμφανίζουν παραλληλία πράξεων. Η λειτουργία τους σε ενσωματωμένα συστήματα είναι αυτή ενός συνεπεξεργαστή. Η έρευνα πάνω σε επαναπροσδιοριζόμενες αρχιτεκτονικές πίνακα έχει αποκτήσει μεγάλο ενδιαφέρον λόγω της ευελιξίας, της επεκτασιμότητας και της απόδοσής τους, ιδιαίτερα σε εφαρμογές που κυριαρχούνται από δεδομένα. Η μεταγλώττιση, όμως, εφαρμογών πάνω σε αυτές χαρακτηρίζεται από υψηλή πολυπλοκότητα. Απαιτούνται κατάλληλα εργαλεία και ειδικές μεθοδολογίες χαρτογράφησης για την εκμετάλλευση των χαρακτηριστικών αυτών των αρχιτεκτονικών. Με αυτό το σκεπτικό, προτάθηκε μια πρωτότυπη επαναστοχεύσιμη μεθοδολογία χαρτογράφησης εφαρμογών, η οποία επιπλέον έχει αυτοματοποιηθεί με τη χρήση ενός πρότυπου εργαλείου μεταγλώττισης που στοχεύει σε ένα αρχιτεκτονικό παραμετρικό πρότυπο. Αποτέλεσμα ήταν η εύρεση των βέλτιστων αρχιτεκτονικών με βάσει την απόδοση, τον αριθμό των εντολών ανά κύκλο ρολογιού και το χρόνο εκτέλεσης του εργαλείου, για μια ομάδα εφαρμογών. Η αποδοτικότητα μιας επαναπροσδιοριζόμενης αρχιτεκτονικής πίνακα ως προς την ταχύτητα και το κόστος σε υλικό είναι δύσκολο να μετρηθεί, για αυτό έχουν υπάρξει λίγες έρευνες που μελετούν την επίδραση αρχιτεκτονικών παραμέτρων πάνω σε παράγοντες όπως η επιφάνεια ολοκλήρωσης και ο αριθμός εντολών ανά κύκλο ρολογιού. Επιπλέον, καμιά εργασία δεν έχει εξετάσει την επίδραση πολλαπλασιαστών ενσωματωμένων στα επεξεργαστικά στοιχεία των επαναπροσδιοριζόμενων αρχιτεκτονικών. Χρησιμοποιώντας την υπάρχουσα επαναστοχεύσιμη μεθοδολογία μεταγλώττισης και μια παραμετρική υλοποίηση της αρχιτεκτονικής σε γλώσσα περιγραφής υλικού, εξετάζουμε την επίδραση των πολλαπλασιαστών από τη μεριά της χαρτογράφησης και της αρχιτεκτονικής. Επίσης, περιγράφεται η πρωτότυπη μεθοδολογία χαρτογράφησης που εισήχθη με σκοπό την αποδοτική λειτουργία του αλγορίθμου Fast Fourier Transform (FFT) πάνω σε επαναπροσδιοριζόμενα συστήματα αρχιτεκτονικών πίνακα. Ο αλγόριθμος FFT χαρακτηρίζεται από μεγάλο αριθμό πράξεων κυρίως πολλαπλασιασμών που επιβραδύνουν την απόδοση μιας επαναπροσδιοριζόμενης αρχιτεκτονικής. Εκμεταλλευόμενοι την ύπαρξη εσωτερικής επαναληπτικής δομής μέσα στον αλγόριθμο και χρησιμοποιώντας μια επαναπροσδιοριζόμενη αρχιτεκτονική 16 επεξεργαστικών στοιχείων, αναπτύξαμε μια πρωτότυπη τεχνική χαρτογράφησης. Επιπρόσθετα, η τεχνική μας λαμβάνει υπόψη την ιεραρχία μνήμης μεταξύ κύριας μνήμης και επαναπροσδιοριζόμενης αρχιτεκτονικής για την περαιτέρω επιτάχυνση εκτέλεσης του αλγορίθμου FFT. Η χρήση της προτεινόμενης τεχνικής χαρτογράφησης οδηγεί σε επίτευξη βαθμού χρησιμοποίησης των επεξεργαστικών στοιχείων άνω του 90%, τιμή που είναι τουλάχιστον 37% υψηλότερη από την καλύτερη τιμή της βιβλιογραφίας.
The object of this PhD thesis focuses on developing efficient mapping techniques for coarse grain reconfigurable build arrays. Data intensive applications were used to evaluate the proposed methodologies. The aim is to optimize the applications’ performance on characteristics targeting reconfigurable characteristics such as performance, instructions per cycle, area of integration and processing resource utilization. This is achieved by introducing novel mapping techniques and finding optimal architectures. In the first part of the thesis research, development and automation of mapping techniques was carried out targeting coarse grain reconfigurable arrays. The main feature of these architectures is the presence of a large number of processing elements working in parallel thus speeding up the execution of applications featuring parallel operations. The function of these processing elements in embedded systems resembles that of a coprocessor. The research on reconfigurable array architectures has gained considerable interest because of their flexibility, scalability and performance, particularly in data intensive applications. Nevertheless, compiling these applications on reconfigurable architectures is characterized by high degree of complexity. Appropriate tools and special mapping methodologies are needed to exploit the characteristics of these architectures. Bearing this in mind, we proposed a novel reconfigurable methodology for mapping applications, which has also been automated with the use of a prototype compiler tool aiming at a parametric architectural model. The result was finding the best architectures on the basis of performance, the instructions per cycle term and the tool execution time for a sample set of applications. It is difficult to evaluate the efficiency of a reconfigurable array architecture table in terms of speed and area of integration, so there have been few cases studying the effect of architectural parameters on factors such as surface integration and the number of instructions per clock cycle. Moreover, no work has examined the multipliers’ impact embedded in reconfigurable architectures processing elements. Using the existing reconfigurable mapping methodology and a parametric implementation of the architecture in hardware description language, we examine the effect of multipliers on the part of the mapping phase and architecture. We also describe an original mapping methodology introduced for the purpose of efficiently mapping the Fast Fourier Transform (FFT) algorithm on reconfigurable array architectures. The FFT algorithm is characterized by a large number of operations primarily multiplications that slow the performance of a reconfigurable architecture. Exploiting the existence of an internal structure inside the FFT algorithm and by the use of a reconfigurable architecture template of 16 processing elements, we developed a novel mapping technique. Additionally, our technique takes into account the memory hierarchy between main memory and reconfigurable architecture in order to further accelerate the implementation of the FFT algorithm. Using the proposed mapping technique results in processing elements utilization of over 90% value which is at least 37% better than the best value of the related literature.

APA, Harvard, Vancouver, ISO, and other styles

36

"Register File Organization for Coarse-Grained Reconfigurable Architectures: Compiler-Microarchitecture Perspective." Master's thesis, 2014. http://hdl.handle.net/2286/R.I.25844.

Full text

Abstract:

abstract: Coarse-Grained Reconfigurable Architectures (CGRA) are a promising fabric for improving the performance and power-efficiency of computing devices. CGRAs are composed of components that are well-optimized to execute loops and rotating register file is an example of such a component present in CGRAs. Due to the rotating nature of register indexes in rotating register file, it is very challenging, if at all possible, to hold and properly index memory addresses (pointers) and static values. In this Thesis, different structures for CGRA register files are investigated. Those structures are experimentally compared in terms of performance of mapped applications, design frequency, and area. It is shown that a register file that can logically be partitioned into rotating and non-rotating regions is an excellent choice because it imposes the minimum restriction on underlying CGRA mapping algorithm while resulting in efficient resource utilization.
Dissertation/Thesis
Masters Thesis Computer Science 2014

APA, Harvard, Vancouver, ISO, and other styles

37

Biswas, Prasenjit. "Hardware Consolidation Of Systolic Algorithms On A Coarse Grained Runtime Reconfigurable Architecture." Thesis, 2011. http://etd.iisc.ernet.in/handle/2005/2108.

Full text

Abstract:

Application domains such as Bio-informatics, DSP, Structural Biology, Fluid Dynamics, high resolution direction finding, state estimation, adaptive noise cancellation etc. demand high performance computing solutions for their simulation environments. The core computations of these applications are in Numerical Linear Algebra (NLA) kernels. Direct solvers are predominantly required in the domains like DSP, estimation algorithms like Kalman Filter etc, where the matrices on which operations need to be performed are either small or medium sized, but dense. Faddeev's Algorithm is often used for solving dense linear system of equations. Modified Faddeev's algorithm (MFA) is a general algorithm on which LU decomposition, QR factorization or SVD of matrices can be realized. MFA has the good property of realizing a host of matrix operations by computing the Schur complements on four blocked matrices, thereby reducing the overall computation requirements. We will use MFA as a representative Direct Solver in this work. We further discuss Given's rotation based QR algorithm for Decomposition of any matrix, often used to solve the linear least square problem. Systolic Array Architectures are widely accepted ASIC solutions for NLA algorithms. But the \can of worms" associated with this traditional solution spawns the need for alternative solutions. While popular custom hardware solution in form of systolic arrays can deliver high performance, but because of their rigid structure they are not scalable and reconfigurable, and hence not commercially viable. We show how a Reconfigurable computing platform can serve to contain the \can of worms". REDEFINE, a coarse grained runtime reconfigurable architecture has been used for systolic actualization of NLA kernels. We elaborate upon streaming NLA-specific enhancements to REDEFINE in order to meet expected performance goals. We explore the need for an algorithm aware custom compilation framework. We bring about a proposition to realize Faddeev's Algorithm on REDEFINE. We show that REDEFINE performs several times faster than traditional GPPs. Further we direct our interest to QR Decomposition to be the next NLA kernel as it ensures better stability than LU and other decompositions. We use QR Decomposition as a case study to explore the design space of the proposed solution on REDEFINE. We also investigate the architectural details of the Custom Functional Units (CFU) for these NLA kernels. We determine the right size of the sub-array in accordance with the optimal pipeline depth of the core execution units and the number of such units to be used per sub-array. The framework used to realize QR Decomposition can be generalized for the realization of other algorithms dealing with decompositions like LU, Faddeev's Algorithm, Gauss-Jordon etc with different CFU definitions .

APA, Harvard, Vancouver, ISO, and other styles

38

Jung, Lukas Johannes. "Optimization of the Memory Subsystem of a Coarse Grained Reconfigurable Hardware Accelerator." Phd thesis, 2019. https://tuprints.ulb.tu-darmstadt.de/8674/1/2019-05-13_Jung_Lukas_Johannes.pdf.

Full text

Abstract:

Fast and energy efficient processing of data has always been a key requirement in processor design. The latest developments in technology emphasize these requirements even further. The widespread usage of mobile devices increases the demand of energy efficient solutions. Many new applications like advanced driver assistance systems focus more and more on machine learning algorithms and have to process large data sets in hard real time. Up to the 1990s the increase in processor performance was mainly achieved by new and better manufacturing technologies for processors. That way, processors could operate at higher clock frequencies, while the processor microarchitecture was mainly the same. At the beginning of the 21st century this development stopped. New manufacturing technologies made it possible to integrate more processor cores onto one chip, but almost no improvements were achieved anymore in terms of clock frequencies. This required new approaches in both processor microarchitecture and software design. Instead of improving the performance of a single processor, the current problem has to be divided into several subtasks that can be executed in parallel on different processing elements which speeds up the application. One common approach is to use multi-core processors or GPUs (Graphic Processing Units) in which each processing element calculates one subtask of the problem. This approach requires new programming techniques and legacy software has to be reformulated. Another approach is the usage of hardware accelerators which are coupled to a general purpose processor. For each problem a dedicated circuit is designed which can solve the problem fast and efficiently. The actual computation is then executed on the accelerator and not on the general purpose processor. The disadvantage of this approach is that a new circuit has to be designed for each problem. This results in an increased design effort and typically the circuit can not be adapted once it is deployed. This work covers reconfigurable hardware accelerators. They can be reconfigured during runtime so that the same hardware is used to accelerate different problems. During runtime, time consuming code fragments can be identified and the processor itself starts a process that creates a configuration for the hardware accelerator. This configuration can now be loaded and the code will then be executed on the accelerator faster and more efficient. A coarse grained reconfigurable architecture was chosen because creating a configuration for it is much less complex than creating a configuration for a fine grained reconfigurable architecture like an FPGA (Field Programmable Gate Array). Additionally, the smaller overhead for the reconfigurability results in higher clock frequencies. One advantage of this approach is that programmers don't need any knowledge about the underlying hardware, because the acceleration is done automatically during runtime. It is also possible to accelerate legacy code without user interaction (even when no source code is available anymore). One challenge that is relevant for all approaches, is the efficient and fast data exchange between processing elements and main memory. Therefore, this work concentrates on the optimization of the memory interface between the coarse grained reconfigurable hardware accelerator and the main memory. To achieve this, a simulator for a Java processor coupled with a coarse grained reconfigurable hardware accelerator was developed during this work. Several strategies were developed to improve the performance of the memory interface. The solutions range from different hardware designs to software solutions that try to optimize the usage of the memory interface during the creation of the configuration of the accelerator. The simulator was used to search the design space for the best implementation. With this optimization of the memory interface a performance improvement of 22.6% was achieved. Apart from that, a first prototype of this kind of accelerator was designed and implemented on an FPGA to show the correct functionality of the whole approach and the simulator.

APA, Harvard, Vancouver, ISO, and other styles

39

Obeid, Abdulfattah Mohammad. "Architectural Synthesis of a Coarse-Grained Run-Time-Reconfigurable Accelerator for DSP Applications." Phd thesis, 2006. https://tuprints.ulb.tu-darmstadt.de/668/1/ObeidDissG_Part1v2.pdf.

Full text

Abstract:

Given all its merits and potential, Reconfigurable Computing has attracted lots of research work. Reconfiguration costs as well as new Reconfigurable Computing specific challenges have so far been the main obstacles hindering reaching optimal reconfigurable computing solutions. Because of the flexibility offered by Reconfigurable Computing many new design parameters that were previously unknown now exist. Dynamic reconfiguration, partial reconfiguration, context management and HW/SW issues are among these. Depending on the target set of applications, different design decisions can be made in order to optimize the reconfigurable solution according to the target application constraints. In this thesis the HPad, an efficient coarse-grained dynamically reconfigurable solution targeted for DSP computation, is proposed. The HPad architecture was greatly influenced by reported VLSI architectures of a variety of DSP algorithms. Based on observations of the characteristics of these DSP algorithms and their architectures the HPad was chosen to be a heterogeneous and dynamically reconfigurable coarse grained solution. The HPad features partial, dynamic, and background reconfiguration capabilities. In addition, the HPad data path architecture is tailored to efficiently realize the studied DSP applications. Through the use of local reconfiguration interface sockets around each processing element, the dynamic reconfiguration problem is partitioned and efficiently solved. The HPad was modeled and synthesized with a parameterizable VHDL code written at the RTL level. Parameterizing the code was beneficial since it permitted generation of new designs simply by changing a few constants and recompiling. The model consisted of several thousand lines of code. Mapping and routing of several pipelined architectures of DSP algorithms were examined to demonstrate the suitability and validity of the HPad to the proposed scope of

APA, Harvard, Vancouver, ISO, and other styles

40

Obeid, Abdulfattah Mohammad [Verfasser]. "Architectural synthesis of a coarse-grained run-time-reconfigurable accelerator for DSP applications / Abdulfattah Mohammad Obeid." 2006. http://d-nb.info/979006651/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

41

Merchant, Farhad. "Algorithm-Architecture Co-Design for Dense Linear Algebra Computations." Thesis, 2015. http://etd.iisc.ernet.in/2005/3958.

Full text

Abstract:

Achieving high computation efficiency, in terms of Cycles per Instruction (CPI), for high-performance computing kernels is an interesting and challenging research area. Dense Linear Algebra (DLA) computation is a representative high-performance computing ap- plication, which is used, for example, in LU and QR factorizations. Unfortunately, mod- ern off-the-shelf microprocessors fall significantly short of achieving theoretical lower bound in CPI for high performance computing applications. In this thesis, we perform an in-depth analysis of the available parallelisms and propose suitable algorithmic and architectural variation to significantly improve the computation efficiency. There are two standard approaches for improving the computation effficiency, first, to perform application-specific architecture customization and second, to do algorithmic tuning. In the same manner, we first perform a graph-based analysis of selected DLA kernels. From the various forms of parallelism, thus identified, we design a custom processing element for improving the CPI. The processing elements are used as building blocks for a commercially available Coarse-Grained Reconfigurable Architecture (CGRA). By per- forming detailed experiments on a synthesized CGRA implementation, we demonstrate that our proposed algorithmic and architectural variations are able to achieve lower CPI compared to off-the-shelf microprocessors. We also benchmark against state-of-the-art custom implementations to report higher energy-performance-area product. DLA computations are encountered in many engineering and scientific computing ap- plications ranging from Computational Fluid Dynamics (CFD) to Eigenvalue problem. Traditionally, these applications are written in highly tuned High Performance Comput- ing (HPC) software packages like Linear Algebra Package (LAPACK), and/or Scalable Linear Algebra Package (ScaLAPACK). The basic building block for these packages is Ba- sic Linear Algebra Subprograms (BLAS). Algorithms pertaining LAPACK/ScaLAPACK are written in-terms of BLAS to achieve high throughput. Despite extensive intellectual efforts in development and tuning of these packages, there still exists a scope for fur- ther tuning in this packages. In this thesis, we revisit most prominent and widely used compute bound algorithms like GMM for further exploitation of Instruction Level Parallelism (ILP). We further look into LU and QR factorizations for generalizations and exhibit higher ILP in these algorithms. We first accelerate sequential performance of the algorithms in BLAS and LAPACK and then focus on the parallel realization of these algorithms. Major contributions in the algorithmic tuning in this thesis are as follows: Algorithms: We present graph based analysis of General Matrix Multiplication (GMM) and discuss different types of parallelisms available in GMM We present analysis of Givens Rotation based QR factorization where we improve GR and derive Column-wise GR (CGR) that can annihilate multiple elements of a column of a matrix simultaneously. We show that the multiplications in CGR are lower than GR We generalize CGR further and derive Generalized GR (GGR) that can annihilate multiple elements of the columns of a matrix simultaneously. We show that the parallelism exhibited by GGR is much higher than GR and Householder Transform (HT) We extend generalizations to Square root Free GR (also knows as Fast Givens Rotation) and Square root and Division Free GR (SDFG) and derive Column-wise Fast Givens, and Column-wise SDFG . We also extend generalization for complex matrices and derive Complex Column-wise Givens Rotation Coarse-grained Recon gurable Architectures (CGRAs) have gained popularity in the last decade due to their power and area efficiency. Furthermore, CGRAs like REDEFINE also exhibit support for domain customizations. REDEFINE is an array of Tiles where each Tile consists of a Compute Element and a Router. The Routers are responsible for on-chip communication, while Compute Elements in the REDEFINE can be domain customized to accelerate the applications pertaining to the domain of interest. In this thesis, we consider REDEFINE base architecture as a starting point and we design Processing Element (PE) that can execute algorithms in BLAS and LAPACK efficiently. We perform several architectural enhancements in the PE to approach lower bound of the CPI. For parallel realization of BLAS and LAPACK, we attach this PE to the Router of REDEFINE. We achieve better area and power performance compared to the yesteryear customized architecture for DLA. Major contributions in architecture in this thesis are as follows: Architecture: We present design of a PE for acceleration of GMM which is a Level-3 BLAS operation We methodically enhance the PE with different features for improvement in the performance of GMM For efficient realization of Linear Algebra Package (LAPACK), we use PE that can efficiently execute GMM and show better performance For further acceleration of LU and QR factorizations in LAPACK, we identify macro operations encountered in LU and QR factorizations, and realize them on a reconfigurable data-path resulting in 25-30% lower run-time

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!