Dissertations / Theses: 'Sparse Accelerator'

1

Syed, Akber. "A Hardware Interpreter for Sparse Matrix LU Factorization." University of Cincinnati / OhioLINK, 2002. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1024934521.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Jamal, Aygul. "A parallel iterative solver for large sparse linear systems enhanced with randomization and GPU accelerator, and its resilience to soft errors." Thesis, Université Paris-Saclay (ComUE), 2017. http://www.theses.fr/2017SACLS269/document.

Full text

Abstract:

Dans cette thèse de doctorat, nous abordons trois défis auxquels sont confrontés les solveurs d'algèbres linéaires dans la perspective des futurs systèmes exascale: accélérer la convergence en utilisant des techniques innovantes au niveau algorithmique, en profitant des accélérateurs GPU (Graphics Processing Units) pour améliorer le calcul sur plusieurs systèmes, en évaluant l'impact des erreurs due à l'augmentation du parallélisme dans les superordinateurs. Nous nous intéressons à l'étude des méthodes permettant d'accélérer la convergence et le temps d'exécution des solveurs itératifs pour les grands systèmes linéaires creux. Le solveur plus spécifiquement considéré dans ce travail est le “parallel Algebraic Recursive Multilevel Solver (pARMS)” qui est un soldeur parallèle sur mémoire distribuée basé sur les méthodes de sous-espace de Krylov.Tout d'abord, nous proposons d'intégrer une technique de randomisation appelée “Random Butterfly Transformations (RBT)” qui a été proposée avec succès pour éliminer le coût du pivotage dans la résolution des systèmes linéaires denses. Notre objectif est d'appliquer cette technique dans le préconditionneur ARMS de pARMS pour résoudre plus efficacement le dernier système Complément de Schur dans l'application du processus à multi-niveaux récursif. En raison de l'importance considérable du dernier Complément de Schur pour certains problèmes de test, nous proposons également d'utiliser une variante creux de RBT suivie d'un solveur direct creux (SuperLU). Les résultats expérimentaux sur certaines matrices de la collection de Davis montrent une amélioration de la convergence et de la précision par rapport aux implémentations existantes.Ensuite, nous illustrons comment une approche non intrusive peut être appliquée pour implémenter des calculs GPU dans le solveur pARMS, plus particulièrement pour la phase de préconditionnement locale qui représente une partie importante du temps pour la résolution. Nous comparons les solveurs purement CPU avec les solveurs hybrides CPU / GPU sur plusieurs problèmes de test issus d'applications physiques. Les résultats de performance du solveur hybride CPU / GPU utilisant le préconditionnement ARMS combiné avec RBT, ou le préconditionnement ILU(0), montrent un gain de performance jusqu'à 30% sur les problèmes de test considérés dans nos expériences.Enfin, nous étudions l'effet des défaillances logicielles variable sur la convergence de la méthode itérative flexible GMRES (FGMRES) qui est couramment utilisée pour résoudre le système préconditionné dans pARMS. Le problème ciblé dans nos expériences est un problème elliptique PDE sur une grille régulière. Nous considérons deux types de préconditionneurs: une factorisation LU incomplète à double seuil (ILUT) et le préconditionneur ARMS combiné avec randomisation RBT. Nous considérons deux modèle de fautes logicielles différentes où nous perturbons la multiplication du vecteur matriciel et la phase de préconditionnement, et nous comparons leur impact potentiel sur la convergence
In this PhD thesis, we address three challenges faced by linear algebra solvers in the perspective of future exascale systems: accelerating convergence using innovative techniques at the algorithm level, taking advantage of GPU (Graphics Processing Units) accelerators to enhance the performance of computations on hybrid CPU/GPU systems, evaluating the impact of errors in the context of an increasing level of parallelism in supercomputers. We are interested in studying methods that enable us to accelerate convergence and execution time of iterative solvers for large sparse linear systems. The solver specifically considered in this work is the parallel Algebraic Recursive Multilevel Solver (pARMS), which is a distributed-memory parallel solver based on Krylov subspace methods.First we integrate a randomization technique referred to as Random Butterfly Transformations (RBT) that has been successfully applied to remove the cost of pivoting in the solution of dense linear systems. Our objective is to apply this method in the ARMS preconditioner to solve more efficiently the last Schur complement system in the application of the recursive multilevel process in pARMS. The experimental results show an improvement of the convergence and the accuracy. Due to memory concerns for some test problems, we also propose to use a sparse variant of RBT followed by a sparse direct solver (SuperLU), resulting in an improvement of the execution time.Then we explain how a non intrusive approach can be applied to implement GPU computing into the pARMS solver, more especially for the local preconditioning phase that represents a significant part of the time to compute the solution. We compare the CPU-only and hybrid CPU/GPU variant of the solver on several test problems coming from physical applications. The performance results of the hybrid CPU/GPU solver using the ARMS preconditioning combined with RBT, or the ILU(0) preconditioning, show a performance gain of up to 30% on the test problems considered in our experiments.Finally we study the effect of soft fault errors on the convergence of the commonly used flexible GMRES (FGMRES) algorithm which is also used to solve the preconditioned system in pARMS. The test problem in our experiments is an elliptical PDE problem on a regular grid. We consider two types of preconditioners: an incomplete LU factorization with dual threshold (ILUT), and the ARMS preconditioner combined with RBT randomization. We consider two soft fault error modeling approaches where we perturb the matrix-vector multiplication and the application of the preconditioner, and we compare their potential impact on the convergence of the solver

APA, Harvard, Vancouver, ISO, and other styles

3

Pradels, Léo. "Efficient CNN inference acceleration on FPGAs : a pattern pruning-driven approach." Electronic Thesis or Diss., Université de Rennes (2023-....), 2024. http://www.theses.fr/2024URENS087.

Full text

Abstract:

Les modèles d'apprentissage profond basés sur les CNNs offrent des performances de pointe dans les tâches de traitement d'images et de vidéos, en particulier pour l'amélioration ou la classification d'images. Cependant, ces modèles sont lourds en calcul et en empreinte mémoire, ce qui les rend inadaptés aux contraintes de temps réel sur des FPGA embarqués. Il est donc essentiel de compresser ces CNNs et de concevoir des architectures d'accélérateurs pour l'inférence qui intègrent la compression dans une approche de co-conception matérielle et logicielle. Bien que des optimisations logicielles telles que l'élagage aient été proposées, elles manquent souvent de structure nécessaire à une intégration efficace de l'accélérateur. Pour répondre à ces limitations, cette thèse se concentre sur l'accélération des CNNs sur FPGA tout en respectant les contraintes de temps réel sur les systèmes embarqués. Cet objectif est atteint grâce à plusieurs contributions clés. Tout d'abord, elle introduit l'élagage des motifs, qui impose une structure à la sparsité du réseau, permettant une accélération matérielle efficace avec une perte de précision minimale due à la compression. Deuxièmement, un accélérateur pour l'inférence de CNN est présenté, qui adapte son architecture en fonction des critères de performance d'entrée, des spécifications FPGA et de l'architecture du modèle CNN cible. Une méthode efficace d'intégration de l'élagage des motifs dans l'accélérateur et un flux complet pour l'accélération de CNN sont proposés. Enfin, des améliorations de la compression du réseau sont explorées grâce à la quantification de Shift\&Add, qui modifie les méthodes de multiplication sur FPGA tout en maintenant la précision du réseau de base
CNN-based deep learning models provide state-of-the-art performance in image and video processing tasks, particularly for image enhancement or classification. However, these models are computationally and memory-intensive, making them unsuitable for real-time constraints on embedded FPGA systems. As a result, compressing these CNNs and designing accelerator architectures for inference that integrate compression in a hardware-software co-design approach is essential. While software optimizations like pruning have been proposed, they often lack the structured approach needed for effective accelerator integration. To address these limitations, this thesis focuses on accelerating CNNs on FPGAs while complying with real-time constraints on embedded systems. This is achieved through several key contributions. First, it introduces pattern pruning, which imposes structure on network sparsity, enabling efficient hardware acceleration with minimal accuracy loss due to compression. Second, a scalable accelerator for CNN inference is presented, which adapts its architecture based on input performance criteria, FPGA specifications, and target CNN model architecture. An efficient method for integrating pattern pruning within the accelerator and a complete flow for CNN acceleration are proposed. Finally, improvements in network compression are explored through Shift&Add quantization, which modifies FPGA computation methods while maintaining baseline network accuracy

APA, Harvard, Vancouver, ISO, and other styles

4

Ramachandran, Shridhar. "Incremental PageRank acceleration using Sparse Matrix-Sparse Vector Multiplication." The Ohio State University, 2016. http://rave.ohiolink.edu/etdc/view?acc_num=osu1462894358.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Fernández, Becerra David. "Multicore acceleration of sparse electromagnetics computations." Thesis, McGill University, 2011. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=104641.

Full text

Abstract:

Multicore processors have become the dominant industry trend to increase computer systems performance, driving electromagnetics (EM) practitioners to redesign their applications using parallel programming paradigms. This is especially true for computations involving complex data structures such as sparse matrix computations that often arise in EM simulations with the finite element method (FEM). These computations require pointer manipulation that render useless many compiler optimizations and parallel shared memory frameworks (e.g. OpenMP). This work presents new sparse data structures and techniques to efficiently exploit multicore parallelism and short-vector units (the last of which has not been exploited by state of the art sparse matrix libraries) for recurrent computationally intensive kernels in EM simulations, such as the sparse matrix-vector multiplication (SMVM) and the conjugate gradient (CG) algorithms. Up to 14 times performance speedups are demonstrated for the accelerated SMVM kernel and 5.8x for the CG kernel using the proposed methods over conventional approaches for two different multicore architectures. Finally, a new method to solve the FEM for parallel processing is presented and an optimized implementation is realized on two different generations of NVIDIA GPUs (manycore) accelerators with performance increases of up to 27.53 times compared to compiler optimized CPU results.
Les processeurs multicœurs sont devenus la tendance dominante de l'industrie pour accroître la performance des systèmes informatiques, forçant les concepteurs de systèmes électromagnétiques (EM) à reconcevoir leurs applications en utilisant des paradigmes de programmation parallèle. Cela est particulièrement vrai pour les calculs impliquant des structures de données complexes comme les calculs de matrices creuses qui surviennent souvent dans des simulations électromagnétiques (EM) avec la méthode d'analyse par éléments finis (FÉM). Ces calculs nécessitent de manipulation de pointeurs qui rendent inutiles de nombreuses optimisations du compilateur et les bibliothèques de mémoire partagée parallèle (OpenMP, par exemple). Ce travail présente de nouvelles structures de données rares et de nouvelles techniques afin d'exploiter efficacement le parallélisme multicœur et les unités de vecteur court (dont le dernier n'a pas été exploité par des bibliothèques de matrices creuses à la fine pointe de la technologie) pour les noyaux de calcul intensif récurrents dans les simulations EM, tels que les multiplications matrice-vecteur rares (SMVM) et des algorithmes à gradient conjugué (CG). Des performances d'accélérations jusqu'à 14 fois supérieures sont démontrées pour le noyau accéléré par SMVM et jusqu'à 5,8 fois supérieures pour le noyau CG en utilisant les méthodes proposées par rapport aux approches conventionnelles pour deux architectures multicœurs différentes. Enfin, une nouvelle méthode pour résoudre la FÉM pour le traitement parallèle est présentée et une implantation optimisée est réalisée sur deux générations d'accélérateurs de GPU NVIDIA (multicœur) avec des augmentations de performances allant jusqu'à 27,53 fois par rapport aux résultats du CPU optimisé par compilateur.

APA, Harvard, Vancouver, ISO, and other styles

6

Grigoras, Paul. "Instance directed tuning for sparse matrix kernels on reconfigurable accelerators." Thesis, Imperial College London, 2018. http://hdl.handle.net/10044/1/62634.

Full text

Abstract:

We present a novel method to optimise sparse matrix kernels for reconfigurable accelerators, through instance directed tuning - the tuning of reconfigurable architectures based on a sparse matrix instance. First, we present two novel reconfigurable architectures for the Conjugate Gradient Method that are optimised based on the problem dimension and sparsity pattern. These architectures provide the context for illustrating the opportunities and challenges for tuning sparse matrix kernels, which guide the design of the proposed method. Second, we introduce CASK, a novel framework for sparse matrix kernels on reconfigurable accelerators. CASK is: (1) instance directed, since it can account for differences in the matrix instances to generate and select adequate architectures; (2) unified, as it can be applied to a broad range of kernels and optimisations; (3) systematic, since it can support optimisations at multiple levels of encompassed reconfigurable architectures; and (4) automated, since it can operate with minimal user input, encapsulating and simplifying the tuning process. Third, we demonstrate the benefits of the proposed approach, by applying it to the Sparse Matrix Vector Multiplication kernel: (1) to tune a novel parametric reconfigurable architecture, resulting in up to 2 times energy effciency gains compared to optimised GPU and Xeon Phi implementations; (2) to include a novel compression method for nonzero values, resulting in up to 2.5 times compression ratio compared to the Compressed Sparse Row format; and (3) to tune a novel architecture for the block diagonal sparsity pattern arising in the Finite Element Method, enabling larger problems to be supported with up to 3 times speedup compared to an optimised CPU implementation.

APA, Harvard, Vancouver, ISO, and other styles

7

Segura, Salvador Albert. "High-performance and energy-efficient irregular graph processing on GPU architectures." Doctoral thesis, Universitat Politècnica de Catalunya, 2021. http://hdl.handle.net/10803/671449.

Full text

Abstract:

Graph processing is an established and prominent domain that is the foundation of new emerging applications in areas such as Data Analytics and Machine Learning, empowering applications such as road navigation, social networks and automatic speech recognition. The large amount of data employed in these domains requires high throughput architectures such as GPGPU. Although the processing of large graph-based workloads exhibits a high degree of parallelism, memory access patterns tend to be highly irregular, leading to poor efficiency due to memory divergence.In order to ameliorate these issues, GPGPU graph applications perform stream compaction operations which process active nodes/edges so subsequent steps work on a compacted dataset. We propose to offload this task to the Stream Compaction Unit (SCU) hardware extension tailored to the requirements of these operations, which additionally performs pre-processing by filtering and reordering elements processed.We show that memory divergence inefficiencies prevail in GPGPU irregular graph-based applications, yet we find that it is possible to relax the strict relationship between thread and processed data to empower new optimizations. As such, we propose the Irregular accesses Reorder Unit (IRU), a novel hardware extension integrated in the GPU pipeline that reorders and filters data processed by the threads on irregular accesses improving memory coalescing.Finally, we leverage the strengths of both previous approaches to achieve synergistic improvements. We do so by proposing the IRU-enhanced SCU (ISCU), which employs the efficient pre-processing mechanisms of the IRU to improve SCU stream compaction efficiency and NoC throughput limitations due to SCU pre-processing operations. We evaluate the ISCU with state-of-the-art graph-based applications achieving a 2.2x performance improvement and 10x energy-efficiency.
El processament de grafs és un domini prominent i establert com a la base de noves aplicacions emergents en àrees com l'anàlisi de dades i Machine Learning, que permeten aplicacions com ara navegació per carretera, xarxes socials i reconeixement automàtic de veu. La gran quantitat de dades emprades en aquests dominis requereix d’arquitectures d’alt rendiment, com ara GPGPU. Tot i que el processament de grans càrregues de treball basades en grafs presenta un alt grau de paral·lelisme, els patrons d’accés a la memòria tendeixen a ser irregulars, fet que redueix l’eficiència a causa de la divergència d’accessos a memòria. Per tal de millorar aquests problemes, les aplicacions de grafs per a GPGPU realitzen operacions de stream compaction que processen nodes/arestes per tal que els passos posteriors funcionin en un conjunt de dades compactat. Proposem deslliurar d’aquesta tasca a la extensió hardware Stream Compaction Unit (SCU) adaptada als requisits d’aquestes operacions, que a més realitza un pre-processament filtrant i reordenant els elements processats.Mostrem que les ineficiències de divergència de memòria prevalen en aplicacions GPGPU basades en grafs irregulars, tot i que trobem que és possible relaxar la relació estricta entre threads i les dades processades per obtenir noves optimitzacions. Com a tal, proposem la Irregular accesses Reorder Unit (IRU), una nova extensió de maquinari integrada al pipeline de la GPU que reordena i filtra les dades processades pels threads en accessos irregulars que milloren la convergència d’accessos a memòria. Finalment, aprofitem els punts forts de les propostes anteriors per aconseguir millores sinèrgiques. Ho fem proposant la IRU-enhanced SCU (ISCU), que utilitza els mecanismes de pre-processament eficients de la IRU per millorar l’eficiència de stream compaction de la SCU i les limitacions de rendiment de NoC a causa de les operacions de pre-processament de la SCU.

APA, Harvard, Vancouver, ISO, and other styles

8

Yee, Wai Min. "Cache Design for a Hardware Accelerated Sparse Texture Storage System." Thesis, University of Waterloo, 2004. http://hdl.handle.net/10012/1197.

Full text

Abstract:

Hardware texture mapping is essential for real-time rendering. Unfortunately the memory bandwidth and latency often bounds performance in current graphics architectures. Bandwidth consumption can be reduced by compressing the texture map or by using a cache. However, the way a texture map occupies memory and how it is accessed affects the pattern of memory accesses, which in turn affects cache performance. Thus texture compression schemes and cache architectures must be designed in conjunction with each other. We define a sparse texture to be a texture where a substantial percentage of the texture is constant. Sparse textures are of interest as they occur often, and they are used as parts of more general texture compression schemes. We present a hardware compatible implementation of sparse textures based on B-tree indexing and explore cache designs for it. We demonstrate that it is possible to have the bandwidth consumption and miss rate due to the texture data alone scale with the area of the region of interest. We also show that the additional bandwidth consumption and hideable latency due to the B-tree indices are low. Furthermore, the caches necessary for these textures can be quite small.

APA, Harvard, Vancouver, ISO, and other styles

9

Mantell, Rosemary Genevieve. "Accelerated sampling of energy landscapes." Thesis, University of Cambridge, 2017. https://www.repository.cam.ac.uk/handle/1810/267990.

Full text

Abstract:

In this project, various computational energy landscape methods were accelerated using graphics processing units (GPUs). Basin-hopping global optimisation was treated using a version of the limited-memory BFGS algorithm adapted for CUDA, in combination with GPU-acceleration of the potential calculation. The Lennard-Jones potential was implemented using CUDA, and an interface to the GPU-accelerated AMBER potential was constructed. These results were then extended to form the basis of a GPU-accelerated version of hybrid eigenvector-following. The doubly-nudged elastic band method was also accelerated using an interface to the potential calculation on GPU. Additionally, a local rigid body framework was adapted for GPU hardware. Tests were performed for eight biomolecules represented using the AMBER potential, ranging in size from 81 to 22\,811 atoms, and the effects of minimiser history size and local rigidification on the overall efficiency were analysed. Improvements relative to CPU performance of up to two orders of magnitude were obtained for the largest systems. These methods have been successfully applied to both biological systems and atomic clusters. An existing interface between a code for free energy basin-hopping and the SuiteSparse package for sparse Cholesky factorisation was refined, validated and tested. Tests were performed for both Lennard-Jones clusters and selected biomolecules represented using the AMBER potential. Significant acceleration of the vibrational frequency calculations was achieved, with negligible loss of accuracy, relative to the standard diagonalisation procedure. For the larger systems, exploiting sparsity reduces the computational cost by factors of 10 to 30. The acceleration of these computational energy landscape methods opens up the possibility of investigating much larger and more complex systems than previously accessible. A wide array of new applications are now computationally feasible.

APA, Harvard, Vancouver, ISO, and other styles

10

Chen, Dong. "Acceleration of the spatial selective excitation of MRI via sparse approximation." kostenfrei, 2009. https://mediatum2.ub.tum.de/node?id=956913.

Full text

APA, Harvard, Vancouver, ISO, and other styles

11

Samarawickrama, Mahendra. "Acceleration Techniques for Sparse Recovery Based Plane-wave Decomposition of a Sound Field." Thesis, The University of Sydney, 2017. http://hdl.handle.net/2123/17302.

Full text

Abstract:

Plane-wave decomposition by sparse recovery is a reliable and accurate technique for plane-wave decomposition which can be used for source localization, beamforming, etc. In this work, we introduce techniques to accelerate the plane-wave decomposition by sparse recovery. The method consists of two main algorithms which are spherical Fourier transformation (SFT) and sparse recovery. Comparing the two algorithms, the sparse recovery is the most computationally intensive. We implement the SFT on an FPGA and the sparse recovery on a multithreaded computing platform. Then the multithreaded computing platform could be fully utilized for the sparse recovery. On the other hand, implementing the SFT on an FPGA helps to flexibly integrate the microphones and improve the portability of the microphone array. For implementing the SFT on an FPGA, we develop a scalable FPGA design model that enables the quick design of the SFT architecture on FPGAs. The model considers the number of microphones, the number of SFT channels and the cost of the FPGA and provides the design of a resource optimized and cost-effective FPGA architecture as the output. Then we investigate the performance of the sparse recovery algorithm executed on various multithreaded computing platforms (i.e., chip-multiprocessor, multiprocessor, GPU, manycore). Finally, we investigate the influence of modifying the dictionary size on the computational performance and the accuracy of the sparse recovery algorithms. We introduce novel sparse-recovery techniques which use non-uniform dictionaries to improve the performance of the sparse recovery on a parallel architecture.

APA, Harvard, Vancouver, ISO, and other styles

12

Thürck, Daniel [Verfasser], Kristian [Akademischer Betreuer] Kersting, Matthias [Akademischer Betreuer] Bollhöfer, and Michael [Akademischer Betreuer] Goesele. "Irregularity mitigation and portability abstractions for accelerated sparse matrix factorization / Daniel Thürck ; Kristian Kersting, Matthias Bollhöfer, Michael Goesele." Darmstadt : Universitäts- und Landesbibliothek, 2021. http://d-nb.info/1232914843/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

13

Karkouri, Jabrane. "Exploiting sparse spectrum to accelerate spiral magnetic resonance spectroscopic imaging : method, simulation and applications to the functional exploration of skeletal muscle." Thesis, Lyon, 2019. http://www.theses.fr/2019LYSE1295.

Full text

Abstract:

La quantification du métabolisme musculaire énergétique et de la capacité mitochondriale est d'un intérêt crucial pour étudier les troubles musculaires, les maladies métaboliques ou cardiovasculaires comme la myopathie mitochondriale, le diabète ou les maladies artérielles périphériques. La spectroscopie 31P est un moyen non invasif de monitorer le métabolisme énergétique et les concentrations dynamiques de métabolites phosphorylés pendant ou après un exercice et fournit des informations sur la capacité mitochondriale et oxydative du muscle squelettique. L'évaluation du métabolisme énergétique par spectroscopie 31P peut se faire par spectroscopie non localisée, spectroscopie monovoxel et imagerie spectroscopique par résonance magnétique (MRSI). Dans la pratique clinique, la spectroscopie 31P non localisée est généralement réalisée, ce qui empêche de mesurer les informations métaboliques de différents muscles individuellement, et une information moyenne résultant du muscle entier est collectée en une seule fois par la bobine de surface utilisée pour l'expérience. L'utilisation de la spectroscopie 31P localisée permettrait d'accéder à des informations spatialement résolues et de motiver le développement de nouvelles séquences intégrant les développements techniques les plus avancés. L'imagerie spectroscopique par résonance magnétique (ISRM) disponible dans les systèmes cliniques a un temps d'acquisition très long qui limite son utilisation clinique à l'acquisition statique, alors que ce qui est d’intérêt est essentiellement la capacité à mesurer dynamiquement les variations de concentration des métabolites phosphorylés au cours d’un protocole d'exercice. Les développements méthodologiques sur l’ISRM réalisés dans le cadre de cette thèse, se sont attaqués précisément à réduire le temps d'acquisition, en vue d’applications cliniques. Une méthode d'acquisition ISRM rapide a donc été développée, impliquant un échantillonnage non cartésien dans l'espace k (échantillonnage en spirale), couplé à un sous-échantillonnage intelligent de la dimension temporelle, exploitant la connaissance a priori de la parcimonie du support spectral et une estimation par moindres carrés pour la reconstruction du signal. Cette méthode a été validée à l'aide de simulations et mise en œuvre dans un système IRM, optimisée puis testée in vivo sur le muscle du mollet pour des applications ISRM 1H et 31P. Des applications dynamiques 31P ont également été réalisées à 3T et l'utilisation de la séquence sous-échantillonnée CSI spiral développée a montré qu'elle était capable de révéler les changements dynamiques attendus dans le PCr. La quantification du signal nous a permis en outre d'accéder à la capacité mitochondriale, avec une résolution temporelle dynamique deux fois supérieure à celle du cas ISRM spiral avec échantillonnage régulier, et une résolution temporelle similaire à celle de l’ISRM non localisée utilisée conventionnellement. Ces développements sont d'un intérêt crucial pour une évaluation spatialement résolue de la capacité mitochondriale au sein de différents muscles ; c'est-à-dire pour mettre en évidence des altérations musculaires individuelles liées à des dommages spécifiques ou des différences de consommation d'énergie entre les différents chefs musculaires pendant l'exercice. Des améliorations de séquence sur la spectroscopie 31P localisée 1D ont également été intégrées dans un protocole clinique en cours, afin, à terme, d’appliquer les développements de séquence réalisés pendant cette thèse à un contexte clinique. D'abord testé sur des volontaires sains pour la reproductibilité, le protocole implique des patients qui souffrent d'artériopathie de la jambe inférieure. L'objectif était d'évaluer la capacité mitochondriale de ces patients avant et après une revascularisation de l'artère endommagée. Les résultats ont montré une amélioration significative de la capacité mitochondriale après revascularisation
Quantifying energetic muscular metabolism and mitochondrial capacity are of crucial interest to reveal muscular disorders, metabolic diseases or cardiovascular diseases like mitochondrial myopathy, diabetes or peripheral arthery diseases. 31P spectroscopy is a non-invasive way to monitor energetic metabolism and dynamic concentrations of 31P during exercise or after during recovery, and provides informations on mitochondrial and oxidative capacity. The assessment of energetic metabolism via 31P spectroscopy can be done with non-localized spectroscopy, single voxel selection spectroscopy and Magnetic Resonance Spectroscopic Imaging (MRSI). In clinical practice, mostly non localized 31P spectroscopy is done, preventing metabolic information from different individual muscles to be measured, but an average information resulting from the whole muscle and collected at once by the surface coil used for the experiment. The use of localized 31P spectroscopy would enable to access spatially resolved information and motivate the development of new home made sequences integrating the most advanced technical developments. Magnetic resonance Chemical shift Spectroscopic Imaging (CSI) available in clinical systems have very long acquisition time that limits their clinical use to static acquisition, while this is essentially the capacity to measure 31P dynamically during an exercise protocol that is of interest. The methodological developments on MRSI realized In the context of this thesis, aimed precisely at reducing the acquisition time and in view of some clinical applications. A fast MRSI acquisition method has thus been developed involving a non-Cartesian k-space Sampling (spiral sampling), coupled to a smart under-sampling of the temporal dimension, exploiting a priori known spectral support and a least-square estimation for signal reconstruction. This method has been validated using simulations, and implemented in a MR scanner, optimized and then tested in vivo on the calf muscle for 1H and 31P MRSI applications. Dynamic 31P applications were also performed at 3T and the use of the under-sampled CSI_spiral MRSI developed sequence has been shown to adequately reveal the expected dynamic changes in PCr. Quantification of the signal further enable us to access mitochondrial capacity, with a twice higher dynamic temporal resolution compared to the fully sampled CSI_spiral MRSI case, and similar temporal resolution as the non-localized classically used MRS sequence. Those developments are of crucial interest for a spatially resolved assessment of mitochondrial capacity within different muscles, i.e. to point out individual muscle alterations related to specific damages or differences between muscle energy consumption during the exercise. Sequence improvements on 1D localized 31P spectroscopy were also integrated in the clinical sequence and used in an on-going clinical protocol; in order, in the long term, to apply the sequence developments carried out during this thesis to a clinical context. First tested on safe volunteers for reproducibility, the protocol involves patients that suffer from lower leg arteriopathy. The objective was to assess mitochondrial capacity of those patients before and after a revascularization of the damaged artery. Results showed significant improvement in mitochondrial capacity after revascularization

APA, Harvard, Vancouver, ISO, and other styles

14

Kulunchakov, Andrei. "Optimisation stochastique pour l'apprentissage machine à grande échelle : réduction de la variance et accélération." Thesis, Université Grenoble Alpes, 2020. http://www.theses.fr/2020GRALM057.

Full text

Abstract:

Cette thèse vise à explorer divers sujets liés à l'analyse des méthodes de premier ordre appliquées à des problèmes stochastiques de grande dimension. Notre première contribution porte sur divers algorithmes incrémentaux, tels que SVRG, SAGA, MISO, SDCA, qui ont été analysés de manière approfondie pour les problèmes avec des informations de gradient exactes. Nous proposons une nouvelle technique, qui permet de traiter ces méthodes de manière unifiée et de démontrer leur robustesse à des perturbations stochastiques lors de l'observation des gradients. Notre approche est basée sur une extension du concept de suite d'estimation introduite par Yurii Nesterov pour l'analyse d'algorithmes déterministes accélérés.Cette approche permet de concevoir de façon naturelle de nouveaux algorithmes incrémentaux offrant les mêmes garanties que les méthodes existantes tout en étant robustes aux perturbations stochastiques.Enfin, nous proposons un nouvel algorithme de descente de gradient stochastique accéléré et un nouvel algorithme SVRG accéléré robuste au bruit stochastique. Dans le dernier cas il s'agit essentiellement de l'accélération déterministe au sens de Nesterov, qui préserve la convergence optimale des erreurs stochastiques.Finalement, nous abordons le problème de l'accélération générique. Pour cela, nous étendons l'approche multi-étapes de Catalyst, qui visait à l'origine l'accélération de méthodes déterministes. Afin de l'appliquer aux problèmes stochastiques, nous le modifions pour le rendre plus flexible par rapport au choix des fonctions auxiliaires minimisées à chaque étape de l'algorithme. Finalement, à partir d'une méthode d'optimisation pour les problèmes fortement convexes, avec des garanties standard de convergence, notre procédure commence par accélérer la convergence vers une région dominée par le bruit, pour converger avec une vitesse quasi-optimale ensuite. Cette approche nous permet d'accélérer diverses méthodes stochastiques, y compris les algorithmes à variance réduite. Là encore, le cadre développé présente des similitudes avec l'analyse d'algorithmes accélérés à l'aide des suites d'estimation. En ce sens, nous essayons de combler l'écart entre l'optimisation déterministe et stochastique en termes d'accélération de Nesterov. Une autre contribution est une analyse unifiée d'algorithmes proximaux stochastiques lorsque l'opérateur proximal ne peut pas être calculé de façon exacte.Ensuite, nous étudions des propriétés d'algorithmes stochastique non-Euclidiens appliqués au problème d'estimation parcimonieuse. La structure de parcimonie permet de réduire de façon significative les effets du bruit dans les observation du gradient. Nous proposons un nouvel algorithme stochastique, appelé SMD-SR, permettant de faire meilleur usage de cette structure. Là encore, la méthode en question est une routine multi-étapes qui utilise l'algorithme stochastique de descente en miroir comme élément constitutif de ses étapes. Cette procédure comporte deux phases de convergence, dont la convergence linéaire de l'erreur pendant la phase préliminaire, et la convergence à la vitesse asymptotique optimale pendant la phase asymptotique. Par rapport aux solutions existantes les plus efficaces aux problèmes d’optimisation stochastique parcimonieux, nous proposons une amélioration sur plusieurs aspects. Tout d'abord, nous montrons que l'algorithme proposé réduit l'erreur initiale avec une vitesse linéaire (comme un algorithme déterministe de descente de gradient, utilisant l'observation complète du gradient), avec un taux de convergence optimal par rapport aux caractéristiques du bruit. Deuxièmement, nous obtenons ce taux pour une grande classe de modèles de bruit, y compris les distributions sous-gaussiennes, de Rademacher, de Student multivariées, etc. Enfin, ces résultats sont obtenus sous la condition optimale sur le niveau de parcimonie qui peut approcher le nombre total d'iterations de l'algorithme (à un facteur logarithmique près)
A goal of this thesis is to explore several topics in optimization for high-dimensional stochastic problems. The first task is related to various incremental approaches, which rely on exact gradient information, such as SVRG, SAGA, MISO, SDCA. While the minimization of large limit sums of functions was thoroughly analyzed, we suggest in Chapter 2 a new technique, which allows to consider all these methods in a generic fashion and demonstrate their robustness to possible stochastic perturbations in the gradient information.Our technique is based on extending the concept of estimate sequence introduced originally by Yu. Nesterov in order to accelerate deterministic algorithms.Using the finite-sum structure of the problems, we are able to modify the aforementioned algorithms to take into account stochastic perturbations. At the same time, the framework allows to derive naturally new algorithms with the same guarantees as existing incremental methods. Finally, we propose a new accelerated stochastic gradient descent algorithm and a new accelerated SVRG algorithm that is robust to stochastic noise. This acceleration essentially performs the typical deterministic acceleration in the sense of Nesterov, while preserving the optimal variance convergence.Next, we address the problem of generic acceleration in stochastic optimization. For this task, we generalize in Chapter 3 the multi-stage approach called Catalyst, which was originally aimed to accelerate deterministic methods. In order to apply it to stochastic problems, we improve its flexibility on the choice of surrogate functions minimized at each stage. Finally, given an optimization method with mild convergence guarantees for strongly convex problems, our developed multi-stage procedure, accelerates convergence to a noise-dominated region, and then achieves the optimal (up to a logarithmic factor) worst-case convergence depending on the noise variance of the gradients. Thus, we successfully address the acceleration of various stochastic methods, including the variance-reduced approaches considered and generalized in Chapter 2. Again, the developed framework bears similarities with the acceleration performed by Yu. Nesterov using the estimate sequences. In this sense, we try to fill the gap between deterministic and stochastic optimization in terms of Nesterov's acceleration. A side contribution of this chapter is a generic analysis that can handle inexact proximal operators, providing new insights about the robustness of stochastic algorithms when the proximal operator cannot be exactly computed.In Chapter 4, we study properties of non-Euclidean stochastic algorithms applied to the problem of sparse signal recovery. A sparse structure significantly reduces the effects of noise in gradient observations. We propose a new stochastic algorithm, called SMD-SR, allowing to make better use of this structure. This method is a multi-step procedure which uses the stochastic mirror descent algorithm as a building block over its stages. Essentially, SMD-SR has two phases of convergence with the linear bias convergence during the preliminary phase and the optimal asymptotic rate during the asymptotic phase.Comparing to the most effective existing solution to the sparse stochastic optimization problems, we offer an improvement in several aspects. First, we establish the linear bias convergence (similar to the one of the deterministic gradient descent algorithm, when the full gradient observation is available), while showing the optimal robustness to noise. Second, we achieve this rate for a large class of noise models, including sub-Gaussian, Rademacher, multivariate Student distributions and scale mixtures. Finally, these results are obtained under the optimal condition on the level of sparsity which can approach the total number of iterations of the algorithm (up to a logarithmic factor)

APA, Harvard, Vancouver, ISO, and other styles

15

Pham, Duc-An, and 范德安. "Analysis and Optimization Methodology for Sparse CNN Accelerator." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/qgub7n.

Full text

APA, Harvard, Vancouver, ISO, and other styles

16

Wang, Bang-Chyuan, and 王邦全. "A Sparse Optimization Design for Pointwise Convolutions on the CNN Accelerator." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/xzx4ds.

Full text

Abstract:

碩士
國立交通大學
資訊科學與工程研究所
106
In the past, IoT devices just only process simple computations or collect sensor data. With the development of Artificial Intelligence(AI), IoT devices need to have computing power to deal with AI algorithm, such as image recognition, speech recognition, etc. Machine learning models should be supported for IoT devices. For example, Convolution Neural Network(CNN) and Recurrent Neural Network(RNN). Because ability of IoT devices computing and power are restricted. In regard to these complex applications, there are designs of Application Specific Integrated Circuit(ASIC) that accelerating computation, e.g. accelerator. In analysis, most of time is consumed by convolutional computations that compared with memory access time in computation of CNN. On the other hand, the analyzing result of CNN models shows that weights of models always exist lots of zero. It means that there are many redundant zero-value multiplications at runtime of accelerator and it costs overhead of energy. Therefore, we propose a sparse optimization mechanism based on a real accelerator to reduce overhead and improve performance. In addition, we implement a tool to evaluate performance that combining behavior of accelerator and hardware parameters with a deep learning compiler, called NNVM.

APA, Harvard, Vancouver, ISO, and other styles

17

Lin, Chien-Yu, and 林建宇. "Merlin: A Sparse Neural Network Accelerator Utilizing Both Neuron and Weight Sparsity." Thesis, 2017. http://ndltd.ncl.edu.tw/handle/6aq7yc.

Full text

APA, Harvard, Vancouver, ISO, and other styles

18

Pan, Jyun-Wei, and 潘俊瑋. "Enhancing PE Utilization on SIMD-like Accelerator for Sparse Convolutional Neural Networks." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/6qt4ud.

Full text

APA, Harvard, Vancouver, ISO, and other styles

19

Chang, Chia-Hung, and 張嘉宏. "Design of an Inference Accelerator for CNN with Sparse Row-wise Kernel." Thesis, 2019. http://ndltd.ncl.edu.tw/handle/vgqj7n.

Full text

APA, Harvard, Vancouver, ISO, and other styles

20

"Algorithm Architecture Co-design for Dense and Sparse Matrix Computations." Master's thesis, 2018. http://hdl.handle.net/2286/R.I.51737.

Full text

Abstract:

abstract: With the end of Dennard scaling and Moore's law, architects have moved towards heterogeneous designs consisting of specialized cores to achieve higher performance and energy efficiency for a target application domain. Applications of linear algebra are ubiquitous in the field of scientific computing, machine learning, statistics, etc. with matrix computations being fundamental to these linear algebra based solutions. Design of multiple dense (or sparse) matrix computation routines on the same platform is quite challenging. Added to the complexity is the fact that dense and sparse matrix computations have large differences in their storage and access patterns and are difficult to optimize on the same architecture. This thesis addresses this challenge and introduces a reconfigurable accelerator that supports both dense and sparse matrix computations efficiently. The reconfigurable architecture has been optimized to execute the following linear algebra routines: GEMV (Dense General Matrix Vector Multiplication), GEMM (Dense General Matrix Matrix Multiplication), TRSM (Triangular Matrix Solver), LU Decomposition, Matrix Inverse, SpMV (Sparse Matrix Vector Multiplication), SpMM (Sparse Matrix Matrix Multiplication). It is a multicore architecture where each core consists of a 2D array of processing elements (PE). The 2D array of PEs is of size 4x4 and is scheduled to perform 4x4 sized matrix updates efficiently. A sequence of such updates is used to solve a larger problem inside a core. A novel partitioned block compressed sparse data structure (PBCSC/PBCSR) is used to perform sparse kernel updates. Scalable partitioning and mapping schemes are presented that map input matrices of any given size to the multicore architecture. Design trade-offs related to the PE array dimension, size of local memory inside a core and the bandwidth between on-chip memories and the cores have been presented. An optimal core configuration is developed from this analysis. Synthesis results using a 7nm PDK show that the proposed accelerator can achieve a performance of upto 32 GOPS using a single core.
Dissertation/Thesis
Masters Thesis Computer Engineering 2018

APA, Harvard, Vancouver, ISO, and other styles

21

Wu, I.-Chen, and 吳易真. "An Energy-Efficient Accelerator with Relative-Indexing Memory for Sparse Compressed Convolutional Neural Network." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/tx6yx4.

Full text

APA, Harvard, Vancouver, ISO, and other styles

22

Ramesh, Chinthala. "Hardware-Software Co-Design Accelerators for Sparse BLAS." Thesis, 2017. http://etd.iisc.ac.in/handle/2005/4276.

Full text

Abstract:

Sparse Basic Linear Algebra Subroutines (Sparse BLAS) is an important library. Sparse BLAS includes three levels of subroutines. Level 1, Level2 and Level 3 Sparse BLAS routines. Level 1 Sparse BLAS routines do computations over sparse vector and spare/dense vector. Level 2 deals with sparse matrix and vector operations. Level 3 deals with sparse matrix and dense matrix operations. The computations of these Sparse BLAS routines on General Purpose Processors (GPPs) not only suffer from less utilization of hardware resources but also takes more compute time than the workload due to poor data locality of sparse vector/matrix storage formats. In the literature, tremendous efforts have been put into software to improve these Sparse BLAS routines performance on GPPs. GPPs best suit for applications with high data locality, whereas Sparse BLAS routines operate on applications with less data locality hence, GPPs performance is poor. Various Custom Function Units (Hardware Accelerators) are proposed in the literature and are proved to be efficient than soft wares which tried to accelerate Sparse BLAS subroutines. Though existing hardware accelerators improved the Sparse BLAS performance compared to software Sparse BLAS routines, there is still lot of scope to improve these accelerators. This thesis describes both the existing software and hardware software co-designs (HW/SW co-design) and identifies the limitations of these existing solutions. We propose a new sparse data representation called Sawtooth Compressed Row Storage (SCRS) and corresponding SpMV and SpMM algorithms. SCRS based SpMV and SpMM are performing better than existing software solutions. Even though SCRS based SpMV and SpMM algorithms perform better than existing solutions, they still could not reach theoretical peak performance. The knowledge gained from the study of limitations of these existing solutions including the proposed SCRS based SpMV and SpMM is used to propose new HW/SW co-designs. Software accelerators are limited by the hardware properties of GPPs, and GPUs itself, hence, we propose HW/SW co-designs to accelerate few basic Sparse BLAS operations (SpVV and SpMV). Our proposed Parallel Sparse BLAS HW/SW co-design achieves near theoretical peak performance with reasonable hardware resources.

APA, Harvard, Vancouver, ISO, and other styles

23

"Algorithm and Hardware Design for High Volume Rate 3-D Medical Ultrasound Imaging." Doctoral diss., 2019. http://hdl.handle.net/2286/R.I.55684.

Full text

Abstract:

abstract: Ultrasound B-mode imaging is an increasingly significant medical imaging modality for clinical applications. Compared to other imaging modalities like computed tomography (CT) or magnetic resonance imaging (MRI), ultrasound imaging has the advantage of being safe, inexpensive, and portable. While two dimensional (2-D) ultrasound imaging is very popular, three dimensional (3-D) ultrasound imaging provides distinct advantages over its 2-D counterpart by providing volumetric imaging, which leads to more accurate analysis of tumor and cysts. However, the amount of received data at the front-end of 3-D system is extremely large, making it impractical for power-constrained portable systems. In this thesis, algorithm and hardware design techniques to support a hand-held 3-D ultrasound imaging system are proposed. Synthetic aperture sequential beamforming (SASB) is chosen since its computations can be split into two stages, where the output generated of Stage 1 is significantly smaller in size compared to the input. This characteristic enables Stage 1 to be done in the front end while Stage 2 can be sent out to be processed elsewhere. The contributions of this thesis are as follows. First, 2-D SASB is extended to 3-D. Techniques to increase the volume rate of 3-D SASB through a new multi-line firing scheme and use of linear chirp as the excitation waveform, are presented. A new sparse array design that not only reduces the number of active transducers but also avoids the imaging degradation caused by grating lobes, is proposed. A combination of these techniques increases the volume rate of 3-D SASB by 4\texttimes{} without introducing extra computations at the front end. Next, algorithmic techniques to further reduce the Stage 1 computations in the front end are presented. These include reducing the number of distinct apodization coefficients and operating with narrow-bit-width fixed-point data. A 3-D die stacked architecture is designed for the front end. This highly parallel architecture enables the signals received by 961 active transducers to be digitalized, routed by a network-on-chip, and processed in parallel. The processed data are accumulated through a bus-based structure. This architecture is synthesized using TSMC 28 nm technology node and the estimated power consumption of the front end is less than 2 W. Finally, the Stage 2 computations are mapped onto a reconfigurable multi-core architecture, TRANSFORMER, which supports different types of on-chip memory banks and run-time reconfigurable connections between general processing elements and memory banks. The matched filtering step and the beamforming step in Stage 2 are mapped onto TRANSFORMER with different memory configurations. Gem5 simulations show that the private cache mode generates shorter execution time and higher computation efficiency compared to other cache modes. The overall execution time for Stage 2 is 14.73 ms. The average power consumption and the average Giga-operations-per-second/Watt in 14 nm technology node are 0.14 W and 103.84, respectively.
Dissertation/Thesis
Doctoral Dissertation Engineering 2019

APA, Harvard, Vancouver, ISO, and other styles

24

"On-Chip Learning and Inference Acceleration of Sparse Representations." Doctoral diss., 2019. http://hdl.handle.net/2286/R.I.54867.

Full text

Abstract:

abstract: The past decade has seen a tremendous surge in running machine learning (ML) functions on mobile devices, from mere novelty applications to now indispensable features for the next generation of devices. While the mobile platform capabilities range widely, long battery life and reliability are common design concerns that are crucial to remain competitive. Consequently, state-of-the-art mobile platforms have become highly heterogeneous by combining a powerful CPUs with GPUs to accelerate the computation of deep neural networks (DNNs), which are the most common structures to perform ML operations. But traditional von Neumann architectures are not optimized for the high memory bandwidth and massively parallel computation demands required by DNNs. Hence, propelling research into non-von Neumann architectures to support the demands of DNNs. The re-imagining of computer architectures to perform efficient DNN computations requires focusing on the prohibitive demands presented by DNNs and alleviating them. The two central challenges for efficient computation are (1) large memory storage and movement due to weights of the DNN and (2) massively parallel multiplications to compute the DNN output. Introducing sparsity into the DNNs, where certain percentage of either the weights or the outputs of the DNN are zero, greatly helps with both challenges. This along with algorithm-hardware co-design to compress the DNNs is demonstrated to provide efficient solutions to greatly reduce the power consumption of hardware that compute DNNs. Additionally, exploring emerging technologies such as non-volatile memories and 3-D stacking of silicon in conjunction with algorithm-hardware co-design architectures will pave the way for the next generation of mobile devices. Towards the objectives stated above, our specific contributions include (a) an architecture based on resistive crosspoint array that can update all values stored and compute matrix vector multiplication in parallel within a single cycle, (b) a framework of training DNNs with a block-wise sparsity to drastically reduce memory storage and total number of computations required to compute the output of DNNs, (c) the exploration of hardware implementations of sparse DNNs and architectural guidelines to reduce power consumption for the implementations in monolithic 3D integrated circuits, and (d) a prototype chip in 65nm CMOS accelerator for long-short term memory networks trained with the proposed block-wise sparsity scheme.
Dissertation/Thesis
Doctoral Dissertation Electrical Engineering 2019

APA, Harvard, Vancouver, ISO, and other styles

25

Thürck, Daniel. "Irregularity mitigation and portability abstractions for accelerated sparse matrix factorization." Phd thesis, 2021. https://tuprints.ulb.tu-darmstadt.de/17951/1/20210420_dthuerck_dissertation_pdfa.pdf.

Full text

Abstract:

In this thesis, we investigate new ways to mitigate the inherent irregularity in sparse matrix factorizations and decompose the resulting computation into simple kernels which are portable across a diverse set of compute accelerator architectures through our novel compiler borG. Be it weather prediction, climate models, personalized medicine, genetic analysis and autonomous driving: some of today's central challenges require processing of vast amounts of data, feeding large-scale simulations or AI models. As the scale of these problems outpaces the processing power and available storage capacity, it becomes crucial to exploit their inherent sparsity. Such sparse topologies, i.e., graph topologies where most of the nodes are not directly connected, are often the source for sparse linear systems of equations whose solution poses a major computational challenge. At the same time, we are witnessing a shift in terms of hardware in the high-performance computing field: as hardware designers try to avoid the quadratically increasing energy consumption for higher clock frequencies, compute setups increase parallelism and specialization instead. Notably, most of the accelerators in use today are optimized for massive parallelism on regular structures and dense data structures. Processing sparse workloads efficiently on novel, heterogeneous architectures presents a challenge that demands systemic solutions. In this thesis, we investigate strategies and systems focusing on an important building block for computational sciences: sparse numerical (matrix) factorizations. Factorizations exhibit irregularity in two aspects. First, the sparse data structures complicate workload distribution on accelerators geared towards regular grids. Second, numerically mandated pivoting introduces irregularity into the control flow. This leads to expensive synchronization points and requires expensive re-building of data structures. We propose two building blocks that help mitigate these problems for accelerators. First, a generalization of sparse factorizations to block-sparse matrices, leading to the use of batched, heavily templated compute kernels. Such kernels are relatively simple and can be tuned for the accelerator architecture in question. Second, we propose a data structure for block-sparse matrices that enables global pivoting through parallel index modifications. Additionally, we demonstrate how pivoting can be introduced into register-focused GPU kernels, leading to a two-level, threshold a-posteriori pivoting scheme. Both concepts are validated on implementations of sparse LDLt factorizations for GPUs. Once we extend the block-sparse approach to other architectures, we risk maintaining divergent, device-specific code bases for batched kernels. Therefore, we present the source-to-source compiler borG. Based on a novel intermediate representation unifying two distinct parallel architectures programming models, borG compiles OpenCL code that uses a generalization of the warp register cache idiom, to AVX512-based CPUs, NVIDIA GPUs and NEC's SX-Aurora vector processor. The generated kernels may be specialized for a specific problem size, e.g., processing a batch of m*n matrices, and compare favorably to hand-coded kernels in the systems' native development stack. Building on work so far, we extend the concept of block-sparse decomposition and generalize it into a meta-algorithm. Motivated by the rise of domain-specific, fixed function accelerators, we simplify the device-side code for factorizations even further. The resulting concept, METAPACK, can be implemented not just on more traditional compute accelerators, but also on "non-Neumann" types of hardware. The latter includes systolic arrays and FPGAs. Relying only on host-side pipelining and batched kernel submission, we lay out a blueprint of a versatile factorization generator. As a full-stack system, METAPACK will allow rapid creation of customized, parameterized, sparse factorization code across many systems that support batched or pipelined parallelism. Several experiments confirm the validity of the concept and encourage a full-fledged implementation. For the sake of simplicity, METAPACK regularizes sparse matrices by subdivision into a regular grid. In order to reduce the memory waste and exploit compute resources more efficiently, accelerators would need native support for work items of different sizes. By expanding on concepts from borG, we propose such support. We combine an incremental change of current accelerators' architectures with a novel compiler pass to simplify such irregular scheduling on the hardware side. A prototypical software simulation shows this architectures' potential to simplify support for irregular compute loads in the future. In summary, our work can help to improve and simplify the handling of sparse matrix factorizations and their efficiency on acclerators. Through a block-centric decomposition of the factorization process and a simplification of the primitive numerical kernels, we enable the use of novel and future accelerator architectures for the sparse problems of computational science.

APA, Harvard, Vancouver, ISO, and other styles

26

Hamlett, Matthew. "A scalable architecture for hardware acceleration of large sparse matrix calculations." 2006. http://www.lib.ncsu.edu/theses/available/etd-11062006-023159/unrestricted/etd.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

27

Jheng, Hong-Yuan, and 鄭弘元. "FPGA Acceleration of Sparse Matrix-Vector Multiplication Based on Network-on-Chip." Thesis, 2011. http://ndltd.ncl.edu.tw/handle/y884tf.

Full text

Abstract:

碩士
國立臺灣科技大學
電子工程系
99
The Sparse Matrix-Vector Multiplication (SMVM) is a pervasive operation in many scientific and engineering applications. Moreover, SMVM is a computational intensive operation that dominates the performance in most iterative linear system solvers. There are some optimization challenges in computations involving SMVM due to its high memory access rate and irregular memory access pattern. In this thesis, a new design concept for SMVM in an FPGA by using Network-on-Chip (NoC) is presented. In traditional circuit design on-chip communications have been designed with dedicated point-to-point interconnections or shared buses. Therefore, regular data transfer is the major concern of many parallel implementations. However, when dealing with the SMVM operation, the required data transfers are usually dependent on the sparsity structure of the matrix and can be extremely irregular. Using an NoC architecture makes it possible to deal with arbitrary structure of the data transfers, i.e. with arbitrary structured sparse matrices. In addition, the size of the pipelined SMVM calculator based on NoC architecture can be customized to 2×2, 4×4, ..., p×p (p∈N) due to its high scalibility and flexibility. The implementation is done in IEEE-754 single floating-point precision on the Xilinx Virtex-6 FPGA. The experimental results show that the proposed NoC-based implementation can achieve approximate 2.3 - 5.6 speed-up over the MATLAB-based software implementation in Matrix Market benchmark applications.

APA, Harvard, Vancouver, ISO, and other styles

28

Chen, Dong [Verfasser]. "Acceleration of the spatial selective excitation of MRI via sparse approximation / Dong Chen." 2009. http://d-nb.info/1000072541/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

29

Meila, Marina. "An Accelerated Chow and Liu Algorithm: Fitting Tree Distributions to High Dimensional Sparse Data." 1999. http://hdl.handle.net/1721.1/6676.

Full text

Abstract:

Chow and Liu introduced an algorithm for fitting a multivariate distribution with a tree (i.e. a density model that assumes that there are only pairwise dependencies between variables) and that the graph of these dependencies is a spanning tree. The original algorithm is quadratic in the dimesion of the domain, and linear in the number of data points that define the target distribution $P$. This paper shows that for sparse, discrete data, fitting a tree distribution can be done in time and memory that is jointly subquadratic in the number of variables and the size of the data set. The new algorithm, called the acCL algorithm, takes advantage of the sparsity of the data to accelerate the computation of pairwise marginals and the sorting of the resulting mutual informations, achieving speed ups of up to 2-3 orders of magnitude in the experiments.

APA, Harvard, Vancouver, ISO, and other styles

30

Sandhu, Ali Imran. "Efficient and Accurate Numerical Techniques for Sparse Electromagnetic Imaging." Diss., 2020. http://hdl.handle.net/10754/662627.

Full text

Abstract:

Electromagnetic (EM) imaging schemes are inherently non-linear and ill-posed. Albeit there exist remedies to these fundamental problems, more efficient solutions are still being sought. To this end, in this thesis, the non-linearity is tackled in- corporating a multitude of techniques (ranging from Born approximation (linear), inexact Newton (linearized) to complete nonlinear iterative Landweber schemes) that can account for weak to strong scattering problems. The ill-posedness of the EM inverse scattering problem is circumvented by formulating the above methods into a minimization problem with a sparsity constraint. More specifically, four novel in- verse scattering schemes are formulated and implemented. (i) A greedy algorithm is used together with a simple artificial neural network (ANN) for efficient and accu- rate EM imaging of weak scatterers. The ANN is used to predict the sparsity level of the investigation domain which is then used as the L0 - constraint parameter for the greedy algorithm. (ii) An inexact Newton scheme that enforces the sparsity con- straint on the derivative of the unknown material properties (not necessarily sparse) is proposed. The inverse scattering problem is formulated as a nonlinear function of the derivative of the material properties. This approach results in significant spar- sification where any sparsity regularization method could be efficiently applied. (iii) A sparsity regularized nonlinear contrast source (CS) framework is developed to di- rectly solve the nonlinear minimization problem using Landweber iterations where the convergence is accelerated using a self-adaptive projected accelerated steepest descent algorithm. (iv) A 2.5D finite difference frequency domain (FDFD) based in- verse scattering scheme is developed for imaging scatterers embedded in lossy and inhomogeneous media. The FDFD based inversion algorithm does not require the Green’s function of the background medium and appears a promising technique for biomedical and subsurface imaging with a reasonable computational time. Numerical experiments, which are carried out using synthetically generated mea- surements, show that the images recovered by these sparsity-regularized methods are sharper and more accurate than those produced by existing methods. The methods developed in this work have potential application areas ranging from oil/gas reservoir engineering to biological imaging where sparse domains naturally exist.

APA, Harvard, Vancouver, ISO, and other styles

31

Sitaraman, Hariswaran. "Magneto-hydrodynamics simulation study of high density thermal plasmas in plasma acceleration devices." 2013. http://hdl.handle.net/2152/21618.

Full text

Abstract:

The development of a Magneto-hydrodynamics (MHD) numerical tool to study high density thermal plasmas in plasma acceleration devices is presented. The MHD governing equations represent eight conservation equations for the evolution of density, momentum, energy and induced magnetic fields in a plasma. A matrix-free implicit method is developed to solve these conservation equations within the framework of an unstructured grid finite volume formulation. The analytic form of the convective flux Jacobian is derived for general unstructured grids. A Lower Upper Symmetric Gauss Seidel (LU-SGS) technique is developed as part of the implicit scheme. A coloring based algorithm for parallelization of this technique is also presented and its computational efficiency is compared with a global matrix solve technique that uses the GMRES (Generalized Minimum Residual) algorithm available in the PETSc (Portable Extensible Toolkit for Scientific computation) libraries. The verification cases used for this study are the MHD shock tube problem in one, two and three dimensions, the oblique shock and the Hartmann flow problem. It is seen that the matrix free method is comparatively faster and shows excellent scaling on multiple cores compared to the global matrix solve technique. The numerical model was thus verified against the above mentioned standard test cases and two application problems were studied. These include the simulation of plasma deflagration phenomenon in a coaxial plasma accelerator and a novel high speed flow control device called the Rail Plasma Actuator (RailPAc). Experimental studies on coaxial plasma accelerators have revealed two different modes of operation based on the delay between gas loading and discharge ignition. Longer delays lead to the detonation or the snowplow mode while shorter delays lead to the relatively efficient stationary or deflagration mode. One of the theories that explain the two different modes is based on plasma resistivity. A numerical modeling study is presented here in the context of a coaxial plasma accelerator and the effect of plasma resistivity is dealt with in detail. The simulated results pertaining to axial distribution of radial currents are compared with experimental measurements which show good agreement with each other. The simulations show that magnetic field diffusion is dominant at lower conductivities which tend to form a stationary region of high current density close to the inlet end of the device. Higher conductivities led to the formation of propagating current sheet like features due to greater convection of magnetic field. This study also validates the theory behind the two modes of operation based on plasma resistivity. The RailPAc (Rail Plasma Actuator) is a novel flow control device that uses the magnetic Lorentz forces for fluid flow actuation at atmospheric pressures. Experimental studies reveal actuation ~ 10-100 m/s can be achieved with this device which is much larger than conventional electro-hydrodynamic (EHD) force based plasma actuators. A magneto-hydrodynamics simulation study of this device is presented. The model is further developed to incorporate applied electric and magnetic fields seen in this device. The snowplow model which is typically used for studying pulsed plasma thrusters is used to predict the arc velocities which agrees well with experimental measurements. Two dimensional simulations were performed to study the effect of Lorentz forcing and heating effects on fluid flow actuation. Actuation on the order of 100 m/s is attained at the head of the current sheet due to the effect of Lorentz forcing alone. The inclusion of heating effects led to isotropic blast wave like actuation which is detrimental to the performance of RailPAc. This study also revealed the deficiencies of a single fluid model and a more accurate multi-fluid approach is proposed for future work.
text

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Sparse Accelerator'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles