Dissertations / Theses on the topic 'Parallel code optimization'

To see the other types of publications on this topic, follow the link: Parallel code optimization.

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 24 dissertations / theses for your research on the topic 'Parallel code optimization.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Cordeiro, Silvio Ricardo. "Code profiling and optimization in transactional memory systems." reponame:Biblioteca Digital de Teses e Dissertações da UFRGS, 2014. http://hdl.handle.net/10183/97866.

Full text
Abstract:
Memória Transacional tem se demonstrado um paradigma promissor na implementação de aplicações concorrentes sob memória compartilhada que busquem evitar um modelo de sincronização baseado em locks. Em vez de sujeitar a execução a um acesso exclusivo com base no valor de um lock que é compartilhado por threads concorrentes, uma aplicação sob Memória Transacional tenta executar seções críticas de modo otimista, desfazendo as modificações no caso de um conflito de acesso à memória. Entretanto, apesar de a abordagem baseada em locks ter adquirido um número significativo de ferramentas automatizadas para a depuração, profiling e otimização automatizados (por ser uma das técnicas de sincronização mais antigas e mais bem pesquisadas), o campo da Memória Transacional ainda é comparativamente recente, e programadores frequentemente precisam adaptar manualmente suas aplicações transacionais ao encontrar problemas de eficiência. Este trabalho propõe um sistema no qual o profiling de código em uma implementação de Memória Transacional simulada é utilizado para caracterizar uma aplicação transacional, formando a base para uma parametrização automatizada do respectivo sistema especulativo para uma execução eficiente do código em questão. Também é proposta uma abordagem de escalonamento de threads guiado por profiling em uma implementação de Memória Transacional baseada em software, usando dados coletados pelo profiler para prever a probabilidade de conflitos e determinar que thread escalonar com base nesta previsão. São apresentados os resultados de experimentos sob ambas as abordagens.
Transactional Memory has shown itself to be a promising paradigm for the implementation of shared-memory concurrent applications that eschew a lock-based model of data synchronization. Rather than conditioning exclusive access on the value of a lock that is shared across concurrent threads, Transactional Memory attempts to execute critical sections optimistically, rolling back the modifications in the event of a data access conflict. However, while the lock-based approach has acquired a significant body of debugging, profiling and automated optimization tools (as one of the oldest and most researched synchronization techniques), the field of Transactional Memory is still comparably recent, and programmers are usually tasked with an unguided manual tuning of their transactional applications when facing efficiency problems. We propose a system in which code profiling in a simulated hardware implementation of Transactional Memory is used to characterize a transactional application, which forms the basis for the automated tuning of the underlying speculative system for the efficient execution of that particular application. We also propose a profile-guided approach to the scheduling of threads in a software-based implementation of Transactional Memory, using collected data to predict the likelihood of conflicts and determine what thread to schedule based on this prediction. We present the results achieved under both designs.
APA, Harvard, Vancouver, ISO, and other styles
2

Hong, Changwan. "Code Optimization on GPUs." The Ohio State University, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=osu1557123832601533.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Faber, Peter. "Code Optimization in the Polyhedron Model - Improving the Efficieny of Parallel Loop Nests." kostenfrei, 2007. http://www.opus-bayern.de/uni-passau/volltexte/2008/1251/.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Fassi, Imen. "XFOR (Multifor) : A new programming structure to ease the formulation of efficient loop optimizations." Thesis, Strasbourg, 2015. http://www.theses.fr/2015STRAD043/document.

Full text
Abstract:
Nous proposons une nouvelle structure de programmation appelée XFOR (Multifor), dédiée à la programmation orientée réutilisation de données. XFOR permet de gérer simultanément plusieurs boucles "for" ainsi que d’appliquer/composer des transformations de boucles d’une façon intuitive. Les expérimentations ont montré des accélérations significatives des codes XFOR par rapport aux codes originaux, mais aussi par rapport au codes générés automatiquement par l’optimiseur polyédrique de boucles Pluto. Nous avons mis en œuvre la structure XFOR par le développement de trois outils logiciels: (1) un compilateur source-à-source nommé IBB, qui traduit les codes XFOR en un code équivalent où les boucles XFOR ont été remplacées par des boucles for sémantiquement équivalentes. L’outil IBB bénéficie également des optimisations implémentées dans le générateur de code polyédrique CLooG qui est invoqué par IBB pour générer des boucles for à partir d’une description OpenScop; (2) un environnement de programmation XFOR nommé XFOR-WIZARD qui aide le programmeur dans la ré-écriture d’un programme utilisant des boucles for classiques en un programme équivalent, mais plus efficace, utilisant des boucles XFOR; (3) un outil appelé XFORGEN, qui génère automatiquement des boucles XFOR à partir de toute représentation OpenScop de nids de boucles transformées générées automatiquement par un optimiseur automatique
We propose a new programming structure named XFOR (Multifor), dedicated to data-reuse aware programming. It allows to handle several for-loops simultaneously and map their respective iteration domains onto each other. Additionally, XFOR eases loop transformations application and composition. Experiments show that XFOR codes provides significant speed-ups when compared to the original code versions, but also to the Pluto optimized versions. We implemented the XFOR structure through the development of three software tools: (1) a source-to-source compiler named IBB for Iterate-But-Better!, which automatically translates any C/C++ code containing XFOR-loops into an equivalent code where XFOR-loops have been translated into for-loops. IBB takes also benefit of optimizations implemented in the polyhedral code generator CLooG which is invoked by IBB to generate for-loops from an OpenScop specification; (2) an XFOR programming environment named XFOR-WIZARD that assists the programmer in re-writing a program with classical for-loops into an equivalent but more efficient program using XFOR-loops; (3) a tool named XFORGEN, which automatically generates XFOR-loops from any OpenScop representation of transformed loop nests automatically generated by an automatic optimizer
APA, Harvard, Vancouver, ISO, and other styles
5

Irigoin, François. "Partitionnement des boucles imbriquées : une technique d'optimisation pour les programmes scientifiques." Paris 6, 1987. http://www.theses.fr/1987PA066437.

Full text
Abstract:
On propose une nouvelle transformation de programme, appelée partitionnement en supernœuds, qui s'applique aux boucles imbriquées et qui permet d'atteindre les objectifs suivants: saturation du parallélisme vectoriel et des processeurs élémentaires, une bonne localité des références de manière à ne pas se trouver limité par la bande passante de la mémoire principale et un coût de synchronisation acceptable.
APA, Harvard, Vancouver, ISO, and other styles
6

He, Guanlin. "Parallel algorithms for clustering large datasets on CPU-GPU heterogeneous architectures." Electronic Thesis or Diss., université Paris-Saclay, 2022. http://www.theses.fr/2022UPASG062.

Full text
Abstract:
Clustering, qui consiste à réaliser des groupements naturels de données, est une tâche fondamentale et difficile dans l'apprentissage automatique et l'exploration de données. De nombreuses méthodes de clustering ont été proposées dans le passé, parmi lesquelles le clustering en k-moyennes qui est une méthode couramment utilisée en raison de sa simplicité et de sa rapidité.Le clustering spectral est une approche plus récente qui permet généralement d'obtenir une meilleure qualité de clustering que les k-moyennes. Cependant, les algorithmes classiques de clustering spectral souffrent d'un manque de passage à l'échelle en raison de leurs grandes complexités en nombre d'opérations et en espace mémoire nécessaires. Ce problème de passage à l'échelle peut être traité en appliquant des méthodes d'approximation ou en utilisant le calcul parallèle et distribué.L'objectif de cette thèse est d'accélérer le clustering spectral et de le rendre applicable à de grands ensembles de données en combinant l'approximation basée sur des données représentatives avec le calcul parallèle sur processeurs CPU et GPU. En considérant différents scénarios, nous proposons plusieurs chaînes de traitement parallèle pour le clustering spectral à grande échelle. Nous concevons des algorithmes et des implémentations parallèles optimisés pour les modules de chaque chaîne proposée : un algorithme parallèle des k-moyennes sur CPU et GPU, un clustering spectral parallèle sur GPU avec un format de stockage creux, un filtrage parallèle sur GPU du bruit dans les données, etc. Nos expériences variées atteignent de grandes performances et valident le passage à l'échelle de chaque module et de nos chaînes complètes
Clustering, which aims at achieving natural groupings of data, is a fundamental and challenging task in machine learning and data mining. Numerous clustering methods have been proposed in the past, among which k-means is one of the most famous and commonly used methods due to its simplicity and efficiency.Spectral clustering is a more recent approach that usually achieves higher clustering quality than k-means. However, classical algorithms of spectral clustering suffer from a lack of scalability due to their high complexities in terms of number of operations and memory space requirements. This scalability challenge can be addressed by applying approximation methods or by employing parallel and distributed computing.The objective of this thesis is to accelerate spectral clustering and make it scalable to large datasets by combining representatives-based approximation with parallel computing on CPU-GPU platforms. Considering different scenarios, we propose several parallel processing chains for large-scale spectral clustering. We design optimized parallel algorithms and implementations for each module of the proposed chains: parallel k-means on CPU and GPU, parallel spectral clustering on GPU using sparse storage format, parallel filtering of data noise on GPU, etc. Our various experiments reach high performance and validate the scalability of each module and the complete chains
APA, Harvard, Vancouver, ISO, and other styles
7

Fang, Juing. "Décodage pondère des codes en blocs et quelques sujets sur la complexité du décodage." Paris, ENST, 1987. http://www.theses.fr/1987ENST0005.

Full text
Abstract:
Etude de la compléxité théorique du décodage des codes en blocs à travers une famille d'algorithmes basée sur le principe d'optimisation combinatoire. Puis on aborde un algorithme parallèle de décodage algébrique dont la complexitré est liée au niveau de bruit du canal. Enfin on introduit un algorithme de Viterbi pour les applications de traitement en chaînes.
APA, Harvard, Vancouver, ISO, and other styles
8

Tagliavini, Giuseppe <1980&gt. "Optimization Techniques for Parallel Programming of Embedded Many-Core Computing Platforms." Doctoral thesis, Alma Mater Studiorum - Università di Bologna, 2017. http://amsdottorato.unibo.it/8068/1/TESI.pdf.

Full text
Abstract:
Nowadays many-core computing platforms are widely adopted as a viable solution to accelerate compute-intensive workloads at different scales, from low-cost devices to HPC nodes. It is well established that heterogeneous platforms including a general-purpose host processor and a parallel programmable accelerator have the potential to dramatically increase the peak performance/Watt of computing architectures. However the adoption of these platforms further complicates application development, whereas it is widely acknowledged that software development is a critical activity for the platform design. The introduction of parallel architectures raises the need for programming paradigms capable of effectively leveraging an increasing number of processors, from two to thousands. In this scenario the study of optimization techniques to program parallel accelerators is paramount for two main objectives: first, improving performance and energy efficiency of the platform, which are key metrics for both embedded and HPC systems; second, enforcing software engineering practices with the aim to guarantee code quality and reduce software costs. This thesis presents a set of techniques that have been studied and designed to achieve these objectives overcoming the current state-of-the-art. As a first contribution, we discuss the use of OpenMP tasking as a general-purpose programming model to support the execution of diverse workloads, and we introduce a set of runtime-level techniques to support fine-grain tasks on high-end many-core accelerators (devices with a power consumption greater than 10W). Then we focus our attention on embedded computer vision (CV), with the aim to show how to achieve best performance by exploiting the characteristics of a specific application domain. To further reduce the power consumption of parallel accelerators beyond the current technological limits, we describe an approach based on the principles of approximate computing, which implies modification to the program semantics and proper hardware support at the architectural level.
APA, Harvard, Vancouver, ISO, and other styles
9

Drebes, Andi. "Dynamic optimization of data-flow task-parallel applications for large-scale NUMA systems." Thesis, Paris 6, 2015. http://www.theses.fr/2015PA066330/document.

Full text
Abstract:
Au milieu des années deux mille, le développement de microprocesseurs a atteint un point à partir duquel l'augmentation de la fréquence de fonctionnement et la complexification des micro-architectures devenaient moins efficaces en termes de consommation d'énergie, poussant ainsi la densité d'énergie au delà du raisonnable. Par conséquent, l'industrie a opté pour des architectures multi-cœurs intégrant plusieurs unités de calcul sur une même puce. Les sytèmes hautes performances d'aujourd'hui sont composés de centaines de cœurs et les systèmes futurs intègreront des milliers d'unités de calcul. Afin de fournir une bande passante mémoire suffisante dans ces systèmes, la mémoire vive est distribuée physiquement sur plusieurs contrôleurs mémoire avec un accès non-uniforme à la mémoire (NUMA). Des travaux de recherche récents ont identifié les modèles de programmation à base de tâches dépendantes à granularité fine comme une approche clé pour exploiter la puissance de calcul des architectures généralistes massivement parallèles. Toutefois, peu de recherches ont été conduites sur l'optimisation dynamique des programmes parallèles à base de tâches afin de réduire l'impact négatif sur les performances résultant de la non-uniformité des accès à la mémoire. L'objectif de cette thèse est de déterminer les enjeux et les opportunités concernant l'exploitation efficace de machines many-core NUMA par des applications à base de tâches et de proposer des mécanismes efficaces, portables et entièrement automatiques pour le placement de tâches et de données, améliorant la localité des accès à la mémoire ainsi que les performances. Les décisions de placement sont basées sur l'exploitation des informations sur les dépendances entre tâches disponibles dans les run-times de langages de programmation à base de tâches modernes. Les évaluations expérimentales réalisées reposent sur notre implémentation dans le run-time du langage OpenStream et un ensemble de benchmarks scientifiques hautes performances. Enfin, nous avons développé et implémenté Aftermath, un outil d'analyse et de débogage de performances pour des applications à base de tâches et leurs run-times
Within the last decade, microprocessor development reached a point at which higher clock rates and more complex micro-architectures became less energy-efficient, such that power consumption and energy density were pushed beyond reasonable limits. As a consequence, the industry has shifted to more energy efficient multi-core designs, integrating multiple processing units (cores) on a single chip. The number of cores is expected to grow exponentially and future systems are expected to integrate thousands of processing units. In order to provide sufficient memory bandwidth in these systems, main memory is physically distributed over multiple memory controllers with non-uniform access to memory (NUMA). Past research has identified programming models based on fine-grained, dependent tasks as a key technique to unleash the parallel processing power of massively parallel general-purpose computing architectures. However, the execution of task-paralel programs on architectures with non-uniform memory access and the dynamic optimizations to mitigate NUMA effects have received only little interest. In this thesis, we explore the main factors on performance and data locality of task-parallel programs and propose a set of transparent, portable and fully automatic on-line mapping mechanisms for tasks to cores and data to memory controllers in order to improve data locality and performance. Placement decisions are based on information about point-to-point data dependences, readily available in the run-time systems of modern task-parallel programming frameworks. The experimental evaluation of these techniques is conducted on our implementation in the run-time of the OpenStream language and a set of high-performance scientific benchmarks. Finally, we designed and implemented Aftermath, a tool for performance analysis and debugging of task-parallel applications and run-times
APA, Harvard, Vancouver, ISO, and other styles
10

Child, Ryan. "Performance and Power Optimization of Parallel Discrete Event Simulations Using DVFS." University of Cincinnati / OhioLINK, 2012. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1342730759.

Full text
APA, Harvard, Vancouver, ISO, and other styles
11

Drebes, Andi. "Dynamic optimization of data-flow task-parallel applications for large-scale NUMA systems." Electronic Thesis or Diss., Paris 6, 2015. http://www.theses.fr/2015PA066330.

Full text
Abstract:
Au milieu des années deux mille, le développement de microprocesseurs a atteint un point à partir duquel l'augmentation de la fréquence de fonctionnement et la complexification des micro-architectures devenaient moins efficaces en termes de consommation d'énergie, poussant ainsi la densité d'énergie au delà du raisonnable. Par conséquent, l'industrie a opté pour des architectures multi-cœurs intégrant plusieurs unités de calcul sur une même puce. Les sytèmes hautes performances d'aujourd'hui sont composés de centaines de cœurs et les systèmes futurs intègreront des milliers d'unités de calcul. Afin de fournir une bande passante mémoire suffisante dans ces systèmes, la mémoire vive est distribuée physiquement sur plusieurs contrôleurs mémoire avec un accès non-uniforme à la mémoire (NUMA). Des travaux de recherche récents ont identifié les modèles de programmation à base de tâches dépendantes à granularité fine comme une approche clé pour exploiter la puissance de calcul des architectures généralistes massivement parallèles. Toutefois, peu de recherches ont été conduites sur l'optimisation dynamique des programmes parallèles à base de tâches afin de réduire l'impact négatif sur les performances résultant de la non-uniformité des accès à la mémoire. L'objectif de cette thèse est de déterminer les enjeux et les opportunités concernant l'exploitation efficace de machines many-core NUMA par des applications à base de tâches et de proposer des mécanismes efficaces, portables et entièrement automatiques pour le placement de tâches et de données, améliorant la localité des accès à la mémoire ainsi que les performances. Les décisions de placement sont basées sur l'exploitation des informations sur les dépendances entre tâches disponibles dans les run-times de langages de programmation à base de tâches modernes. Les évaluations expérimentales réalisées reposent sur notre implémentation dans le run-time du langage OpenStream et un ensemble de benchmarks scientifiques hautes performances. Enfin, nous avons développé et implémenté Aftermath, un outil d'analyse et de débogage de performances pour des applications à base de tâches et leurs run-times
Within the last decade, microprocessor development reached a point at which higher clock rates and more complex micro-architectures became less energy-efficient, such that power consumption and energy density were pushed beyond reasonable limits. As a consequence, the industry has shifted to more energy efficient multi-core designs, integrating multiple processing units (cores) on a single chip. The number of cores is expected to grow exponentially and future systems are expected to integrate thousands of processing units. In order to provide sufficient memory bandwidth in these systems, main memory is physically distributed over multiple memory controllers with non-uniform access to memory (NUMA). Past research has identified programming models based on fine-grained, dependent tasks as a key technique to unleash the parallel processing power of massively parallel general-purpose computing architectures. However, the execution of task-paralel programs on architectures with non-uniform memory access and the dynamic optimizations to mitigate NUMA effects have received only little interest. In this thesis, we explore the main factors on performance and data locality of task-parallel programs and propose a set of transparent, portable and fully automatic on-line mapping mechanisms for tasks to cores and data to memory controllers in order to improve data locality and performance. Placement decisions are based on information about point-to-point data dependences, readily available in the run-time systems of modern task-parallel programming frameworks. The experimental evaluation of these techniques is conducted on our implementation in the run-time of the OpenStream language and a set of high-performance scientific benchmarks. Finally, we designed and implemented Aftermath, a tool for performance analysis and debugging of task-parallel applications and run-times
APA, Harvard, Vancouver, ISO, and other styles
12

Belgin, Mehmet. "Structure-based Optimizations for Sparse Matrix-Vector Multiply." Diss., Virginia Tech, 2010. http://hdl.handle.net/10919/30260.

Full text
Abstract:
This dissertation introduces two novel techniques, OSF and PBR, to improve the performance of Sparse Matrix-vector Multiply (SMVM) kernels, which dominate the runtime of iterative solvers for systems of linear equations. SMVM computations that use sparse formats typically achieve only a small fraction of peak CPU speeds because they are memory bound due to their low flops:byte ratio, they access memory irregularly, and exhibit poor ILP due to inefficient pipelining. We particularly focus on improving the flops:byte ratio, which is the main limiter on performance, by exploiting recurring structures or sub-structures in matrices. Our techniques also support micro-architecture level optimizations to further improve performance. Operation Stacking Framework (OSF) stacks problems in large ensemble computations, which run the same sparse kernel using an identical matrix structure, such that they share a single copy of the indexing information to significantly reduce memory bandwidth usage. OSF provides performance improvements of up to 1.94x on an AMD Opteron compared to the CSR method. We validate performance results using hardware event counters, which demonstrate significantly improved cache and pipeline utilization. Pattern-based Representation (PBR) exploits recurring block nonzero patterns by generating custom code for each recurring block pattern. In this way, no indexing data for individual nonzero elements are read from memory, reducing the overall size of the indices by up to 98%. Our code generator emits highly tuned codes that utilize SSE vectorization and software prefetching. PBR accurately identifies a block size that achieves optimal or near-optimal performance using a linear multiple regression performance model. On recent multicore machines, PBR provides performance improvements of up to 3.4x sequentially and 5x in parallel, compared to the CSR method. The PBR library we provide converts matrices at runtime, allowing our method to be used as a drop-in replacement for existing methods. We compare PBRâ s overhead relative to its benefits and show that PBR is beneficial for many applications that repetitively call the SMVM kernel for the same matrix structure.
Ph. D.
APA, Harvard, Vancouver, ISO, and other styles
13

Gao, Xiaoyang. "Integrated compiler optimizations for tensor contractions." Columbus, Ohio : Ohio State University, 2008. http://rave.ohiolink.edu/etdc/view?acc%5Fnum=osu1198874631.

Full text
APA, Harvard, Vancouver, ISO, and other styles
14

Darwish, Mohammed. "Lot-sizing and scheduling optimization using genetic algorithm." Thesis, Högskolan i Skövde, Institutionen för ingenjörsvetenskap, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:his:diva-17045.

Full text
Abstract:
Simultaneous lot-sizing and scheduling problem is the problem to decide what products to be produced on which machine and in which order, as well as the quantity of each product. Problems of this type are hard to solve. Therefore, they were studied for years, and a considerable number of papers is published to solve different lotsizing and scheduling problems, specifically real-case problems. This work proposes a Real-Coded Genetic Algorithm (RCGA) with a new chromosome representation to solve a non-identical parallel machine capacitated lot-sizing and scheduling problem with sequence dependent setup times and costs, machine cost and backlogging. Such a problem can be found in real world production line at furniture manufacturer in Sweden. Backlogging is an important concept in this problem, and it is often ignored in the literature. This study implements three different types of crossover; one of them has been chosen based on numerical experiments. Four mutation operators have been combined together to allow the genetic algorithm to scan the search area and maintain genetic diversity. Other steps like initializing of the population and a reinitializing process have been designed carefully to achieve the best performance and to prevent the algorithm from trapped into the local optimum. The proposed algorithm is implemented and coded in MATLAB and tested for a set of standard medium to large-size problems taken from the literature. A variety of problems were solved to measure the impact of different characteristics of problems such as the number of periods, machines, and products on the quality of the solution provided by the proposed RCGA. To evaluate the performance of the proposed algorithm, the average deviation from the lower bound and runtime for the proposed RCGA are compared with three other algorithms from the literature. The results show that, in addition to its high computational speed, the proposed RCGA outperforms the other algorithms for non-identical parallel machine problems, while it is outperformed by the other algorithms for problems with the more identical parallel machine. The results show that the different characteristics of problem instances, like increasing setup cost, and size of the problem influence the quality of the solutions provided by the proposed RCGA negatively.
APA, Harvard, Vancouver, ISO, and other styles
15

Тевяшев, А. Д., and Д. І. Гольдинер. "System Analysis of The Parallel Execution Problem." Thesis, 2019. http://openarchive.nure.ua/handle/document/11959.

Full text
Abstract:
The article studies the prerequisites for the appearance and dissemination of the approach to optimizing the execution of program code through the simultaneous execution in several threads, analyzes the problem of parallel computing, identifies factors affecting the feasibility and effectiveness of the implementation of the architectural solution for multi-threaded program execution. Also, the problems that may arise during the implementation of optimization are considered. The main advantages and disadvantages of using the Go programming language for implementing parallel execution of a software product are considered.
APA, Harvard, Vancouver, ISO, and other styles
16

Faber, Peter [Verfasser]. "Code optimization in the polyhedron model : improving the efficiency of parallel loop nests / Peter Faber." 2008. http://d-nb.info/991047869/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
17

Mullapudi, Ravi Teja. "Polymage : Automatic Optimization for Image Processing Pipelines." Thesis, 2015. http://etd.iisc.ac.in/handle/2005/3757.

Full text
Abstract:
Image processing pipelines are ubiquitous. Every image captured by a camera and every image uploaded on social networks like Google+or Facebook is processed by a pipeline. Applications in a wide range of domains like computational photography, computer vision and medical imaging use image processing pipelines. Many of these applications demand high-performance which requires effective utilization of modern architectures. Given the proliferation of camera enabled devices and social networks optimizing these emerging workloads has become important both at the data center and the embedded device scales. An image processing pipeline can be viewed as a graph of interconnected stages which process images successively. Each stage typically performs one of point-wise, stencil, sam-pling, reduction or data-dependent operations on image pixels. Individual stages in a pipeline typically exhibit abundant data parallelism that can be exploited with relative ease. However, the stages also require high memory bandwidth preventing effective uti-lization of parallelism available on modern architectures. The traditional options are using optimized libraries like OpenCV or to optimize manually. While using libraries precludes optimization across library routines, manual optimization accounting for both parallelism and locality is very tedious. Inthisthesis,wepresentthedesignandimplementationofPolyMage,adomain-specific language and compiler for image processing pipelines. The focus of the system is on au-tomatically generating high-performance implementations of image processing pipelines expressed in a high-level declarative language. We achieve such automation with: • tiling techniques to improve parallelism and locality by introducing redundant computation, v a model-driven fusion heuristic which enables a trade-off between locality and re-dundant computations, and anautotuner whichleveragesthefusionheuristictoexploreasmallsubsetofpipeline implementations and find the best performing one. Our optimization approach primarily relies on the transformation and code generation ca-pabilities of the polyhedral compiler framework. To the best of our knowledge, this is the first model-driven compiler for image processing pipelines that performs complex fusion, tiling, and storage optimization fully automatically. We evaluate our framework on a modern multicore system using a set of seven benchmarks which vary widely in structure and complexity. Experimental results show that the performance of pipeline implementations generated by our approach is: • up to 1.81× better than pipeline implementations manually tuned using Halide, a state-of-the-art language and compiler for image processing pipelines, • on average 5.39× better than pipeline implementations automatically tuned using Halide and OpenTuner, and • on average 3.3× better than naive pipeline implementations which only exploit par-allelism without optimizing for locality. We also demonstrate that the performance of PolyMage generated code is better or compa-rable to implementations using OpenCV, a state-of-the-art image processing and computer vision library.
APA, Harvard, Vancouver, ISO, and other styles
18

Mullapudi, Ravi Teja. "Polymage : Automatic Optimization for Image Processing Pipelines." Thesis, 2015. http://etd.iisc.ernet.in/2005/3757.

Full text
Abstract:
Image processing pipelines are ubiquitous. Every image captured by a camera and every image uploaded on social networks like Google+or Facebook is processed by a pipeline. Applications in a wide range of domains like computational photography, computer vision and medical imaging use image processing pipelines. Many of these applications demand high-performance which requires effective utilization of modern architectures. Given the proliferation of camera enabled devices and social networks optimizing these emerging workloads has become important both at the data center and the embedded device scales. An image processing pipeline can be viewed as a graph of interconnected stages which process images successively. Each stage typically performs one of point-wise, stencil, sam-pling, reduction or data-dependent operations on image pixels. Individual stages in a pipeline typically exhibit abundant data parallelism that can be exploited with relative ease. However, the stages also require high memory bandwidth preventing effective uti-lization of parallelism available on modern architectures. The traditional options are using optimized libraries like OpenCV or to optimize manually. While using libraries precludes optimization across library routines, manual optimization accounting for both parallelism and locality is very tedious. Inthisthesis,wepresentthedesignandimplementationofPolyMage,adomain-specific language and compiler for image processing pipelines. The focus of the system is on au-tomatically generating high-performance implementations of image processing pipelines expressed in a high-level declarative language. We achieve such automation with: • tiling techniques to improve parallelism and locality by introducing redundant computation, v a model-driven fusion heuristic which enables a trade-off between locality and re-dundant computations, and anautotuner whichleveragesthefusionheuristictoexploreasmallsubsetofpipeline implementations and find the best performing one. Our optimization approach primarily relies on the transformation and code generation ca-pabilities of the polyhedral compiler framework. To the best of our knowledge, this is the first model-driven compiler for image processing pipelines that performs complex fusion, tiling, and storage optimization fully automatically. We evaluate our framework on a modern multicore system using a set of seven benchmarks which vary widely in structure and complexity. Experimental results show that the performance of pipeline implementations generated by our approach is: • up to 1.81× better than pipeline implementations manually tuned using Halide, a state-of-the-art language and compiler for image processing pipelines, • on average 5.39× better than pipeline implementations automatically tuned using Halide and OpenTuner, and • on average 3.3× better than naive pipeline implementations which only exploit par-allelism without optimizing for locality. We also demonstrate that the performance of PolyMage generated code is better or compa-rable to implementations using OpenCV, a state-of-the-art image processing and computer vision library.
APA, Harvard, Vancouver, ISO, and other styles
19

Nandakumar, K. S. "Combining Conditional Constant Propagation And Interprocedural Alias Analysis." Thesis, 1995. https://etd.iisc.ac.in/handle/2005/1739.

Full text
APA, Harvard, Vancouver, ISO, and other styles
20

Nandakumar, K. S. "Combining Conditional Constant Propagation And Interprocedural Alias Analysis." Thesis, 1995. http://etd.iisc.ernet.in/handle/2005/1739.

Full text
APA, Harvard, Vancouver, ISO, and other styles
21

Sjöblom, William. "Idiom-driven innermost loop vectorization in the presence of cross-iteration data dependencies in the HotSpot C2 compiler." Thesis, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-172789.

Full text
Abstract:
This thesis presents a technique for automatic vectorization of innermost single statement loops with a cross-iteration data dependence by analyzing data-flow to recognize frequently recurring program idioms. Recognition is carried out by matching the circular SSA data-flow found around the loop body’s φ-function against several primitive patterns, forming a tree representation of the relevant data-flow that is then pruned down to a single parameterized node, providing a high-level specification of the data-flow idiom at hand used to guide algorithmic replacement applied to the intermediate representation. The versatility of the technique is shown by presenting an implementation supporting vectorization of both a limited class of linear recurrences as well as prefix sums, where the latter shows how the technique generalizes to intermediate representations with memory state in SSA-form. Finally, a thorough performance evaluation is presented, showing the effectiveness of the vectorization technique.
APA, Harvard, Vancouver, ISO, and other styles
22

Ryoo, Shane. "Program optimization strategies for data-parallel many-core processors /." 2008. http://gateway.proquest.com/openurl?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&res_dat=xri:pqdiss&rft_dat=xri:pqdiss:3314878.

Full text
Abstract:
Thesis (Ph.D.)--University of Illinois at Urbana-Champaign, 2008.
Source: Dissertation Abstracts International, Volume: 69-05, Section: B, page: 3190. Adviser: Wen-mei W. Hwu. Includes bibliographical references (leaves 137-145) Available on microfilm from Pro Quest Information and Learning.
APA, Harvard, Vancouver, ISO, and other styles
23

Chen, Shih-Chang, and 陳世璋. "Developing GEN_BLOCK Redistribution Algorithms and Optimization Techniques on Parallel, Distributed and Multi-Core Systems." Thesis, 2010. http://ndltd.ncl.edu.tw/handle/31998536558283534640.

Full text
Abstract:
博士
中華大學
工程科學博士學位學程
99
Parallel computing systems have been used to solve complex scientific problems with aggregates of data such as arrays to extend sequential programming language. With the improvement of hardware architectures, parallel systems can be a cluster, multiple clusters or a multi-cluster with multi-core machines. Under this paradigm, appropriate data distribution is critical to the performance of each phase in a multi-phase program. Because the phases of a program are different from one another, the optimal distribution changes due to the characteristics of each phase, as well as on those of the following phase. In order to achieve good load balancing, improved data locality and reduced inter-processor communication during runtime, data redistribution is critical during operation. In this study, formulas for message generation, three scheduling algorithms for single cluster, multiple clusters and multi-cluster system with multi-core machines and a power saving technique are proposed to solve problems for GEN_BLOCK redistribution. Formulas for message generation provide much information of source, destination and data which are needed before scheduling algorithms giving effective results. Each node can use the formulas to obtain the information simply, effectively and independently. An effective scheduling algorithm for a cluster system is proposed to apply on heterogeneous environment. It not only guarantees minimal schedule steps but also shortens communication cost. Multi-cluster computing provides complex network and heterogeneous processors to perform GEN_BLOCK redistribution. To adapt this architecture, a new scheduling algorithm is proposed to provide better result in terms of communication cost. This technique classifies transmissions among clusters into three types and schedules transmissions inside a node together to avoid synchronization delay. While employing multi-core machines to be a part of parallel systems, present scheduling algorithms are doubted to deliver good performance. In addition, efficient power saving techniques are not under consideration for any scheduling algorithms. Therefore, four kinds of transmission time are designed for messages to increase scheduling efficiency. While performing proposed scheduling algorithm, the efficient power saving technique is also executed to evaluate the voltage value to save energy for each core on the complicated system.
APA, Harvard, Vancouver, ISO, and other styles
24

Nikjah, Reza. "Performance evaluation and protocol design of fixed-rate and rateless coded relaying networks." Phd thesis, 2010. http://hdl.handle.net/10048/1674.

Full text
Abstract:
The importance of cooperative relaying communication in substituting for, or complementing, multiantenna systems is described, and a brief literature review is presented. Amplify-and-forward (AF) and decode-and-forward (DF) relaying are investigated and compared for a dual-hop relay channel. The optimal strategy, source and relay optimal power allocation, and maximum cooperative gain are determined for the relay channel. It is shown that while DF relaying is preferable to AF relaying for strong source-relay links, AF relaying leads to more gain for strong source-destination or relay-destination links. Superimposed and selection AF relaying are investigated for multirelay, dual-hop relaying. Selection AF relaying is shown to be globally strictly outage suboptimal. A necessary condition for the selection AF outage optimality, and an upper bound on the probability of this optimality are obtained. A near-optimal power allocation scheme is derived for superimposed AF relaying. The maximum instantaneous rates, outage probabilities, and average capacities of multirelay, dual-hop relaying schemes are obtained for superimposed, selection, and orthogonal DF relaying, each with parallel channel cooperation (PCC) or repetition-based cooperation (RC). It is observed that the PCC over RC gain can be as much as 4 dB for the outage probabilities and 8.5 dB for the average capacities. Increasing the number of relays deteriorates the capacity performance of orthogonal relaying, but improves the performances of the other schemes. The application of rateless codes to DF relaying networks is studied by investigating three single-relay protocols, one of which is new, and three novel, low complexity multirelay protocols for dual-hop networks. The maximum rate and minimum energy per bit and per symbol are derived for the single-relay protocols under a peak power and an average power constraint. The long-term average rate and energy per bit, and relay-to-source usage ratio (RSUR), a new performance measure, are evaluated for the single-relay and multirelay protocols. The new single-relay protocol is the most energy efficient single-relay scheme in most cases. All the multirelay protocols exhibit near-optimal rate performances, but are vastly different in the RSUR. Several future research directions for fixed-rate and rateless coded cooperative systems, and frameworks for comparing these systems, are suggested.
Communications
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography