Dissertations / Theses on the topic 'Algorithmes GPU'
Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles
Consult the top 50 dissertations / theses for your research on the topic 'Algorithmes GPU.'
Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.
You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.
Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.
Ballage, Marion. "Algorithmes de résolution rapide de problèmes mécaniques sur GPU." Thesis, Toulouse 3, 2017. http://www.theses.fr/2017TOU30122/document.
Full textGenerating a conformal mesh on complex geometries leads to important model size of structural finite element simulations. The meshing time is directly linked to the geometry complexity and can contribute significantly to the total turnaround time. Graphics processing units (GPUs) are highly parallel programmable processors, delivering real performance gains on computationally complex, large problems. GPUs are used to implement a new finite element method on a Cartesian mesh. A Cartesian mesh is well adapted to the parallelism needed by GPUs and reduces the meshing time to almost zero. The novel method relies on the finite element method and the extended finite element formulation. The extended finite element method was introduced in the field of fracture mechanics. It consists in enriching the basis functions to take care of the geometry and the interface. This method doesn't need a conformal mesh to represent cracks and avoids refining during their propagation. Our method is based on the extended finite element method, with a geometry implicitly defined, wich allows for a good approximation of the geometry and boundary conditions without a conformal mesh.To represent the model on a Cartesian grid, we use a level set representing a density. This density is greater than 0.5 inside the domain and less than 0.5 outside. It takes 0.5 on the boundary. A new integration technique is proposed, adapted to the geometrical representation. For the element cut by the levet set, only the part full of material has to be integrated. The Gauss quadrature is no longer adapted. We introduce a quadrature method with integration points on a cartesian dense grid.In order to reduce the computational effort, a learning approach is then considered to form the elementary stiffness matrices as function of density values on the vertices of the elements. This learning method reduces the stiffness matrices time computation. Results obtained after analysis by finite element method or the novel finite element method can have important storage size, dependant of the model complexity and the resolution scheme exactitude. Due to the limited direct memory of graphics processing units, the data results are compressed. We compress the model and the element finite results with a wavelet transform. The compression will help for storage issue and also for data visualization
Luong, Thé Van. "Métaheuristiques parallèles sur GPU." Thesis, Lille 1, 2011. http://www.theses.fr/2011LIL10058/document.
Full textReal-world optimization problems are often complex and NP-hard. Their modeling is continuously evolving in terms of constraints and objectives, and their resolution is CPU time-consuming. Although near-optimal algorithms such as metaheuristics (generic heuristics) make it possible to reduce the temporal complexity of their resolution, they fail to tackle large problems satisfactorily. Over the last decades, parallel computing has been revealed as an unavoidable way to deal with large problem instances of difficult optimization problems. The design and implementation of parallel metaheuristics are strongly influenced by the computing platform. Nowadays, GPU computing has recently been revealed effective to deal with time-intensive problems. This new emerging technology is believed to be extremely useful to speed up many complex algorithms. One of the major issues for metaheuristics is to rethink existing parallel models and programming paradigms to allow their deployment on GPU accelerators. Generally speaking, the major issues we have to deal with are: the distribution of data processing between CPU and GPU, the thread synchronization, the optimization of data transfer between the different memories, the memory capacity constraints, etc. The contribution of this thesis is to deal with such issues for the redesign of parallel models of metaheuristics to allow solving of large scale optimization problems on GPU architectures. Our objective is to rethink the existing parallel models and to enable their deployment on GPUs. Thereby, we propose in this document a new generic guideline for building efficient parallel metaheuristics on GPU. Our challenge is to come out with the GPU-based design of the whole hierarchy of parallel models.In this purpose, very efficient approaches are proposed for CPU-GPU data transfer optimization, thread control, mapping of solutions to GPU threadsor memory management. These approaches have been exhaustively experimented using five optimization problems and four GPU configurations. Compared to a CPU-based execution, experiments report up to 80-fold acceleration for large combinatorial problems and up to 2000-fold speed-up for a continuous problem. The different works related to this thesis have been accepted in a dozen of publications, including the IEEE Transactions on Computers journal
Viard, Thomas. "Algorithmes de visualisation des incertitudes en géomodélisation sur GPU." Thesis, Vandoeuvre-les-Nancy, INPL, 2010. http://www.theses.fr/2010INPL042N/document.
Full textMost of the subsurface is inaccessible to direct observation in geosciences. Consequently, only local or imprecise data are available when building or updating a geological model; uncertainties are therefore central to geomodeling. The inverse problem theory and the stochastic simulation methods provide a framework for the generation of large sets of likely representations of the subsurface, also termed realizations. In practice, however, the size of the set of realizations severely impacts further interpretation or processing of the geological model.This thesis aims at providing visualization algorithms to expert geologists that allow them to explore, analyze and communicate on spatial uncertainties associated to large sets of realizations. Our contributions are: (1) We propose a set of techniques dedicated to petrophysical uncertainty visualization, based on a GPU programming approach that maintains their interoperability; (2) We propose two techniques dedicated to structural uncertainty visualization that can handle both geometrical and topological uncertainties (e.g., the existence of the surface or its relationships with other surfaces); (3) We assess the quality of our uncertainty visualization algorithms through two user studies, which respectively focus on the perception of static and animated methods. These studies bring new elements on how uncertainty should be represented
Lefèbvre, Matthieu. "Algorithmes sur GPU pour la simulation numérique en mécanique des fluides." Paris 13, 2012. http://scbd-sto.univ-paris13.fr/intranet/edgalilee_th_2012_lefebvre.pdf.
Full textNumerical simulations in fluid mechanics require tremendous computational power ; GPU computing is one of the newest approaches to accelerate such simulations. On one hand, this thesis studies the case of fluid mechanics algorithms on structured meshes. The mesh structuration naturally brings well suited memory arrangements and allows to reach guidelines when using GPUs for numerical simulations. On the other hand, we examine the case of fluid mechanics on unstructured meshes with the help of three different algorithmic strategies. The first of these technique is a reorganisation to produce consecutive data accesses, but at the cost of expensive data copies, both in time and in memory. The second technique, a cell partitioning approach, is developed and allows to extensively use modern GPUs’ cache memories. The third technique consists on a generic refinement. The initial mesh is made of coarse elements refined in the exact same way in order to produce consecutive memory accesses. This approach brings significant performance improvements for fluid mechanics simulations on unstructured meshes
Marin, Manuel. "GPU-enhanced power flow analysis." Thesis, Perpignan, 2015. http://www.theses.fr/2015PERP0041.
Full textThis thesis addresses the utilization of Graphics Processing Units (GPUs) for improving the Power Flow (PF) analysis of modern power systems. Currently, GPUs are challenged by applications exhibiting an irregular computational pattern, as is the case of most known methods for PF analysis. At the same time, the PF analysis needs to be improved in order to cope with new requirements of efficiency and accuracy coming from the Smart Grid concept. The relevance of GPU-enhanced PF analysis is twofold. On one hand, it expands the application domain of GPU to a new class of problems. On the other hand, it consistently increases the computational capacity available for power system operation and design. The present work attempts to achieve that in two complementary ways: (i) by developing novel GPU programming strategies for available PF algorithms, and (ii) by proposing novel PF analysis methods that can exploit the numerous features present in GPU architectures. Specific contributions on GPU computing include: (i) a comparison of two programming paradigms, namely regularity and load-balancing, for implementing the so-called treefix operations; (ii) a study of the impact of the representation format over performance and accuracy, for fuzzy interval algebraic operations; and (iii) the utilization of architecture-specific design, as a novel strategy to improve performance scalability of applications. Contributions on PF analysis include: (i) the design and evaluation of a novel method for the uncertainty assessment, based on the fuzzy interval approach; and (ii) the development of an intrinsically parallel method for PF analysis, which is not affected by the Amdahl's law
Van, Luong Thé. "Métaheuristiques parallèles sur GPU." Phd thesis, Université des Sciences et Technologie de Lille - Lille I, 2011. http://tel.archives-ouvertes.fr/tel-00638820.
Full textBuatois, Luc. "Algorithmes sur GPU de visualisation et de calcul pour des maillages non-structurés." Phd thesis, Institut National Polytechnique de Lorraine - INPL, 2008. http://tel.archives-ouvertes.fr/tel-00331935.
Full textChakroun, Imen. "Algorithmes Branch and Bound parallèles hétérogènes pour environnements multi-coeurs et multi-GPU." Phd thesis, Université des Sciences et Technologie de Lille - Lille I, 2013. http://tel.archives-ouvertes.fr/tel-00841965.
Full textMansouri, Abdelkhalek. "Generic heuristics on GPU to superpixel segmentation and application to optical flow estimation." Thesis, Bourgogne Franche-Comté, 2020. http://www.theses.fr/2020UBFCA012.
Full textFinding clusters in point clouds and matching graphs to graphs are recurrent tasks in computer science domain, data analysis, image processing, that are most often modeled as NP-hard optimization problems. With the development and accessibility of cheap multiprocessors, acceleration of the heuristic procedures for these tasks becomes possible and necessary. We propose parallel implantation on GPU (graphics processing unit) system for some generic algorithms applied here to image superpixel segmentation and image optical flow problem. The aim is to provide generic algorithms based on standard decentralized data structures to be easy to improve and customized on many optimization problems and parallel platforms.The proposed parallel algorithm implementations include classical k-means algorithm and application of minimum spanning forest computation for super-pixel segmentation. They include also a parallel local search procedure, and a population-based memetic algorithm applied to optical flow estimation based on superpixel matching. While data operations fully exploit GPU, the memetic algorithm operates like a coalition of processes executed in parallel on the multi-core CPU and requesting GPU resources. Images are point clouds in 3D Euclidean space (space-gray value domain), and are also graphs to which are assigned processor grids. GPU kernels execute parallel transformations under CPU control whose limited role only consists in stopping criteria evaluation or sequencing transformations.The presented contribution contains two main parts. Firstly, we present tools for superpixel segmentation. A parallel implementation of the k-means algorithm is presented with application to 3D data. It is based on a cellular grid subdivision of 3D space that allows closest point findings in constant optimal time for bounded distributions. We present an application of the parallel Boruvka minimum spanning tree algorithm to compute watershed minimum spanning forest. Secondly, based on the generated superpixels and segmentation, we derive parallel optimization procedures for optical flow estimation with edge aware filtering. The method includes construction and improvement heuristics, as winner-take-all and parallel local search, and their embedding into a population-based metaheuristic framework. The algorithms are presented and evaluated in comparison to state-of-the-art algorithms
Legrand, Hélène. "Algorithmes parallèles pour le traitement rapide de géométries 3D." Electronic Thesis or Diss., Paris, ENST, 2017. http://www.theses.fr/2017ENST0053.
Full textOver the last twenty years, the main signal processing concepts have been adapted for digital geometry, in particular for 3D polygonal meshes. However, the processing time required for large models is significant. This computational load becomes an obstacle in the current context, where the massive amounts of data that are generated every second may need to be processed with several operators. The ability to run geometry processing operators with strong time constraints is a critical challenge in dynamic 3D systems. In this context, we seek to speed up some of the current algorithms by several orders of magnitude, and to reformulate or approximate them in order to reduce their complexity or make them parallel. In this thesis, we are building on a compact and effective object to analyze 3D surfaces at different scales : error quadrics. In particular, we propose new high performance algorithms that maintain error quadrics on the surface to represent the geometry. One of the main challenges lies in the effective generation of the right structures for parallel processing, in order to take advantage of the GPU
Recur, Benoît. "Précision et qualité en reconstruction tomographique : algorithmes et applications." Thesis, Bordeaux 1, 2010. http://www.theses.fr/2010BOR14113/document.
Full textA large kind of methods are available now to acquire an object in a non-destructive way (X-Ray scanner, micro-scanner, Tera-hertz waves, Transmission Electron Microscopy, etc). These tools acquire a projection set around the object and a reconstruction step leads to a representation of the acquired domain. The main limitation of these methods is that they rely on a continuous domain modeling wheareas they compute in a finite domain. The resulting discretization step sparks off errors in obtained images. Moreover, the acquisition step is not performed ideally and may be corrupted by artifacts and noises. Many direct or iterative methods have been developped to try to reduce errors and to give a better representative image of reality. An overview of these reconstructions is proposed and it is enriched with a study on quality, precision and noise robustness.\\Since the discretization is one of the major limitations, we try to adjust discrete methods for the reconstruction of real data. These methods are accurate in a finite domain but are not suitable for real acquisition, especially because of their error sensitivity. Therefore, we propose a link between the two worlds and we develop new discrete and noise robust methods. Finally, we are interesting in the missing data problem, i.e. when the acquisition is not uniform around the object, giving deformations into reconstructed images. Since discrete reconstructions are insensitive to this effect, we propose a primer solution using the tools developed previously
Romera, Thomas. "Adéquation algorithme architecture pour flot optique sur GPU embarqué." Electronic Thesis or Diss., Sorbonne université, 2023. http://www.theses.fr/2023SORUS450.
Full textThis thesis focus on the optimization and efficient implementation of pixel motion (optical flow) estimation algorithms on embedded graphics processing units (GPUs). Two iterative algorithms have been studied: the Total Variation - L1 (TV-L1) method and the Horn-Schunck method. The primary objective of this work is to achieve real-time processing, with a target frame processing time of less than 40 milliseconds, on low-power platforms, while maintaining acceptable image resolution and flow estimation quality for the intended applications. Various levels of optimization strategies have been explored. High-level algorithmic transformations, such as operator fusion and operator pipelining, have been implemented to maximize data reuse and enhance spatial/temporal locality. Additionally, GPU-specific low-level optimizations, including the utilization of vector instructions and numbers, as well as efficient memory access management, have been incorporated. The impact of floating-point number representation (single-precision versus half-precision) has also been investigated. The implementations have been assessed on Nvidia's Jetson Xavier, TX2, and Nano embedded platforms in terms of execution time, power consumption, and optical flow accuracy. Notably, the TV-L1 method exhibits higher complexity and computational intensity compared to Horn-Schunck. The fastest versions of these algorithms achieve a processing rate of 0.21 nanoseconds per pixel per iteration in half-precision on the Xavier platform, representing a 22x time reduction over efficient and parallel CPU versions. Furthermore, energy consumption is reduced by a factor of x5.3. Among the tested boards, the Xavier embedded platform, being both the most powerful and the most recent, consistently delivers the best results in terms of speed and energy efficiency. Operator merging and pipelining have proven to be instrumental in improving GPU performance by enhancing data reuse. This data reuse is made possible through GPU Shared memory, which is a small, high-speed memory that enables data sharing among threads within the same GPU thread block. While merging multiple iterations yields performance gains, it is constrained by the size of the Shared memory, necessitating trade-offs between resource utilization and speed. The adoption of half-precision numbers accelerates iterative algorithms and achieves superior optical flow accuracy within the same time frame compared to single-precision counterparts. Half-precision implementations converge more rapidly due to the increased number of iterations possible within a given time window. Specifically, the use of half-precision numbers on the best GPU architecture accelerates execution by up to x2.2 for TV-L1 and x3.7 for Horn-Schunck. This work underscores the significance of both GPU-specific optimizations for computer vision algorithms, along with the use and study of reduced floating point numbers. They pave the way for future enhancements through new algorithmic transformations, alternative numerical formats, and hardware architectures. This approach can potentially be extended to other families of iterative algorithms
Soua, Mahmoud. "Extraction hybride et description structurelle de caractères pour une reconnaissance efficace de texte dans les documents hétérogènes scannés : Méthodes et Algorithmes parallèles." Thesis, Paris Est, 2016. http://www.theses.fr/2016PESC1069/document.
Full textThe Optical Character Recognition (OCR) is a process that converts text images into editable text documents. Today, these systems are widely used in the dematerialization applications such as mail sorting, bill management, etc. In this context, the aim of this thesis is to propose an OCR system that provides a better compromise between recognition rate and processing speed which allows to give a reliable and a real time documents dematerialization. To ensure its recognition, the text is firstly extracted from the background. Then, it is segmented into disjoint characters that are described based on their structural characteristics. Finally, the characters are recognized when comparing their descriptors with a predefined ones.The text extraction, based on binarization methods remains difficult in heterogeneous and scanned documents with a complex and noisy background where the text may be confused with a textured background or because of the noise. On the other hand, the description of characters, and the extraction of segments, are often complex using calculation of geometricaltransformations, polygon, including a large number of characteristics or gives low discrimination if the characteristics of the selected type are sensitive to variation of scale, style, etc. For this, we adapt our algorithms to the type of heterogeneous and scanned documents. We also provide a high discriminatiobn between characters that descriptionis based on the study of the structure of the characters according to their horizontal and vertical projections. To ensure real-time processing, we parallelise algorithms developed on the graphics processor (GPU). Our main contributions in our proposed OCR system are as follows:A new binarisation method for heterogeneous and scanned documents including text regions with complex or homogeneous background. In this method, an image analysis process is used followed by a classification of the document areas into images (text with a complex background) and text (text with a homogeneous background). For text regions is performed text extraction using a hybrid method based on classification algorithm Kmeans (CHK) that we have developed for this aim. This method combines local and global approaches. It improves the quality of separation text/background, while minimizing the amount of distortion for text extraction from the scanned document and noisy because of the process of digitization. The image areas are improved with Gamma Correction (CG) before applying HBK. According to our experiment, our text extraction method gives 98% of character recognition rate on heterogeneous scanned documents.A Unified Character Descriptor based on the study of the character structure. It employs a sufficient number of characteristics resulting from the unification of the descriptors of the horizontal and vertical projection of the characters for efficient discrimination. The advantage of this descriptor is both on its high performance and its simple computation. It supports the recognition of alphanumeric and multiscale characters. The proposed descriptor provides a character recognition 100% for a given Face-type and Font-size.Parallelization of the proposed character recognition system. The GPU graphics processor has been used as a platform of parallelization. Flexible and powerful, this architecture provides an effective solution for accelerating intensive image processing algorithms. Our implementation, combines coarse/fine-grained parallelization strategies to speed up the steps of the OCR chain. In addition, the CPU-GPU communication overheads are avoided and a good memory management is assured. The effectiveness of our implementation is validated through extensive experiments
He, Guanlin. "Parallel algorithms for clustering large datasets on CPU-GPU heterogeneous architectures." Electronic Thesis or Diss., université Paris-Saclay, 2022. http://www.theses.fr/2022UPASG062.
Full textClustering, which aims at achieving natural groupings of data, is a fundamental and challenging task in machine learning and data mining. Numerous clustering methods have been proposed in the past, among which k-means is one of the most famous and commonly used methods due to its simplicity and efficiency.Spectral clustering is a more recent approach that usually achieves higher clustering quality than k-means. However, classical algorithms of spectral clustering suffer from a lack of scalability due to their high complexities in terms of number of operations and memory space requirements. This scalability challenge can be addressed by applying approximation methods or by employing parallel and distributed computing.The objective of this thesis is to accelerate spectral clustering and make it scalable to large datasets by combining representatives-based approximation with parallel computing on CPU-GPU platforms. Considering different scenarios, we propose several parallel processing chains for large-scale spectral clustering. We design optimized parallel algorithms and implementations for each module of the proposed chains: parallel k-means on CPU and GPU, parallel spectral clustering on GPU using sparse storage format, parallel filtering of data noise on GPU, etc. Our various experiments reach high performance and validate the scalability of each module and the complete chains
Mabrouk, Lhoussein. "Contribution à l'implémentation des algorithmes de vision avec parallélisme massif de données sur les architectures hétérogènes CPU/GPU." Thesis, Université Grenoble Alpes, 2020. http://www.theses.fr/2020GRALT009.
Full textMixture of Gaussians (MoG) and Compressive Sensing (CS) are two common algorithms in many image and audio processing systems. Their combination, CS-MoG, was recently used for mobile objects detecting through background subtraction. However, the implementations of CS-MoG presented in previous works do not take full advantage of the heterogeneous architectures evolution. This thesis proposes two contributions for the efficient implementation of CS-MoG on heterogeneous parallel CPU/GPU architectures. These technologies nowadays offer great programming flexibility, which allows optimizing performance as well as energy efficiency.Our first contribution consists in offering the best acceleration-precision compromise on CPU and GPU. The second is the proposition of a new adaptive approach for data partitioning, allowing to fully exploiting the CPUs-GPUs. Whatever their relative performance, this approach, called the Optimal Data Distribution Cursor (ODDC), aims to ensure automatic balancing of the computational load by estimating the optimal proportion of data that has to be affected to each processor, taking into account its computing capacity.This approach updates the partitioning online, which allows taking into account any influence of the irregularity of the processed images content. In terms of mobile objects, we mainly target vehicles whose detection represents some challenges, but in order to generalize our approach, we test also scenes containing other types of targets.Experimental results, on different platforms and datasets, show that the combination of our two contributions makes it possible to reach 98% of the maximal possible performance on these platforms. These results can also be beneficial for other algorithms where calculations are performed independently on small grains of data
Perrotte, Lancelot. "Algorithmes robustes de lancer de rayons pour calcul de dose interactif." Toulouse, ISAE, 2011. http://www.theses.fr/2011ESAE0005.
Full textLuo, Jia. "Algorithmes génétiques parallèles pour résoudre des problèmes d'ordonnancement de tâches dynamiques de manière efficace en prenant en compte l'énergie." Thesis, Toulouse 3, 2019. http://www.theses.fr/2019TOU30001.
Full textDue to new government legislation, customers' environmental concerns and continuously rising cost of energy, energy efficiency is becoming an essential parameter of industrial manufacturing processes in recent years. Most efforts considering energy issues in scheduling problems have focused on static scheduling. But in fact, scheduling problems are dynamic in the real world with uncertain new arrival jobs after the execution time. In this thesis, two energy efficient dynamic scheduling problems are studied. Model I analyzes the total tardiness and the makespan with a peak power limitation while considering the flexible flow shop with new arrival jobs. A periodic complete rescheduling approach is adopted to represent the optimization problem. Model II concerns an investigation into minimizing total tardiness and total energy consumption in the job shop with new urgent arrival jobs. An event driven schedule repair approach is utilized to deal with the updated schedule. As an adequate renewed scheduling plan needs to be obtained in a short response time in dynamic environment, two parallel Genetic Algorithms (GAs) are proposed to solve these two models respectively. The parallel GA I is a CUDA-based hybrid model consisting of an island GA at the upper level and a fine-grained GA at the lower level. It combines metrics of two hierarchical layers and takes full advantage of CUDA's compute capability. The parallel GA II is a dual heterogeneous design composed of a cellular GA and a pseudo GA. The islands with these two different structures increase the population diversity and can be well parallelized on GPUs simultaneously with multi-core CPU. Finally, numerical experiments are conducted and show that our approaches can not only solve the problems flexibly, but also gain competitive results and reduce time requirements
Delevacq, Audrey. "Métaheuristiques pour l'optimisation combinatoire sur processeurs graphiques (GPU)." Thesis, Reims, 2013. http://www.theses.fr/2013REIMS011/document.
Full textSeveral combinatorial optimization problems are NP-hard and can only be solved optimally by exact algorithms for small instances. Metaheuristics have proved to be effective in solving many of these problems by finding approximate solutions in a reasonable time. However, dealing with large instances, they may require considerable computation time and amount of memory space to be efficient in the exploration of the search space. Therefore, the interest devoted to their deployment on high performance computing architectures has increased over the past years. Existing parallelization approaches generally follow the message-passing and shared-memory computing paradigms which are suitable for traditional architectures based on microprocessors, also called CPU (Central Processing Unit). However, research in the field of parallel computing is rapidly evolving and new architectures emerge, including hardware accelerators which offloads the CPU of some of its tasks. Among them, graphics processors or GPUs (Graphics Processing Units) have a massively parallel architecture with great potential but also imply new algorithmic and programming challenges. In fact, existing parallelization models of metaheuristics are generally unsuited to computing environments like GPUs. Few works have tackled this subject without providing a comprehensive and fundamental view of it.The general purpose of this thesis is to propose a framework for the effective implementation of metaheuristics on parallel architectures based on GPUs. It begins with a state of the art describing existing works on GPU parallelization of metaheuristics and general classifications of parallel metaheuristics. An original taxonomy is then designed to classify identified implementations and to formalize GPU parallelization strategies in a coherent methodological framework. This thesis also aims to validate this taxonomy by exploiting its main components to propose original parallelization strategies specifically tailored to GPU architectures. Several effective implementations based on Ant Colony Optimization and Iterated Local Search metaheuristics are thus proposed for solving the Travelling Salesman Problem. A structured and thorough experimental study is conducted to evaluate and compare the performance of approaches on criteria related to solution quality and computing time reduction
Roccia, Jean-Patrick. "Accélération de l'algorithme du lancer de rayons en environnement parallèle hétérogène." Toulouse 3, 2013. http://thesesups.ups-tlse.fr/2008/.
Full textRay tracing allows to obtain high accuracy to simulate physical phenomena of radiation, and in diverse areas. However, despite its relative simplicity, many problems occur when it comes to the performance. Modern hardware architectures offer a growing parallelism, symbolized by a growing number of available cores, whether on CPU or GPU. The ray tracing algorithm should take advantage of this available computing power. Indeed, instead of treating the rays sequentially, parallel processing can significantly increase performance. Thesis contributions are spread over all the main elements for ray tracing algorithm acceleration. The first one speeds up the construction of a high quality acceleration structure: a KD-Tree. This method can speed up the construction by using jointly CPU and GPU, and outperforms previously published method. The second contribution allows to spread the KD-Tree traversal steps between CPU and GPU transparently for users. The third contribution concerns the ray-triangle intersection test applied to the KD-Tree and allows to maximize the use of the SIMD instructions on CPU in the intersection test while controlling memory usage and maintaining performance on GPU. Finally, the last contribution is more general, and spreads automatically and transparently parallel computations tasks between heterogeneous computing units
Jamal, Aygul. "A parallel iterative solver for large sparse linear systems enhanced with randomization and GPU accelerator, and its resilience to soft errors." Thesis, Université Paris-Saclay (ComUE), 2017. http://www.theses.fr/2017SACLS269/document.
Full textIn this PhD thesis, we address three challenges faced by linear algebra solvers in the perspective of future exascale systems: accelerating convergence using innovative techniques at the algorithm level, taking advantage of GPU (Graphics Processing Units) accelerators to enhance the performance of computations on hybrid CPU/GPU systems, evaluating the impact of errors in the context of an increasing level of parallelism in supercomputers. We are interested in studying methods that enable us to accelerate convergence and execution time of iterative solvers for large sparse linear systems. The solver specifically considered in this work is the parallel Algebraic Recursive Multilevel Solver (pARMS), which is a distributed-memory parallel solver based on Krylov subspace methods.First we integrate a randomization technique referred to as Random Butterfly Transformations (RBT) that has been successfully applied to remove the cost of pivoting in the solution of dense linear systems. Our objective is to apply this method in the ARMS preconditioner to solve more efficiently the last Schur complement system in the application of the recursive multilevel process in pARMS. The experimental results show an improvement of the convergence and the accuracy. Due to memory concerns for some test problems, we also propose to use a sparse variant of RBT followed by a sparse direct solver (SuperLU), resulting in an improvement of the execution time.Then we explain how a non intrusive approach can be applied to implement GPU computing into the pARMS solver, more especially for the local preconditioning phase that represents a significant part of the time to compute the solution. We compare the CPU-only and hybrid CPU/GPU variant of the solver on several test problems coming from physical applications. The performance results of the hybrid CPU/GPU solver using the ARMS preconditioning combined with RBT, or the ILU(0) preconditioning, show a performance gain of up to 30% on the test problems considered in our experiments.Finally we study the effect of soft fault errors on the convergence of the commonly used flexible GMRES (FGMRES) algorithm which is also used to solve the preconditioned system in pARMS. The test problem in our experiments is an elliptical PDE problem on a regular grid. We consider two types of preconditioners: an incomplete LU factorization with dual threshold (ILUT), and the ARMS preconditioner combined with RBT randomization. We consider two soft fault error modeling approaches where we perturb the matrix-vector multiplication and the application of the preconditioner, and we compare their potential impact on the convergence of the solver
Yu, Tianyi. "Décomposition en temps réel de signaux iEMG : filtrage bayésien implémenté sur GPU." Thesis, Ecole centrale de Nantes, 2019. http://www.theses.fr/2019ECDN0006/document.
Full text:A sequential decomposition algorithm based on a Hidden Markov Model of the EMG, that used Bayesian filtering to estimate the unknown parameters of discharge series of motor units was previously proposed in the laboratory LS2N. This algorithm has successfully decomposed the experimental iEMG signal with four motor units. However, the proposed algorithm demands a high time consuming. In this work, we firstly validated the proposed algorithm in a serial structure. We proposed some modifications for the activation process of the recruitment model in Hidden Markov Model and implemented two signal pre-processing techniques to improve the performance of the algorithm. Then, we realized a GPU-oriented implementation of this algorithm, as well as the modifications applied to the original model in order to achieve a real-time performance. We have achieved the decomposition of 10 experimental iEMG signals acquired from two different muscles, respectively by fine wire electrodes and needle electrodes. The number of motor units ranges from 2 to 8. The percentage of superposition, representing the complexity of iEMG signal, ranges from 6.56 % to 28.84 %. The accuracies of almost all experimental iEMG signals are more than90 %, except two signals at 30 % MVC (more than 85 %). Moreover, we realized the realtime decomposition for all these experimental signals by the parallel implementation. We are the first one that realizes the real time full decomposition of single channel iEMG signal with number of MUs up to 10, where full decomposition means resolving the superposition problem. For the signals with more than 10 MUs, we can also decompose them quickly, but not reaching the real time level
Tran, Tuan Tu. "Comparaisons de séquences biologiques sur architecture massivement multi-coeurs." Thesis, Lille 1, 2012. http://www.theses.fr/2012LIL10138/document.
Full textSearching similarities between sequences is a fundamental operation in bioinformatics, providing insight in biological functions as well as tools for high-throughput data. There is a need to have algorithms able to process efficiently billions of sequences. To look for approximate similarities,a common heuristic is to consider short words that appear exactly in both sequences, the seeds, then to try to extend this similarity to the neighborhoods of the seeds. The thesis focuses on this second stage of seed-based heuristics : how can we retrieve and compare efficiently the neighborhoods of the seeds ? The thesis proposes several solutions tailored for manycore processors such as today’s GPUs. Such processors are making massively parallel computing more and more popular. The thesis proposes direct approaches (extension of bit-parallel Wu-Manber algorithm, published in PBC 2011, and binary search) and approaches with another index (with perfect hash functions). Each one of these solutions was conceived to obtain as much fine-grained parallelism as possible, requiring intensive but homogeneous computational operations. All proposed methods were implemented in OpenCL and benchmarked. Finally, the thesis presents MAROSE, a prototype parallel read mapper using these concepts. In some situations, MAROSE is more efficient than the existing read mappers with a comparable sensitivity
Seznec, Mickaël. "From the algorithm to the targets, optimization flow for high performance computing on embedded GPUs." Electronic Thesis or Diss., université Paris-Saclay, 2021. http://www.theses.fr/2021UPASG074.
Full textCurrent digital processing algorithms require more computing power to achieve more accurate results and process larger data. In the meantime, hardware architectures are becoming more specialized, with highly efficient accelerators designed for specific tasks. In this context, the path of deployment from the algorithm to the implementation becomes increasingly complex. It is, therefore, crucial to determine how algorithms can be modified to take advantage of new hardware capabilities. Our study focused on graphics processing units (GPUs), a massively parallel processor. Our algorithmic work was done in the context of radio-astronomy or optical flow estimation and consisted of finding the best adaptation of the software to the hardware. At the level of a mathematical operator, we modified the traditional image convolution algorithm to use the matrix units and showed that its performance doubles for large convolution kernels. At a broader method level, we evaluated linear solvers for the combined local-global optical flow to find the most suitable one on GPU. With additional optimizations, such as iteration fusion or memory buffer re-utilization, the method is twice as fast as the initial implementation, running at 60 frames per second on an embedded platform (30 W). Finally, we also pointed out the interest of this hardware-aware algorithm design method in the context of deep neural networks. For that, we showed the hybridization of a convolutional neural network for optical flow estimation with a pre-trained image classification network, MobileNet, that was initially designed for efficient image classification on low-power platforms
Avril, Quentin. "Détection de Collision pour Environnements Large Échelle : Modèle Unifié et Adaptatif pour Architectures Multi-coeur et Multi-GPU." Phd thesis, INSA de Rennes, 2011. http://tel.archives-ouvertes.fr/tel-00642067.
Full textGhannam, Boutros. "Modélisation ultra-rapide des transferts de chaleur par rayonnement et par conduction et exemple d'application." Phd thesis, Ecole Nationale Supérieure des Mines de Paris, 2012. http://pastel.archives-ouvertes.fr/pastel-00958145.
Full textWang, Hongjian. "Cellular matrix for parallel k-means and local search to Euclidean grid matching." Thesis, Belfort-Montbéliard, 2015. http://www.theses.fr/2015BELF0280/document.
Full textIn this thesis, we propose a parallel computing model, called cellular matrix, to provide answers to problematic issues of parallel computation when applied to Euclidean graph matching problems. These NP-hard optimization problems involve data distributed in the plane and elastic structures represented by graphs that must match the data. They include problems known under various names, such as geometric k-means, elastic net, topographic mapping, and elastic image matching. The Euclidean traveling salesman problem (TSP), the median cycle problem, and the image matching problem are also examples that can be modeled by graph matching. The contribution presented is divided into three parts. In the first part, we present the cellular matrix model that partitions data and defines the level of granularity of parallel computation. We present a generic loop for parallel computations, and this loop models the projection between graphs and their matching. In the second part, we apply the parallel computing model to k-means algorithms in the plane extended with topology. The proposed algorithms are applied to the TSP, structured mesh generation, and image segmentation following the concept of superpixel. The approach is called superpixel adaptive segmentation map (SPASM). In the third part, we propose a parallel local search algorithm, called distributed local search (DLS). The solution results from the many local operations, including local evaluation, neighborhood search, and structured move, performed on the distributed data in the plane. The algorithm is applied to Euclidean graph matching problems including stereo matching and optical flow
Baboud, Lionel. "Représentations alternatives du détail visuel pour le rendu en temps-réel." Phd thesis, Université Joseph Fourier (Grenoble), 2009. http://tel.archives-ouvertes.fr/tel-00452293.
Full textBaboud, Lionel. "Représentations alternatives du détail visuel pour le rendu en temps-réel." Phd thesis, Grenoble 1, 2009. http://www.theses.fr/2009GRE10222.
Full textThis thesis takes place in the context of real-time image synthesis. It addresses the problem of efficient rendering for visual detail, main component of an image's realism. To cope with the complexity of visual detail it is necessary to possess representations which are adapted to objects that are to be rendered, together with capabilities of modern graphics processors. The first research effort concerns the use of relief for efficiently representing and rendering geometrical detail. The compact and structured representation of relief by an height map allows to design efficient and accurate rendering algorithms. We propose two of them: the first one can render animated reliefs, while the second one focuses on static ones by exploiting the possibility to preprocess the height map. We also develop a consideration on the use of relief for the representation of general surfaces, and present an application for real-time rendering of realistic water volumes. The second research effort focuses on non-surfacic representations, necessary when geometrical representations are inadequate or even non-existent. It is notably the case of distant objects or objects possessing a dense geometry, like a tree's foliage for example. The problem here is to be able to represent the visual aspect of an object without resorting to any geometrical model. We propose a method taking the light-field of an object as only input data, to determine the optimal parameters for an efficiently renderable representation
Zola, Wagner Machado Nunan 1961. "Parallel gpu algorithms for compressed implicit octrees." reponame:Repositório Institucional da UFPR, 2015. http://hdl.handle.net/1884/45749.
Full textTese (doutorado) - Universidade Federal do Paraná, Setor de Ciências Exatas, Programa de Pós-Graduação em Informática. Defesa: Curitiba, 10/09/2015
Inclui referências : f. 97-101
Resumo: O algoritmo Barnes-Hut é um método aproximado amplamente usado para na simulação gravitacional de N-Corpos, que envolve a construção e eaminliamento de árvores esparsas a cada passo de simulação e assim reduzindo a complexidade computacional e possibilitando a solução de problemas práticos de grande escala, A natureza irregular desse código de eaminliamento em árvore apresenta desafios interessantes na sua computação em sistemas paralelos. Desafios adicionais ocorrem nesse tipo de padrão de computação paralela quando se deseja utilizar de maneira eficaz a capacidade computacional de arquiteturas de GPUs (processadores gráficos multieore de propósito geral), Oetrees são estruturas de dados que representam de maneira eficiente as informações de dados espaciais em várias áreas tais como computação científica, computação gráfica, processamento de imagens, dentre outras. Nosso enfoque nesse trabalho é de tratar explicitamente os padrões dinâmicos irregulares de acesso a dados em memória com o remapeamento de dados e transformações de lavouts, dependendo das estruturas acessadas. Também é feito o controle explicito, por programa, de fluxos divergentes de execuções em threads. Apresentamos uma nova estrutura de dados compacta para lavouts de oetrees esparsas, bem como algoritmos paralelos para GPUs, tanto para transformações de lavouts como para eaminliamento paralelo usando a técnica de simulação de "warps"-largos (SWW, Simulated Wide-Warps), Os benefícios de nossas técnicas ocorrem devido à transposição do algoritmo de eamin- nhamento na árvore para execução em padrões mais regulares, possibilitando uma melhor adaptação ao modelo GPU paralelo, A estrutura de dados permite explorar localidades de acessos à memória durante os percursos, ao mesmo tempo conservando espaço em memória eaehe ou em memória compartilhada (scratchpad). Desta forma a memória rápida intra-eore pode ser dedicada a acelerar eaminliamentos. Controle divergência de fluxos também é delimitado de maneira algorítmica, impondo uma execução uniforme na maior parte dos segmentos de execução. Nossos experimentos mostram melhoria de desempenho significativa em relação às soluções em GPU mais conhecidas para este algoritmo. Desenvolvemos um novo algoritmo paralelo eficiente que gera diretamente de uma só vez as oetrees implícitas comprimidas, como um método massivamente paralelo. Este método traz uma nova visão para tratar de forma eficiente com a natureza irregular também presente na construção de oetrees esparsas, O algoritmo proposto de geração massivamente paralela de oetrees esparsas tem aplicação imediata em nossa implementação GPU paralela da simulação Barnes-Hut e em outros métodos de N-eorpos, As técnicas e algoritmos propostos nesta tese também poderão ser aplicadas em outros contextos. Palavras-chave: Algoritmo Massivamente Paralelo para Geração de Octrees; Octrees esparsas; Octree implícita; Probleamas de N-Corpos; Barnes-Hut; GPGPIJ; WarpsLargos Simulados em Software; CIJDA; Algoritmo Paralelo irregular; Algoritmos paralelos; Manycore Computing; Acelerador de Computação;
Abstract: The Barnes-Hut algorithm is a widely used approximation method for the N-Body simulation problem, which involves the construction and traversal of sparse trees at each simulation step and thus reducing the complexity to solve large/praetieal problems. The irregular nature of this tree walking code presents interesting challenges for its computation on parallel systems. Additional problems arise in effectively exploiting the processing capacity of GPU architectures. Octrees are data structures that efficiently represent spatial data in many fields such as scientific computing, computer graphics and image processing, among others. In this work we explicitly deal with dynamic irregular patterns in data accesses with data remapping and data transformation, depending on the data structures being accessed, and by controlling the execution flow divergence of threads. We present a new compact data-strueture for sparse octree layouts, and also GPU parallel algorithms for tree transformation and parallel walking using software Simulated Wide-Warps (SWW), Benefits of our techniques are in transposing the tree algorithm to execute regular patterns to match the GPU parallel model. The data structure allows exploring localities during traversals, at the same time conserving space in caches or scratchpad memory. This way fast intra-eore memory can be dedicated to speed up traversals. Control flow divergence is also algorithmically constrained, enforcing a mostly uniform execution of threads. Our experiments show significant performance improvement over the best known GPU solutions to this algorithm. We have developed a novel efficient parallel algorithm that directly generates entire compressed implicit octrees at once, as a massively parallel method. This method brings new insight on how to efficiently deal with the irregular nature of algorithms for constructing sparse octrees. The proposed algorithm has immediate application to our GPU parallel Barnes-Hut implementation and other N-Body methods. We envision that the techniques and algorithms proposed in this dissertation can also be applied in other contexts. Keywords: Massively Parallel Octree Generation Algorithm; Sparse Octrees; Implicit Octree; N-Body; Barnes-Hut; GPGPU; Software Simulated Wide-Warp; CUDA; Irregular Parallel Algorithm; Parallel algorithms; Many core Computing; Accelerator Computing;
Kang, Seunghwa. "On the design of architecture-aware algorithms for emerging applications." Diss., Georgia Institute of Technology, 2011. http://hdl.handle.net/1853/39503.
Full textIngram, Stephen F. "Multilevel multidimensional scaling on the GPU." Thesis, University of British Columbia, 2007. http://hdl.handle.net/2429/409.
Full textZaouche, Abdelouahib. "Egalisation aveugle utilisant des techniques d'optimisation évolutionistes et de recherche exhaustive par motifs généralistes." Valenciennes, 2007. https://ged.uphf.fr/nuxeo/site/esupversions/057b4ef6-37a6-4118-b768-5684a88e0912.
Full textZheng, Lu. "Parallel Junction Tree Algorithm on GPU." Thesis, Carnegie Mellon University, 2013. http://pqdtopen.proquest.com/#viewpdf?dispub=3575068.
Full textBayesian networks (BNs) are an effective tool in a diverse range of applications that require representation and reasoning with uncertain knowledge and data. Mathematically, a Bayesian network (BN) is a type of statistical model that can compactly represent complex probability distributions of random variables. Inference over BNs can be either exact or approximate. The junction tree algorithm is perhaps the most popular exact inference algorithm. The junction tree algorithm propagates beliefs (or posteriors) over a junction tree when a new piece of evidence comes in so that the factors over the junction tree stay consistent with each other.
However, belief propagation over junction trees is known to be computationally hard. The cornputational complexity hinders the application of BNs in cases where real-time inference is required. This thesis accelerates the junction tree belief propagation algorithm through parallelizing and using a state of the art manycore computing platform. Recent manycore computing platforms, like the recent Graphical Processing Units (GPUs) from NVIDIA and Intel's Knights Ferry, employ a Single Instruction Multiple Data (SIMD) architecture. They can provide massive computational power to address various computational problems. However, it is not trivial to map a problem instance to the manycore platform. The problem has to be carefully parallelized and the implementation has to be carefully optimized in order to make full use of the computation resources on a GPU.
In this thesis, we will thoroughly investigate the junction tree algorithm and identify the possible parallel opportunities. Our work mainly focuses on node-level parallelizati on, i.e., we parallelize each message passing of junction tree belief propagation. We first identify two kinds of parallelism in the message passing, namely, element-wise parallelism and arithmetic parallelism. Then, based on these two kinds parallelism, we propose a two-dimensional parallel message passing algorithm. In case of message passings that do not contain enough parallel opportunity, we also propose a junction tree pruning techniques called clique merging. Clique merging eliminates extremely small nodes in a junction tree so that the junction tree is better adjusted to the GPU platform.
Performance tuning for the parallel junction tree algorithm is necessary. Yet the diversity in size typically seen in the junction tree cliques makes the GPU input workload vary; the two dimensions of parallelism further add to the complexity of this tuning issue. Therefore it is hard to set the GPU configuration parameters for optimal performance. In this thesis, we parameterize the GPU tuning process and propose a machine learning framework to automatically optimize the performance. This framework essentially trains a machine learning model to capture causal relationships between a GPUs workload, its configuration, and resulting performance characteristics. The machine learning model is then used to optimize the GPU configuration. Experiments show that this auto-tuning framework can improve the performance significantly more than extensive manual tuning.
We implemented the parallel junction tree algorithm on a NVIDIA GTX 460 GPU card, and employed the clique merging and auto-tuning techniques. Compared with a benchmark sequential code that runs on an Intel Core 2 Quad CPU, the fully tuned code show up to 19.90x speedup. On average over all data-sets, we get a speedup of 10.70x (arithmetic average) or 8.68x (geometric average).
Anders, Söderholm, and Sörman Justus. "GPU-accelleration of image rendering and sorting algorithms with the OpenCL framework." Thesis, Linköpings universitet, Programvara och system, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-127479.
Full textMcLaughlin, Adam Thomas. "Power-constrained performance optimization of GPU graph traversal." Thesis, Georgia Institute of Technology, 2013. http://hdl.handle.net/1853/50209.
Full textSreehari, Ambuluri. "Implementations of the FFT algorithm on GPU." Thesis, Linköpings universitet, Elektroniksystem, 2012. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-91351.
Full textZhu, Huanzhou. "Developing graph-based co-scheduling algorithms with GPU acceleration." Thesis, University of Warwick, 2016. http://wrap.warwick.ac.uk/92000/.
Full textHarvey, Jesse Patrick. "GPU acceleration of object classification algorithms using NVIDIA CUDA /." Online version of thesis, 2009. http://hdl.handle.net/1850/10894.
Full textKnight, David Warwick. "Algorithms for MARS spectral CT." Thesis, University of Canterbury. Medical Physics, 2015. http://hdl.handle.net/10092/10422.
Full textLerchundi, Osa Gorka. "Fast Implementation of Two Hash Algorithms on nVidia CUDA GPU." Thesis, Norwegian University of Science and Technology, Department of Telematics, 2009. http://urn.kb.se/resolve?urn=urn:nbn:no:ntnu:diva-9817.
Full textUser needs increases as time passes. We started with computers like the size of a room where the perforated plaques did the same function as the current machine code object does and at present we are at a point where the number of processors within our graphic device unit its not enough for our requirements. A change in the evolution of computing is looming. We are in a transition where the sequential computation is losing ground on the benefit of the distributed. And not because of the birth of the new GPUs easily accessible this trend is novel but long before it was used for projects like SETI@Home, fightAIDS@Home, ClimatePrediction and there were shouting from the rooftops about what was to come. Grid computing was its formal name. Until now it was linked only to distributed systems over the network, but as this technology evolves it will take different meaning. nVidia with CUDA has been one of the first companies to make this kind of software package noteworthy. Instead of being a proof of concept its a real tool. Where the transition is expressed in greater magnitude in which the true artist is the programmer who uses it and achieves performance increases. As with many innovations, a community distributed worldwide has grown behind this software package and each one doing its bit. It is noteworthy that after CUDA release a lot of software developments grown like the cracking of the hitherto insurmountable WPA. With Sony-Toshiba-IBM (STI) alliance it could be said the same thing, it has a great community and great software (IBM is the company in charge of maintenance). Unlike nVidia is not as accessible as it is but IBM is powerful enough to enter home made supercomputing market. In this case, after IBM released the PS3 SDK, a notorious application was created using the benefits of parallel computing named Folding@Home. Its purpose is to, inter alia, find the cure for cancer. To sum up, this is only the beginning, and in this thesis is sized up the possibility of using this technology for accelerating cryptographic hash algorithms. BLUE MIDNIGHT WISH (The hash algorithm that is applied to the surgery) is undergone to an environment change adapting it to a parallel capable code for creating empirical measures that compare to the current sequential implementations. It will answer questions that nowadays havent been answered yet. BLUE MIDNIGHT WISH is a candidate hash function for the next NIST standard SHA-3, designed by professor Danilo Gligoroski from NTNU and Vlastimil Klima an independent cryptographer from Czech Republic. So far, from speed point of view BLUE MIDNIGHT WISH is on the top of the charts (generally on the second place right behind EDON-R - another hash function from professor Danilo Gligoroski). One part of the work on this thesis was to investigate is it possible to achieve faster speeds in processing of Blue Midnight Wish when the computations are distributed among the cores in a CUDA device card. My numerous experiments give a clear answer: NO. Although the answer is negative, it still has a significant scientific value. The point is that my work acknowledges viewpoints and standings of a part of the cryptographic community that is doubtful that the cryptographic primitives will benefit when executed in parallel in many cores in one CPU. Indeed, my experiments show that the communication costs between cores in CUDA outweigh by big margin the computational costs done inside one core (processor) unit.
Marques, Ricardo Jorge dos Santos. "Algorithmic skeleton framework for the orchestration of GPU computations." Master's thesis, Faculdade de Ciências e Tecnologia, 2012. http://hdl.handle.net/10362/8382.
Full textThe Graphics Processing Unit (GPU) is gaining popularity as a co-processor to the Central Processing Unit (CPU), due to its ability to surpass the latter’s performance in certain application fields. Nonetheless, harnessing the GPU’s capabilities is a non-trivial exercise that requires good knowledge of parallel programming. Thus, providing ways to extract such computational power has become an emerging research topic. In this context, there have been several proposals in the field of GPGPU (Generalpurpose Computation on Graphics Processing Unit) development. However, most of these still offer a low-level abstraction of the GPU computing model, forcing the developer to adapt application computations in accordance with the SPMD model, as well as to orchestrate the low-level details of the execution. On the other hand, the higher-level approaches have limitations that prevent the full exploitation of GPUs when the purpose goes beyond the simple offloading of a kernel. To this extent, our proposal builds on the recent trend of applying the notion of algorithmic patterns (skeletons) to GPU computing. We propose Marrow, a high-level algorithmic skeleton framework that expands the set of skeletons currently available in this field. Marrow’s skeletons orchestrate the execution of OpenCL computations and introduce optimizations that overlap communication and computation, thus conjoining programming simplicity with performance gains in many application scenarios. Additionally, these skeletons can be combined (nested) to create more complex applications. We evaluated the proposed constructs by confronting them against the comparable skeleton libraries for GPGPU, as well as against hand-tuned OpenCL programs. The results are favourable, indicating that Marrow’s skeletons are both flexible and efficient in the context of GPU computing.
FCT-MCTES - financing the equipment
Alexandre, Fernando Jorge Marques. "Multi-GPU support on the marrow algorithmic skeleton framework." Master's thesis, Faculdade de Ciências e Tecnologia, 2013. http://hdl.handle.net/10362/10746.
Full textWith the proliferation of general purpose GPUs, workload parallelization and datatransfer optimization became an increasing concern. The natural evolution from using a single GPU, is multiplying the amount of available processors, presenting new challenges, as tuning the workload decompositions and load balancing, when dealing with heterogeneous systems. Higher-level programming is a very important asset in a multi-GPU environment, due to the complexity inherent to the currently used GPGPU APIs (OpenCL and CUDA), because of their low-level and code overhead. This can be obtained by introducing an abstraction layer, which has the advantage of enabling implicit optimizations and orchestrations such as transparent load balancing mechanism and reduced explicit code overhead. Algorithmic Skeletons, previously used in cluster environments, have recently been adapted to the GPGPU context. Skeletons abstract most sources of code overhead, by defining computation patterns of commonly used algorithms. The Marrow algorithmic skeleton library is one of these, taking advantage of the abstractions to automate the orchestration needed for an efficient GPU execution. This thesis proposes the extension of Marrow to leverage the use of algorithmic skeletons in the modular and efficient programming of multiple heterogeneous GPUs, within a single machine. We were able to achieve a good balance between simplicity of the programming model and performance, obtaining good scalability when using multiple GPUs, with an efficient load distribution, although at the price of some overhead when using a single-GPU.
projects PTDC/EIA-EIA/102579/2008 and PTDC/EIA-EIA/111518/2009
Nilsson, Mattias. "Evaluation of Computer Vision Algorithms Optimized for Embedded GPU:s." Thesis, Linköpings universitet, Datorseende, 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-112575.
Full textHopson, Benjamin Thomas Ken. "Techniques of design optimisation for algorithms implemented in software." Thesis, University of Edinburgh, 2016. http://hdl.handle.net/1842/20435.
Full textKump, Joseph Lee. "Efficient Algorithms for Data Analytics in Geophysical Imaging." Thesis, Virginia Tech, 2021. http://hdl.handle.net/10919/103864.
Full textMaster of Science
Modern sensor designs make it easier to collect large quantities of seismic vibration data. While this data can provide valuable insight, it is difficult to effectively store and perform analysis on such a high data volume. We propose a few new, general-purpose algorithms that enable speedy use of two common methods in geophysical modeling and data analytics: crosscorrelation, which provides a measure of similarity between signals; and multichannel analysis of surface waves, which is a seismic imaging technique. Our algorithms take advantage of hardware and software typically available on modern computers, and the mathematical properties of these two methods.
Ponce, Sean Philip. "Towards Algorithm Transformation for Temporal Data Mining on GPU." Thesis, Virginia Tech, 2009. http://hdl.handle.net/10919/34387.
Full textMaster of Science
Wang, Kaibo. "Algorithmic and Software System Support to Accelerate Data Processing in CPU-GPU Hybrid Computing Environments." The Ohio State University, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=osu1447685368.
Full textValladares, Cereceda Ignacio. "GPU parallel algorithms for reporting movement behaviour patterns in spatiotemporal databases." Doctoral thesis, Universitat de Girona, 2013. http://hdl.handle.net/10803/119544.
Full textEn aquesta tesi tractem i resolem varis problemes relacionats amb el càlcul de patrons de moviment en bases de dades espai-temporals, dissenyant i implementant algoritmes paral·lels utilitzant GPUs. Primer, proposem un algoritme que utilitza els processos gràfics de la GPU per reportar el patró ‘Llocs Populars’. Després estudiem el problema de reportar tots els grups de subtrajectories d’una trajectòria. Per mesurar la similitud entre corbes hem triat la distancia de Fréchet. Finalment resolem el problema del patró ‘Ramat’. Amb aquest objectiu, presentem dos algorismes per resoldre dos problemes relacionats amb el patró ‘Ramat’: El problema de trobar tots els conjunts maximals de una família, i el problema de intersecar dos famílies de conjunts. Proposem algorismes paral·lels per resoldre els dos problemes que després s’utilitzen per reportar patrons ’Ramat’
Koskinas, Stefanos. "Denoising using randomized matching pursuit: algorithmic improvements and GPU implementation." Thesis, McGill University, 2013. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=116991.
Full textLes techniques de Matching Pursuit (MP) ont été considérées récemment pour le débruitage des signaux. L'algorithme MP peut reconstruire un signal pur à partir d'une observation bruyante à travers une estimation gloutonne (greedy) des moindres carrés (MC) comme une combinaison linéaire des éléments (atomes) d'un dictionnaire. Cette méthode de reconstruction évite les calculs intensifs de la solution de MC pour de larges problèmes, et elle exhibe une performance améliorée pour enlever du bruit, particulièrement dans le cas où le signal est creux. L'algorithme MP original et d'autres algorithmes de type MP, comme Orthogonal Matching Pursuit (OMP), résolvent itérativement le problème de la reconstruction en utilisant, à chaque iteration, des sélections déterministes des atomes du dictionnaire. Cependant, des recherches récentes ont proposé l'emploi d'une sélection aléatoire d'atomes au lieu d'une sélection déterministe, qui est capable d'améliorer encore la performance du débruitage. Dans cette thèse, nous explorons la performance des algorithmes MP et OMP avec des sélections aléatoires d'atomes, et nous les comparons vis-à-vis des sélections déterministes en utilisant des simulations Monte Carlo et d'une analyse géométrique du leur comportement en deux dimensions. Motivés par nos études, nous proposons Multi-Stage MP (MS-MP), une nouvelle approche de MP qui utilise plusieurs étapes pour enlever le bruit. Les simulations démontrent que, sous certaines conditions, MS-MP peut améliorer la performance de manière significative. Enfin, nous observons que les algorithmes MP avec des sélections aléatoires d'atomes présentent un potentiel pour le parallélisme, et nous implémentons un algorithme parallèle de RMP sur un GPU compatible avec CUDA. Notre implémentation aboutit à une amélioration de la durée d'exécution jusqu' à neuf fois par rapport à un algorithme classique en série.
Pospíchal, Petr. "Akcelerace genetického algoritmu s využitím GPU." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2009. http://www.nusl.cz/ntk/nusl-236783.
Full text