Thèses : « GPU-CPU »

1

Fang, Zhuowen. « Java GPU vs CPU Hashing Performance ». Thesis, Mittuniversitetet, Avdelningen för informationssystem och -teknologi, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:miun:diva-33994.

Texte intégral

Résumé :

In the latest years, the public’s interest in blockchain technology has been growing since it was brought up in 2008, primarily because of its ability to create an immutable ledger, for storing information that never will or can be changed. As an expanding chain structure, the act of nodes adding blocks to the chain is called mining which is regulated by consensus mechanism. In the most widely used consensus mechanism Proof of work, this process is based on computationally heavy guessing of hashes of blocks. Today, there are several prominent ways developed of performing this guessing, thanks to the development of hardware technology, either using the regular all-rounded computer processing unit (CPU), or using the more specialized graphics processing unit (GPU), or using dedicated hardware. This thesis studied the working principles of blockchain, implemented the crucial hash function used in Proof of Work consensus mechanism and other blockchain structures with the popular programming language Java on various platforms. CPU implementation is done with Java’s built-in functions and for GPU I used OpenCL ’ s Java binding JOCL. This project gives a quantified measurement for hash rate on different devices, determines that all the GPUs tested advantage over CPUs in performance and memory consumption. Java’s built-in function is easier to use but both of the implementations are doing well in platform independent that the same code can easily be executed on different platforms. Furthermore, based on the measurements, I did in-depth exploration of the principles and proposed future work, analyzed their application values combined with future possibilities of blockchain based on implementation difficulties and performance.

Styles APA, Harvard, Vancouver, ISO, etc.

2

Dollinger, Jean-François. « A framework for efficient execution on GPU and CPU+GPU systems ». Thesis, Strasbourg, 2015. http://www.theses.fr/2015STRAD019/document.

Texte intégral

Résumé :

Les verrous technologiques rencontrés par les fabricants de semi-conducteurs au début des années deux-mille ont abrogé la flambée des performances des unités de calculs séquentielles. La tendance actuelle est à la multiplication du nombre de cœurs de processeur par socket et à l'utilisation progressive des cartes GPU pour des calculs hautement parallèles. La complexité des architectures récentes rend difficile l'estimation statique des performances d'un programme. Nous décrivons une méthode fiable et précise de prédiction du temps d'exécution de nids de boucles parallèles sur GPU basée sur trois étapes : la génération de code, le profilage offline et la prédiction online. En outre, nous présentons deux techniques pour exploiter l'ensemble des ressources disponibles d'un système pour la performance. La première consiste en l'utilisation conjointe des CPUs et GPUs pour l'exécution d'un code. Afin de préserver les performances il est nécessaire de considérer la répartition de charge, notamment en prédisant les temps d'exécution. Le runtime utilise les résultats du profilage et un ordonnanceur calcule des temps d'exécution et ajuste la charge distribuée aux processeurs. La seconde technique présentée met le CPU et le GPU en compétition : des instances du code cible sont exécutées simultanément sur CPU et GPU. Le vainqueur de la compétition notifie sa complétion à l'autre instance, impliquant son arrêt
Technological limitations faced by the semi-conductor manufacturers in the early 2000's restricted the increase in performance of the sequential computation units. Nowadays, the trend is to increase the number of processor cores per socket and to progressively use the GPU cards for highly parallel computations. Complexity of the recent architectures makes it difficult to statically predict the performance of a program. We describe a reliable and accurate parallel loop nests execution time prediction method on GPUs based on three stages: static code generation, offline profiling, and online prediction. In addition, we present two techniques to fully exploit the computing resources at disposal on a system. The first technique consists in jointly using CPU and GPU for executing a code. In order to achieve higher performance, it is mandatory to consider load balance, in particular by predicting execution time. The runtime uses the profiling results and the scheduler computes the execution times and adjusts the load distributed to the processors. The second technique, puts CPU and GPU in a competition: instances of the considered code are simultaneously executed on CPU and GPU. The winner of the competition notifies its completion to the other instance, implying the termination of the latter

Styles APA, Harvard, Vancouver, ISO, etc.

3

Gjermundsen, Aleksander. « CPU and GPU Co-processing for Sound ». Thesis, Norges teknisk-naturvitenskapelige universitet, Institutt for datateknikk og informasjonsvitenskap, 2010. http://urn.kb.se/resolve?urn=urn:nbn:no:ntnu:diva-11794.

Texte intégral

Résumé :

When using voice communications, one of the problematic phenomena that can occur, is participants hearing an echo of their own voice. Acoustic echo cancellation (AEC) is used to remove this echo, but can be computationally demanding.The recent OpenCL standard allows high-level programs to be run on both multi-core CPUs, as well as Graphics Processing Units (GPUs) and custom accelerators. This opens up new possibilities for offloading computations, which is especially important for real-time applications. Although many algorithms for image- and video-processing have been studied on the GPU, audio processing algorithms have not similarly been well researched. This can be due to these algorithms not being viewed as computationally heavy and thus as suitable for GPU-offloading as, for instance, dense linear algebra.This thesis studies the AEC filter from the open-source library Speex for speech compression and audio preprocessing. We translate the original code into an optimized OpenCL program that can run on both CPUs and GPUs. Since the overhead of the OpenCL vendor implementations dominate running times, our results show that the existing reference implementation is faster for single channel input/output, due to its simplicity and low computational intensity. However, by increasing the number of channels processed by the filter and the length of the echo tail, a speed-up of up to 5 on CPU+GPU over CPU only, was achieved. Although these cases may not be the most common, the techniques developed in this thesis are expected to be of increasing importance as GPUs and CPUs become more integrated, especially on embedded devices. This makes latencies less of an issue and hence the value of our results stronger. An outline for future work in this area is thus also included.

Styles APA, Harvard, Vancouver, ISO, etc.

4

CARLOS, EDUARDO TELLES. « HYBRID FRUSTUM CULLING USING CPU AND GPU ». PONTIFÍCIA UNIVERSIDADE CATÓLICA DO RIO DE JANEIRO, 2009. http://www.maxwell.vrac.puc-rio.br/Busca_etds.php?strSecao=resultado&nrSeq=31453@1.

Texte intégral

Résumé :

PONTIFÍCIA UNIVERSIDADE CATÓLICA DO RIO DE JANEIRO
Um dos problemas mais antigos da computação gráfica tem sido a determinação de visibilidade. Vários algoritmos têm sido desenvolvidos para viabilizar modelos cada vez maiores e detalhados. Dentre estes algoritmos, destaca-se o frustum culling, cujo papel é remover objetos que não sejam visíveis ao observador. Esse algoritmo, muito comum em várias aplicações, vem sofrendo melhorias ao longo dos anos, a fim de acelerar ainda mais a sua execução. Apesar de ser tratado como um problema bem resolvido na computação gráfica, alguns pontos ainda podem ser aperfeiçoados, e novas formas de descarte desenvolvidas. No que se refere aos modelos massivos, necessita-se de algoritmos de alta performance, pois a quantidade de cálculos aumenta significativamente. Este trabalho objetiva avaliar o algoritmo de frustum culling e suas otimizações, com o propósito de obter o melhor algoritmo possível implementado em CPU, além de analisar a influência de cada uma de suas partes em modelos massivos. Com base nessa análise, novas técnicas de frustum culling serão desenvolvidas, utilizando o poder computacional da GPU (Graphics Processing Unit), e comparadas com o resultado obtido apenas pela CPU. Como resultado, será proposta uma forma de frustum culling híbrido, que tentará aproveitar o melhor da CPU e da GPU.
The definition of visibility is a classical problem in Computer Graphics. Several algorithms have been developed to enable the visualization of huge and complex models. Among these algorithms, the frustum culling, which plays an important role in this area, is used to remove invisible objects by the observer. Besides being very usual in applications, this algorithm has been improved in order to accelerate its execution. Although being treated as a well-solved problem in Computer Graphics, some points can be enhanced yet, and new forms of culling may be disclosed as well. In massive models, for example, algorithms of high performance are required, since the calculus arises considerably. This work analyses the frustum culling algorithm and its optimizations, aiming to obtain the state-of-the-art algorithm implemented in CPU, as well as explains the influence of each of its steps in massive models. Based on this analysis, new GPU (Graphics Processing Unit) based frustum culling techniques will be developed and compared with the ones using only CPU. As a result, a hybrid frustum culling will be proposed, in order to achieve the best of CPU and GPU processing.

Styles APA, Harvard, Vancouver, ISO, etc.

5

Farooqui, Naila. « Runtime specialization for heterogeneous CPU-GPU platforms ». Diss., Georgia Institute of Technology, 2015. http://hdl.handle.net/1853/54915.

Texte intégral

Résumé :

Heterogeneous parallel architectures like those comprised of CPUs and GPUs are a tantalizing compute fabric for performance-hungry developers. While these platforms enable order-of-magnitude performance increases for many data-parallel application domains, there remain several open challenges: (i) the distinct execution models inherent in the heterogeneous devices present on such platforms drives the need to dynamically match workload characteristics to the underlying resources, (ii) the complex architecture and programming models of such systems require substantial application knowledge and effort-intensive program tuning to achieve high performance, and (iii) as such platforms become prevalent, there is a need to extend their utility from running known regular data-parallel applications to the broader set of input-dependent, irregular applications common in enterprise settings. The key contribution of our research is to enable runtime specialization on such hybrid CPU-GPU platforms by matching application characteristics to the underlying heterogeneous resources for both regular and irregular workloads. Our approach enables profile-driven resource management and optimizations for such platforms, providing high application performance and system throughput. Towards this end, this research: (a) enables dynamic instrumentation for GPU-based parallel architectures, specifically targeting the complex Single-Instruction Multiple-Data (SIMD) execution model, to gain real-time introspection into application behavior; (b) leverages such dynamic performance data to support novel online resource management methods that improve application performance and system throughput, particularly for irregular, input-dependent applications; (c) automates some of the programmer effort required to exercise specialized architectural features of such platforms via instrumentation-driven dynamic code optimizations; and (d) proposes a specialized, affinity-aware work-stealing scheduling runtime for integrated CPU-GPU processors that efficiently distributes work across all CPU and GPU cores for improved load balance, taking into account both application characteristics and architectural differences of the underlying devices.

Styles APA, Harvard, Vancouver, ISO, etc.

6

Smith, Michael Shawn. « Performance Analysis of Hybrid CPU/GPU Environments ». PDXScholar, 2010. https://pdxscholar.library.pdx.edu/open_access_etds/300.

Texte intégral

Résumé :

We present two metrics to assist the performance analyst to gain a unified view of application performance in a hybrid environment: GPU Computation Percentage and GPU Load Balance. We analyze the metrics using a matrix multiplication benchmark suite and a real scientific application. We also extend an experiment management system to support GPU performance data and to calculate and store our GPU Computation Percentage and GPU Load Balance metrics.

Styles APA, Harvard, Vancouver, ISO, etc.

7

Wong, Henry Ting-Hei. « Architectures and limits of GPU-CPU heterogeneous systems ». Thesis, University of British Columbia, 2008. http://hdl.handle.net/2429/2529.

Texte intégral

Résumé :

As we continue to be able to put an increasing number of transistors on a single chip, the answer to the perpetual question of what the best processor we could build with the transistors is remains uncertain. Past work has shown that heterogeneous multiprocessor systems provide benefits in performance and efficiency. This thesis explores heterogeneous systems composed of a traditional sequential processor (CPU) and highly parallel graphics processors (GPU). This thesis presents a tightly-coupled heterogeneous chip multiprocessor architecture for general-purpose non-graphics computation and a limit study exploring the potential benefits of GPU-like cores for accelerating a set of general-purpose workloads. Pangaea is a heterogeneous CMP design for non-rendering workloads that integrates IA32 CPU cores with GMA X4500 GPU cores. Pangaea introduces a resource partitioning of the GPU, where 3D graphics-specific hardware is removed to reduce area or add more processing cores, and a 3-instruction extension to the IA32 ISA that supports fast communication between CPU and GPU by building user-level interrupts on top of existing cache coherency mechanisms. By removing graphics-specific hardware on a 65 nm process, the area saved is equivalent to 9 GPU cores, while the power saved is equivalent to 5 cores. Our FPGA prototype shows thread spawn latency improvements from thousands of clock cycles to 26. A set of non-graphics workloads demonstrate speedups of up to 8.8x. This thesis also presents a limit study, where we measure the limit of algorithm parallelism in the context of a heterogeneous system that can be usefully extracted from a set of general-purpose applications. We measure sensitivity to the sequential performance (register read-after-write latency) of the low-cost parallel cores, and latency and bandwidth of the communication channel between the two cores. Using these measurements, we propose system characteristics that maximize area and power efficiencies. As in previous limit studies, we find a high amount of parallelism. We show, however, that the potential speedup on GPU-like systems is low (2.2x - 12.7x) due to poor sequential performance. Communication latency and bandwidth have comparatively small performance effects (<25%). Optimal area efficiency requires a lower-cost parallel processor while optimal power efficiency requires a higher-performance parallel processor than today's GPUs.

Styles APA, Harvard, Vancouver, ISO, etc.

8

Gummadi, Deepthi. « Improving GPU performance by regrouping CPU-memory data ». Thesis, Wichita State University, 2014. http://hdl.handle.net/10057/10959.

Texte intégral

Résumé :

In order to fast effective analysis of large complex systems, high-performance computing is essential. NVIDIA Compute Unified Device Architecture (CUDA)-assisted central processing unit (CPU) / graphics processing unit (GPU) computing platform has proven its potential to be used in high-performance computing. In CPU/GPU computing, original data and instructions are copied from CPU main memory to GPU global memory. Inside GPU, it would be beneficial to keep the data into shared memory (shared only by the threads of that block) than in the global memory (shared by all threads). However, shared memory is much smaller than global memory (for Fermi Tesla C2075, total shared memory per block is 48 KB and total global memory is 6 GB). In this paper, we introduce a CPU-memory to GPU-global-memory mapping technique to improve GPU and overall system performance by increasing the effectiveness of GPU-shared memory. We use NVIDIA 448-core Fermi and 2496-core Kepler GPU cards in this study. Experimental results, from solving Laplace's equation for 512x512 matrixes using a Fermi GPU card, show that proposed CPU-to-GPU memory mapping technique help decrease the overall execution time by more than 75%.
Thesis (M.S.)--Wichita State University, College of Engineering, Dept. of Electrical Engineering and Computer Science

Styles APA, Harvard, Vancouver, ISO, etc.

9

Chen, Wei. « Dynamic Workload Division in GPU-CPU Heterogeneous Systems ». The Ohio State University, 2013. http://rave.ohiolink.edu/etdc/view?acc_num=osu1364250106.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

10

Ben, Romdhanne Bilel. « Simulation des réseaux à grande échelle sur les architectures de calculs hétérogènes ». Thesis, Paris, ENST, 2013. http://www.theses.fr/2013ENST0088/document.

Texte intégral

Résumé :

La simulation est une étape primordiale dans l'évolution des systèmes en réseaux. L’évolutivité et l’efficacité des outils de simulation est une clef principale de l’objectivité des résultats obtenue, étant donné la complexité croissante des nouveaux des réseaux sans-fils. La simulation a évènement discret est parfaitement adéquate au passage à l'échelle, cependant les architectures logiciel existantes ne profitent pas des avancées récente du matériel informatique comme les processeurs parallèle et les coprocesseurs graphique. Dans ce contexte, l'objectif de cette thèse est de proposer des mécanismes d'optimisation qui permettent de surpasser les limitations des approches actuelles en combinant l’utilisation des ressources de calcules hétérogène. Pour répondre à la problématique de l’efficacité, nous proposons de changer la représentation d'événement, d'une représentation bijective (évènement-descripteur) à une représentation injective (groupe d'évènements-descripteur). Cette approche permet de réduire la complexité de l'ordonnancement d'une part et de maximiser la capacité d'exécuter massivement des évènements en parallèle d'autre part. Dans ce sens, nous proposons une approche d'ordonnancement d'évènements hybride qui se base sur un enrichissement du descripteur pour maximiser le degré de parallélisme en combinons la capacité de calcule du CPU et du GPU dans une même simulation. Les résultats comparatives montre un gain en terme de temps de simulation de l’ordre de 100x en comparaison avec une exécution équivalente sur CPU uniquement. Pour répondre à la problématique d’évolutivité du système, nous proposons une nouvelle architecture distribuée basée sur trois acteurs
The simulation is a primary step on the evaluation process of modern networked systems. The scalability and efficiency of such a tool in view of increasing complexity of the emerging networks is a key to derive valuable results. The discrete event simulation is recognized as the most scalable model that copes with both parallel and distributed architecture. Nevertheless, the recent hardware provides new heterogeneous computing resources that can be exploited in parallel.The main scope of this thesis is to provide a new mechanisms and optimizations that enable efficient and scalable parallel simulation using heterogeneous computing node architecture including multicore CPU and GPU. To address the efficiency, we propose to describe the events that only differs in their data as a single entry to reduce the event management cost. At the run time, the proposed hybrid scheduler will dispatch and inject the events on the most appropriate computing target based on the event descriptor and the current load obtained through a feedback mechanisms such that the hardware usage rate is maximized. Results have shown a significant gain of 100 times compared to traditional CPU based approaches. In order to increase the scalability of the system, we propose a new simulation model, denoted as general purpose coordinator-master-worker, to address jointly the challenge of distributed and parallel simulation at different levels. The performance of a distributed simulation that relies on the GP-CMW architecture tends toward the maximal theoretical efficiency in a homogeneous deployment. The scalability of such a simulation model is validated on the largest European GPU-based supercomputer

Styles APA, Harvard, Vancouver, ISO, etc.

11

Sundberg, Andreas. « Skapa digitalt fingeravtryck med hjälp av CPU och GPU ». Thesis, Högskolan i Skövde, Institutionen för informationsteknologi, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:his:diva-12851.

Texte intégral

Résumé :

Digitala fingeravtryck är en teknik som används för att skapa riktad reklam och för att undvika bedrägeri. Det finns många fingeravtryckstekniker som till exempel att använda cookies, IPadresser och använda sig av Javascript. Många av teknikerna är lätta att undvika som till exempel att stänga av cookies och att byta IP-adress vilket gör det svårare att upptäcka användaren. I detta arbete undersöks det om det är möjligt att identifiera datorer med hjälp av att mäta hur lång tid det tar för en dator att exekvera några skript på CPU:n och GPU:n. För att besvara frågan skapades sex olika skript där tre exekveras på CPU:n och de andra tre exekveras på GPU:n eller både på CPU:n och GPU:n. När skripten var klara jämfördes tiderna på åtta datorer och undersökningen visade på att det är möjligt att identifiera datorer med olika hårdvara men inte datorer med likadan hårdvara.

Har haft kontakt med Västgöta-Data AB som har gett feedback på arbetet och även gett förslag på hur man kan utföra arbetet.

Styles APA, Harvard, Vancouver, ISO, etc.

12

Krishnasamy, Ezhilmathi. « Hybrid CPU-GPU Parallel Simulations of 3D Front Propagation ». Thesis, Linköpings universitet, Hållfasthetslära, 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-114935.

Texte intégral

Résumé :

This master thesis studies GPU-enabled parallel implementations of the 3D Parallel Marching Method (PMM). 3D PMM is aimed at solving the non-linear static Jacobi-Hamilton equations, which has real world applications such as in the study of geological foldings, where each layer of the Earth’s crust is considered as a front propagating over time. Using the parallel computer architectures, fast simulationscan be achieved, leading to less time consumption, quicker understanding of the inner Earth and enables early exploration of oil and gas reserves. Currently 3D PMM is implemented in shared memory architecture using OpenMP Application Programming Interface (API) and the MINT programming model, which translates C code into Compute Unified Device Architecture (CUDA) code for a single Graphical Process Unit (GPU). Parallel architectures have seen rapid growth in recent years, especially GPUs, allowing us to do faster simulations. In this thesis work, a new parallel implementation for 3D PMM has been done to exploit multicore CPU architectures as well as single and multiple GPUs. In a multiple GPU implementation, 3D data isdecomposed into 1D data for each GPU. CUDA streams are used to overlap the computation and communication within the single GPU. Part of the decomposed 3D volume data is kept in the respective GPU to avoid complete data transfer between the GPUs over a number of iterations. In total, there are two kinds of datatransfers that are involved while doing computation in the multiple GPUs: boundary value data transfer and decomposed 3D volume data transfer. The decomposed 3D volume data transfer is optimized between the multiple GPUs by using the peer to peer memory transfer in CUDA. The speedup is shown and compared between shared memory CPUs (E5-2660, 16cores), single GPU (GTX-590, C2050 and K20m) and multiple GPUs. Hand coded CUDA has shown slightly better performance than the Mint translated CUDA, and the multiple GPU implementation showed promising speedup compared to shared memory multicore CPUs and single GPU implementations.

Styles APA, Harvard, Vancouver, ISO, etc.

13

Lindqvist, Sebastian. « Performance Evaluation of Boids on the GPU and CPU ». Thesis, Blekinge Tekniska Högskola, Institutionen för kreativa teknologier, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-15970.

Texte intégral

Résumé :

Context. Agent based models are used to simulate complex systems by using multiple agents that follow a set of rules. One such model is the boid model which is used to simulate movements of synchronized groups of animals. Executing agent based models partially or fully on the GPU has previously shown to increase performance, opening up the possibility for larger simulations. However, few articles have previously compared a full GPU implementation of the boid model with a multi-threaded CPU implementation. Objectives. The objectives of this thesis are to find how parallel execution of boid model performs when executed on the CPU and GPU respectively, based on the variables frames per second and average boid computation time per frame. Methods. A performance benchmark experiment will be set up where three implementations of the boid model are implemented and tested. Results. The collected data is summarized in both tables and graphs, showing the result of the experiment for frames per second and average boid computation time per frame. Additionally, the average results are summarized in two tables. Conclusions. For the largest flock size the GPGPU implementation performs the best with an average FPS of 42 times over the single-core implementation while the multi-core implementation performs with an average FPS 6 times better than the single-core implementation. For the smallest flock size the single-core implementation is most efficient while the GPGPU implementation has 1.6 times slower average update time and the multi-cor eimplementation has an average update time of 11 times slower compared to the single-core implementation.

Styles APA, Harvard, Vancouver, ISO, etc.

14

Venkatasubramanian, Sundaresan. « Tuned and asynchronous stencil kernels for CPU/GPU systems ». Thesis, Atlanta, Ga. : Georgia Institute of Technology, 2009. http://hdl.handle.net/1853/29728.

Texte intégral

Résumé :

Thesis (M. S.)--Computing, Georgia Institute of Technology, 2009.
Committee Chair: Vuduc, Richard; Committee Member: Kim, Hyesoon; Committee Member: Vetter, Jeffrey. Part of the SMARTech Electronic Thesis and Dissertation Collection.

Styles APA, Harvard, Vancouver, ISO, etc.

15

Lind, Eric, et Velasquez Ävelin Pantigoso. « A performance comparison between CPU and GPU in TensorFlow ». Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-260240.

Texte intégral

Résumé :

The fast-growing field of Machine Learning has in the later years become more common, as it has gone from a restricted research area to actually be in general use. Frameworks such as TensorFlow have been developed to scale and analyze artificial neural networks, which are used in one of the areas in Machine Learning called Deep Learning. This paper will study how well the framework TensorFlow performs in regard to time and memory allocation on the processor units CPU and GPU since these are the factors that are often the restraining resources. Three neural networks have been used to measure how TensorFlow allocates the resources and computes operations used to process the neural network during the training phase. By using TensorFlows profiler we could trace how each operation was executed in the CPU and GPU, from the gathered data we could analyse how the operations allocated memory and time. Our results show that the training of a more complex neural network benefits from being executed on the GPU, while a simple neural network has no or an insignificant profit from being executed on the GPU over the CPU. The result also indicates possible findings for further research such as processor utilisation as the gaps in the scheduling has not been studied in this paper.
Det snabbt växande fältet Maskininlärning har de senaste åren kommit att bli vanligare och vanligare, det har gått från att vara ett forskningsfält till att användas mer generellt i produktutveckling. Ramverk som TensorFlow har utvecklats för att göra det möjligt att skala och analysera artificiella neurala nätverk, dessa används inom Djupinlärning, ett fält inom Maskininlärning. Denna rapport undersöker hur väl ramverket TensorFlow utför beräkningar med åtanke till tid och minnesallokering på CPU samt GPU eftersom dessa är de faktorer som är mest begränsade resurserna under träning. Tre artificiella neurala nätverk har använts för att undersöka hur TensorFlow allokerar resurserna och hur den använder sig utav operationer som utförs under träningsfasen av de neural nätverken. Genom att använda TensorFlows profiler kunde vi följa hur varje operation var utfördes i både GPU och CPU. Från datan kunde vi analysera operationer tog tid och allokerad minne under hela träningsfasen. Resultatet visade på att träning av mer komplexa neurala nätverk drog nytta av att utföras på GPU, medan mer simpla neurala nätverk hade ingen eller en obetydlig vinning från att använda GPU istället för CPU. Resultaten indikerar också möjliga upptäckter som kan undersökas i framtida forskning. Som till exempel processorernas utnyttjande eftersom vi fann luckor inom operations schemat som inte blev undersökt i denna studie.

Styles APA, Harvard, Vancouver, ISO, etc.

16

Lagerhult, Christopher. « Smartphone CPU : An Energy efficient alternative to the GPU ». Thesis, Uppsala universitet, Institutionen för informationsteknologi, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-397426.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

17

Ospici, Matthieu. « Modèles de programmation et d'exécution pour les architectures parallèles et hybrides. Applications à des codes de simulation pour la physique ». Phd thesis, Université de Grenoble, 2013. http://tel.archives-ouvertes.fr/tel-00934266.

Texte intégral

Résumé :

Nous nous intéressons dans cette thèse aux grandes architectures parallèles hybrides, c'est-à-dire aux architectures parallèles qui sont une combinaison de processeurs généraliste (Intel Xeon par exemple) et de processeurs accélérateur (GPU Nvidia). L'exploitation efficace de ces grappes hybrides pour le calcul haute performance est au cœur de nos travaux. L'hétérogénéité des ressources de calcul au sein des grappes hybrides pose de nombreuses problématiques lorsque l'on souhaite les exploiter efficacement avec de grandes applications scientifiques existantes. Deux principales problématiques ont été traitées. La première concerne le partage des accélérateurs pour les applications MPI et la seconde porte sur la programmation et l'exécution concurrente de code entre CPU et accélérateur. Les architectures hybrides sont très hétérogènes : en fonction des architectures, le ratio entre le nombre d'accélérateurs et le nombre de coeurs CPU est très variable. Ainsi, nous avons tout d'abord proposé une notion de virtualisation d'accélérateur, qui permet de donner l'illusion aux applications qu'elles ont la capacité d'utiliser un nombre d'accélérateurs qui n'est pas lié au nombre d'accélérateurs physiques disponibles dans le matériel. Un modèle d'exécution basé sur un partage des accélérateurs est ainsi mis en place et permet d'exposer aux applications une architecture hybride plus homogène. Nous avons également proposé des extensions aux modèles de programmation basés sur MPI / threads afin de traiter le problème de l'exécution concurrente entre CPU et accélérateurs. Nous avons proposé pour cela un modèle basé sur deux types de threads, les threads CPU et accélérateur, permettant de mettre en place des calculs hybrides exploitant simultanément les CPU et les accélérateurs. Dans ces deux cas, le déploiement et l'exécution du code sur les ressources hybrides est crucial. Nous avons pour cela proposé deux bibliothèques logicielles S_GPU 1 et S_GPU 2 qui ont pour rôle de déployer et d'exécuter les calculs sur le matériel hybride. S_GPU 1 s'occupant de la virtualisation, et S_GPU 2 de l'exploitation concurrente CPU -- accélérateurs. Pour observer le déploiement et l'exécution du code sur des architectures complexes à base de GPU, nous avons intégré des mécanismes de traçage qui permettent d'analyser le déroulement des programmes utilisant nos bibliothèques. La validation de nos propositions a été réalisée sur deux grandes application scientifiques : BigDFT (simulation ab-initio) et SPECFEM3D (simulation d'ondes sismiques). Nous les avons adapté afin qu'elles puissent utiliser S_GPU 1 (pour BigDFT) et S_GPU 2 (pour SPECFEM3D).

Styles APA, Harvard, Vancouver, ISO, etc.

18

Norgren, David. « Implementing and Evaluating CPU/GPU Real-Time Ray Tracing Solutions ». Thesis, Mälardalens högskola, Akademin för innovation, design och teknik, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:mdh:diva-32076.

Texte intégral

Résumé :

Ray tracing is a popular algorithm used to simulate the behavior of light and is commonly used to render images with high levels of visual realism. Modern multicore CPUs and many-core GPUs can take advantage of the parallel nature of ray tracing to accelerate the rendering process and produce new images in real-time. For non-specialized hardware however, such implementations are often limited to low screen resolutions, simple scene geometry and basic graphical effects. In this work, a C++ framework was created to investigate how the ray tracing algorithm can be implemented and accelerated on the CPU and GPU, respectively. The framework is capable of utilizing two third-party ray tracing libraries, Intel’s Embree and NVIDIA’s OptiX, to ray trace various 3D scenes. The framework also supports several effects for added realism, a user controlled camera and triangle meshes with different materials and textures. In addition, a hybrid ray tracing solution is explored, running both libraries simultaneously to render subsections of the screen. Benchmarks performed on a high-end CPU and GPU are finally presented for various scenes and effects. Throughout these results, OptiX on a Titan X performed better by a factor of 2-4 compared to Embree running on an 8-core hyperthreaded CPU within the same price range. Due to this imbalance of the CPU and GPU along with possible interferences between the libraries, the hybrid solution did not give a significant speedup, but created possibilities for future research.

Styles APA, Harvard, Vancouver, ISO, etc.

19

Sandgren, Julius. « Transfer Time Reduction of Data Transfers between CPU and GPU ». Thesis, Uppsala universitet, Institutionen för informationsteknologi, 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-205272.

Texte intégral

Résumé :

In real-time video processing data transfer between CPU and GPU is a time critical action; time spent transferring data is processing time lost. Several variants of standard transfer methods were developed and evaluated on nine computers and two smart decision algorithms was designed to help choose the fastest method for each occasion. Results showed that the standard transfer methods can be beaten; by using the designed decision algorithms, transfer times between CPU and GPU (both ways) can be reduced by a factor of 7 compared to always using the standard methods.

Styles APA, Harvard, Vancouver, ISO, etc.

20

Erik, Liljeqvist. « Evaluating a CPU/GPU Implementation for Real-Time Ray Tracing ». Thesis, Mälardalens högskola, Akademin för innovation, design och teknik, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:mdh:diva-35768.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

21

Svantesson, David, et Martin Eklund. « A naive implementation of Topological Sort on GPU : A comparative study between CPU and GPU performance ». Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-186417.

Texte intégral

Résumé :

Topological sorting is a graph problem encountered in various different areas in computer science. Many graph problems have benefited from execution on a GPU rather than a CPU due to the GPU's capability for parallelism. The purpose of this report is to determine if topological sorting may benefit from a naive implementation on the GPU compared to the CPU. This is accomplished by constructing a parallel implementation using the CUDA platform by NVIDIA for GPGPU programing. The runtime of this implementation running on several different graphs is compared to a sequential implementation in C running on the CPU. The results indicate that the GPU algorithm only works beneficially on large, shallow graphs.

Styles APA, Harvard, Vancouver, ISO, etc.

22

Zhang, Junchi. « GPU computing of Heat Equations ». Digital WPI, 2015. https://digitalcommons.wpi.edu/etd-theses/515.

Texte intégral

Résumé :

There is an increasing amount of evidence in scientific research and industrial engineering indicating that the graphic processing unit (GPU) has a higher efficiency and a stronger ability over CPUs to process certain computations. The heat equation is one of the most well-known partial differential equations with well-developed theories, and application in engineering. Thus, we chose in this report to use the heat equation to numerically solve for the heat distributions at different time points using both GPU and CPU programs. The heat equation with three different boundary conditions (Dirichlet, Neumann and Periodic) were calculated on the given domain and discretized by finite difference approximations. The programs solving the linear system from the heat equation with different boundary conditions were implemented on GPU and CPU. A convergence analysis and stability analysis for the finite difference method was performed to guarantee the success of the program. Iterative methods and direct methods to solve the linear system are also discussed for the GPU. The results show that the GPU has a huge advantage in terms of time spent compared with CPU in large size problems.

Styles APA, Harvard, Vancouver, ISO, etc.

23

Vekterli, Tor Brede. « Parallelization of Artificial Spiking Neural Networks on the CPU and GPU ». Thesis, Norwegian University of Science and Technology, Department of Computer and Information Science, 2009. http://urn.kb.se/resolve?urn=urn:nbn:no:ntnu:diva-9838.

Texte intégral

Résumé :

Conventional artificial neural networks have traditionally faced inherent problems with efficient parallelization of neuron processing. Recent research has shown how artificial spiking neural networks can, with the introduction of biologically plausible synaptic conduction delays, be fully parallelized regardless of their network topology. This, in conjunction with the influx of fast, massively parallel desktop-level computing hardware leaves the field of efficient, large-scale spiking neural network simulations potentially open to even those with no access to supercomputers or large computing clusters. This thesis aims to show how such a parallelization is possible as well as present a network model that enables it. This model will then be used as a base for implementing a parallel artificial spiking neural network on both the CPU and the GPU and subsequently evaluating some of the challenges involved, the performance and scalability measured and the potential that is exhibited.

Styles APA, Harvard, Vancouver, ISO, etc.

24

Enmyren, Johan. « A Skeleton Programming Library for Multicore CPU and Multi-GPU Systems ». Thesis, Linköpings universitet, Institutionen för datavetenskap, 2010. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-60319.

Texte intégral

Résumé :

This report presents SkePU, a C++ template library which provides a simple and unified interface for specifying data-parallel computations with the help of skeletons on GPUs using CUDA and OpenCL. The interface is also general enough to support other architectures, and SkePU implements both a sequential CPU and a parallel OpenMP back end. It also supports multi-GPU systems. Benchmarks show that copying data between the host and the GPU is often a bottleneck. Therefore a container which uses lazy memory copying has been implemented to avoid unnecessary memory transfers. SkePU was evaluated with small benchmarks and a larger application, a Runge-Kutta ODE solver. The results show that skeletal parallel programming is indeed a viable approach for GPU Computing and that a generalized interface for multiple back ends is also reasonable. The best performance gains are received when the computation load is large compared to memory I/O (the lazy memory copying can help to achieve this). We see that SkePU offers good performance with a more complex and realistic task such as ODE solving, with up to ten times faster run times when using SkePU with a GPU back end compared to a sequential solver running on a fast CPU. From the benchmarks we can conclude that skeletal parallel programming is indeed a viable approach for GPU Computing and that a generalized interface for multiple back ends is also reasonable. SkePU does however have some disadvantages too; there is some overhead in using the library which we can see from the dot product and LibSolve benchmarks. Although not big, it is still there and if performance is of uttermost importance, then a hand coded solution would be best. One cannot express all calculations in terms of skeletons either, if one have such a problem, specialized routines must still be created.

Styles APA, Harvard, Vancouver, ISO, etc.

25

Berthou, Gautier. « Implementation of an object-detection algorithm on a CPU+GPU target ». Thesis, KTH, Skolan för informations- och kommunikationsteknik (ICT), 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-206178.

Texte intégral

Résumé :

Systems like autonomous vehicles may require real time embedded image processing under hardware constraints. This paper provides directions to design time and resource efficient Haar cascade detection algorithms. It also reviews some software architecture and hardware aspects. The considered algorithms were meant to be run on platforms equipped with a CPU and a GPU under power consumption limitations. The main aim of the project was to design and develop real time underwater object detection algorithms. However the concepts that are presented in this paper are generic and can be applied to other domains where object detection is required, face detection for instance. The results show how the solutions outperform OpenCV cascade detector in terms of execution time while having the same accuracy.
System så som autonoma vehiklar kan kräva inbyggd bildbehandling i realtid under hårdvarubegränsningar. Denna uppsats tillhandahåller anvisningar för att designa tidsoch resurseffektiva Haar-kasad detekterande algoritmer. Dessutom granskas en del mjukvaruarkitektur och hårdvaruaspekter. De avsedda algoritmerna är menade att användas på plattformar försedda med en CPU och en GPU under begränsad energitillgång. Det huvudsakliga målet med projektet var att designa och utveckla realtidsalgoritmer för detektering av objekt under vatten. Dock är koncepten som presenteras i arbetet generiska och kan appliceras på andra domäner där objektdetektering kan behövas, till exempel vid detektering av ansikten. Resultaten visar hur lösningarna överträffar OpenCVs kaskaddetektor beträffande exekutionstid och med samtidig lika stor träffsäkerhet.

Styles APA, Harvard, Vancouver, ISO, etc.

26

Concha, Ramírez Francisca Andrea. « FADRA : A CPU-GPU framework for astronomical data reduction and Analysis ». Tesis, Universidad de Chile, 2016. http://repositorio.uchile.cl/handle/2250/140769.

Texte intégral

Résumé :

Magíster en Ciencias, Mención Computación
Esta tesis establece las bases de FADRA: Framework for Astronomical Data Reduction and Analysis. El framework FADRA fue diseñado para ser eficiente, simple de usar, modular, expandible, y open source. Hoy en día, la astronomía es inseparable de la computación, pero algunos de los software más usados en la actualidad fueron desarrollados tres décadas atrás y no están diseñados para enfrentar los actuales paradigmas de big data. El mundo del software astronómico debe evolucionar no solo hacia prácticas que comprendan y adopten la era del big data, sino también que estén enfocadas en el trabajo colaborativo de la comunidad. El trabajo desarollado consistió en el diseño e implementación de los algoritmos básicos para el análisis de datos astronómicos, dando inicio al desarrollo del framework. Esto consideró la implementación de estructuras de datos eficientes al trabajar con un gran número de imágenes, la implementación de algoritmos para el proceso de calibración o reducción de imágenes astronómicas, y el diseño y desarrollo de algoritmos para el cálculo de fotometría y la obtención de curvas de luz. Tanto los algoritmos de reducción como de obtención de curvas de luz fueron implementados en versiones CPU y GPU. Para las implementaciones en GPU, se diseñaron algoritmos que minimizan la cantidad de datos a ser procesados de manera de reducir la transferencia de datos entre CPU y GPU, proceso lento que muchas veces eclipsa las ganancias en tiempo de ejecución que se pueden obtener gracias a la paralelización. A pesar de que FADRA fue diseñado con la idea de utilizar sus algoritmos dentro de scripts, un módulo wrapper para interactuar a través de interfaces gráficas también fue implementado. Una de las principales metas de esta tesis consistió en la validación de los resultados obtenidos con FADRA. Para esto, resultados de la reducción y curvas de luz fueron comparados con resultados de AstroPy, paquete de Python con distintas utilidades para astrónomos. Los experimentos se realizaron sobre seis datasets de imágenes astronómicas reales. En el caso de reducción de imágenes astronómicas, el Normalized Root Mean Squared Error (NRMSE) fue utilizado como métrica de similaridad entre las imágenes. Para las curvas de luz, se probó que las formas de las curvas eran iguales a través de la determinación de offsets constantes entre los valores numéricos de cada uno de los puntos pertenecientes a las distintas curvas. En términos de la validez de los resultados, tanto la reducción como la obtención de curvas de luz, en sus implementaciones CPU y GPU, generaron resultados correctos al ser comparados con los de AstroPy, lo que significa que los desarrollos y aproximaciones diseñados para FADRA otorgan resultados que pueden ser utilizados con seguridad para el análisis científico de imágenes astronómicas. En términos de tiempos de ejecución, la naturaleza intensiva en uso de datos propia del proceso de reducción hace que la versión GPU sea incluso más lenta que la versión CPU. Sin embargo, en el caso de la obtención de curvas de luz, el algoritmo GPU presenta una disminución importante en tiempo de ejecución comparado con su contraparte en CPU.
Este trabajo ha sido parcialmente financiado por Proyecto Fondecyt 1120299

Styles APA, Harvard, Vancouver, ISO, etc.

27

Öhberg, Tomas. « Auto-tuning Hybrid CPU-GPU Execution of Algorithmic Skeletons in SkePU ». Thesis, Linköpings universitet, Programvara och system, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-149605.

Texte intégral

Résumé :

The trend in computer architectures has for several years been heterogeneous systems consisting of a regular CPU and at least one additional, specialized processing unit, such as a GPU.The different characteristics of the processing units and the requirement of multiple tools and programming languages makes programming of such systems a challenging task. Although there exist tools for programming each processing unit, utilizing the full potential of a heterogeneous computer still requires specialized implementations involving multiple frameworks and hand-tuning of parameters.To fully exploit the performance of heterogeneous systems for a single computation, hybrid execution is needed, i.e. execution where the workload is distributed between multiple, heterogeneous processing units, working simultaneously on the computation. This thesis presents the implementation of a new hybrid execution backend in the algorithmic skeleton framework SkePU. The skeleton framework already gives programmers a user-friendly interface to algorithmic templates, executable on different hardware using OpenMP, CUDA and OpenCL. With this extension it is now also possible to divide the computational work of the skeletons between multiple processing units, such as between a CPU and a GPU. The results show an improvement in execution time with the hybrid execution implementation for all skeletons in SkePU. It is also shown that the new implementation results in a lower and more predictable execution time compared to a dynamic scheduling approach based on an earlier implementation of hybrid execution in SkePU.

Styles APA, Harvard, Vancouver, ISO, etc.

28

Vivanloc, Vincent. « Rendu distribué sur grappe de CPU/GPU et effets d'éclairage global ». Toulouse 3, 2008. http://thesesups.ups-tlse.fr/823/.

Texte intégral

Résumé :

Le prototypage virtuel et l'aide à la revue de projet requièrent un rendu réaliste en temps réel. Cela amène deux axes de recherches, d'une part, le rendu temps réel d'effets d'éclairage indirect et d'autre part, le rendu distribué temps réel à haute résolution. Simuler des effets d'éclairage global permet d'améliorer la qualité d'une image de synthèse produite par rastérisation. Nous nous sommes intéressés à l'éclairage indirect et aux réflexions spéculaires. Sur des éclairages à basse fréquence, le rendu de l'éclairage indirect peut être mis à jour en temps réel. Pour une gamme plus large de fréquences, le rendu de l'éclairage direct reste réservé à des scènes statiques. En effet, ce dernier cas nécessite des temps de calcul élevés ou une paramétrisation complexe de la géométrie. Notre approche reconstruit rapidement un éclairage global à partir d'une carte de photons sans paramétrisation préalable. La carte de photons est simplifiée sous la forme d'un octree de lumières directionnelles virtuelles. L'éclairage est alors évalué par la carte graphique afin de permettre une navigation en temps réel dans une scène sous éclairage indirect. Nous avons ensuite étudié un moyen d'améliorer le calcul des réflexions spéculaires en temps réel en rastérisation, afin de s'affranchir d'une simulation coûteuse en lancer de rayon. En rastérisation, les réflexions calculées ne sont exactes que pour des éléments réfléchis situés à l'infini. Les solutions existantes tentent de construire des réflexions en champ proche, cependant elles reposent sur des hypothèses trop simplificatrices, ce qui limite le champ d'application à un nombre restreint de topologies de scène. Nous avons alors défini une méthode basée sur une recherche itérative afin d'obtenir une réflexion plausible pour des éléments proches. Toutefois, le gain de précision obtenu s'accompagne de phénomènes de désoccultations causés par le mouvement apparent lié à la parallaxe. Ces problèmes sont limités, en partie, par une reconstruction locale de la géométrie par notre tampon de géométrie projetée. De nombreuses solutions existantes permettent d'afficher des rendus temps réel à des résolutions élevées. Toutefois, les solutions matérielles souffrent d'une obsolescence rapide et ne présentent qu'une montée en charge limitée. Par contre, les distributions logicielles s'avèrent plus souples, mais n'offrent que des rendus sommaires. .
Virtual Prototyping and design review require realistic real time rendering. This brings two research axes: on one hand, a real time rendering of indirect illumination effects and on the other hand, a real time distributed high resolution rendering. Simulating global illumination effects provide a sensible improvement over computer graphics currently generated by rasterisation. We were involved in indirect lighting and specular reflection rendering. Render indirect illumination is now possible for low frequency lightings. For broader frequencies, the indirect illumination rendering is limited to static scenes. This latter case requires a long preprocessing time or a lengthy mesh parametrisation. Our contribution consists in a fast reconstruction of global illumination from a photon map without any required parameterisation. The photon map is then simplified into an octree of virtual directional lights. The radiance is therefore evaluated on the fly by a graphic card to provide a real time navigation into a global illuminated scene. We also try to improve the quality of specular reflections in rasterisation to avoid a costly raytracing simulation. Indeed, rasterised reflexion are only valid for reflected items located at infinity. Thus, the quality improvement of existing solutions relies on over simplified hypothesis on scene topology. Therefore, we devised a method based on an iterative search to provide a plausible solution for near reflexions. However, the obtained accuracy is followed by some parallax phenomenon. This problem is partly limited by a local reconstruction of geometry by our projected geometry buffer. A lot of existing solutions provide high resolution real time displays. In one hand, distributed rendering hardware suffer from a fast obsolescence and have only a limited scalability. In the other hand, software distribution are more extensibility but are stuck to rough renderings. However, modifying these solutions in order to improve the quality of the rendered pictures with multipass shaders is relatively difficult : legacy software interlaces the rendering procedures with the data distribution algorithms. On the contrary, a modular architecture might improve the re-usability of a distributed system; the development of rendering methods becomes independent from any data distribution code. This is what HiD2RA tries to provide, assisted by its meta scenegraph. This implementation of remote proxy design pattern offers an extensible interface for the development of real time high quality rendering applications on display walls

Styles APA, Harvard, Vancouver, ISO, etc.

29

Trichy, Ravi Vignesh. « Runtime Systems and Scheduling Support for High-End CPU-GPU Architectures ». The Ohio State University, 2012. http://rave.ohiolink.edu/etdc/view?acc_num=osu1338324367.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

30

He, Guanlin. « Parallel algorithms for clustering large datasets on CPU-GPU heterogeneous architectures ». Electronic Thesis or Diss., université Paris-Saclay, 2022. http://www.theses.fr/2022UPASG062.

Texte intégral

Résumé :

Clustering, qui consiste à réaliser des groupements naturels de données, est une tâche fondamentale et difficile dans l'apprentissage automatique et l'exploration de données. De nombreuses méthodes de clustering ont été proposées dans le passé, parmi lesquelles le clustering en k-moyennes qui est une méthode couramment utilisée en raison de sa simplicité et de sa rapidité.Le clustering spectral est une approche plus récente qui permet généralement d'obtenir une meilleure qualité de clustering que les k-moyennes. Cependant, les algorithmes classiques de clustering spectral souffrent d'un manque de passage à l'échelle en raison de leurs grandes complexités en nombre d'opérations et en espace mémoire nécessaires. Ce problème de passage à l'échelle peut être traité en appliquant des méthodes d'approximation ou en utilisant le calcul parallèle et distribué.L'objectif de cette thèse est d'accélérer le clustering spectral et de le rendre applicable à de grands ensembles de données en combinant l'approximation basée sur des données représentatives avec le calcul parallèle sur processeurs CPU et GPU. En considérant différents scénarios, nous proposons plusieurs chaînes de traitement parallèle pour le clustering spectral à grande échelle. Nous concevons des algorithmes et des implémentations parallèles optimisés pour les modules de chaque chaîne proposée : un algorithme parallèle des k-moyennes sur CPU et GPU, un clustering spectral parallèle sur GPU avec un format de stockage creux, un filtrage parallèle sur GPU du bruit dans les données, etc. Nos expériences variées atteignent de grandes performances et valident le passage à l'échelle de chaque module et de nos chaînes complètes
Clustering, which aims at achieving natural groupings of data, is a fundamental and challenging task in machine learning and data mining. Numerous clustering methods have been proposed in the past, among which k-means is one of the most famous and commonly used methods due to its simplicity and efficiency.Spectral clustering is a more recent approach that usually achieves higher clustering quality than k-means. However, classical algorithms of spectral clustering suffer from a lack of scalability due to their high complexities in terms of number of operations and memory space requirements. This scalability challenge can be addressed by applying approximation methods or by employing parallel and distributed computing.The objective of this thesis is to accelerate spectral clustering and make it scalable to large datasets by combining representatives-based approximation with parallel computing on CPU-GPU platforms. Considering different scenarios, we propose several parallel processing chains for large-scale spectral clustering. We design optimized parallel algorithms and implementations for each module of the proposed chains: parallel k-means on CPU and GPU, parallel spectral clustering on GPU using sparse storage format, parallel filtering of data noise on GPU, etc. Our various experiments reach high performance and validate the scalability of each module and the complete chains

Styles APA, Harvard, Vancouver, ISO, etc.

31

Kankatala, Sriram. « Performance Analysis of kNN on large datasets using CUDA & ; Pthreads : Comparing between CPU & ; GPU ». Thesis, Blekinge Tekniska Högskola, Institutionen för kommunikationssystem, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-10830.

Texte intégral

Résumé :

Several organizations have large databases which are growing at a rapid rate day by day, which need to be regularly maintained. Content based searches are similar searched based on certain features that are obtained from various multi media data. For various applications like multimedia content retrieval, data mining, pattern recognition, etc., performing the nearest neighbor search is a challenging task in multidimensional data. The important factors in nearest neighbor search kNN are searching speed and accuracy. Implementation of kNN on GPU is an ongoing research from last few years, focusing on improving the performance of kNN. By considering these aspects, our research has been started and found a gap in this research area. This master thesis shows effective and efficient parallelism on multi-core of CPU and GPU to compare the performance with single core CPU. This paper shows an experimental implementation of kNN on single core CPU, Mutli-core CPU and GPU using C, Pthreads and CUDA respectively. We considered different levels of inputs (size, dimensions) to evaluate the performance. The experiment shows the GPU outperforms for kNN when compared to CPU single core with a factor of approximately 5.8 to 16 and CPU multi-core with a factor of approximately 1.2 to 3 for different levels of inputs.

Styles APA, Harvard, Vancouver, ISO, etc.

32

Topcu, Tumer. « Data Parallelism For Ray Casting Large Scenes On A Cpu-gpu Cluster ». Master's thesis, METU, 2008. http://etd.lib.metu.edu.tr/upload/12609494/index.pdf.

Texte intégral

Résumé :

In the last decade, computational power, memory bandwidth and programmability capabilities of graphics processing units (GPU) have rapidly evolved. Therefore, many researches have been performed to use GPUs in advanced graphics rendering. Because of its high degree of parallelism, ray tracing has been one of the rst algorithms studied on GPUs. However, the rendering of large scenes with ray tracing can easily exceed the GPU'
s memory capacity. The algorithm proposed in this work uses a data parallel approach where the scene is partitioned and assigned to CPU-GPU couples in a cluster to overcome this problem. Our algorithm focuses on ray casting which is a special case of ray tracing mainly used in visualization of volumetric data. CPUs are pretty ecient in ow control and branching while GPUs are very fast performing intense oating point operations. Using these facts, the GPUs in the cluster are assigned the task of performing ray casting while the CPUs are responsible for traversing the rays. In the end, we were able to visualize large scenes successfully by utilizing CPU-GPU couples eectively and observed that the performance is highly dependent on the viewing angle as a result of load imbalance.

Styles APA, Harvard, Vancouver, ISO, etc.

33

Sharma, Vishist. « Sparse-Matrix support for the SkePU library for portable CPU/GPU programming ». Thesis, Linköpings universitet, Institutionen för datavetenskap, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-129687.

Texte intégral

Résumé :

In this thesis work we have extended the SkePU framework by designing a new container data structure for the representation of generic two dimensional sparse matrices. Computation on matrices is an integral part of many scientific and engineering problems. Sometimes it is unnecessary to perform costly operations on zero entries of the matrix. If the number of zeroes is relatively large then a requirement for more efficient data structure arises. Beyond the sparse matrix representation, we propose an algorithm to judge the condition where computation on sparse matrices is more beneficial in terms of execution time for an ongoing computation and to adapt a matrix's state accordingly, which is the main concern of this thesis work. We present and implement an approach to switch automatically between two data container types dynamically inside the SkePU framework for a multi-core GPU-based heterogeneous system. The new sparse matrix data container supports all SkePU skeletons and nearly all SkePU operations. We provide compression and decompression algorithms from dense matrix to sparse matrix and vice versa on CPU and GPUs using SkePU data parallel skeletons. We have also implemented a context aware switching mechanism in order to switch between two data container types on the CPU or the GPU. A multi-state matrix representation, and selection on demand is also made possible. In order to evaluate and test effectiveness and efficiency of our extension to the SkePU framework, we have considered Matrix-Vector Multiplication as our benchmark program because iterative solvers like Conjugate Gradient and Generalized Minimum Residual use Sparse Matrix-Vector Multiplication as their basic operation. Through our benchmark program we have demonstrated adaptive switching between two data container types, implementation selection between CUDA and OpenMP, and converting the data structure depending on the density of non-zeroes in a matrix. Our experiments on GPU-based architectures show that our automatic switching mechanism adapts with the fastest SkePU implementation variant, and has a limited training cost.

Styles APA, Harvard, Vancouver, ISO, etc.

34

Ferenczi, Daniel. « Användning av Dynamisk Arbetslastbalansering mellan CPU och GPU för att Simulera Rök ». Thesis, Högskolan i Skövde, Institutionen för informationsteknologi, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:his:diva-11025.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

35

Pinto, Vinícius Garcia. « Escalonamento por roubo de tarefas em sistemas Multi-CPU e Multi-GPU ». reponame:Biblioteca Digital de Teses e Dissertações da UFRGS, 2013. http://hdl.handle.net/10183/71270.

Texte intégral

Résumé :

Nos últimos anos, uma das alternativas adotadas para aumentar o desempenho de sistemas de processamento de alto desempenho têm sido o uso de arquiteturas híbridas. Essas arquiteturas são constituídas de processadores multicore e coprocessadores especializados, como GPUs. Esses coprocessadores atuam como aceleradores em alguns tipos de operações. Por outro lado, as ferramentas e modelos de programação paralela atuais não são adequados para cenários híbridos, produzindo aplicações pouco portáveis. O paralelismo de tarefas considerado um paradigma de programação genérico e de alto nível pode ser adotado neste cenário. Porém, exige o uso de algoritmos de escalonamento dinâmicos, como o algoritmo de roubo de tarefas. Neste contexto, este trabalho apresenta um middleware (WORMS) que oferece suporte ao paralelismo de tarefas com escalonamento por roubo de tarefas em sistemas híbridos multi-CPU e multi-GPU. Esse middleware permite que as tarefas tenham implementação tanto para execução em CPUs quanto em GPUs, decidindo em tempo de execução qual das implementações será executada de acordo com os recursos de hardware disponíveis. Os resultados obtidos com o WORMS mostram ser possível superar, em algumas aplicações, tanto o desempenho de ferramentas de referência para execução em CPU quanto de ferramentas para execução em GPUs.
In the last years, one of alternatives adopted to increase performance in high performance computing systems have been the use of hybrid architectures. These architectures consist of multicore processors and specialized coprocessors, like GPUs. Coprocessors act as accelerators in some types of operations. On the other hand, current parallel programming models and tools are not suitable for hybrid scenarios, generating less portable applications. Task parallelism, considered a generic and high level programming paradigm, can be used in this scenario. However, it requires the use of dynamic scheduling algorithms, such as work stealing. In this context, this work presents a middleware (WORMS) that supports task parallelism with work stealing scheduling in multi-CPU and multi-GPU systems. This middleware allows task implementations for both CPU and GPU, deciding at runtime which implementation will run according to the available hardware resources. The performance results obtained with WORMS showed that is possible to outperform both CPU and GPU reference tools in some applications.

Styles APA, Harvard, Vancouver, ISO, etc.

36

Mestre, Nuno Roberto Pereira. « Comparação do desempenho do FDTD com implementação em CPU e em GPU ». Master's thesis, Universidade de Aveiro, 2012. http://hdl.handle.net/10773/10939.

Texte intégral

Résumé :

Mestrado em Engenharia de Computadores e Telemática
O Finite-Difference Time-Domain é um método utilizado em electromagnetismo computacional para simular a propagação de ondas electromagnéticas em meios cujas características podem não ser uniformes. É um método com inúmeras aplicações, e como tal é vantajoso que o seu desempenho possa ser aumentado, de preferência recorrendo a sistemas computacionais de baixo custo. O propósito desta dissertação é aproveitar duas tecnologias emergentes e de relativo baixo custo para aumentar o desempenho do FDTD em uma e duas dimensões. Essas tecnologias são sistemas com processadores Multi-Core e placas gráficas que permitem utilizar as suas características de processamento massivamente paralelo para a execução de código de propósito geral. Para explorar as capacidades de um sistema com processador Multi-Core, o algoritmo originalmente sequencial foi alterado de modo a ser executado em múltiplas threads. Por sua vez, para tirar partido da tecnologia CUDA, o algoritmo foi convertido de forma a ser executado num GPU. Os acréscimos de desempenho obtidos indicam que é vantajoso o uso destas tecnologias comparativamente com implementações puramente sequenciais.
The Finite-Difference Time-Domain is a method used in computational electromagnetics to simulate the propagation of electromagnetic waves in fields that might not have uniform characteristics. It is a method with countless applications and so it is advantageous to increase its performance, preferably using low cost computer systems. The purpose of this thesis is to make use of two relatively low-cost emerging technologies to increase the FDTD performance in one and two dimensions. These technologies are Multi-Core systems and graphics cards that allow the use of its massive parallel processing characteristics to run general purpose code. To make use of a Multi-Core system, the original sequential code was changed to be executed by multiple threads. In order to use the CUDA technology, the algorithm was converted so that it could be executed on the GPU. The performance increase shows that the use of these technologies is advantageous in comparison to pure sequential implementations.

Styles APA, Harvard, Vancouver, ISO, etc.

37

Ansaloni, Pietro. « Analisi di immagini con trasformata Ranklet : ottimizzazioni computazionali su CPU e GPU ». Master's thesis, Alma Mater Studiorum - Università di Bologna, 2013. http://amslaurea.unibo.it/5037/.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

38

Wen, Hao. « IMPROVING PERFORMANCE AND ENERGY EFFICIENCY FOR THE INTEGRATED CPU-GPU HETEROGENEOUS SYSTEMS ». VCU Scholars Compass, 2018. https://scholarscompass.vcu.edu/etd/5664.

Texte intégral

Résumé :

Current heterogeneous CPU-GPU architectures integrate general purpose CPUs and highly thread-level parallelized GPUs (Graphic Processing Units) in the same die. This dissertation focuses on improving the energy efficiency and performance for the heterogeneous CPU-GPU system. Leakage energy has become an increasingly large fraction of total energy consumption, making it important to reduce leakage energy for improving the overall energy efficiency. Cache occupies a large on-chip area, which are good targets for leakage energy reduction. For the CPU cache, we study how to reduce the cache leakage energy efﬁciently in a hybrid SPM (Scratch-Pad Memory) and cache architecture. For the GPU cache, the access pattern of GPU cache is different from the CPU, which usually has little locality and high miss rate. In addition, GPU can hide memory latency more effectively due to multi-threading. Because of the above reasons, we find it is possible to place the cache lines of the GPU data caches into the low power mode more aggressively than traditional leakage management for CPU caches, which can reduce more leakage energy without significant performance degradation. The contention in shared resources between CPU and GPU, such as the last level cache (LLC), interconnection network and DRAM, may degrade both CPU and GPU performance. We propose a simple yet effective method based on probability to control the LLC replacement policy for reducing the CPU’s inter-core conﬂict misses caused by GPU without significantly impacting GPU performance. In addition, we develop two strategies to combine the probability based method for the LLC and an existing technique called virtual channel partition (VCP) for the interconnection network to further improve the CPU performance. For a specific graph application of Breadth first search (BFS), which is a basis for graph search and a core building block for many higher-level graph analysis applications, it is a typical example of parallel computation that is inefficient on GPU architectures. In a graph, a small portion of nodes may have a large number of neighbors, which leads to irregular tasks on GPUs. These irregularities limit the parallelism of BFS executing on GPUs. Unlike the previous works focusing on fine-grained task management to address the irregularity, we propose Virtual-BFS (VBFS) to virtually change the graph itself. By adding virtual vertices, the high-degree nodes in the graph are divided into groups that have an equal number of neighbors, which increases the parallelism such that more GPU threads can work concurrently. This approach ensures correctness and can significantly improve both the performance and energy efficiency on GPUs.

Styles APA, Harvard, Vancouver, ISO, etc.

39

Giuntoli, Guido. « Hybrid CPU/GPU implementation for the FE2 multi-scale method for composite problems ». Doctoral thesis, Universitat Politècnica de Catalunya, 2020. http://hdl.handle.net/10803/668824.

Texte intégral

Résumé :

This thesis aims to develop a High-Performance Computing implementation to solve large composite materials problems through the use of the FE2 multi-scale method. Previous works have not been able to scale the FE2 strategy to real size problems with mesh resolutions of more than 10K elements at the macro-scale and 100^3 elements at the micro-scale. The latter is due to the computational requirements needed to carry out these calculations. This works identifies the most computationally intensive parts of the FE2 algorithm and ports several parts of the micro-scale computations to GPUs. The cases considered assume small deformations and steady-state equilibrium conditions. The work provides a feasible parallel strategy that can be used in real engineering cases to optimize the design of composite material structures. For this, it presents a coupling scheme between the MPI multi-physics code Alya (macro-scale) and the CPU/GPU-accelerated code Micropp (micro-scale). The coupled system is designed to work on multi-GPU architectures and to exploit the GPU overloading. Also, a Multi-Zone coupling methodology combined with weighted partitioning is proposed to reduce the computational cost and to solve the load balance problem. The thesis demonstrates that the method proposed scales notably well for the target problems, especially in hybrid architectures with distributed CPU nodes and communicated with multiple GPUs. Moreover, it clarifies the advantages achieved with the CPU/GPU accelerated version respect to the pure CPU approach.
Esta tesis apunta a desarrollar una implementación de alta performance computacional para resolver problemas grandes de materiales compuestos a través del método de Multi-Escala FE2. Trabajos previos no han logrado escalar la técnica FE2 a problemas de dimensiones reales con mayas de resolucion de más de 10 K elementos en la macro-escala y 100^3 elementos en la micro-escala. Esto último se debe a los requerimientos computacionales para llevar a cabo estos cálculos. Este trabajo identifica las partes computacionales más costosas del algoritmo FE2 y porta varias partes del cálculo de micro-escala a GPUs. Los casos considerados asumen condiciones de pequeñas deformaciones y estado estacionario de equilibrio. El trabajo provee una estrategía factible que puede ser usada en problemas reales de ingeniería para optimizar el diseño de estructuras de materiales compuestos. Para esto se presenta un esquema de acople entre el codigo MPI de multi-física Alya (macro-escala) y la versión acelerada CPU/GPU de Micropp (micro-escala). El sistema acoplado está diseñado para trabajar con arquitecturas de multiples GPUs y explotar la sobrecarga de GPUs. También, un método de multiple zonas de acople combinado con particionado pesado es propuesto para reducir el costo computacional y resolver el problema de balanceo de carga. La tesis demuestra que el método propuesto escala notablemente bien para los problemas modelo, especialmente en arquitecturas híbridas con nodos CPU distribuidos y comunicados con multiples GPUs. Más aún, la tesis clarifica las ventajas logradas con la versión acelerada CPU/GPU respecto a usar unicamente CPUs.

Styles APA, Harvard, Vancouver, ISO, etc.

40

Barrientos, Rojel Ricardo Javier. « Búsqueda por Similitud en Espacios Métricos Sobre Plataformas Multi-Core (CPU y GPU) ». Tesis, Universidad de Chile, 2011. http://www.repositorio.uchile.cl/handle/2250/102738.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

41

Sajjapongse, Kittisak. « Hierarchical scheduling and uniform access programming frameworks for heterogeneous CPU-GPU computing clusters ». Thesis, University of Missouri - Columbia, 2016. http://pqdtopen.proquest.com/#viewpdf?dispub=10178997.

Texte intégral

Résumé :

The advance of the GPU hardware architecture has made GPUs attractive devices for general-purpose computing. Modern GPUs are equipped with an increasing number of cores, a flexible memory hierarchy, and a large memory capacity. While the computational power of modern GPU devices has allowed their introduction in high-performance computing (HPC) clusters and the efficient processing of ever larger workloads, existing software components for HPC clusters still offer basic support for hardware heterogeneity and often cause performance limitations in the presence of GPU devices. In particular, two kinds of limitations are associated with these software components: runtime support and programmability. We found that these limitations are due to the fact that existing software frameworks for heterogeneous clusters treat GPUs as dedicated coprocessor devices.

In this dissertation, we propose two software frameworks for addressing the performance and hardware underutilization issues found in heterogeneous CPU-GPU clusters as well as increasing their programmability. Our frameworks provide a uniform view of compute resources and treat CPUs and GPUs equally as first-class resources, allowing efficient management of heterogeneous compute resources. First, we propose a hierarchical scheduling framework consisting of a node-level runtime and a cluster-level scheduler that provides abstraction of heterogeneous compute resources at different granularities. This hierarchical framework targets existing applications and does not require their modification. In the node-level runtime, we identify and design mechanisms, such as virtual GPUs, GPU virtual memory, dynamic load balancing and pre-emption, which are necessary to support efficient sharing and load balancing schemes for GPUs within a compute node. In the cluster-level scheduler, we introduce mechanisms to abstract compute nodes and perform load balancing in concert with the node-level runtime. Our hierarchical scheduling framework allows supporting different load balancing policies and does not require additional inputs (such as profiling information) from users. Second, we propose a programming framework based on a novel memory and execution model. Our memory model hides disjoint addressing spaces (corresponding to different CPUs, GPUs and compute nodes) and provides a view of a single virtual memory space that can be accessed by all compute resources in a heterogeneous cluster. Our execution model provides uniform access to compute resources and allows our framework to treat all CPUs and GPUs equally and to access data in the virtual memory space.

Styles APA, Harvard, Vancouver, ISO, etc.

42

Wahlberg, Björn. « Att procedurellt generera ett 2D landskap parallellt på GPU vs seriellt på CPU ». Thesis, Högskolan i Skövde, Institutionen för informationsteknologi, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:his:diva-18759.

Texte intégral

Résumé :

Procedurellt genererat innehåll, PCG,förekommer väldigt ofta i spel nu för tiden, mycket för att öka återspelbarheten i ett spel. Några populära exempel på spel som utnyttjar PCG är Terraria(2011) och Minecraft(2011). I takt med att hårdvara blir mer och mer kraftfull så ökar även kraven på spelen som utnyttjar teknikerna eftersom att det går att generera innehåll i realtid. Men finns det outnyttjat potential i grafikkortet? Trenden av ökningen av klockfrekvensen på processorer har reducerats på senare tid, för att istället ersättas av ett större antal kärnor. Här så kan parallellisering av programkod utnyttjas för att utvinna mer ur datorns hårdvara. Ett teknologi-orienterat experiment att utfördes på först en seriell CPUlösning, och sedan en parallell GPUlösning för att undersöka hur lång tid varje metod tog. Detta skedde på varierande stora kartor för att kunna fastställa om det fanns ett samband mellan storlek och tid. Genomförandet använde sig av SFML biblioteket för att implementera GPU varianten där en fragment shader användes för att utföra alla parallella uträkningar för kartgenreringen. CPU metoden använde samma tekniker som GPU metoden, fast utan någon parallellisering. Båda teknikerna validerades genom att använda SFML för att rita ut kartorna som de genererar med enkelgrafik.

Styles APA, Harvard, Vancouver, ISO, etc.

43

Fauzia, Naznin. « Characterization of Data Locality Potential of CPU and GPU Applications through Dynamic Analysis ». The Ohio State University, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=osu1420759839.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

44

Van, Winkle Scott E. « Dynamic Bandwidth and Laser Scaling for CPU-GPU Heterogenous Network-on-Chip Architectures ». Ohio University / OhioLINK, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=ohiou1500992706350957.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

45

Xue, Weicheng. « CPU/GPU Code Acceleration on Heterogeneous Systems and Code Verification for CFD Applications ». Diss., Virginia Tech, 2021. http://hdl.handle.net/10919/102073.

Texte intégral

Résumé :

Computational Fluid Dynamics (CFD) applications usually involve intensive computations, which can be accelerated through using open accelerators, especially GPUs due to their common use in the scientific computing community. In addition to code acceleration, it is important to ensure that the code and algorithm are implemented numerically correctly, which is called code verification. This dissertation focuses on accelerating research CFD codes on multi-CPUs/GPUs using MPI and OpenACC, as well as the code verification for turbulence model implementation using the method of manufactured solutions and code-to-code comparisons. First, a variety of performance optimizations both agnostic and specific to applications and platforms are developed in order to 1) improve the heterogeneous CPU/GPU compute utilization; 2) improve the memory bandwidth to the main memory; 3) reduce communication overhead between the CPU host and the GPU accelerator; and 4) reduce the tedious manual tuning work for GPU scheduling. Both finite difference and finite volume CFD codes and multiple platforms with different architectures are utilized to evaluate the performance optimizations used. A maximum speedup of over 70 is achieved on 16 V100 GPUs over 16 Xeon E5-2680v4 CPUs for multi-block test cases. In addition, systematic studies of code verification are performed for a second-order accurate finite volume research CFD code. Cross-term sinusoidal manufactured solutions are applied to verify the Spalart-Allmaras and k-omega SST model implementation, both in 2D and 3D. This dissertation shows that the spatial and temporal schemes are implemented numerically correctly.
Doctor of Philosophy
Computational Fluid Dynamics (CFD) is a numerical method to solve fluid problems, which usually requires a large amount of computations. A large CFD problem can be decomposed into smaller sub-problems which are stored in discrete memory locations and accelerated by a large number of compute units. In addition to code acceleration, it is important to ensure that the code and algorithm are implemented correctly, which is called code verification. This dissertation focuses on the CFD code acceleration as well as the code verification for turbulence model implementation. In this dissertation, multiple Graphic Processing Units (GPUs) are utilized to accelerate two CFD codes, considering that the GPU has high computational power and high memory bandwidth. A variety of optimizations are developed and applied to improve the performance of CFD codes on different parallel computing systems. The program execution time can be reduced significantly especially when multiple GPUs are used. In addition, code-to-code comparisons with some NASA CFD codes and the method of manufactured solutions are utilized to verify the correctness of a research CFD code.

Styles APA, Harvard, Vancouver, ISO, etc.

46

Said, Issam. « Apports des architectures hybrides à l'imagerie profondeur : étude comparative entre CPU, APU et GPU ». Thesis, Paris 6, 2015. http://www.theses.fr/2015PA066531/document.

Texte intégral

Résumé :

Les compagnies pétrolières s'appuient sur le HPC pour accélérer les algorithmes d'imagerie profondeur. Les grappes de CPU et les accélérateurs matériels sont largement adoptés par l'industrie. Les processeurs graphiques (GPU), avec une grande puissance de calcul et une large bande passante mémoire, ont suscité un vif intérêt. Cependant le déploiement d'applications telle la Reverse Time Migration (RTM) sur ces architectures présente quelques limitations. Notamment, une capacité mémoire réduite, des communications fréquentes entre le CPU et le GPU présentant un possible goulot d'étranglement à cause du bus PCI, et des consommations d'énergie élevées. AMD a récemment lancé l'Accelerated Processing Unit (APU) : un processeur qui fusionne CPU et GPU sur la même puce via une mémoire unifiée. Dans cette thèse, nous explorons l'efficacité de la technologie APU dans un contexte pétrolier, et nous étudions si elle peut surmonter les limitations des solutions basées sur CPU et sur GPU. L'APU est évalué à l'aide d'une suite OpenCL de tests mémoire, applicatifs et d'efficacité énergétique. La faisabilité de l'utilisation hybride de l'APU est explorée. L'efficacité d'une approche par directives de compilation est également étudiée. En analysant une sélection d'applications sismiques (modélisation et RTM) au niveau du noeud et à grande échelle, une étude comparative entre CPU, APU et GPU est menée. Nous montrons la pertinence du recouvrement des entrées-sorties et des communications MPI par le calcul pour les grappes d'APU et de GPU, que les APU délivrent des performances variant entre celles du CPU et celles du GPU, et que l'APU peut être aussi énergétiquement efficace que le GPU
In an exploration context, Oil and Gas (O&G) companies rely on HPC to accelerate depth imaging algorithms. Solutions based on CPU clusters and hardware accelerators are widely embraced by the industry. The Graphics Processing Units (GPUs), with a huge compute power and a high memory bandwidth, had attracted significant interest.However, deploying heavy imaging workflows, the Reverse Time Migration (RTM) being the most famous, on such hardware had suffered from few limitations. Namely, the lack of memory capacity, frequent CPU-GPU communications that may be bottlenecked by the PCI transfer rate, and high power consumptions. Recently, AMD has launched theAccelerated Processing Unit (APU): a processor that merges a CPU and a GPU on the same die, with promising features notably a unified CPU-GPU memory. Throughout this thesis, we explore how efficiently may the APU technology be applicable in an O&G context, and study if it can overcome the limitations that characterize the CPU and GPU based solutions. The APU is evaluated with the help of memory, applicative and power efficiency OpenCL benchmarks. The feasibility of the hybrid utilization of the APUs is surveyed. The efficiency of a directive based approach is also investigated. By means of a thorough review of a selection of seismic applications (modeling and RTM) on the node level and on the large scale level, a comparative study between the CPU, the APU and the GPU is conducted. We show the relevance of overlapping I/O and MPI communications with computations for the APU and GPUclusters, that APUs deliver performances that range between those of CPUs and those of GPUs, and that the APU can be as power efficient as the GPU

Styles APA, Harvard, Vancouver, ISO, etc.

47

Said, Issam. « Apports des architectures hybrides à l'imagerie profondeur : étude comparative entre CPU, APU et GPU ». Electronic Thesis or Diss., Paris 6, 2015. http://www.theses.fr/2015PA066531.

Texte intégral

Résumé :

Les compagnies pétrolières s'appuient sur le HPC pour accélérer les algorithmes d'imagerie profondeur. Les grappes de CPU et les accélérateurs matériels sont largement adoptés par l'industrie. Les processeurs graphiques (GPU), avec une grande puissance de calcul et une large bande passante mémoire, ont suscité un vif intérêt. Cependant le déploiement d'applications telle la Reverse Time Migration (RTM) sur ces architectures présente quelques limitations. Notamment, une capacité mémoire réduite, des communications fréquentes entre le CPU et le GPU présentant un possible goulot d'étranglement à cause du bus PCI, et des consommations d'énergie élevées. AMD a récemment lancé l'Accelerated Processing Unit (APU) : un processeur qui fusionne CPU et GPU sur la même puce via une mémoire unifiée. Dans cette thèse, nous explorons l'efficacité de la technologie APU dans un contexte pétrolier, et nous étudions si elle peut surmonter les limitations des solutions basées sur CPU et sur GPU. L'APU est évalué à l'aide d'une suite OpenCL de tests mémoire, applicatifs et d'efficacité énergétique. La faisabilité de l'utilisation hybride de l'APU est explorée. L'efficacité d'une approche par directives de compilation est également étudiée. En analysant une sélection d'applications sismiques (modélisation et RTM) au niveau du noeud et à grande échelle, une étude comparative entre CPU, APU et GPU est menée. Nous montrons la pertinence du recouvrement des entrées-sorties et des communications MPI par le calcul pour les grappes d'APU et de GPU, que les APU délivrent des performances variant entre celles du CPU et celles du GPU, et que l'APU peut être aussi énergétiquement efficace que le GPU
In an exploration context, Oil and Gas (O&G) companies rely on HPC to accelerate depth imaging algorithms. Solutions based on CPU clusters and hardware accelerators are widely embraced by the industry. The Graphics Processing Units (GPUs), with a huge compute power and a high memory bandwidth, had attracted significant interest.However, deploying heavy imaging workflows, the Reverse Time Migration (RTM) being the most famous, on such hardware had suffered from few limitations. Namely, the lack of memory capacity, frequent CPU-GPU communications that may be bottlenecked by the PCI transfer rate, and high power consumptions. Recently, AMD has launched theAccelerated Processing Unit (APU): a processor that merges a CPU and a GPU on the same die, with promising features notably a unified CPU-GPU memory. Throughout this thesis, we explore how efficiently may the APU technology be applicable in an O&G context, and study if it can overcome the limitations that characterize the CPU and GPU based solutions. The APU is evaluated with the help of memory, applicative and power efficiency OpenCL benchmarks. The feasibility of the hybrid utilization of the APUs is surveyed. The efficiency of a directive based approach is also investigated. By means of a thorough review of a selection of seismic applications (modeling and RTM) on the node level and on the large scale level, a comparative study between the CPU, the APU and the GPU is conducted. We show the relevance of overlapping I/O and MPI communications with computations for the APU and GPUclusters, that APUs deliver performances that range between those of CPUs and those of GPUs, and that the APU can be as power efficient as the GPU

Styles APA, Harvard, Vancouver, ISO, etc.

48

Sjölander, Erik. « Krypteringsalgoritmer i OpenCL : AES-256 och ECC ElGamal ». Thesis, Linköpings universitet, Institutionen för systemteknik, 2012. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-81660.

Texte intégral

Résumé :

De senaste åren har grafikkorten genomgått en omvandling från renderingsenheter till att klara av generella beräkningar, likt en vanlig processor. Med hjälp av språk som OpenCL blir grafikkorten kraftfulla enheter som går att använda effektivt vid stora beräkningar. Målet med detta examensarbete var att visa krypteringsalgoritmer som passar bra att accelerera med OpenCL på grafikkort. Ytterligare mål var att visa att programmet inte behöver omfattande omskrivning för att fungera i OpenCL. Två krypteringsalgoritmer portades för att kunna köras på grafikkorten. Den första algoritmen AES-256 testades i två olika implementationer, en 8- samt 32-bitars. Den andra krypteringsalgoritmen som användes var ECC ElGamal. Dessa två är valda för visa att både symmetrisk och öppen nyckelkryptering går att accelerera. Resultatet för AES-256 i ECB mod på GPU blev 7 Gbit/s, en accelerering på 25 gånger jämfört med CPU. För elliptiska kurvor ElGamal blev resultatet en acceleration på 55 gånger för kryptering och 67 gånger för avkryptering. Arbetet visar skalärmultiplikation med kurvan B-163 som tar 65us. Båda implementationerna bygger på dataparallellisering, där dataelementen distribueras över tillgänglig hårdvara. Arbetet är utfört på Syntronic Software Innovations AB i Linköping.
Last years, the graphic cards have become more powerful than ever before. A conversion from pure rendering components to more general purpose computing devices together with languages like OpenCL have created a new division for graphics cards. The goal of this thesis is to show that crypthography algorithms are well suited for acceleration with OpenCL using graphics cards. A second goal was to show that C-code can be easily translated into OpenCL kernel with just a small syntax change. The two algorithms that have been used are AES-256 implemented in 8- and 32-bits variants, and the second algorithm is Elliptic Curve Crypthography with the ElGamal scheme. The algoritms are chosen to both represent fast symmetric and the slower public-key schemes. The results for AES-256 in ECB mode on GPU, ended up with a throughtput of 7Gbit/s which is a acceleration of 25 times compared to a CPU. For Elliptic Curve, a single scalar point multiplication for the B-163 NIST curve is computed on the GPU in 65us. Using this in the ElGamal encryption scheme, an acceleration of 55 and 67 times was gained for encryption and decryption. The work has been made at Syntronic Software Innovations AB in Linköping, Sweden.

Styles APA, Harvard, Vancouver, ISO, etc.

49

Löfgren, Robin, et Kristoffer Dahl. « Beräkningar med GPU vs CPU : En jämförelsestudie av beräkningseffektivitet med avseende på energi- och tidsförbrukning ». Thesis, Linnaeus University, School of Computer Science, Physics and Mathematics, 2010. http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-5782.

Texte intégral

Résumé :

Examensarbetet handlar om en jämförelsestudie av beräkningseffektivitet med avseende på energi- och tidsförbrukning mellan grafikkort och processorer i persondatorer och PlayStation 3.

Problemet studeras för att göra allmänheten uppmärksam på att det går att lösa en del av energiproblematiken med beräkningar genom att öka energieffektiviteten av beräkningsenheterna.

Undersökningen har genomförts på ett explorativt sätt och studerar förhållandet mellan processorer, grafikkort och vilken som presterar bäst i vilket sammanhang. Prestandatest genomförs med molekylberäkningsprogrammet F@H och med filkomprimeringsprogrammet WinRAR. Testerna utförs på MultiCore- och SingleCorePCs och PS3s av olika karaktär. I vissa test mäts effektförbrukning för att kunna räkna ut hur energieffektiva vissa system är.

Resultatet visar tydligt hur den genomsnittliga effektförbrukningen och energieffektiviteten för olika testsystem skiljer sig vid belastning, viloläge och olika typer beräkningar.

The thesis is a comparative study of computational efficiency in terms of energy and time consumption of graphics cards and processors in personal computers and Playstation3’s.

The problem is studied in order to make the public aware that it is possible to solve some of the energy problems with computations by increasing energy efficiency of the computational units.

The audit was conducted in an exploratory way, studying the relationship between the processors, graphics cards and which one performs best in which context. Performance tests are carried out by the molecule calculating F@H-program and the file compression program WinRAR. Tests performed on MultiCore and SingleCore PC’s and PS3’s with different characteristics. In some tests power consumption is measured in order to figure out how energy-efficient certain systems are.

The results clearly show how the average power consumption and energy efficiency for various test systems at differ at load, sleep and various calculations.

Styles APA, Harvard, Vancouver, ISO, etc.

50

Chavez, Daniel. « Parallelizing Map Projection of Raster Data on Multi-core CPU and GPU Parallel Programming Frameworks ». Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-190883.

Texte intégral

Résumé :

Map projections lie at the core of geographic information systems and numerous projections are used today. The reprojection between different map projections is recurring in a geographic information system and it can be parallelized with multi-core CPUs and GPUs. This thesis implements a parallel analytic reprojection algorithm of raster data in C/C++ with the parallel programming frameworks Pthreads, C++11 STL threads, OpenMP, Intel TBB, CUDA and OpenCL. The thesis compares the execution times from the different implementations on small, medium and large raster data sets, where OpenMP had the best speedup of 6, 6.2 and 5.5, respectively. Meanwhile, the GPU implementations were 293 % faster than the fastest CPU implementations, where profiling shows that the CPU implementations spend most time on trigonometry functions. The results show that reprojection algorithm is well suited for the GPU, while OpenMP and Intel TBB are the fastest of the CPU frameworks.
Kartprojektioner är en central del av geografiska informationssystem och en otalig mängd av kartprojektioner används idag. Omprojiceringen mellan olika kartprojektioner sker regelbundet i ett geografiskt informationssystem och den kan parallelliseras med flerkärniga CPU:er och GPU:er. Denna masteruppsats implementerar en parallel och analytisk omprojicering av rasterdata i C/C++ med ramverken Pthreads, C++11 STL threads, OpenMP, Intel TBB, CUDA och OpenCL. Uppsatsen jämför de olika implementationernas exekveringstider på tre rasterdata av varierande storlek, där OpenMP hade bäst speedup på 6, 6.2 och 5.5. GPU-implementationerna var 293 % snabbare än de snabbaste CPU-implementationerna, där profileringen visar att de senare spenderade mest tid på trigonometriska funktioner. Resultaten visar att GPU:n är bäst lämpad för omprojicering av rasterdata, medan OpenMP är den snabbaste inom CPU ramverken.

Styles APA, Harvard, Vancouver, ISO, etc.

Thèses sur le sujet « GPU-CPU »

Créez une référence correcte selon les styles APA, MLA, Chicago, Harvard et plusieurs autres