Dissertations / Theses: 'GPU Systems'

1

Yuan, George Lai. "GPU compute memory systems." Thesis, University of British Columbia, 2009. http://hdl.handle.net/2429/15877.

Full text

Abstract:

Modern Graphic Process Units (GPUs) offer orders of magnitude more raw computing power than contemporary CPUs by using many simpler in-order single-instruction, multiple-data (SIMD) cores optimized for multi-thread performance rather than single-thread performance. As such, GPUs operate much closer to the "Memory Wall", thus requiring much more careful memory management. This thesis proposes changes to the memory system of our detailed GPU performance simulator, GPGPU-Sim, to allow proper simulation of general-purpose applications written using NVIDIA's Compute Unified Device Architecture (CUDA) framework. To test these changes, fourteen CUDA applications with varying degrees of memory intensity were collected. With these changes, we show that our simulator predicts performance of commodity GPU hardware with 86% correlation. Furthermore, we show that increasing chip resources to allow more threads to run concurrently does not necessarily increase performance due to increased contention for the shared memory system. Moreover, this thesis proposes a hybrid analytical DRAM performance model that uses memory address traces to predict the efficiency of a DRAM system when using a conventional First-Ready First-Come First-Serve (FR-FCFS) memory scheduling policy. To stress the proposed model, a massively multithreaded architecture based upon contemporary high-end GPUs is simulated to generate the memory address trace needed. The results show that the hybrid analytical model predicts DRAM efficiency to within 11.2% absolute error when arithmetically averaged across a memory-intensive subset of the CUDA applications introduced in the first part of this thesis. Finally, this thesis proposes a complexity-effective solution to memory scheduling that recovers most of the performance loss incurred by a naive in-order First-in First-out (FIFO) DRAM scheduler compared to an aggressive out-of-order FR-FCFS scheduler. While FR-FCFS scheduling re-orders memory requests to improve row access locality, we instead employ an interconnection network arbitration scheme that preserves the inherently high row access locality of memory request streams from individual "shader cores" and, in doing so, achieve DRAM efficiency and system performance close to that of FR-FCFS with a simpler design. We evaluate our interconnection network arbitration scheme using crossbar, ring, and mesh networks and show that, when coupled with a banked FIFO in-order scheduler, it obtains up to 91.0% of the performance obtainable with an out-of-order memory scheduler with eight-entry DRAM controller queues.

APA, Harvard, Vancouver, ISO, and other styles

2

Arnau, Jose Maria. "Energy-efficient mobile GPU systems." Doctoral thesis, Universitat Politècnica de Catalunya, 2015. http://hdl.handle.net/10803/290736.

Full text

Abstract:

The design of mobile GPUs is all about saving energy. Smartphones and tablets are battery-operated and thus any type of rendering needs to use as little energy as possible. Furthermore, smartphones do not include sophisticated cooling systems due to their small size, making heat dissipation a primary concern. Improving the energy-efficiency of mobile GPUs will be absolutely necessary to achieve the performance required to satisfy consumer expectations, while maintaining operating time per battery charge and keeping the GPU in its thermal limits. The first step in optimizing energy consumption is to identify the sources of energy drain. Previous studies have demonstrated that the register file is one of the main sources of energy consumption in a GPU. As graphics workloads are highly data- and memory-parallel, GPUs rely on massive multithreading to hide the memory latency and keep the functional units busy. However, aggressive multithreading requires a huge register file to keep the registers of thousands of simultaneous threads. Such a big register file exceeds the power budget typically available for an embedded graphics processors and, hence, more energy-efficient memory latency tolerance techniques are necessary. On the other hand, prior research showed that the off-chip accesses to system memory are one of the most expensive operations in terms of energy in a mobile GPU. Therefore, optimizing memory bandwidth usage is a primary concern in mobile GPU design. Many bandwidth saving techniques, such as texture compression or ARM's transaction elimination, have been proposed in both industry and academia. The purpose of this thesis is to study the characteristics of mobile graphics processors and mobile workloads in order to propose different energy saving techniques specifically tailored for the low-power segment. Firstly, we focus on energy-efficient memory latency tolerance. We analyze several techniques such as multithreading and prefetching and conclude that they are effective but not energy-efficient. Next, we propose an architecture for the fragment processors of a mobile GPU that is based on the decoupled access/execute paradigm. The results obtained by using a cycle-accurate mobile GPU simulator and several commercial Android games show that the decoupled architecture combined with a small degree of multithreading provides the most energy efficient solution for hiding memory latency. More specifically, the decoupled access/execute-like design with just 4 SIMD threads/processor is able to achieve 97% of the performance of a larger GPU with 16 SIMD threads/processor, while providing 20.5% energy savings on average. Secondly, we focus on optimizing memory bandwidth in a mobile GPU. We analyze the bandwidth usage in a set of commercial Android games and find that most of the bandwidth is employed for fetching textures, and also that consecutive frames share most of the texture dataset as they tend to be very similar. However, the GPU cannot capture inter-frame texture re-use due to the big size of the texture dataset for one frame. Based on this analysis, we propose Parallel Frame Rendering (PFR), a technique that overlaps the processing of multiple frames in order to exploit inter-frame texture re-use and save bandwidth. By processing multiple frames in parallel textures are fetched once every two frames instead of being fetched in a frame basis as in conventional GPUs. PFR provides 23.8% memory bandwidth savings on average in our set of Android games, that result in 12% speedup and 20.1% energy savings. Finally, we improve PFR by introducing a hardware memoization system on top. We analyze the redundancy in mobile games and find that more than 38% of the Fragment Program executions are redundant on average. We thus propose a task-level hardware-based memoization system that provides 15% speedup and 12% energy savings on average over a PFR-enabled GPU.
El diseño de las GPUs (Graphics Procesing Units) móviles se centra fundamentalmente en el ahorro energético. Los smartphones y las tabletas son dispositivos alimentados mediante baterías y, por lo tanto, cualquier tipo de renderizado debe utilizar la menor cantidad de energía posible. Mejorar la eficiencia energética de las GPUs móviles será absolutamente necesario para alcanzar el rendimiento requirido para satisfacer las expectativas de los usuarios, sin reducir el tiempo de vida de la batería. El primer paso para optimizar el consumo energético consiste en identificar qué componentes son los principales consumidores de la batería. Estudios anteriores han identificado al banco de registros y a los accessos a memoria principal como las mayores fuentes de consumo energético en una GPU. El propósito de esta tesis es estudiar las características de los procesadores gráficos móviles y de las aplicaciones móviles con el objetivo de proponer distintas técnicas de ahorro energético. En primer lugar, la investigación se centra en desarrollar métodos energéticamente eficientes para ocultar la latencia de la memoria principal. El resultado de la investigación es una arquitectura desacoplada para los Fragment Processors de la GPU. Los resultados experimentales utilizando un simulador de ciclo y distintos juegos de Android muestran que una arquitectura desacoplada, combinada con un nivel de multithreading moderado, proporciona la solución más eficiente desde el punto de vista energético para ocultar la latencia de la memoria prinicipal. Más específicamente, la arquitectura desacoplada con sólo 4 SIMD threads/processor es capaz de alcanzar el 97% del rendimiento de una GPU más grande con 16 SIMD threads/processor, al tiempo que se reduce el consumo energético en un 20.5%. En segundo lugar, el trabajo de investigación se centró en optimizar el ancho de banda en una GPU móvil. Se realizó un estudio del uso del ancho de banda en distintos juegos de Android y se observó que la mayor parte del ancho de banda se utiliza para leer texturas. Además, se observó que frames consecutivos comparten una gran parte de las texturas. Sin embargo, la GPU no puede capturar el reuso de texturas entre frames dado que el tamaño de las texturas utilizadas por un frame es mucho mayor que la caché de segundo nivel. Basándose en este análisis, se desarrolló Parallel Frame Rendering (PFR), una técnica que solapa el procesado de multiples frames consecutivos con el objetivo de explotar el reuso de texturas entre frames y ahorrar así ancho de bando. Al procesar múltiples frames en paralelo las texturas se leen de memoria principal una vez cada dos frames en lugar de leerse en cada frame como sucede en una GPU convencional. PFR proporciona un ahorro del 23.8% en ancho de banda en promedio para distintos juegos de Android, este ahorro de ancho de banda redunda en un incremento del rendimiento del 12% y un ahorro energético del 20.1%. Por último, se mejoró PFR introduciendo un sistema hardware capaz de evitar cómputos redundantes. Un análisis de distintos juegos de Android reveló que más de un 38% de las ejecuciones del Fragment Program eran redundantes en promedio. Así pues, se propuso un sistema hardware capaz de identificar y eliminar parte de los cómputos y accessos a memoria redundantes, dicho sistema proporciona un incremento del rendimiento del 15% y un ahorro energético del 12% en promedio con respecto a una GPU móvil basada en PFR.

APA, Harvard, Vancouver, ISO, and other styles

3

Arnau, Montañés Jose Maria. "Energy-efficient mobile GPU systems." Doctoral thesis, Universitat Politècnica de Catalunya, 2015. http://hdl.handle.net/10803/290736.

Full text

Abstract:

The design of mobile GPUs is all about saving energy. Smartphones and tablets are battery-operated and thus any type of rendering needs to use as little energy as possible. Furthermore, smartphones do not include sophisticated cooling systems due to their small size, making heat dissipation a primary concern. Improving the energy-efficiency of mobile GPUs will be absolutely necessary to achieve the performance required to satisfy consumer expectations, while maintaining operating time per battery charge and keeping the GPU in its thermal limits. The first step in optimizing energy consumption is to identify the sources of energy drain. Previous studies have demonstrated that the register file is one of the main sources of energy consumption in a GPU. As graphics workloads are highly data- and memory-parallel, GPUs rely on massive multithreading to hide the memory latency and keep the functional units busy. However, aggressive multithreading requires a huge register file to keep the registers of thousands of simultaneous threads. Such a big register file exceeds the power budget typically available for an embedded graphics processors and, hence, more energy-efficient memory latency tolerance techniques are necessary. On the other hand, prior research showed that the off-chip accesses to system memory are one of the most expensive operations in terms of energy in a mobile GPU. Therefore, optimizing memory bandwidth usage is a primary concern in mobile GPU design. Many bandwidth saving techniques, such as texture compression or ARM's transaction elimination, have been proposed in both industry and academia. The purpose of this thesis is to study the characteristics of mobile graphics processors and mobile workloads in order to propose different energy saving techniques specifically tailored for the low-power segment. Firstly, we focus on energy-efficient memory latency tolerance. We analyze several techniques such as multithreading and prefetching and conclude that they are effective but not energy-efficient. Next, we propose an architecture for the fragment processors of a mobile GPU that is based on the decoupled access/execute paradigm. The results obtained by using a cycle-accurate mobile GPU simulator and several commercial Android games show that the decoupled architecture combined with a small degree of multithreading provides the most energy efficient solution for hiding memory latency. More specifically, the decoupled access/execute-like design with just 4 SIMD threads/processor is able to achieve 97% of the performance of a larger GPU with 16 SIMD threads/processor, while providing 20.5% energy savings on average. Secondly, we focus on optimizing memory bandwidth in a mobile GPU. We analyze the bandwidth usage in a set of commercial Android games and find that most of the bandwidth is employed for fetching textures, and also that consecutive frames share most of the texture dataset as they tend to be very similar. However, the GPU cannot capture inter-frame texture re-use due to the big size of the texture dataset for one frame. Based on this analysis, we propose Parallel Frame Rendering (PFR), a technique that overlaps the processing of multiple frames in order to exploit inter-frame texture re-use and save bandwidth. By processing multiple frames in parallel textures are fetched once every two frames instead of being fetched in a frame basis as in conventional GPUs. PFR provides 23.8% memory bandwidth savings on average in our set of Android games, that result in 12% speedup and 20.1% energy savings. Finally, we improve PFR by introducing a hardware memoization system on top. We analyze the redundancy in mobile games and find that more than 38% of the Fragment Program executions are redundant on average. We thus propose a task-level hardware-based memoization system that provides 15% speedup and 12% energy savings on average over a PFR-enabled GPU.
El diseño de las GPUs (Graphics Procesing Units) móviles se centra fundamentalmente en el ahorro energético. Los smartphones y las tabletas son dispositivos alimentados mediante baterías y, por lo tanto, cualquier tipo de renderizado debe utilizar la menor cantidad de energía posible. Mejorar la eficiencia energética de las GPUs móviles será absolutamente necesario para alcanzar el rendimiento requirido para satisfacer las expectativas de los usuarios, sin reducir el tiempo de vida de la batería. El primer paso para optimizar el consumo energético consiste en identificar qué componentes son los principales consumidores de la batería. Estudios anteriores han identificado al banco de registros y a los accessos a memoria principal como las mayores fuentes de consumo energético en una GPU. El propósito de esta tesis es estudiar las características de los procesadores gráficos móviles y de las aplicaciones móviles con el objetivo de proponer distintas técnicas de ahorro energético. En primer lugar, la investigación se centra en desarrollar métodos energéticamente eficientes para ocultar la latencia de la memoria principal. El resultado de la investigación es una arquitectura desacoplada para los Fragment Processors de la GPU. Los resultados experimentales utilizando un simulador de ciclo y distintos juegos de Android muestran que una arquitectura desacoplada, combinada con un nivel de multithreading moderado, proporciona la solución más eficiente desde el punto de vista energético para ocultar la latencia de la memoria prinicipal. Más específicamente, la arquitectura desacoplada con sólo 4 SIMD threads/processor es capaz de alcanzar el 97% del rendimiento de una GPU más grande con 16 SIMD threads/processor, al tiempo que se reduce el consumo energético en un 20.5%. En segundo lugar, el trabajo de investigación se centró en optimizar el ancho de banda en una GPU móvil. Se realizó un estudio del uso del ancho de banda en distintos juegos de Android y se observó que la mayor parte del ancho de banda se utiliza para leer texturas. Además, se observó que frames consecutivos comparten una gran parte de las texturas. Sin embargo, la GPU no puede capturar el reuso de texturas entre frames dado que el tamaño de las texturas utilizadas por un frame es mucho mayor que la caché de segundo nivel. Basándose en este análisis, se desarrolló Parallel Frame Rendering (PFR), una técnica que solapa el procesado de multiples frames consecutivos con el objetivo de explotar el reuso de texturas entre frames y ahorrar así ancho de bando. Al procesar múltiples frames en paralelo las texturas se leen de memoria principal una vez cada dos frames en lugar de leerse en cada frame como sucede en una GPU convencional. PFR proporciona un ahorro del 23.8% en ancho de banda en promedio para distintos juegos de Android, este ahorro de ancho de banda redunda en un incremento del rendimiento del 12% y un ahorro energético del 20.1%. Por último, se mejoró PFR introduciendo un sistema hardware capaz de evitar cómputos redundantes. Un análisis de distintos juegos de Android reveló que más de un 38% de las ejecuciones del Fragment Program eran redundantes en promedio. Así pues, se propuso un sistema hardware capaz de identificar y eliminar parte de los cómputos y accessos a memoria redundantes, dicho sistema proporciona un incremento del rendimiento del 15% y un ahorro energético del 12% en promedio con respecto a una GPU móvil basada en PFR.

APA, Harvard, Vancouver, ISO, and other styles

4

Dollinger, Jean-François. "A framework for efficient execution on GPU and CPU+GPU systems." Thesis, Strasbourg, 2015. http://www.theses.fr/2015STRAD019/document.

Full text

Abstract:

Les verrous technologiques rencontrés par les fabricants de semi-conducteurs au début des années deux-mille ont abrogé la flambée des performances des unités de calculs séquentielles. La tendance actuelle est à la multiplication du nombre de cœurs de processeur par socket et à l'utilisation progressive des cartes GPU pour des calculs hautement parallèles. La complexité des architectures récentes rend difficile l'estimation statique des performances d'un programme. Nous décrivons une méthode fiable et précise de prédiction du temps d'exécution de nids de boucles parallèles sur GPU basée sur trois étapes : la génération de code, le profilage offline et la prédiction online. En outre, nous présentons deux techniques pour exploiter l'ensemble des ressources disponibles d'un système pour la performance. La première consiste en l'utilisation conjointe des CPUs et GPUs pour l'exécution d'un code. Afin de préserver les performances il est nécessaire de considérer la répartition de charge, notamment en prédisant les temps d'exécution. Le runtime utilise les résultats du profilage et un ordonnanceur calcule des temps d'exécution et ajuste la charge distribuée aux processeurs. La seconde technique présentée met le CPU et le GPU en compétition : des instances du code cible sont exécutées simultanément sur CPU et GPU. Le vainqueur de la compétition notifie sa complétion à l'autre instance, impliquant son arrêt
Technological limitations faced by the semi-conductor manufacturers in the early 2000's restricted the increase in performance of the sequential computation units. Nowadays, the trend is to increase the number of processor cores per socket and to progressively use the GPU cards for highly parallel computations. Complexity of the recent architectures makes it difficult to statically predict the performance of a program. We describe a reliable and accurate parallel loop nests execution time prediction method on GPUs based on three stages: static code generation, offline profiling, and online prediction. In addition, we present two techniques to fully exploit the computing resources at disposal on a system. The first technique consists in jointly using CPU and GPU for executing a code. In order to achieve higher performance, it is mandatory to consider load balance, in particular by predicting execution time. The runtime uses the profiling results and the scheduler computes the execution times and adjusts the load distributed to the processors. The second technique, puts CPU and GPU in a competition: instances of the considered code are simultaneously executed on CPU and GPU. The winner of the competition notifies its completion to the other instance, implying the termination of the latter

APA, Harvard, Vancouver, ISO, and other styles

5

Yanggratoke, Rerngvit. "GPU Network Processing." Thesis, KTH, Telekommunikationssystem, TSLab, 2010. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-103694.

Full text

Abstract:

Networking technology is connecting more and more people around the world. It has become an essential part of our daily life. For this connectivity to be seamless, networks need to be fast. Nonetheless, rapid growth in network traffic and variety of communication protocols overwhelms the Central Processing Units (CPUs) processing packets in the networks. Existing solutions to this problem such as ASIC, FPGA, NPU, and TOE are not cost effective and easy to manage because they require special hardware and custom configurations. This thesis approaches the problem differently by offloading the network processing to off-the-shelf Graphic Processing Units (GPUs). The thesis's primary goal is to find out how the GPUs should be used for the offloading. The thesis follows the case study approach and the selected case studies are layer 2 Bloom filter forwarding and flow lookup in Openflow switch. Implementation alternatives and evaluation methodology are proposed for both of the case studies. Then, the prototype implementation for comparing between traditional CPU-only and GPU-offloading approach is developed and evaluated. The primary findings from this work are criteria of network processing functions suitable for GPU offloading and tradeoffs involved. The criteria are no inter-packet dependency, similar processing flows for all packets, and within-packet parallel processing opportunity. This offloading trades higher latency and memory consumption for higher throughput.
Nätverksteknik ansluter fler och fler människor runt om i världen. Det har blivit en viktig del av vårt dagliga liv. För att denna anslutning skall vara sömlös, måste nätet vara snabbt. Den snabba tillväxten i nätverkstrafiken och olika kommunikationsprotokoll sätter stora krav på processorer som hanterar all trafik. Befintliga lösningar på detta problem, t.ex. ASIC, FPGA, NPU, och TOE är varken kostnadseffektivt eller lätta att hantera, eftersom de kräver speciell hårdvara och anpassade konfigurationer. Denna avhandling angriper problemet på ett annat sätt genom att avlasta nätverks processningen till grafikprocessorer som sitter i vanliga pc-grafikkort. Avhandlingen främsta mål är att ta reda på hur GPU bör användas för detta. Avhandlingen följer fallstudie modell och de valda fallen är lager 2 Bloom filter forwardering och ``flow lookup'' i Openflow switch. Implementerings alternativ och utvärderingsmetodik föreslås för både fallstudierna. Sedan utvecklas och utvärderas en prototyp för att jämföra mellan traditionell CPU- och GPU-offload. Det primära resultatet från detta arbete utgör kriterier för nätvärksprocessfunktioner lämpade för GPU offload och vilka kompromisser som måste göras. Kriterier är inget inter-paket beroende, liknande processflöde för alla paket. och möjlighet att köra fler processer på ett paket paralellt. GPU offloading ger ökad fördröjning och minneskonsumption till förmån för högre troughput.

APA, Harvard, Vancouver, ISO, and other styles

6

Spampinato, Daniele. "Modeling Communication on Multi-GPU Systems." Thesis, Norwegian University of Science and Technology, Department of Computer and Information Science, 2009. http://urn.kb.se/resolve?urn=urn:nbn:no:ntnu:diva-9068.

Full text

Abstract:

Coupling commodity CPUs and modern GPUs give you heterogeneous systems that are cheap, high-performance with incredible FLOPS counts. Recent evolution of GPGPU models and technologies make these systems even more appealing as compute devices for a range of HPC applications including image processing, seismic processing and other physical modeling, as well as linear programming applications. In fact, graphics vendor such as NVIDIA and AMD are now targeting HPC with some of their products. Due to the power and frequency walls, the trend is now to use multiple GPUs on a given system, much like you will find multiple cores on CPU-based systems. However, increasing the hierarchy of resource wides the spectrum of factors that may impact on the performance of the system. The lack of good models for GPU-based, heterogeneous systems also makes it harder to understand which factors impact performance the most. The goal of this thesis is to analyze such factors by investigating and benchmarking NVIDIA's multi-GPU solution, their recent NVIDIA Tesla S1070 Computing System. This system combines four T10 GPUs making available up to 4 TFLOPS of computational power. Based on a comparative study of fundamental parallel computing models and on the specific heterogeneous features exposed by the system, we define a test space for performance analysis. As a case study, we develop a red-black, SOR PDE solver for Laplace equations with Dirichlet boundaries, well known for requiring constant communication in order to exchange neighboring data. To aid both design and analysis, we propose a model for multi-GPU systems targeting communication between the several GPUs. The main variables exposed by the benchmark application are: domain size and shape, kind of data partitioning, number of GPUs, width of the borders to exchange, kernels to use, and kind of synchronization between the GPU contexts. Among other results, the framework is able to point out the most critical bounds of the S1070 system when dealing with applications like the one in our case study. We show that the multi-GPU system greatly benefits from using all its four GPUs on very large data volumes. Our results show the four GPUs almost four times faster than a single GPU, and twice as fast as two. Our analysis outcomes also allow us to refine our static communication model, enriching it with regression-based predictions.

APA, Harvard, Vancouver, ISO, and other styles

7

Lulec, Andac. "Solution Of Sparse Systems On Gpu Architecture." Master's thesis, METU, 2011. http://etd.lib.metu.edu.tr/upload/12613355/index.pdf.

Full text

Abstract:

The solution of the linear system of equations is one of the core aspects of Finite Element Analysis (FEA) software. Since large amount of arithmetic operations are required for the solution of the system obtained by FEA, the influence of the solution of linear equations on the performance of the software is very significant. In recent years, the increasing demand for performance in the game industry caused significant improvements on the performances of Graphical Processing Units (GPU). With their massive floating point operations capability, they became attractive sources of performance for the general purpose programmers. Because of this reason, GPUs are chosen as the target hardware to develop an efficient parallel direct solver for the solution of the linear equations obtained from FEA.

APA, Harvard, Vancouver, ISO, and other styles

8

Dastgeer, Usman. "Skeleton Programming for Heterogeneous GPU-based Systems." Licentiate thesis, Linköpings universitet, PELAB - Laboratoriet för programmeringsomgivningar, 2011. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-70234.

Full text

Abstract:

In this thesis, we address issues associated with programming modern heterogeneous systems while focusing on a special kind of heterogeneous systems that include multicore CPUs and one or more GPUs, called GPU-based systems.We consider the skeleton programming approach to achieve high level abstraction for efficient and portable programming of these GPU-based systemsand present our work on SkePU library which is a skeleton library for these systems. We extend the existing SkePU library with a two-dimensional (2D) data type and skeleton operations and implement several new applications using newly made skeletons. Furthermore, we consider the algorithmic choice present in SkePU and implement support to specify and automatically optimize the algorithmic choice for a skeleton call, on a given platform. To show how to achieve performance, we provide a case-study on optimized GPU-based skeleton implementation for 2D stencil computations and introduce two metrics to maximize resource utilization on a GPU. By devising a mechanism to automatically calculate these two metrics, performance can be retained while porting an application from one GPU architecture to another. Another contribution of this thesis is implementation of the runtime support for the SkePU skeleton library. This is achieved with the help of the StarPUruntime system. By this implementation,support for dynamic scheduling and load balancing for the SkePU skeleton programs is achieved. Furthermore, a capability to do hybrid executionby parallel execution on all available CPUs and GPUs in a system, even for a single skeleton invocation, is developed. SkePU initially supported only data-parallel skeletons. The first task-parallel skeleton (farm) in SkePU is implemented with support for performance-aware scheduling and hierarchical parallel execution by enabling all data parallel skeletons to be usable as tasks inside the farm construct. Experimental evaluations are carried out and presented for algorithmic selection, performance portability, dynamic scheduling and hybrid execution aspects of our work.

APA, Harvard, Vancouver, ISO, and other styles

9

Lee, Kenneth Sydney. "Characterization and Exploitation of GPU Memory Systems." Thesis, Virginia Tech, 2012. http://hdl.handle.net/10919/34215.

Full text

Abstract:

Graphics Processing Units (GPUs) are workhorses of modern performance due to their ability to achieve massive speedups on parallel applications. The massive number of threads that can be run concurrently on these systems allow applications which have data-parallel computations to achieve better performance when compared to traditional CPU systems. However, the GPU is not perfect for all types of computation. The massively parallel SIMT architecture of the GPU can still be constraining in terms of achievable performance. GPU-based systems will typically only be able to achieve between 40%-60% of their peak performance. One of the major problems affecting this effeciency is the GPU memory system, which is tailored to the needs of graphics workloads instead of general-purpose computation. This thesis intends to show the importance of memory optimizations for GPU systems. In particular, this work addresses problems of data transfer and global atomic memory contention. Using the novel AMD Fusion architecture, we gain overall performance improvements over discrete GPU systems for data-intensive applications. The fused architecture systems offer an interesting trade off by increasing data transfer rates at the cost of some raw computational power. We characterize the performance of different memory paths that are possible because of the shared memory space present on the fused architecture. In addition, we provide a theoretical model which can be used to correctly predict the comparative performance of memory movement techniques for a given data-intensive application and system. In terms of global atomic memory contention, we show improvements in scalability and performance for global synchronization primitives by avoiding contentious global atomic memory accesses. In general, this work shows the importance of understanding the memory system of the GPU architecture to achieve better application performance.
Master of Science

APA, Harvard, Vancouver, ISO, and other styles

10

Rustico, Eugenio. "Fluid Dynamics Simulations on Multi-GPU Systems." Doctoral thesis, Università di Catania, 2012. http://hdl.handle.net/10761/1030.

Full text

Abstract:

The thesis describes the original design, implementation and testing of the multi-GPU version of two fluid flow simulation models, focusing on the cellular automaton MAGFLOW lava flow simulator and the GPU-SPH model for Navier-Stokes. In both cases, a spatial subdivision of the domain is performed, with a minimal overlap to ensure the correct evaluation of the bordering elements (cells in MAGFFLOW, particles in GPUSPH). The latencies introduces by the continuous transfer of the overlapping borders are completely hidden through the use of asynchronous transfers performed concurrently with computations. Different load balancing techniques are used (a priori for MAGFLOW, a posteriori for GPUSPH) and compared. The obtained speedup is linear with the number of the used devices and closely follows the ideal speedup. The performance results are formally analyzed and discussed. While the results are close to the ideal achievable speedups, some future improvements are hypothesized and open problems are mentioned.

APA, Harvard, Vancouver, ISO, and other styles

11

Zhang, Junchi. "GPU computing of Heat Equations." Digital WPI, 2015. https://digitalcommons.wpi.edu/etd-theses/515.

Full text

Abstract:

There is an increasing amount of evidence in scientific research and industrial engineering indicating that the graphic processing unit (GPU) has a higher efficiency and a stronger ability over CPUs to process certain computations. The heat equation is one of the most well-known partial differential equations with well-developed theories, and application in engineering. Thus, we chose in this report to use the heat equation to numerically solve for the heat distributions at different time points using both GPU and CPU programs. The heat equation with three different boundary conditions (Dirichlet, Neumann and Periodic) were calculated on the given domain and discretized by finite difference approximations. The programs solving the linear system from the heat equation with different boundary conditions were implemented on GPU and CPU. A convergence analysis and stability analysis for the finite difference method was performed to guarantee the success of the program. Iterative methods and direct methods to solve the linear system are also discussed for the GPU. The results show that the GPU has a huge advantage in terms of time spent compared with CPU in large size problems.

APA, Harvard, Vancouver, ISO, and other styles

12

Olsson, Martin Wexö. "GPU based particle system." Thesis, Blekinge Tekniska Högskola, Sektionen för datavetenskap och kommunikation, 2010. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-3761.

Full text

Abstract:

GPGPU (General purpose computing on graphics processing unit) is quite common in today's modern computer games when doing heavy simulation calculations like game physics or particle systems. GPU programming is not only used in games but also in scientific research when doing heavy calculations on molecular structures and protein folding etc. The reason why you use the GPU for these kinds of tasks is that you can gain an incredible speedup in performance to your application. Previous research shows that particle systems scale very well to the GPU architecture. When simulating very large particle-system on the GPU it can run up to 79 times faster than the CPU. But for some very small particle systems the CPU proved to be faster. This research aims to compare the difference between the GPU and CPU when it comes to simulating many smaller particle-systems and to see what happen to the performance when the particle-systems become smaller and smaller.

APA, Harvard, Vancouver, ISO, and other styles

13

Campeanu, Gabriel. "GPU-aware Component-based Development for Embedded Systems." Licentiate thesis, Mälardalens högskola, Inbyggda system, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:mdh:diva-33368.

Full text

Abstract:

Nowadays, more and more embedded systems are equipped with e.g., various sensors that produce large amount of data. One of the challenges of traditional (CPU-based) embedded systems is to process this considerable amount of data such that it produces the appropriate performance level demanded by embedded applications. A solution comes from the usage of a specialized processing unit such as Graphics Processing Unit (GPU). A GPU can process large amount of data thanks to its parallel processing architecture, delivering an im- proved performance outcome compared to CPU. A characteristic of the GPU is that it cannot work alone; the CPU must trigger all its activities. Today, taking advantage of the latest technology breakthrough, we can benefit of the GPU technology in the context of embedded systems by using heterogeneous CPU-GPU embedded systems. Component-based development has demonstrated to be a promising methology in handling software complexity. Through component models, which describe the component specification and their interaction, the methodology has been successfully used in embedded system domain. The existing component models, designed to handle CPU-based embedded systems, face challenges in developing embedded systems with GPU capabilities. For example, current so- lutions realize the communication between components with GPU capabilities via the RAM system. This introduces an undesired overhead that negatively affects the system performance. This Licentiate presents methods and techniques that address the component- based development of embedded systems with GPU capabilities. More concretely, we provide means for component models to explicitly address the GPU-aware component-based development by using specific artifacts. For example, the overhead introduced by the traditional way of communicating via RAM is reduced by inserting automatically generated adapters that facilitate a direct component communication over the GPU memory. Another contribution of the thesis is a component allocation method over the system hardware. The proposed solution offers alternative options in opti- mizing the total system performance and balancing various system properties (e.g., memory usage, GPU load). For the validation part of our proposed solutions, we use an underwater robot demonstrator equipped with GPU hardware.
Ralf 3

APA, Harvard, Vancouver, ISO, and other styles

14

Cabezas, Rodríguez Javier. "On the programmability of multi-GPU computing systems." Doctoral thesis, Universitat Politècnica de Catalunya, 2015. http://hdl.handle.net/10803/308500.

Full text

Abstract:

Multi-GPU systems are widely used in High Performance Computing environments to accelerate scientific computations. This trend is expected to continue as integrated GPUs will be introduced to processors used in multi-socket servers and servers will pack a higher number of GPUs per node. GPUs are currently connected to the system through the PCI Express interconnect, which provides limited bandwidth (compared to the bandwidth of the memory in GPUs) and it often becomes a bottleneck for performance scalability. Current programming models present GPUs as isolated devices with their own memory, even if they share the host memory with the CPU. Programmers explicitly manage allocations in all GPU memories and use primitives to communicate data between GPUs. Furthermore, programmers are required to use mechanisms such as command queues and inter-GPU synchronization. This explicit model harms the maintainability of the code and introduces new sources for potential errors. The first proposal of this thesis is the HPE model. HPE builds a simple, consistent programming interface based on three major features. (1) All device address spaces are combined with the host address space to form a Unified Virtual Address Space. (2) Programs are provided with an Asymmetric Distributed Shared Memory system for all the GPUs in the system. It allows to allocate memory objects that can be accessed by any GPU or CPU. (3) Every CPU thread can request a data exchange between any two GPUs, through simple memory copy calls. Such a simple interface allows HPE to provide always the optimal implementation; eliminating the need for application code to handle different system topologies. Experimental results show improvements on real applications that range from 5% in compute-bound benchmarks to 2.6x in communication-bound benchmarks. HPE transparently implements sophisticated communication schemes that can deliver up to a 2.9x speedup in I/O device transfers. The second proposal of this thesis is a shared memory programming model that exploits the new GPU capabilities for remote memory accesses to remove the need for explicit communication between GPUs. This model turns a multi-GPU system into a shared memory system with NUMA characteristics. In order to validate the viability of the model we also perform an exhaustive performance analysis of remote memory accesses over PCIe. We show that the unique characteristics of the GPU execution model and memory hierarchy help to hide the costs of remote memory accesses. Results show that PCI Express 3.0 is able to hide the costs of up to a 10% of remote memory accesses depending on the access pattern, while caching of remote memory accesses can have a large performance impact on kernel performance. Finally, we introduce AMGE, a programming interface, compiler support and runtime system that automatically executes computations that are programmed for a single GPU across all the GPUs in the system. The programming interface provides a data type for multidimensional arrays that allows for robust, transparent distribution of arrays across all GPU memories. The compiler extracts the dimensionality information from the type of each array, and is able to determine the access pattern in each dimension of the array. The runtime system uses the compiler-provided information to automatically choose the best computation and data distribution configuration to minimize inter-GPU communication and memory footprint. This model effectively frees programmers from the task of decomposing and distributing computation and data to exploit several GPUs. AMGE achieves almost linear speedups for a wide range of dense computation benchmarks on a real 4-GPU system with an interconnect with moderate bandwidth. We show that irregular computations can also benefit from AMGE, too.
Los sistemas multi-GPU son muy comúnmente utilizados en entornos de computación de altas prestaciones para acelerar cálculos científicos. Esta tendencia continuará con la introducción de GPUs integradas en los procesadores de los servidores procesador y con una mayor densidad de GPUs por nodo. Las GPUs actualmente se contectan al sistema a través de una interconexión PCI Express, que provee un ancho de banda reducido (comparado con las memorias de las GPUs) y habitualmente se convierte en el cuello de botella para escalar el rendimiento. Los modelos de programación actuales exponen las GPUs como dispositivos aislados con su propia memoria, incluso si comparten la memoria física con la CPU. Los programadores manejan diferentes reservas en todas las memorias de GPU y usan primitivas para comunicar datos entre GPUs. Además, los programadores deben utilizar mecanismos como colas de comandos y sincronicación entre GPUs. Este modelo explícito empeora la programabilidad del código e introduce nuevas fuentes de errores potenciales. La primera propuesta de esta tesis es el modelo HPE. HPE construye una interfaz de programaci ón consistente basada en tres características principales. (1) Todos los espacios de direcciones de los dispositivos son combinados para formar un espacio de direcciones unificado. (2) Los programas usan un sistema asimétrico distribuido de memoria compartida para todas las GPUs del sistema, que permite declarar objetos de memoria que pueden ser accedidos por cualquier GPU o CPU. (3) Cada hilo de ejecución de la CPU puede lanzar un intercambio de datos entre dos GPUs a través de simples llamadas de copia de memoria. Esta interfaz simplificada permite a HPE usar la implementaci ón óptima; sinque la aplicación contemple diferentes topologías de sistema. Los resultados experimentales muestran mejoras en aplicaciones reales que van desde un 5% en aplicaciones limitadas por el cómputo a 2.6x aplicaciones imitadas por la comunicación. HPE implementa sofisticados esquemas de transferencia para dispositivos de E/S que proporcionan mejoras de rendimiento de 2.9x. La segunda propuesta de esta tesis es un modelo de programación basado en memoria compartida que aprovecha las nuevas capacidades acceso remoto de memoria de las GPUs para eliminar la comunicación explícita entre memorias de GPU. Este modelo convierte un sistema multi-GPU en un sistema de memoria compartida con características NUMA. Para validar la viabilidad del modelo realizamos un anlásis exhaustivo del rendimiento los accessos de memoria remotos sobre PCIe. Los resultados muestran que PCI Express 3.0 elimina los costes de hasta un 10% de accesos remotos, dependiendo en el patrón de acceso, mientras que guardar los accesos remotos en memorias cache tiene un gran inpacto en el rendimiento de las computaciones. Finalmente, presentamos AMGE, una interfaz de programación con soporte de compilación y un sistema que ejecuta, de forma automática, computaciones programadas para una única GPU en todas las GPUs del sistema. La interfaz de programación proporciona un tipo de datos para arreglos multidimensionales que permite una distribuci ón transparente y robusta de los datos en todas las memorias de GPU. El compilador extrae la información sobre la dimensionalidad de cada arreglo y puede determinar el patrón de acceso en cada dimensión de forma individual. El sistema utiliza, en tiempo de ejecución, la información del compilador para elegir la mejor descomposición de la computación y los datos para minimizar la comunicación entre GPUs y el uso de memoria. AMGE consigue mejoras de rendimiento que crecen de forma lineal con el número de GPUs para un amplio abanico de computaciones densas en un sistema real con 4 GPUs. También mostramos que las computaciones con patrones irregulares también se pueden beneficiar de AMGE.

APA, Harvard, Vancouver, ISO, and other styles

15

Valderhaug, Thor Kristian. "The Lattice Boltzmann Simulation on Multi-GPU Systems." Thesis, Norges teknisk-naturvitenskapelige universitet, Institutt for datateknikk og informasjonsvitenskap, 2011. http://urn.kb.se/resolve?urn=urn:nbn:no:ntnu:diva-13920.

Full text

Abstract:

The Lattice Boltzmann Method (LBM) is widely used to simulate different types of flow, such as water, oil and gas in porous reservoirs. In the oil industry it is commonly used to estimate petrophysical properties of porous rocks, such as the permeability. To achieve the required accuracy it is necessary to use big simulation models requiring large amounts of memory. The method is highly data intensive making it suitable for offloading to the GPU. However, the limited amount of memory available on modern GPUs severely limits the size of the dataset possible to simulate.In this thesis, we increase the size of the datasets possible to simulate using techniques to lower the memory requirement while retaining numerical precision. These techniques improve the size possible to simulate on a single GPU by about 20 times for datasets with 15% porosity.We then develop multi-GPU simulations for different hardware configurations using OpenCL and MPI to investigate how LBM scales when simulating large datasets.The performance of the implementations are measured using three porous rock datasets provided by Numerical Rocks AS. By connecting two Tesla S2070s to a single host we are able to achieve a speedup of 1.95, compared to using a single GPU. For large datasets we are able to completely hide the host to host communication in a cluster configuration, showing that LBM scales well and is suitable for simulation on a cluster with GPUs. The correctness of the implementations is confirmed against an analytically known flow, and three datasets with known permeability also provided by Numerical Rocks AS.

APA, Harvard, Vancouver, ISO, and other styles

16

Wong, Henry Ting-Hei. "Architectures and limits of GPU-CPU heterogeneous systems." Thesis, University of British Columbia, 2008. http://hdl.handle.net/2429/2529.

Full text

Abstract:

As we continue to be able to put an increasing number of transistors on a single chip, the answer to the perpetual question of what the best processor we could build with the transistors is remains uncertain. Past work has shown that heterogeneous multiprocessor systems provide benefits in performance and efficiency. This thesis explores heterogeneous systems composed of a traditional sequential processor (CPU) and highly parallel graphics processors (GPU). This thesis presents a tightly-coupled heterogeneous chip multiprocessor architecture for general-purpose non-graphics computation and a limit study exploring the potential benefits of GPU-like cores for accelerating a set of general-purpose workloads. Pangaea is a heterogeneous CMP design for non-rendering workloads that integrates IA32 CPU cores with GMA X4500 GPU cores. Pangaea introduces a resource partitioning of the GPU, where 3D graphics-specific hardware is removed to reduce area or add more processing cores, and a 3-instruction extension to the IA32 ISA that supports fast communication between CPU and GPU by building user-level interrupts on top of existing cache coherency mechanisms. By removing graphics-specific hardware on a 65 nm process, the area saved is equivalent to 9 GPU cores, while the power saved is equivalent to 5 cores. Our FPGA prototype shows thread spawn latency improvements from thousands of clock cycles to 26. A set of non-graphics workloads demonstrate speedups of up to 8.8x. This thesis also presents a limit study, where we measure the limit of algorithm parallelism in the context of a heterogeneous system that can be usefully extracted from a set of general-purpose applications. We measure sensitivity to the sequential performance (register read-after-write latency) of the low-cost parallel cores, and latency and bandwidth of the communication channel between the two cores. Using these measurements, we propose system characteristics that maximize area and power efficiencies. As in previous limit studies, we find a high amount of parallelism. We show, however, that the potential speedup on GPU-like systems is low (2.2x - 12.7x) due to poor sequential performance. Communication latency and bandwidth have comparatively small performance effects (<25%). Optimal area efficiency requires a lower-cost parallel processor while optimal power efficiency requires a higher-performance parallel processor than today's GPUs.

APA, Harvard, Vancouver, ISO, and other styles

17

Dastgeer, Usman. "Performance-aware Component Composition for GPU-based systems." Doctoral thesis, Linköpings universitet, Programvara och system, 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-104314.

Full text

Abstract:

This thesis addresses issues associated with efficiently programming modern heterogeneous GPU-based systems, containing multicore CPUs and one or more programmable Graphics Processing Units (GPUs). We use ideas from component-based programming to address programming, performance and portability issues of these heterogeneous systems. Specifically, we present three approaches that all use the idea of having multiple implementations for each computation; performance is achieved/retained either a) by selecting a suitable implementation for each computation on a given platform or b) by dividing the computation work across different implementations running on CPU and GPU devices in parallel. In the first approach, we work on a skeleton programming library (SkePU) that provides high-level abstraction while making intelligent implementation selection decisions underneath either before or during the actual program execution. In the second approach, we develop a composition tool that parses extra information (metadata) from XML files, makes certain decisions online, and, in the end, generates code for making the final decisions at runtime. The third approach is a framework that uses source-code annotations and program analysis to generate code for the runtime library to make the selection decision at runtime. With a generic performance modeling API alongside program analysis capabilities, it supports online tuning as well as complex program transformations. These approaches differ in terms of genericity, intrusiveness, capabilities and knowledge about the program source-code; however, they all demonstrate usefulness of component programming techniques for programming GPU-based systems. With experimental evaluation, we demonstrate how all three approaches, although different in their own way, provide good performance on different GPU-based systems for a variety of applications.

APA, Harvard, Vancouver, ISO, and other styles

18

Chen, Wei. "Dynamic Workload Division in GPU-CPU Heterogeneous Systems." The Ohio State University, 2013. http://rave.ohiolink.edu/etdc/view?acc_num=osu1364250106.

Full text

APA, Harvard, Vancouver, ISO, and other styles

19

Larsson, Andreas. "Real-Time Persistent Mesh Painting with GPU Particle Systems." Thesis, Linköpings universitet, Informationskodning, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-138145.

Full text

Abstract:

Particle systems are used to create visual effects in real-time applications such as computer games. However, emitted particles are often transient and do not leave a lasting impact on a 3D scene. This thesis work presents a real-time method that enables GPU particle systems to paint meshes in a 3D scene as the result of particle collisions, thus adding detail to and leaving a lasting impact on a scene. The method uses screen space collision detection and a mapping from screen space to texture space of meshes to determine where to apply paint. The method was tested for its time complexity and how well it performed in scenarios similar to those found in computer games. The results shows that the method probably can be used in computer games. Performance and visual fidelity of the paint application is not directly dependent on the amount of simulated particles, but depends only on the complexity of the meshes and their texture mapping as wellas the resolution of the paint. It is concluded that the method is renderer agnostic and could be added to existing GPU particle systems and that other types of effects than those showed in the thesis could be achieved by using the method.

APA, Harvard, Vancouver, ISO, and other styles

20

Venkatasubramanian, Sundaresan. "Tuned and asynchronous stencil kernels for CPU/GPU systems." Thesis, Atlanta, Ga. : Georgia Institute of Technology, 2009. http://hdl.handle.net/1853/29728.

Full text

Abstract:

Thesis (M. S.)--Computing, Georgia Institute of Technology, 2009.
Committee Chair: Vuduc, Richard; Committee Member: Kim, Hyesoon; Committee Member: Vetter, Jeffrey. Part of the SMARTech Electronic Thesis and Dissertation Collection.

APA, Harvard, Vancouver, ISO, and other styles

21

Peniak, Martin. "GPU computing for cognitive robotics." Thesis, University of Plymouth, 2014. http://hdl.handle.net/10026.1/3052.

Full text

Abstract:

This thesis presents the first investigation of the impact of GPU computing on cognitive robotics by providing a series of novel experiments in the area of action and language acquisition in humanoid robots and computer vision. Cognitive robotics is concerned with endowing robots with high-level cognitive capabilities to enable the achievement of complex goals in complex environments. Reaching the ultimate goal of developing cognitive robots will require tremendous amounts of computational power, which was until recently provided mostly by standard CPU processors. CPU cores are optimised for serial code execution at the expense of parallel execution, which renders them relatively inefficient when it comes to high-performance computing applications. The ever-increasing market demand for high-performance, real-time 3D graphics has evolved the GPU into a highly parallel, multithreaded, many-core processor extraordinary computational power and very high memory bandwidth. These vast computational resources of modern GPUs can now be used by the most of the cognitive robotics models as they tend to be inherently parallel. Various interesting and insightful cognitive models were developed and addressed important scientific questions concerning action-language acquisition and computer vision. While they have provided us with important scientific insights, their complexity and application has not improved much over the last years. The experimental tasks as well as the scale of these models are often minimised to avoid excessive training times that grow exponentially with the number of neurons and the training data. This impedes further progress and development of complex neurocontrollers that would be able to take the cognitive robotics research a step closer to reaching the ultimate goal of creating intelligent machines. This thesis presents several cases where the application of the GPU computing on cognitive robotics algorithms resulted in the development of large-scale neurocontrollers of previously unseen complexity enabling the conducting of the novel experiments described herein.

APA, Harvard, Vancouver, ISO, and other styles

22

de, Laval johnny. "Trådlösa Nätverk : säkerhet och GPU." Thesis, Högskolan på Gotland, Institutionen för speldesign, teknik och lärande, 2009. http://urn.kb.se/resolve?urn=urn:nbn:se:hgo:diva-1063.

Full text

Abstract:

Trådlosa nätverk är av naturen sårbara for avlyssning för att kommunikationen sker med radiovagor. Därfor skyddas trådlosa nätverk med kryptering. WEP var den första krypteringsstandarden som användes av en bredare publik som senare visade sig innehålla flera sårbarheter. Följden blev att krypteringen kunde förbigås på ett par minuter. Därför utvecklades WPA som ett svar till sårbarheterna i WEP. Kort därefter kom WPA2 som är den standard som används i nutid. Den svaghet som kan påvisas med WPA2 finns hos WPA2-PSK när svaga lösenord används. Mjukvaror kan med enkelhet gå igenom stora uppslagsverk för att testa om lösenord går att återställa. Det är en process som tar tid och som därför skyddar nätverken i viss mån. Dock har grafikprocessorer börjat användas i syfte för att återställa lösenord. Grafikkorten är effektivare och återställer svaga lösenord betydligt snabbare än moderkortens processorer. Det öppnar upp for att jämföra lösenord med ännu större uppslagsverk och fler kombinationer. Det är vad denna studie avser att belysa; hur har grafikkortens effektivitet påverkat säkerheten i trådlösa nätverk ur ett verksamhetsperspektiv.
Wireless networks are inherently vulnerable for eavesdropping since they use radio waves to communicate. Wireless networks are therefore protected by encryption. WEP was the first encryption standard that was widely used. Unfortunately WEP proved to have several serious vulnerabilities. WEP could be circumvented within few minutes. Therefore WPA was developed as a response to the weak WEP. Shortly thereafter WPA2 was released and are now being used in present. The only weakness with WPA2 is in the subset WPA2-PSK when weak passwords are being used. Software could easily go through large dictionaries to verify if a password could be recovered. But that is time consuming and therefore providing wireless networks limited protection. However a new area of use with advanced graphic cards has showed that it is providing a faster way of recovering passwords than the ordinary processor on the motherboard. That opens up for the larger use of dictionaries and the processing of words or combinations of words. That is what this study aims to shed light on. How the efficiency of the graphic cards have affected security in wireless networks from a corporate perspective of view.

APA, Harvard, Vancouver, ISO, and other styles

23

Erik, Liljeqvist. "Evaluating a CPU/GPU Implementation for Real-Time Ray Tracing." Thesis, Mälardalens högskola, Akademin för innovation, design och teknik, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:mdh:diva-35768.

Full text

APA, Harvard, Vancouver, ISO, and other styles

24

Matz, Alexander [Verfasser], and Holger [Akademischer Betreuer] Fröning. "Exploiting BSP Abstractions for Compiler Based Optimizations of GPU Applications on multi-GPU Systems / Alexander Matz ; Betreuer: Holger Fröning." Heidelberg : Universitätsbibliothek Heidelberg, 2020. http://d-nb.info/1223546578/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

25

Mei, Xinxin. "Energy conservation techniques for GPU computing." HKBU Institutional Repository, 2016. https://repository.hkbu.edu.hk/etd_oa/298.

Full text

Abstract:

The emerging general purpose graphics processing units (GPGPU) computing has tremendously speeded up a great variety of commercial and scientific applications. The GPUs have become prevalent accelerators in current high performance clusters. Though the computational capacity per Watt of the GPUs is much higher than that of the CPUs, the hybrid GPU clusters still consume enormous power. To conserve energy on this kind of clusters is of critical significance. In this thesis, we seek energy conservative computing on the GPU accelerated servers. We introduce our studies as follows. First, we dissect the GPU memory hierarchy due to the fact that most of the GPU applications are suffering from the GPU memory bottleneck. We find that the conventional CPU cache models cannot be applied on the modern GPU caches, and the microbenchmarks to study the conventional CPU cache become invalid for the GPU. We propose the GPU-specified microbenchmarks to examine the GPU memory structures and properties. Our benchmark results verify that the design goal of the GPU has transformed from pure computation performance to better energy efficiency. Second, we investigate the impact of dynamic voltage and frequency scaling (DVFS), a successful energy management technique for CPUs, on the GPU platforms. Our experimental results suggest that GPU DVFS is still promising in conserving energy, but the patterns to save energy strongly differ from those of the CPU. Besides, the effect of GPU DVFS depends on the individual application characteristics. Third, we derive the GPU DVFS power and performance models from our experimental results, based on which we find the optimal GPU voltage and frequency setting to minimize the energy consumption of a single GPU task. We then study the problem of scheduling multiple tasks on a hybrid CPU-GPU cluster to minimize the total energy consumption by GPU DVFS. We design an effective offline scheduling algorithm which can reduce the energy consumption significantly. At last, we combine the GPU DVFS and dynamic resource sleep (DRS), another energy management technique, to further conserve the energy, for the online task scheduling on hybrid clusters. Though the idle energy consumption increases significantly compared to the offline problem, our online scheduling algorithm still achieves more than 30% of energy conservation with appropriate runtime GPU DVFS readjustments.

APA, Harvard, Vancouver, ISO, and other styles

26

Young, Emily Clark. "GPU-Accelerated Demodulation for a Satellite Ground Station." DigitalCommons@USU, 2019. https://digitalcommons.usu.edu/etd/7635.

Full text

Abstract:

One consequence of the increasing number of small satellite missions is an increasing demand for high data rate downlinks. As the satellites transmit at high data rates, ground-side receivers need to demodulate the transmitted data as quickly as possible. While application specific hardware can be designed, software defined radio solutions for ground stations are attractive for their flexibility, adaptability, and portability. Another industry trend is the increasing use of Graphics Processing Units (GPUs) in general-purpose processing. By performing many operations simultaneously, GPUs are capable of accelerating processing when given a problem that can be implemented in a parallel manner. Furthermore, once a parallel algorithm is implemented, further speedups are possible by increasing hardware resources without need for any revision in the algorithm. This project combines the above ideas by implementing a software defined radio algorithm to quickly demodulate high-speed data on a GPU. It demonstrates the viability of the GPU in software defined radio applications and particularly in the area of fast demodulation.

APA, Harvard, Vancouver, ISO, and other styles

27

Gruslys, Audrūnas. "Development and applications of GPU based medical image registration." Thesis, University of Cambridge, 2014. https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.708078.

Full text

APA, Harvard, Vancouver, ISO, and other styles

28

Ching, Bryan. "OPTIMIZING LEMPEL-ZIV FACTORIZATION FOR THE GPU ARCHITECTURE." DigitalCommons@CalPoly, 2014. https://digitalcommons.calpoly.edu/theses/1238.

Full text

Abstract:

Lossless data compression is used to reduce storage requirements, allowing for the relief of I/O channels and better utilization of bandwidth. The Lempel-Ziv lossless compression algorithms form the basis for many of the most commonly used compression schemes. General purpose computing on graphic processing units (GPGPUs) allows us to take advantage of the massively parallel nature of GPUs for computations other that their original purpose of rendering graphics. Our work targets the use of GPUs for general lossless data compression. Specifically, we developed and ported an algorithm that constructs the Lempel-Ziv factorization directly on the GPU. Our implementation bypasses the sequential nature of the LZ factorization and attempts to compute the factorization in parallel. By breaking down the LZ factorization into what we call the PLZ, we are able to outperform the fastest serial CPU implementations by up to 24x and perform comparatively to a parallel multicore CPU implementation. To achieve these speeds, our implementation outputted LZ factorizations that were on average only 0.01 percent greater than the optimal solution that what could be computed sequentially. We are also able to reevaluate the fastest GPU suffix array construction algorithm, which is needed to compute the LZ factorization. We are able to find speedups of up to 5x over the fastest CPU implementations.

APA, Harvard, Vancouver, ISO, and other styles

29

Enmyren, Johan. "A Skeleton Programming Library for Multicore CPU and Multi-GPU Systems." Thesis, Linköpings universitet, Institutionen för datavetenskap, 2010. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-60319.

Full text

Abstract:

This report presents SkePU, a C++ template library which provides a simple and unified interface for specifying data-parallel computations with the help of skeletons on GPUs using CUDA and OpenCL. The interface is also general enough to support other architectures, and SkePU implements both a sequential CPU and a parallel OpenMP back end. It also supports multi-GPU systems. Benchmarks show that copying data between the host and the GPU is often a bottleneck. Therefore a container which uses lazy memory copying has been implemented to avoid unnecessary memory transfers. SkePU was evaluated with small benchmarks and a larger application, a Runge-Kutta ODE solver. The results show that skeletal parallel programming is indeed a viable approach for GPU Computing and that a generalized interface for multiple back ends is also reasonable. The best performance gains are received when the computation load is large compared to memory I/O (the lazy memory copying can help to achieve this). We see that SkePU offers good performance with a more complex and realistic task such as ODE solving, with up to ten times faster run times when using SkePU with a GPU back end compared to a sequential solver running on a fast CPU. From the benchmarks we can conclude that skeletal parallel programming is indeed a viable approach for GPU Computing and that a generalized interface for multiple back ends is also reasonable. SkePU does however have some disadvantages too; there is some overhead in using the library which we can see from the dot product and LibSolve benchmarks. Although not big, it is still there and if performance is of uttermost importance, then a hand coded solution would be best. One cannot express all calculations in terms of skeletons either, if one have such a problem, specialized routines must still be created.

APA, Harvard, Vancouver, ISO, and other styles

30

Trichy, Ravi Vignesh. "Runtime Systems and Scheduling Support for High-End CPU-GPU Architectures." The Ohio State University, 2012. http://rave.ohiolink.edu/etdc/view?acc_num=osu1338324367.

Full text

APA, Harvard, Vancouver, ISO, and other styles

31

Ausavarungnirun, Rachata. "Techniques for Shared Resource Management in Systems with Throughput Processors." Research Showcase @ CMU, 2017. http://repository.cmu.edu/dissertations/905.

Full text

Abstract:

The continued growth of the computational capability of throughput processors has made throughput processors the platform of choice for a wide variety of high performance computing applications. Graphics Processing Units (GPUs) are a prime example of throughput processors that can deliver high performance for applications ranging from typical graphics applications to general-purpose data parallel (GPGPU) applications. However, this success has been accompa- nied by new performance bottlenecks throughout the memory hierarchy of GPU-based systems. This dissertation identifies and eliminates performance bottlenecks caused by major sources of interference throughout the memory hierarchy. Specifically, we provide an in-depth analysis of inter- and intra-application as well as inter- address-space interference that significantly degrade the performance and efficiency of GPU-based systems. To minimize such interference, we introduce changes to the memory hierarchy for systems with GPUs that allow the memory hierarchy to be aware of both CPU and GPU applications’ charac- teristics. We introduce mechanisms to dynamically analyze different applications’ characteristics and propose four major changes throughout the memory hierarchy. First, we introduce Memory Divergence Correction (MeDiC), a cache management mecha- nism that mitigates intra-application interference in GPGPU applications by allowing the shared L2 cache and the memory controller to be aware of the GPU’s warp-level memory divergence characteristics. MeDiC uses this warp-level memory divergence information to give more cache space and more memory bandwidth to warps that benefit most from utilizing such resources. Our evaluations show that MeDiC significantly outperforms multiple state-of-the-art caching policies proposed for GPUs. Second, we introduce the Staged Memory Scheduler (SMS), an application-aware CPU-GPU memory request scheduler that mitigates inter-application interference in heterogeneous CPU-GPU systems. SMS creates a fundamentally new approach to memory controller design that decouples the memory controller into three significantly simpler structures, each of which has a separate task, These structures operate together to greatly improve both system performance and fairness. Our three-stage memory controller first groups requests based on row-buffer locality. This grouping allows the second stage to focus on inter-application scheduling decisions. These two stages en- force high-level policies regarding performance and fairness. As a result, the last stage is simple logic that deals only with the low-level DRAM commands and timing. SMS is also configurable: it allows the system software to trade off between the quality of service provided to the CPU versus GPU applications. Our evaluations show that SMS not only reduces inter-application interference caused by the GPU, thereby improving heterogeneous system performance, but also provides better scalability and power efficiency compared to multiple state-of-the-art memory schedulers. Third, we redesign the GPU memory management unit to efficiently handle new problems caused by the massive address translation parallelism present in GPU computation units in multi- GPU-application environments. Running multiple GPGPU applications concurrently induces significant inter-core thrashing on the shared address translation/protection units; e.g., the shared Translation Lookaside Buffer (TLB), a new phenomenon that we call inter-address-space interference. To reduce this interference, we introduce Multi Address Space Concurrent Kernels (MASK). MASK introduces TLB-awareness throughout the GPU memory hierarchy and introduces TLBand cache-bypassing techniques to increase the effectiveness of a shared TLB. Finally, we introduce Mosaic, a hardware-software cooperative technique that further increases the effectiveness of TLB by modifying the memory allocation policy in the system software. Mosaic introduces a high-throughput method to support large pages in multi-GPU-application environments. The key idea is to ensure memory allocation preserve address space contiguity to allow pages to be coalesced without any data movements. Our evaluations show that the MASK-Mosaic combination provides a simple mechanism that eliminates the performance overhead of address translation in GPUs without significant changes to GPU hardware, thereby greatly improving GPU system performance. The key conclusion of this dissertation is that a combination of GPU-aware cache and memory management techniques can effectively mitigate the memory interference on current and future GPU-based systems as well as other types of throughput processors.

APA, Harvard, Vancouver, ISO, and other styles

32

Wu, Jiadong. "Improving the throughput of novel cluster computing systems." Diss., Georgia Institute of Technology, 2015. http://hdl.handle.net/1853/53890.

Full text

Abstract:

Traditional cluster computing systems such as the supercomputers are equipped with specially designed high-performance hardware, which escalates the manufacturing cost and the energy cost of those systems. Due to such drawbacks and the diversified demand in computation, two new types of clusters are developed: the GPU clusters and the Hadoop clusters. The GPU cluster combines traditional CPU-only computing cluster with general purpose GPUs to accelerate the applications. Thanks to the massively-parallel architecture of the GPU, this type of system can deliver much higher performance-per-watt than the traditional computing clusters. The Hadoop cluster is another popular type of cluster computing system. It uses inexpensive off-the-shelf component and standard Ethernet to minimize manufacturing cost. The Hadoop systems are widely used throughout the industry. Alongside with the lowered cost, these new systems also bring their unique challenges. According to our study, the GPU clusters are prone to severe under-utilization due to the heterogeneous nature of its computation resources, and the Hadoop clusters are vulnerable to network congestion due to its limited network resources. In this research, we are trying to improve the throughput of these novel cluster computing systems by increasing the workload parallelism and network I/O parallelism.

APA, Harvard, Vancouver, ISO, and other styles

33

Tasoulas, Zois Gerasimos. "Resource management and application customization for hardware accelerated systems." OpenSIUC, 2021. https://opensiuc.lib.siu.edu/dissertations/1907.

Full text

Abstract:

Computational demands are continuously increasing, driven by the growing resource demands of applications. At the era of big-data, big-scale applications, and real-time applications, there is an enormous need for quick processing of big amounts of data. To meet these demands, computer systems have shifted towards multi-core solutions. Technology scaling has allowed the incorporation of even larger numbers of transistors and cores into chips. Nevertheless, area constrains, power consumption limitations, and thermal dissipation limit the ability to design and sustain ever increasing chips. To overpassthese limitations, system designers have turned towards the usage of hardware accelerators. These accelerators can take the form of modules attached to each core of a multi-core system, forming a network on chip of cores with attached accelerators. Another option of hardware accelerators are Graphics Processing Units (GPUs). GPUs can be connected through a host-device model with a general purpose system, and are used to off-load parts of a workload to them. Additionally, accelerators can be functionality dedicated units. They can be part of a chip and the main processor can offload specific workloads to the hardware accelerator unit.In this dissertation we present: (a) a microcoded synchronization mechanism for systems with hardware accelerators that provide distributed shared memory, (b) a Streaming Multiprocessor (SM) allocation policy for single application execution on GPUs, (c) an SM allocation policy for concurrent applications that execute on GPUs, and (d) a framework to map neural network (NN) weights to approximate multiplier accuracy levels. Theaforementioned mechanisms coexist in the resource management domain. Specifically, the methodologies introduce ways to boost system performance by using hardware accelerators. In tandem with improved performance, the methodologies explore and balance trade-offs that the use of hardware accelerators introduce.

APA, Harvard, Vancouver, ISO, and other styles

34

Roque, Pedro Miguel da Silva. "Contraint solving on massively parallel systems." Doctoral thesis, Universidade de Évora, 2020. http://hdl.handle.net/10174/27976.

Full text

Abstract:

Abstract Applying parallelism to constraint solving seems a promising approach and it has been done with varying degrees of success. Early attempts to parallelize constraint propagation, which constitutes the core of traditional interleaved propagation and search constraint solving, were hindered by its essentially sequential nature. Recently, parallelization efforts have focussed mainly on the search part of constraint solving. A particular source of parallelism has become pervasive, in the guise of GPUs, able to run thousands of parallel threads, and they have naturally drawn the attention of researchers in parallel constraint solving. This thesis addresses the challenges faced when using multiple devices for constraint solving, especially GPUs, such as deciding on the appropriate level of parallelism to employ, load balancing and inter-device communication. To overcome these challenges new techniques were implemented in a new constraint solver, named Parallel Heterogeneous Architecture Constraint Toolkit (PHACT), which allows to use one or more CPUs, GPUs, Intel Many Integrated Cores (MIC) and any other device compatible with OpenCL to solve a constraint problem. Several tests were made to measure the capabilities of some GPUs to solve constraint problems, and the conclusions of these tests are described in this thesis. PHACT’s architecture is presented and its performance was measured in each one of five machines, comprising eleven CPUs, six GPUs and two MICs. The tests were made using 10 constraint satisfaction problems, consisting in counting all the solutions, finding one solution or optimizing. Each of the problems has been instantiated with up to three different dimensions. PHACT’s performance was also compared with the ones of Gecode, Choco and OR-Tools. In the end, these tests allowed to detect which techniques implemented in PHACT were already achieving the expected results, and to point changes that may improve PHACT’s performance.

APA, Harvard, Vancouver, ISO, and other styles

35

Sundin, Patricia. "Adaptation of algorithms for underwater sonar data processing to GPU-based systems." Thesis, Linköpings universitet, Institutionen för datavetenskap, 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-94023.

Full text

Abstract:

In this master thesis, algorithms for acoustic simulations in underwater environments are ported for GPU processing. The GPU parallel computing platforms used are CUDA, OpenCL and SkePU. The purpose of this master thesis is to adapt and evaluate the ported algorithms' performance on two modern NVIDIA GPUs, Tesla K20 and Quadro K5000. Several optimizations, described in existing literature for GPU processing (e.g. usage of shared memory, coalesced memory accesses), are implemented and multiple versions of each algorithm are created to study their trade-offs. Evaluation on two GPUs showed that different versions of the same algorithm have different performance characteristic and execution with the best performing version can give better performance than the original algorithm executing on 8 CPUs. A performance comparison between CUDA, OpenCL and SkePU versions of one algorithm is also made.

APA, Harvard, Vancouver, ISO, and other styles

36

Wen, Hao. "IMPROVING PERFORMANCE AND ENERGY EFFICIENCY FOR THE INTEGRATED CPU-GPU HETEROGENEOUS SYSTEMS." VCU Scholars Compass, 2018. https://scholarscompass.vcu.edu/etd/5664.

Full text

Abstract:

Current heterogeneous CPU-GPU architectures integrate general purpose CPUs and highly thread-level parallelized GPUs (Graphic Processing Units) in the same die. This dissertation focuses on improving the energy efficiency and performance for the heterogeneous CPU-GPU system. Leakage energy has become an increasingly large fraction of total energy consumption, making it important to reduce leakage energy for improving the overall energy efficiency. Cache occupies a large on-chip area, which are good targets for leakage energy reduction. For the CPU cache, we study how to reduce the cache leakage energy efﬁciently in a hybrid SPM (Scratch-Pad Memory) and cache architecture. For the GPU cache, the access pattern of GPU cache is different from the CPU, which usually has little locality and high miss rate. In addition, GPU can hide memory latency more effectively due to multi-threading. Because of the above reasons, we find it is possible to place the cache lines of the GPU data caches into the low power mode more aggressively than traditional leakage management for CPU caches, which can reduce more leakage energy without significant performance degradation. The contention in shared resources between CPU and GPU, such as the last level cache (LLC), interconnection network and DRAM, may degrade both CPU and GPU performance. We propose a simple yet effective method based on probability to control the LLC replacement policy for reducing the CPU’s inter-core conﬂict misses caused by GPU without significantly impacting GPU performance. In addition, we develop two strategies to combine the probability based method for the LLC and an existing technique called virtual channel partition (VCP) for the interconnection network to further improve the CPU performance. For a specific graph application of Breadth first search (BFS), which is a basis for graph search and a core building block for many higher-level graph analysis applications, it is a typical example of parallel computation that is inefficient on GPU architectures. In a graph, a small portion of nodes may have a large number of neighbors, which leads to irregular tasks on GPUs. These irregularities limit the parallelism of BFS executing on GPUs. Unlike the previous works focusing on fine-grained task management to address the irregularity, we propose Virtual-BFS (VBFS) to virtually change the graph itself. By adding virtual vertices, the high-degree nodes in the graph are divided into groups that have an equal number of neighbors, which increases the parallelism such that more GPU threads can work concurrently. This approach ensures correctness and can significantly improve both the performance and energy efficiency on GPUs.

APA, Harvard, Vancouver, ISO, and other styles

37

Wang, Dongwei. "A REUSED DISTANCE BASED ANALYSIS AND OPTIMIZATION FOR GPU CACHE." VCU Scholars Compass, 2016. http://scholarscompass.vcu.edu/etd/4840.

Full text

Abstract:

As a throughput-oriented device, Graphics Processing Unit(GPU) has already integrated with cache, which is similar to CPU cores. However, the applications in GPGPU computing exhibit distinct memory access patterns. Normally, the cache, in GPU cores, suffers from threads contention and resources over-utilization, whereas few detailed works excavate the root of this phenomenon. In this work, we adequately analyze the memory accesses from twenty benchmarks based on reuse distance theory and quantify their patterns. Additionally, we discuss the optimization suggestions, and implement a Bypassing Aware(BA) Cache which could intellectually bypass the thrashing-prone candidates. BA cache is a cost efficient cache design with two extra bits in each line, they are flags to make the bypassing decision and find the victim cache line. Experimental results show that BA cache can improve the system performance around 20\% and reduce the cache miss rate around 11\% compared with traditional design.

APA, Harvard, Vancouver, ISO, and other styles

38

Arafat, Md Humayun. "Runtime Systems for Load Balancing and Fault Tolerance on Distributed Systems." The Ohio State University, 2014. http://rave.ohiolink.edu/etdc/view?acc_num=osu1408972218.

Full text

APA, Harvard, Vancouver, ISO, and other styles

39

Xiao, Shucai. "Generalizing the Utility of Graphics Processing Units in Large-Scale Heterogeneous Computing Systems." Diss., Virginia Tech, 2013. http://hdl.handle.net/10919/51845.

Full text

Abstract:

Today, heterogeneous computing systems are widely used to meet the increasing demand for high-performance computing. These systems commonly use powerful and energy-efficient accelerators to augment general-purpose processors (i.e., CPUs). The graphic processing unit (GPU) is one such accelerator. Originally designed solely for graphics processing, GPUs have evolved into programmable processors that can deliver massive parallel processing power for general-purpose applications. Using SIMD (Single Instruction Multiple Data) based components as building units; the current GPU architecture is well suited for data-parallel applications where the execution of each task is independent. With the delivery of programming models such as Compute Unified Device Architecture (CUDA) and Open Computing Language (OpenCL), programming GPUs has become much easier than before. However, developing and optimizing an application on a GPU is still a challenging task, even for well-trained computing experts. Such programming tasks will be even more challenging in large-scale heterogeneous systems, particularly in the context of utility computing, where GPU resources are used as a service. These challenges are largely due to the limitations in the current programming models: (1) there are no intra-and inter-GPU cooperative mechanisms that are natively supported; (2) current programming models only support the utilization of GPUs installed locally; and (3) to use GPUs on another node, application programs need to explicitly call application programming interface (API) functions for data communication. To reduce the mapping efforts and to better utilize the GPU resources, we investigate generalizing the utility of GPUs in large-scale heterogeneous systems with GPUs as accelerators. We generalize the utility of GPUs through the transparent virtualization of GPUs, which can enable applications to view all GPUs in the system as if they were installed locally. As a result, all GPUs in the system can be used as local GPUs. Moreover, GPU virtualization is a key capability to support the notion of "GPU as a service." Specifically, we propose the virtual OpenCL (or VOCL) framework for the transparent virtualization of GPUs. To achieve good performance, we optimize and extend the framework in three aspects: (1) optimize VOCL by reducing the data transfer overhead between the local node and remote node; (2) propose GPU synchronization to reduce the overhead of switching back and forth if multiple kernel launches are needed for data communication across different compute units on a GPU; and (3) extend VOCL to support live virtual GPU migration for quick system maintenance and load rebalancing across GPUs. With the above optimizations and extensions, we thoroughly evaluate VOCL along three dimensions: (1) show the performance improvement for each of our optimization strategies; (2) evaluate the overhead of using remote GPUs via several microbenchmark suites as well as a few real-world applications; and (3) demonstrate the overhead as well as the benefit of live virtual GPU migration. Our experimental results indicate that VOCL can generalize the utility of GPUs in large-scale systems at a reasonable virtualization and migration cost.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

40

Pettersson, Johan. "Real-time Object Recognition on a GPU." Thesis, Linköping University, Department of Electrical Engineering, 2007. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-10238.

Full text

Abstract:

Shape-Based matching (SBM) is a known method for 2D object recognition that is rather robust against illumination variations, noise, clutter and partial occlusion.

The objects to be recognized can be translated, rotated and scaled.

The translation of an object is determined by evaluating a similarity measure for all possible positions (similar to cross correlation).

The similarity measure is based on dot products between normalized gradient directions in edges.

Rotation and scale is determined by evaluating all possible combinations, spanning a huge search space.

A resolution pyramid is used to form a heuristic for the search that then gains real-time performance.

For SBM, a model consisting of normalized edge gradient directions, are constructed for all possible combinations of rotation and scale.

We have avoided this by using (bilinear) interpolation in the search gradient map, which greatly reduces the amount of storage required.

SBM is highly parallelizable by nature and with our suggested improvements it becomes much suited for running on a GPU.

This have been implemented and tested, and the results clearly outperform those of our reference CPU implementation (with magnitudes of hundreds).

It is also very scalable and easily benefits from future devices without effort.

An extensive evaluation material and tools for evaluating object recognition algorithms have been developed and the implementation is evaluated and compared to two commercial 2D object recognition solutions.

The results show that the method is very powerful when dealing with the distortions listed above and competes well with its opponents.

APA, Harvard, Vancouver, ISO, and other styles

41

Wang, Qiang. "Performance and power modeling of GPU systems with dynamic voltage and frequency scaling." HKBU Institutional Repository, 2020. https://repository.hkbu.edu.hk/etd_oa/814.

Full text

Abstract:

To address the ever-increasing demand for computing capacities, more and more heterogeneous systems have been designed to use both general-purpose and special-purpose processors. The huge energy consumption of them raises new environmental concerns and challenges. Besides performance, energy efficiency is another key factor to be considered by system designers and consumers. In particular, contemporary graphics processing units (GPUs) support dynamic voltage and frequency scaling (DVFS) to balance computational performance and energy consumption. However, accurate and straightforward performance and power estimation for a given GPU kernel under different frequency settings is still lacking for real hardware, which is essential to determine the best frequency configuration for energy saving. In this thesis, we investigate how to improve the energy efficiency of GPU systems by accurately modeling the effects of GPU DVFS on the target GPU kernel. We also propose efficient algorithms to solve the communication contention problem in scheduling multiple distributed deep learning (DDL) jobs on GPU clusters. We introduce our studies as follows. First, we present a benchmark suite EPPMiner for evaluating the performance, power, and energy of different heterogeneous systems. EPPMiner consists of 16 benchmark programs that cover a broad range of application domains, and it shows a great variety in the intensity of utilizing the processors. We have implemented a prototype of EPPMiner that supports OpenMP, CUDA, and OpenCL, and demonstrated its usage by three showcases. The showcases justify that GPUs provide much better energy efficiency than other types of computing systems, and especially illustrate the effectiveness of GPU Dynamic Voltage and Frequency Scaling (DVFS) on the energy efficiency of GPU applications. Second, we reveal a fine-grained analytical model to estimate the execution time of GPU kernels with both core and memory frequency scaling. Compared to the cycle-level simulators, which are too slow to apply on real hardware, our model only needs one-off micro-benchmarks to extract a set of hardware parameters and kernel performance counters without any source code analysis. Our experimental results show that the proposed performance model can capture the kernel performance scaling behaviors under different frequency settings and achieve decent accuracy. Third, we design a cross-benchmarking suite, which simulates kernels with a wide range of instruction distributions. The synthetic kernels generated by this suite can be used for model pre- training or as supplementary training samples. We then build machine learning models to predict the execution time and runtime power of a GPU kernel under different voltage and frequency settings. Validated on three modern GPUs with a wide frequency scaling range, by using a collection of 24 real application kernels, the model trained only with our cross-benchmarking suite is able to achieve considerably accurate results. At last, we establish a new DDL job scheduling framework which organizes DDL jobs as Directed Acyclic Graphs (DAGs) and considers communication contention between nodes. We then propose an efficient job placement algorithm, Least-Workload-First- (LWF-), to balance the GPU utilization and consolidate the allocated GPUs for each job. When scheduling the communication tasks, we propose Ada-SRSF for the DDL job scheduling problem to address the communication contention issue. Our simulation results show that LWF- achieves up to 1.59x improvement over the classical first-fit algorithms. More importantly, Ada-SRSF reduces the average job completion time by up to 36.7%, as compared to the solutions of either avoiding all the communication contention or accepting all of it

APA, Harvard, Vancouver, ISO, and other styles

42

Klenk, Benjamin [Verfasser], and Holger [Akademischer Betreuer] Fröning. "Communication Architectures for Scalable GPU-centric Computing Systems / Benjamin Klenk ; Betreuer: Holger Fröning." Heidelberg : Universitätsbibliothek Heidelberg, 2018. http://d-nb.info/1177691078/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

43

Xue, Weicheng. "CPU/GPU Code Acceleration on Heterogeneous Systems and Code Verification for CFD Applications." Diss., Virginia Tech, 2021. http://hdl.handle.net/10919/102073.

Full text

Abstract:

Computational Fluid Dynamics (CFD) applications usually involve intensive computations, which can be accelerated through using open accelerators, especially GPUs due to their common use in the scientific computing community. In addition to code acceleration, it is important to ensure that the code and algorithm are implemented numerically correctly, which is called code verification. This dissertation focuses on accelerating research CFD codes on multi-CPUs/GPUs using MPI and OpenACC, as well as the code verification for turbulence model implementation using the method of manufactured solutions and code-to-code comparisons. First, a variety of performance optimizations both agnostic and specific to applications and platforms are developed in order to 1) improve the heterogeneous CPU/GPU compute utilization; 2) improve the memory bandwidth to the main memory; 3) reduce communication overhead between the CPU host and the GPU accelerator; and 4) reduce the tedious manual tuning work for GPU scheduling. Both finite difference and finite volume CFD codes and multiple platforms with different architectures are utilized to evaluate the performance optimizations used. A maximum speedup of over 70 is achieved on 16 V100 GPUs over 16 Xeon E5-2680v4 CPUs for multi-block test cases. In addition, systematic studies of code verification are performed for a second-order accurate finite volume research CFD code. Cross-term sinusoidal manufactured solutions are applied to verify the Spalart-Allmaras and k-omega SST model implementation, both in 2D and 3D. This dissertation shows that the spatial and temporal schemes are implemented numerically correctly.
Doctor of Philosophy
Computational Fluid Dynamics (CFD) is a numerical method to solve fluid problems, which usually requires a large amount of computations. A large CFD problem can be decomposed into smaller sub-problems which are stored in discrete memory locations and accelerated by a large number of compute units. In addition to code acceleration, it is important to ensure that the code and algorithm are implemented correctly, which is called code verification. This dissertation focuses on the CFD code acceleration as well as the code verification for turbulence model implementation. In this dissertation, multiple Graphic Processing Units (GPUs) are utilized to accelerate two CFD codes, considering that the GPU has high computational power and high memory bandwidth. A variety of optimizations are developed and applied to improve the performance of CFD codes on different parallel computing systems. The program execution time can be reduced significantly especially when multiple GPUs are used. In addition, code-to-code comparisons with some NASA CFD codes and the method of manufactured solutions are utilized to verify the correctness of a research CFD code.

APA, Harvard, Vancouver, ISO, and other styles

44

Chien, Wei Der. "An Evaluation of TensorFlow as a Programming Framework for HPC Applications." Thesis, KTH, Beräkningsvetenskap och beräkningsteknik (CST), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-233795.

Full text

Abstract:

In recent years, deep-learning, a branch of machine learning gained increasing popularity due to their extensive applications and performance. At the core of these application is dense matrix-matrix multiplication. Graphics Processing Units (GPUs) are commonly used in the training process due to their massively parallel computation capabilities. In addition, specialized low-precision accelerators have emerged to specifically address Tensor operations. Software frameworks, such as TensorFlow have also emerged to increase the expressiveness of neural network model development. In TensorFlow computation problems are expressed as Computation Graphs where nodes of a graph denote operation and edges denote data movement between operations. With increasing number of heterogeneous accelerators which might co-exist on the same cluster system, it became increasingly difficult for users to program efficient and scalable applications. TensorFlow provides a high level of abstraction and it is possible to place operations of a computation graph on a device easily through a high level API. In this work, the usability of TensorFlow as a programming framework for HPC application is reviewed. We give an introduction of TensorFlow as a programming framework and paradigm for distributed computation. Two sample applications are implemented on TensorFlow: tiled matrix multiplication and conjugate gradient solver for solving large linear systems. We try to illustrate how such problems can be expressed in computation graph for distributed computation. We perform scalability tests and comment on performance scaling results and quantify how TensorFlow can take advantage of HPC systems by performing micro-benchmarking on communication performance. Through this work, we show that TensorFlow is an emerging and promising platform which is well suited for a particular class of problem which requires very little synchronization.
Under de senaste åren har deep-learning, en så kallad typ av maskininlärning, blivit populärt på grund av dess applikationer och prestanda. Den viktigaste komponenten i de här teknikerna är matrismultiplikation. Grafikprocessorer (GPUs) är vanligt förekommande vid träningsprocesser av artificiella neuronnät. Detta på grund av deras massivt parallella beräkningskapacitet. Dessutom har specialiserade lågprecisionsacceleratorer som specifikt beräknar matrismultiplikation tagits fram. Många utvecklingsramverk har framkommit för att hjälpa programmerare att hantera artificiella neuronnät. I TensorFlow uttrycks beräkningsproblem som en beräkningsgraf. En nod representerar en beräkningsoperation och en väg representerar dataflöde mellan beräkningsoperationer i en beräkningsgraf. Eftersom man måste programmera olika acceleratorer med olika systemarkitekturer har programmering av högprestandasystem blivit allt svårare. TensorFlow erbjuder en hög abstraktionsnivå och förenklar programmering av högprestandaberäkningar. Man programmerar acceleratorer genom att placera operationer inom grafen på olika acceleratorer med en API. I detta arbete granskas användbarheten hos TensorFlow som ett programmeringsramverk för applikationer med högprestandaberäkningar. Vi presenterar TensorFlow som ett programmeringsutvecklingsramverk för distribuerad beräkning. Vi implementerar två vanliga applikationer i TensorFlow: en lösare som löser linjära ekvationsystem med konjugerade gradientmetoden samt blockmatrismultiplikation och illustrerar hur de här problemen kan uttryckas i beräkningsgrafer för distribuerad beräkning. Vi experimenterar och kommenterar metoder för att demonstrera hur TensorFlow kan nyttja HPC-maskinvaror. Vi testar både skalbarhet och effektivitet samt gör mikro-benchmarking på kommunikationsprestanda. Genom detta arbete visar vi att TensorFlow är en framväxande och lovande plattform som passar väl för en viss typ av problem som kräver minimal synkronisering.

APA, Harvard, Vancouver, ISO, and other styles

45

Alhowaidi, Mohammad. "Real-Time Systems with Radiation-Hardened Processors : A GPU-based Framework to Explore Tradeoffs." Thesis, Linköpings universitet, ESLAB - Laboratoriet för inbyggda system, 2012. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-77261.

Full text

Abstract:

Radiation-hardened processors are designed to be resilient against soft errorsbut such processors are slower than Commercial Off-The-Shelf (COTS)processors as well significantly costlier. In order to mitigate the high costs,software techniques such as task re-executions must be deployed together withadequately hardened processors to provide reliability. This leads to a huge designspace comprising of the hardening level of the processors and the numberof re-executions of each task in the system. Each configuration in this designspace represents a tradeoff between processor load, reliability and costs. The reliability comes at the price of higher costs due to higher levels of hardeningand performance degradation due to hardening or due to re-executions.Thus, the tradeoffs between performance, reliability and costs must be carefullystudied. Pertinent questions that arise in such a design scenario are — (i)how many times a task must be re-executed and (ii) what should be hardeninglevel? — such that the system reliability is satisfied. In order to evaluate such tradeoffs efficiently, in this thesis, we proposenovel framework that harnesses the computational power of Graphics ProcessingUnits (GPUs). Our framework is based on a system failure probabilityanalysis that connects the probability of failure of tasks to the overall systemreliability. Based on characteristics of this probabilistic analysis as well asreal-time deadlines, we derive bounds on the design space to prune infeasiblesolutions. Finally, we illustrate the benefits of our proposed framework withseveral experiments

APA, Harvard, Vancouver, ISO, and other styles

46

Grewe, Dominik. "Mapping parallel programs to heterogeneous multi-core systems." Thesis, University of Edinburgh, 2014. http://hdl.handle.net/1842/8852.

Full text

Abstract:

Heterogeneous computer systems are ubiquitous in all areas of computing, from mobile to high-performance computing. They promise to deliver increased performance at lower energy cost than purely homogeneous, CPU-based systems. In recent years GPU-based heterogeneous systems have become increasingly popular. They combine a programmable GPU with a multi-core CPU. GPUs have become flexible enough to not only handle graphics workloads but also various kinds of general-purpose algorithms. They are thus used as a coprocessor or accelerator alongside the CPU. Developing applications for GPU-based heterogeneous systems involves several challenges. Firstly, not all algorithms are equally suited for GPU computing. It is thus important to carefully map the tasks of an application to the most suitable processor in a system. Secondly, current frameworks for heterogeneous computing, such as OpenCL, are low-level, requiring a thorough understanding of the hardware by the programmer. This high barrier to entry could be lowered by automatically generating and tuning this code from a high-level and thus more user-friendly programming language. Both challenges are addressed in this thesis. For the task mapping problem a machine learning-based approach is presented in this thesis. It combines static features of the program code with runtime information on input sizes to predict the optimal mapping of OpenCL kernels. This approach is further extended to also take contention on the GPU into account. Both methods are able to outperform competing mapping approaches by a significant margin. Furthermore, this thesis develops a method for targeting GPU-based heterogeneous systems from OpenMP, a directive-based framework for parallel computing. OpenMP programs are translated to OpenCL and optimized for GPU performance. At runtime a predictive model decides whether to execute the original OpenMP code on the CPU or the generated OpenCL code on the GPU. This approach is shown to outperform both a competing approach as well as hand-tuned code.

APA, Harvard, Vancouver, ISO, and other styles

47

Villarroel, Felipe Andres Cruz. "Particle flow simulation using a parallel FMM on distributed memory systems and GPU architectures." Thesis, University of Bristol, 2010. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.541607.

Full text

APA, Harvard, Vancouver, ISO, and other styles

48

Sjöström, Oskar. "Parallelizing the Edge application for GPU-based systems using the SkePU skeleton programming library." Thesis, Linköpings universitet, Programvara och system, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-122255.

Full text

Abstract:

SkePU is an auto-tunable multi-backend skeleton programming library for multi-GPU systems. SkePU is implemented as a C++ template library and has been developed at Linköping University. In this thesis the CFD flow solver Edge has been ported to SkePU. This combines the paradigm of skeleton programming with the utilization of the unstructured grid structure used by Edge. In order to do this certain extensions have been made to the SkePU library. The performance of the ported implementation has been evaluated to identify if a performance gain can be achieved by parallelizing this type of application with the help of SkePU. A moderate speedup of the application has been achieved given the size of the ported section of the Edge application. Another important outcome of the project is the provided feedback for further development of the SkePU framework.

APA, Harvard, Vancouver, ISO, and other styles

49

Wang, Kaibo. "Algorithmic and Software System Support to Accelerate Data Processing in CPU-GPU Hybrid Computing Environments." The Ohio State University, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=osu1447685368.

Full text

APA, Harvard, Vancouver, ISO, and other styles

50

Sköld, Philip. "Real Time Volumetric Ray Marching with Ordered Dithering : Reducing required samples for ray marched volumetric lighting on the GPU." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-240619.

Full text

Abstract:

Volumetric Lighting is a collective term for visual phenomena that occur due to how light interacts inside so-called participating media, and it accounts for many recognizable effects such as fog or light shafts. Because it isvery computationally expensive, it has been an importantproblem within computer graphics to calculate volumetriclighting, both accurately and efficiently. Ray Marching is a technique that has been used extensively in non real-time applications to compute volumetriclighting and has recently been adapted for real time applications by use of the GPU. In this thesis we implement andevaluatevolumetric ray marching with ordered dithering. The results show how ordered dithering yields significant performance improvements, retaining high quality while lowering the number of samples. We conclude that with ordered dithering, volumetric ray marching is a suitable approach for real time volumetric ray marching on the GPU and we discuss both important additional optimizationsand how ordered dithering will likely remain important in future ray marching implementations.
Volumetriskt ljus är en term som beskriver visuella fenomen som uppstår från hur ljus interagerar inuti material som kan bära ljus. Hur ljuset absorberas eller ändrar riktning då det färdas igenom material ger upphov till många bekanta fenomen såsom dimma, moln eller eld. Eftersom volumetriskt ljus är dyrt att beräkna så har det varit ett viktigt problem inom datorgrafik hur man effektivt simulerar denna typ av ljustransport. Ray Marching är en metod som har använts mycket inom bland annat filmindustrin där man inte har en hård gräns på beräkningstiden, men metoden har med hjälp av grafikkortets parallelliseringsförmåga också börjat appliceras för realtidsapplikationer såsom datorspel. I denna rapport så utforskar vi en optimeringsmetod till grafikkortsbaserad ray marching som kallas för ordered dithering. Resultaten visar hur optimeringsmetoden ger stor prestandaförbättring genom att placera samplingspunkter mer effektivt, utan signifikant försämring av kvalité. Resultaten styrker hur den valda algoritmen är en lämplig algoritm för att åstadkomma volumetriskt ljus i realtid. Vi diskuterar också hur optimeringsmedoten troligtvis även i framtiden kommer spela en viktig roll i att nå acceptabel prestandainom grafikkortsbaserad ray marching.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'GPU Systems'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles