Dissertations / Theses: 'High-performance, graph processing, GPU'

1

Segura, Salvador Albert. "High-performance and energy-efficient irregular graph processing on GPU architectures." Doctoral thesis, Universitat Politècnica de Catalunya, 2021. http://hdl.handle.net/10803/671449.

Full text

Abstract:

Graph processing is an established and prominent domain that is the foundation of new emerging applications in areas such as Data Analytics and Machine Learning, empowering applications such as road navigation, social networks and automatic speech recognition. The large amount of data employed in these domains requires high throughput architectures such as GPGPU. Although the processing of large graph-based workloads exhibits a high degree of parallelism, memory access patterns tend to be highly irregular, leading to poor efficiency due to memory divergence.In order to ameliorate these issues, GPGPU graph applications perform stream compaction operations which process active nodes/edges so subsequent steps work on a compacted dataset. We propose to offload this task to the Stream Compaction Unit (SCU) hardware extension tailored to the requirements of these operations, which additionally performs pre-processing by filtering and reordering elements processed.We show that memory divergence inefficiencies prevail in GPGPU irregular graph-based applications, yet we find that it is possible to relax the strict relationship between thread and processed data to empower new optimizations. As such, we propose the Irregular accesses Reorder Unit (IRU), a novel hardware extension integrated in the GPU pipeline that reorders and filters data processed by the threads on irregular accesses improving memory coalescing.Finally, we leverage the strengths of both previous approaches to achieve synergistic improvements. We do so by proposing the IRU-enhanced SCU (ISCU), which employs the efficient pre-processing mechanisms of the IRU to improve SCU stream compaction efficiency and NoC throughput limitations due to SCU pre-processing operations. We evaluate the ISCU with state-of-the-art graph-based applications achieving a 2.2x performance improvement and 10x energy-efficiency.
El processament de grafs és un domini prominent i establert com a la base de noves aplicacions emergents en àrees com l'anàlisi de dades i Machine Learning, que permeten aplicacions com ara navegació per carretera, xarxes socials i reconeixement automàtic de veu. La gran quantitat de dades emprades en aquests dominis requereix d’arquitectures d’alt rendiment, com ara GPGPU. Tot i que el processament de grans càrregues de treball basades en grafs presenta un alt grau de paral·lelisme, els patrons d’accés a la memòria tendeixen a ser irregulars, fet que redueix l’eficiència a causa de la divergència d’accessos a memòria. Per tal de millorar aquests problemes, les aplicacions de grafs per a GPGPU realitzen operacions de stream compaction que processen nodes/arestes per tal que els passos posteriors funcionin en un conjunt de dades compactat. Proposem deslliurar d’aquesta tasca a la extensió hardware Stream Compaction Unit (SCU) adaptada als requisits d’aquestes operacions, que a més realitza un pre-processament filtrant i reordenant els elements processats.Mostrem que les ineficiències de divergència de memòria prevalen en aplicacions GPGPU basades en grafs irregulars, tot i que trobem que és possible relaxar la relació estricta entre threads i les dades processades per obtenir noves optimitzacions. Com a tal, proposem la Irregular accesses Reorder Unit (IRU), una nova extensió de maquinari integrada al pipeline de la GPU que reordena i filtra les dades processades pels threads en accessos irregulars que milloren la convergència d’accessos a memòria. Finalment, aprofitem els punts forts de les propostes anteriors per aconseguir millores sinèrgiques. Ho fem proposant la IRU-enhanced SCU (ISCU), que utilitza els mecanismes de pre-processament eficients de la IRU per millorar l’eficiència de stream compaction de la SCU i les limitacions de rendiment de NoC a causa de les operacions de pre-processament de la SCU.

APA, Harvard, Vancouver, ISO, and other styles

2

McLaughlin, Adam Thomas. "Power-constrained performance optimization of GPU graph traversal." Thesis, Georgia Institute of Technology, 2013. http://hdl.handle.net/1853/50209.

Full text

Abstract:

Graph traversal represents an important class of graph algorithms that is the nucleus of many large scale graph analytics applications. While improving the performance of such algorithms using GPUs has received attention, understanding and managing performance under power constraints has not yet received similar attention. This thesis first explores the power and performance characteristics of breadth first search (BFS) via measurements on a commodity GPU. We utilize this analysis to address the problem of minimizing execution time below a predefined power limit or power cap exposing key relationships between graph properties and power consumption. We modify the firmware on a commodity GPU to measure power usage and use the GPU as an experimental system to evaluate future architectural enhancements for the optimization of graph algorithms. Specifically, we propose and evaluate power management algorithms that scale i) the GPU frequency or ii) the number of active GPU compute units for a diverse set of real-world and synthetic graphs. Compared to scaling either frequency or compute units individually, our proposed schemes reduce execution time by an average of 18.64% by adjusting the configuration based on the inter- and intra-graph characteristics.

APA, Harvard, Vancouver, ISO, and other styles

3

Lee, Dongwon. "High-performance computer system architectures for embedded computing." Diss., Georgia Institute of Technology, 2011. http://hdl.handle.net/1853/42766.

Full text

Abstract:

The main objective of this thesis is to propose new methods for designing high-performance embedded computer system architectures. To achieve the goal, three major components - multi-core processing elements (PEs), DRAM main memory systems, and on/off-chip interconnection networks - in multi-processor embedded systems are examined in each section respectively. The first section of this thesis presents architectural enhancements to graphics processing units (GPUs), one of the multi- or many-core PEs, for improving performance of embedded applications. An embedded application is first mapped onto GPUs to explore the design space, and then architectural enhancements to existing GPUs are proposed for improving throughput of the embedded application. The second section proposes high-performance buffer mapping methods, which exploit useful features of DRAM main memory systems, in DSP multi-processor systems. The memory wall problem becomes increasingly severe in multiprocessor environments because of communication and synchronization overheads. To alleviate the memory wall problem, this section exploits bank concurrency and page mode access of DRAM main memory systems for increasing the performance of multiprocessor DSP systems. The final section presents a network-centric Turbo decoder and network-centric FFT processors. In the era of multi-processor systems, an interconnection network is another performance bottleneck. To handle heavy communication traffic, this section applies a crossbar switch - one of the indirect networks - to the parallel Turbo decoder, and applies a mesh topology to the parallel FFT processors. When designing the mesh FFT processors, a very different approach is taken to improve performance; an optical fiber is used as a new interconnection medium.

APA, Harvard, Vancouver, ISO, and other styles

4

Sedaghati, Mokhtari Naseraddin. "Performance Optimization of Memory-Bound Programs on Data Parallel Accelerators." The Ohio State University, 2016. http://rave.ohiolink.edu/etdc/view?acc_num=osu1452255686.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Hong, Changwan. "Code Optimization on GPUs." The Ohio State University, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=osu1557123832601533.

Full text

APA, Harvard, Vancouver, ISO, and other styles

6

Hassan, Mohamed Wasfy Abdelfattah. "Using Workload Characterization to Guide High Performance Graph Processing." Diss., Virginia Tech, 2021. http://hdl.handle.net/10919/103469.

Full text

Abstract:

Graph analytics represent an important application domain widely used in many fields such as web graphs, social networks, and Bayesian networks. The sheer size of the graph data sets combined with the irregular nature of the underlying problem pose a significant challenge for performance, scalability, and power efficiency of graph processing. With the exponential growth of the size of graph datasets, there is an ever-growing need for faster more power efficient graph solvers. The computational needs of graph processing can take advantage of the FPGAs' power efficiency and customizable architecture paired with CPUs' general purpose processing power and sophisticated cache policies. CPU-FPGA hybrid systems have the potential for supporting performant and scalable graph solvers if both devices can work coherently to make up for each other's deficits. This study aims to optimize graph processing on heterogeneous systems through interdisciplinary research that would impact both the graph processing community, and the FPGA/heterogeneous computing community. On one hand, this research explores how to harness the computational power of FPGAs and how to cooperatively work in a CPU-FPGA hybrid system. On the other hand, graph applications have a data-driven execution profile; hence, this study explores how to take advantage of information about the graph input properties to optimize the performance of graph solvers. The introduction of High Level Synthesis (HLS) tools allowed FPGAs to be accessible to the masses but they are yet to be performant and efficient, especially in the case of irregular graph applications. Therefore, this dissertation proposes automated frameworks to help integrate FPGAs into mainstream computing. This is achieved by first exploring the optimization space of HLS-FPGA designs, then devising a domain-specific performance model that is used to build an automated framework to guide the optimization process. Moreover, the architectural strengths of both CPUs and FPGAs are exploited to maximize graph processing performance via an automated framework for workload distribution on the available hardware resources.
Doctor of Philosophy
Graph processing is a very important application domain, which is emphasized by the fact that many real-world problems can be represented as graph applications. For instance, looking at the internet, web pages can be represented as the graph vertices while hyper links between them represent the edges. Analyzing these types of graphs is used for web search engines, ranking websites, and network analysis among other uses. However, graph processing is computationally demanding and very challenging to optimize. This is due to the irregular nature of graph problems, which can be characterized by frequent indirect memory accesses. Such a memory access pattern is dependent on the data input and impossible to predict, which renders CPUs' sophisticated caching policies useless to performance. With the rise of heterogeneous computing that enabled using hardware accelerators, a new research area was born, attempting to maximize performance by utilizing the available hardware devices in a heterogeneous ecosystem. This dissertation aims to improve the efficiency of utilizing such heterogeneous systems when targeting graph applications. More specifically, this research focuses on the collaboration of CPUs and FPGAs (Field Programmable Gate Arrays) in a CPU-FPGA hybrid system. Innovative ideas are presented to exploit the strengths of each available device in such a heterogeneous system, as well as addressing some of the inherent challenges of graph processing. Automated frameworks are introduced to efficiently utilize the FPGA devices, in addition to distributing and scheduling the workload across multiple devices to maximize the performance of graph applications.

APA, Harvard, Vancouver, ISO, and other styles

7

Smith, Michael Shawn. "Performance Analysis of Hybrid CPU/GPU Environments." PDXScholar, 2010. https://pdxscholar.library.pdx.edu/open_access_etds/300.

Full text

Abstract:

We present two metrics to assist the performance analyst to gain a unified view of application performance in a hybrid environment: GPU Computation Percentage and GPU Load Balance. We analyze the metrics using a matrix multiplication benchmark suite and a real scientific application. We also extend an experiment management system to support GPU performance data and to calculate and store our GPU Computation Percentage and GPU Load Balance metrics.

APA, Harvard, Vancouver, ISO, and other styles

8

Cyrus, Sam. "Fast Computation on Processing Data Warehousing Queries on GPU Devices." Scholar Commons, 2016. http://scholarcommons.usf.edu/etd/6214.

Full text

Abstract:

Current database management systems use Graphic Processing Units (GPUs) as dedicated accelerators to process each individual query, which results in underutilization of GPU. When a single query data warehousing workload was run on an open source GPU query engine, the utilization of main GPU resources was found to be less than 25%. The low utilization then leads to low system throughput. To resolve this problem, this paper suggests a way to transfer all of the desired data into the global memory of GPU and keep it until all queries are executed as one batch. The PCIe transfer time from CPU to GPU is minimized, which results in better performance in less time of overall query processing. The execution time was improved by up to 40% when running multiple queries, compared to dedicated processing.

APA, Harvard, Vancouver, ISO, and other styles

9

Madduri, Kamesh. "A high-performance framework for analyzing massive complex networks." Diss., Atlanta, Ga. : Georgia Institute of Technology, 2008. http://hdl.handle.net/1853/24712.

Full text

Abstract:

Thesis (Ph.D.)--Computing, Georgia Institute of Technology, 2009.
Committee Chair: Bader, David; Committee Member: Berry, Jonathan; Committee Member: Fujimoto, Richard; Committee Member: Saini, Subhash; Committee Member: Vuduc, Richard

APA, Harvard, Vancouver, ISO, and other styles

10

Hordemann, Glen J. "Exploring High Performance SQL Databases with Graphics Processing Units." Bowling Green State University / OhioLINK, 2013. http://rave.ohiolink.edu/etdc/view?acc_num=bgsu1380125703.

Full text

APA, Harvard, Vancouver, ISO, and other styles

11

Collet, Julien. "Exploration of parallel graph-processing algorithms on distributed architectures." Thesis, Compiègne, 2017. http://www.theses.fr/2017COMP2391/document.

Full text

Abstract:

Avec l'explosion du volume de données produites chaque année, les applications du domaine du traitement de graphes ont de plus en plus besoin d'être parallélisées et déployées sur des architectures distribuées afin d'adresser le besoin en mémoire et en ressource de calcul. Si de telles architectures larges échelles existent, issue notamment du domaine du calcul haute performance (HPC), la complexité de programmation et de déploiement d’algorithmes de traitement de graphes sur de telles cibles est souvent un frein à leur utilisation. De plus, la difficile compréhension, a priori, du comportement en performances de ce type d'applications complexifie également l'évaluation du niveau d'adéquation des architectures matérielles avec de tels algorithmes. Dans ce contexte, ces travaux de thèses portent sur l’exploration d’algorithmes de traitement de graphes sur architectures distribuées en utilisant GraphLab, un Framework de l’état de l’art dédié à la programmation parallèle de tels algorithmes. En particulier, deux cas d'applications réelles ont été étudiées en détails et déployées sur différentes architectures à mémoire distribuée, l’un venant de l’analyse de trace d’exécution et l’autre du domaine du traitement de données génomiques. Ces études ont permis de mettre en évidence l’existence de régimes de fonctionnement permettant d'identifier des points de fonctionnements pertinents dans lesquels on souhaitera placer un système pour maximiser son efficacité. Dans un deuxième temps, une étude a permis de comparer l'efficacité d'architectures généralistes (type commodity cluster) et d'architectures plus spécialisées (type serveur de calcul hautes performances) pour le traitement de graphes distribué. Cette étude a démontré que les architectures composées de grappes de machines de type workstation, moins onéreuses et plus simples, permettaient d'obtenir des performances plus élevées. Cet écart est d'avantage accentué quand les performances sont pondérées par les coûts d'achats et opérationnels. L'étude du comportement en performance de ces architectures a également permis de proposer in fine des règles de dimensionnement et de conception des architectures distribuées, dans ce contexte. En particulier, nous montrons comment l’étude des performances fait apparaitre les axes d’amélioration du matériel et comment il est possible de dimensionner un cluster pour traiter efficacement une instance donnée. Finalement, des propositions matérielles pour la conception de serveurs de calculs plus performants pour le traitement de graphes sont formulées. Premièrement, un mécanisme est proposé afin de tempérer la baisse significative de performance observée quand le cluster opère dans un point de fonctionnement où la mémoire vive est saturée. Enfin, les deux applications développées ont été évaluées sur une architecture à base de processeurs basse-consommation afin d'étudier la pertinence de telles architectures pour le traitement de graphes. Les performances mesurés en utilisant de telles plateformes sont encourageantes et montrent en particulier que la diminution des performances brutes par rapport aux architectures existantes est compensée par une efficacité énergétique bien supérieure
With the advent of ever-increasing graph datasets in a large number of domains, parallel graph-processing applications deployed on distributed architectures are more and more needed to cope with the growing demand for memory and compute resources. Though large-scale distributed architectures are available, notably in the High-Performance Computing (HPC) domain, the programming and deployment complexity of such graphprocessing algorithms, whose parallelization and complexity are highly data-dependent, hamper usability. Moreover, the difficult evaluation of performance behaviors of these applications complexifies the assessment of the relevance of the used architecture. With this in mind, this thesis work deals with the exploration of graph-processing algorithms on distributed architectures, notably using GraphLab, a state of the art graphprocessing framework. Two use-cases are considered. For each, a parallel implementation is proposed and deployed on several distributed architectures of varying scales. This study highlights operating ranges, which can eventually be leveraged to appropriately select a relevant operating point with respect to the datasets processed and used cluster nodes. Further study enables a performance comparison of commodity cluster architectures and higher-end compute servers using the two use-cases previously developed. This study highlights the particular relevance of using clustered commodity workstations, which are considerably cheaper and simpler with respect to node architecture, over higher-end systems in this applicative context. Then, this thesis work explores how performance studies are helpful in cluster design for graph-processing. In particular, studying throughput performances of a graph-processing system gives fruitful insights for further node architecture improvements. Moreover, this work shows that a more in-depth performance analysis can lead to guidelines for the appropriate sizing of a cluster for a given workload, paving the way toward resource allocation for graph-processing. Finally, hardware improvements for next generations of graph-processing servers areproposed and evaluated. A flash-based victim-swap mechanism is proposed for the mitigation of unwanted overloaded operations. Then, the relevance of ARM-based microservers for graph-processing is investigated with a port of GraphLab on a NVIDIA TX2-based architecture

APA, Harvard, Vancouver, ISO, and other styles

12

Ling, Cheng. "High performance bioinformatics and computational biology on general-purpose graphics processing units." Thesis, University of Edinburgh, 2012. http://hdl.handle.net/1842/6260.

Full text

Abstract:

Bioinformatics and Computational Biology (BCB) is a relatively new multidisciplinary field which brings together many aspects of the fields of biology, computer science, statistics, and engineering. Bioinformatics extracts useful information from biological data and makes these more intuitive and understandable by applying principles of information sciences, while computational biology harnesses computational approaches and technologies to answer biological questions conveniently. Recent years have seen an explosion of the size of biological data at a rate which outpaces the rate of increases in the computational power of mainstream computer technologies, namely general purpose processors (GPPs). The aim of this thesis is to explore the use of off-the-shelf Graphics Processing Unit (GPU) technology in the high performance and efficient implementation of BCB applications in order to meet the demands of biological data increases at affordable cost. The thesis presents detailed design and implementations of GPU solutions for a number of BCB algorithms in two widely used BCB applications, namely biological sequence alignment and phylogenetic analysis. Biological sequence alignment can be used to determine the potential information about a newly discovered biological sequence from other well-known sequences through similarity comparison. On the other hand, phylogenetic analysis is concerned with the investigation of the evolution and relationships among organisms, and has many uses in the fields of system biology and comparative genomics. In molecular-based phylogenetic analysis, the relationship between species is estimated by inferring the common history of their genes and then phylogenetic trees are constructed to illustrate evolutionary relationships among genes and organisms. However, both biological sequence alignment and phylogenetic analysis are computationally expensive applications as their computing and memory requirements grow polynomially or even worse with the size of sequence databases. The thesis firstly presents a multi-threaded parallel design of the Smith- Waterman (SW) algorithm alongside an implementation on NVIDIA GPUs. A novel technique is put forward to solve the restriction on the length of the query sequence in previous GPU-based implementations of the SW algorithm. Based on this implementation, the difference between two main task parallelization approaches (Inter-task and Intra-task parallelization) is presented. The resulting GPU implementation matches the speed of existing GPU implementations while providing more flexibility, i.e. flexible length of sequences in real world applications. It also outperforms an equivalent GPPbased implementation by 15x-20x. After this, the thesis presents the first reported multi-threaded design and GPU implementation of the Gapped BLAST with Two-Hit method algorithm, which is widely used for aligning biological sequences heuristically. This achieved up to 3x speed-up improvements compared to the most optimised GPP implementations. The thesis then presents a multi-threaded design and GPU implementation of a Neighbor-Joining (NJ)-based method for phylogenetic tree construction and multiple sequence alignment (MSA). This achieves 8x-20x speed up compared to an equivalent GPP implementation based on the widely used ClustalW software. The NJ method however only gives one possible tree which strongly depends on the evolutionary model used. A more advanced method uses maximum likelihood (ML) for scoring phylogenies with Markov Chain Monte Carlo (MCMC)-based Bayesian inference. The latter was the subject of another multi-threaded design and GPU implementation presented in this thesis, which achieved 4x-8x speed up compared to an equivalent GPP implementation based on the widely used MrBayes software. Finally, the thesis presents a general evaluation of the designs and implementations achieved in this work as a step towards the evaluation of GPU technology in BCB computing, in the context of other computer technologies including GPPs and Field Programmable Gate Arrays (FPGA) technology.

APA, Harvard, Vancouver, ISO, and other styles

13

Cooper, Lee Alex Donald. "High Performance Image Analysis for Large Histological Datasets." The Ohio State University, 2009. http://rave.ohiolink.edu/etdc/view?acc_num=osu1250004647.

Full text

APA, Harvard, Vancouver, ISO, and other styles

14

Shanmugam, Sakthivadivel Saravanakumar. "Fast-NetMF: Graph Embedding Generation on Single GPU and Multi-core CPUs with NetMF." The Ohio State University, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=osu1557162076041442.

Full text

APA, Harvard, Vancouver, ISO, and other styles

15

Abu, Doleh Anas. "High Performance and Scalable Matching and Assembly of Biological Sequences." The Ohio State University, 2016. http://rave.ohiolink.edu/etdc/view?acc_num=osu1469092998.

Full text

APA, Harvard, Vancouver, ISO, and other styles

16

Henriksson, Jonas. "Implementation of a real-time Fast Fourier Transform on a Graphics Processing Unit with data streamed from a high-performance digitizer." Thesis, Linköpings universitet, Programvara och system, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-113389.

Full text

Abstract:

In this thesis we evaluate the prospects of performing real-time digital signal processing on a graphics processing unit (GPU) when linked together with a high-performance digitizer. A graphics card is acquired and an implementation developed that address issues such as transportation of data and capability of coping with the throughput of the data stream. Furthermore, it consists of an algorithm for executing consecutive fast Fourier transforms on the digitized signal together with averaging and visualization of the output spectrum. An empirical approach has been used when researching different available options for streaming data. For better performance, an analysis of the introduced noise of using single-precision over double-precision has been performed to decide on the required precision in the context of this thesis. The choice of graphics card is based on an empirical investigation coupled with a measurement-based approach. An implementation in single-precision with streaming from the digitizer, by means of double buffering in CPU RAM, capable of speeds up to 3.0 GB/s is presented. Measurements indicate that even higher bandwidths are possible without overflowing the GPU. Tests show that the implementation is capable of computing the spectrum for transform sizes of $2^{21}$ , however measurements indicate that higher and lower transform sizes are possible. The results of the computations are visualized in real-time.

APA, Harvard, Vancouver, ISO, and other styles

17

Banihashemi, Seyed Parsa. "Parallel explicit FEM algorithms using GPU's." Diss., Georgia Institute of Technology, 2015. http://hdl.handle.net/1853/54391.

Full text

Abstract:

The Explicit Finite Element Method is a powerful tool in nonlinear dynamic finite element analysis. Recent major developments in computational devices, in particular, General Purpose Graphical Processing Units (GPGPU's) now make it possible to increase the performance of the explicit FEM. This dissertation investigates existing explicit finite element method algorithms which are then redesigned for GPU's and implemented. The performance of these algorithms is assessed and a new asynchronous variational integrator spatial decomposition (AVISD) algorithm is developed which is flexible and encompasses all other methods and can be tuned based for a user-defined problem and the performance of the user's computer. The mesh-aware performance of the proposed explicit finite element algorithm is studied and verified by implementation. The current research also introduces the use of a Particle Swarm Optimization method to tune the performance of the proposed algorithm automatically given a finite element mesh and the performance characteristics of a user's computer. For this purpose, a time performance model is developed which depends on the finite element mesh and the machine performance. This time performance model is then used as an objective function to minimize the run-time cost. Also, based on the performance model provided in this research and predictions about the changes in GPU's in the near future, the performance of the AVISD method is predicted for future machines. Finally, suggestions and insights based on these results are proposed to help facilitate future explicit FEM development.

APA, Harvard, Vancouver, ISO, and other styles

18

Nedel, Werner Mauricio. "Analise dos efeitos de falhas transientes no conjunto de banco de registradores em unidades gráficas de processamento." reponame:Biblioteca Digital de Teses e Dissertações da UFRGS, 2015. http://hdl.handle.net/10183/140441.

Full text

Abstract:

Unidades gráficas de processamento, mais conhecidas como GPUs (Graphics Processing Unit), são dispositivos que possuem um grande poder de processamento paralelo com respectivo baixo custo de operação. Sua capacidade de simultaneamente manipular grandes blocos de memória a credencia a ser utilizada nas mais variadas aplicações, tais como processamento de imagens, controle de tráfego aéreo, pesquisas acadêmicas, dentre outras. O termo GPGPUs (General Purpose Graphic Processing Unit) designa o uso de GPUs utilizadas na computação de aplicações de uso geral. A rápida proliferação das GPUs com ao advento de um modelo de programação amigável ao usuário fez programadores utilizarem essa tecnologia em aplicações onde confiabilidade é um requisito crítico, como aplicações espaciais, automotivas e médicas. O crescente uso de GPUs nestas aplicações faz com que novas arquiteturas deste dispositivo sejam propostas a fim de explorar seu alto poder computacional. A arquitetura FlexGrip (FLEXible GRaphIcs Processor) é um exemplo de GPGPU implementada em FPGA (Field Programmable Gate Array), sendo compatível com programas implementados especificamente para GPUs, com a vantagem de possibilitar a customização da arquitetura de acordo com a necessidade do usuário. O constante aumento da demanda por tecnologia fez com que GPUs de última geração sejam fabricadas em tecnologias com processo de fabricação de até 28nm, com frequência de relógio de até 1GHz. Esse aumento da frequência de relógio e densidade de transistores, combinados com a redução da tensão de operação, faz com que os transistores fiquem mais suscetíveis a falhas causadas por interferência de radiação. O modelo de programação utilizado pelas GPUs faz uso de constantes acessos a memórias e registradores, tornando estes dispositivos sensíveis a perturbações transientes em seus valores armazenados. Estas perturbações são denominadas Single Event Upset (SEU), ou bit-flip, e podem resultar em erros no resultado final da aplicação. Este trabalho tem por objetivo apresentar um modelo de injeção de falhas transientes do tipo SEU nos principais bancos de registradores da GPGPU Flexgrip, avaliando o comportamento da execução de diferentes algoritmos em presença de SEUs. O impacto de diferentes distribuições de recursos computacionais da GPU em sua confiabilidade também é abordado. Resultados podem indicar maneiras eficientes de obter-se confiabilidade explorando diferentes configurações de GPUs.
Graphic Process Units (GPUs) are specialized massively parallel units that are widely used due to their high computing processing capability with respective lower costs. The ability to rapidly manipulate high amounts of memory simultaneously makes them suitable for solving computer-intensive problems, such as analysis of air traffic control, academic researches, image processing and others. General-Purpose Graphic Processing Units (GPGPUs) designates the use of GPUs in applications commonly handled by Central Processing Units (CPUs). The rapid proliferation of GPUs due to the advent of significant programming support has brought programmers to use such devices in safety critical applications, like automotive, space and medical. This crescent use of GPUs pushed developers to explore its parallel architecture and proposing new implementations of such devices. The FLEXible GRaphics Processor (FlexGrip) is an example of GPGPU optimized for Field Programmable Arrays (FPGAs) implementation, fully compatible with GPU’s compiled programs. The increasing demand for computational has pushed GPUs to be built in cuttingedge technology down to 28nm fabrication process for the latest NVIDIA devices with operating clock frequencies up to 1GHz. The increases in operating frequencies and transistor density combined with the reduction of voltage supplies have made transistors more susceptible to faults caused by radiation. The program model adopted by GPUs makes constant accesses to its memories and registers, making this device sensible to transient perturbations in its stored values. These perturbations are called Single Event Upset (SEU), or just bit-flip, and might cause the system to experience an error. The main goal of this work is to study the behavior of the GPGPU FlexGrip under the presence of SEUs in a range of applications. The distribution of computational resources of the GPUs and its impact in the GPU confiability is also explored, as well as the characterization of the errors observed in the fault injection campaigns. Results can indicate efficient configurations of GPUs in order to avoid perturbations in the system under the presence of SEUs.

APA, Harvard, Vancouver, ISO, and other styles

19

Vitor, Giovani Bernardes 1985. "Rastreamento de alvo móvel em mono-visão aplicado no sistema de navegação autônoma utilizando GPU." [s.n.], 2010. http://repositorio.unicamp.br/jspui/handle/REPOSIP/264975.

Full text

Abstract:

Orientador: Janito Vaqueiro Ferreira
Dissertação (mestrado) - Universidade Estadual de Campinas, Faculdade de Engenharia Mecânica
Made available in DSpace on 2018-08-16T19:38:32Z (GMT). No. of bitstreams: 1 Vitor_GiovaniBernardes_M.pdf: 6258094 bytes, checksum: fbd34947eb1efdce50b97b27f56c1920 (MD5) Previous issue date: 2010
Resumo: O sistema de visão computacional é bastante útil em diversas aplicações de veículos autônomos, como em geração de mapas, desvio de obstáculos, tarefas de posicionamento e rastreamento de alvos. Além disso, a visão computacional pode proporcionar um ganho significativo na confiabilidade, versatilidade e precisão das tarefas robóticas, questões cruciais na maioria das aplicações reais. O presente trabalho tem como objetivo principal o desenvolvimento de uma metodologia de controle servo visual em veículos robóticos terrestres para a realização de rastreamento e perseguição de um alvo. O procedimento de rastreamento é baseado na correspondência da região alvo entre a seqüência de imagens, e a perseguição pela geração do movimento de navegação baseado nas informações da região alvo. Dentre os aspectos que contribuem para a solução do procedimento de rastreamento proposto, considera-se o uso das técnicas de processamento de imagens como filtro KNN, filtro Sobel, filtro HMIN e transformada Watershed que unidas proporcionam a robustez desejada para a solução. No entanto, esta não é uma técnica compatível com sistema de tempo real. Deste modo, tais algoritmos foram modelados para processamento paralelo em placas gráficas utilizando CUDA. Experimentos em ambientes reais foram analisados, apresentando diversos resultados para o procedimento de rastreamento, bem como validando a utilização das GPU's para acelerar o processamento do sistema de visão computacional
Abstract: The computer vision system is useful in several applications of autonomous vehicles, such as map generation, obstacle avoidance tasks, positioning tasks and target tracking. Furthermore, computer vision can provide a significant gain in reliability, versatility and accuracy of robotic tasks, which are important concerns in most applications. The present work aims at the development of a visual servo control method in ground robotic vehicles to perform tracking and follow of a target. The procedure for tracking is based on the correspondence between the target region sequence of images, and persecution by the generation of motion based navigation of information from target region. Among the aspects that contribute to the solution of the proposed tracking procedure, we consider the use of imaging techniques such as KNN filter, Sobel filter, HMIN filter and Watershed transform that together provide the desired robustness for the solution. However, this is not a technique compatible with real-time system. Thus, these algorithms were modeled for parallel processing on graphics cards using CUDA. Experiments in real environments were analyzed showed different results for the procedure for tracking and validating the use of GPU's to accelerate the processing of computer vision system
Mestrado
Mecanica dos Sólidos e Projeto Mecanico
Mestre em Engenharia Mecânica

APA, Harvard, Vancouver, ISO, and other styles

20

PETRINI, ALESSANDRO. "HIGH PERFORMANCE COMPUTING MACHINE LEARNING METHODS FOR PRECISION MEDICINE." Doctoral thesis, Università degli Studi di Milano, 2021. http://hdl.handle.net/2434/817104.

Full text

Abstract:

La Medicina di Precisione (Precision Medicine) è un nuovo paradigma che sta rivoluzionando diversi aspetti delle pratiche cliniche: nella prevenzione e diagnosi, essa è caratterizzata da un approccio diverso dal "one size fits all" proprio della medicina classica. Lo scopo delle Medicina di Precisione è di trovare misure di prevenzione, diagnosi e cura che siano specifiche per ciascun individuo, a partire dalla sua storia personale, stile di vita e fattori genetici. Tre fattori hanno contribuito al rapido sviluppo della Medicina di Precisione: la possibilità di generare rapidamente ed economicamente una vasta quantità di dati omici, in particolare grazie alle nuove tecniche di sequenziamento (Next-Generation Sequencing); la possibilità di diffondere questa enorme quantità di dati grazie al paradigma "Big Data"; la possibilità di estrarre da questi dati tutta una serie di informazioni rilevanti grazie a tecniche di elaborazione innovative ed altamente sofisticate. In particolare, le tecniche di Machine Learning introdotte negli ultimi anni hanno rivoluzionato il modo di analizzare i dati: esse forniscono dei potenti strumenti per l'inferenza statistica e l'estrazione di informazioni rilevanti dai dati in maniera semi-automatica. Al contempo, però, molto spesso richiedono elevate risorse computazionali per poter funzionare efficacemente. Per questo motivo, e per l'elevata mole di dati da elaborare, è necessario sviluppare delle tecniche di Machine Learning orientate al Big Data che utilizzano espressamente tecniche di High Performance Computing, questo per poter sfruttare al meglio le risorse di calcolo disponibili e su diverse scale, dalle singole workstation fino ai super-computer. In questa tesi vengono presentate tre tecniche di Machine Learning sviluppate nel contesto del High Performance Computing e create per affrontare tre questioni fondamentali e ancora irrisolte nel campo della Medicina di Precisione, in particolare la Medicina Genomica: i) l'identificazione di varianti deleterie o patogeniche tra quelle neutrali nelle aree non codificanti del DNA; ii) l'individuazione della attività delle regioni regolatorie in diverse linee cellulari e tessuti; iii) la predizione automatica della funzione delle proteine nel contesto di reti biomolecolari. Per il primo problema è stato sviluppato parSMURF, un innovativo metodo basato su hyper-ensemble in grado di gestire l'elevato grado di sbilanciamento che caratterizza l'identificazione di varianti patogeniche e deleterie in mezzo al "mare" di varianti neutrali nelle aree non-coding del DNA. L'algoritmo è stato implementato per sfruttare appositamente le risorse di supercalcolo del CINECA (Marconi - KNL) e HPC Center Stuttgart (HLRS Apollo HAWK), ottenendo risultati allo stato dell'arte, sia per capacità predittiva, sia per scalabilità. Il secondo problema è stato affrontato tramite lo sviluppo di reti neurali "deep", in particolare Deep Feed Forward e Deep Convolutional Neural Networks per analizzare - rispettivamente - dati di natura epigenetica e sequenze di DNA, con lo scopo di individuare promoter ed enhancer attivi in linee cellulari e tessuti specifici. L'analisi è compiuta "genome-wide" e sono state usate tecniche di parallelizzazione su GPU. Infine, per il terzo problema è stato sviluppato un algoritmo di Machine Learning semi-supervisionato su grafo basato su reti di Hopfield per elaborare efficacemente grandi network biologici, utilizzando ancora tecniche di parallelizzazione su GPU; in particolare, una parte rilevante dell'algoritmo è data dall'introduzione di una tecnica parallela di colorazione del grafo che migliora il classico approccio greedy introdotto da Luby. Tra i futuri lavori e le attività in corso, viene presentato il progetto inerente all'estensione di parSMURF che è stato recentemente premiato dal consorzio Partnership for Advance in Computing in Europe (PRACE) allo scopo di sviluppare ulteriormente l'algoritmo e la sua implementazione, applicarlo a dataset di diversi ordini di grandezza più grandi e inserire i risultati in Genomiser, lo strumento attualmente allo stato dell'arte per l'individuazione di varianti genetiche Mendeliane. Questo progetto è inserito nel contesto di una collaborazione internazionale con i Jackson Lab for Genomic Medicine.
Precision Medicine is a new paradigm which is reshaping several aspects of clinical practice, representing a major departure from the "one size fits all" approach in diagnosis and prevention featured in classical medicine. Its main goal is to find personalized prevention measures and treatments, on the basis of the personal history, lifestyle and specific genetic factors of each individual. Three factors contributed to the rapid rise of Precision Medicine approaches: the ability to quickly and cheaply generate a vast amount of biological and omics data, mainly thanks to Next-Generation Sequencing; the ability to efficiently access this vast amount of data, under the Big Data paradigm; the ability to automatically extract relevant information from data, thanks to innovative and highly sophisticated data processing analytical techniques. Machine Learning in recent years revolutionized data analysis and predictive inference, influencing almost every field of research. Moreover, high-throughput bio-technologies posed additional challenges to effectively manage and process Big Data in Medicine, requiring novel specialized Machine Learning methods and High Performance Computing techniques well-tailored to process and extract knowledge from big bio-medical data. In this thesis we present three High Performance Computing Machine Learning techniques that have been designed and developed for tackling three fundamental and still open questions in the context of Precision and Genomic Medicine: i) identification of pathogenic and deleterious genomic variants among the "sea" of neutral variants in the non-coding regions of the DNA; ii) detection of the activity of regulatory regions across different cell lines and tissues; iii) automatic protein function prediction and drug repurposing in the context of biomolecular networks. For the first problem we developed parSMURF, a novel hyper-ensemble method able to deal with the huge data imbalance that characterizes the detection of pathogenic variants in the non-coding regulatory regions of the human genome. We implemented this approach with highly parallel computational techniques using supercomputing resources at CINECA (Marconi – KNL) and HPC Center Stuttgart (HLRS Apollo HAWK), obtaining state-of-the-art results. For the second problem we developed Deep Feed Forward and Deep Convolutional Neural Networks to respectively process epigenetic and DNA sequence data to detect active promoters and enhancers in specific tissues at genome-wide level using GPU devices to parallelize the computation. Finally we developed scalable semi-supervised graph-based Machine Learning algorithms based on parametrized Hopfield Networks to process in parallel using GPU devices large biological graphs, using a parallel coloring method that improves the classical Luby greedy algorithm. We also present ongoing extensions of parSMURF, very recently awarded by the Partnership for Advance in Computing in Europe (PRACE) consortium to further develop the algorithm, apply them to huge genomic data and embed its results into Genomiser, a state-of-the-art computational tool for the detection of pathogenic variants associated with Mendelian genetic diseases, in the context of an international collaboration with the Jackson Lab for Genomic Medicine.

APA, Harvard, Vancouver, ISO, and other styles

21

Lundgren, Jacob. "Pricing of American Options by Adaptive Tree Methods on GPUs." Thesis, Uppsala universitet, Avdelningen för beräkningsvetenskap, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-265257.

Full text

Abstract:

An assembled algorithm for pricing American options with absolute, discrete dividends using adaptive lattice methods is described. Considerations for hardware-conscious programming on both CPU and GPU platforms are discussed, to provide a foundation for the investigation of several approaches for deploying the program onto GPU architectures. The performance results of the approaches are compared to that of a central processing unit reference implementation, and to each other. In particular, an approach of designating subtrees to be calculated in parallel by allowing multiple calculation of overlapping elements is described. Among the examined methods, this attains the best performance results in a "realistic" region of calculation parameters. A fifteen- to thirty-fold improvement in performance over the CPU reference implementation is observed as the problem size grows sufficiently large.

APA, Harvard, Vancouver, ISO, and other styles

22

Prades, Gasulla Javier. "Improving Performance and Energy Efficiency of Heterogeneous Systems with rCUDA." Doctoral thesis, Universitat Politècnica de València, 2021. http://hdl.handle.net/10251/168081.

Full text

Abstract:

[ES] En la última década la utilización de la GPGPU (General Purpose computing in Graphics Processing Units; Computación de Propósito General en Unidades de Procesamiento Gráfico) se ha vuelto tremendamente popular en los centros de datos de todo el mundo. Las GPUs (Graphics Processing Units; Unidades de Procesamiento Gráfico) se han establecido como elementos aceleradores de cómputo que son usados junto a las CPUs formando sistemas heterogéneos. La naturaleza masivamente paralela de las GPUs, destinadas tradicionalmente al cómputo de gráficos, permite realizar operaciones numéricas con matrices de datos a gran velocidad debido al gran número de núcleos que integran y al gran ancho de banda de acceso a memoria que poseen. En consecuencia, aplicaciones de todo tipo de campos, tales como química, física, ingeniería, inteligencia artificial, ciencia de materiales, etc. que presentan este tipo de patrones de cómputo se ven beneficiadas, reduciendo drásticamente su tiempo de ejecución. En general, el uso de la aceleración del cómputo en GPUs ha significado un paso adelante y una revolución. Sin embargo, no está exento de problemas, tales como problemas de eficiencia energética, baja utilización de las GPUs, altos costes de adquisición y mantenimiento, etc. En esta tesis pretendemos analizar las principales carencias que presentan estos sistemas heterogéneos y proponer soluciones basadas en el uso de la virtualización remota de GPUs. Para ello hemos utilizado la herramienta rCUDA, desarrollada en la Universitat Politècnica de València, ya que multitud de publicaciones la avalan como el framework de virtualización remota de GPUs más avanzado de la actualidad. Los resutados obtenidos en esta tesis muestran que el uso de rCUDA en entornos de Cloud Computing incrementa el grado de libertad del sistema, ya que permite crear instancias virtuales de las GPUs físicas totalmente a medida de las necesidades de cada una de las máquinas virtuales. En entornos HPC (High Performance Computing; Computación de Altas Prestaciones), rCUDA también proporciona un mayor grado de flexibilidad de uso de las GPUs de todo el clúster de cómputo, ya que permite desacoplar totalmente la parte CPU de la parte GPU de las aplicaciones. Además, las GPUs pueden estar en cualquier nodo del clúster, independientemente del nodo en el que se está ejecutando la parte CPU de la aplicación. En general, tanto para Cloud Computing como en el caso de HPC, este mayor grado de flexibilidad se traduce en un aumento hasta 2x de la productividad de todo el sistema al mismo tiempo que se reduce el consumo energético en un 15%. Finalmente, también hemos desarrollado un mecanismo de migración de trabajos de la parte GPU de las aplicaciones que ha sido integrado dentro del framework rCUDA. Este mecanismo de migración ha sido evaluado y los resultados muestran claramente que, a cambio de una pequeña sobrecarga, alrededor de 400 milisegundos, en el tiempo de ejecución de las aplicaciones, es una potente herramienta con la que, de nuevo, aumentar la productividad y reducir el gasto energético del sistema. En resumen, en esta tesis se analizan los principales problemas derivados del uso de las GPUs como aceleradores de cómputo, tanto en entornos HPC como de Cloud Computing, y se demuestra cómo a través del uso del framework rCUDA, estos problemas pueden solucionarse. Además se desarrolla un potente mecanismo de migración de trabajos GPU, que integrado dentro del framework rCUDA, se convierte en una herramienta clave para los futuros planificadores de trabajos en clusters heterogéneos.
[CAT] En l'última dècada la utilització de la GPGPU(General Purpose computing in Graphics Processing Units; Computació de Propòsit General en Unitats de Processament Gràfic) s'ha tornat extremadament popular en els centres de dades de tot el món. Les GPUs (Graphics Processing Units; Unitats de Processament Gràfic) s'han establert com a elements acceleradors de còmput que s'utilitzen al costat de les CPUs formant sistemes heterogenis. La naturalesa massivament paral·lela de les GPUs, destinades tradicionalment al còmput de gràfics, permet realitzar operacions numèriques amb matrius de dades a gran velocitat degut al gran nombre de nuclis que integren i al gran ample de banda d'accés a memòria que posseeixen. En conseqüència, les aplicacions de tot tipus de camps, com ara química, física, enginyeria, intel·ligència artificial, ciència de materials, etc. que presenten aquest tipus de patrons de còmput es veuen beneficiades reduint dràsticament el seu temps d'execució. En general, l'ús de l'acceleració del còmput en GPUs ha significat un pas endavant i una revolució, però no està exempt de problemes, com ara poden ser problemes d'eficiència energètica, baixa utilització de les GPUs, alts costos d'adquisició i manteniment, etc. En aquesta tesi pretenem analitzar les principals mancances que presenten aquests sistemes heterogenis i proposar solucions basades en l'ús de la virtualització remota de GPUs. Per a això hem utilitzat l'eina rCUDA, desenvolupada a la Universitat Politècnica de València, ja que multitud de publicacions l'avalen com el framework de virtualització remota de GPUs més avançat de l'actualitat. Els resultats obtinguts en aquesta tesi mostren que l'ús de rCUDA en entorns de Cloud Computing incrementa el grau de llibertat del sistema, ja que permet crear instàncies virtuals de les GPUs físiques totalment a mida de les necessitats de cadascuna de les màquines virtuals. En entorns HPC (High Performance Computing; Computació d'Altes Prestacions), rCUDA també proporciona un major grau de flexibilitat en l'ús de les GPUs de tot el clúster de còmput, ja que permet desacoblar totalment la part CPU de la part GPU de les aplicacions. A més, les GPUs poden estar en qualsevol node del clúster, sense importar el node en el qual s'està executant la part CPU de l'aplicació. En general, tant per a Cloud Computing com en el cas del HPC, aquest major grau de flexibilitat es tradueix en un augment fins 2x de la productivitat de tot el sistema al mateix temps que es redueix el consum energètic en aproximadament un 15%. Finalment, també hem desenvolupat un mecanisme de migració de treballs de la part GPU de les aplicacions que ha estat integrat dins del framework rCUDA. Aquest mecanisme de migració ha estat avaluat i els resultats mostren clarament que, a canvi d'una petita sobrecàrrega, al voltant de 400 mil·lisegons, en el temps d'execució de les aplicacions, és una potent eina amb la qual, de nou, augmentar la productivitat i reduir la despesa energètica de sistema. En resum, en aquesta tesi s'analitzen els principals problemes derivats de l'ús de les GPUs com acceleradors de còmput, tant en entorns HPC com de Cloud Computing, i es demostra com a través de l'ús del framework rCUDA, aquests problemes poden solucionar-se. A més es desenvolupa un potent mecanisme de migració de treballs GPU, que integrat dins del framework rCUDA, esdevé una eina clau per als futurs planificadors de treballs en clústers heterogenis.
[EN] In the last decade the use of GPGPU (General Purpose computing in Graphics Processing Units) has become extremely popular in data centers around the world. GPUs (Graphics Processing Units) have been established as computational accelerators that are used alongside CPUs to form heterogeneous systems. The massively parallel nature of GPUs, traditionally intended for graphics computing, allows to perform numerical operations with data arrays at high speed. This is achieved thanks to the large number of cores GPUs integrate and the large bandwidth of memory access. Consequently, applications of all kinds of fields, such as chemistry, physics, engineering, artificial intelligence, materials science, and so on, presenting this type of computational patterns are benefited by drastically reducing their execution time. In general, the use of computing acceleration provided by GPUs has meant a step forward and a revolution, but it is not without problems, such as energy efficiency problems, low utilization of GPUs, high acquisition and maintenance costs, etc. In this PhD thesis we aim to analyze the main shortcomings of these heterogeneous systems and propose solutions based on the use of remote GPU virtualization. To that end, we have used the rCUDA middleware, developed at Universitat Politècnica de València. Many publications support rCUDA as the most advanced remote GPU virtualization framework nowadays. The results obtained in this PhD thesis show that the use of rCUDA in Cloud Computing environments increases the degree of freedom of the system, as it allows to create virtual instances of the physical GPUs fully tailored to the needs of each of the virtual machines. In HPC (High Performance Computing) environments, rCUDA also provides a greater degree of flexibility in the use of GPUs throughout the computing cluster, as it allows the CPU part to be completely decoupled from the GPU part of the applications. In addition, GPUs can be on any node in the cluster, regardless of the node on which the CPU part of the application is running. In general, both for Cloud Computing and in the case of HPC, this greater degree of flexibility translates into an up to 2x increase in system-wide throughput while reducing energy consumption by approximately 15%. Finally, we have also developed a job migration mechanism for the GPU part of applications that has been integrated within the rCUDA middleware. This migration mechanism has been evaluated and the results clearly show that, in exchange for a small overhead of about 400 milliseconds in the execution time of the applications, it is a powerful tool with which, again, we can increase productivity and reduce energy foot print of the computing system. In summary, this PhD thesis analyzes the main problems arising from the use of GPUs as computing accelerators, both in HPC and Cloud Computing environments, and demonstrates how thanks to the use of the rCUDA middleware these problems can be addressed. In addition, a powerful GPU job migration mechanism is being developed, which, integrated within the rCUDA framework, becomes a key tool for future job schedulers in heterogeneous clusters.
This work jointly supported by the Fundación Séneca (Agencia Regional de Ciencia y Tecnología, Región de Murcia) under grants (20524/PDC/18, 20813/PI/18 and 20988/PI/18) and by the Spanish MEC and European Commission FEDER under grants TIN2015-66972-C5-3-R, TIN2016-78799-P and CTQ2017-87974-R (AEI/FEDER, UE). We also thank NVIDIA for hardware donation under GPU Educational Center 2014-2016 and Research Center 2015-2016. The authors thankfully acknowledge the computer resources at CTE-POWER and the technical support provided by Barcelona Supercomputing Center - Centro Nacional de Supercomputación (RES-BCV-2018-3-0008). Furthermore, researchers from Universitat Politècnica de València are supported by the Generalitat Valenciana under Grant PROMETEO/2017/077. Authors are also grateful for the generous support provided by Mellanox Technologies Inc. Prof. Pradipta Purkayastha, from Department of Chemical Sciences, Indian Institute of Science Education and Research (IISER) Kolkata, is acknowledged for kindly providing the initial ligand and DNA structures.
Prades Gasulla, J. (2021). Improving Performance and Energy Efficiency of Heterogeneous Systems with rCUDA [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/168081
TESIS

APA, Harvard, Vancouver, ISO, and other styles

23

Zheng, Mai. "Towards Manifesting Reliability Issues In Modern Computer Systems." The Ohio State University, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=osu1436283400.

Full text

APA, Harvard, Vancouver, ISO, and other styles

24

Obrecht, Christian. "High performance lattice Boltzmann solvers on massively parallel architectures with applications to building aeraulics." Phd thesis, INSA de Lyon, 2012. http://tel.archives-ouvertes.fr/tel-00776986.

Full text

Abstract:

With the advent of low-energy buildings, the need for accurate building performance simulations has significantly increased. However, for the time being, the thermo-aeraulic effects are often taken into account through simplified or even empirical models, which fail to provide the expected accuracy. Resorting to computational fluid dynamics seems therefore unavoidable, but the required computational effort is in general prohibitive. The joint use of innovative approaches such as the lattice Boltzmann method (LBM) and massively parallel computing devices such as graphics processing units (GPUs) could help to overcome these limits. The present research work is devoted to explore the potential of such a strategy. The lattice Boltzmann method, which is based on a discretised version of the Boltzmann equation, is an explicit approach offering numerous attractive features: accuracy, stability, ability to handle complex geometries, etc. It is therefore an interesting alternative to the direct solving of the Navier-Stokes equations using classic numerical analysis. From an algorithmic standpoint, the LBM is well-suited for parallel implementations. The use of graphics processors to perform general purpose computations is increasingly widespread in high performance computing. These massively parallel circuits provide up to now unrivalled performance at a rather moderate cost. Yet, due to numerous hardware induced constraints, GPU programming is quite complex and the possible benefits in performance depend strongly on the algorithmic nature of the targeted application. For LBM, GPU implementations currently provide performance two orders of magnitude higher than a weakly optimised sequential CPU implementation. The present thesis consists of a collection of nine articles published in international journals and proceedings of international conferences (the last one being under review). These contributions address the issues related to single-GPU implementations of the LBM and the optimisation of memory accesses, as well as multi-GPU implementations and the modelling of inter-GPU and internode communication. In addition, we outline several extensions to the LBM, which appear essential to perform actual building thermo-aeraulic simulations. The test cases we used to validate our codes account for the strong potential of GPU LBM solvers in practice.

APA, Harvard, Vancouver, ISO, and other styles

25

Kerr, Andrew. "A model of dynamic compilation for heterogeneous compute platforms." Diss., Georgia Institute of Technology, 2012. http://hdl.handle.net/1853/47719.

Full text

Abstract:

Trends in computer engineering place renewed emphasis on increasing parallelism and heterogeneity. The rise of parallelism adds an additional dimension to the challenge of portability, as different processors support different notions of parallelism, whether vector parallelism executing in a few threads on multicore CPUs or large-scale thread hierarchies on GPUs. Thus, software experiences obstacles to portability and efficient execution beyond differences in instruction sets; rather, the underlying execution models of radically different architectures may not be compatible. Dynamic compilation applied to data-parallel heterogeneous architectures presents an abstraction layer decoupling program representations from optimized binaries, thus enabling portability without encumbering performance. This dissertation proposes several techniques that extend dynamic compilation to data-parallel execution models. These contributions include: - characterization of data-parallel workloads - machine-independent application metrics - framework for performance modeling and prediction - execution model translation for vector processors - region-based compilation and scheduling We evaluate these claims via the development of a novel dynamic compilation framework, GPU Ocelot, with which we execute real-world workloads from GPU computing. This enables the execution of GPU computing workloads to run efficiently on multicore CPUs, GPUs, and a functional simulator. We show data-parallel workloads exhibit performance scaling, take advantage of vector instruction set extensions, and effectively exploit data locality via scheduling which attempts to maximize control locality.

APA, Harvard, Vancouver, ISO, and other styles

26

Passerat-Palmbach, Jonathan. "Contributions to parallel stochastic simulation : application of good software engineering practices to the distribution of pseudorandom streams in hybrid Monte Carlo simulations." Phd thesis, Université Blaise Pascal - Clermont-Ferrand II, 2013. http://tel.archives-ouvertes.fr/tel-00858735.

Full text

Abstract:

The race to computing power increases every day in the simulation community. A few years ago, scientists have started to harness the computing power of Graphics Processing Units (GPUs) to parallelize their simulations. As with any parallel architecture, not only the simulation model implementation has to be ported to the new parallel platform, but all the tools must be reimplemented as well. In the particular case of stochastic simulations, one of the major element of the implementation is the pseudorandom numbers source. Employing pseudorandom numbers in parallel applications is not a straightforward task, and it has to be done with caution in order not to introduce biases in the results of the simulation. This problematic has been studied since parallel architectures are available and is called pseudorandom stream distribution. While the literature is full of solutions to handle pseudorandom stream distribution on CPU-based parallel platforms, the young GPU programming community cannot display the same experience yet. In this thesis, we study how to correctly distribute pseudorandom streams on GPU. From the existing solutions, we identified a need for good software engineering solutions, coupled to sound theoretical choices in the implementation. We propose a set of guidelines to follow when a PRNG has to be ported to GPU, and put these advice into practice in a software library called ShoveRand. This library is used in a stochastic Polymer Folding model that we have implemented in C++/CUDA. Pseudorandom streams distribution on manycore architectures is also one of our concerns. It resulted in a contribution named TaskLocalRandom, which targets parallel Java applications using pseudorandom numbers and task frameworks. Eventually, we share a reflection on the methods to choose the right parallel platform for a given application. In this way, we propose to automatically build prototypes of the parallel application running on a wide set of architectures. This approach relies on existing software engineering tools from the Java and Scala community, most of them generating OpenCL source code from a high-level abstraction layer.

APA, Harvard, Vancouver, ISO, and other styles

27

Teng, Sin Yong. "Intelligent Energy-Savings and Process Improvement Strategies in Energy-Intensive Industries." Doctoral thesis, Vysoké učení technické v Brně. Fakulta strojního inženýrství, 2020. http://www.nusl.cz/ntk/nusl-433427.

Full text

Abstract:

S tím, jak se neustále vyvíjejí nové technologie pro energeticky náročná průmyslová odvětví, stávající zařízení postupně zaostávají v efektivitě a produktivitě. Tvrdá konkurence na trhu a legislativa v oblasti životního prostředí nutí tato tradiční zařízení k ukončení provozu a k odstavení. Zlepšování procesu a projekty modernizace jsou zásadní v udržování provozních výkonů těchto zařízení. Současné přístupy pro zlepšování procesů jsou hlavně: integrace procesů, optimalizace procesů a intenzifikace procesů. Obecně se v těchto oblastech využívá matematické optimalizace, zkušeností řešitele a provozní heuristiky. Tyto přístupy slouží jako základ pro zlepšování procesů. Avšak, jejich výkon lze dále zlepšit pomocí moderní výpočtové inteligence. Účelem této práce je tudíž aplikace pokročilých technik umělé inteligence a strojového učení za účelem zlepšování procesů v energeticky náročných průmyslových procesech. V této práci je využit přístup, který řeší tento problém simulací průmyslových systémů a přispívá následujícím: (i)Aplikace techniky strojového učení, která zahrnuje jednorázové učení a neuro-evoluci pro modelování a optimalizaci jednotlivých jednotek na základě dat. (ii) Aplikace redukce dimenze (např. Analýza hlavních komponent, autoendkodér) pro vícekriteriální optimalizaci procesu s více jednotkami. (iii) Návrh nového nástroje pro analýzu problematických částí systému za účelem jejich odstranění (bottleneck tree analysis – BOTA). Bylo také navrženo rozšíření nástroje, které umožňuje řešit vícerozměrné problémy pomocí přístupu založeného na datech. (iv) Prokázání účinnosti simulací Monte-Carlo, neuronové sítě a rozhodovacích stromů pro rozhodování při integraci nové technologie procesu do stávajících procesů. (v) Porovnání techniky HTM (Hierarchical Temporal Memory) a duální optimalizace s několika prediktivními nástroji pro podporu managementu provozu v reálném čase. (vi) Implementace umělé neuronové sítě v rámci rozhraní pro konvenční procesní graf (P-graf). (vii) Zdůraznění budoucnosti umělé inteligence a procesního inženýrství v biosystémech prostřednictvím komerčně založeného paradigmatu multi-omics.

APA, Harvard, Vancouver, ISO, and other styles

28

Busato, Federico. "High-Performance and Power-Aware Graph Processing on GPUs." Doctoral thesis, 2018. http://hdl.handle.net/11562/979445.

Full text

Abstract:

Graphs are a common representation in many problem domains, including engineering, finance, medicine, and scientific applications. Different problems map to very large graphs, often involving millions of vertices. Even though very efficient sequential implementations of graph algorithms exist, they become impractical when applied on such actual very large graphs. On the other hand, graphics processing units (GPUs) have become widespread architectures as they provide massive parallelism at low cost. Parallel execution on GPUs may achieve speedup up to three orders of magnitude with respect to the sequential counterparts. Nevertheless, accelerating efficient and optimized sequential algorithms and porting (i.e., parallelizing) their implementation to such many-core architectures is a very challenging task. The task is made even harder since energy and power consumption are becoming constraints in addition, or in same case as an alternative, to performance. This work aims at developing a platform that provides (I) a library of parallel, efficient, and tunable implementations of the most important graph algorithms for GPUs, and (II) an advanced profiling model to analyze both performance and power consumption of the algorithm implementations. The platform goal is twofold. Through the library, it aims at saving developing effort in the parallelization task through a primitive-based approach. Through the profiling framework, it aims at customizing such primitives by considering both the architectural details and the target efficiency metrics (i.e., performance or power).

APA, Harvard, Vancouver, ISO, and other styles

29

Mishra, Ashirbad. "Efficient betweenness Centrality Computations on Hybrid CPU-GPU Systems." Thesis, 2016. http://hdl.handle.net/2005/2718.

Full text

Abstract:

Analysis of networks is quite interesting, because they can be interpreted for several purposes. Various features require different metrics to measure and interpret them. Measuring the relative importance of each vertex in a network is one of the most fundamental building blocks in network analysis. Between’s Centrality (BC) is one such metric that plays a key role in many real world applications. BC is an important graph analytics application for large-scale graphs. However it is one of the most computationally intensive kernels to execute, and measuring centrality in billion-scale graphs is quite challenging. While there are several existing e orts towards parallelizing BC algorithms on multi-core CPUs and many-core GPUs, in this work, we propose a novel ne-grained CPU-GPU hybrid algorithm that partitions a graph into two partitions, one each for CPU and GPU. Our method performs BC computations for the graph on both the CPU and GPU resources simultaneously, resulting in a very small number of CPU-GPU synchronizations, hence taking less time for communications. The BC algorithm consists of two phases, the forward phase and the backward phase. In the forward phase, we initially and the paths that are needed by either partitions, after which each partition is executed on each processor in an asynchronous manner. We initially compute border matrices for each partition which stores the relative distances between each pair of border vertex in a partition. The matrices are used in the forward phase calculations of all the sources. In this way, our hybrid BC algorithm leverages the multi-source property inherent in the BC problem. We present proof of correctness and the bounds for the number of iterations for each source. We also perform a novel hybrid and asynchronous backward phase, in which each partition communicates with the other only when there is a path that crosses the partition, hence it performs minimal CPU-GPU synchronizations. We use a variety of implementations for our work, like node-based and edge based parallelism, which includes data-driven and topology based techniques. In the implementation we show that our method also works using variable partitioning technique. The technique partitions the graph into unequal parts accounting for the processing power of each processor. Our implementations achieve almost equal percentage of utilization on both the processors due to the technique. For large scale graphs, the size of the border matrix also becomes large, hence to accommodate the matrix we present various techniques. The techniques use the properties inherent in the shortest path problem for reduction. We mention the drawbacks of performing shortest path computations on a large scale and also provide various solutions to it. Evaluations using a large number of graphs with different characteristics show that our hybrid approach without variable partitioning and border matrix reduction gives 67% improvement in performance, and 64-98.5% less CPU-GPU communications than the state of art hybrid algorithm based on the popular Bulk Synchronous Paradigm (BSP) approach implemented in TOTEM. This shows our algorithm's strength which reduces the need for larger synchronizations. Implementing variable partitioning, border matrix reduction and backward phase optimizations on our hybrid algorithm provides up to 10x speedup. We compare our optimized implementation, with CPU and GPU standalone codes based on our forward phase and backward phase kernels, and show around 2-8x speedup over the CPU-only code and can accommodate large graphs that cannot be accommodated in the GPU-only code. We also show that our method`s performance is competitive to the state of art multi-core CPU and performs 40-52% better than GPU implementations, on large graphs. We show the drawbacks of CPU and GPU only implementations and try to motivate the reader about the challenges that graph algorithms face in large scale computing, suggesting that a hybrid or distributed way of approaching the problem is a better way of overcoming the hurdles.

APA, Harvard, Vancouver, ISO, and other styles

30

Lain, Jiang-Siang, and 連江祥. "High-performance Cholesky Factorization using the GPU and CPU parallel processing for band matrix." Thesis, 2011. http://ndltd.ncl.edu.tw/handle/36458624994930511368.

Full text

Abstract:

碩士
國立交通大學
土木工程學系
100
The required memory storage and processing time will be increased and elongated when solver linear system in larger matrices. Hence, the application of parallel computing technology on solving of linear system has received considerable interest in the last decade. Most of the parallel computing technologies of the previous studies have focused on iterative algorithm on the distributed parallel computing platforms。 However, the performance of iterative algorithms can realize only for matrices with larger-scaled linear system on super computers. The aim of this study focuses on developing more complicated direct parallel algorithm, on the multi-core CPU (Multi-core) and GPU parallel computing platforms. There are three stages in this study. First, the direct linear system solving algorithms are parallelized and implemented on the multi-core platform. The computing time and precision of solution were investigated and compared to conclude the performance of these different algorithms. Following, the blocked-Cholesky algorithm was utilized and optimized to develop a novel parallel algorithm. Finally, the optimized novel blocked-Cholesky algorithm was implemented on multi-core CPU and GPU parallel computing platforms. The computing results revealed that a 2.3 speed-up achieved fir band-matrices of bandwidth greater than 100 on a four-core platform as compared with performance on a single-core platform. Moreover, the computing performance accomplished 3.3 when the bandwidth of matrices greater than the1000. Notable, a ten-time performance can be reached when the novel algorithm was implemented on a platform of GPU with CUDA technology. The results also revealed that the more the bandwidth of matrices, the higher the achieved performance for computing on GPU platforms.

APA, Harvard, Vancouver, ISO, and other styles

31

Azevedo, José Maria Pantoja Mata Vale e. "Image Stream Similarity Search in GPU Clusters." Master's thesis, 2018. http://hdl.handle.net/10362/58447.

Full text

Abstract:

Images are an important part of today’s society. They are everywhere on the internet and computing, from news articles to diverse areas such as medicine, autonomous vehicles and social media. This enormous amount of images requires massive amounts of processing power to process, upload, download and search for images. The ability to search an image, and find similar images in a library of millions of others empowers users with great advantages. Different fields have different constraints, but all benefit from the quick processing that can be achieved. Problems arise when creating a solution for this. The similarity calculation between several images, performing thousands of comparisons every second, is a challenge. The results of such computations are very large, and pose a challenge when attempting to process. Solutions for these problems often take advantage of graphs in order to index images and their similarity. The graph can then be used for the querying process. Creating and processing such a graph in an acceptable time frame poses yet another challenge. In order to tackle these challenges, we take advantage of a cluster of machines equipped with Graphics Processing Units (GPUs), enabling us to parallelize the process of describing an image visually and finding other images similar to it in an acceptable time frame. GPUs are incredibly efficient at processing data such as images and graphs, through algorithms that are heavily parallelizable. We propose a scalable and modular system that takes advantage of GPUs, distributed computing and fine-grained parallellism to detect image features, index images in a graph and allow users to search for similar images. The solution we propose is able to compare up to 5000 images every second. It is also able to query a graph with thousands of nodes and millions of edges in a matter of milliseconds, achieving a very efficient query speed. The modularity of our solution allows the interchangeability of algorithms and different steps in the solution, which provides great adaptability to any needs.

APA, Harvard, Vancouver, ISO, and other styles

32

CARBONE, Giancarlo. "HPC techniques for large scale data analysis." Doctoral thesis, 2015. http://hdl.handle.net/11573/864567.

Full text

Abstract:

In the present work we apply High-Performance Computing techniques to two Big Data problems. The frst one deals with the analysis of large graphs by using a parallel distributed architecture, whereas the second one consists in the design and implementation of a scalable solution for fast indexing and searching of large datasets of heterogeneous documents.

APA, Harvard, Vancouver, ISO, and other styles

33

CHANG, CHIA-LONG, and 張嘉龍. "The Research on the Benefits of Leading Cloud Computing on Cross Virtual and Real Graph Platform – A "GPU Nomogram Developed a Common Farm Cluster" of National Center for High-Performance Computing as a Research Object." Thesis, 2019. http://ndltd.ncl.edu.tw/handle/u9qf2h.

Full text

Abstract:

碩士
國立臺北教育大學
數位科技設計學系(含玩具與遊戲設計碩士班)
107
National Network Center recently built the completion of the "GPU map farm R & amp; d and sharing cluster", the main users from the domestic animation, special effects, television, film, advertising, communication related school departments and industries, so the use of high efficiency, stability and availability and other needs, in order to complete the calculation project in a limited time. Previous studies on cloud computing performance have focused on building cloud-based virtual systems from the user's point of view for performance evaluation, while few studies have been conducted on the performance of cloud computing systems of cloud computing providers themselves. Therefore, this study focuses on the newly built "GPU Graphics Farm R&D Common Cluster" to study the benefits of cloud computing. The results of this study are as follows: 1. Calculation node, server Management node, GPU Acceleration section, GPU Remote logon node, etc. in the remote graphical interface, remote hardware self-diagnosis, remote power (boot and shutdown and Reset), Virtual Media, remote control keyboard and mouse functional testing, are through the results. 2.SPEC Int test Score, in addition to the Server Management node, graph calculation node, GPU acceleration section, GPU Remote login node and other scores are more than 1500 points, fractional performance is quite ideal. 3. This graph system in the Remote management function test report, you can see the calculation node, server Management node, GPU Acceleration section, GPU Remote logon node and other functional tests, are the results of the adoption. The SPEC Int test score is more than 1500 points in the calculation node, GPU acceleration section, GPU Remote login node, etc., and the score performance is quite ideal. Finally, using the graph farm platform to verify the GPU performance in 4K and 1080p two picture quality and use the solid nVida Tesla V100 and the virtual platform GPU measurement to import the cloud platform without the doubt that the calculation efficiency is reduced. 4.The benefits of a cross virtual and real platform, that is, a virtualized cloud platform, while calculating the performance of the graph, do not differ too much by using cloud computing but can have the benefits of cloud computing, so it is helpful to cross virtual platforms.

APA, Harvard, Vancouver, ISO, and other styles

34

Ramashekar, Thejas. "Automatic Data Allocation, Buffer Management And Data Movement For Multi-GPU Machines." Thesis, 2013. http://etd.iisc.ernet.in/handle/2005/2627.

Full text

Abstract:

Multi-GPU machines are being increasingly used in high performance computing. These machines are being used both as standalone work stations to run computations on medium to large data sizes (tens of gigabytes) and as a node in a CPU-Multi GPU cluster handling very large data sizes (hundreds of gigabytes to a few terabytes). Each GPU in such a machine has its own memory and does not share the address space either with the host CPU or other GPUs. Hence, applications utilizing multiple GPUs have to manually allocate and managed at a on each GPU. A significant body of scientific applications that utilize multi-GPU machines contain computations inside affine loop nests, i.e., loop nests that have affine bounds and affine array access functions. These include stencils, linear-algebra kernels, dynamic programming codes and data-mining applications. Data allocation, buffer management, and coherency handling are critical steps that need to be performed to run affine applications on multi-GPU machines. Existing works that propose to automate these steps have limitations and in efficiencies in terms of allocation sizes, exploiting reuse, transfer costs and scalability. An automatic multi-GPU memory manager that can overcome these limitations and enable applications to achieve salable performance is highly desired. One technique that has been used in certain memory management contexts in the literature is that of bounding boxes. The bounding box of an array, for a given tile, is the smallest hyper-rectangle that encapsulates all the array elements accessed by that tile. In this thesis, we exploit the potential of bounding boxes for memory management far beyond their current usage in the literature. In this thesis, we propose a scalable and fully automatic data allocation and buffer management scheme for affine loop nests on multi-GPU machines. We call it the Bounding Box based Memory Manager (BBMM). BBMM is a compiler-assisted runtime memory manager. At compile time, it use static analysis techniques to identify a set of bounding boxes accessed by a computation tile. At run time, it uses the bounding box set operations such as union, intersection, difference, finding subset and superset relation to compute a set of disjoint bounding boxes from the set of bounding boxes identified at compile time. It also exploits the architectural capability provided by GPUs to perform fast transfers of rectangular (strided) regions of memory and hence performs all data transfers in terms of bounding boxes. BBMM uses these techniques to automatically allocate, and manage data required by applications (suitably tiled and parallelized for GPUs). This allows It to (1) allocate only as much data (or close to) as is required by computations running on each GPU, (2) efficiently track buffer allocations and hence, maximize data reuse across tiles and minimize the data transfer overhead, (3) and as a result, enable applications to maximize the utilization of the combined memory on multi-GPU machines. BBMM can work with any choice of parallelizing transformations, computation placement, and scheduling schemes, whether static or dynamic. Experiments run on a system with four GPUs with various scientific programs showed that BBMM is able to reduce data allocations on each GPU by up to 75% compared to current allocation schemes, yield at least 88% of the performance of hand-optimized Open CL codes and allows excellent weak scaling.

APA, Harvard, Vancouver, ISO, and other styles

35

Patel, Parita. "Compilation of Graph Algorithms for Hybrid, Cross-Platform and Distributed Architectures." Thesis, 2017. http://etd.iisc.ernet.in/2005/3803.

Full text

Abstract:

1. Main Contributions made by the supplicant: This thesis proposes an Open Computing Language (OpenCL) framework to address the challenges of implementation of graph algorithms on parallel architectures and large scale graph processing. The proposed framework uses the front-end of the existing Falcon DSL compiler, andso, programmers enjoy conventional, imperative and shared memory programming style. The back-end of the framework generates implementations of graph algorithms in OpenCL to target single device architectures. The generated OpenCL code is portable across various platforms, e.g., CPU and GPU, and also vendors, e.g., NVIDIA, Intel and AMD. The framework automatically generates code for thread management and memory management for the devices. It hides all the lower level programming details from the programmers. A few optimizations are applied to reduce the execution time. The large graph processing challenge is tackled through graph partitioning over multiple devices of a single node and multiple nodes of a distributed cluster. The programmer codes a graph algorithm in Falcon assuming that the graph fits into single machine memory and the framework handles graph partitioning without any intervention by the programmer. The framework analyses the Abstract Syntax Tree (AST) generated by Falcon to find all the necessary information about communication and synchronization. It automatically generates code for message passing to hide the complexity of programming in a distributed environment. The framework also applies a set of optimizations to minimize the communication latency. The thesis reports results of several experiments conducted on widely used graph algorithms: single source shortest path, pagerank and minimum spanning tree to name a few. Experimental evaluations show that the reported results are comparable to the state-of-art non-portable graph DSLs and frameworks on a single node. Experiments in a distributed environment to show the scalability and efficiency of the framework are also described. 2. Summary of the Referees' Written Comments: Extracts from the referees' reports are provided below. A copy of the written replies to the clarifications sought by the external examiner is appended to this report. Referee 1: This thesis extends the Falcon framework with OpenCL for parallel graph processing on multi-device and multi-node architectures. The thesis makes important contributions. Processing large graphs in short time is very important, and making use of multiple nodes and devices is perhaps the only way to achieve this. Towards this, the thesis makes good contributions for easy programming, compiler transformations and efficient runtime systems. One of the commendable aspects of the thesis that it demonstrates with graphs that cannot be accommodated In the memory of a single device. The thesis is generally written well. The related work coverage is very good. The magnitude of thesis excellent for a Masters work. The experimental setup is very comprehensive with good set of graphs, good experimental comparisons with state-of-art works and good platforms. Particularly. the demonstration with a GPU cluster with multiple GPU nodes (Chapter 5) is excellent. The attempt to demonstrate scalability with 2, 4 and 8 nodes is also noteworthy. However, the contributions on optimizations are weak. Most of the optimizations and compiler transformations are straight-forward. There should be summary observations on the results in Chapter 3, especially given that the results are mixed and don't quite clearly convey the clear advantages of their work. The same is the case with multi-device results in chapter 4, where the results are once again mixed. Similarly, the speedups and scalability achieved with multiple nodes are not great. The problem size justification in the multi-node results is not clear. (Referee 1 also indicates a couple of minor changes to the thesis). Referee 2: The thesis uses the OpenCL framework to address the problem of programming graph algorithms on distributed systems. The use of OpenCL ensures that the generated code is platform-agnoistic and vendor-agnoistic. Sufficient experimentation with large scale graphs and reasonable size clusters have been conducted to demonstrate the scalability and portability of the code generated by the framework. The automatically generated code is almost as efficient as manually written code. The thesis is well written and is of high quality. The related work section is well organized and displays a good knowledge of the subject matter under consideration. The author has made important contributions to a good publication as well. 3. An Account of the Open Oral Examination: The oral examination of Ms. Parita Patel took place during 10 AM and 11AM on 27th November 2017, in the Seminar Hall of the Department of Computer Science and Automation. The members of the Oral Examination Board present were, Prof. Sathish Vadhiyar, external examiner and Prof. Y. N. Srikant, research supervisor. The candidate presented the work in an open defense seminar highlighting the problem domain, the methodology used, the investigations carried out by her, and the resulting contributions documented in the thesis before an audience consisting of the examiners, some faculty members, and students. Some of the questions posed by the examiners and the members of the audience during the oral examination are listed below. 1. How much is the overlap between Falcon work and this thesis? Response: We have used the Falcon front end in our work. Further, the existing Falcon compiler was useful to us to test our own implementation of algorithms in Falcon. 2. Why are speedup and scalability not very high with multiple nodes? Response: For the multi-node architecture, we were not able to achieve linear scalability because, with the increase in number of nodes, communication cost increases significantly. Unless the computation cost in the nodes is significant and is much more than the communication cost, this is bound to happen. 3. Do you have plans of making the code available for use by the community? Response: The code includes some part of Falcon implementation (front-end parsing/grammar) also. After discussion with the author of Falcon, the code can be made available to the community. 4. How can a graph that does not fit into a single device fit into a single node in the case of multiple nodes? Response: Single node machine used in the experiments of “multi-device architecture” contains multiple devices while each node used in experiments of “multi-node architecture” contains only a single device. So, the graph which does not fit into single-node-single-device memory can fit into single-node-multi-device after partitioning. 5. Is there a way to permit morph algorithms to be coded in your framework? Response: Currently, our framework does not translate morph algorithms. Supporting morph algorithms will require some kind of runtime system to manage memory on GPU since morph algorithms add and remove the vertices and edges to the graph dynamically. This can be further explored in future work. 6. Is it possible to accommodate FPGA devices in your framework? Response: Yes, we can support FPGA devices (or any other device that is compatible for OpenCL) just by specifying the device type in the command line argument. We did not work with other devices because CPU and GPU are generally used to process graph algorithms. The candidate provided satisfactory answers to all the questions posed and the clarifications sought by the audience and the examiners during the presentation. The candidate's overall performance during the open defense and the oral examination was very satisfactory to the oral examination board. 4. Certificate of Corrections and Changes: All the necessary corrections and changes suggested by the examiners have been made in the thesis and these have been verified by the members of the oral examination board. The thesis has been recommended for acceptance in its revised form. 5. Final Recommendation: In view of the recommendations of the referees and the satisfactory performance of the candidate in the oral examination, the oral examination board recommends that the thesis of Ms. ParitaPatel be accepted for the award of the M.Sc(Engg.) Degree of the Institute. Response to the comments by the external examiner on the M.Sc(Engg.) thesis “Compilation of Graph Algorithms for Hybrid, Cross-Platform, and Distributed Architectures” by Parita Patel 1. Comment: The contributions on optimizations are weak. Response: The novelty of this thesis is to make the Falcon platform agnostic, and additionally process large scale graphs on multi-devices of a single node and multi-node clusters seamlessly. Our framework performs similar to the existing frameworks, but at the same time, it targets several types of architectures which are not possible in the existing works. Advanced optimizations are beyond the scope of this thesis. 2. Comment: The translation of Falcon to OpenCL is simple. While the translation of Falcon to OpenCL was not hard, figuring out the details of the translation for multi-device and multi-node architectures was not simple. For example, design of implementations for collection, set, global variables, concurrency, etc., were non-trivial. These designs have already been explained in the appropriate places in the thesis. Further, such large software introduced its own intricacies during development. 3. Comment: Lines between Falcon work and this work are not clear. Response: Appendix-A shows the falcon implementation of all the algorithms which we used to run the experiments. We compiled these falcon implementations through our framework and subsequently ran the generated code on different types of target architectures and compared the results with other framework's generated code. These falcon programs were written by us. We have also used the front-end of the Falcon compiler and this has already been stated in the thesis (page 16). 4. Comment: There should be a summary of observations in chapter 3. Response: Summary of observations have been added to chapter 3 (pages 35-36), chapter 4 (page 46), and chapter 5 (page 51) of the thesis. 5. Comment: Speedup and scalability achieved with multiple nodes are not great. Response: For the multi-node architecture, we were not able to achieve linear scalability because, with the increase in number of nodes, communication cost increases significantly. Unless the computation cost in the nodes is significant and is much more than the communication cost, this is bound to happen. 6. Comment: It will be good to separate the related work coverage into a separate chapter. Response: The related work is coherent with the flow in chapter 1. It consists of just 4.5 pages and separating it into a separate chapter would make both (rest of) chapter 1 and the new chapter very small. Therefore, we do not recommend it. 7. Comment: The code should be made available for use by the community. Response: The code includes some part of Falcon code (front-end parsing/grammar) also. After discussion with the author of Falcon, the code can be made available to the community. 8. Comment: Page 28: Shouldn’t the else part be inside the kernel? Response: There was some missing text and a few minor changes in Figure 3.14 (page 28) which have been incorporated in the corrected thesis. 9. Comment: Figure 4.1 needs to be explained better. Response: Explanation for Figure 4.1 (pages 38-39) has been added to the thesis. 10. Comment: The problem size justification in the multi-node results is not clear. Response: Single node machine used in the experiments of “multi-device architecture” contains multiple devices while each node used in experiments of “multi-node architecture” contains only a single device. So, the graph which does not fit into single-node-single-device memory can fit into single-node-multi-device after partitioning. Name of the Candidate: Parita Patel (S.R. No. 04-04-00-10-21-14-1-11610) Degree Registered: M.Sc(Engg.) Department: Computer Science & Automation Title of the Thesis: Compilation of Graph Algorithms for Hybrid, Cross-Platform and Graph algorithms are abundantly used in various disciplines. These algorithms perform poorly due to random memory access and negligible spatial locality. In order to improve performance, parallelism exhibited by these algorithms can be exploited by leveraging modern high performance parallel computing resources. Implementing graph algorithms for these parallel architectures requires manual thread management and memory management which becomes tedious for a programmer. Large scale graphs cannot fit into the memory of a single machine. One solution is to partition the graph either on multiple devices of a single node or on multiple nodes of a distributed network. All the available frameworks for such architectures demand unconventional programming which is difficult and error prone. To address these challenges, we propose a framework for compilation of graph algorithms written in an intuitive graph domain-specific language, Falcon. The framework targets shared memory parallel architectures, computational accelerators and distributed architectures (CPU and GPU cluster). First, it analyses the abstract syntax tree (generated by Falcon) and gathers essential information. Subsequently, it generates optimized code in OpenCL for shared-memory parallel architectures and computational accelerators, and OpenCL coupled with MPI code for distributed architectures. Motivation behind generating OpenCL code is its platform-agnostic and vendor-agnostic behavior, i.e., it is portable to all kinds of devices. Our framework makes memory management, thread management, message passing, etc., transparent to the user. None of the available domain-specific languages, frameworks or parallel libraries handle portable implementations of graph algorithms. Experimental evaluations demonstrate that the generated code performs comparably to the state-of-the-art non-portable implementations and hand-tuned implementations. The results also show portability and scalability of our framework.

APA, Harvard, Vancouver, ISO, and other styles

36

Kriske, Jeffery Edward Jr. "A scalable approach to processing adaptive optics optical coherence tomography data from multiple sensors using multiple graphics processing units." Thesis, 2014. http://hdl.handle.net/1805/6458.

Full text

Abstract:

Indiana University-Purdue University Indianapolis (IUPUI)
Adaptive optics-optical coherence tomography (AO-OCT) is a non-invasive method of imaging the human retina in vivo. It can be used to visualize microscopic structures, making it incredibly useful for the early detection and diagnosis of retinal disease. The research group at Indiana University has a novel multi-camera AO-OCT system capable of 1 MHz acquisition rates. Until this point, a method has not existed to process data from such a novel system quickly and accurately enough on a CPU, a GPU, or one that can scale to multiple GPUs automatically in an efficient manner. This is a barrier to using a MHz AO-OCT system in a clinical environment. A novel approach to processing AO-OCT data from the unique multi-camera optics system is tested on multiple graphics processing units (GPUs) in parallel with one, two, and four camera combinations. The design and results demonstrate a scalable, reusable, extensible method of computing AO-OCT output. This approach can either achieve real time results with an AO-OCT system capable of 1 MHz acquisition rates or be scaled to a higher accuracy mode with a fast Fourier transform of 16,384 complex values.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'High-performance, graph processing, GPU'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles