Dissertations / Theses on the topic 'High performance computing'

To see the other types of publications on this topic, follow the link: High performance computing.

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'High performance computing.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

KHAN, OMAR USMAN. "High Performance Computing using GPGPU's." Doctoral thesis, Politecnico di Torino, 2013. http://hdl.handle.net/11583/2506369.

Full text
Abstract:
Computer based simulation software having a basis in numerical methods play a major role in research in the area of natural and physical sciences. These tools allow scientists to attempt problems that are too large to solve using analytical methods. But even these tools can fail to give solutions due to computational or storage limits. However, as the performance of computer hardware gets better and better, the computational limits can be also addressed. One such area of work is that of magnetic field modeling, which plays a crucial role in various fields of research, especially those related to nanotechnology. Due to remarkable advancements made in this field, magnetic modelling has developed new found interest, and rightly so. The most significant impact of this interest is perhaps felt in increasing areal densities for data storage devices which is projected to reach almost atomic scales. Computational limits, and subsequently their solutions based on hardware delivering high performance, are therefore a key component in research in this field. The scale of length and time plays a crucial role in observing magnetic phenomena, and as these scales are reduced, new behaviours can be observed. Coarser scales may be beneficial if modeling larger systems, but when working with sub-μm scales, a finer scale has to be selected. Doing so will project the proper magnetic behaviour of the materials, but will come with its share of problems. These will be addressed in this thesis. Simulations are usually configured before being started. The configuration is performed using scripting based methods which need to reflect the proper environmental conditions. For example, simulating multiple bodies with varying orientations, non-uniform geometries, bodies consisting of multiple layers with each layer having different properties, etc. will all need different configuration methods. A performance based solution would need to be optimized for each type of simulation. This may require re-structuring of different components of a simulator. This thesis is devoted to addressing such problems listed above with a focus on performance based solutions. The scope of the work has been limited to magnetostatic field calculations in particular because they consume the most time in the overall simulation. The scope has also been confined to regular structured rectangular meshes which are popular in major micromagnetic simulation software. Using regular meshes, magnetostatic field calculations can exploit a performance boost by using Fast Fourier Transforms. Therefore, fast FFT libraries using open standards will also be addressed in this thesis. In particular, this thesis will be based on the development process of open standards for magnetic field modeling. The major contribution in this regard includes an OpenCL specific FFT library for GPU’s and a GPU based magnetostatic field solver which is used as an extension to the OOMMF simulator. The thesis covers some novel numerical techniques that have been developed to target particular simulation configurations to obtain maximum performance.
APA, Harvard, Vancouver, ISO, and other styles
2

ROOZMEH, MEHDI. "High Performance Computing via High Level Synthesis." Doctoral thesis, Politecnico di Torino, 2018. http://hdl.handle.net/11583/2710706.

Full text
Abstract:
As more and more powerful integrated circuits are appearing on the market, more and more applications, with very different requirements and workloads, are making use of the available computing power. This thesis is in particular devoted to High Performance Computing applications, where those trends are carried to the extreme. In this domain, the primary aspects to be taken into consideration are (1) performance (by definition) and (2) energy consumption (since operational costs dominate over procurement costs). These requirements can be satisfied more easily by deploying heterogeneous platforms, which include CPUs, GPUs and FPGAs to provide a broad range of performance and energy-per-operation choices. In particular, as we will see, FPGAs clearly dominate both CPUs and GPUs in terms of energy, and can provide comparable performance. An important aspect of this trend is of course design technology, because these applications were traditionally programmed in high-level languages, while FPGAs required low-level RTL design. The OpenCL (Open Computing Language) developed by the Khronos group enables developers to program CPU, GPU and recently FPGAs using functionally portable (but sadly not performance portable) source code which creates new possibilities and challenges both for research and industry. FPGAs have been always used for mid-size designs and ASIC prototyping thanks to their energy efficient and flexible hardware architecture, but their usage requires hardware design knowledge and laborious design cycles. Several approaches are developed and deployed to address this issue and shorten the gap between software and hardware in FPGA design flow, in order to enable FPGAs to capture a larger portion of the hardware acceleration market in data centers. Moreover, FPGAs usage in data centers is growing already, regardless of and in addition to their use as computational accelerators, because they can be used as high performance, low power and secure switches inside data-centers. High-Level Synthesis (HLS) is the methodology that enables designers to map their applications on FPGAs (and ASICs). It synthesizes parallel hardware from a model originally written C-based programming languages .e.g. C/C++, SystemC and OpenCL. Design space exploration of the variety of implementations that can be obtained from this C model is possible through wide range of optimization techniques and directives, e.g. to pipeline loops and partition memories into multiple banks, which guide RTL generation toward application dependent hardware and benefit designers from flexible parallel architecture of FPGAs. Model Based Design (MBD) is a high-level and visual process used to generate implementations that solve mathematical problems through a varied set of IP-blocks. MBD enables developers with different expertise, e.g. control theory, embedded software development, and hardware design to share a common design framework and contribute to a shared design using the same tool. Simulink, developed by MATLAB, is a model based design tool for simulation and development of complex dynamical systems. Moreover, Simulink embedded code generators can produce verified C/C++ and HDL code from the graphical model. This code can be used to program micro-controllers and FPGAs. This PhD thesis work presents a study using automatic code generator of Simulink to target Xilinx FPGAs using both HDL and C/C++ code to demonstrate capabilities and challenges of high-level synthesis process. To do so, firstly, digital signal processing unit of a real-time radar application is developed using Simulink blocks. Secondly, generated C based model was used for high level synthesis process and finally the implementation cost of HLS is compared to traditional HDL synthesis using Xilinx tool chain. Alternative to model based design approach, this work also presents an analysis on FPGA programming via high-level synthesis techniques for computationally intensive algorithms and demonstrates the importance of HLS by comparing performance-per-watt of GPUs(NVIDIA) and FPGAs(Xilinx) manufactured in the same node running standard OpenCL benchmarks. We conclude that generation of high quality RTL from OpenCL model requires stronger hardware background with respect to the MBD approach, however, the availability of a fast and broad design space exploration ability and portability of the OpenCL code, e.g. to CPUs and GPUs, motivates FPGA industry leaders to provide users with OpenCL software development environment which promises FPGA programming in CPU/GPU-like fashion. Our experiments, through extensive design space exploration(DSE), suggest that FPGAs have higher performance-per-watt with respect to two high-end GPUs manufactured in the same technology(28 nm). Moreover, FPGAs with more available resources and using a more modern process (20 nm) can outperform the tested GPUs while consuming much less power at the cost of more expensive devices.
APA, Harvard, Vancouver, ISO, and other styles
3

Balakrishnan, Suresh Reuben A/L. "Hybrid High Performance Computing (HPC) + Cloud for Scientific Computing." Thesis, Curtin University, 2022. http://hdl.handle.net/20.500.11937/89123.

Full text
Abstract:
The HPC+Cloud framework has been built to enable on-premise HPC jobs to use resources from cloud computing nodes. As part of designing the software framework, public cloud providers, namely Amazon AWS, Microsoft Azure and NeCTAR were benchmarked against one another, and Microsoft Azure was determined to be the most suitable cloud component in the proposed HPC+Cloud software framework. Finally, an HPC+Cloud cluster was built using the HPC+Cloud software framework and then was validated by conducting HPC processing benchmarks.
APA, Harvard, Vancouver, ISO, and other styles
4

Roberts, Stephen I. "Energy-aware performance engineering in high performance computing." Thesis, University of Warwick, 2017. http://wrap.warwick.ac.uk/107784/.

Full text
Abstract:
Advances in processor design have delivered performance improvements for decades. As physical limits are reached, however, refinements to the same basic technologies are beginning to yield diminishing returns. Unsustainable increases in energy consumption are forcing hardware manufacturers to prioritise energy efficiency in their designs. Research suggests that software modifications will be needed to exploit the resulting improvements in current and future hardware. New tools are required to capitalise on this new class of optimisation. This thesis investigates the field of energy-aware performance engineering. It begins by examining the current state of the art, which is characterised by ad-hoc techniques and a lack of standardised metrics. Work in this thesis addresses these deficiencies and lays stable foundations for others to build on. The first contribution made includes a set of criteria which define the properties that energy-aware optimisation metrics should exhibit. These criteria show that current metrics cannot meaningfully assess the utility of code or correctly guide its optimisation. New metrics are proposed to address these issues, and theoretical and empirical proofs of their advantages are given. This thesis then presents the Power Optimised Software Envelope (POSE) model, which allows developers to assess whether power optimisation is worth pursuing for their applications. POSE is used to study the optimisation characteristics of codes from the Mantevo mini-application suite running on a Haswell-based cluster. The results obtained show that of these codes TeaLeaf has the most scope for power optimisation while PathFinder has the least. Finally, POSE modelling techniques are extended to evaluate the system-wide scope for energy-aware performance optimisation. System Summary POSE allows developers to assess the scope a system has for energy-aware software optimisation independent of the code being run.
APA, Harvard, Vancouver, ISO, and other styles
5

Palamadai, Natarajan Ekanathan. "Portable and productive high-performance computing." Thesis, Massachusetts Institute of Technology, 2017. http://hdl.handle.net/1721.1/108988.

Full text
Abstract:
Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2017.
Cataloged from PDF version of thesis.
Includes bibliographical references (pages 115-120).
Performance portability of computer programs, and programmer productivity in writing them are key expectations in software engineering. These expectations lead to the following questions: Can programmers write code once, and execute it at optimal speed on any machine configuration? Can programmers write parallel code to simple models that hide the complex details of parallel programming? This thesis addresses these questions for certain "classes" of computer programs. It describes "autotuning" techniques that achieve performance portability for serial divide-and-conquer programs, and an abstraction that improves programmer productivity in writing parallel code for a class of programs called "Star". We present a "pruned-exhaustive" autotuner called Ztune that optimizes the performance of serial divide-and-conquer programs for a given machine configuration. Whereas the traditional way of autotuning divide-and-conquer programs involves simply coarsening the base case of recursion optimally, Ztune searches for optimal divide-and-conquer trees. Although Ztune, in principle, exhaustively enumerates the search domain, it uses pruning properties that greatly reduce the size of the search domain without significantly sacrificing the quality of the autotuned code. We illustrate how to autotune divide-and-conquer stencil computations using Ztune, and present performance comparisons with state-of-the-art "heuristic" autotuning. Not only does Ztune autotune significantly faster than a heuristic autotuner, the Ztuned programs also run faster on average than their heuristic autotuner tuned counterparts. Surprisingly, for some stencil benchmarks, Ztune actually autotuned faster than the time it takes to execute the stencil computation once. We introduce the Star class that includes many seemingly different programs like solving symmetric, diagonally-dominant tridiagonal systems, executing "watershed" cuts on graphs, sample sort, fast multipole computations, and all-prefix-sums and its various applications. We present a programming model, which is also called Star, to generate and execute parallel code for the Star class of programs. The Star model abstracts the pattern of computation and interprocessor communication in the Star class of programs, hides low-level parallel programming details, and offers ease of expression, thereby improving programmer productivity in writing parallel code. Besides, we also present parallel algorithms, which offer asymptotic improvements over prior art, for two programs in the Star class - a Trip algorithm for solving symmetric, diagonally-dominant tridiagonal systems, and a Wasp algorithm for executing watershed cuts on graphs. The Star model is implemented in the Julia programming language, and leverages Julia's capabilities in expressing parallelism in code concisely, and in supporting both shared-memory and distributed-memory parallel programming alike.
by Ekanathan Palamadai Natarajan.
Ph. D.
APA, Harvard, Vancouver, ISO, and other styles
6

Zhou, He. "High Performance Computing Architecture with Security." Diss., The University of Arizona, 2015. http://hdl.handle.net/10150/578611.

Full text
Abstract:
Multi-processor embedded system is the future promise of high performance computing architecture. However, it still suffers low network efficiency and security threat. Simply upgrading to multi-core systems has been proven to provide only minor speedup compared with single core systems. Router architecture of network-on-chip (NoC) uses shared input buffers such as virtual channels and crossbar switches that only allow sequential data access. The speed and efficiency of on-chip communication is limited. In addition, the performance of conventional NoC topology is limited by routing latency and energy consumption due to its network diameter increases with the rising number of nodes. The security concern has also become a serious problem for embedded systems. Even with cryptographic algorithms, embedded systems are still very vulnerable to side channel attacks (SCAs). Among SCA approaches, power analysis is an efficient and powerful attack. Once the encryption location in an instruction sequence is identified, power analysis can be applied to exploit the embedded system. To improve on-chip network parallelism, this dissertation proposes a new router microarchitecture based on a new data structure called virtual collision array. Sequential data requests are partially eliminated in the virtual collision array before entering router pipeline. To facilitate the new router architecture, new workload assignment is applied to increase data request elimination. Through a task flow partitioning algorithm, we minimize sequential data access and then schedule tasks while minimizing the total router delay. For NoC topology, this dissertation presents a new hybrid NoC (HyNoC) architecture. We introduce an adaptive routing scheme to provide reconfigurable on-chip communication with both wired and wireless links. In addition, based on a mathematical model which established on cross-correlation, this dissertation proposes two obfuscation methodologies: Real Instruction Insertion and AES Mimic to prevent SCAs power analysis attack.
APA, Harvard, Vancouver, ISO, and other styles
7

Mani, Sindhu. "Empirical Performance Analysis of High Performance Computing Benchmarks Across Variations in Cloud Computing." UNF Digital Commons, 2012. http://digitalcommons.unf.edu/etd/418.

Full text
Abstract:
High Performance Computing (HPC) applications are data-intensive scientific software requiring significant CPU and data storage capabilities. Researchers have examined the performance of Amazon Elastic Compute Cloud (EC2) environment across several HPC benchmarks; however, an extensive HPC benchmark study and a comparison between Amazon EC2 and Windows Azure (Microsoft’s cloud computing platform), with metrics such as memory bandwidth, Input/Output (I/O) performance, and communication computational performance, are largely absent. The purpose of this study is to perform an exhaustive HPC benchmark comparison on EC2 and Windows Azure platforms. We implement existing benchmarks to evaluate and analyze performance of two public clouds spanning both IaaS and PaaS types. We use Amazon EC2 and Windows Azure as platforms for hosting HPC benchmarks with variations such as instance types, number of nodes, hardware and software. This is accomplished by running benchmarks including STREAM, IOR and NPB benchmarks on these platforms on varied number of nodes for small and medium instance types. These benchmarks measure the memory bandwidth, I/O performance, communication and computational performance. Benchmarking cloud platforms provides useful objective measures of their worthiness for HPC applications in addition to assessing their consistency and predictability in supporting them.
APA, Harvard, Vancouver, ISO, and other styles
8

Choi, Jee Whan. "Power and performance modeling for high-performance computing algorithms." Diss., Georgia Institute of Technology, 2015. http://hdl.handle.net/1853/53561.

Full text
Abstract:
The overarching goal of this thesis is to provide an algorithm-centric approach to analyzing the relationship between time, energy, and power. This research is aimed at algorithm designers and performance tuners so that they may be able to make decisions on how algorithms should be designed and tuned depending on whether the goal is to minimize time or to minimize energy on current and future systems. First, we present a simple analytical cost model for energy and power. Assuming a simple von Neumann architecture with a two-level memory hierarchy, this model pre- dicts energy and power for algorithms using just a few simple parameters, such as the number of floating point operations (FLOPs or flops) and the amount of data moved (bytes or words). Using highly optimized microbenchmarks and a small number of test platforms, we show that although this model uses only a few simple parameters, it is, nevertheless, accurate. We can also visualize this model using energy “arch lines,” analogous to the “rooflines” in time. These “rooflines in energy” allow users to easily assess and com- pare different algorithms’ intensities in energy and time to various target systems’ balances in energy and time. This visualization of our model gives us many inter- esting insights, and as such, we refer to our analytical model as the energy roofline model. Second, we present the results of our microbenchmarking study of time, energy, and power costs of computation and memory access of several candidate compute- node building blocks of future high–performance computing (HPC) systems. Over a dozen server-, desktop-, and mobile-class platforms that span a range of compute and power characteristics were evaluated, including x86 (both conventional and Xeon Phi accelerator), ARM, graphics processing units (GPU), and hybrid (AMD accelerated processing units (APU) and other system–on–chip (SoC)) processors. The purpose of this study was twofold; first, it was to extend the validation of the energy roofline model to a more comprehensive set of target systems to show that the model works well independent of system hardware and microarchitecture; second, it was to improve the model by uncovering and remedying potential shortcomings, such as incorporating the effects of power “capping,” multi–level memory hierarchy, and different implementation strategies on power and performance. Third, we incorporate dynamic voltage and frequency scaling (DVFS) into the energy roofline model to explore its potential for saving energy. Rather than the more traditional approach of using DVFS to reduce energy, whereby a “slack” in computation is used as an opportunity to dynamically cycle down the processor clock, the energy roofline model can be used to determine precisely how the time and energy costs of different operations, both compute and memory, change with respect to frequency and voltage settings. This information can be used to target a specific optimization goal, whether that be time, energy, or a combination of both. In the final chapter of this thesis, we use our model to predict the energy dissi- pation of a real application running on a real system. The fast multipole method (FMM) kernel was executed on the GPU component of the Tegra K1 SoC under various frequency and voltage settings and a breakdown of instructions and data ac- cess pattern was collected via performance counters. The total energy dissipation of FMM was then calculated as a weighted sum of these instructions and the associated costs in energy. On eight different voltage and frequency settings and eight different algorithm–specific input parameters per setting, for a total of 64 total test cases, the accuracy of the energy roofline model for predicting total energy dissipation was within 6.2%, with a standard deviation of 4.7%, when compared to actual energy measurements. Despite its simplicity and its foundation on the first principles of algorithm anal- ysis, the energy roofline model has proven to be both practical and accurate for real applications running on a real system. And as such, it can be an invaluable tool for al- gorithm designers and performance tuners with which they can more precisely analyze the impact of their design decisions on both performance and energy efficiency.
APA, Harvard, Vancouver, ISO, and other styles
9

Ge, Rong. "Theories and Techniques for Efficient High-End Computing." Diss., Virginia Tech, 2007. http://hdl.handle.net/10919/28863.

Full text
Abstract:
Today, power consumption costs supercomputer centers millions of dollars annually and the heat produced can reduce system reliability and availability. Achieving high performance while reducing power consumption is challenging since power and performance are inextricably interwoven; reducing power often results in degradation in performance. This thesis aims to address these challenges by providing theories, techniques, and tools to 1) accurately predict performance and improve it in systems with advanced hierarchical memories, 2) understand and evaluate power and its impacts on performance, 3) control power and performance for maximum efficiency. Our theories, techniques, and tools have been applied to high-end computing systems. Our theroetical models can improve algorithm performance by up to 59% and accurately predict the impacts of power on performance. Our techniques can evaluate power consumption of high-end computing systems and their applications with fine granularity and save up to 36% energy with little performance degradation.
Ph. D.
APA, Harvard, Vancouver, ISO, and other styles
10

Orobitg, Cortada Miquel. "High performance computing on biological sequence alignment." Doctoral thesis, Universitat de Lleida, 2013. http://hdl.handle.net/10803/110930.

Full text
Abstract:
L'Alineament Múltiple de Seqüències (MSA) és una eina molt potent per a aplicacions biològiques importants. Els MSA són computacionalment complexos de calcular, i la majoria de les formulacions porten a problemes d'optimització NP-Hard. Per a dur a terme alineaments de milers de seqüències, nous desafiaments necessiten ser resolts per adaptar els algoritmes a l'era de la computació d'altes prestacions. En aquesta tesi es proposen tres aportacions diferents per resoldre algunes limitacions dels mètodes MSA. La primera proposta consisteix en un algoritme de construcció d'arbres guia per millorar el grau de paral•lelisme, amb la finalitat de resoldre el coll d'ampolla de l'etapa de l'alineament progressiu. La segona proposta consisteix en optimitzar la biblioteca de consistència per millorar el temps d'execució, l'escalabilitat, i poder tractar un major nombre de seqüències. Finalment, proposem Multiples Trees Alignment (MTA), un mètode MSA per alinear en paral•lel múltiples arbres guia, avaluar els alineaments obtinguts i seleccionar el millor com a resultat. Els resultats experimentals han demostrat que MTA millora considerablement la qualitat dels alineaments. El Alineamiento Múltiple de Secuencias (MSA) es una herramienta poderosa para aplicaciones biológicas importantes. Los MSA son computacionalmente complejos de calcular, y la mayoría de las formulaciones llevan a problemas de optimización NP-Hard. Para llevar a cabo alineamientos de miles de secuencias, nuevos desafíos necesitan ser resueltos para adaptar los algoritmos a la era de la computación de altas prestaciones. En esta tesis se proponen tres aportaciones diferentes para resolver algunas limitaciones de los métodos MSA. La primera propuesta consiste en un algoritmo de construcción de árboles guía para mejorar el grado de paralelismo, con el fin de resolver el cuello de botella de la etapa del alineamiento progresivo. La segunda propuesta consiste en optimizar la biblioteca de consistencia para mejorar el tiempo de ejecución, la escalabilidad, y poder tratar un mayor número de secuencias. Finalmente, proponemos Múltiples Trees Alignment (MTA), un método MSA para alinear en paralelo múltiples árboles guía, evaluar los alineamientos obtenidos y seleccionar el mejor como resultado. Los resultados experimentales han demostrado que MTA mejora considerablemente la calidad de los alineamientos. Multiple Sequence Alignment (MSA) is a powerful tool for important biological applications. MSAs are computationally difficult to calculate, and most formulations of the problem lead to NP-Hard optimization problems. To perform large-scale alignments, with thousands of sequences, new challenges need to be resolved to adapt the MSA algorithms to the High-Performance Computing era. In this thesis we propose three different approaches to solve some limitations of main MSA methods. The first proposal consists of a new guide tree construction algorithm to improve the degree of parallelism in order to resolve the bottleneck of the progressive alignment stage. The second proposal consists of optimizing the consistency library, improving the execution time and the scalability of MSA to enable the method to treat more sequences. Finally, we propose Multiple Trees Alignments (MTA), a MSA method to align in parallel multiple guide-trees, evaluate the alignments obtained and select the best one as a result. The experimental results demonstrated that MTA improves considerably the quality of the alignments.
APA, Harvard, Vancouver, ISO, and other styles
11

Bentz, Jonathan Lee. "Hybrid programming in high performance scientific computing." [Ames, Iowa : Iowa State University], 2006.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
12

Algire, Martin. "Distributed multi-processing for high performance computing." Thesis, McGill University, 2000. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=31180.

Full text
Abstract:
Parallel computing can take many forms. From a user's perspective, it is important to consider the advantages and disadvantages of each methodology. The following project attempts to provide some perspective on the methods of parallel computing and indicate where the tradeoffs lie along the continuum. Problems that are parallelizable enable researchers to maximize the computing resources available for a problem, and thus push the limits of the problems that can be solved. Solving any particular problem in parallel will require some very important design decisions to be made. These decisions may dramatically affect portability, performance, and cost of implementing a software solution to the problem. The results gained from this work indicate that although performance improvements are indeed possible---they are heavily dependent on the application in question and may require much more programming effort and expertise to implement.
APA, Harvard, Vancouver, ISO, and other styles
13

King, Graham A. "High performance computing systems for signal processing." Thesis, Southampton Solent University, 1996. http://ssudl.solent.ac.uk/2424/.

Full text
Abstract:
The submission begins by demonstrating that the conditions required for consideration under the University's research degrees regulations have been met in full. There then follows a commentary which starts by explaining the origin of the research theme concerned and which continues by discussing the nature and significance of the work. This has been an extensive programme to devise new methods of improving the computational speed and efficiency required for effective implementation of FIR and IIR digital filters and transforms. The problems are analysed and initial experimental work is described which sought to quantify the performance to be derived from peripheral vector processors. For some classes of computation, especially in real time, it was necessary to tum to pure systolic array hardware engines and a large number of innovations are suggested, both in array architecture and in the creation of a new hybrid opto-electronic adder capable of improving the performance ofprocessing elements for the array. This significant and original research is extended further by including a class of computation involving a bit sliced co-processor. A means of measuring the performance of this system is developed and discussed. The contribution of the work has been evident in: software innovation for horizontal architecture microprocessors; improved multi-dimensional systolic array designs; the development of completely new implementations of processing elements in such arrays; and in the details of co-processing architectures for bit sliced microprocessors. The use of Read Only Memory in creating n-dimensional FIR or IIR filters, and in executing the discrete cosine transform is a further innovative contribution that has enabled researchers to re-examine the case for pre-calculated systems previously using stored squares. The Read Only Memory work has suggested that Read Only Memory chips may be combined in a way architecturally similar to systolic array processing elements. This led to original concepts of pipelining for memory devices. The work is entirely coherent in that it covers the application of these contributions to a set of common processes, producing a set of performance graded and scaleable solutions. In order that effective solutions are proposed it was necessary to demonstrate a solid underlying appreciation of the computational mechanics involved. Whilst the published papers within this submission assume such an understanding , two appendices are provided to demonstrate the essential groundwork necessary to underpin the work resulting in these publications. The improved results obtained from the programme were threefold: execution time; theoretical clocking speeds and circuit areas; and speed up ratios. In the case of the investigations involving vector signal processors the issue was one of quantifying the performance bounds of the architecture in performing specific combinations of signal processing functions. An important aspect of this work was the optimisation achieved in the programming of the device. The use of innovative techniques reduced the execution time for the complex combinational algorithms involved to sub 10 milliseconds. Given the real time constraints for typical applications and the aims for this research the work evolved toward dedicated hardware solutions. Systolic arrays were thus a significant area of investigation. In such systems meritorious criteria are concerned with achieving: a higher regularity in architectural structure; data exchanges only with nearest neighbour processing elements; minimised global distribution functions such as power supplies and clock lines; minimised latency; minimisation in the use of latches; the elimination of output adders; and the design of higher speed processing elements. The programme has made original and significant contributions to the art of effective array design culminating in systems calculated to clock at 100MHz when using 1 micron CMOS technology, whilst creating reductions in transistor count when compared with contemporary implementations. The improvements vary by specific design but are ofthe order of30-l00% speed advantage and 20-30% less real estate usage. The third type of result was obtained when considering operations best executed by dedicated microcode running on bit sliced engines. The main issues for this part of the work were the development of effective interactions between host processors and the bit sliced processors used for computationally intensive and repetitive functions together with the evaluation of the relative performance of new bit sliced microcode solutions. The speed up obtained relative to a range of state of the art microprocessors (68040, 80386, 32032) ranged from 2: 1 to 8: 1. The programme of research is represented by sixteen papers divided into three groups corresponding to the following stages in the work: problem definition and initial responses involving vector processors; the synthesis of higher performance solutions using dedicated hardware; and bit sliced solutions
APA, Harvard, Vancouver, ISO, and other styles
14

Debbage, Mark. "Reliable communication protocols for high-performance computing." Thesis, University of Southampton, 1993. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.358359.

Full text
APA, Harvard, Vancouver, ISO, and other styles
15

Cox, Simon J. "Development and applications of high performance computing." Thesis, University of Southampton, 1998. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.242712.

Full text
APA, Harvard, Vancouver, ISO, and other styles
16

Tavara, Shirin. "High-Performance Computing For Support Vector Machines." Licentiate thesis, Högskolan i Skövde, Institutionen för informationsteknologi, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:his:diva-16556.

Full text
Abstract:
Machine learning algorithms are very successful in solving classification and regression problems, however the immense amount of data created by digitalization slows down the training and predicting processes, if solvable at all. High-Performance Computing(HPC) and particularly parallel computing are promising tools for improving the performance of machine learning algorithms in terms of time. Support Vector Machines(SVM) is one of the most popular supervised machine learning techniques that enjoy the advancement of HPC to overcome the problems regarding big data, however, efficient parallel implementations of SVM is a complex endeavour. While there are many parallel techniques to facilitate the performance of SVM, there is no clear roadmap for every application scenario. This thesis is based on a collection of publications. It addresses the problems regarding parallel implementations of SVM through four research questions, all of which are answered through three research articles. In the first research question, the thesis investigates important factors such as parallel algorithms, HPC tools, and heuristics on the efficiency of parallel SVM implementation. This leads to identifying the state of the art parallel implementations of SVMs, their pros and cons, and suggests possible avenues for future research. It is up to the user to create a balance between the computation time and the classification accuracy. In the second research question, the thesis explores the impact of changes in problem size, and the value of corresponding SVM parameters that lead to significant performance. This leads to addressing the impact of the problem size on the optimal choice of important parameters. Besides, the thesis shows the existence of a threshold between the number of cores and the training time. In the third research question, the thesis investigates the impact of the network topology on the performance of a network-based SVM. This leads to three key contributions. The first contribution is to show how much the expansion property of the network impact the convergence. The next is to show which network topology is preferable to efficiently use the computing powers. Third is to supply an implementation making the theoretical advances practically available. The results show that graphs with large spectral gaps and higher degrees exhibit accelerated convergence. In the last research question, the thesis combines all contributions in the articles and offers recommendations towards implementing an efficient framework for SVMs regarding large-scale problems.
APA, Harvard, Vancouver, ISO, and other styles
17

Wong, Yee Lok Ph D. Massachusetts Institute of Technology. "High-performance computing with PetaBricks and Julia." Thesis, Massachusetts Institute of Technology, 2011. http://hdl.handle.net/1721.1/67818.

Full text
Abstract:
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Mathematics, 2011.
Cataloged from PDF version of thesis.
Includes bibliographical references (p. 163-170).
We present two recent parallel programming languages, PetaBricks and Julia, and demonstrate how we can use these two languages to re-examine classic numerical algorithms in new approaches for high-performance computing. PetaBricks is an implicitly parallel language that allows programmers to naturally express algorithmic choice explicitly at the language level. The PetaBricks compiler and autotuner is not only able to compose a complex program using fine-grained algorithmic choices but also find the right choice for many other parameters including data distribution, parallelization and blocking. We re-examine classic numerical algorithms with PetaBricks, and show that the PetaBricks autotuner produces nontrivial optimal algorithms that are difficult to reproduce otherwise. We also introduce the notion of variable accuracy algorithms, in which accuracy measures and requirements are supplied by the programmer and incorporated by the PetaBricks compiler and autotuner in the search of optimal algorithms. We demonstrate the accuracy/performance trade-offs by benchmark problems, and show how nontrivial algorithmic choice can change with different user accuracy requirements. Julia is a new high-level programming language that aims at achieving performance comparable to traditional compiled languages, while remaining easy to program and offering flexible parallelism without extensive effort. We describe a problem in large-scale terrain data analysis which motivates the use of Julia. We perform classical filtering techniques to study the terrain profiles and propose a measure based on Singular Value Decomposition (SVD) to quantify terrain surface roughness. We then give a brief tutorial of Julia and present results of our serial blocked SVD algorithm implementation in Julia. We also describe the parallel implementation of our SVD algorithm and discuss how flexible parallelism can be further explored using Julia.
by Yee Lok Wong.
Ph.D.
APA, Harvard, Vancouver, ISO, and other styles
18

Ravindrudu, Rahul. "Benchmarking More Aspects of High Performance Computing." Ames, Iowa : Oak Ridge, Tenn. : Ames Laboratory ; distributed by the Office of Scientific and Technical Information, U.S. Dept. of Energy, 2004. http://www.osti.gov/servlets/purl/837280-06M7ga/webviewable/.

Full text
Abstract:
Thesis (M.S.); Submitted to Iowa State Univ., Ames, IA (US); 19 Dec 2004.
Published through the Information Bridge: DOE Scientific and Technical Information. "IS-T 2196" Rahul Ravindrudu. US Department of Energy 12/19/2004. Report is also available in paper and microfiche from NTIS.
APA, Harvard, Vancouver, ISO, and other styles
19

Jararweh, Yaser. "Autonomic Programming Paradigm for High Performance Computing." Diss., The University of Arizona, 2010. http://hdl.handle.net/10150/193527.

Full text
Abstract:
The advances in computing and communication technologies and software tools have resulted in an explosive growth in networked applications and information services that cover all aspects of our life. These services and applications are inherently complex, dynamic and heterogeneous. In a similar way, the underlying information infrastructure, e.g. the Internet, is large, complex, heterogeneous and dynamic, globally aggregating large numbers of independent computing and communication resources. The combination of the two results in application development and management complexities that break current computing paradigms, which are based on static behaviors. As a result, applications, programming environments and information infrastructures are rapidly becoming fragile, unmanageable and insecure. This has led researchers to consider alternative programming paradigms and management techniques that are based on strategies used by biological systems. Autonomic programming paradigm is inspired by the human autonomic nervous system that handles complexity, uncertainties and abnormality. The overarching goal of the autonomic programming paradigm is to help building systems and applications capable of self-management. Firstly, we investigated the large-scale scientific computing applications which generally experience different execution phases at run time and each phase has different computational, communication and storage requirements as well as different physical characteristics. In this dissertation, we present Physics Aware Optimization (PAO) paradigm that enables programmers to identify the appropriate solution methods to exploit the heterogeneity and the dynamism of the application execution states. We implement a Physics Aware Optimization Manager to exploit the PAO paradigm. On the other hand we present a self configuration paradigm based on the principles of autonomic computing that can handle efficiently complexity, dynamism and uncertainty in configuring server and networked systems and their applications. Our approach is based on making any resource/application to operate as an Autonomic Component (that means it can be self-managed component) by using our autonomic programming paradigm. Our POA technique for medical application yielded about 3X improvement of performance with 98.3% simulation accuracy compared to traditional techniques for performance optimization. Also, our Self-configuration management for power and performance management in GPU cluster demonstrated 53.7% power savings for CUDAworkload while maintaining the cluster performance within given acceptable thresholds.
APA, Harvard, Vancouver, ISO, and other styles
20

HEMMATPOUR, MASOUD. "High Performance Computing using Infiniband-based clusters." Doctoral thesis, Politecnico di Torino, 2019. http://hdl.handle.net/11583/2750549.

Full text
APA, Harvard, Vancouver, ISO, and other styles
21

Nassar, Samuel. "High performance parallel Java with Javaparty." Thesis, Monterey, Calif. : Naval Postgraduate School, 2008. http://handle.dtic.mil/100.2/ADA483466.

Full text
Abstract:
Thesis (M.S. in Electrical Engineering)--Naval Postgraduate School, June 2008.
Thesis Advisor(s): Su, Weilian. "June 2008." Description based on title screen as viewed on August 26, 2008. Includes bibliographical references (p. 59-60). Also available in print.
APA, Harvard, Vancouver, ISO, and other styles
22

Petkov, Ventsislav [Verfasser]. "Automatic Performance Engineering Workflows for High Performance Computing / Ventsislav Petkov." München : Verlag Dr. Hut, 2014. http://d-nb.info/1051549671/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
23

Beserra, David Willians dos Santos Cavalcanti. "Performance analysis of virtualization technologies in high performance computing enviroments." Universidade Federal de Sergipe, 2016. https://ri.ufs.br/handle/riufs/3382.

Full text
Abstract:
Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - CAPES
Computação de Alto Desempenho (CAD) agrega poder computacional com o objetivo de solucionar problemas complexos e de grande escala em diferentes áreas do conhecimento, como ciência e engenharias, variando desde aplicações medias 3D ate a simulação do universo. Atualmente, os usuários de CAD podem utilizar infraestruturas de Nuvem como uma alternativa de baixo custo para a execução de suas aplicações. Apesar de ser possível utilizar as infraestruturas de nuvem como plataformas de CAD, muitas questões referentes as sobrecargas decorrentes do uso de virtualização permanecem sem resposta. Nesse trabalho foi analisado o desempenho de algumas ferramentas de virtualização - Linux Containers (LXC), Docker, VirtualBox e KVM – em atividades de CAD. Durante os experimentos foram avaliados os desempenhos da UCP, da infraestrutura de comunicação (rede física e barramentos internos) e de E/S de dados em disco. Os resultados indicam que cada tecnologia de virtualização impacta diferentemente no desempenho do sistema observado em função do tipo de recurso de hardware utilizado e das condições de compartilhamento do recurso adotadas.
High Performance Computing (HPC) aggregates computing power in order to solve large and complex problems in different knowledge areas, such as science and engineering, ranging from 3D real-time medical images to simulation of the universe. Nowadays, HPC users can utilize virtualized Cloud infrastructures as a low-cost alternative to deploy their applications. Despite of Cloud infrastructures can be used as HPC platforms, many issues from virtualization overhead have kept them almost unrelated. In this work, we analyze the performance of some virtualization solutions - Linux Containers (LXC), Docker, VirtualBox and KVM - under HPC activities. For our experiments, we consider CPU, (physical network and internal buses) communication and disk I/O performance. Results show that different virtualization technologies can impact distinctly in performance according to hardware resource type used by HPC application and resource sharing conditions adopted.
APA, Harvard, Vancouver, ISO, and other styles
24

Roloff, Eduardo. "Viability and performance of high-performance computing in the cloud." reponame:Biblioteca Digital de Teses e Dissertações da UFRGS, 2013. http://hdl.handle.net/10183/79594.

Full text
Abstract:
Computação em nuvem é um novo paradigma, onde recursos computacionais são disponibilizados como serviços. Neste cenário, o usuário não tem a necessidade de adquirir infraestrutura, ele pode alugar os recursos de um provedor e usá-los durante um certo período de tempo. Além disso, o usuário pode facilmente alocar e desalocar quantos recursos ele desejar, num ambiente totalmente elástico. O usuário só é cobrado pelo efetivo uso que for feito dos recursos alocados, isso significa que ele somente pagará pelo que for utilizado. Por outro lado, usuários de processamento de alto desempenho (PAD) tem a necessidade de utilizar grande poder computacional como uma ferramenta de trabalho. Para se ter acesso a estes recursos, são necessários investimentos financeiros adequados para aquisição de sistemas para PAD. Mas, neste caso, duas situações podem incorrer em problemas. O usuário necessita ter acesso aos recursos financeiros totais para adquirir e manter um sistema para PAD, e esses recusros são limitados. O propósito dessa dissertação é avaliar se o paradigma de computação em nuvem é um ambiente viável para PAD, verificando se este modelo de computação tem a capaciodade de prover acesso a ambientes que podem ser utilizados para a execução de aplicações de alto desempenho, e também, se o custo benefício apresentado é melhor do que o de sistemas tradicionais. Para isso, todo o modelo de computação em nuvem foi avaliado para se identificar quais partes dele tem o potencial para ser usado para PAD. Os componentes identificados foram avaliados utilizando-se proeminentes provedores de computação em nuvem. Foram analisadas as capacidades de criação de ambientes de PAD, e tais ambientes tiveram seu desempenho analisado através da utilização de técnicas tradicionais. Para a avaliação do custo benefício, foi criado e aplicado um modelo de custo. Os resultados mostraram que todos os provedores analisados possuem a capacidade de criação de ambientes de PAD. Em termos de desempenho, houveram alguns casos em que os provedores de computação em nuvem foram melhores do que um sistema tradicional. Na perspectiva de custo, a nuvem apresenta uma alternativa bastante interessante devido ao seu modelo de cobrança de acordo com o uso. Como conclusão dessa dissertação, foi mostrado que a computação em nuvem pode ser utilizada como uma alternativa real para ambientes de PAD.
Cloud computing is a new paradigm, where computational resources are offered as services. In this context, the user does not need to buy infrastructure, the resources can be rented from a provider and used for a period of time. Furthermore the user can easily allocate as many resources as needed, and deallocate them as well, in a totally elastic environment. The resources need to be paid only for the effective usage time. On the other hand, High-Performance Computing (HPC) requires a large amount of computational power. To acquire systems capable for HPC, large financial investments are necessary. Apart from the initial investment, the user must pay the maintenance costs, and has only limited computational resources. To overcome these issues, this thesis aims to evaluate the cloud computing paradigm as a candidate environment for HPC. We analyze the efforts and challenges for porting and deploy HPC applications to the cloud. We evaluate if this computing model can provide sufficient capacities for running HPC applications, and compare its cost efficiency to traditional HPC systems, such as clusters. The cloud computing paradigm was analyzed to identify which models have the potential to be used for HPC purposes. The identified models were then evaluated using major cloud providers, Microsoft Windows Azure, Amazon EC2 and Rackspace and compare them to a traditional HPC system. We analyzed the capabilities to create HPC environments, and evaluated their performance. For the evaluation of the cost efficiency, we developed an economic model. The results show that all the evaluated providers have the capability to create HPC environments. In terms of performance, there are some cases where cloud providers present a better performance than the traditional system. From the cost perspective, the cloud presents an interesting alternative due to the pay-per-use model. Summarizing the results, this dissertation shows that cloud computing can be used as a realistic alternative for HPC environments.
APA, Harvard, Vancouver, ISO, and other styles
25

Hutchins, Richard Chad. "Feasibility of virtual machine and cloud computing technologies for high performance computing." Thesis, Monterey, California. Naval Postgraduate School, 2013. http://hdl.handle.net/10945/42447.

Full text
Abstract:
Approved for public release; distribution is unlimited
Reissued May 2014 with additions to the acknowledgments
Knowing the future weather on the battlefield with high certainty can result in a higher advantage over the adversary. To create this advantage for the United States, the U.S. Navy utilizes the Coupled Ocean/Atmosphere Mesoscale Prediction System (COAMPS) to create high spatial resolution, regional, numerical weather prediction (NWP) forecasts. To compute a forecast, COAMPS runs on high performance computing (HPC) systems. These HPC systems are large, dedicated supercomputers with little ability to scale or move. This makes these systems vulnerable to outages without a costly, equally powerful secondary system. Recent advancements in cloud computing and virtualization technologies provide a method for high mobility and scalability without sacrificing performance. This research used standard benchmarks in order to quantitatively compare a virtual machine (VM) to a native HPC cluster. The benchmark tests showed that the VM was feasible platform for executing HPC applications. Then we ran the COAMPS NWP on a VM within a cloud infrastructure to prove the ability to run a HPC application in a virtualized environment. The VM COAMPS model run performed better than the native HPC machine model run. These results show that VM and cloud computing technologies can be used to run HPC applications for the Department of Defense
APA, Harvard, Vancouver, ISO, and other styles
26

Lofstead, Gerald Fredrick. "Extreme scale data management in high performance computing." Diss., Georgia Institute of Technology, 2010. http://hdl.handle.net/1853/37232.

Full text
Abstract:
Extreme scale data management in high performance computing requires consideration of the end-to-end scientific workflow process. Of particular importance for runtime performance, the write-read cycle must be addressed as a complete unit. Any optimization made to enhance writing performance must consider the subsequent impact on reading performance. Only by addressing the full write-read cycle can scientific productivity be enhanced. The ADIOS middleware developed as part of this thesis provides an API nearly as simple as the standard POSIX interface, but with the flexibilty to choose what transport mechanism(s) to employ at or during runtime. The accompanying BP file format is designed for high performance parallel output with limited coordination overheads while incorporating features to accelerate subsequent use of the output for reading operations. This pair of optimizations of the output mechanism and the output format are done such that they either do not negatively impact or greatly improve subsequent reading performance when compared to popular self-describing file formats. This end-to-end advantage of the ADIOS architecture is further enhanced through techniques to better enable asychronous data transports affording the incorporation of 'in flight' data processing operations and pseudo-transport mechanisms that can trigger workflows or other operations.
APA, Harvard, Vancouver, ISO, and other styles
27

Lee, Dongwon. "High-performance computer system architectures for embedded computing." Diss., Georgia Institute of Technology, 2011. http://hdl.handle.net/1853/42766.

Full text
Abstract:
The main objective of this thesis is to propose new methods for designing high-performance embedded computer system architectures. To achieve the goal, three major components - multi-core processing elements (PEs), DRAM main memory systems, and on/off-chip interconnection networks - in multi-processor embedded systems are examined in each section respectively. The first section of this thesis presents architectural enhancements to graphics processing units (GPUs), one of the multi- or many-core PEs, for improving performance of embedded applications. An embedded application is first mapped onto GPUs to explore the design space, and then architectural enhancements to existing GPUs are proposed for improving throughput of the embedded application. The second section proposes high-performance buffer mapping methods, which exploit useful features of DRAM main memory systems, in DSP multi-processor systems. The memory wall problem becomes increasingly severe in multiprocessor environments because of communication and synchronization overheads. To alleviate the memory wall problem, this section exploits bank concurrency and page mode access of DRAM main memory systems for increasing the performance of multiprocessor DSP systems. The final section presents a network-centric Turbo decoder and network-centric FFT processors. In the era of multi-processor systems, an interconnection network is another performance bottleneck. To handle heavy communication traffic, this section applies a crossbar switch - one of the indirect networks - to the parallel Turbo decoder, and applies a mesh topology to the parallel FFT processors. When designing the mesh FFT processors, a very different approach is taken to improve performance; an optical fiber is used as a new interconnection medium.
APA, Harvard, Vancouver, ISO, and other styles
28

Andrade, Jorge. "Grid and High-Performance Computing for Applied Bioinformatics." Doctoral thesis, Stockholm : Bioteknologi, Kungliga Tekniska högskolan, 2007. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-4573.

Full text
APA, Harvard, Vancouver, ISO, and other styles
29

Starke, Christoph [Verfasser]. "High Performance Computing zur technischen Finanzmarktanalyse / Christoph Starke." Kiel : Universitätsbibliothek Kiel, 2012. http://d-nb.info/1026442737/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
30

Mukhamedov, Farukh. "High performance computing for the discontinuous Galerkin methods." Thesis, Brunel University, 2018. http://bura.brunel.ac.uk/handle/2438/16769.

Full text
Abstract:
Discontinuous Galerkin methods form a class of numerical methods to find a solution of partial differential equations by combining features of finite element and finite volume methods. Methods are defined using a weak form of a particular model problem, allowing for discontinuities in the discrete trial and test spaces. Using a discontinuous discrete space mesh provides proper flexibility and a compact discretisation pattern, allowing a multidomain and multiphysics simulation. Discontinuous Galerkin methods with a higher approximation polynomial order, the socalled p-version, performs better in terms of convergence rate, compared with the low order h-version with smaller element sizes and bigger mesh. However, the condition number of the Galerkin system grows subsequently. This causes surge in the amount of required storage, computational complexity and in the time required for computation. We use the following three approaches to keep the advantages and eliminate the disadvantages. The first approach will be a specific choice of basis functions which we call C1 polynomials. These ensure that the majority of integrals over the edge of the mesh elements disappears. This reduces the total number of non-zero elements in the resulting system. This decreases the computational complexity without loss in precision. This approach does not affect the number of iterations required by chosen Conjugate Gradients method when compared to the other choice of basis functions. It actually decreases the total number of algebraic operations performed. The second approach is the introduction of suitable preconditioners. In our case, the Additive two-layer Schwarz method, developed in [4], for the iterative Conjugate Gradients method is considered. This directly affects the spectral condition number of the system matrix and decreases the number of iterations required for the computation. This approach, however, increases the total number of algebraic operations and might require more operational time. To tackle the rise in the number of algebraic operations, we introduced a modified Additive two-layer non-overlapping Schwarz method with a Multigrid process. This using a fixed low-order approximation polynomial degree on a coarse grid. We show that this approach is spectrally equivalent to the first preconditioner, and requires less time for computation. The third approach is a development of an efficient mathematical framework for distributed data structure. This allows a high performance, massively parallel, implementation of the discontinuous Galerkin method. We demonstrate that it is possible to exploit properties of the system matrix and C1 polynomials as basis functions to optimize the parallel structures. The previously mentioned parallel data structure allows us to parallelize at the same time both the matrix-vector multiplication routines for the Conjugate Gradients method, as well as the preconditioner routines on the solver level. This minimizes the transfer ratio amongst the distributed system. Finally, we combined all three approaches and created a framework, which allowed us to successfully implement all of the above.
APA, Harvard, Vancouver, ISO, and other styles
31

Luchangco, Victor. "Memory consistency models for high performance distributed computing." Thesis, Massachusetts Institute of Technology, 2001. http://hdl.handle.net/1721.1/86772.

Full text
Abstract:
Thesis (Sc. D.)--Massachusetts Institute of Technology, Dept. of Mechanical Engineering, 2001.
Includes bibliographical references (p. 195-205) and index.
This thesis develops a mathematical framework for specifying the consistency guarantees of high performance distributed shared memory multiprocessors. This framework is based on computations, which specify the operations requested and constraints on how these operations may be applied; we call the framework computation-centric. This framework is expressive enough to specify high level synchronization mechanisms such as locks. We use the computation-centric framework to specify and compare several memory models, to characterize programming disciplines, and to prove that weakly consistent systems provide strong consistency guarantees when certain programming disciplines are obeyed. Specifically, we define computation-centric versions of several memory models from the literature, including sequential consistency, weak ordering and release consistency, and we give a computation-centric characterization of data-race-free programs. We prove that when running data-race-free programs, weakly ordered systems appear sequentially consistent. We also define memory models that have higher level guarantees such as locks and transactions.
(cont.) The strongly consistent versions of these models make guarantees that are stronger than sequential consistency, and thus are easier for programmers to use. We introduce a new model called weak sequential locking, which has very weak guarantees, and prove that it guarantees sequential consistency and mutually exclusive locking for programs that protect memory accesses using locks. We also show that by using two-phase locking, programmers can implement serializable transactions on any memory system with weak sequential locking. The framework is intended primarily to help programmers of such systems reason about their programs. It supports a high level of abstraction, insulating programmers from system details and enhancing the portability of their programs. The framework is also useful for implementors of such systems, in determining what guarantees their implementations provide and in assessing the advantages of providing one memory model rather than another.
by Victor Luchangco.
Sc.D.
APA, Harvard, Vancouver, ISO, and other styles
32

Istoan, Matei Valentin. "High-performance coarse operators for FPGA-based computing." Thesis, Lyon, 2017. http://www.theses.fr/2017LYSEI030/document.

Full text
Abstract:
Les FPGA (Field Programmable Gate Arrays) constituent un type de circuit reprogrammable qui, sous certaines conditions, peuvent avoir de meilleures performances que les microprocesseurs classiques. Les FPGA utilisent le circuit comme paradigme de programmation, ce qui permet d'effectuer des calculs parallèles propres à l'application visée. Ils permettent aussi d’atteindre l’efficacité arithmétique: un bit ne doit être calculé que s'il est utile dans le résultat final. Pour ce faire, l’arithmétique utilisée par les FPGA ne peut se limiter qu’à des fonctions conçues pour les microprocesseurs. Cette thèse se propose d’étudier les méthodes pour l’implémentation des fonctions gros-grain pour les FPGA à travers trois voies. De nouvelles méthodes pour évaluer des fonctions trigonométriques, telles que le sinus, cosinus et arc tangente ont été développés dans cette thèse. Chaque méthode est optimisée dans son contexte, de la manière la plus flexible et la plus souple possible. Pour que les méthodes aboutissent à leur efficacité arithmétique, il est nécessaire de procéder à une analyse d'erreurs, ainsi qu’à un choix attentif des paramétrés de la méthode et à une fine compréhension des algorithmes utilisés. Les filtres numériques constituent une famille importante d’opérateurs arithmétiques qui rassemble des fonctions élémentaires. Ils peuvent être spécifiés à un niveau élevé d'abstraction, à travers une fonction de transfert avec des contraintes sur le rapport signal/bruit. Ils peuvent être ensuite implémentés comme des chemins de données basés sur des additions et des multiplications. Le principal résultat est donc une méthode qui transforme une spécification de haut niveau en une implémentation d’une façon automatique. La première étape se rapporte au développement d'une méthode pour le calcul des produits par des constantes. Des filtres FIR et IIR peuvent être construits à l'aide de cette brique de base. Pour que les opérateurs arithmétiques atteignent leur performance maximale, on a besoin d’un pipeline correspondant au contexte donné. Même si les connaissances du développeur s’avèrent d’un grand avantage pendant le processus de création d'un pipeline d'un chemin de données, cette étape demeure complexe et facilement susceptible à des erreurs. Une méthode automatique, contrôlée par le développeur a dont été développée. Cette thèse fournit un générateur des opérateurs arithmétiques de haute qualité près à l'emploi, et qui propagent le domaine des calculs sur des FPGA à un pas plus proche de l’adoption générale. Les cœurs arithmétiques font partie d'un générateur open-source, où les fonctions peuvent être décrites par une spécification de haut niveau, comme par exemple une formule mathématique
Field-Programmable Gate Arrays (FPGAs) have been shown to sometimes outperform mainstream microprocessors. The circuit paradigm enables efficient application-specific parallel computations. FPGAs also enable arithmetic efficiency: a bit is only computed if it is useful to the final result. To achieve this, FPGA arithmetic shouldn’t be limited to basic arithmetic operations offered by microprocessors. This thesis studies the implementation of coarser operations on FPGAs, in three main directions: New FPGA-specific approaches for evaluating the sine, cosine and the arctangent have been developed. Each function is tuned for its context and is as versatile and flexible as possible. Arithmetic efficiency requires error analysis and parameter tuning, and a fine understanding of the algorithms used. Digital filters are an important family of coarse operators resembling elementary functions: they can be specified at a high level as a transfer function with constraints on the signal/noise ratio, and then be implemented as an arithmetic datapath based on additions and multiplications. The main result is a method which transforms a high-level specification into a filter in an automated way. The first step is building an efficient method for computing sums of products by constants. Based on this, FIR and IIR filter generators are constructed. For arithmetic operators to achieve maximum performance, context-specific pipelining is required. Even if the designer’s knowledge is of great help when building and pipelining an arithmetic datapath, this remains complex and error-prone. A user-directed, automated method for pipelining has been developed. This thesis provides a generator of high-quality, ready-made operators for coarse computing cores, which brings FPGA-based computing a step closer to mainstream adoption. The cores are part of an open-ended generator, where functions are described as high-level objects such as mathematical expressions
APA, Harvard, Vancouver, ISO, and other styles
33

Adhinarayanan, Vignesh. "Models and Techniques for Green High-Performance Computing." Diss., Virginia Tech, 2020. http://hdl.handle.net/10919/98660.

Full text
Abstract:
High-performance computing (HPC) systems have become power limited. For instance, the U.S. Department of Energy set a power envelope of 20MW in 2008 for the first exascale supercomputer now expected to arrive in 2021--22. Toward this end, we seek to improve the greenness of HPC systems by improving their performance per watt at the allocated power budget. In this dissertation, we develop a series of models and techniques to manage power at micro-, meso-, and macro-levels of the system hierarchy, specifically addressing data movement and heterogeneity. We target the chip interconnect at the micro-level, heterogeneous nodes at the meso-level, and a supercomputing cluster at the macro-level. Overall, our goal is to improve the greenness of HPC systems by intelligently managing power. The first part of this dissertation focuses on measurement and modeling problems for power. First, we study how to infer chip-interconnect power by observing the system-wide power consumption. Our proposal is to design a novel micro-benchmarking methodology based on data-movement distance by which we can properly isolate the chip interconnect and measure its power. Next, we study how to develop software power meters to monitor a GPU's power consumption at runtime. Our proposal is to adapt performance counter-based models for their use at runtime via a combination of heuristics, statistical techniques, and application-specific knowledge. In the second part of this dissertation, we focus on managing power. First, we propose to reduce the chip-interconnect power by proactively managing its dynamic voltage and frequency (DVFS) state. Toward this end, we develop a novel phase predictor that uses approximate pattern matching to forecast future requirements and in turn, proactively manage power. Second, we study the problem of applying a power cap to a heterogeneous node. Our proposal proactively manages the GPU power using phase prediction and a DVFS power model but reactively manages the CPU. The resulting hybrid approach can take advantage of the differences in the capabilities of the two devices. Third, we study how in-situ techniques can be applied to improve the greenness of HPC clusters. Overall, in our dissertation, we demonstrate that it is possible to infer power consumption of real hardware components without directly measuring them, using the chip interconnect and GPU as examples. We also demonstrate that it is possible to build models of sufficient accuracy and apply them for intelligently managing power at many levels of the system hierarchy.
Doctor of Philosophy
Past research in green high-performance computing (HPC) mostly focused on managing the power consumed by general-purpose processors, known as central processing units (CPUs) and to a lesser extent, memory. In this dissertation, we study two increasingly important components: interconnects (predominantly focused on those inside a chip, but not limited to them) and graphics processing units (GPUs). Our contributions in this dissertation include a set of innovative measurement techniques to estimate the power consumed by the target components, statistical and analytical approaches to develop power models and their optimizations, and algorithms to manage power statically and at runtime. Experimental results show that it is possible to build models of sufficient accuracy and apply them for intelligently managing power on multiple levels of the system hierarchy: chip interconnect at the micro-level, heterogeneous nodes at the meso-level, and a supercomputing cluster at the macro-level.
APA, Harvard, Vancouver, ISO, and other styles
34

Aji, Ashwin M. "Programming High-Performance Clusters with Heterogeneous Computing Devices." Diss., Virginia Tech, 2015. http://hdl.handle.net/10919/52366.

Full text
Abstract:
Today's high-performance computing (HPC) clusters are seeing an increase in the adoption of accelerators like GPUs, FPGAs and co-processors, leading to heterogeneity in the computation and memory subsystems. To program such systems, application developers typically employ a hybrid programming model of MPI across the compute nodes in the cluster and an accelerator-specific library (e.g.; CUDA, OpenCL, OpenMP, OpenACC) across the accelerator devices within each compute node. Such explicit management of disjointed computation and memory resources leads to reduced productivity and performance. This dissertation focuses on designing, implementing and evaluating a runtime system for HPC clusters with heterogeneous computing devices. This work also explores extending existing programming models to make use of our runtime system for easier code modernization of existing applications. Specifically, we present MPI-ACC, an extension to the popular MPI programming model and runtime system for efficient data movement and automatic task mapping across the CPUs and accelerators within a cluster, and discuss the lessons learned. MPI-ACC's task-mapping runtime subsystem performs fast and automatic device selection for a given task. MPI-ACC's data-movement subsystem includes careful optimizations for end-to-end communication among CPUs and accelerators, which are seamlessly leveraged by the application developers. MPI-ACC provides a familiar, flexible and natural interface for programmers to choose the right computation or communication targets, while its runtime system achieves efficient cluster utilization.
Ph. D.
APA, Harvard, Vancouver, ISO, and other styles
35

Ali, Nawab. "Rethinking I/O in High-Performance Computing Environments." The Ohio State University, 2009. http://rave.ohiolink.edu/etdc/view?acc_num=osu1259094051.

Full text
APA, Harvard, Vancouver, ISO, and other styles
36

Stock, Kevin Alan. "Vectorization and Register Reuse in High Performance Computing." The Ohio State University, 2014. http://rave.ohiolink.edu/etdc/view?acc_num=osu1408969385.

Full text
APA, Harvard, Vancouver, ISO, and other styles
37

Calatrava, Arroyo Amanda. "High Performance Scientific Computing over Hybrid Cloud Platforms." Doctoral thesis, Universitat Politècnica de València, 2016. http://hdl.handle.net/10251/75265.

Full text
Abstract:
Scientific applications generally require large computational requirements, memory and data management for their execution. Such applications have traditionally used high-performance resources, such as shared memory supercomputers, clusters of PCs with distributed memory, or resources from Grid infrastructures on which the application needs to be adapted to run successfully. In recent years, the advent of virtualization techniques, together with the emergence of Cloud Computing, has caused a major shift in the way these applications are executed. However, the execution management of scientific applications on high performance elastic platforms is not a trivial task. In this doctoral thesis, Elastic Cloud Computing Cluster (EC3) has been developed. EC3 is an open-source tool able to execute high performance scientific applications by creating self-managed cost-efficient virtual hybrid elastic clusters on top of IaaS Clouds. These self-managed clusters have the capability to adapt the size of the cluster, i.e. the number of nodes, to the workload, thus creating the illusion of a real cluster without requiring an investment beyond the actual usage. They can be fully customized and migrated from one provider to another, in an automatically and transparent process for the users and jobs running in the cluster. EC3 can also deploy hybrid clusters across on-premises and public Cloud resources, where on-premises resources are supplemented with public Cloud resources to accelerate the execution process. Different instance types and the use of spot instances combined with on-demand resources are also cluster configurations supported by EC3. Moreover, using spot instances, together with checkpointing techniques, the tool can significantly reduce the total cost of executions while introducing automatic fault tolerance. EC3 is conceived to facilitate the use of virtual clusters to users, that might not have an extensive knowledge about these technologies, but they can benefit from them. Thus, the tool offers two different interfaces for its users, a web interface where EC3 is exposed as a service for non-experienced users and a powerful command line interface. Moreover, this thesis explores the field of light-weight virtualization using containers as an alternative to the traditional virtualization solution based on virtual machines. This study analyzes the suitable scenario for the use of containers and proposes an architecture for the deployment of elastic virtual clusters based on this technology. Finally, to demonstrate the functionality and advantages of the tools developed during this thesis, this document includes several use cases covering different scenarios and fields of knowledge, such as structural analysis of buildings, astrophysics or biodiversity.
Las aplicaciones científicas generalmente precisan grandes requisitos de cómputo, memoria y gestión de datos para su ejecución. Este tipo de aplicaciones tradicionalmente ha empleado recursos de altas prestaciones, como supercomputadores de memoria compartida, clústers de PCs de memoria distribuida, o recursos provenientes de infraestructuras Grid, sobre los que se adaptaba la aplicación para que se ejecutara satisfactoriamente. El auge que han tenido las técnicas de virtualización en los últimos años, propiciando la aparición de la computación en la nube (Cloud Computing), ha provocado un importante cambio en la forma de ejecutar este tipo de aplicaciones. Sin embargo, la gestión de la ejecución de aplicaciones científicas sobre plataformas de computación elásticas de altas prestaciones no es una tarea trivial. En esta tesis doctoral se ha desarrollado Elastic Cloud Computing Cluster (EC3), una herramienta de código abierto capaz de llevar a cabo la ejecución de aplicaciones científicas de altas prestaciones creando para ello clústers virtuales, híbridos y elásticos, autogestionados y eficientes en cuanto a costes, sobre plataformas Cloud de tipo Infraestructura como Servicio (IaaS). Estos clústers autogestionados tienen la capacidad de adaptar su tamaño, es decir, el número de nodos, a la carga de trabajo, creando así la ilusión de un clúster real sin requerir una inversión por encima del uso actual. Además, son completamente configurables y pueden ser migrados de un proveedor a otro de manera automática y transparente a los usuarios y trabajos en ejecución en el cluster. EC3 también permite desplegar clústers híbridos sobre recursos Cloud públicos y privados, donde los recursos privados son complementados con recursos Cloud públicos para acelerar el proceso de ejecución. Otras configuraciones híbridas, como el empleo de diferentes tipos de instancias y el uso de instancias puntuales combinado con instancias bajo demanda son también soportadas por EC3. Además, el uso de instancias puntuales junto con técnicas de checkpointing permite a EC3 reducir significantemente el coste total de las ejecuciones a la vez que proporciona tolerancia a fallos. EC3 está concebido para facilitar el uso de clústers virtuales a los usuarios, que, aunque no tengan un conocimiento extenso sobre este tipo de tecnologías, pueden beneficiarse fácilmente de ellas. Por ello, la herramienta ofrece dos interfaces diferentes a sus usuarios, una interfaz web donde se expone EC3 como servicio para usuarios no experimentados y una potente interfaz de línea de comandos. Además, esta tesis doctoral se adentra en el campo de la virtualización ligera, mediante el uso de contenedores como alternativa a la solución tradicional de virtualización basada en máquinas virtuales. Este estudio analiza el escenario propicio para el uso de contenedores y propone una arquitectura para el despliegue de clusters virtuales elásticos basados en esta tecnología. Finalmente, para demostrar la funcionalidad y ventajas de las herramientas desarrolladas durante esta tesis, esta memoria recoge varios casos de uso que abarcan diferentes escenarios y campos de conocimiento, como estudios estructurales de edificios, astrofísica o biodiversidad.
Les aplicacions científiques generalment precisen grans requisits de còmput, de memòria i de gestió de dades per a la seua execució. Este tipus d'aplicacions tradicionalment hi ha empleat recursos d'altes prestacions, com supercomputadors de memòria compartida, clústers de PCs de memòria distribuïda, o recursos provinents d'infraestructures Grid, sobre els quals s'adaptava l'aplicació perquè s'executara satisfactòriament. L'auge que han tingut les tècniques de virtualitzaciò en els últims anys, propiciant l'aparició de la computació en el núvol (Cloud Computing), ha provocat un important canvi en la forma d'executar este tipus d'aplicacions. No obstant això, la gestió de l'execució d'aplicacions científiques sobre plataformes de computació elàstiques d'altes prestacions no és una tasca trivial. En esta tesi doctoral s'ha desenvolupat Elastic Cloud Computing Cluster (EC3), una ferramenta de codi lliure capaç de dur a terme l'execució d'aplicacions científiques d'altes prestacions creant per a això clústers virtuals, híbrids i elàstics, autogestionats i eficients quant a costos, sobre plataformes Cloud de tipus Infraestructura com a Servici (IaaS). Estos clústers autogestionats tenen la capacitat d'adaptar la seua grandària, es dir, el nombre de nodes, a la càrrega de treball, creant així la il·lusió d'un cluster real sense requerir una inversió per damunt de l'ús actual. A més, són completament configurables i poden ser migrats d'un proveïdor a un altre de forma automàtica i transparent als usuaris i treballs en execució en el cluster. EC3 també permet desplegar clústers híbrids sobre recursos Cloud públics i privats, on els recursos privats són complementats amb recursos Cloud públics per a accelerar el procés d'execució. Altres configuracions híbrides, com l'us de diferents tipus d'instàncies i l'ús d'instàncies puntuals combinat amb instàncies baix demanda són també suportades per EC3. A més, l'ús d'instàncies puntuals junt amb tècniques de checkpointing permet a EC3 reduir significantment el cost total de les execucions al mateix temps que proporciona tolerància a fallades. EC3e stà concebut per a facilitar l'ús de clústers virtuals als usuaris, que, encara que no tinguen un coneixement extensiu sobre este tipus de tecnologies, poden beneficiar-se fàcilment d'elles. Per això, la ferramenta oferix dos interfícies diferents dels seus usuaris, una interfície web on s'exposa EC3 com a servici per a usuaris no experimentats i una potent interfície de línia d'ordres. A més, esta tesi doctoral s'endinsa en el camp de la virtualitzaciò lleugera, per mitjà de l'ús de contenidors com a alternativa a la solució tradicional de virtualitzaciò basada en màquines virtuals. Este estudi analitza l'escenari propici per a l'ús de contenidors i proposa una arquitectura per al desplegament de clusters virtuals elàstics basats en esta tecnologia. Finalment, per a demostrar la funcionalitat i avantatges de les ferramentes desenrotllades durant esta tesi, esta memòria arreplega diversos casos d'ús que comprenen diferents escenaris i camps de coneixement, com a estudis estructurals d'edificis, astrofísica o biodiversitat.
Calatrava Arroyo, A. (2016). High Performance Scientific Computing over Hybrid Cloud Platforms [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/75265
TESIS
APA, Harvard, Vancouver, ISO, and other styles
38

Aitken, Michael James. "A reconfigurable accelerator card for high performance computing." Master's thesis, University of Cape Town, 2008. http://hdl.handle.net/11427/5234.

Full text
Abstract:
Includes abstract.
Includes bibliographical references (leaves 68-70).
This thesis describes the design, implementation, and testing of a reconfigurable accelerator card. The goal of the project was to provide a hardware platform for future students to carry out research into reconfigurable computing. Our accelerator design is an expansion card for a traditional Von Neumann host machine, and contains two field-programmable gate arrays. By inserting the card into a host machine, intrinsically parallel processing tasks can be exported to the FPGAs. This is similar to the way in which video game rendering tasks can be exported to the GFC on a graphics accelerator. We show how an FPGA is a suitable processing element, in terms of performance per watt, for many computing tasks. We set out to design and build a reconfigurable card that harnessed the latest FPGAs and fastest available I/O interfaces. The resultant design is one which can run within a host machine, in an array of host machines, or as a stand-alone processing node.
APA, Harvard, Vancouver, ISO, and other styles
39

Barrett, Brian W. "One-sided communication for high performance computing applications." [Bloomington, Ind.] : Indiana University, 2009. http://gateway.proquest.com/openurl?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&res_dat=xri:pqdiss&rft_dat=xri:pqdiss:3354909.

Full text
Abstract:
Thesis (Ph.D.)--Indiana University, Dept. of Computer Science, 2009.
Title from PDF t.p. (viewed on Feb. 4, 2010). Source: Dissertation Abstracts International, Volume: 70-04, Section: B, page: 2379. Adviser: Andrew Lumsdaine.
APA, Harvard, Vancouver, ISO, and other styles
40

Ahmed, Kishwar. "Energy Demand Response for High-Performance Computing Systems." FIU Digital Commons, 2018. https://digitalcommons.fiu.edu/etd/3569.

Full text
Abstract:
The growing computational demand of scientific applications has greatly motivated the development of large-scale high-performance computing (HPC) systems in the past decade. To accommodate the increasing demand of applications, HPC systems have been going through dramatic architectural changes (e.g., introduction of many-core and multi-core systems, rapid growth of complex interconnection network for efficient communication between thousands of nodes), as well as significant increase in size (e.g., modern supercomputers consist of hundreds of thousands of nodes). With such changes in architecture and size, the energy consumption by these systems has increased significantly. With the advent of exascale supercomputers in the next few years, power consumption of the HPC systems will surely increase; some systems may even consume hundreds of megawatts of electricity. Demand response programs are designed to help the energy service providers to stabilize the power system by reducing the energy consumption of participating systems during the time periods of high demand power usage or temporary shortage in power supply. This dissertation focuses on developing energy-efficient demand-response models and algorithms to enable HPC system's demand response participation. In the first part, we present interconnection network models for performance prediction of large-scale HPC applications. They are based on interconnected topologies widely used in HPC systems: dragonfly, torus, and fat-tree. Our interconnect models are fully integrated with an implementation of message-passing interface (MPI) that can mimic most of its functions with packet-level accuracy. Extensive experiments show that our integrated models provide good accuracy for predicting the network behavior, while at the same time allowing for good parallel scaling performance. In the second part, we present an energy-efficient demand-response model to reduce HPC systems' energy consumption during demand response periods. We propose HPC job scheduling and resource provisioning schemes to enable HPC system's emergency demand response participation. In the final part, we propose an economic demand-response model to allow both HPC operator and HPC users to jointly reduce HPC system's energy cost. Our proposed model allows the participation of HPC systems in economic demand-response programs through a contract-based rewarding scheme that can incentivize HPC users to participate in demand response.
APA, Harvard, Vancouver, ISO, and other styles
41

Chiesi, Matteo <1984&gt. "Heterogeneous Multi-core Architectures for High Performance Computing." Doctoral thesis, Alma Mater Studiorum - Università di Bologna, 2014. http://amsdottorato.unibo.it/6469/1/strutt.pdf.

Full text
Abstract:
This thesis deals with heterogeneous architectures in standard workstations. Heterogeneous architectures represent an appealing alternative to traditional supercomputers because they are based on commodity components fabricated in large quantities. Hence their price-performance ratio is unparalleled in the world of high performance computing (HPC). In particular, different aspects related to the performance and consumption of heterogeneous architectures have been explored. The thesis initially focuses on an efficient implementation of a parallel application, where the execution time is dominated by an high number of floating point instructions. Then the thesis touches the central problem of efficient management of power peaks in heterogeneous computing systems. Finally it discusses a memory-bounded problem, where the execution time is dominated by the memory latency. Specifically, the following main contributions have been carried out: A novel framework for the design and analysis of solar field for Central Receiver Systems (CRS) has been developed. The implementation based on desktop workstation equipped with multiple Graphics Processing Units (GPUs) is motivated by the need to have an accurate and fast simulation environment for studying mirror imperfection and non-planar geometries. Secondly, a power-aware scheduling algorithm on heterogeneous CPU-GPU architectures, based on an efficient distribution of the computing workload to the resources, has been realized. The scheduler manages the resources of several computing nodes with a view to reducing the peak power. The two main contributions of this work follow: the approach reduces the supply cost due to high peak power whilst having negligible impact on the parallelism of computational nodes. from another point of view the developed model allows designer to increase the number of cores without increasing the capacity of the power supply unit. Finally, an implementation for efficient graph exploration on reconfigurable architectures is presented. The purpose is to accelerate graph exploration, reducing the number of random memory accesses.
APA, Harvard, Vancouver, ISO, and other styles
42

Chiesi, Matteo <1984&gt. "Heterogeneous Multi-core Architectures for High Performance Computing." Doctoral thesis, Alma Mater Studiorum - Università di Bologna, 2014. http://amsdottorato.unibo.it/6469/.

Full text
Abstract:
This thesis deals with heterogeneous architectures in standard workstations. Heterogeneous architectures represent an appealing alternative to traditional supercomputers because they are based on commodity components fabricated in large quantities. Hence their price-performance ratio is unparalleled in the world of high performance computing (HPC). In particular, different aspects related to the performance and consumption of heterogeneous architectures have been explored. The thesis initially focuses on an efficient implementation of a parallel application, where the execution time is dominated by an high number of floating point instructions. Then the thesis touches the central problem of efficient management of power peaks in heterogeneous computing systems. Finally it discusses a memory-bounded problem, where the execution time is dominated by the memory latency. Specifically, the following main contributions have been carried out: A novel framework for the design and analysis of solar field for Central Receiver Systems (CRS) has been developed. The implementation based on desktop workstation equipped with multiple Graphics Processing Units (GPUs) is motivated by the need to have an accurate and fast simulation environment for studying mirror imperfection and non-planar geometries. Secondly, a power-aware scheduling algorithm on heterogeneous CPU-GPU architectures, based on an efficient distribution of the computing workload to the resources, has been realized. The scheduler manages the resources of several computing nodes with a view to reducing the peak power. The two main contributions of this work follow: the approach reduces the supply cost due to high peak power whilst having negligible impact on the parallelism of computational nodes. from another point of view the developed model allows designer to increase the number of cores without increasing the capacity of the power supply unit. Finally, an implementation for efficient graph exploration on reconfigurable architectures is presented. The purpose is to accelerate graph exploration, reducing the number of random memory accesses.
APA, Harvard, Vancouver, ISO, and other styles
43

AZIMI, SARAH. "Digital design techniques for dependable High-Performance Computing." Doctoral thesis, Politecnico di Torino, 2019. http://hdl.handle.net/11583/2734213.

Full text
APA, Harvard, Vancouver, ISO, and other styles
44

MANCA, EMANUELE. "Grid and high performance computing applied to bioinformatics." Doctoral thesis, Università degli Studi di Cagliari, 2015. http://hdl.handle.net/11584/266595.

Full text
Abstract:
Recent advances in genome sequencing technologies and modern biological data analysis technologies used in bioinformatics have led to a fast and continuous increase in biological data. The difficulty of managing the huge amounts of data currently available to researchers and the need to have results within a reasonable time have led to the use of distributed and parallel computing infrastructures for their analysis. In this context Grid computing has been successfully used. Grid computing is based on a distributed system which interconnects several computers and/or clusters to access global-scale resources. This infrastructure is exible, highly scalable and can achieve high performances with data-compute-intensive algorithms. Recently, bioinformatics is exploring new approaches based on the use of hardware accelerators, such as the Graphics Processing Units (GPUs). Initially developed as graphics cards, GPUs have been recently introduced for scientific purposes by rea- son of their performance per watt and the better cost/performance ratio achieved in terms of throughput and response time compared to other high-performance com- puting solutions. Although developers must have an in-depth knowledge of GPU programming and hardware to be effective, GPU accelerators have produced a lot of impressive results. The use of high-performance computing infrastructures raises the question of finding a way to parallelize the algorithms while limiting data dependency issues in order to accelerate computations on a massively parallel hardware. In this context, the research activity in this dissertation focused on the assessment and testing of the impact of these innovative high-performance computing technolo- gies on computational biology. In order to achieve high levels of parallelism and, in the final analysis, obtain high performances, some of the bioinformatic algorithms applicable to genome data analysis were selected, analyzed and implemented. These algorithms have been highly parallelized and optimized, thus maximizing the GPU hardware resources. The overall results show that the proposed parallel algorithms are highly performant, thus justifying the use of such technology. However, a software infrastructure for work ow management has been devised to provide support in CPU and GPU computation on a distributed GPU-based in- frastructure. Moreover, this software infrastructure allows a further coarse-grained data-parallel parallelization on more GPUs. Results show that the proposed appli- cation speed-up increases with the increase in the number of GPUs.
APA, Harvard, Vancouver, ISO, and other styles
45

Dugani, Vishwanath. "Continuous system-wide profiling of High Performance Computing parallel applications : Profiling high performance applications." Thesis, KTH, Parallelldatorcentrum, PDC, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-224926.

Full text
Abstract:
Profiling of an application identifies parts of the code being executed using the hardware performance counters thus providing the application’s performance. Profiling has long been standard in the development process focused on a single execution of a single program. As computing systems have evolved, understanding the bigger picture across multiple machines has become increasingly important. As supercomputing grows in pervasiveness and scale, understanding parallel applications performance and utilization characteristics is critically important, because even minor performance improvements translate into large cost savings. The study surveys various tools for the application. After which, Perfminer was integrated in SCANIA’s Linux clusters to profile CFD and FEA applications exploiting the batch queue system features for continuous system wide profiling, which provides performance insights for high performance applications, with negligible overhead. Perfminer provides stable, accurate profiles and a cluster-scale tool for performance analysis. Perfminer effectively highlights the micro-architectural bottlenecks.
Profilering av en ansökan identifierar delar av koden exekveras med hjälp av hårdvara prestandaräknare därmed ger programmets prestanda. Profilering har länge varit standard i utvecklingsprocessen fokuserad på en enda exekvering av ett enda program. Som datorsystem har utvecklats, att förstå helheten på flera datorer har blivit allt viktigare. Som superdatorer växer i genomslagskraft och skala, är förståelsen parallella applikationer prestanda och användningsegenskaper avgörande betydelse, eftersom även prestandaförbättringar mindre översätta till stora kostnadsbesparingar. Studien granskar olika verktyg för tillämpningen. Därefter var Perfminer integrerat i Scanias Linux-kluster att profilera CFD och FEA-program som utnyttjar sats kösystem funktioner för kontinuerlig hela systemet profilering, vilket ger prestanda insikter för högpresterande tillämpningar, med försumbar overhead. Perfminer ger stabila, noggranna profiler och ett kluster skala verktyg för prestandaanalys. Perfminer belyser effektivt mikro arkitektoniska flaskhalsar.
APA, Harvard, Vancouver, ISO, and other styles
46

Zong, Ziliang Qin Xiao. "Energy-efficient resource management for high-performance computing platforms." Auburn, Ala, 2008. http://repo.lib.auburn.edu/EtdRoot/2008/SUMMER/Computer_Science_and_Software_Engineering/Dissertation/Zong_Ziliang_54.pdf.

Full text
APA, Harvard, Vancouver, ISO, and other styles
47

Engelmann, Christian. "Symmetric active/active high availability for high-performance computing system services." Thesis, University of Reading, 2008. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.559245.

Full text
Abstract:
In order to address anticipated high failure rates, reliability, availability and serviceability have become an urgent priority for next-generation high-performance computing (HPC) systems. This thesis aims to pave the way for highly available HPC systems by focusing on their most critical components and by reinforcing them with appropriate high availability solutions. Service components, such as head and service nodes, are the "Achilles heel" of a HPC system. A failure typically results in a complete system-wide outage. This thesis targets efficient software state replication mechanisms for service component redundancy to achieve high availability as well as high performance. Its methodology relies on defining a modern theoretical foundation for providing service- level high availability, identifying availability deficiencies of HPC systems, and comparing various service-level high availability methods. This thesis showcases several developed proof-of-concept prototypes providing high availability for services running on HPC head and service nodes using the symmetric active/ active replication method, i.e., state- machine replication, to complement prior work in this area using active/standby and asymmetric active/active configurations. Presented contributions include a generic taxonomy for service high availability, an insight into availability deficiencies of HPC systems, and a unified definition of service-level high availability methods. Further contributions encompass a fully functional symmetric active/active high availability prototype for a HPC job and resource management service that does not require modification of service, a fully functional symmetric active/active high availability prototype for a HPC parallel file system metadata service that offers high performance, and two preliminary prototypes for a transparent symmetric active/active replication software framework for client-service and dependent service scenarios that hide the replication infrastructure from clients and services. Assuming a mean-time to failure of 5,000 hours for a head or service node, all presented prototypes improve service availability from 99.285% to 99.995% in a two-node system, and to 99.99996% with three nodes.
APA, Harvard, Vancouver, ISO, and other styles
48

MA, LIANG. "Low power and high performance heterogeneous computing on FPGAs." Doctoral thesis, Politecnico di Torino, 2019. http://hdl.handle.net/11583/2727228.

Full text
APA, Harvard, Vancouver, ISO, and other styles
49

Abraham, Subil. "On the Use of Containers in High Performance Computing." Thesis, Virginia Tech, 2020. http://hdl.handle.net/10919/99319.

Full text
Abstract:
The lightweight, portable, and flexible nature of containers is driving their widespread adoption in cloud solutions. Data analysis and deep learning applications have especially benefited from containerized solutions. As such data analysis is also being utilized in the high performance computing (HPC) domain, the need for container support in HPC has become paramount. However, container adoption in HPC face crucial performance and I/O challenges. One obstacle is that while there have been container solutions for HPC, such solutions have not been thoroughly investigated, especially from the aspect of their impact on the crucial I/O throughput needs of HPC. To this end, this paper provides a first-of-its-kind empirical analysis of state-of-the-art representative container solutions (Docker, Podman, Singularity, and Charliecloud) in HPC environments, especially how containers interact with the HPC storage systems. We present the design of an analysis framework that is deployed on all nodes in an HPC environment, and captures aspects such as CPU, memory, network, and file I/O statistics from the nodes and the storage system. We are able to garner key insights from our analysis, e.g., Charliecloud outperforms other container solutions in terms of container start-up time, while Singularity and Charliecloud are equivalent in I/O throughput. But this comes at a cost, as Charliecloud invokes the most metadata and I/O operations on the underlying Lustre file system. By identifying such optimization opportunities, we can enhance performance of containers atop HPC and help the aforementioned applications.
Master of Science
Containers are a technology that allow for applications to be packaged along with its ideal environment, all the way down to its preferred operating system. This allows an application to run anywhere that can support containers without a huge hit to the application performance. Hence containers have seen wide adoption for use in the cloud. These qualities have also made it very appealing for use in the world of scientific research in national labs. Modern research heavily relies on the power of computing in order to model, simulate, and test the behavior of real world entities, often making use of large amounts of data and utilizing machine learning and deep learning. Doing this often requires the high performance computing power found in supercomputers. In most cases, scientists just want to be able to write their code and expect it to just work. Their applications might depend on other source code that form part of their standard toolkit and would expect to also be installed in the supercomputing environment. This may not always be the case, taking the scientist's focus away from their work in order ensure their requirements are set up in the supercomputing environment which might require extensive cooperation with the operations team responsible for the supercomputers. Containers easily solve this problem because it can package everything together. However, the use of containers in these environments have not been extensively tested, especially for applications that are very heavy on the analysis of large quantities of data. To fill this gap, this work analyzes the performance of several state-of-the-art container technologies (Docker, Podman, Singularity, Charliecloud), with a particular focus on its interaction with the Lustre data storage systems widely used in supercomputing environments. As part of this work, we design an analysis setup that captures the behavior of various aspects of the high performance computing environment like CPU, memory, network usage and data movement while using containers to run data heavy applications. We garner important insights about their performance that can help inform the best choice of container technology given an environment and the kind of application that needs to be run.
APA, Harvard, Vancouver, ISO, and other styles
50

Pulla, Gautam. "High Performance Computing Issues in Large-Scale Molecular Statics Simulations." Thesis, Virginia Tech, 1999. http://hdl.handle.net/10919/33206.

Full text
Abstract:
Successful application of parallel high performance computing to practical problems requires overcoming several challenges. These range from the need to make sequential and parallel improvements in programs to the implementation of software tools which create an environment that aids sharing of high performance hardware resources and limits losses caused by hardware and software failures. In this thesis we describe our approach to meeting these challenges in the context of a Molecular Statics code. We describe sequential and parallel optimizations made to the code and also a suite of tools constructed to facilitate the execution of the Molecular Statics program on a network of parallel machines with the aim of increasing resource sharing, fault tolerance and availability.
Master of Science
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography