Dissertations / Theses: 'MPI'

1

Kamal, Humaira. "FG-MPI : Fine-Grain MPI." Thesis, University of British Columbia, 2013. http://hdl.handle.net/2429/44668.

Full text

Abstract:

The Message Passing Interface (MPI) is widely used to write sophisticated parallel applications ranging from cognitive computing to weather predictions and is almost universally adopted for High Performance Computing (HPC). Many popular MPI implementations bind MPI processes to OS-processes. This runtime model has closely matched single or multi-processor compute clusters. Since 2008, however, clusters of multicore nodes have been the predominant architecture for HPC, with the opportunity for parallelism inside one compute node. There are a number of popular parallel programming languages for multicore that use message passing. One notable difference between MPI and these languages is the granularity of the MPI processes. Processes written using MPI tend to be coarse-grained and designed to match the number of processes to the available hardware, rather than the program structure. Binding MPI processes to OS-processes fails to take full advantage of the finer-grain parallelism available on today's multicore systems. Our goal was to take advantage of the type of runtime systems used by fine-grain languages and integrate that into MPI to obtain the best of these programming models; the ability to have fine-grain parallelism, while maintaining MPI's rich support for communication inside clusters. Fine-Grain MPI (FG-MPI) is a system that extends the execution model of MPI to include interleaved concurrency through integration into the MPI middleware. FG-MPI is integrated into the MPICH2 middleware, which is an open source, production-quality implementation of MPI. The FG-MPI runtime uses coroutines to implement light-weight MPI processes that are non-preemptively scheduled by its MPI-aware scheduler. The use of coroutines enables fast context-switching time and low communication and synchronization overhead. FG-MPI enables expression of finer-grain function-level parallelism, which allows for flexible process mapping, scalability, and can lead to better program performance. We have demonstrated FG-MPI's ability to scale to over a 100 million MPI processes on a large cluster of 6,480 cores. This is the first time any system has executed such a large number of MPI processes, and this capability will be useful in exploring scalability issues of the MPI middleware as systems move towards compute clusters with millions of processor cores.

APA, Harvard, Vancouver, ISO, and other styles

2

Ramesh, Srinivasan. "MPI Performance Engineering with the MPI Tools Information Interface." Thesis, University of Oregon, 2018. http://hdl.handle.net/1794/23779.

Full text

Abstract:

The desire for high performance on scalable parallel systems is increasing the complexity and the need to tune MPI implementations. The MPI Tools Information Interface (MPI T) introduced in the MPI 3.0 standard provides an opportunity for performance tools and external software to introspect and understand MPI runtime behavior at a deeper level to detect scalability issues. The interface also provides a mechanism to fine-tune the performance of the MPI library dynamically at runtime. This thesis describes the motivation, design, and challenges involved in developing an MPI performance engineering infrastructure using MPI T for two performance toolkits — the TAU Performance System, and Caliper. I validate the design of the infrastructure for TAU by developing optimizations for production and synthetic applications. I show that the MPI T runtime introspection mechanism in Caliper enables a meaningful analysis of performance data. This thesis includes previously published co-authored material.

APA, Harvard, Vancouver, ISO, and other styles

3

Massetto, Francisco Isidro. "Hybrid MPI - uma implementação MPI para ambientes distribuídos híbridos." Universidade de São Paulo, 2007. http://www.teses.usp.br/teses/disponiveis/3/3141/tde-08012008-100937/.

Full text

Abstract:

O crescente desenvolvimento de aplicações de alto desempenho é uma realidade presente nos dias atuais. Entretanto, a diversidade de arquiteturas de máquinas, incluindo monoprocessadores e multiprocessadores, clusters com ou sem máquina front-end, variedade de sistemas operacionais e implementações da biblioteca MPI tem aumentado cada dia mais. Tendo em vista este cenário, bibliotecas que proporcionem a integração de diversas implementações MPI, sistemas operacionais e arquiteturas de máquinas são necessárias. Esta tese apresenta o HyMPI, uma implementação da biblioteca MPI voltada para integração, em um mesmo ambiente distribuído de alto desempenho, nós com diferentes arquiteturas, clusters com ou sem máquina front-end, sistemas operacionais e implementações MPI. HyMPI oferece um conjunto de primitivas compatíveis com a especificação MPI, incluindo comunicação ponto a ponto, operações coletivas, inicio e termino, além de outras primitivas utilitárias.
The increasing develpment of high performance applications is a reality on current days. However, the diversity of computer architectures, including mono and multiprocessor machines, clusters with or without front-end node, the variety of operating systems and MPI implementations has growth increasingly. Focused on this scenario, programming libraries that allows integration of several MPI implementations, operating systems and computer architectures are needed. This thesis introduces HyMPI, a MPI implementation aiming integratino, on a distributed high performance system nodes with different architectures, clusters with or without front-end machine, operating systems and MPI implementations. HyMPI offers a set of primitives based on MPI specification, including point-to-point communication, collective operations, startup and finalization and some other utility functions.

APA, Harvard, Vancouver, ISO, and other styles

4

Subotic, Vladimir. "Evaluating techniques for parallelization tuning in MPI, OmpSs and MPI/OmpSs." Doctoral thesis, Universitat Politècnica de Catalunya, 2013. http://hdl.handle.net/10803/129573.

Full text

Abstract:

Parallel programming is used to partition a computational problem among multiple processing units and to define how they interact (communicate and synchronize) in order to guarantee the correct result. The performance that is achieved when executing the parallel program on a parallel architecture is usually far from the optimal: computation unbalance and excessive interaction among processing units often cause lost cycles, reducing the efficiency of parallel computation. In this thesis we propose techniques oriented to better exploit parallelism in parallel applications, with emphasis in techniques that increase asynchronism. Theoretically, this type of parallelization tuning promises multiple benefits. First, it should mitigate communication and synchronization delays, thus increasing the overall performance. Furthermore, parallelization tuning should expose additional parallelism and therefore increase the scalability of execution. Finally, increased asynchronism would provide higher tolerance to slower networks and external noise. In the first part of this thesis, we study the potential for tuning MPI parallelism. More specifically, we explore automatic techniques to overlap communication and computation. We propose a speculative messaging technique that increases the overlap and requires no changes of the original MPI application. Our technique automatically identifies the application’s MPI activity and reinterprets that activity using optimally placed non-blocking MPI requests. We demonstrate that this overlapping technique increases the asynchronism of MPI messages, maximizing the overlap, and consequently leading to execution speedup and higher tolerance to bandwidth reduction. However, in the case of realistic scientific workloads, we show that the overlapping potential is significantly limited by the pattern by which each MPI process locally operates on MPI messages. In the second part of this thesis, we study the potential for tuning hybrid MPI/OmpSs parallelism. We try to gain a better understanding of the parallelism of hybrid MPI/OmpSs applications in order to evaluate how these applications would execute on future machines and to predict the execution bottlenecks that are likely to emerge. We explore how MPI/OmpSs applications could scale on the parallel machine with hundreds of cores per node. Furthermore, we investigate how this high parallelism within each node would reflect on the network constraints. We especially focus on identifying critical code sections in MPI/OmpSs. We devised a technique that quickly evaluates, for a given MPI/OmpSs application and the selected target machine, which code section should be optimized in order to gain the highest performance benefits. Also, this thesis studies techniques to quickly explore the potential OmpSs parallelism inherent in applications. We provide mechanisms to easily evaluate potential parallelism of any task decomposition. Furthermore, we describe an iterative trialand-error approach to search for a task decomposition that will expose sufficient parallelism for a given target machine. Finally, we explore potential of automating the iterative approach by capturing the programmers’ experience into an expert system that can autonomously lead the search process. Also, throughout the work on this thesis, we designed development tools that can be useful to other researchers in the field. The most advanced of these tools is Tareador – a tool to help porting MPI applications to MPI/OmpSs programming model. Tareador provides a simple interface to propose some decomposition of a code into OmpSs tasks. Tareador dynamically calculates data dependencies among the annotated tasks, and automatically estimates the potential OmpSs parallelization. Furthermore, Tareador gives additional hints on how to complete the process of porting the application to OmpSs. Tareador already proved itself useful, by being included in the academic classes on parallel programming at UPC.
La programación paralela consiste en dividir un problema de computación entre múltiples unidades de procesamiento y definir como interactúan (comunicación y sincronización) para garantizar un resultado correcto. El rendimiento de un programa paralelo normalmente está muy lejos de ser óptimo: el desequilibrio de la carga computacional y la excesiva interacción entre las unidades de procesamiento a menudo causa ciclos perdidos, reduciendo la eficiencia de la computación paralela. En esta tesis proponemos técnicas orientadas a explotar mejor el paralelismo en aplicaciones paralelas, poniendo énfasis en técnicas que incrementan el asincronismo. En teoría, estas técnicas prometen múltiples beneficios. Primero, tendrían que mitigar el retraso de la comunicación y la sincronización, y por lo tanto incrementar el rendimiento global. Además, la calibración de la paralelización tendría que exponer un paralelismo adicional, incrementando la escalabilidad de la ejecución. Finalmente, un incremente en el asincronismo proveería una tolerancia mayor a redes de comunicación lentas y ruido externo. En la primera parte de la tesis, estudiamos el potencial para la calibración del paralelismo a través de MPI. En concreto, exploramos técnicas automáticas para solapar la comunicación con la computación. Proponemos una técnica de mensajería especulativa que incrementa el solapamiento y no requiere cambios en la aplicación MPI original. Nuestra técnica identifica automáticamente la actividad MPI de la aplicación y la reinterpreta usando solicitudes MPI no bloqueantes situadas óptimamente. Demostramos que esta técnica maximiza el solapamiento y, en consecuencia, acelera la ejecución y permite una mayor tolerancia a las reducciones de ancho de banda. Aún así, en el caso de cargas de trabajo científico realistas, mostramos que el potencial de solapamiento está significativamente limitado por el patrón según el cual cada proceso MPI opera localmente en el paso de mensajes. En la segunda parte de esta tesis, exploramos el potencial para calibrar el paralelismo híbrido MPI/OmpSs. Intentamos obtener una comprensión mejor del paralelismo de aplicaciones híbridas MPI/OmpSs para evaluar de qué manera se ejecutarían en futuras máquinas. Exploramos como las aplicaciones MPI/OmpSs pueden escalar en una máquina paralela con centenares de núcleos por nodo. Además, investigamos cómo este paralelismo de cada nodo se reflejaría en las restricciones de la red de comunicación. En especia, nos concentramos en identificar secciones críticas de código en MPI/OmpSs. Hemos concebido una técnica que rápidamente evalúa, para una aplicación MPI/OmpSs dada y la máquina objetivo seleccionada, qué sección de código tendría que ser optimizada para obtener la mayor ganancia de rendimiento. También estudiamos técnicas para explorar rápidamente el paralelismo potencial de OmpSs inherente en las aplicaciones. Proporcionamos mecanismos para evaluar fácilmente el paralelismo potencial de cualquier descomposición en tareas. Además, describimos una aproximación iterativa para buscar una descomposición en tareas que mostrará el suficiente paralelismo en la máquina objetivo dada. Para finalizar, exploramos el potencial para automatizar la aproximación iterativa. En el trabajo expuesto en esta tesis hemos diseñado herramientas que pueden ser útiles para otros investigadores de este campo. La más avanzada es Tareador, una herramienta para ayudar a migrar aplicaciones al modelo de programación MPI/OmpSs. Tareador proporciona una interfaz simple para proponer una descomposición del código en tareas OmpSs. Tareador también calcula dinámicamente las dependencias de datos entre las tareas anotadas, y automáticamente estima el potencial de paralelización OmpSs. Por último, Tareador da indicaciones adicionales sobre como completar el proceso de migración a OmpSs. Tareador ya se ha mostrado útil al ser incluido en las clases de programación de la UPC.

APA, Harvard, Vancouver, ISO, and other styles

5

Träff, Jesper. "Aspects of the efficient implementation of the message passing interface (MPI)." Aachen Shaker, 2009. http://d-nb.info/994501803/04.

Full text

APA, Harvard, Vancouver, ISO, and other styles

6

Young, Bobby Dalton. "MPI WITHIN A GPU." UKnowledge, 2009. http://uknowledge.uky.edu/gradschool_theses/614.

Full text

Abstract:

GPUs offer high-performance floating-point computation at commodity prices, but their usage is hindered by programming models which expose the user to irregularities in the current shared-memory environments and require learning new interfaces and semantics. This thesis will demonstrate that the message-passing paradigm can be conceptually cleaner than the current data-parallel models for programming GPUs because it can hide the quirks of current GPU shared-memory environments, as well as GPU-specific features, behind a well-established and well-understood interface. This will be shown by demonstrating a proof-of-concept MPI implementation which provides cleaner, simpler code with a reasonable performance cost. This thesis will also demonstrate that, although there is a virtualization constraint imposed by MPI, this constraint is harmless as long as the virtualization was already chosen to be optimal in terms of a strong execution model and nearly-optimal execution time. This will be demonstrated by examining execution times with varying virtualization using a computationally-expensive micro-kernel.

APA, Harvard, Vancouver, ISO, and other styles

7

Angadi, Raghavendra. "Best effort MPI/RT as an alternative to MPI design and performance comparison /." Master's thesis, Mississippi State : Mississippi State University, 2002. http://library.msstate.edu/etd/show.asp?etd=etd-12032002-162333.

Full text

APA, Harvard, Vancouver, ISO, and other styles

8

Sankarapandian, Dayala Ganesh R. Kamal Raj. "Profiling MPI Primitives in Real-time Using OSU INAM." The Ohio State University, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=osu1587336162238284.

Full text

APA, Harvard, Vancouver, ISO, and other styles

9

Hoefler, Torsten. "Communication/Computation Overlap in MPI." Universitätsbibliothek Chemnitz, 2006. http://nbn-resolving.de/urn:nbn:de:swb:ch1-200600021.

Full text

Abstract:

This talk discusses optimized collective algorithms and the benefits of leveraging independent hardware entities in a pipelined manner. The resulting approach uses overlap of computation and communication to reach this task. Different examples are given.

APA, Harvard, Vancouver, ISO, and other styles

10

Chung, Ryan Ki Sing. "CMCMPI : Compose-Map-Configure MPI." Thesis, University of British Columbia, 2014. http://hdl.handle.net/2429/51185.

Full text

Abstract:

In order to manage the complexities of Multiple Program, Multiple Data (MPMD) program deployment to optimize for performance, we propose (CM)²PI as a specification and tool that employs a four stage approach to create a separation of concerns between distinct decisions: architecture interactions, software size, resource constraints, and function. With function level parallelism in mind, to create a scalable architecture specification we use multi-level compositions to improve re-usability and encapsulation. We explore different ways to abstract out communication from the tight coupling of MPI ranks and placement. One of the methods proposed is the flow-controlled channels which also aims at tackling the common issues of buffer limitations and termination. The specification increase compatibility with optimization tools. This enables the automatic optimization of program run time with respect to resource constraints. Together these features simplify the development of MPMD MPI programs.
Science, Faculty of
Computer Science, Department of
Graduate

APA, Harvard, Vancouver, ISO, and other styles

11

Mir, Taheri Seyed M. "Scalability of communicators in MPI." Thesis, University of British Columbia, 2011. http://hdl.handle.net/2429/33128.

Full text

Abstract:

This thesis offers a novel framework for representing groups and communicators in Message Passing Interface (MPI) middleware. MPI is a widely used paradigm in a cluster environment that supports communication between the nodes. In our framework, we have implemented and evaluated scalable techniques for groups and communicators in MPI. We have tested this framework using FG-MPI, a fine-grain version of MPI that scales millions of MPI processes. Groups in MPI are the primary means for creating communicators. A group map is the underlying structure that stores participating processes in the communication. We introduce a framework for concise representations of the group map. This framework is based on the observation that a map can be decomposed into a set and a permutation. This decomposition allows us to use a compact set representation for the cases where specific mapping is not required i.e. lists with monotonically increasing order. In other cases, the representation adds a permutation as well. A variety of set compression techniques has been used. Furthermore, the framework is open to integration of new representations. One advantage of such decomposition is the ability to implicitly represent a set with set representations such as BDD. BDD and similar representations are well-suited for the types of operations used in construction of communicators. In addition to set representations for unordered maps, we incorporated Wavelet Trees on Runs. This library is designed to represent permutation. We have also included general compression techniques in the framework such as BWT. This allows some degree of compression in memory-constrained environments where there is no discernible pattern in the group structure. We have investigated time and space trade-offs among the representations to develop strategies available to the framework. The strategies tune the framework based on user's requirements. The first strategy optimizes the framework to be fast and is called the time strategy. The second strategy optimizes the framework in regard to space. The final hybrid strategy is a hybrid of both and tries to strike a reasonable trade-off between time and space. These strategies let the framework accommodate a wider range of applications and users.

APA, Harvard, Vancouver, ISO, and other styles

12

Silva, Rafael Ennes. "Escalonamento estático de programas-MPI." reponame:Biblioteca Digital de Teses e Dissertações da UFRGS, 2006. http://hdl.handle.net/10183/11472.

Full text

Abstract:

O bom desempenho de uma aplicação paralela é obtido conforme o modo como as técnicas de paralelização são empregadas. Para utilizar essas técnicas, é preciso encontrar uma forma adequada de extrair o paralelismo. Esta extração pode ser feita através de um grafo representativo da aplicação. Neste trabalho são aplicados métodos de particionamento de grafos para otimizar as comunicações entre os processos que fazem parte de uma computação paralela. Nesse contexto, a alocação dos processos almeja minimizar a quantidade de comunicações entre processadores. Esta técnica é frequentemente adotada em Processamento de Alto Desempenho - PAD. No entanto, a construção de grafo geralmente está embutida no programa, cujas estruturas de dados privadas são empregadas na contrução do grafo. A proposta é usar ferramentas diretamente em programas MPI, empregando, apenas, os recursos padr ões da norma MPI 1.2. O objetivo é fornecer uma biblioteca (b -MPI) portável para o escalonamento estático de programas MPI. O escalonamento estático realizado pela biblioteca é feito através do mapeamento de processos Esse mapeamento busca agrupar os processos que trocam muitas informações em um mesma máquina, o que nesse caso diminui o volume de dados trafegados pela rede. O mapeamento será realizado estaticamente após uma execução prévia do programa MPI. As aplicações alvo para o uso da b -MPI são aquelas que mantêm o mesmo padrão de comunicação após execuções sucessivas. A validação da biblioteca foi realizada atrav és da Transformada Rápida de Fourier disponível no pacote FFTW, da resolução do Problema de Transferência de Calor através do Método de Schwarz e Multigrid e da Fatora ção LU implementada no benchmark HPL. Os resultados mostraram que a b -MPI pode ser utilizada para distribuir os processos e cientemente minimizando o volume de mensagens trafegadas pela rede.
A good performance of a parallel application is obtained according to the mode as the parallelization techniques are applied. To make use of these techniques, is necessary to nd an appropriate way to extract the parallelism. This extraction can be done through a representative graph of the application. In this work, methods of partitioning graphs are applied to optimize the communication between processes that belong to a parallel computation. In this context, the processes allocation aims to minimize the communication amount between processors. This technique is frequently adopted in High Performance Computing - HPC. However, the graph building is generally inside the program, that has private data structures employed in the graph building. The proposal is to utilize tools directly in MPI programs, employing only standard resources of the MPI 1.2 norm. The goal is to provide a portable library (b -MPI) to static schedule MPI programs. The static scheduling realized by the library is done through the mapping of processes. This mapping seeks to cluster the processes that exchange a lot of information in the same machine that, in this case decreases the data volume passed through the net. The mapping will be done staticly after a previous execution of a MPI program. The target applications to make use of b -MPI are those whose keep the same communication pattern after successives executions. The library validation is done through the available applications in the FFTW package, the solving of the problem of Heat Transference through the Additive Schwarz Method and Multigrid and the LU factorization implemented in the HPL benchmark. The results show that b -MPI can be utilized to distribute the processes ef ciently minimizing the volume of messages exchanged through the network.

APA, Harvard, Vancouver, ISO, and other styles

13

Marjanović, Vladimir. "The MPI/OmpSs parallel programming model." Doctoral thesis, Universitat Politècnica de Catalunya, 2016. http://hdl.handle.net/10803/398135.

Full text

Abstract:

Even today supercomputing systems have already reached millions of cores for a single machine, which are connected by using a complex network interconnection. Reducing communication time across processes becomes the most important issue in order to achieve the highest possible performance. The Message Passing Interface (MPI), which is the most widely used programming model for large distributed memory, supports asynchronous communication primitives for overlapping communication and computation. However, these primitives are difficult to use and increase code complexity. which then requiring more development effort and making less readable programs. This thesis presents a new programming model, which allows the programmer to easily introduce the asynchrony necessary to overlap communication and computation. The proposed programming model is based on MPI and tasked based shared memory framework, namely OmpSs. The thesis further describes implementation details which in order to allow efficient inter-operation of the OmpSs runtime and MPI. The thesis demonstrates the hybrid use of MPI/OmpSs with several applications of which the HPL benchmark is the most important case study. The hybrid MPI/OmpSs versions significantly improve the performance of the applications compared with their pure MPI counterparts. For the HPL we get close to the asymptotic performance at relatively small problem sizes and still get significant benefits at large problem sizes. In addition, the hybrid MPI/OmpSs approach substantially reduces code complexity and is less sensitive to network bandwidth and operating system noise than the pure MPI versions. In addition, the thesis analyzes and compares current techniques for overlapping computation and collective communication, including approaches using point-to-point communications and additional communication threads, respectively. The thesis stresses the importance of understanding the characteristic of a computational kernel that runs concurrently with communication. Experimental evaluations is done using the Communication Computation Concurrent (CCUBE) synthetic benchmark, developed in this thesis, as well as the HPL.
Las supercomputadoras están formadas por un creciente número de núcleos, del orden de millones en la actualidad, que se comunican a través de una compleja red de interconexión. Para obtener el más alto rendimiento posible es necesario reducir el tiempo de comunicación entre procesos. MPI ("Message Passing Interface", Interfaz de Paso de Mensajes), el modelo de programación más usado para grandes sistemas con memoria distribuida, permite llamadas de comunicación asíncrona para solapar la comunicación y la computación. Sin embargo, dichas llamadas son difíciles de usar e incrementan la complejidad del código, necesitándose un mayor esfuerzo en la implementación del código y dando lugar a programas más difíciles de leer. Esta tesis presenta un nuevo modelo de programación que permite al programador introducir fácilmente la asincronía necesaria para solapar la comunicación y la computación. El modelo de programación propuesto está fundamentado en MPI y la infraestructura basada en tareas y memoria compartida OmpSs. La tesis describe en profundidad los detalles de la implementación para la eficiente interoperabilidad entre OmpSs y MPI. En la tesis se demuestra el uso híbrido de MPI/OmpSs con distintas aplicaciones de las cuales el benchmark HPL es el más importante. La versión híbrida MPI/OmpSs mejora significativamente el rendimiento de las aplicaciones respecto a las versiones MPI originales. En el caso de HPL se acerca a un rendimiento asintótico para problemas relativamente pequeños, obteniendo mejoras significativas para problemas grandes. Además la versión híbrida MPI/OmpSs reduce substancialmente la complejidad del código y se ve menos afectada por el ancho de banda de la red y el ruido del sistema operativo que la versión MPI pura. Esta tesis también analiza y compara otros métodos actuales para solapar computación y comunicación colectiva, tales como usar comunicación punto a punto con hilos adicionales para la comunicación. La tesis resalta la importancia de entender las características de la computación que se ejecuta simultáneamente con la comunicación. Los resultados experimentales se han obtenido usando el benchmark sintético CCUBE ("Communication Computation Concurrent", Comunicación Computación Concurrente), desarrollado en esta tesis, además de HPL.

APA, Harvard, Vancouver, ISO, and other styles

14

Tsai, Mike Yao Chen. "Hybrid design of MPI over SCTP." Thesis, University of British Columbia, 2007. http://hdl.handle.net/2429/32492.

Full text

Abstract:

Message Passing Interface(MPI)is a popular message passing interface for writing parallel applications. It has been designed to run over many different types of network interconnects ranging from commodity Ethernet to more specialized hardwares including: shared memory, and Remote Direct Memory Access (RDMA) devices such as InfiniBand and the recently standardized Internet Wide Area RDMA Protocol (iWARP). The API itself provides both the point-to-point and remote memory access (RMA) operations to the application. However, it is often implemented based on one kind of underlying network device, namely entirely RDMA or point-to-point. As a result, it is often not possible to provide a direct mapping from the software semantics to the underlying hardware. In this work, we propose a hybrid approach in designing MPI in which network device to use can depend on its functional requirement. This allows the MPI API to exploit the potential performance benefits of the underlying hardware more directly. Another highlight of this work is the design of the MPI middleware to be IP based in order to provide support for both cluster and wide area network environment; this can be achieved via the use of a commodity transport layer protocol, namely Stream Control Transmission Protocol (SCTP). We will demonstrate how SCTP can be used to support MPI with different kinds of network devices and to provide multirailing support from the transport layer.
Science, Faculty of
Computer Science, Department of
Graduate

APA, Harvard, Vancouver, ISO, and other styles

15

Zhang, Wenbin. "Libra: Detecting Unbalance MPI Collective Calls." The Ohio State University, 2011. http://rave.ohiolink.edu/etdc/view?acc_num=osu1313160584.

Full text

APA, Harvard, Vancouver, ISO, and other styles

16

Cheng, Chih-Kai. "Java simulation of MPI collective communications." Leeds, 2001. http://www.leeds.ac.uk/library/counter2/compstmsc/20002001/cheng.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

17

Florez-Larrahondo, German. "A trusted environment for MPI programs." Master's thesis, Mississippi State : Mississippi State University, 2002. http://library.msstate.edu/etd/show.asp?etd=etd-10172002-103135.

Full text

APA, Harvard, Vancouver, ISO, and other styles

18

Mohror, Kathryn Marie. "Infrastructure For Performance Tuning MPI Applications." PDXScholar, 2004. https://pdxscholar.library.pdx.edu/open_access_etds/2660.

Full text

Abstract:

Clusters of workstations are becoming increasingly popular as a low-budget alternative for supercomputing power. In these systems,message-passing is often used to allow the separate nodes to act as a single computing machine. Programmers of such systems face a daunting challenge in understanding the performance bottlenecks of their applications. This is largely due to the vast amount of performance data that is collected, and the time and expertise necessary to use traditional parallel performance tools to analyze that data. The goal of this project is to increase the level of performance tool support for message-passing application programmers on clusters of workstations. We added support for LAM/MPI into the existing parallel performance tool,P aradyn. LAM/MPI is a commonly used, freely-available implementation of the Message Passing Interface (MPI),and also includes several newer MPI features,such as dynamic process creation. In addition, we added support for non-shared filesystems into Paradyn and enhanced the existing support for the MPICH implementation of MPI. We verified that Paradyn correctly measures the performance of the majority of LAM/MPI programs on Linux clusters and show the results of those tests. In addition,we discuss MPI-2 features that are of interest to parallel performance tool developers and design support for these features for Paradyn.

APA, Harvard, Vancouver, ISO, and other styles

19

Ford, Corey. "Lazy Fault Detection for Redundant MPI." DigitalCommons@CalPoly, 2016. https://digitalcommons.calpoly.edu/theses/1561.

Full text

Abstract:

As the scale of supercomputers grows, it is becoming increasingly important for software to efficiently withstand hardware and software faults. Process replication is one resilience technique, but typical implementations require replicas to stay closely synchronized with each other. We propose algorithms to lazily detect faults in replicated MPI applications, allowing for more flexibility in replica scheduling and potential power savings. Evaluation shows that, when all processes are operated at full power, this approach allows applications to complete substantially faster as compared to using a synchronized model, and often as fast as in non-replicated execution.

APA, Harvard, Vancouver, ISO, and other styles

20

Gabriel, Edgar. "Erweiterung einer MPI-Umgebung zur Interoperabilität verteilter MPP-Systeme." [S.l.] : Universität Stuttgart , Zentrale Universitätseinrichtung (RUS, UB etc.), 1996. http://www.bsz-bw.de/cgi-bin/xvms.cgi?SWB6783410.

Full text

APA, Harvard, Vancouver, ISO, and other styles

21

Cooper, Ian Michael. "MPI-style Web services : an investigation into the potential of using Web services for MPI-style applications." Thesis, Cardiff University, 2009. http://orca.cf.ac.uk/54979/.

Full text

Abstract:

This research investigates the potential of the Web services architecture to act as a platform for the execution of MPI-style applications. The work in this thesis is based upon extending current Web service methodologies and merging them with ideas from other research domains, such as high performance computing. MPIWS, an API to extend the functionality of standard Web services is introduced. MPIWS provides MPI-style message passing functionality to facilitate the execution of MPI-style applications using Web service based communication protocols. The thesis then presents a large selection of experiments that perform a comprehensive evaluation of MPIWS's performance. This performance is compared with an existing MPI implementation that has the option of transmitting data either via Java serialised objects, or via the Java native interface to an underlying C implementation of MPI. From the results obtained from these experiments, it can be concluded that using MPIWS for applications requiring MPI-style message passing between services is potentially a practical and efficient way of distributing coarse grained parallel applications. The results also show that the use of collective communication techniques within the Web services architecture can significantly improve the efficiency of suitable applications such as molecular dynamics simulation. MPI-style communication can also be used to enhance the performance of Web service based workflow execution. Tests conducted have evaluated a range of functionality that can be provided by the MPIWS tool. This evaluation shows that direct messaging between services, without sending data via the workflow manager, can improve the efficiency of Web service based workflow execution.

APA, Harvard, Vancouver, ISO, and other styles

22

Hoefler, Torsten, Mirko Reinhardt, Frank Mietke, Torsten Mehlan, and Wolfgang Rehm. "Low Overhead Ethernet Communication for Open MPI on Linux Clusters." Universitätsbibliothek Chemnitz, 2006. http://nbn-resolving.de/urn:nbn:de:swb:ch1-200601112.

Full text

Abstract:

This paper describes the basic concepts of our solution to improve the performance of Ethernet Communication on a Linux Cluster environment by introducing Reliable Low Latency Ethernet Sockets. We show that about 25% of the socket latency can be saved by using our simplified protocol. Especially, we put emphasis on demonstrating that this performance benefit is able to speed up the MPI level communication. Therefore we have developed a new BTL component for Open MPI, an open source MPI-2 implementation which offers with its Modular Component Architecture a nearly ideal environment to implement our changes. Microbenchmarks of MPI collective and Point-to-Point operations were performed. We see a performance improvement of 8% to 16% for LU and SP implementations of the NAS parallel benchmark suite which spends a significant amount of time in the MPI. Practical application tests with Abinit, an electronic structure calculation program, show that the runtime of be nearly halved on a 4 node system. Thus we show evidence that our new Ethernet communication protocol is able to increase the speedup of parallel applications considerably.

APA, Harvard, Vancouver, ISO, and other styles

23

Nagel, Wolfgang E., Alfred Arnold, Michael Weber, Hans-Christian Hoppe, and Karl Solchenbach. "VAMPIR: Visualization and Analysis of MPI Resources." Saechsische Landesbibliothek- Staats- und Universitaetsbibliothek Dresden, 2010. http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-26639.

Full text

Abstract:

Performance analysis most often is based on the detailed knowledge of program behavior. One option to get this information is tracing. Based on the research tool PARvis, the visualization environment VAMPIR was developed at KFA which now supports the new message passing standard MPI. VAMPIR translates a given trace file into a variety of graphical views, e.g., state diagrams, activity charts, time-line displays, and statistics. Moreover, it supports an animation mode that can help to locate performance bottlenecks, and it provides flexible filter operations to reduce the amount of information displayed. The most interesting part of VAMPIR is the powerful zooming feature that allows to identify problems at any level of detail.

APA, Harvard, Vancouver, ISO, and other styles

24

Kubiš, Milan. "Optimalizace sběrného výfukového potrubí Škoda 1,2 MPI." Master's thesis, Vysoké učení technické v Brně. Fakulta strojního inženýrství, 2015. http://www.nusl.cz/ntk/nusl-230446.

Full text

Abstract:

The subject of this diploma thesis is optimalization of exhaust manifold of ŠKODA 1,2 MPI engine with respect for the plastic deformation at heat stress. The first part is focused on general description of converter module, whose component the exhaust manifold is. In the next part of the thesis is computation of a heat load of the exhaust manifold. The last part is devoted to the seal analysis of the whole converter module of a ŠKODA three-cylinder engine.

APA, Harvard, Vancouver, ISO, and other styles

25

Grabowsky, L., Th Ermer, and J. Werner. "Nutzung von MPI für parallele FEM-Systeme." Universitätsbibliothek Chemnitz, 1998. http://nbn-resolving.de/urn:nbn:de:bsz:ch1-199801365.

Full text

Abstract:

Der Standard des Message Passing Interfaces (MPI) stellt dem Entwickler paralleler Anwendungen ein mächtiges Werkzeug zur Verfügung, seine Softwa- re effizient und weitgehend unabhängig von Details des parallelen Systems zu entwerfen. Im Rahmen einer Projektarbeit erfolgte die Umstellung der Kommunikationsbibliothek eines bestehenden FEM-Programmes auf den MPI-Mechanismus. Die Ergebnisse werden in der hier gegebenen Beschreibung der Cubecom-Implementierung zusammengefasst. In einem zweiten Teil dieser Arbeit wird untersucht, auf welchem Wege mit der in MPI verfügbaren Funktionalität auch die Koppelrandkommunikation mit einem einheitlichen und effizienten Verfahren durchgeführt werden kann. Sowohl fuer die Basisimplementierung als auch die MPI-basierte Koppelrandkommunikation wird die Effizienz untersucht und ein Ausblick auf weitere Anwendungsmoeglichkeiten gegeben.

APA, Harvard, Vancouver, ISO, and other styles

26

Nakashima, Raul Junji. "Paralelização de programas sisal para sistemas MPI." Universidade de São Paulo, 1996. http://www.teses.usp.br/teses/disponiveis/76/76132/tde-06052008-105502/.

Full text

Abstract:

Este trabalho teve como finalidade a implementação de um método para a paralelização parcial de programas, escritos na linguagem funcional, SISAL utilizando as bibliotecas do padrão MPI (Message Passing Interface). Para tal, propusemos a transformação dos programas SISAL através do particionamento do loop paralelo forall, através do método de particionamento slice e a utilização do modelo de implementação do paralelismo SPMD (Single Program Multiple Data) no estilo de programas mestre/escravo. A validação de nossa proposta foi obtida através da realização de testes onde foram comparados os resultados obtidos com os programas originais e os programas com as alterações propostas
This work describes a method for the partial parallelization of SISAL programs into programs with calls to MPI routines. We focused on the parallelization of the forall loop (through slicing of the index range). The generated code is a master/slave SPMD program. The work was validated through the compilation of some simple SISAL programs and comparison of the results with an unmodified version

APA, Harvard, Vancouver, ISO, and other styles

27

Ignatenko, S. N., and S. A. Petrov. "Application of mpi technology for allocated calculations." Thesis, Вид-во СумДУ, 2009. http://essuir.sumdu.edu.ua/handle/123456789/17001.

Full text

APA, Harvard, Vancouver, ISO, and other styles

28

Kühnemann, Matthias, Thomas Rauber, and Gudula Rünger. "Optimizing MPI Collective Communication by Orthogonal Structures." Universitätsbibliothek Chemnitz, 2007. http://nbn-resolving.de/urn:nbn:de:swb:ch1-200701061.

Full text

Abstract:

Many parallel applications from scientific computing use MPI collective communication operations to collect or distribute data. Since the execution times of these communication operations increase with the number of participating processors, scalability problems might occur. In this article, we show for different MPI implementations how the execution time of collective communication operations can be significantly improved by a restructuring based on orthogonal processor structures with two or more levels. As platform, we consider a dual Xeon cluster, a Beowulf cluster and a Cray T3E with different MPI implementations. We show that the execution time of operations like MPI Bcast or MPI Allgather can be reduced by 40% and 70% on the dual Xeon cluster and the Beowulf cluster. But also on a Cray T3E a significant improvement can be obtained by a careful selection of the processor groups. We demonstrate that the optimized communication operations can be used to reduce the execution time of data parallel implementations of complex application programs without any other change of the computation and communication structure. Furthermore, we investigate how the execution time of orthogonal realization can be modeled using runtime functions. In particular, we consider the modeling of two-phase realizations of communication operations. We present runtime functions for the modeling and verify that these runtime functions can predict the execution time both for communication operations in isolation and in the context of application programs.

APA, Harvard, Vancouver, ISO, and other styles

29

Kazilas, Panagiotis. "Augmenting MPI Programming Process with Cognitive Computing." Thesis, Linnéuniversitetet, Institutionen för datavetenskap och medieteknik (DM), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-88913.

Full text

Abstract:

Cognitive Computing is a new and quickly advancing technology. In thelast decade Cognitive Computing has been used to assist researchers in theirendeavors in many different scientific fields such as Health & medicine,Education, Marketing, Psychology and Financial Services. On the otherhand, Parallel programming is a more complex concept than sequentialprogramming. The additional complexity of Parallel Programming isintroduced by its nature that requires implementations of more complexalgorithms and it introduces additional concepts to the developers, namelythe communication between the processes (Distributed memory systems)that execute the parallel program and their synchronization (Share memorysystems). As a result of this additional complexity, a lot of novice developersare reserved in their attempts to implement parallel programs. The objectiveof this research project was to investigate whether we can assist parallelprogramming process through cognitive computing solutions. In order toachieve our objective, the MPI Assistant, a Q&A system has been developedand a case study has been carried out to determine our application’s efficiencyin our attempt to assist parallel programming developers. The case studyshowed that our MPI Assistant system indeed helped developers reduce thetime they spend to develop their solutions, but not improve the quality ofthe program or its efficiency as these improvements require features that areout of this research project’s scope. However, the case study had limitednumber of participants, which may affect our results’ reliability. As a nextstep in our attempt to determine if cognitive computing technologies are ableto assist developers in their parallel programming development, we movedto investigate if cognitive solutions can extract better and more completeresponses compared to our manually-created responses that we created forthe MPI Assistant. We have experimented with 2 different approaches to theproblem. An approach where we manually created responses for the MPIAssistant, and an approach where we investigated if cognitive solutions canautomatically extract better and complete responses. We compared the qualityof the latter automatic responses with the quality of the former which weremanually created.

APA, Harvard, Vancouver, ISO, and other styles

30

Wang, Liqiang. "An Efficient Platform for Large-Scale MapReduce Processing." ScholarWorks@UNO, 2009. http://scholarworks.uno.edu/td/963.

Full text

Abstract:

In this thesis we proposed and implemented the MMR, a new and open-source MapRe- duce model with MPI for parallel and distributed programing. MMR combines Pthreads, MPI and the Google's MapReduce processing model to support multi-threaded as well as dis- tributed parallelism. Experiments show that our model signi cantly outperforms the leading open-source solution, Hadoop. It demonstrates linear scaling for CPU-intensive processing and even super-linear scaling for indexing-related workloads. In addition, we designed a MMR live DVD which facilitates the automatic installation and con guration of a Linux cluster with integrated MMR library which enables the development and execution of MMR applications.

APA, Harvard, Vancouver, ISO, and other styles

31

Almeida, Alexandre Vinicius. "Uso de auto-tuning para otimização de decomposição de domínios paralela." reponame:Biblioteca Digital de Teses e Dissertações da UFRGS, 2011. http://hdl.handle.net/10183/39121.

Full text

Abstract:

O desenvolvimento de aplicações de forma a atingir níveis de desempenho próximos aos níveis teóricos de uma determinada plataforma é uma tarefa que exige conhecimento técnico do ambiente de hardware, uma vez que o software deve explorar detalhes específicos da plataforma em questão. Pelo fato do software ser específico à plataforma, caso ela evolua ou se altere, as otimizações realizadas podem não explorar a nova arquitetura de forma eficiente. Auto-tuners são sistemas que surgiram como um meio automatizado de adaptar um determinado software a uma arquitetura alvo. Essa adaptação ocorre através de uma busca empírica de valores ótimos para parâmetros específicos de uma aplicação, a fim de ajustá-los às características do hardware, ou ainda através da geração de códigofonte otimizado para a plataforma. Este trabalho propõe um módulo auto-tuner orientado à adaptação parametrizada de uma aplicação paralela, que trabalha variando os fatores da dimensão do domínio bidimensional, o número de processos e a extensão das regiões de sobreposição. Para cada variação dos fatores, o auto-tuner testa a aplicação na arquitetura paralela de forma a buscar a combinação de parâmetros com melhor desempenho. Para possibilitar o auto-tuning, foi desenvolvida uma classe em linguagem C++ denominada Mesh, baseada no padrão MPI. A classe busca abstrair a decomposição de domínios de uma aplicação paralela por meio do uso de Orientação a Objetos, e facilita a variação da extensão das regiões de sobreposição entre os subdomínios. Os resultados experimentais demonstraram que o auto-tuner explora o ganho de desempenho pela variação do número de processos da aplicação, que também é tratado pelo módulo auto-tuner. A arquitetura paralela utilizada na validação não se mostrou ideal para uma otimização através do aumento da extensão das regiões sobrepostas entre subdomínios.
Achieving the peak performance level of a particular platform requires technical knowledge of the hardware environment involved, since the software must explore specific details inherent to the hardware. Once the software is optimized for a target platform, if the hardware evolves or is changed, the software probably would not be as efficient in the new environment. This performance portability problem is addressed by software auto-tuning, which emerged in the past decade as an automated technique to adapt a particular software to an underlying hardware. The software adaptation is performed by an auto-tuner. The auto-tuner is an entity that empirically adjusts specific application parameters in order to improve the overall application performance, or even generates source-code optimized for the target platform. This dissertation proposes an auto-tuner to optimize the domain decomposition of a parallel application that performs stencil computations. The proposed auto-tuner works in a parameterized adaptation fashion, and varies the dimensions of a 2D domain, the number of parallel processes and the extension of the overlapping zones between subdomains. For each combination of parameter values, the auto-tuner probes the application in the parallel architecture in order to seek the best combination of values. In order to make auto-tuning possible, it is proposed a C++ class called Mesh, based on the Message Passing Interface (MPI) standard. The role of this class is to abstract the domain decomposition from the application using the Object Orientation facilities provided by C++, and also to enable the extension of the overlapping zones between subdomain. The experimental results showed that the performance gains were mainly due to the variation of the number of processes, which was one of the application factors dealt by the auto-tuner. The parallel architecture used in the experiments showed itself as not adequate for optimizing the domain decomposition by increasing the overlapping zones extension.

APA, Harvard, Vancouver, ISO, and other styles

32

Dickov, Branimir. "MPI layer techniques to improve network energy efficiency." Doctoral thesis, Universitat Politècnica de Catalunya, 2015. http://hdl.handle.net/10803/334181.

Full text

Abstract:

Interconnection networks represent the backbone of large-scale parallel systems. In order to build ultra-scale supercomputers larger interconnection networks are being designed and deployed. As compute nodes become more energy-efficient, the interconnect is accounting for an increasing proportion of the total system energy consumption. The interconnect's energy consumption is, however, only starting to receive serious attention. Most of this power consumption is due to the interconnection links. The problem, in terms of power, of an interconnect link is that its power consumption is almost constant, whether or not it is actively exchanging data, since both ends stay active to mantain synchronization. This thesis complements ongoing efforts related to power reduction and energy proportionality of the interconnection network. The thesis contemplates two directions for power savings in the interconnection network; one is the possibility to use lower bandwidth links during the communication phases and thus save energy, while the second one addresses shifting links to low-power mode during computation phases when they are unused. To address the first one we investigate the potential benefits from MPI data compression. When compression of MPI data is possible, the reduction in link bandwidth is enabled without incurring any performance penalty. Consecutively, lower bandwidth leads to lower link energy consumption. In the past, several compression techniques have been proposed as a way to improve the performance and scalability of parallel applications. Those works have shown significant speed-ups when applying compressors to the MPI transfers of certain algorithmic kernels. However, these techniques have not seen widespread adoptation in current supercomputers. In this thesis we will show that although data compression naturally leads to improved performance, the benefit is small, for modern high-performance networks, and it varies greatly between applications. In contrast, combining data compression with switching to low-power mode preserves performance while delivering effective and consistent energy savings, in proportion with the reduction in data rate. In general, application developers view time spent in a communication as an overhead, and therefore strive to keep it at minimum. This leads to high peak bandwidth demand and latency sensitivity, but low average utilization, which provides significant opportunities for energy savings. It is therefore possible to save energy using low-power modes, but link wake-up latencies must not lead to a loss in performance. Thus, we propose a mechanism that can accurately predict when links are idle, allowing them to be switched to more power efficient mode. Our runtime system called the Pattern Prediction System (PPS) can accurately predict not only when a link will become unused but also when it will become active again, allowing links to be switched off during the idle periods and switched back on again in time to avoid incurring a significant performance degradation. Many HPC application benefit from prediction, since they have repetitive computation and communication phases. By implementing the energy-saving mechanisms inside the MPI library, existing MPI programs do not need to be modified. We also develop more advanced version of the prediction system, Self-Tuned Pattern Prediction System (SPPS) which is capable of automatically tuning to the current application communication characteristic and shaping the switching on/off of the links in the most appropriate way. The proposed compression and prediction techniques are evaluated using an event-driven simulator, which is able to replay the traces from real execution of MPI applications. Experimental results show significant energy savings in the IB links while the performance overhead due to wake-up latencies and additional computation time have negligible effects on the final application performance.
En los últimos años, el consumo de energia en la red de interconexión se esta considerando como uno de los factores que pueden condicionar la carrera hacia los sistemas Exascale. En la red de interconexion, la mayor parte de este consumo de energía se debe a los enlaces de red, cuyo consumo permanece constante independientemente de si los datos se intercambian de forma activa, dado que ambos extremos deben de permanecer activos para poder mantener la sincronización. Esta tesis complementa los esfuerzos de investigación que actualmente se estan llevando a cabo a nivel internacional con el objetivo de reducir la potencia y conseguir una proporcionalidad de consumo de energía con respecto al ancho de banda requerido en las comunicaciones. En esta tesis se contemplan dos direcciones complementarias para conseguir dichos objetivos: por un lado, la posibilidad de usar sólo el ancho de banda necesario durante las fases de comunicación; y por lo tanto usar el modo de bajo consumo durante las fases de computación en las que no se requiere de la red de interconexión. Para abordar la primera de ellas se investiga los posibles beneficios de usar compresión en los datos que se transfieren en los mensajes MPI. Cuando ello es posible, se puede realizar la comunicación con una menor necesidad de ancho de banda de los enlaces sin que necesariamente se produzca una penalizacion en el rendimiento de la aplicación. Varias técnicas de compresión han sido propuestas en la literatura con el objetivo de reducir el tiempo de comunicación y la escalabilidad de las aplicaciones paralelas. Aunque estas técnicas han mostrado un potencial importante en ciertos nucleos computacionales, su adopción en sistemas reales no se ha llevado a cabo. En esta tesis, se muestra como el uso de la compresión de datos en los mensajes MPI puede permitir una reducción en el consumo de energia, reduciendo el número de enlaces activos que son requeridos para realizar la comunicación, en proporción a la reducción de los bytes que deben de ser transferidos. En general, los desarrolladores de aplicaciones consideran el tiempo pasado en la comunicación como un gasto innecesario, y por lo tanto se esfuerzan en mantenerlo al mínimo. Esto lleva a una demanda de un ancho de banda que puede afrontar el pico de alto trafico y de una sensibilidad a la latencía, pero con una utilización mediana baja, lo que ofrece unas oportunidades significativas para el ahorro de energía. Por lo tanto, es posible ahorrar la energía apoyándose en los modos de bajo consumo, pero las latencias de reactivación de los enlaces no deben producir una pérdida en el rendimiento. En esta tesis doctoral se propone un mecanismo que permite predecir con exactitud los periodos de inactividad de los enlaces, lo que permitirá pasarlos al modo más eficiente de energía que disponga la infraestructura de red. La propuesta en esta tesis doctoral actua en tiempo de ejecución y se denomina Sistema de Predicción de Patrones (SPP). SPP permite predecir con exactitud no sólo cuando un enlace llega a ser no usado, sino también cuando se requiere de nuevo su reactivación, permitiendo que los enlaces entren en modo de bajo consumo durante los periodos de inactividad y se vuelven de nuevo activos a tiempo evitando provocar una degradación significativa en el rendimiento. Muchas aplicaciones de HPC (High-Performance Computing) pueden beneficiarse de esta predicción, ya que tienen fases de computación y de comunicación repetitivas. Mediante la implementación de los mecanismos de ahorro de energía dentro de la libreria MPI, los programas MPI existentes no requiren ninguna modificación. En la tesis, tambien desarrollamos una version más avanzada del sistema de predicción que dominamos como el Sistema de Prediccion de Patrones con Ajustes Automáticos (SPPA) que además permite ajustar de forma autónoma uno de los parámetros importantes de SPP que determina el grado de agregación de mensajes en el algoritmo de predicción

APA, Harvard, Vancouver, ISO, and other styles

33

Hagen, Knut Imar. "Fault-tolerance for MPI Codes on Computational Clusters." Thesis, Norwegian University of Science and Technology, Department of Computer and Information Science, 2007. http://urn.kb.se/resolve?urn=urn:nbn:no:ntnu:diva-8728.

Full text

Abstract:

This thesis focuses on fault-tolerance for MPI codes on computational clusters. When an application runs on a very large cluster with thousands of processors, there is likely that a process crashes due to a hardware or software failure. Fault-tolerance is the ability of a system to respond gracefully to an unexpected hardware or software failure. A test application which is meant to run for several weeks on several nodes is used in this thesis. The application is a seismic MPI application, written in Fortran90. This application was provided by Statoil, who wanted a fault-tolerant implementation. The original test application had no degree of fault-tolerance --if one process or one node crashed, the entire application also crashed. In this thesis, a collection of fault-tolerant techniques are analysed, including checkpointing, MPI Error handlers, extending MPI, replication, fault detection, atomic clocks and multiple simultaneous failures. Several MPI implementations are described, like MPICH1, MPICH2, LAM/MPI and Open MPI. Next, some fault-tolerant products which are developed at other universities are described, like FT-MPI, FEMPI, MPICH-V including its five protocols, the fault-tolerant functionality of Open MPI, and MPI Error handlers. A fault-tolerant simulator which simulates the application's behaviour is developed. The simulator uses two fault-tolerance methods: FT-MPI and MPI Error handlers. Next, our test application is similarly made fault-tolerant with FT-MPI using three proposed approaches: MPI_Reduce(), MPI_Barrier(), and the final and current implementation: MPI Loop. Tests of the MPI Loop implementation are run on a small and a large cluster to verify the fault-tolerant behaviour. The seismic application survives a crash of n-2 nodes/processes. Process number 0 must stay alive since it acts as an I/O server, and there must be at least one process left to compute data. Processes can also be restarted rather than left out, but the test application needs to be modified to support this.

APA, Harvard, Vancouver, ISO, and other styles

34

Karlbom, David. "A Performance Evaluation of MPI Shared Memory Programming." Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-188676.

Full text

Abstract:

The thesis investigates the Message Passing Interface (MPI) support for shared memory programming on modern hardware architecture with multiple Non-Uniform Memory Access (NUMA) domains. We investigate its performance in two case studies: the matrix-matrix multiplication and Conway’s game of life. We compare MPI shared memory performance in terms of execution time and memory consumption with the performance of implementations using OpenMP and MPI point-to-point communication, also called "MPI two-sided". We perform strong scaling tests in both test cases. We observe that MPI two-sided implementation is 21% and 18% faster than the MPI shared and OpenMP implementations respectively in the matrix-matrix multiplication when using 32 processes. MPI shared uses less memory space: when compared to MPI two-sided, MPI shared uses 45% less memory. In the Conway’s game of life, we find that MPI two-sided implementation is 10% and 82% faster than the MPI shared and OpenMP implementations respectively when using 32 processes. We also observe that not mapping virtual memory to a specific NUMA domain can lead to an increment in execution time of 64% when using 32 processes. The use of MPI shared is viable for intranode communication on modern hardware architecture with multiple NUMA domains.
I detta examensarbete undersöker vi Message Passing Inferfaces (MPI) support för shared memory programmering på modern hårdvaruarkitektur med flera Non-Uniform Memory Access (NUMA) domäner. Vi undersöker prestanda med hjälp av två fallstudier: matris-matris multiplikation och Conway’s game of life. Vi jämför prestandan utav MPI shared med hjälp utav exekveringstid samt minneskonsumtion jämtemot OpenMP och MPI punkt-till-punkt kommunikation, även känd som MPI two-sided. Vi utför strong scaling tests för båda fallstudierna. Vi observerar att MPI-two sided är 21% snabbare än MPI shared och 18% snabbare än OpenMP för matris-matris multiplikation när 32 processorer användes. För samma testdata har MPI shared en 45% lägre minnesförburkning än MPI two-sided. För Conway’s game of life är MPI two-sided 10% snabbare än MPI shared samt 82% snabbare än OpenMP implementation vid användandet av 32 processorer. Vi kunde också utskilja att om ingen mappning av virtuella minnet till en specifik NUMA domän görs, leder det till en ökning av exekveringstiden med upp till 64% när 32 processorer används. Vi kom fram till att MPI shared är användbart för intranode kommunikation på modern hårdvaruarkitektur med flera NUMA domäner.

APA, Harvard, Vancouver, ISO, and other styles

35

Sihota, Amit Kaur. "Conjugate gradient methods using MPI for distributed systems." Thesis, McGill University, 2004. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=81569.

Full text

Abstract:

The expanding use of distributed multi-processor supercomputers has made a significant impact on the speed and complexity of the problems which can be solved by the finite element (FE) method. Moreover, the use of the standard Message Passing Interface (MPI) protocol has facilitated the portability of parallel applications across a wide variety of parallel architectures. The conjugate gradient method plays an important role in solving discretized partial differential equations resulting from FE methods. In this thesis, the scalability and performance analysis of three versions of the conjugate gradient parallel solver are executed on the Sun symmetric multiprocessor system with 4 processors with problems stemming from finite element analysis. The implemented solvers demonstrate good scaling behaviour for matrices with different sparsity patterns stemming from practical finite element applications.

APA, Harvard, Vancouver, ISO, and other styles

36

Aguilar, Xavier. "Towards Scalable Performance Analysis of MPI Parallel Applications." Licentiate thesis, KTH, High Performance Computing and Visualization (HPCViz), 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-165043.

Full text

Abstract:

A considerably fraction of science discovery is nowadays relying on computer simulations. High Performance Computing (HPC) provides scientists with the means to simulate processes ranging from climate modeling to protein folding. However, achieving good application performance and making an optimal use of HPC resources is a heroic task due to the complexity of parallel software. Therefore, performance tools and runtime systems that help users to execute applications in the most optimal way are of utmost importance in the landscape of HPC. In this thesis, we explore different techniques to tackle the challenges of collecting, storing, and using fine-grained performance data. First, we investigate the automatic use of real-time performance data in order to run applications in an optimal way. To that end, we present a prototype of an adaptive task-based runtime system that uses real-time performance data for task scheduling. This runtime system has a performance monitoring component that provides real-time access to the performance behavior of anapplication while it runs. The implementation of this monitoring component is presented and evaluated within this thesis. Secondly, we explore lossless compression approaches for MPI monitoring. One of the main problems that performance tools face is the huge amount of fine-grained data that can be generated from an instrumented application. Collecting fine-grained data from a program is the best method to uncover the root causes of performance bottlenecks, however, it is unfeasible with extremely parallel applications or applications with long execution times. On the other hand, collecting coarse-grained data is scalable but sometimes not enough to discern the root cause of a performance problem. Thus, we propose a new method for performance monitoring of MPI programs using event flow graphs. Event flow graphs provide very low overhead in terms of execution time and storage size, and can be used to reconstruct fine-grained trace files of application events ordered in time.

QC 20150508

APA, Harvard, Vancouver, ISO, and other styles

37

Saifi, Mohamad Maamoun El. "PMPI: uma implementação MPI multi-plataforma, multi-linguagem." Universidade de São Paulo, 2006. http://www.teses.usp.br/teses/disponiveis/3/3141/tde-08122006-154811/.

Full text

Abstract:

Esta dissertação apresenta o PMPI, uma implementação do padrão MPI em plataformas heterogêneas. Diferentemente de outras implementações de MPI, o PMPI permite que a aplicação paralela seja realizada num sistema multi-plataforma, e que programas em linguagens de programação diferentes participem da mesma computação. PMPI é construído sobre o Dotnet Framework. Com o PMPI, os nós de processamento chamam funções MPI que são executadas transparentemente em outros nós participantes da computação paralela pela rede de comunicação. O PMPI pode atravessar múltiplos domínios administrativos distribuídos geograficamente. Para os programadores, o grid se parece como uma computação MPI local. O modelo de computação é indistinguível da computação MPI padrão. Esta dissertação estuda a implementação de PMPI com o Microsoft Dotnet Framework e com o MONO para prover uma biblioteca que suporta ambiente de multi-linguagens de programação e multi-plataformas. São analisados os resultados obtidos dos testes executados em sistemas heterogêneos usando PMPI. Os resultados obtidos mostram que a implementação PMPI é uma solução viável, possuindo várias vantagens que ainda podemos explorar melhor.
This dissertation describes PMPI, an implementation of the MPI standard on a heterogeneous platform. Unlike other MPI implementations, PMPI permits MPI computation to run on a multiplatform system. In addition, PMPI permits programs executing on different nodes to be written in different programming languages. PMPI is build on the top of Dotnet framework. With PMPI, nodes call MPI functions that are transparently executed on the participating nodes across the network. PMPI can span multiple administrative domains distributed geographically. To programmers, the grid looks like a local MPI computation. The model of computation is indistinguishable from that of standard MPI computation. This dissertation studies the implementation of PMPI with Microsoft Dotnet framework and MONO Dotnet framework to provide a common layer for a multiprogramming language multiplatform MPI library. Results obtained from tests running PMPI on a heterogeneous system are analyzed. The obtained results show that PMPI implementation is feasible and has many advantages that can be explored.

APA, Harvard, Vancouver, ISO, and other styles

38

Michel, Martial. "Contribution au transfert de données : application à MPI." Nancy 1, 2001. http://www.theses.fr/2001NAN10197.

Full text

Abstract:

MPI est un standard définissant une bibliothèque reposant sur le concept d'échange de messages, permettant de résoudre des opérations de parallélisme. Il est difficile de traiter les types composés de plusieurs types de base avec MPI ; c'est une opération longue et répétitive, résolue avec AutoMap. MPI ne dispose pas non plus de mécanismes pour effectuer le transfert automatique de types de données reliés par des pointeurs, ceci est résolu avec autoLink. Ces deux outils forment les MPI Data-Types Tools et constituent le travail de thèse discuté dans le présent document. Nous discuterons de l'évolution guidée des outils et de leur algorithmes respectifs, décrirons le fonctionnement de la fonction de sérialisation, présenterons les résultats expérimentaux des tests de performances effectués, discuterons de l'utilisation de ceux-ci, comparerons leur principe à des outils de principe similaire, pour finalement expliquer les évolutions envisageables à partir des algorithmes actuels
MPI is a Standard that defines a library based upon the Message Passing concept, aimed at solving parallel problem by allowing direct communications between tasks. MPI has many basic data-types available, but the creation of data-types composed of other basic data-types is a long and repetitive process, eased by AutoMap. MPI has no mechanism for automatic transfer of pointer linked data-types : AutoLink is a library developed to answer the adaptation and transfer need of such a mecanism. Those tools compose the MPI Data-Types Tools, and are the work discussed in this PhD thesis. We will present the evolution of the inner mechanisms as well as algorithms, describe the way AutoLink produces a serialized data to be transfered using buffers to enhance communications, present experimental studies of performances, explain uses of the MPI Data-Types Tools, compare them with other similar tools, and finally possible extensions to the current algorithms will be introduced

APA, Harvard, Vancouver, ISO, and other styles

39

Liu, Jiuxing. "Designing high performance and scalable MPI over InfiniBand." The Ohio State University, 2004. http://rave.ohiolink.edu/etdc/view?acc_num=osu1095296555.

Full text

APA, Harvard, Vancouver, ISO, and other styles

40

Varia, Siddharth. "REGULARIZED MARKOV CLUSTERING IN MPI AND MAP REDUCE." The Ohio State University, 2013. http://rave.ohiolink.edu/etdc/view?acc_num=osu1374153215.

Full text

APA, Harvard, Vancouver, ISO, and other styles

41

Mosch, Marek Höfler Torsten. "Entwicklung einer optimierten kollektiven Komponente für Open MPI." [S.l. : s.n.], 2007.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

42

Ribeiro, Hethini do Nascimento. "Paralelização do algoritmo DIANA com OpenMP e MPI." Universidade Estadual Paulista (UNESP), 2018. http://hdl.handle.net/11449/157280.

Full text

Abstract:

Submitted by HETHINI DO NASCIMENTO RIBEIRO (hethini.ribeiro@outlook.com) on 2018-10-08T23:20:34Z No. of bitstreams: 1 Dissertação_hethini.pdf: 1986842 bytes, checksum: f1d6e8b9be8decd1fb1e992204d2b2d0 (MD5)
Rejected by Elza Mitiko Sato null (elzasato@ibilce.unesp.br), reason: Solicitamos que realize correções na submissão seguindo as orientações abaixo: Problema 01) A FICHA CATALOGRÁFICA (Obrigatório pela ABNT NBR14724) está desconfigurada e falta número do CDU. Problema 02) Falta citação nos agradecimentos, segundo a Portaria nº 206, de 4 de setembro de 2018, todos os trabalhos que tiveram financiamento CAPES deve constar nos agradecimentos a expressão: "O presente trabalho foi realizado com apoio da Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Código de Financiamento 001 Problema 03) Falta o ABSTRACT (resumo em língua estrangeira), você colocou apenas o resumo em português. Problema 04) Na lista de tabelas, a página referente a Tabela 9 está desconfigurada. Problema 05) A cidade na folha de aprovação deve ser Bauru, cidade onde foi feita a defesa. Bauru 31 de agosto de 2018 Problema 06) A paginação deve ser sequencial, iniciando a contagem na folha de rosto e mostrando o número a partir da introdução, a ficha catalográfica ficará após a folha de rosto e não deverá ser contada. OBS:-Estou encaminhando via e-mail o template/modelo das páginas pré-textuais para que você possa fazer as correções da paginação, sugerimos que siga este modelo pois ele contempla as normas da ABNT Lembramos que o arquivo depositado no repositório deve ser igual ao impresso, o rigor com o padrão da Universidade se deve ao fato de que o seu trabalho passará a ser visível mundialmente. Agradecemos a compreensão on 2018-10-09T14:18:32Z (GMT)
Submitted by HETHINI DO NASCIMENTO RIBEIRO (hethini.ribeiro@outlook.com) on 2018-10-10T00:30:40Z No. of bitstreams: 1 Dissertação_hethini_corrigido.pdf: 1570340 bytes, checksum: a42848ab9f1c4352dcef8839391827a7 (MD5)
Approved for entry into archive by Elza Mitiko Sato null (elzasato@ibilce.unesp.br) on 2018-10-10T14:37:37Z (GMT) No. of bitstreams: 1 ribeiro_hn_me_sjrp.pdf: 1566499 bytes, checksum: 640247f599771152e290426a2174d30f (MD5)
Made available in DSpace on 2018-10-10T14:37:37Z (GMT). No. of bitstreams: 1 ribeiro_hn_me_sjrp.pdf: 1566499 bytes, checksum: 640247f599771152e290426a2174d30f (MD5) Previous issue date: 2018-08-31
Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)
No início desta década havia cerca de 5 bilhões de telefones em uso gerando dados. Essa produção global aumentou aproximadamente 40% ao ano no início da década passada. Esses grandes conjuntos de dados que podem ser capturados, comunicados, agregados, armazenados e analisados, também chamados de Big Data, estão colocando desafios inevitáveis em muitas áreas e, em particular, no campo Machine Learning. Algoritmos de Machine Learning são capazes de extrair informações úteis desses grandes repositórios de dados e por este motivo está se tornando cada vez mais importante o seu estudo. Os programas aptos a realizarem essa tarefa podem ser chamados de algoritmos de classificação e clusterização. Essas aplicações são dispendiosas computacionalmente. Para citar alguns exemplos desse custo, o algoritmo Quality Threshold Clustering tem, no pior caso, complexidade O(��5). Os algoritmos hierárquicos AGNES e DIANA, por sua vez, possuem O(n²) e O(2n) respectivamente. Sendo assim, existe um grande desafio, que consiste em processar grandes quantidades de dados em um período de tempo realista, encorajando o desenvolvimento de algoritmos paralelos que se adequam ao volume de dados. O objetivo deste trabalho é apresentar a paralelização do algoritmo de hierárquico divisivo DIANA. O desenvolvimento do algoritmo foi realizado em MPI e OpenMP, chegando a ser três vezes mais rápido que a versão monoprocessada, evidenciando que embora em ambientes de memória distribuídas necessite de sincronização e troca de mensagens, para um certo grau de paralelismo é vantajosa a aplicação desse tipo de otimização para esse algoritmo.
Earlier in this decade there were about 5 billion phones in use generating data. This global production increased approximately 40% per year at the beginning of the last decade. These large datasets that can be captured, communicated, aggregated, stored and analyzed, also called Big Data, are posing inevitable challenges in many areas, and in particular in the Machine Learning field. Machine Learning algorithms are able to extract useful information from these large data repositories and for this reason their study is becoming increasingly important. The programs that can perform this task can be called classification and clustering algorithms. These applications are computationally expensive. To cite some examples of this cost, the Quality Threshold Clustering algorithm has, in the worst case, complexity O (n5). The hierarchical algorithms AGNES and DIANA, in turn, have O (n²) and O (2n) respectively. Thus, there is a great challenge, which is to process large amounts of data in a realistic period of time, encouraging the development of parallel algorithms that fit the volume of data. The objective of this work is to present the parallelization of the DIANA divisive hierarchical algorithm. The development of the algorithm was performed in MPI and OpenMP, reaching three times faster than the monoprocessed version, evidencing that although in distributed memory environments need synchronization and exchange of messages, for a certain degree of parallelism it is advantageous to apply this type of optimization for this algorithm.
1757857

APA, Harvard, Vancouver, ISO, and other styles

43

Hoefler, Torsten. "Fast Barrier Synchronization for InfiniBand." Universitätsbibliothek Chemnitz, 2006. http://nbn-resolving.de/urn:nbn:de:swb:ch1-200600019.

Full text

Abstract:

Barrier Synchronization is crucial for many parallel systems. This talk introduces different synchronization mechanisms and demonstrates new approaches to leverage special hardware properties of InfiniBand to lower the Barrier latency.

APA, Harvard, Vancouver, ISO, and other styles

44

Šeinauskas, Vytenis. "Lygiagrečių programų efektyvumo tyrimas." Master's thesis, Lithuanian Academic Libraries Network (LABT), 2008. http://vddb.library.lt/obj/LT-eLABa-0001:E.02~2008~D_20080811_151827-94348.

Full text

Abstract:

Šis magistrinis darbas skirtas lygiagrečių programų efektyvumo analizei atlikti, pasinaudojant sukurta lygiagrečių programų efektyvumo tyrimo programine įranga. Pagrindinis darbo tikslas – sukurti, ištirti bei pritaikyti mokymo programinę įrangą, skirtą lygiagrečių programų analizei. Tam tikslui buvo atliekamas sukurtos programos galimybių tyrimas bei suplanuoti ir vykdomi programinės įrangos tobulinimo darbai. Taip pat buvo atliekami pavyzdinių lygiagrečių programų tyrimai, naudojant sukurtą programinę įrangą, norint parodyti lygiagrečių programų efektyvumo tyrimo būdus bei sukurtos lygiagrečių programų efektyvumo tyrimo programinės įrangos galimybes.
Parallel program execution is often used to overcome the constraints of processing speed and memory size when executing complex and time-consuming algorithms. The downside to this approach is the increased overall complexity of programs and their implementations. Parallel execution introduces a new class of software bugs and performance shortcomings, that are usually difficult to trace using traditional methods and tools. Hence, new tools and methods need to be introduced, which deal specifically with problems encountered in parallel programs. The goal of this project is the development of MPI-based parallel program performance monitoring tool and research into the ways this tool can be used for measuring, comparing and improving the performance of target programs.

APA, Harvard, Vancouver, ISO, and other styles

45

Sehrish, Saba. "IMPROVING PERFORMANCE AND PROGRAMMER PRODUCTIVITY FOR I/O-INTENSIVE HIGH PERFORMANCE COMPUTING APPLICATIONS." Doctoral diss., University of Central Florida, 2010. http://digital.library.ucf.edu/cdm/ref/collection/ETD/id/3300.

Full text

Abstract:

Due to the explosive growth in the size of scientific data sets, data-intensive computing is an emerging trend in computational science. HPC applications are generating and processing large amount of data ranging from terabytes (TB) to petabytes (PB). This new trend of growth in data for HPC applications has imposed challenges as to what is an appropriate parallel programming framework to efficiently process large data sets. In this work, we study the applicability of two programming models (MPI/MPI-IO and MapReduce) to a variety of I/O-intensive HPC applications ranging from simulations to analytics. We identify several performance and programmer productivity related limitations of these existing programming models, if used for I/O-intensive applications. We propose new frameworks which will improve both performance and programmer productivity for the emerging I/O-intensive applications. Message Passing Interface (MPI) is widely used for writing HPC applications. MPI/MPI- IO allows a fine-grained control of assigning data and task distribution. At the programming frameworks level, various optimizations have been proposed to improve the performance of MPI/MPI-IO function calls. These performance optimizations are provided as various function options to the programmers. In order to write an efficient code, they are required to know the exact usage of the optimization functions, hence programmer productivity is limited. We propose an abstraction called Reduced Function Set Abstraction (RFSA) for MPI-IO to reduce the number of I/O functions and provide methods to automate the selection of appropriate I/O function for writing HPC simulation applications. The purpose of RFSA is to hide the performance optimization functions from the application developer, and relieve the application developer from deciding on a specific function. The proposed set of functions relies on a selection algorithm to decide among the most common optimizations provided by MPI-IO. Additionally, many application scientists are looking to integrate data-intensive computing into computational-intensive High Performance Computing facilities, particularly for data analytics. We have observed several scientific applications which must migrate their data from an HPC storage system to a data-intensive one. There is a gap between the data semantics of HPC storage and data-intensive system, hence, once migrated, the data must be further refined and reorganized. This reorganization must be performed before existing data-intensive tools such as MapReduce can be effectively used to analyze data. This reorganization requires at least two complete scans through the data set and then at least one MapReduce program to prepare the data before analyzing it. Running multiple MapReduce phases causes significant overhead for the application, in the form of excessive I/O operations. For every MapReduce application that must be run in order to complete the desired data analysis, a distributed read and write operation on the file system must be performed. Our contribution is to extend Map-Reduce to eliminate the multiple scans and also reduce the number of pre-processing MapReduce programs. We have added additional expressiveness to the MapReduce language in our novel framework called MapReduce with Access Patterns (MRAP), which allows users to specify the logical semantics of their data such that 1) the data can be analyzed without running multiple data pre-processing MapReduce programs, and 2) the data can be simultaneously reorganized as it is migrated to the data-intensive file system. We also provide a scheduling mechanism to further improve the performance of these applications. The main contributions of this thesis are, 1) We implement a selection algorithm for I/O functions like read/write, merge a set of functions for data types and file views and optimize the atomicity function by automating the locking mechanism in RFSA. By running different parallel I/O benchmarks on both medium-scale clusters and NERSC supercomputers, we show an improved programmer productivity (35.7% on average). This approach incurs an overhead of 2-5% for one particular optimization, and shows performance improvement of 17% when a combination of different optimizations is required by an application. 2) We provide an augmented Map-Reduce system (MRAP), which consist of an API and corresponding optimizations i.e. data restructuring and scheduling. We have demonstrated up to 33% throughput improvement in one real application (read-mapping in bioinformatics), and up to 70% in an I/O kernel of another application (halo catalogs analytics). Our scheduling scheme shows performance improvement of 18% for an I/O kernel of another application (QCD analytics).
Ph.D.
School of Electrical Engineering and Computer Science
Engineering and Computer Science
Computer Engineering PhD

APA, Harvard, Vancouver, ISO, and other styles

46

Cera, Marcia Cristina. "Providing adaptability to MPI applications on current parallel architectures." reponame:Biblioteca Digital de Teses e Dissertações da UFRGS, 2012. http://hdl.handle.net/10183/55464.

Full text

Abstract:

Atualmente, adaptabilidade é uma característica desejada em aplicações paralelas. Por exemplo, o crescente número de usuários competindo por recursos em arquiteturas paralelas gera mudanças constantes no conjunto de processadores disponíveis. Aplicações adaptativas são capazes de executar usando um conjunto volátil de processadores, oferecendo urna melhor utilização dos recursos. Este comportamento adaptativo é conhecido corno maleabilidade. Outro exemplo vem da constante evolução das arquiteturas multi-core, as quais aumentam o número de cores em seus chips a cada nova geração. Adaptabilidade é a chave para permitir que os programas paralelos sejam portáveis de uma máquina a outra. Assim. os programas paralelos são capazes de adaptar a extração do paralelismo de acordo com o grau de paralelismo específico da arquitetura alvo. Este comportamento pode ser visto como um caso particular de evolutividade. Nesse sentido, esta tese está focada em: (i) maleabilidade para adaptar a execução das aplicações paralelas às mudanças na disponibilidade dos processadores; e (ii) evolutividade para adaptar a extração do paralelismo de acordo com propriedades da arquitetura e dos dados de entrada. Portanto, a questão remanescente é "Como prover e suportar aplicações adaptativas?". Esta tese visa responder tal questão com base no MPI (Message-Passing Interface), o qual é a API paralela padrão para HPC em ambientes distribuídos. Nosso trabalho baseia-se nas características do MPI-2 que permitem criar processos em tempo de execução, dando alguma flexibilidade às aplicações MPI. Aplicações MPI maleáveis usam a criação dinâmica de processos para expandir-se nas ações de crescimento (para usar processadores extras). As ações de diminuição (para liberar processadores) finalizam os processos MPI que executam nos processadores requeridos, preservando os dados da aplicação. Note que as aplicações maleáveis requerem suporte do ambiente de execução, uma vez que precisam ser notificadas sobre a disponibilidade dos processadores. Aplicações MPI evolutivas seguem o paradigma do paralelismo de tarefas explícitas para permitir adaptação em tempo de execução. Assim, a criação dinâmica de processos é usada para extrair o paralelismo, ou seja, para criar novas tarefas MPI sob demanda. Para prover tais aplicações nós definimos tarefas MPI abstratas, implementamos a sincronização entre elas através da troca de mensagens, e propusemos uma abordagem para ajustar a granularidade das tarefas MPI, visando eficiência em ambientes distribuídos. Os resultados experimentais validaram nossa hipótese de que aplicações adaptativas podem ser providas usando características do MPI-2. Adicionalmente, esta tese identificou os requisitos rio nível do ambiente de execução para suportá-las em clusters. Portanto, as aplicações MPI maleáveis melhoraram a utilização de recursos de clusters; e as aplicações de tarefas explícitas adaptaram a extração do paralelismo de acordo com a arquitetura alvo. mostrando que este paradigma também é eficiente em ambientes distribuídos.
Currently, adaptability is a desired feature in parallel applications. For instante, the increasingly number of user competing for resources of the parallel architectures causes dynamic changes in the set of available processors. Adaptive applications are able to execute using a set of volatile processors, providing better resource utilization. This adaptive behavior is known as malleability. Another example comes from the constant evolution of the multi-core architectures, which increases the number of cores to each new generation of chips. Adaptability is the key to allow parallel programs portability from one multi-core machine to another. Thus, parallel programs can adapt the unfolding of the parallelism to the specific degree of parallelism of the target architecture. This adaptive behavior can be seen as a particular case of evolutivity. In this sense, this thesis is focused on: (i) malleability to adapt the execution of parallel applications as changes in processors availability; and (ii) evolutivity to adapt the unfolding of the parallelism at runtime as the architecture and input data properties. Thus, the open issue is "How to provide and support adaptive applications?". This thesis aims to answer this question taking into account the MPI (Message-Passing Interface), which is the standard parallel API for HPC in distributed-memory environments. Our work is based on MPI-2 features that allow spawning processes at runtime. adding some fiexibility to the MPI applications. Malleable MPI applications use dynamic process creation to expand themselves in growth action (to use further processors). The shrinkage actions (to release processors) end the execution of the MPI processes on the required processors in such a way that the application's data are preserved. Notice that malleable applications require a runtime environment support to execute, once they must be notified about the processors availability. Evolving MPI applications follow the explicit task parallelism paradigm to allow their runtime adaptation. Thus, dynamic process creation is used to unfold the parallelism, i.e., to create new MPI tasks on demand. To provide these applications we defined the abstract MPI tasks, implemented the synchronization among these tasks through message exchanges, and proposed an approach to adjust MPI tasks granularity aiming at efficiency in distributed-memory environments. Experimental results validated our hypothesis that adaptive applications can be provided using the MPI-2 features. Additionally, this thesis identifies the requirements to support these applications in cluster environments. Thus, malleable MPI applications were able to improve the cluster utilization; and the explicit task ones were able to adapt the unfolding of the parallelism to the target architecture, showing that this programming paradigm can be efficient also in distributed-memory contexts.

APA, Harvard, Vancouver, ISO, and other styles

47

Attari, Sanya. "An Investigation of I/O Strategies for MPI Workloads." Thesis, Virginia Tech, 2010. http://hdl.handle.net/10919/36212.

Full text

Abstract:

Different techniques could be used for improving application performance in parallel systems. Studies have been shown that I/O communication delay is the main reason for different behavior of I/O intensive applications with specific requirements for performance optimization. So, using common strategies, generally defined and effective for computationally intensive applications may not have the same effect on performance improvement for these applications. Moreover, background system configuration effects on the behavior of the application and its performance. Growing use of parallel multi-core systems is an important factor in increasing performance and speeding up the applications. Since changing multi-core systems hardware is not an efficient method in satisfying different expectations of unique application, it is application developer's responsibility to design flexible and scalable code that is compatible with different environments. On the other hand, predicting application behavior and I/O requirements for I/O intensive applications with irregular communication patterns is a complicated and time-consuming task that pushes the problem to runtime impacts. Addressing this issue, we provided an overview on different techniques used for solving this problem. We have studied I/O bound parallel applications that use MPI as the communication method in order to define a general perspective to optimize cost performance ratio. Our designed experiments cover different setups for these applications in order to define various criteria that should be considered in design stage as well as runtime. Moreover, targeting one of the popular I/O intensive applications, we have discussed some possible solutions to speed it up on a multi-core system.
Master of Science

APA, Harvard, Vancouver, ISO, and other styles

48

Grass, Max. "Parallelisierung einer hybriden Partikel/Finite-Volumen Simulationsplattform mittels MPI." Zürich : ETH, Eidgenössische Technische Hochschule Zürich, Institut für Fluiddynamik, 2006. http://e-collection.ethbib.ethz.ch/show?type=dipl&nr=240.

Full text

APA, Harvard, Vancouver, ISO, and other styles

49

Castellanos, Carrazana Abel. "Performance model for hybrid MPI+OpenMP master/worker applications." Doctoral thesis, Universitat Autònoma de Barcelona, 2014. http://hdl.handle.net/10803/283403.

Full text

Abstract:

En el entorno actual, diversas ramas de las ciencias, tienen la necesidad de auxiliarse de la computación de altas prestaciones para la obtención de resultados a relativamente corto plazo. Ello es debido fundamentalmente, al alto volumen de información que necesita ser procesada y también al costo computacional que demandan dichos cálculos. El beneficio al realizar este procesamiento de manera distribuida y paralela, logra acortar de manera notable los tiempos de espera en la obtención de los resultados. Para soportar ello, existen fundamentalmente dos modelos de programación ampliamente extendidos: el modelo de paso de mensajes a través de librerías basadas en el estándar MPI, y el de memoria compartida con la utilización de OpenMP. Las aplicaciones híbridas son aquellas que combinan ambos modelos con el fin de aprovechar en cada caso, las potencialidades específicas del paralelismo en cada uno. Lamentablemente, la práctica ha demostrado que la utilización de esta combinación de modelos, no garantiza necesariamente una mejoría en el comportamiento de las aplicaciones. Existen varios parámetros que deben ser considerados a determinar la configuración de la aplicación que proporciona el mejor tiempo de ejecución. El número de proceso que se debe utilizar, el número de hilos en cada nodo, la distribución de datos entre procesos e hilos, y así sucesivamente, son parámetros que afectan seriamente elrendimiento de la aplicación. El valor apropiado de tales parámetros depende, por una parte, de las características de arquitectura del sistema (latencia de las comunicaciones, el ancho de banda de comunicación, el tamaño y la distribución de los niveles de memoria cache, la capacidad de cómputo, etc.) y, por otro lado, de la características propias del comportamiento de la aplicación. La contribución fundamental de esta tesis radica en la utilización de una técnica novedosa para la predicción del rendimiento y la eficiencia de aplicaciones híbridas de tipo Master/Worker. En particular, dentro del mundo del aprendizaje automatizado, este método de predicción es conocido como arboles de regresión basados en modelos análiticos. Los resultados experimentales obtenidos permiten ser optimista en cuanto al uso de este algoritmo para la predicción de ambas métricas o para la selección de la mejor configuración de parámetros de ejecución de la aplicación.
In the current environment, various branches of science are in need of auxiliary high-performance computing to obtain relatively short-term results. This is mainly due to the high volume of information that needs to be processed and the computational cost demanded by these calculations. The benefit to performing this processing using distributed and parallel programming mechanisms is that it achieves shorter waiting times in obtaining the results. To support this, there are basically two widespread programming models: the model of message passing based on the standard libraries MPI and the shared memory model with the use of OpenMP. Hybrid applications are those that combine both models in order to take the specific potential of parallelism of each one in each case. Unfortunately, experience has shown that using this combination of models does not necessarily guarantee an improvement in the behavior of applications. There are several parameters that must be considered to determine the configuration of the application that provides the best execution time. The number of process that must be used,the number of threads on each node, the data distribution among processes and threads, and so on, are parameters that seriously affect the performance of the application. On the one hand, the appropriate value of such parameters depends on the architectural features of the system (communication latency, communication bandwidth, cache memory size and architecture, computing capabilities, etc.), and, on the other hand, on the features of the application. The main contribution of this thesis is a novel technique for predicting the performance and efficiency of parallel hybrid Master/Worker applications. This technique is known as model-based regression trees into the field of machine learning. The experimental results obtained allow us to be optimistic about the use of this algorithm for predicting both metrics and to select the best application execution parameters.

APA, Harvard, Vancouver, ISO, and other styles

50

Yu, Weikuan. "Enhancing MPI with modern networking mechanisms in cluster interconnects." Columbus, Ohio : Ohio State University, 2006. http://rave.ohiolink.edu/etdc/view?acc%5Fnum=osu1150470374.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'MPI'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles