Dissertations / Theses on the topic 'Checkpointing'
Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles
Consult the top 50 dissertations / theses for your research on the topic 'Checkpointing.'
Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.
You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.
Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.
Oliner, Adam Jamison. "Cooperative checkpointing for supercomputing systems." Thesis, Massachusetts Institute of Technology, 2005. http://hdl.handle.net/1721.1/32102.
Full textThis electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Includes bibliographical references (p. 91-94).
A system-level checkpointing mechanism, with global knowledge of the state and health of the machine, can improve performance and reliability by dynamically deciding when to skip checkpoint requests made by applications. This thesis presents such a technique, called cooperative checkpointing, and models its behavior as an online algorithm. Where C is the checkpoint overhead and I is the request interval, a worst-case analysis proves a lower bound of (2 + [C/I])-competitiveness for deterministic cooperative checkpointing algorithms, and proves that a number of simple algorithms meet this bound. Using an expected-case analysis, this thesis proves that an optimal periodic checkpointing algorithm that assumes an exponential failure distribution may be arbitrarily bad relative to an optimal cooperative checkpointing algorithm that permits a general failure distribution. Calculations suggest that, under realistic conditions, an application using cooperative checkpointing may make progress 4 times faster than one using periodic checkpointing. Finally, the thesis suggests an embodiment of cooperative checkpointing for a large-scale high performance computer system and presents the results of some preliminary simulations. These results show that, in extreme cases, cooperative checkpointing improved system utilization by more than 25%, reduced bounded slowdown by a factor of 9, while simultaneously reducing the amount of work lost due to failures by 30%. This thesis contributes a unique approach to providing large-scale system reliability through cooperative checkpointing, techniques for analyzing the approach, and blueprints for implementing it in practice.
by Adam Jamison Oliner.
M.Eng.
Kalaiselvi, S. "Checkpointing Algorithms for Parallel Computers." Thesis, Indian Institute of Science, 1997. https://etd.iisc.ac.in/handle/2005/3908.
Full textKalaiselvi, S. "Checkpointing Algorithms for Parallel Computers." Thesis, Indian Institute of Science, 1997. http://hdl.handle.net/2005/67.
Full textNilsson, Christoffer, and Sebastian Karlsson. "Adaptive Checkpointing for Emergency Communication Systems." Thesis, Linköpings universitet, Institutionen för datavetenskap, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-130876.
Full textVieira, Gustavo Maciel Dias. "Estudo comparativo de algoritmos para checkpointing." [s.n.], 2001. http://repositorio.unicamp.br/jspui/handle/REPOSIP/276435.
Full textDissertação (mestrado) - Universidade Estadual de Campinas, Instituto de Computação
Made available in DSpace on 2018-08-01T02:33:00Z (GMT). No. of bitstreams: 1 Vieira_GustavoMacielDias_M.pdf: 3096254 bytes, checksum: 30b7155e50de3e9afd753dd40520b771 (MD5) Previous issue date: 2001
Resumo: Esta dissertação fornece um estudo comparativo abrangente de algoritmos quase-síncronos para checkpointing. Para tanto, utilizamos a simulação de sistemas distribuídos que nos oferece liberdade para construirmos modelos de sistemas com grande facilidade. O estudo comparativo avaliou pela primeira vez de forma uniforme o impacto sobre o desempenho dos algoritmos de fatores como a escala do sistema, a freqüência de check points básicos e a diferença na velocidade dos processos da aplicação. Com base nestes dados obtivemos um profundo conhecimento sobre o comportamento destes algoritmos e produzimos um valioso referencial para projetistas de sistemas em busca de algoritmos para check pointing para as suas aplicações distribuídas
Abstract: This dissertation provides a comprehensive comparative study ofthe performance of quase synchronous check pointing algorithms. To do so we used the simulation of distributed systems, which provides freedom to build system models easily. The comparative study assessed for the first time in an uniform environment the impact of the algorithms' performance with respect to factors such as the system's scale, the basic checkpoint rate and the relative processes' speed. By analyzing these data we acquired a deep understanding of the behavior of these algorithms and were able to produce a valuable reference to system architects looking for check pointing algorithms for their distributed applications
Mestrado
Mestre em Ciência da Computação
Bai, Yunhao. "A Checkpointing Methodology for Android Smartphone." The Ohio State University, 2016. http://rave.ohiolink.edu/etdc/view?acc_num=osu1461171667.
Full textJeyakumar, Ashwin Raju. "Metamori: A library for Incremental File Checkpointing." Thesis, Virginia Tech, 2004. http://hdl.handle.net/10919/9969.
Full textMaster of Science
Schmidt, Rodrigo Malta. "Coleta de lixo para protocolos de checkpointing." [s.n.], 2003. http://repositorio.unicamp.br/jspui/handle/REPOSIP/276424.
Full textDissertação (mestrado) - Universidade Estadual de Campinas, Instituto de Matematica, Estatistica e Computação Cientifica
Made available in DSpace on 2018-08-03T19:18:25Z (GMT). No. of bitstreams: 1 Schmidt_RodrigoMalta_M.pdf: 745421 bytes, checksum: c32cef5e0a61fe3580cc8a211902f9fd (MD5) Previous issue date: 2003
Mestrado
Vasireddy, Rahul. "ASYNCHRONOUS CHECKPOINTING AND RECOVERY APPROACH FOR DISTRIBUTED SYSTEMS." Available to subscribers only, 2009. http://proquest.umi.com/pqdweb?did=1967797571&sid=6&Fmt=2&clientId=1509&RQT=309&VName=PQD.
Full textMontón, i. Macián Màrius. "Checkpointing for virtual platforms and systemC-TLM-2.0." Doctoral thesis, Universitat Autònoma de Barcelona, 2010. http://hdl.handle.net/10803/32099.
Full textOne advantage of using a virtual platform or virtual prototype over real hardware for embedded software development and testing is the ability of some simulators to take checkpoints of their state. If the entire system model is detailed enough, it might take several minutes (or even hours) to simulate booting the O.S. If a snapshot of the simulation is saved just after it has finished booting, each time it is necessary to run the embedded software, designers can simply restore the snapshot and go. Restarting a checkpoint typically takes a few seconds. This can translate into a major productivity gain, especially when working with embedded system with complex SW stacks and O.S. like modern embedded devices. In this dissertation we present in firstly our work on adding a description level language as SystemC to two Virtual Platforms. This work was done for a commercial Virtual Platform, and later translated to a open-sourced Platform. This thesis also presents a set of modifications to SystemC language to support checkpointing. These modifications will make it possible to take the state of a SystemC running simulation and save it to disk. Later, the same simulation can be restored to the same point it was before, without any change to the simulated modules. These changes would help SystemC to be suitable for use by Virtual Platforms as a description language.
Hägglund, Joakim. "Policies and Checkpointing in a Distributed Automotive Middleware." Thesis, Uppsala University, Department of Information Technology, 2009. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-103035.
Full textThe goal of this master thesis is to analyze different methods for checkpointing with rollback recovery and evaluate how these would perform in a distributed heterogeneous automotive system. The conclusions from this study lead to a design of a checkpointing system for the Selfconfigurable HighAvailability and Policy based platform for Embedded system (SHAPE) middleware together with an API to use this new functionality. This design has been implemented in SHAPE to prove that it works. The conclusion is that this design should be enough for most applications in SHAPE although it requires some extra work by the application programmer.
This report also covers some aspects of policy based computing and presents adesign which aims to simplify the usage of policies. This results in a system with a policy manager and a context manager which is implemented in SHAPE.This implementation was shown to greatly improve the usability of policies andcontexts without removing any functionalities.
Thakre, Anupam S. "Asynchronous checkpointing scheme for distributed mobile computing environment /." Available to subscribers only, 2005. http://proquest.umi.com/pqdweb?did=1095439851&sid=1&Fmt=2&clientId=1509&RQT=309&VName=PQD.
Full textHeller, Richard. "Checkpointing without operating system intervention implementing Griewank's algorithm." Ohio : Ohio University, 1998. http://www.ohiolink.edu/etd/view.cgi?ohiou1176494831.
Full textWu, Jiang. "CHECKPOINTING AND RECOVERY IN DISTRIBUTED AND DATABASE SYSTEMS." UKnowledge, 2011. http://uknowledge.uky.edu/cs_etds/2.
Full text周志賢 and Chi-yin Edward Chow. "Adaptive recovery with hierarchical checkpointing on workstation clusters." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 1999. http://hub.hku.hk/bib/B29812914.
Full textChow, Chi-yin Edward. "Adaptive recovery with hierarchical checkpointing on workstation clusters /." Hong Kong : University of Hong Kong, 1999. http://sunzi.lib.hku.hk/hkuto/record.jsp?B20792700.
Full textBentria, Dounia. "Combining checkpointing and other resilience mechanisms for exascale systems." Thesis, Lyon, École normale supérieure, 2014. http://www.theses.fr/2014ENSL0971/document.
Full textIn this thesis, we are interested in scheduling and optimization problems in probabilistic contexts. The contributions of this thesis come in two parts. The first part is dedicated to the optimization of different fault-Tolerance mechanisms for very large scale machines that are subject to a probability of failure and the second part is devoted to the optimization of the expected sensor data acquisition cost when evaluating a query expressed as a tree of disjunctive Boolean operators applied to Boolean predicates. In the first chapter, we present the related work of the first part and then we introduce some new general results that are useful for resilience on exascale systems.In the second chapter, we study a unified model for several well-Known checkpoint/restart protocols. The proposed model is generic enough to encompass both extremes of the checkpoint/restart space, from coordinated approaches to a variety of uncoordinated checkpoint strategies. We propose a detailed analysis of several scenarios, including some of the most powerful currently available HPC platforms, as well as anticipated exascale designs.In the third, fourth, and fifth chapters, we study the combination of different fault tolerant mechanisms (replication, fault prediction and detection of silent errors) with the traditional checkpoint/restart mechanism. We evaluated several models using simulations. Our results show that these models are useful for a set of models of applications in the context of future exascale systems.In the second part of the thesis, we study the problem of minimizing the expected sensor data acquisition cost when evaluating a query expressed as a tree of disjunctive Boolean operators applied to Boolean predicates. The problem is to determine the order in which predicates should be evaluated so as to shortcut part of the query evaluation and minimize the expected cost.In the sixth chapter, we present the related work of the second part and in the seventh chapter, we study the problem for queries expressed as a disjunctive normal form. We consider the more general case where each data stream can appear in multiple predicates and we consider two models, the model where each predicate can access a single stream and the model where each predicate can access multiple streams
Jangjaimon, Itthichok. "Effective Checkpointing for Networked Multicore Systems and Cloud Computing." Thesis, University of Louisiana at Lafayette, 2014. http://pqdtopen.proquest.com/#viewpdf?dispub=3615289.
Full textCheckpointing has been widely adopted in support of fault-tolerance and job migration essential for large-scale networked multicore systems and cloud computing. This dissertation pursues an effective checkpointing mechanism to handle failures and unavailable events in such systems and thus to reduce the expected job turnaround time, the aggregated file size, and the monetary cost involved. To withstand unavailability/failures of local nodes in networked systems, multi-level checkpointing is indispensable, with checkpoint files kept not only locally but also at remote storage. As the number of nodes in such a system grows, I/O bandwidth to remote storage quickly becomes the bottleneck for multi-level checkpointing.
The first part of this work deals with an effective mechanism, dubbed adaptive incremental checkpointing (AIC), which reduces the checkpointing file size considerably to lower its involved overhead and thus to shorten the expected job turnaround time. Given production multicore systems are observed to often have unused cores available, we design AIC to make use of separate, otherwise unused, cores for carrying out delta compression at desirable points of time adaptively. AIC permits multi-level checkpointing effectively, with checkpoint files of execution nodes written to their partner nodes and to remote storage concurrently during job execution. AIC is observed in our implemented testbed to substantially lower the normalized expected turnaround time (by up to 41%) and the aggregated file size (by up to 1,000×) when compared to its static counterpart and a recent multi-level checkpointing scheme with fixed checkpoint intervals.
The second part presents design and implementation of our enhanced adaptive incremental checkpointing (EAIC) for multithreaded applications on the RaaS clouds under spot instance pricing. EAIC model takes into account spot instance revocation events, besides hardware failures, for fast and accurately predicting the desirable points of time to take checkpoints so as to markedly reduce the expected job turnaround time and the monetary cost. The experimental results from our established testbed under real spot instance price traces from Amazon EC2 show that EAIC lowers both the application turnaround time and the monetary cost markedly (by up to 58% and 59%, respectively) in comparison to its recent checkpointing counterpart.
Komal, Hooria. "An investigation of semantic checkpointing in dynamically reconfigurable applications." Thesis, Massachusetts Institute of Technology, 2008. http://hdl.handle.net/1721.1/46033.
Full textIncludes bibliographical references (p. 74-75).
As convergence continues to blur lines between different hardware and software platforms, distributed systems are becoming increasingly important. With this rise in the use of such systems, the ability to provide continuous functionality to the user under varying conditions and across different platforms becomes important.This thesis investigates the issues that must be dealt with in implementing a system performing dynamic reconfiguration by enabling semantic checkpointing: capturing application level state. The reconfiguration under question is not restricted to replacement of a few modules or moving modules from one machine to another. The individual components can be completely changed as long as the configuration provides identical higher-level functionality. As part of this thesis work, I have implemented a model application that supports semantic checkpointing and is capable of functioning even after a reconfiguration. This application will serve as an illustrative example of a system that provides these capabilities and demonstrates various types of reconfigurations. In addition, the application allows easy understanding and investigation of the issues involved in providing application-level checkpointing and reconfiguration support. While the details of the checkpointing and reconfiguration will be application specific, there are many common features that all runtime reconfigurable applications share. The aim of this thesis is to use this application implementation to highlight these common features. Furthermore, I use our approach as a means to derive more general guidelines and strategies that could be combined to create a complete framework for enabling such behavior.
by Hooria Komal.
M.Eng.
Sánchez, Verdejo Rommel. "HPC memory systems: Implications of system simulation and checkpointing." Doctoral thesis, Universitat Politècnica de Catalunya, 2022. http://hdl.handle.net/10803/673620.
Full textEl sistema de memoria es el mayor contribuidor de los desafíos actuales en el campo de la arquitectura de ordenadores como lo son los cuellos de botella en el rendimiento de las aplicaciones, así como los costos operativos en los grandes centros de datos. Con la llegada de tecnologías emergentes de memoria, existe una invitación para que los investigadores mejoren y optimicen las implementaciones actuales con novedosos diseños en la jerarquía de memoria. La simulación de los ordenadores es el enfoque preferido para realizar exploraciones de arquitectura debido al bajo costo que representan frente a la realización de prototipos físicos, arrojando estimaciones de rendimiento aceptables con predicciones precisas. A pesar del amplio uso de simuladores de ordenadores, su validación no está estandarizada ya sea porque el propósito principal del simulador no es imitar al sistema real o porque las suposiciones de diseño son demasiado específicas. Esta tesis proporciona los primeros pasos hacia una metodología sistemática para validar simuladores de ordenadores cuando son comparados con sistemas reales. Primero se descubren los parámetros de microarquitectura en la máquina real a través de un conjunto de micro-pruebas diseñadas para actualizar la infraestructura de simulación con el fin de mejorar la precisión en el dominio de la simulación. Para evaluar la precisión de la simulación, proponemos "el factor de retiro", una extensión a una conocida herramienta para medir el rendimiento de las aplicaciones, pero enfocada al impacto del ajuste de parámetros en el simulador. Además, presentamos "la cola de retardo", una modificación virtual al controlador de memoria que agrega un retraso configurable a todas las transacciones de memoria que alcanzan la memoria principal. Usando el factor de retiro, la cola de retraso nos permite identificar el origen de las desviaciones entre la infraestructura del simulador y el sistema real. Todos los accesos de memoria afectan directamente el rendimiento de la aplicación. Desde el acceso de lectura a una única localidad memoria hasta operaciones simultáneas de lectura/escritura a una o varias localidades de memoria, una propiedad que permite reflejar el uso de memoria de la aplicación es su "huella de memoria". En esta tesis encontramos un vínculo entre la huella de memoria de las aplicaciones de alto desempeño y su rendimiento en simulación. Las tecnologías de memoria emergentes se están implementando en sistemas de alto desempeño en cantidades limitadas en comparación con la memoria principal haciéndolas adecuadas para su uso en aplicaciones con baja huella de memoria. En este trabajo, habilitamos y evaluamos el uso de un sistema de memoria heterogéneo basado en un sistema emergente de memoria. Nuestra implementación agrega una carga despreciable al mismo tiempo que ofrece una interfaz simple para ubicar, administrar y migrar datos entre sistemas de memoria heterogéneos. Además, demostramos que el uso de una tecnología de memoria emergente no es una solución directa a los cuellos de botella en el desempeño. La implementación es fundamental a la hora de obtener el mejor rendimiento ya sea ubicando correctamente los datos, o bien diseñando código especializado. En general, esta tesis proporciona una técnica para validar los simuladores respecto al sistema de memoria principal cuando se integra en una infraestructura de simulación y se compara con sistemas reales. Además, exploramos un vínculo entre la huella de memoria de la carga de trabajo y el rendimiento de la simulación en cargas de trabajo de aplicaciones de alto desempeño. Finalmente, habilitamos aplicaciones de alto desempeño con soporte de resiliencia mientras que se benefician de manera transparente con el uso de un sistema de memoria emergente.
Arquitectura de Computadors
Berg, Gustaf, and Magnus Brattlöf. "Distributed Checkpointing with Docker Containers in High Performance Computing." Thesis, Högskolan Väst, Avdelningen för data-, elektro- och lantmäteriteknik, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:hv:diva-11645.
Full textLightweight container virtualization has gained widespread adoption in recent years after updates to namespace and cgroups features in the Linux kernel. At the same time the Industrial High Performance community suffers from expensive licensing costs that could be managed with virtualization. To demonstrate that Docker could be used for suspending distributed containers with parallel processes, experiments were designed to find out if the experimental checkpoint feature is ready for this community. We run the well-known NAS Parallel Benchmark (NPB) inside containers spread over two systems under test to prove this concept. Then, pausing containers and unpausing them in different sequence orders we were able resume the benchmark. After that, we further demonstrate that if you carefully consider the order in which you Checkpoint/Restore containers, then the checkpoint feature is also able to resume the benchmark successfully. Finally, the concept of restoring distributed containers, running the benchmark, on a different system from where it started was proven to be working with a high success rate. Our tests demonstrate the performance, possibilities and flexibilities of Dockers future in the industrial HPC community. This might very well tip the community over to running their simulations and virtual engineering-applications inside containers instead of running them on native hardware.
Manivannan, D. "Quasi-synchronous checkpointing and failure recovery in distributed systems /." The Ohio State University, 1997. http://rave.ohiolink.edu/etdc/view?acc_num=osu1487946776022687.
Full textSunke, Phani Kanth. "A new asynchronous checkpointing/recovery mechanism for cluster federations /." Available to subscribers only, 2007. http://proquest.umi.com/pqdweb?did=1400971261&sid=4&Fmt=2&clientId=1509&RQT=309&VName=PQD.
Full textFoster, Mark S. "Process forensics the crossroads of checkpointing and intrusion detection /." [Gainesville, Fla.] : University of Florida, 2004. http://purl.fcla.edu/fcla/etd/UFE0008063.
Full textDoudalis, Ioannis. "Hardware assisted memory checkpointing and applications in debugging and reliability." Diss., Georgia Institute of Technology, 2011. http://hdl.handle.net/1853/42700.
Full textSilva, Marcelo Pereira da. "Adaptive Remus: replicação de máquinas virtuais Xen com checkpointing adaptável." Universidade do Estado de Santa Catarina, 2015. http://tede.udesc.br/handle/handle/2046.
Full textCoordenação de Aperfeiçoamento de Pessoal de Nível Superior
With the ever-increasing dependence on computers and networks, many systems are required to be continuously available in order to fulfill their mission. Virtualization technology enables high availability to be offered in a convenient, cost-effective manner: with the encapsulation provided by virtual machines (VMs), entire systems can be replicated transparently in software, obviating the need for expensive fault-tolerant hardware. Remus is a VM replication mechanism for the Xen hypervisor that provides high availability despite crash failures. Replication is performed by checkpointing the VM at fixed intervals. However, there is an antagonism between processing and communication regarding the optimal checkpointing interval: while longer intervals benefit processorintensive applications, shorter intervals favor network-intensive applications. Thus, any chosen interval may not always be suitable for the hosted applications, limiting Remus usage in many scenarios. This work introduces Adaptive Remus, a proposal for adaptive checkpointing in Remus that dynamically adjusts the replication frequency according to the characteristics of running applications. Experimental results indicate that our proposal improves performance for applications that require both processing and communication, without harming applications that use only one type of resource.
Com a dependência cada vez maior de computadores e redes, muitos sistemas precisam estar continuamente disponíveis para cumprir sua missão. A tecnologia de virtualização permite prover alta disponibilidade de forma conveniente e a um custo razoável: com o encapsulamento oferecido pelas máquinas virtuais (MVs), sistemas inteiros podem ser replicados em software, de forma transparente, eliminando a necessidade de hardware tolerante a faltas dispendioso. Remus é um mecanismo de replicação de MVs que fornece alta disponibilidade diante de faltas de parada. A replicação é realizada através de checkpointing, seguindo um intervalo fixo de tempo predeterminado. Todavia, existe um antagonismo entre processamento e comunicação em relação ao intervalo ideal entre checkpoints: enquanto intervalos maiores beneficiam aplicações com processamento intensivo, intervalos menores favorecem as aplicações cujo desempenho é dominado pela rede. Logo, o intervalo utilizado nem sempre é o adequado para as características de uso de recursos da aplicação em execução na MV, limitando a aplicabilidade de Remus em determinados cenários. Este trabalho apresenta Adaptive Remus, uma proposta de checkpointing adaptativo para Remus, que ajusta dinamicamente a frequência de replicação de acordo com as características das aplicações em execução. Os resultados indicam que a proposta obtém um melhor desempenho de aplicações que utilizam tanto recursos de processamento como de comunicação, sem prejudicar aplicações que usam apenas um dos tipos de recursos.
Silva, Francisco Assis da. "Recuperação com base em Checkpointing : uma abordagem orientada a objetos." reponame:Biblioteca Digital de Teses e Dissertações da UFRGS, 2002. http://hdl.handle.net/10183/1929.
Full textChimakurthy, Prasanth. "A new roll-forward checkpointing/recovery mechanism for cluster federation /." Available to subscribers only, 2007. http://proquest.umi.com/pqdweb?did=1464129401&sid=6&Fmt=2&clientId=1509&RQT=309&VName=PQD.
Full textVemuri, Aditya. "A high performance non-blocking checkpointing/recovery algorithm for ring networks /." Available to subscribers only, 2006. http://proquest.umi.com/pqdweb?did=1203587991&sid=8&Fmt=2&clientId=1509&RQT=309&VName=PQD.
Full textGad, Ramy Mohamed Aly [Verfasser]. "Enhancing application checkpointing and migration in HPC / Ramy Mohamed Aly Gad." Mainz : Universitätsbibliothek Mainz, 2018. http://d-nb.info/1172105073/34.
Full textSilva, Ulisses Furquim Freire da. "Arquitetura de software para recuperaçao de falhas utilizando checkpointing quase-sincrono." [s.n.], 2005. http://repositorio.unicamp.br/jspui/handle/REPOSIP/276505.
Full textDissertação (mestrado) - Universidade Estadual de Campinas, Instituto de Computação
Made available in DSpace on 2018-08-06T15:21:09Z (GMT). No. of bitstreams: 1 Silva_UlissesFurquimFreireda_M.pdf: 705102 bytes, checksum: 5b4ebc6853f67fd40696b21c87297f43 (MD5) Previous issue date: 2005
Resumo: Um sistema distribuído tolerante a falhas que utilize recuperação por retrocesso de estado deve selecionar os checkpoints dos seus processos que serão gravados. Além dessa seleção, definida por um protocolo de checkpointing, o sistema precisa realizar uma coleta de lixo, para eliminar os checkpoints que se tornam obsoletos à medida que a aplicação executa. Assim, na ocorrência de uma falha, a computação pode ser retrocedida para um estado consistente salvo anteriormente. Esta dissertação discute os aspectos teóricos e práticos de um sistema distribuído tolerante a falhas que utiliza protocolos de checkpointing quase-síncronos e algoritmos para a coleta de lixo e recuperação por retrocesso. Existem vários protocolos de checkpointing na literatura, e nesta dissertação foram estudados os protocolos de checkpointing quase-síncronos. Esses protocols enviam informações de controle juntamente com as mensagens da aplicação, e podem exigir a gravação de checkpoints forçados, mas não necessitam de sincronização ou troca de mensagens de controle entre os processos. Com base nesse estudo, um framework para protocolos de checkpointing quase-sincronos foi implementado numa biblioteca de troca de mensagens chamada LAM/MPI. Além disso, uma arquitetura de software para recuperação de falhas por retrocesso de estado chamada Curupira também foi estudada e implementada naquela biblioteca. O Curupira_e a primeira arquitetura de software que n~ao precisa de troca de mensagens de controle ou qualquer sincronização entre os processos na execução dos protocolos de checkpointing e de coleta de lixo
Abstract: A fault-tolerant distributed system based on rollback-recovery has to checkpoints of its processes are stored. Besides this selection, that is controlled checkpointing protocol, the system has to do garbage collection, in order to eliminate that become obsolete while the application executes. The garbage collection because checkpoints require the use of storage resources and the storage has limited capacity. So, when some fault occurs, the whole distributed be restored to a consistent global state previously stored. This dissertation practical and theoretical aspects of a fault-tolerant distributed system quasisynchronous checkpointing protocols and also garbage collection and algorithms. There are several checkpointing protocols proposed in the literature, quasisynchronous ones were studied in this dissertation. These protocols information in the application's messages and can induce forced checkpoints, need any synchronization or exchanging of control messages among on that study, a framework for quasi-synchronous checkpointing implemented in a message passing library called LAM/MPI. Moreover, a based on rollback-recovery from faults named Curupira was also implemented in that library. Curupira is the _rst software architecture exchanging of control messages or any synchronization among the execution of the checkpointing and garbage collection protocols
Mestrado
Sistemas Distribuidos
Mestre em Ciência da Computação
Sakata, Tiemi Christine. "Uma ponte entre as abordagens sincrona e quase-sincrona para checkpointing." [s.n.], 2006. http://repositorio.unicamp.br/jspui/handle/REPOSIP/276255.
Full textTese (doutorado) - Universidade Estadual de Campinas, Instituto de Computação
Made available in DSpace on 2018-08-08T07:37:22Z (GMT). No. of bitstreams: 1 Sakata_TiemiChristine_D.pdf: 843635 bytes, checksum: 7f950e8bee6e5c7a1dfb19c6212897c2 (MD5) Previous issue date: 2007
Resumo: Protocolos de checkpointing são responsáveis pelo armazenamento de estados dos processos de um sistema distribuído em memória estável para tolerar falhas. Os protocolos síncronos minimais induzem apenas um número minimal de processos a salvarem checkpoints durante uma execução do protocolo bloqueando os processos envolvidos. Uma versão não-bloqueante desta abordagem garante a minimalidade no número de checkpoints salvos em memória estável com o uso de checkpoints mutáveis, checkpoints que podem ser salvos em memória não-estável. Porém, a complexidade deste protocolo e o fato de ele tolerar apenas a presença de uma execução de checkpointing a cada instante nos motivou a procurar soluções para estes problemas na teoria desenvolvida para os protocolos quase-síncronos. A nova abordagem nos permitiu fazer uma revisão de alguns protocolos síncronos bloqueantes existentes na literatura que até então eram considerados minimais. Nesta mesma linha, obtivemos novos resultados na análise de minimalidade dos protocolos síncronos não-bloqueantes, ao considerarmos a aplicação como um todo e também a existência de execuções concorrentes de checkpointing. Ao estabelecermos esta ponte entre as abordagens para checkpointing, conseguimos desenvolver dois novos protocolos síncronos não-bloqueantes. Ambos fazem uso de checkpoints mutáveis, permitem execuções concorrentes de checkpointing e possuem um mecanismo simples de coleta de lixo. No entanto, o fato de cada um dos protocolos derivar de classes diferentes de protocolos quase-síncronos leva a comportamentos distintos, como evidenciado por resultados de simulação
Abstract: Checkpointing protocols are responsible for the selection of checkpoints in fault-tolerant distributed systems. Minimal checkpointing protocols minimize the number of checkpoints blocking processes during checkpointing. A non-blocking version of this approach assures a minimal number of checkpoints saved in stable memory using mutable checkpoints, those checkpoints can be saved in a non-stable storage. However, the complexity of this protocol and the absence of concurrent checkpointing executions have motivated us to find new solutions in the quasi-synchronous theory. The new approach has allowed us to review some blocking synchronous protocols existent in the literature which were, until now, considered as minimals. In the same way, we present new results analysing the minimality on the number of checkpoints in nonblocking synchronous protocols, considering the whole application and also the existence of concurrent checkpointing executions. On bridging the gap between the checkpointing approaches we could develop two new non-blocking synchronous protocols. Both use mutable checkpoints, allow concurrent checkpointing executions and have a simple mechanism of garbage collection. However, since each protocol derives from a diferent class of quasi-synchronous protocols, they present distinct behaviours, which are evident in the simulation results
Doutorado
Sistemas Distribuidos
Doutor em Ciência da Computação
Aïssaoui, François. "Autonomic Approach based on Semantics and Checkpointing for IoT System Management." Thesis, Toulouse 1, 2018. http://www.theses.fr/2018TOU10061/document.
Full textKantheti, Vinod. "Design of an efficient checkpointing-recovery algorithm for distributed cluster computing environment /." Available to subscribers only, 2005. http://proquest.umi.com/pqdweb?did=1079672401&sid=2&Fmt=2&clientId=1509&RQT=309&VName=PQD.
Full textCamargo, Raphael Yokoingawa de. "\"Armazenamento distribuído de dados e checkpointing de aplicações paralelas em grades oportunistas\"." Universidade de São Paulo, 2007. http://www.teses.usp.br/teses/disponiveis/45/45134/tde-30082007-115609/.
Full textOpportunistic computational grids use idle resources from shared machines to execute applications that need large amounts of computational power and/or deal with large amounts of data. But executing computationally intensive parallel applications in dynamic and heterogeneous environments, such as opportunistic grids, is a daunting task. Machines may fail, become inaccessible, or change from idle to occupied unexpectedly, compromising the application execution. A fault tolerance mechanism that supports heterogeneous architectures is an important requisite for such systems. In this work, we analyze, implement and evaluate a checkpointing-based fault tolerance mechanism for parallel applications running on opportunistic grids. The mechanism monitors application execution and allows the migration of applications between heterogeneous nodes of the grid. But besides application execution, it is necessary to manage data generated and used by those applications. We want a low cost data storage infrastructure that utilizes the unused disk space of grid shared machines. The system should use the machines to store and recover data only during their idle periods, requiring the system to be redundant and fault-tolerant. To solve the data storage problem in opportunistic grids, we designed, implemented and evaluated the OppStore middleware. This middleware provides reliable distributed storage for application data, which can be accessed from any machine in the grid. The machines are organized in clusters, connected by a self-organizing and fault-tolerant peer-to-peer network. During storage, data is codified into redundant fragments, allowing the reconstruction of the original file using only a subset of those fragments. Finally, to deal with resource heterogeneity, we developed an extension to the Pastry peer-to-peer routing substrate, enabling heterogeneity-aware load-balancing message routing.
Bhupathi, Raghavendra Kumar. "A fast and efficient non-blocking coordinated checkpointing approach for mobile computing systems /." Available to subscribers only, 2006. http://proquest.umi.com/pqdweb?did=1203588041&sid=22&Fmt=2&clientId=1509&RQT=309&VName=PQD.
Full textPetersen, Karsten. "Management-Elemente für mehrdimensional heterogene Cluster." Master's thesis, Universitätsbibliothek Chemnitz, 2003. http://nbn-resolving.de/urn:nbn:de:swb:ch1-200300851.
Full textVutukuru, Ashwin. "A sender based message logging and independent checkpointing approach for recovery in mobile computing environment /." Available to subscribers only, 2009. http://proquest.umi.com/pqdweb?did=1797219881&sid=3&Fmt=2&clientId=1509&RQT=309&VName=PQD.
Full textVutukuru, Ashwin. "A Sender Based Message Logging and Independent Checkpointing Approach for Recovery in Mobile Computing Environment." OpenSIUC, 2008. https://opensiuc.lib.siu.edu/theses/426.
Full textTran, Van Long. "Optimization of checkpointing and execution model for an implementation of OpenMP on distributed memory architectures." Thesis, Evry, Institut national des télécommunications, 2018. http://www.theses.fr/2018TELE0017/document.
Full textOpenMP and MPI have become the standard tools to develop parallel programs on shared-memory and distributed-memory architectures respectively. As compared to MPI, OpenMP is easier to use. This is due to the ability of OpenMP to automatically execute code in parallel and synchronize results using its directives, clauses, and runtime functions while MPI requires programmers do all this manually. Therefore, some efforts have been made to port OpenMP on distributed-memory architectures. However, excluding CAPE, no solution has successfully met both requirements: 1) to be fully compliant with the OpenMP standard and 2) high performance. CAPE stands for Checkpointing-Aided Parallel Execution. It is a framework that automatically translates and provides runtime functions to execute OpenMP program on distributed-memory architectures based on checkpointing techniques. In order to execute an OpenMP program on distributed-memory system, CAPE uses a set of templates to translate OpenMP source code to CAPE source code, and then, the CAPE source code is compiled by a C/C++ compiler. This code can be executed on distributed-memory systems under the support of the CAPE framework. Basically, the idea of CAPE is the following: the program first run on a set of nodes on the system, each node being executed as a process. Whenever the program meets a parallel section, the master distributes the jobs to the slave processes by using a Discontinuous Incremental Checkpoint (DICKPT). After sending the checkpoints, the master waits for the returned results from the slaves. The next step on the master is the reception and merging of the resulting checkpoints before injecting them into the memory. For slave nodes, they receive different checkpoints, and then, they inject it into their memory to compute the divided job. The result is sent back to the master using DICKPTs. At the end of the parallel region, the master sends the result of the checkpoint to every slaves to synchronize the memory space of the program as a whole. In some experiments, CAPE has shown very high-performance on distributed-memory systems and is a viable and fully compatible with OpenMP solution. However, CAPE is in the development stage. Its checkpoint mechanism and execution model need to be optimized in order to improve the performance, ability, and reliability. This thesis aims at presenting the approaches that were proposed to optimize and improve checkpoints, design and implement a new execution model, and improve the ability for CAPE. First, we proposed arithmetics on checkpoints, which aims at modeling checkpoint’s data structure and its operations. This modeling contributes to optimize checkpoint size and reduces the time when merging, as well as improve checkpoints capability. Second, we developed TICKPT which stands for Time-stamp Incremental Checkpointing as an instance of arithmetics on checkpoints. TICKPT is an improvement of DICKPT. It adds a timestamp to checkpoints to identify the checkpoints order. The analysis and experiments to compare it to DICKPT show that TICKPT do not only provide smaller in checkpoint size, but also has less impact on the performance of the program using checkpointing. Third, we designed and implemented a new execution model and new prototypes for CAPE based on TICKPT. The new execution model allows CAPE to use resources efficiently, avoid the risk of bottlenecks, overcome the requirement of matching the Bernstein’s conditions. As a result, these approaches make CAPE improving the performance, ability as well as reliability. Four, Open Data-sharing attributes are implemented on CAPE based on arithmetics on checkpoints and TICKPT. This also demonstrates the right direction that we took, and makes CAPE more complete
Tran, Van Long. "Optimization of checkpointing and execution model for an implementation of OpenMP on distributed memory architectures." Electronic Thesis or Diss., Evry, Institut national des télécommunications, 2018. http://www.theses.fr/2018TELE0017.
Full textOpenMP and MPI have become the standard tools to develop parallel programs on shared-memory and distributed-memory architectures respectively. As compared to MPI, OpenMP is easier to use. This is due to the ability of OpenMP to automatically execute code in parallel and synchronize results using its directives, clauses, and runtime functions while MPI requires programmers do all this manually. Therefore, some efforts have been made to port OpenMP on distributed-memory architectures. However, excluding CAPE, no solution has successfully met both requirements: 1) to be fully compliant with the OpenMP standard and 2) high performance. CAPE stands for Checkpointing-Aided Parallel Execution. It is a framework that automatically translates and provides runtime functions to execute OpenMP program on distributed-memory architectures based on checkpointing techniques. In order to execute an OpenMP program on distributed-memory system, CAPE uses a set of templates to translate OpenMP source code to CAPE source code, and then, the CAPE source code is compiled by a C/C++ compiler. This code can be executed on distributed-memory systems under the support of the CAPE framework. Basically, the idea of CAPE is the following: the program first run on a set of nodes on the system, each node being executed as a process. Whenever the program meets a parallel section, the master distributes the jobs to the slave processes by using a Discontinuous Incremental Checkpoint (DICKPT). After sending the checkpoints, the master waits for the returned results from the slaves. The next step on the master is the reception and merging of the resulting checkpoints before injecting them into the memory. For slave nodes, they receive different checkpoints, and then, they inject it into their memory to compute the divided job. The result is sent back to the master using DICKPTs. At the end of the parallel region, the master sends the result of the checkpoint to every slaves to synchronize the memory space of the program as a whole. In some experiments, CAPE has shown very high-performance on distributed-memory systems and is a viable and fully compatible with OpenMP solution. However, CAPE is in the development stage. Its checkpoint mechanism and execution model need to be optimized in order to improve the performance, ability, and reliability. This thesis aims at presenting the approaches that were proposed to optimize and improve checkpoints, design and implement a new execution model, and improve the ability for CAPE. First, we proposed arithmetics on checkpoints, which aims at modeling checkpoint’s data structure and its operations. This modeling contributes to optimize checkpoint size and reduces the time when merging, as well as improve checkpoints capability. Second, we developed TICKPT which stands for Time-stamp Incremental Checkpointing as an instance of arithmetics on checkpoints. TICKPT is an improvement of DICKPT. It adds a timestamp to checkpoints to identify the checkpoints order. The analysis and experiments to compare it to DICKPT show that TICKPT do not only provide smaller in checkpoint size, but also has less impact on the performance of the program using checkpointing. Third, we designed and implemented a new execution model and new prototypes for CAPE based on TICKPT. The new execution model allows CAPE to use resources efficiently, avoid the risk of bottlenecks, overcome the requirement of matching the Bernstein’s conditions. As a result, these approaches make CAPE improving the performance, ability as well as reliability. Four, Open Data-sharing attributes are implemented on CAPE based on arithmetics on checkpoints and TICKPT. This also demonstrates the right direction that we took, and makes CAPE more complete
Walther, Andrea. "Program Reversal Schedules for Single- and Multi-processor Machines." Doctoral thesis, Saechsische Landesbibliothek- Staats- und Universitaetsbibliothek Dresden, 2002. http://nbn-resolving.de/urn:nbn:de:swb:14-1011612099250-05302.
Full textFor adjoint calculations, parameter estimation, and similar purposes one may need to reverse the execution of a computer program. The simplest option is to record a complete execution log and then to read it backwards. This requires massive amounts of storage. Instead one may generate the execution log piecewise by restarting the ``forward'' calculation repeatedly from suitably placed checkpoints. The basic structure of the resulting reversal schedules is illustrated. Various strategies are analysed with respect to the resulting temporal and spatial complexity on serial and parallel machines. For serial machines known optimal compromises between operations count and memory requirement are explained, and they are extended to more general situations. For program execution reversal on multi-processors the new challenges and demands on an optimal reversal schedule are described. We present parallel reversal schedules that are provably optimal with regards to the number of concurrent processes and the total amount of memory required
Jiang, Qiangfeng. "ALGORITHMS FOR FAULT TOLERANCE IN DISTRIBUTED SYSTEMS AND ROUTING IN AD HOC NETWORKS." UKnowledge, 2013. http://uknowledge.uky.edu/cs_etds/16.
Full textKhan, Salman. "Putting checkpoints to work in thread level speculative execution." Thesis, University of Edinburgh, 2010. http://hdl.handle.net/1842/4676.
Full textAupy, Guillaume. "Resilient and energy-efficient scheduling algorithms at scale." Thesis, Lyon, École normale supérieure, 2014. http://www.theses.fr/2014ENSL0928.
Full textThis thesis deals with two issues for future Exascale platforms, namelyresilience and energy.In the first part of this thesis, we focus on the optimal placement ofperiodic coordinated checkpoints to minimize execution time.We consider fault predictors, a software used by system administratorsthat tries to predict (through the study of passed events) where andwhen faults will strike. In this context, we propose efficientalgorithms, and give a first-order optimal formula for the amount ofwork that should be done between two checkpoints.We then focus on silent data corruption errors. Contrarily to fail-stopfailures, such latent errors cannot be detected immediately, and amechanism to detect them must be provided. We compute the optimal periodin order to minimize the waste.In the second part of the thesis we address the energy consumptionchallenge.The speed scaling technique consists in diminishing the voltage of theprocessor, hence diminishing its execution speed. Unfortunately, it waspointed out that DVFS increases the probability of failures. In thiscontext, we consider the speed scaling technique coupled withreliability-increasing techniques such as re-execution, replication orcheckpointing. For these different problems, we propose variousalgorithms whose efficiency is shown either through thoroughsimulations, or approximation results relatively to the optimalsolution. Finally, we consider the different energetic costs involved inperiodic coordinated checkpointing and compute the optimal period tominimize energy consumption, as we did for execution time
Rough, Justin, and mikewood@deakin edu au. "A Platform for reliable computing on clusters using group communications." Deakin University. School of Computing and Mathematics, 2001. http://tux.lib.deakin.edu.au./adt-VDU/public/adt-VDU20060412.141015.
Full textDhoke, Aditya Anil. "On Partial Aborts and Reducing Validation Costs in Fault-tolerant Distributed Transactional Memory." Thesis, Virginia Tech, 2013. http://hdl.handle.net/10919/23890.
Full textMaster of Science
Walther, Andrea. "Program Reversal Schedules for Single- and Multi-processor Machines." Doctoral thesis, Technische Universität Dresden, 1999. https://tud.qucosa.de/id/qucosa%3A24113.
Full textFor adjoint calculations, parameter estimation, and similar purposes one may need to reverse the execution of a computer program. The simplest option is to record a complete execution log and then to read it backwards. This requires massive amounts of storage. Instead one may generate the execution log piecewise by restarting the ``forward'' calculation repeatedly from suitably placed checkpoints. The basic structure of the resulting reversal schedules is illustrated. Various strategies are analysed with respect to the resulting temporal and spatial complexity on serial and parallel machines. For serial machines known optimal compromises between operations count and memory requirement are explained, and they are extended to more general situations. For program execution reversal on multi-processors the new challenges and demands on an optimal reversal schedule are described. We present parallel reversal schedules that are provably optimal with regards to the number of concurrent processes and the total amount of memory required.
Lu, Peng. "Resilire: Achieving High Availability Through Virtual Machine Live Migration." Diss., Virginia Tech, 2013. http://hdl.handle.net/10919/25434.
Full textPh. D.
Väyrynen, Mikael. "Fault-Tolerant Average Execution Time Optimization for General-Purpose Multi-Processor System-On-Chips." Thesis, Linköping University, Linköping University, Linköping University, Department of Computer and Information Science, 2009. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-17705.
Full textFault tolerance is due to the semiconductor technology development important, not only for safety-critical systems but also for general-purpose (non-safety critical) systems. However, instead of guaranteeing that deadlines always are met, it is for general-purpose systems important to minimize the average execution time (AET) while ensuring fault tolerance. For a given job and a soft (transient) no-error probability, we define mathematical formulas for AET using voting (active replication), rollback-recovery with checkpointing (RRC) and a combination of these (CRV) where bus communication overhead is included. And, for a given multi-processor system-on-chip (MPSoC), we define integer linear programming (ILP) models that minimize the AET including bus communication overhead when: (1) selecting the number of checkpoints when using RRC or a combination where RRC is included, (2) finding the number of processors and job-to-processor assignment when using voting or a combination where voting is used, and (3) defining fault tolerance scheme (voting, RRC or CRV) per job and defining its usage for each job. Experiments demonstrate significant savings in AET.