Academic literature on the topic 'Checkpointing'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Checkpointing.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "Checkpointing"

1

Ahn, Jin Ho. "Scalable Checkpointing-Based Rollback Recovery Protocol for Geographically Distributed Systems." Applied Mechanics and Materials 263-266 (December 2012): 1492–96. http://dx.doi.org/10.4028/www.scientific.net/amm.263-266.1492.

Full text
Abstract:
Two opposite approaches were proposed to address some scalability problem resulting from coordinated checkpointing's synchronization during failure-free operation: minimizing the number of checkpointing participants and having the checkpointing process non-blocking. However, these previous approaches, oblivious to the underlying network, may not fundamentally provide any breakthrough for ensuring high scalability required in very large-scale P2P-based systems. This paper proposes a non-blocking coordinated checkpointing protocol to significantly reduce checkpointing synchronization overhead by structuring the peer-to-peer network into a set of groups according to a particular criterion. In this protocol, among processes in a group, one is designated as representative with the following special roles, intra-group and inter-group checkpointing coordination. Intra-group checkpointing coordination addresses the checkpointing procedure among processes within a group. On the other hand, inter-group checkpointing coordination is performed only among representatives. Thanks to this beneficial feature, the proposed protocol may considerably reduce the number of checkpointing control messages routed on core networks compared with the existing ones.
APA, Harvard, Vancouver, ISO, and other styles
2

Çelikel, Özdinç, and Tolga Ovatman. "Distributed Application Checkpointing for Replicated State Machines." Scalable Computing: Practice and Experience 22, no. 1 (February 9, 2021): 67–79. http://dx.doi.org/10.12694/scpe.v22i1.1840.

Full text
Abstract:
Application checkpointing is a widely used recovery mechanism that consists of saving an application's state periodically to be used in case of a failure. In this study we investigate the utilisation of distributed checkpointing for replicated state machines. Conventionally, for replicated state machines, checkpointing information is stored in a replicated way in each of the replicas or separately in a single instance. Applying distributed checkpointing provides a means to adjust the level of fault tolerance of the checkpointing approach by giving away from recovery time. We use a local cluster and cloud environment to examine the effects of distributed checkpointing in a simple state machine example and compare the results with conventional approaches. As expected, distributed checkpointing gains from memory consumption and utilise different levels of fault tolerance while performing worse in terms of recovery time.
APA, Harvard, Vancouver, ISO, and other styles
3

Kumar, Parveen, and Rachit Garg. "Soft-Checkpointing Based Hybrid Synchronous Checkpointing Protocol for Mobile Distributed Systems." International Journal of Distributed Systems and Technologies 2, no. 1 (January 2011): 1–13. http://dx.doi.org/10.4018/jdst.2011010101.

Full text
Abstract:
Minimum-process coordinated checkpointing is a suitable approach to introduce fault tolerance in mobile distributed systems transparently. In order to balance the checkpointing overhead and the loss of computation on recovery, the authors propose a hybrid checkpointing algorithm, wherein an all-process coordinated checkpoint is taken after the execution of minimum-process coordinated checkpointing algorithm for a fixed number of times. In coordinated checkpointing, if a single process fails to take its checkpoint; all the checkpointing effort goes waste, because, each process has to abort its tentative checkpoint. In order to take the tentative checkpoint, an MH (Mobile Host) needs to transfer large checkpoint data to its local MSS over wireless channels. In this regard, the authors propose that in the first phase, all concerned MHs will take soft checkpoint only. Soft checkpoint is similar to mutable checkpoint. In this case, if some process fails to take checkpoint in the first phase, then MHs need to abort their soft checkpoints only. The effort of taking a soft checkpoint is negligibly small as compared to the tentative one. In the minimum-process coordinated checkpointing algorithm, an effort has been made to minimize the number of useless checkpoints and blocking of processes using probabilistic approach.
APA, Harvard, Vancouver, ISO, and other styles
4

Jafary, Bentolhoda, Lance Fiondella, and Ping-Chen Chang. "Optimal equidistant checkpointing of fault tolerant systems subject to correlated failure." Proceedings of the Institution of Mechanical Engineers, Part O: Journal of Risk and Reliability 234, no. 4 (May 4, 2020): 636–48. http://dx.doi.org/10.1177/1748006x19893569.

Full text
Abstract:
Checkpointing is a technique to back up work at periodic intervals so that if computation fails, it will not be necessary to restart from the beginning but will instead be able to restart from the latest checkpoint. Performing checkpointing operations requires time. Therefore, it is necessary to consider the tradeoff between the time to perform checkpointing operations and the time saved when computation restarts at a checkpoint. This article presents a method to model the impact of correlated failures on an application that performs a specified amount of computation and implements checkpointing operations at equidistant periods during this computation. We develop a Markov model and superimpose a correlated life distribution. Two cases are considered. The first assumes that reaching a checkpoint resets the failure distribution. The second allows the probability of failure to progress. We illustrate the approach through a series of examples. The results indicate that correlation can negatively impact checkpointing, necessitating more frequent checkpointing and increasing the total time required, but that the approach can still identify the optimal number of equidistant checkpoints, despite this correlation.
APA, Harvard, Vancouver, ISO, and other styles
5

Plank, J. S., Kai Li, and M. A. Puening. "Diskless checkpointing." IEEE Transactions on Parallel and Distributed Systems 9, no. 10 (1998): 972–86. http://dx.doi.org/10.1109/71.730527.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Nam, Hyochang, Jong Kim, Sung Je Hong, and Sunggu Lee. "Secure checkpointing." Journal of Systems Architecture 48, no. 8-10 (March 2003): 237–54. http://dx.doi.org/10.1016/s1383-7621(02)00137-6.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Kumar, Parveen. "A Low-Cost Hybrid Coordinated Checkpointing Protocol for Mobile Distributed Systems." Mobile Information Systems 4, no. 1 (2008): 13–32. http://dx.doi.org/10.1155/2008/982349.

Full text
Abstract:
Mobile distributed systems raise new issues such as mobility, low bandwidth of wireless channels, disconnections, limited battery power and lack of reliable stable storage on mobile nodes. In minimum-process coordinated checkpointing, some processes may not checkpoint for several checkpoint initiations. In the case of a recovery after a fault, such processes may rollback to far earlier checkpointed state and thus may cause greater loss of computation. In all-process coordinated checkpointing, the recovery line is advanced for all processes but the checkpointing overhead may be exceedingly high. To optimize both matrices, the checkpointing overhead and the loss of computation on recovery, we propose a hybrid checkpointing algorithm, wherein an all-process coordinated checkpoint is taken after the execution of minimum-process coordinated checkpointing algorithm for a fixed number of times. Thus, the Mobile nodes with low activity or in doze mode operation may not be disturbed in the case of minimum-process checkpointing and the recovery line is advanced for each process after an all-process checkpoint. Additionally, we try to minimize the information piggybacked onto each computation message. For minimum-process checkpointing, we design a blocking algorithm, where no useless checkpoints are taken and an effort has been made to optimize the blocking of processes. We propose to delay selective messages at the receiver end. By doing so, processes are allowed to perform their normal computation, send messages and partially receive them during their blocking period. The proposed minimum-process blocking algorithm forces zero useless checkpoints at the cost of very small blocking.
APA, Harvard, Vancouver, ISO, and other styles
8

Yang, Na, and Yun Wang. "A Checkpointing Recovery Approach for Soft Errors Based on Detector Locations." Electronics 12, no. 4 (February 6, 2023): 805. http://dx.doi.org/10.3390/electronics12040805.

Full text
Abstract:
Soft errors are transient errors caused by single-event effects (SEEs) resulting from a strike by high-energy particles acting on sensitive areas of integrated circuits. Soft errors frequently occur in the space environment, adversely affecting the reliability of aerospace-based computing. A recovery process is launched to recover the program when soft errors are detected. A periodic checkpointing recovery approach is widely utilized to prevent soft errors. However, this approach does not consider the detector locations, resulting in a large time overhead. This paper proposes a checkpointing recovery approach for soft errors based on detector locations called DLCKPT. DLCKPT reduces the time overhead by considering detector locations. The experimental results show that the percentage decrease in the time overhead between the DLCKPT and the periodic checkpointing recovery approach is 13.4%. The average recovery rate and average space overhead are 99.3% and 44.4% for the periodic checkpointing recovery approach and 99.4% and 34.6% for the DLCKPT. These results show that the DLCKPT and the periodic checkpointing recovery approach produce comparable results for the recovery rate. The DLCKPT has a lower time overhead and a slightly lower space overhead than the periodic checkpointing recovery approach, demonstrating its effectiveness.
APA, Harvard, Vancouver, ISO, and other styles
9

Ahn, Jinho. "Communication-Induced Checkpointing with Message Logging beyond the Piecewise Deterministic (PWD) Model for Distributed Systems." Electronics 10, no. 12 (June 14, 2021): 1428. http://dx.doi.org/10.3390/electronics10121428.

Full text
Abstract:
This paper introduces an effective communication-induced checkpointing protocol using message logging to enable the number of extra checkpoints to be far lower than the previous number. Even if a situation occurs in which it is decided that a process receiving a message has to perform forced checkpointing, our protocol allows the process to skip the forced checkpointing action if it recognizes that the state of its sender right before the receipt of the message is recoverable. Additionally, the communication-induced checkpointing protocol is thus not required to assume the piecewise deterministic model, despite being combined with message logging. This protocol can maintain these features by piggybacking a one-bit variable and an n-size vector on each message sent. Our simulation results verify our claim that the presented protocol performs much better than the representative optimized protocol with respect to the forced checkpointing frequency, regardless of the communication pattern.
APA, Harvard, Vancouver, ISO, and other styles
10

Sumit Tomar, Ashish Kumar Mishra, and Dharmendra K Yadav. "Knowledge-based checkpointing strategy for spot instances in cloud computing." Journal of Current Science and Technology 13, no. 2 (July 13, 2023): 412–27. http://dx.doi.org/10.59796/jcst.v13n2.2023.1754.

Full text
Abstract:
The Amazon EC2 offers spot-priced virtual machines (VMs) at a reduced price compared to on-demand and reserved VMs. However, Amazon EC2 can terminate these VMs anytime due to the spot price and demand fluctuation. Using spot VMs results in a longer execution time and disrupts service availability. Users can use fault-tolerant techniques such as checkpointing, migration, and job duplication to mitigate the unreliability of spot VMs. In this paper, a knowledge-based checkpointing strategy is proposed to minimize the overall checkpointing overhead during the execution of jobs. The proposed scheme uses real-time price history to decide when to take a checkpoint. Results show that the proposed approach can significantly reduce the turnaround time by 18% compared to Hourly Checkpointing Strategy and 9% compared to Rising-Edge Checkpointing Strategy. One can also achieve 54% to 78% reliability with a cost saving of 78% for the workload used with the described approach.
APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic "Checkpointing"

1

Oliner, Adam Jamison. "Cooperative checkpointing for supercomputing systems." Thesis, Massachusetts Institute of Technology, 2005. http://hdl.handle.net/1721.1/32102.

Full text
Abstract:
Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2005.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Includes bibliographical references (p. 91-94).
A system-level checkpointing mechanism, with global knowledge of the state and health of the machine, can improve performance and reliability by dynamically deciding when to skip checkpoint requests made by applications. This thesis presents such a technique, called cooperative checkpointing, and models its behavior as an online algorithm. Where C is the checkpoint overhead and I is the request interval, a worst-case analysis proves a lower bound of (2 + [C/I])-competitiveness for deterministic cooperative checkpointing algorithms, and proves that a number of simple algorithms meet this bound. Using an expected-case analysis, this thesis proves that an optimal periodic checkpointing algorithm that assumes an exponential failure distribution may be arbitrarily bad relative to an optimal cooperative checkpointing algorithm that permits a general failure distribution. Calculations suggest that, under realistic conditions, an application using cooperative checkpointing may make progress 4 times faster than one using periodic checkpointing. Finally, the thesis suggests an embodiment of cooperative checkpointing for a large-scale high performance computer system and presents the results of some preliminary simulations. These results show that, in extreme cases, cooperative checkpointing improved system utilization by more than 25%, reduced bounded slowdown by a factor of 9, while simultaneously reducing the amount of work lost due to failures by 30%. This thesis contributes a unique approach to providing large-scale system reliability through cooperative checkpointing, techniques for analyzing the approach, and blueprints for implementing it in practice.
by Adam Jamison Oliner.
M.Eng.
APA, Harvard, Vancouver, ISO, and other styles
2

Kalaiselvi, S. "Checkpointing Algorithms for Parallel Computers." Thesis, Indian Institute of Science, 1997. https://etd.iisc.ac.in/handle/2005/3908.

Full text
Abstract:
Checkpointing is a technique widely used in parallel/distributed computers for rollback error recovery. Checkpointing is defined as the coordinated saving of process state information at specified time instances. Checkpoints help in restoring the computation from the latest saved state, in case of failure. In addition to fault recovery, checkpointing has applications in fault detection, distributed debugging and process migration. Checkpointing in uniprocessor systems is easy due to the fact that there is a single clock and events occur with respect to this clock. There is a clear demarcation of events that happens before a checkpoint and events that happens after a checkpoint. In parallel computers a large number of computers coordinate to solve a single problem. Since there might be multiple streams of execution, checkpoints have to be introduced along all these streams simultaneously. Absence of a global clock necessitates explicit coordination to obtain a consistent global state. Events occurring in a distributed system, can be ordered partially using Lamport's happens before relation. Lamport's happens before relation ->is a partial ordering relation to identify dependent and concurrent events occurring in a distributed system. It is defined as follows: ·If two events a and b happen in the same process, and if a happens before b, then a->b ·If a is the sending event of a message and b is the receiving event of the same message then a -> b ·If neither a à b nor b -> a, then a and b are said to be concurrent. A consistent global state may have concurrent checkpoints. In the first chapter of the thesis we discuss issues regarding ordering of events in a parallel computer, need for coordination among checkpoints and other aspects related to checkpointing. Checkpointing locations can either be identified statically or dynamically. The static approach assumes that a representation of a program to be checkpointed is available with information that enables a programmer to specify the places where checkpoints are to be taken. The dynamic approach identifies the checkpointing locations at run time. In this thesis, we have proposed algorithms for both static and dynamic checkpointing. The main contributions of this thesis are as follows: 1. Parallel computers that are being built now have faster communication and hence more efficient clock synchronisation compared to those built a few years ago. Based on efficient clock synchronisation protocols, the clock drift in current machines can be maintained within a few microseconds. We have proposed a dynamic checkpointing algorithm for parallel computers assuming bounded clock drifts. 2. The shared memory paradigm is convenient for programming while message passing paradigm is easy to scale. Distributed Shared Memory (DSM) systems combine the advantage of both paradigms and can be visualized easily on top of a network of workstations. IEEE has recently proposed an interconnect standard called Scalable Coherent Interface (SCI), to con6gure computers as a Distributed Shared Memory system. A periodic dynamic checkpointing algorithm has been proposed in the thesis for a DSM system which uses the SCI standard. 3. When information about a parallel program is available one can make use of this knowledge to perform efficient checkpointing. A static checkpointing approach based on task graphs is proposed for parallel programs. The proposed task graph based static checkpointing approach has been implemented on a Parallel Virtual Machine (PVM) platform. We now give a gist of various chapters of the thesis. Chapter 2 of the thesis gives a classification of existing checkpointing algorithms. The chapter surveys algorithm that have been reported in literature for checkpointing parallel/distributed systems. A point to be noted is that most of the algorithms published for checkpointing message passing systems are based on the seminal article by Chandy & Lamport. A large number of checkpointing algorithms have been published by relaxing the assumptions made in the above mentioned article and by extending the features to minimise the overheads of coordination and context saving. Checkpointing for shared memory systems primarily extend cache coherence protocols to maintain a consistent memory. All of them assume that the main memory is safe for storing the context. Recently algorithms have been published for distributed shared memory systems, which extend the cache coherence protocols used in shared memory systems. They however also include methods for storing the status of distributed memory in stable storage. Chapter 2 concludes with brief comments on the desirable features of a checkpointing algorithm. In Chapter 3, we develop a dynamic checkpointing algorithm for message passing systems assuming that the clock drift of processors in the system is bounded. Efficient clock synchronisation protocols have been implemented on recent parallel computers owing to the fact that communication between processors is very fast. Based on efficient clock synchronisation protocols, clock skew can be limited to a few microseconds. The algorithm proposed in the thesis uses clocks for checkpoint coordination and vector counts for identifying messages to be logged. The algorithm is a periodic, distributed algorithm. We prove correctness of the algorithm and compare it with similar clock based algorithms. Distributed Shared Memory (DSM) systems provide the benefit of ease of programming in a scalable system. The recently proposed IEEE Scalable Coherent Interface (SCI) standard, facilitates the construction of scalable coherent systems. In Chapter 4 we discuss a checkpointing algorithm for an SCI based DSM system. SCI maintains cache coherence in hardware using a distributed cache directory which scales with the number of processors in the system. SCI recommends a two phase transaction protocol for communication. Our algorithm is a two phase centralised coordinated algorithm. Phase one initiates checkpoints and the checkpointing activity is completed in phase two. The correctness of the algorithm is established theoretically. The chapter concludes with the discussion of the features of SCI exploited by the checkpointing algorithm proposed in the thesis. In Chapter 5, a static checkpointing algorithm is developed assuming that the program to be executed on a parallel computer is given as a directed acyclic task graph. We assume that the estimates of the time to execute each task in the task graph is given. Given the timing at which checkpoints are to be taken, the algorithm identifies a set of edges where checkpointing tasks can be placed ensuring that they form a consistent global checkpoint. The proposed algorithm eliminates coordination overhead at run time. It significantly reduces the context saving overhead by taking checkpoints along edges of the task graph. The algorithm is used as a preprocessing step before scheduling the tasks to processors. The algorithm complexity is O(km) where m is the number of edges in the graph and k the maximum number of global checkpoints to be taken. The static algorithm is implemented on a parallel computer with a PVM environment as it is widely available and portable. The task graph of a program can be constructed manually or through program development tools. Our implementation is a collection of preprocessing and run time routines. The preprocessing routines operate on the task graph information to generate a set of edges to be checkpointed for each global checkpoint and write the information on disk. The run time routines save the context along the marked edges. In case of recovery, the recovery algorithms read the information from stable storage and reconstruct the context. The limitation of our static checkpointing algorithm is that it can operate only on deterministic task graphs. To demonstrate the practical feasibility of the proposed approach, case studies of checkpointing some parallel programs are included in the thesis. We conclude the thesis with a summary of proposed algorithms and possible directions to continue research in the area of checkpointing.
APA, Harvard, Vancouver, ISO, and other styles
3

Kalaiselvi, S. "Checkpointing Algorithms for Parallel Computers." Thesis, Indian Institute of Science, 1997. http://hdl.handle.net/2005/67.

Full text
Abstract:
Checkpointing is a technique widely used in parallel/distributed computers for rollback error recovery. Checkpointing is defined as the coordinated saving of process state information at specified time instances. Checkpoints help in restoring the computation from the latest saved state, in case of failure. In addition to fault recovery, checkpointing has applications in fault detection, distributed debugging and process migration. Checkpointing in uniprocessor systems is easy due to the fact that there is a single clock and events occur with respect to this clock. There is a clear demarcation of events that happens before a checkpoint and events that happens after a checkpoint. In parallel computers a large number of computers coordinate to solve a single problem. Since there might be multiple streams of execution, checkpoints have to be introduced along all these streams simultaneously. Absence of a global clock necessitates explicit coordination to obtain a consistent global state. Events occurring in a distributed system, can be ordered partially using Lamport's happens before relation. Lamport's happens before relation ->is a partial ordering relation to identify dependent and concurrent events occurring in a distributed system. It is defined as follows: ·If two events a and b happen in the same process, and if a happens before b, then a->b ·If a is the sending event of a message and b is the receiving event of the same message then a -> b ·If neither a à b nor b -> a, then a and b are said to be concurrent. A consistent global state may have concurrent checkpoints. In the first chapter of the thesis we discuss issues regarding ordering of events in a parallel computer, need for coordination among checkpoints and other aspects related to checkpointing. Checkpointing locations can either be identified statically or dynamically. The static approach assumes that a representation of a program to be checkpointed is available with information that enables a programmer to specify the places where checkpoints are to be taken. The dynamic approach identifies the checkpointing locations at run time. In this thesis, we have proposed algorithms for both static and dynamic checkpointing. The main contributions of this thesis are as follows: 1. Parallel computers that are being built now have faster communication and hence more efficient clock synchronisation compared to those built a few years ago. Based on efficient clock synchronisation protocols, the clock drift in current machines can be maintained within a few microseconds. We have proposed a dynamic checkpointing algorithm for parallel computers assuming bounded clock drifts. 2. The shared memory paradigm is convenient for programming while message passing paradigm is easy to scale. Distributed Shared Memory (DSM) systems combine the advantage of both paradigms and can be visualized easily on top of a network of workstations. IEEE has recently proposed an interconnect standard called Scalable Coherent Interface (SCI), to con6gure computers as a Distributed Shared Memory system. A periodic dynamic checkpointing algorithm has been proposed in the thesis for a DSM system which uses the SCI standard. 3. When information about a parallel program is available one can make use of this knowledge to perform efficient checkpointing. A static checkpointing approach based on task graphs is proposed for parallel programs. The proposed task graph based static checkpointing approach has been implemented on a Parallel Virtual Machine (PVM) platform. We now give a gist of various chapters of the thesis. Chapter 2 of the thesis gives a classification of existing checkpointing algorithms. The chapter surveys algorithm that have been reported in literature for checkpointing parallel/distributed systems. A point to be noted is that most of the algorithms published for checkpointing message passing systems are based on the seminal article by Chandy & Lamport. A large number of checkpointing algorithms have been published by relaxing the assumptions made in the above mentioned article and by extending the features to minimise the overheads of coordination and context saving. Checkpointing for shared memory systems primarily extend cache coherence protocols to maintain a consistent memory. All of them assume that the main memory is safe for storing the context. Recently algorithms have been published for distributed shared memory systems, which extend the cache coherence protocols used in shared memory systems. They however also include methods for storing the status of distributed memory in stable storage. Chapter 2 concludes with brief comments on the desirable features of a checkpointing algorithm. In Chapter 3, we develop a dynamic checkpointing algorithm for message passing systems assuming that the clock drift of processors in the system is bounded. Efficient clock synchronisation protocols have been implemented on recent parallel computers owing to the fact that communication between processors is very fast. Based on efficient clock synchronisation protocols, clock skew can be limited to a few microseconds. The algorithm proposed in the thesis uses clocks for checkpoint coordination and vector counts for identifying messages to be logged. The algorithm is a periodic, distributed algorithm. We prove correctness of the algorithm and compare it with similar clock based algorithms. Distributed Shared Memory (DSM) systems provide the benefit of ease of programming in a scalable system. The recently proposed IEEE Scalable Coherent Interface (SCI) standard, facilitates the construction of scalable coherent systems. In Chapter 4 we discuss a checkpointing algorithm for an SCI based DSM system. SCI maintains cache coherence in hardware using a distributed cache directory which scales with the number of processors in the system. SCI recommends a two phase transaction protocol for communication. Our algorithm is a two phase centralised coordinated algorithm. Phase one initiates checkpoints and the checkpointing activity is completed in phase two. The correctness of the algorithm is established theoretically. The chapter concludes with the discussion of the features of SCI exploited by the checkpointing algorithm proposed in the thesis. In Chapter 5, a static checkpointing algorithm is developed assuming that the program to be executed on a parallel computer is given as a directed acyclic task graph. We assume that the estimates of the time to execute each task in the task graph is given. Given the timing at which checkpoints are to be taken, the algorithm identifies a set of edges where checkpointing tasks can be placed ensuring that they form a consistent global checkpoint. The proposed algorithm eliminates coordination overhead at run time. It significantly reduces the context saving overhead by taking checkpoints along edges of the task graph. The algorithm is used as a preprocessing step before scheduling the tasks to processors. The algorithm complexity is O(km) where m is the number of edges in the graph and k the maximum number of global checkpoints to be taken. The static algorithm is implemented on a parallel computer with a PVM environment as it is widely available and portable. The task graph of a program can be constructed manually or through program development tools. Our implementation is a collection of preprocessing and run time routines. The preprocessing routines operate on the task graph information to generate a set of edges to be checkpointed for each global checkpoint and write the information on disk. The run time routines save the context along the marked edges. In case of recovery, the recovery algorithms read the information from stable storage and reconstruct the context. The limitation of our static checkpointing algorithm is that it can operate only on deterministic task graphs. To demonstrate the practical feasibility of the proposed approach, case studies of checkpointing some parallel programs are included in the thesis. We conclude the thesis with a summary of proposed algorithms and possible directions to continue research in the area of checkpointing.
APA, Harvard, Vancouver, ISO, and other styles
4

Nilsson, Christoffer, and Sebastian Karlsson. "Adaptive Checkpointing for Emergency Communication Systems." Thesis, Linköpings universitet, Institutionen för datavetenskap, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-130876.

Full text
Abstract:
The purpose of an emergency communication system is to be ready andavailable at all times during an emergency situation. This means that emergencysystems have specic usage characteristics, they experience long idleperiods followed by usage spikes. To achieve high availability it is importantto have a fault-tolerant solution. In this report, warm passive replication isin focus. When using warm passive replication, checkpointing is the procedureof transfering the current state from a primary server to its replicas. Inorder to utilize resources in a more eective manner compared to when usinga xed interval checkpointing method an adaptive checkpointing method isproposed. A simulation-based comparison is carried out using MATLABand Simulink to test both the proposed adaptive method and the xed intervalmethod. Two metrics, response time and time to recover, and fourparameters are used in the simulation. The results show that an adaptivemethod can increase eciency, but in order to make a good adaptive methodit is necessary to have specic information regarding system congurationand usage characteristics.
APA, Harvard, Vancouver, ISO, and other styles
5

Vieira, Gustavo Maciel Dias. "Estudo comparativo de algoritmos para checkpointing." [s.n.], 2001. http://repositorio.unicamp.br/jspui/handle/REPOSIP/276435.

Full text
Abstract:
Orientador : Luiz Eduardo Buzato
Dissertação (mestrado) - Universidade Estadual de Campinas, Instituto de Computação
Made available in DSpace on 2018-08-01T02:33:00Z (GMT). No. of bitstreams: 1 Vieira_GustavoMacielDias_M.pdf: 3096254 bytes, checksum: 30b7155e50de3e9afd753dd40520b771 (MD5) Previous issue date: 2001
Resumo: Esta dissertação fornece um estudo comparativo abrangente de algoritmos quase-síncronos para checkpointing. Para tanto, utilizamos a simulação de sistemas distribuídos que nos oferece liberdade para construirmos modelos de sistemas com grande facilidade. O estudo comparativo avaliou pela primeira vez de forma uniforme o impacto sobre o desempenho dos algoritmos de fatores como a escala do sistema, a freqüência de check points básicos e a diferença na velocidade dos processos da aplicação. Com base nestes dados obtivemos um profundo conhecimento sobre o comportamento destes algoritmos e produzimos um valioso referencial para projetistas de sistemas em busca de algoritmos para check pointing para as suas aplicações distribuídas
Abstract: This dissertation provides a comprehensive comparative study ofthe performance of quase synchronous check pointing algorithms. To do so we used the simulation of distributed systems, which provides freedom to build system models easily. The comparative study assessed for the first time in an uniform environment the impact of the algorithms' performance with respect to factors such as the system's scale, the basic checkpoint rate and the relative processes' speed. By analyzing these data we acquired a deep understanding of the behavior of these algorithms and were able to produce a valuable reference to system architects looking for check pointing algorithms for their distributed applications
Mestrado
Mestre em Ciência da Computação
APA, Harvard, Vancouver, ISO, and other styles
6

Bai, Yunhao. "A Checkpointing Methodology for Android Smartphone." The Ohio State University, 2016. http://rave.ohiolink.edu/etdc/view?acc_num=osu1461171667.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Jeyakumar, Ashwin Raju. "Metamori: A library for Incremental File Checkpointing." Thesis, Virginia Tech, 2004. http://hdl.handle.net/10919/9969.

Full text
Abstract:
The advent of cluster computing has resulted in a thrust towards providing software mechanisms for reliability on clusters. The prevalent model for such mechanisms is to take a snapshot of the state of an application, called a checkpoint and commit it to stable storage. This checkpoint has sufficient meta-data, so that if the application fails, it can be restarted from the checkpoint. This operation is called a restore. In order to record a process' complete state, both its volatile and persistent state must be checkpointed. Several libraries exist for checkpointing volatile state. Some of these libraries feature incremental checkpointing, where only the changes since the last checkpoint are recorded in the next checkpoint. Such incremental checkpointing is advantageous since otherwise, the time taken for each successive checkpoint becomes larger and larger. Also, when checkpointing is done in increments, we can restore state to any of the previous checkpoints; a vital feature for adaptive applications. This thesis presents a user-level incremental checkpointing library for files: Metamori. This brings the advantages of incremental memory checkpointing to files as well, thereby providing a low-overhead approach to checkpoint persistent state. Thus, the complete state of an application can now be incrementally checkpointed, as compared to earlier approaches where volatile state was checkpointed incrementally but persistent state had no such facilities.
Master of Science
APA, Harvard, Vancouver, ISO, and other styles
8

Schmidt, Rodrigo Malta. "Coleta de lixo para protocolos de checkpointing." [s.n.], 2003. http://repositorio.unicamp.br/jspui/handle/REPOSIP/276424.

Full text
Abstract:
Orientadores : Luiz Eduardo Buzato, Islene Calciolari Garcia
Dissertação (mestrado) - Universidade Estadual de Campinas, Instituto de Matematica, Estatistica e Computação Cientifica
Made available in DSpace on 2018-08-03T19:18:25Z (GMT). No. of bitstreams: 1 Schmidt_RodrigoMalta_M.pdf: 745421 bytes, checksum: c32cef5e0a61fe3580cc8a211902f9fd (MD5) Previous issue date: 2003
Mestrado
APA, Harvard, Vancouver, ISO, and other styles
9

Vasireddy, Rahul. "ASYNCHRONOUS CHECKPOINTING AND RECOVERY APPROACH FOR DISTRIBUTED SYSTEMS." Available to subscribers only, 2009. http://proquest.umi.com/pqdweb?did=1967797571&sid=6&Fmt=2&clientId=1509&RQT=309&VName=PQD.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Montón, i. Macián Màrius. "Checkpointing for virtual platforms and systemC-TLM-2.0." Doctoral thesis, Universitat Autònoma de Barcelona, 2010. http://hdl.handle.net/10803/32099.

Full text
Abstract:
Un dels avantatges d'usar plataformes virtuals o prototipat virtual enlloc del maquinari real pel desenvolupament de programari encastat és la capacitat d'alguns simuladors de fer captures del seu estat. Si el model del sistema complet és prou detallat, pot tardar uns quants minuts (inclús hores) per simular l'engegada d'un Sistema Operatiu. Si es pren una captura just després de que ha acabat d'engegar, cada cop que calgui corre el programari encastat, els dissenyadors poden simplement recuperar la captura i continuar-la. Recuperar una captura normalment porta pocs segons. Aquest guany es trasllada en una major productivitat, especialment quan es treballa amb sistemes encastat, amb programari complex sobre Sistemes Operatius com en els dispositius actuals. En aquesta tesi es presenta en primer lloc el treball realitzat per afegir un llenguatge de descripció de sistemes anomenat SystemC a dues plataformes virtuals diferents. Aquesta tasca es realitzà per una eina comercial i desprès es traslladà a una plataforma de codi obert. També es presenta una sèrie de modificacions al llenguatge SystemC per suportar la captura d'instantànies. Aquestes modificacions faran possible poder agafar l'estat de la simulació en SystemC i salvar-les al disc. Més tard, la simulació es pot recuperar en el mateix estat on es trobava, sense canvis en els seus components. Aquestes millores ajudaran al llenguatge SystemC a ser més àmpliament usat en el món de les Plataformes Virtuals.
One advantage of using a virtual platform or virtual prototype over real hardware for embedded software development and testing is the ability of some simulators to take checkpoints of their state. If the entire system model is detailed enough, it might take several minutes (or even hours) to simulate booting the O.S. If a snapshot of the simulation is saved just after it has finished booting, each time it is necessary to run the embedded software, designers can simply restore the snapshot and go. Restarting a checkpoint typically takes a few seconds. This can translate into a major productivity gain, especially when working with embedded system with complex SW stacks and O.S. like modern embedded devices. In this dissertation we present in firstly our work on adding a description level language as SystemC to two Virtual Platforms. This work was done for a commercial Virtual Platform, and later translated to a open-sourced Platform. This thesis also presents a set of modifications to SystemC language to support checkpointing. These modifications will make it possible to take the state of a SystemC running simulation and save it to disk. Later, the same simulation can be restored to the same point it was before, without any change to the simulated modules. These changes would help SystemC to be suitable for use by Virtual Platforms as a description language.
APA, Harvard, Vancouver, ISO, and other styles

Books on the topic "Checkpointing"

1

Chaudhary, Vipin. Computation checkpointing and migration. Hauppauge NY: Nova Science Publishers, 2009.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
2

Chaudhary, Vipin. Computation checkpointing and migration. Hauppauge NY: Nova Science Publishers, 2009.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
3

Kent, Fuchs W., and United States. National Aeronautics and Space Administration., eds. Optimal message log reclamation for independent checkpointing. [Urbana, Ill.]: Center for Reliable and High-Performance Computing, Coordinated Science Laboratory, College of Engineering, University of Illinois at Urbana-Champaign, 1993.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
4

United States. National Aeronautics and Space Administration., ed. Space reclamation for uncoordinated checkpointing in message-passing systems. [Urbana, IL]: Coordinated Science Laboratory, College of Engineering, 1993.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
5

Wolter, Katinka. Stochastic models for fault tolerance: Restart, rejuvenation and checkpointing. Heidelberg: Springer, 2010.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
6

United States. National Aeronautics and Space Administration., ed. Space reclamation for uncoordinated checkpointing in message-passing systems. [Urbana, IL]: Coordinated Science Laboratory, College of Engineering, 1993.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
7

Reducing space overhead for independent checkpointing. Urbana, Ill: Center for Reliable and High-Performance Computing, Coordinated Science Laboratory, College of Engineering, University of Illinois at Urbana-Champaign, 1992.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
8

Wolter, Katinka. Stochastic Models for Fault Tolerance: Restart, Rejuvenation and Checkpointing. Springer, 2014.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
9

Wolter, Katinka. Stochastic Models for Fault Tolerance: Restart, Rejuvenation and Checkpointing. Springer, 2010.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
10

Bieker, Bernd. Fault Tolerance for Scalable Applications: Checkpointing Protocols for Parallel Message-Passing-Systems. Lang GmbH, Internationaler Verlag der Wissenschaften, Peter, 2003.

Find full text
APA, Harvard, Vancouver, ISO, and other styles

Book chapters on the topic "Checkpointing"

1

Steele, Guy L., Xiaowei Shen, Josep Torrellas, Mark Tuckerman, Eric J. Bohm, Laxmikant V. Kalé, Glenn Martyna, et al. "Checkpointing." In Encyclopedia of Parallel Computing, 264–73. Boston, MA: Springer US, 2011. http://dx.doi.org/10.1007/978-0-387-09766-4_62.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Wolter, Katinka. "Checkpointing Systems." In Stochastic Models for Fault Tolerance, 171–76. Berlin, Heidelberg: Springer Berlin Heidelberg, 2010. http://dx.doi.org/10.1007/978-3-642-11257-7_8.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Raynal, Michel. "Asynchronous Distributed Checkpointing." In Distributed Algorithms for Message-Passing Systems, 189–218. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013. http://dx.doi.org/10.1007/978-3-642-38123-2_8.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Danilecki, Arkadiusz, and Michał Szychowiak. "Speculation Meets Checkpointing." In Computational Science – ICCS 2006, 753–60. Berlin, Heidelberg: Springer Berlin Heidelberg, 2006. http://dx.doi.org/10.1007/11758501_100.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Baldoni, Roberto, Francesco Quaglia, and Michel Raynal. "Distributed Database Checkpointing." In Euro-Par’99 Parallel Processing, 450–58. Berlin, Heidelberg: Springer Berlin Heidelberg, 1999. http://dx.doi.org/10.1007/3-540-48311-x_61.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Wolter, Katinka. "Stochastic Models for Checkpointing." In Stochastic Models for Fault Tolerance, 177–236. Berlin, Heidelberg: Springer Berlin Heidelberg, 2010. http://dx.doi.org/10.1007/978-3-642-11257-7_9.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Aupy, Guillaume, Anne Benoit, Mohammed El Mehdi Diouri, Olivier Glück, and Laurent Lefèvre. "Energy-Aware Checkpointing Strategies." In Computer Communications and Networks, 279–317. Cham: Springer International Publishing, 2015. http://dx.doi.org/10.1007/978-3-319-20943-2_5.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Bronevetsky, Greg, Daniel Marques, Keshav Pingali, and Radu Rugina. "Compiler-Enhanced Incremental Checkpointing." In Languages and Compilers for Parallel Computing, 1–15. Berlin, Heidelberg: Springer Berlin Heidelberg, 2008. http://dx.doi.org/10.1007/978-3-540-85261-2_1.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Ahlroth, Lauri, Olli Pottonen, and André Schumacher. "Approximately Uniform Online Checkpointing." In Lecture Notes in Computer Science, 297–306. Berlin, Heidelberg: Springer Berlin Heidelberg, 2011. http://dx.doi.org/10.1007/978-3-642-22685-4_27.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Koch, Uwe, Eva Kanellopoulos, and Dietmar Kaletta. "Architekturunabhängiges Checkpointing durch Präprozessing." In Software Engineering im Scientific Computing, 225–29. Wiesbaden: Vieweg+Teubner Verlag, 1996. http://dx.doi.org/10.1007/978-3-322-85027-0_29.

Full text
APA, Harvard, Vancouver, ISO, and other styles

Conference papers on the topic "Checkpointing"

1

Hyo-Chang Nam, Jong Kim, SungJe Hong, and Sunggu Lee. "Probabilistic checkpointing." In Proceedings of IEEE 27th International Symposium on Fault Tolerant Computing. IEEE, 1997. http://dx.doi.org/10.1109/ftcs.1997.614077.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Oliner, Adam J., Larry Rudolph, and Ramendra K. Sahoo. "Cooperative checkpointing." In the 20th annual international conference. New York, New York, USA: ACM Press, 2006. http://dx.doi.org/10.1145/1183401.1183406.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Vogt, Dirk, Cristiano Giuffrida, Herbert Bos, and Andrew S. Tanenbaum. "Lightweight Memory Checkpointing." In 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE, 2015. http://dx.doi.org/10.1109/dsn.2015.45.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Vogt, Dirk, Armando Miraglia, Georgios Portokalidis, Herbert Bos, Andy Tanenbaum, and Cristiano Giuffrida. "Speculative Memory Checkpointing." In Middleware '15: 16th International Middleware Conference. New York, NY, USA: ACM, 2015. http://dx.doi.org/10.1145/2814576.2814802.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Oliner, A., L. Rudolph, and R. Sahoo. "Cooperative checkpointing theory." In Proceedings 20th IEEE International Parallel & Distributed Processing Symposium. IEEE, 2006. http://dx.doi.org/10.1109/ipdps.2006.1639368.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Koch, Dirk, Christian Haubelt, and Jürgen Teich. "Efficient hardware checkpointing." In the 2007 ACM/SIGDA 15th international symposium. New York, New York, USA: ACM Press, 2007. http://dx.doi.org/10.1145/1216919.1216950.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Goulart, Henrique, Álvaro Franco, and Odorico Mendizabal. "Checkpointing Techniques in Distributed Systems: A Synopsis of Diverse Strategies Over the Last Decades." In Workshop de Testes e Tolerância a Falhas. Sociedade Brasileira de Computação - SBC, 2023. http://dx.doi.org/10.5753/wtf.2023.785.

Full text
Abstract:
This paper concisely reviews checkpointing techniques in distributed systems, focusing on various aspects such as coordinated and uncoordinated checkpointing, incremental checkpoints, fuzzy checkpoints, adaptive checkpoint intervals, and kernel-based and user-space checkpoints. The review highlights interesting points, outlines how each checkpoint approach works, and discusses their advantages and drawbacks. It also provides a brief overview of the adoption of checkpoints in different contexts in distributed computing, including Database Management Systems (DBMS), State Machine Replication (SMR), and High-Performance Computing (HPC) environments. Additionally, the paper briefly explores the application of checkpointing strategies in modern cloud and container environments, discussing their role in live migration and application state management. The review offers valuable insights into their adoption and application across various distributed computing contexts by summarizing the historical development, advances, and challenges in checkpointing techniques.
APA, Harvard, Vancouver, ISO, and other styles
8

Lumpp, J. E. "Checkpointing with multicast communication." In 1998 IEEE Aerospace Conference. Proceedings. IEEE, 1998. http://dx.doi.org/10.1109/aero.1998.682213.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Luo, ZongWei. "Checkpointing for workflow recovery." In the 38th annual. New York, New York, USA: ACM Press, 2000. http://dx.doi.org/10.1145/1127716.1127735.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Ahn, Jinho. "Scalable Distributed Checkpointing Algorithm." In 2020 International Conference on Computational Science and Computational Intelligence (CSCI). IEEE, 2020. http://dx.doi.org/10.1109/csci51800.2020.00237.

Full text
APA, Harvard, Vancouver, ISO, and other styles

Reports on the topic "Checkpointing"

1

Li, Chung-Chi J., Elliot M. Stewart, and W. K. Fuchs. Compiler-Assisted Full Checkpointing. Fort Belvoir, VA: Defense Technical Information Center, January 1990. http://dx.doi.org/10.21236/ada274291.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Neves, Nuno, and W. Kent Fuchs. Coordinated Checkpointing Without Direct Coordination. Fort Belvoir, VA: Defense Technical Information Center, January 1998. http://dx.doi.org/10.21236/ada348851.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Wong, D., G. Lloyd, and M. Gokhale. A Memory-mapped Approach to Checkpointing. Office of Scientific and Technical Information (OSTI), April 2013. http://dx.doi.org/10.2172/1084700.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Koo, Richard, and Sam Toueg. Checkpointing and Rollback-Recovery for Distributed Systems. Fort Belvoir, VA: Defense Technical Information Center, October 1985. http://dx.doi.org/10.21236/ada161126.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Solano-Quinde, Lizandro Damian. Parallelization and checkpointing of GPU applications through program transformation. Office of Scientific and Technical Information (OSTI), January 2012. http://dx.doi.org/10.2172/1082971.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Neves, Nuno, and W. K. Fuchs. Using Time to Improve the Performance of Coordinated Checkpointing,. Fort Belvoir, VA: Defense Technical Information Center, January 1996. http://dx.doi.org/10.21236/ada310228.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Wang, Yi-Min, Pi-Yu Chung, In-Jen Lin, and W. K. Fuchs. Checkpoint Space Reclamation for Uncoordinated Checkpointing in Message- Passing Systems. Fort Belvoir, VA: Defense Technical Information Center, January 1991. http://dx.doi.org/10.21236/ada274186.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Moody, A., G. Bronevetsky, K. Mohror, and B. de Supinski. Detailed Modeling, Design, and Evaluation of a Scalable Multi-level Checkpointing System. Office of Scientific and Technical Information (OSTI), April 2010. http://dx.doi.org/10.2172/984082.

Full text
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography