To see the other types of publications on this topic, follow the link: Checkpointing.

Journal articles on the topic 'Checkpointing'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Checkpointing.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Ahn, Jin Ho. "Scalable Checkpointing-Based Rollback Recovery Protocol for Geographically Distributed Systems." Applied Mechanics and Materials 263-266 (December 2012): 1492–96. http://dx.doi.org/10.4028/www.scientific.net/amm.263-266.1492.

Full text
Abstract:
Two opposite approaches were proposed to address some scalability problem resulting from coordinated checkpointing's synchronization during failure-free operation: minimizing the number of checkpointing participants and having the checkpointing process non-blocking. However, these previous approaches, oblivious to the underlying network, may not fundamentally provide any breakthrough for ensuring high scalability required in very large-scale P2P-based systems. This paper proposes a non-blocking coordinated checkpointing protocol to significantly reduce checkpointing synchronization overhead by structuring the peer-to-peer network into a set of groups according to a particular criterion. In this protocol, among processes in a group, one is designated as representative with the following special roles, intra-group and inter-group checkpointing coordination. Intra-group checkpointing coordination addresses the checkpointing procedure among processes within a group. On the other hand, inter-group checkpointing coordination is performed only among representatives. Thanks to this beneficial feature, the proposed protocol may considerably reduce the number of checkpointing control messages routed on core networks compared with the existing ones.
APA, Harvard, Vancouver, ISO, and other styles
2

Çelikel, Özdinç, and Tolga Ovatman. "Distributed Application Checkpointing for Replicated State Machines." Scalable Computing: Practice and Experience 22, no. 1 (February 9, 2021): 67–79. http://dx.doi.org/10.12694/scpe.v22i1.1840.

Full text
Abstract:
Application checkpointing is a widely used recovery mechanism that consists of saving an application's state periodically to be used in case of a failure. In this study we investigate the utilisation of distributed checkpointing for replicated state machines. Conventionally, for replicated state machines, checkpointing information is stored in a replicated way in each of the replicas or separately in a single instance. Applying distributed checkpointing provides a means to adjust the level of fault tolerance of the checkpointing approach by giving away from recovery time. We use a local cluster and cloud environment to examine the effects of distributed checkpointing in a simple state machine example and compare the results with conventional approaches. As expected, distributed checkpointing gains from memory consumption and utilise different levels of fault tolerance while performing worse in terms of recovery time.
APA, Harvard, Vancouver, ISO, and other styles
3

Kumar, Parveen, and Rachit Garg. "Soft-Checkpointing Based Hybrid Synchronous Checkpointing Protocol for Mobile Distributed Systems." International Journal of Distributed Systems and Technologies 2, no. 1 (January 2011): 1–13. http://dx.doi.org/10.4018/jdst.2011010101.

Full text
Abstract:
Minimum-process coordinated checkpointing is a suitable approach to introduce fault tolerance in mobile distributed systems transparently. In order to balance the checkpointing overhead and the loss of computation on recovery, the authors propose a hybrid checkpointing algorithm, wherein an all-process coordinated checkpoint is taken after the execution of minimum-process coordinated checkpointing algorithm for a fixed number of times. In coordinated checkpointing, if a single process fails to take its checkpoint; all the checkpointing effort goes waste, because, each process has to abort its tentative checkpoint. In order to take the tentative checkpoint, an MH (Mobile Host) needs to transfer large checkpoint data to its local MSS over wireless channels. In this regard, the authors propose that in the first phase, all concerned MHs will take soft checkpoint only. Soft checkpoint is similar to mutable checkpoint. In this case, if some process fails to take checkpoint in the first phase, then MHs need to abort their soft checkpoints only. The effort of taking a soft checkpoint is negligibly small as compared to the tentative one. In the minimum-process coordinated checkpointing algorithm, an effort has been made to minimize the number of useless checkpoints and blocking of processes using probabilistic approach.
APA, Harvard, Vancouver, ISO, and other styles
4

Jafary, Bentolhoda, Lance Fiondella, and Ping-Chen Chang. "Optimal equidistant checkpointing of fault tolerant systems subject to correlated failure." Proceedings of the Institution of Mechanical Engineers, Part O: Journal of Risk and Reliability 234, no. 4 (May 4, 2020): 636–48. http://dx.doi.org/10.1177/1748006x19893569.

Full text
Abstract:
Checkpointing is a technique to back up work at periodic intervals so that if computation fails, it will not be necessary to restart from the beginning but will instead be able to restart from the latest checkpoint. Performing checkpointing operations requires time. Therefore, it is necessary to consider the tradeoff between the time to perform checkpointing operations and the time saved when computation restarts at a checkpoint. This article presents a method to model the impact of correlated failures on an application that performs a specified amount of computation and implements checkpointing operations at equidistant periods during this computation. We develop a Markov model and superimpose a correlated life distribution. Two cases are considered. The first assumes that reaching a checkpoint resets the failure distribution. The second allows the probability of failure to progress. We illustrate the approach through a series of examples. The results indicate that correlation can negatively impact checkpointing, necessitating more frequent checkpointing and increasing the total time required, but that the approach can still identify the optimal number of equidistant checkpoints, despite this correlation.
APA, Harvard, Vancouver, ISO, and other styles
5

Plank, J. S., Kai Li, and M. A. Puening. "Diskless checkpointing." IEEE Transactions on Parallel and Distributed Systems 9, no. 10 (1998): 972–86. http://dx.doi.org/10.1109/71.730527.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Nam, Hyochang, Jong Kim, Sung Je Hong, and Sunggu Lee. "Secure checkpointing." Journal of Systems Architecture 48, no. 8-10 (March 2003): 237–54. http://dx.doi.org/10.1016/s1383-7621(02)00137-6.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Kumar, Parveen. "A Low-Cost Hybrid Coordinated Checkpointing Protocol for Mobile Distributed Systems." Mobile Information Systems 4, no. 1 (2008): 13–32. http://dx.doi.org/10.1155/2008/982349.

Full text
Abstract:
Mobile distributed systems raise new issues such as mobility, low bandwidth of wireless channels, disconnections, limited battery power and lack of reliable stable storage on mobile nodes. In minimum-process coordinated checkpointing, some processes may not checkpoint for several checkpoint initiations. In the case of a recovery after a fault, such processes may rollback to far earlier checkpointed state and thus may cause greater loss of computation. In all-process coordinated checkpointing, the recovery line is advanced for all processes but the checkpointing overhead may be exceedingly high. To optimize both matrices, the checkpointing overhead and the loss of computation on recovery, we propose a hybrid checkpointing algorithm, wherein an all-process coordinated checkpoint is taken after the execution of minimum-process coordinated checkpointing algorithm for a fixed number of times. Thus, the Mobile nodes with low activity or in doze mode operation may not be disturbed in the case of minimum-process checkpointing and the recovery line is advanced for each process after an all-process checkpoint. Additionally, we try to minimize the information piggybacked onto each computation message. For minimum-process checkpointing, we design a blocking algorithm, where no useless checkpoints are taken and an effort has been made to optimize the blocking of processes. We propose to delay selective messages at the receiver end. By doing so, processes are allowed to perform their normal computation, send messages and partially receive them during their blocking period. The proposed minimum-process blocking algorithm forces zero useless checkpoints at the cost of very small blocking.
APA, Harvard, Vancouver, ISO, and other styles
8

Yang, Na, and Yun Wang. "A Checkpointing Recovery Approach for Soft Errors Based on Detector Locations." Electronics 12, no. 4 (February 6, 2023): 805. http://dx.doi.org/10.3390/electronics12040805.

Full text
Abstract:
Soft errors are transient errors caused by single-event effects (SEEs) resulting from a strike by high-energy particles acting on sensitive areas of integrated circuits. Soft errors frequently occur in the space environment, adversely affecting the reliability of aerospace-based computing. A recovery process is launched to recover the program when soft errors are detected. A periodic checkpointing recovery approach is widely utilized to prevent soft errors. However, this approach does not consider the detector locations, resulting in a large time overhead. This paper proposes a checkpointing recovery approach for soft errors based on detector locations called DLCKPT. DLCKPT reduces the time overhead by considering detector locations. The experimental results show that the percentage decrease in the time overhead between the DLCKPT and the periodic checkpointing recovery approach is 13.4%. The average recovery rate and average space overhead are 99.3% and 44.4% for the periodic checkpointing recovery approach and 99.4% and 34.6% for the DLCKPT. These results show that the DLCKPT and the periodic checkpointing recovery approach produce comparable results for the recovery rate. The DLCKPT has a lower time overhead and a slightly lower space overhead than the periodic checkpointing recovery approach, demonstrating its effectiveness.
APA, Harvard, Vancouver, ISO, and other styles
9

Ahn, Jinho. "Communication-Induced Checkpointing with Message Logging beyond the Piecewise Deterministic (PWD) Model for Distributed Systems." Electronics 10, no. 12 (June 14, 2021): 1428. http://dx.doi.org/10.3390/electronics10121428.

Full text
Abstract:
This paper introduces an effective communication-induced checkpointing protocol using message logging to enable the number of extra checkpoints to be far lower than the previous number. Even if a situation occurs in which it is decided that a process receiving a message has to perform forced checkpointing, our protocol allows the process to skip the forced checkpointing action if it recognizes that the state of its sender right before the receipt of the message is recoverable. Additionally, the communication-induced checkpointing protocol is thus not required to assume the piecewise deterministic model, despite being combined with message logging. This protocol can maintain these features by piggybacking a one-bit variable and an n-size vector on each message sent. Our simulation results verify our claim that the presented protocol performs much better than the representative optimized protocol with respect to the forced checkpointing frequency, regardless of the communication pattern.
APA, Harvard, Vancouver, ISO, and other styles
10

Sumit Tomar, Ashish Kumar Mishra, and Dharmendra K Yadav. "Knowledge-based checkpointing strategy for spot instances in cloud computing." Journal of Current Science and Technology 13, no. 2 (July 13, 2023): 412–27. http://dx.doi.org/10.59796/jcst.v13n2.2023.1754.

Full text
Abstract:
The Amazon EC2 offers spot-priced virtual machines (VMs) at a reduced price compared to on-demand and reserved VMs. However, Amazon EC2 can terminate these VMs anytime due to the spot price and demand fluctuation. Using spot VMs results in a longer execution time and disrupts service availability. Users can use fault-tolerant techniques such as checkpointing, migration, and job duplication to mitigate the unreliability of spot VMs. In this paper, a knowledge-based checkpointing strategy is proposed to minimize the overall checkpointing overhead during the execution of jobs. The proposed scheme uses real-time price history to decide when to take a checkpoint. Results show that the proposed approach can significantly reduce the turnaround time by 18% compared to Hourly Checkpointing Strategy and 9% compared to Rising-Edge Checkpointing Strategy. One can also achieve 54% to 78% reliability with a cost saving of 78% for the workload used with the described approach.
APA, Harvard, Vancouver, ISO, and other styles
11

Chaturvedi, Amit, Syed Sajad Hussain, and Vikas Kumar. "A Study of Mutable Checkpointing Approach to Reduce the Overheads Associated with Coordinated Checkpointing." SIJ Transactions on Computer Networks & Communication Engineering 01, no. 03 (August 20, 2013): 06–10. http://dx.doi.org/10.9756/sijcnce/v1i3/0103520201.

Full text
APA, Harvard, Vancouver, ISO, and other styles
12

Hsu, Shang-Te, and Ruei-Chuan Chang. "Continuous checkpointing: joining the checkpointing with virtual memory paging." Software: Practice and Experience 27, no. 9 (September 1997): 1103–20. http://dx.doi.org/10.1002/(sici)1097-024x(199709)27:9<1103::aid-spe130>3.0.co;2-2.

Full text
APA, Harvard, Vancouver, ISO, and other styles
13

MANDAL, PARTHA SARATHI, and KRISHNENDU MUKHOPADHYAYA. "MOBILE AGENT BASED CHECKPOINTING WITH CONCURRENT INITIATIONS." International Journal of Foundations of Computer Science 18, no. 05 (October 2007): 1107–22. http://dx.doi.org/10.1142/s0129054107005157.

Full text
Abstract:
Traditional message passing based checkpointing and rollback recovery algorithms perform well for tightly coupled systems. In wide area distributed systems these algorithms may suffer from large overhead due to message passing delay and network traffic. Mobile agents offer an attractive option for designing checkpointing schemes for wide area distributed systems. Network topology is assumed to be arbitrary. Processes are mobile agent enabled. When a process wants to take a checkpoint, it just creates one mobile agent. Concurrent initiations by multiple processes are allowed. Synchronization and creation of a consistent global state (CGS) for checkpointing is managed by the mobile agent(s). In the worst case, for k concurrent initiations among n processes, checkpointing algorithm requires a total of O(kn) hops by all the mobile agents. A mobile agent carries O(n/k) (on the average) size data.
APA, Harvard, Vancouver, ISO, and other styles
14

Vaidya, N. H. "Staggered consistent checkpointing." IEEE Transactions on Parallel and Distributed Systems 10, no. 7 (July 1999): 694–702. http://dx.doi.org/10.1109/71.780864.

Full text
APA, Harvard, Vancouver, ISO, and other styles
15

Wu, Song, Fang Zhou, Xiang Gao, Hai Jin, and Jinglei Ren. "Dual-Page Checkpointing." ACM Transactions on Architecture and Code Optimization 15, no. 4 (January 8, 2019): 1–27. http://dx.doi.org/10.1145/3291057.

Full text
APA, Harvard, Vancouver, ISO, and other styles
16

Hakkarinen, D., and Zizhong Chen. "Multilevel Diskless Checkpointing." IEEE Transactions on Computers 62, no. 4 (April 2013): 772–83. http://dx.doi.org/10.1109/tc.2012.17.

Full text
APA, Harvard, Vancouver, ISO, and other styles
17

Ahn, Jinho. "Scalable Communication-Induced Checkpointing Protocol with Little Overhead for Distributed Computing Environments." Electronics 12, no. 12 (June 16, 2023): 2702. http://dx.doi.org/10.3390/electronics12122702.

Full text
Abstract:
The existing communication-induced checkpointing protocols may not scale well due to their slow acquisition of the most recent timestamps of the next checkpoints of other processes. Accurate situation awareness with diversified information conveyance paths is needed to reduce the number of unnecessary forced checkpoints taken as few as possible. In this paper, a scalable communication-induced checkpointing protocol is proposed to considerably cut down the possibility of performing unnecessary forced checkpointing by exploiting the beneficial features of reliable communication channels. The protocol enables the sender of an application message to swiftly attain the most recent timestamp-related information of the next checkpoint of its receiver and accelerate the spread of the information to others, with little overhead. This behavioral feature may significantly elevate the accuracy of the awareness of the situations in which forced checkpointing is actually needed for useless checkpoint-free recovery. In addition, it generates no extra control message and no message logging overhead while significantly lessening the latency of message sending. Moreover, the protocol can always be operated under the non-deterministic execution model. The evaluation results indicate that the proposed protocol outperforms the existing ones at the reduced forced checkpointing overheads from 12.5% to 84.2%, and at the reduced total execution times from 2.5% to 11.5%.
APA, Harvard, Vancouver, ISO, and other styles
18

Han, Li, Valentin Le Fèvre, Louis-Claude Canon, Yves Robert, and Frédéric Vivien. "A generic approach to scheduling and checkpointing workflows." International Journal of High Performance Computing Applications 33, no. 6 (August 12, 2019): 1255–74. http://dx.doi.org/10.1177/1094342019866891.

Full text
Abstract:
This work deals with scheduling and checkpointing strategies to execute scientific workflows on failure-prone large-scale platforms. To the best of our knowledge, this work is the first to target fail-stop errors for arbitrary workflows. Most previous work addresses soft errors, which corrupt the task being executed by a processor but do not cause the entire memory of that processor to be lost, contrarily to fail-stop errors. We revisit classical mapping heuristics such as Heterogeneous Earliest Finish Time and MinMin and complement them with several checkpointing strategies. The objective is to derive an efficient trade-off between checkpointing every task (CkptAll), which is an overkill when failures are rare events, and checkpointing no task (CkptNone), which induces dramatic re-execution overhead even when only a few failures strike during execution. Contrarily to previous work, our approach applies to arbitrary workflows, not just special classes of dependence graphs such as minimal series-parallel graphs. Extensive experiments report significant gain over both CkptAll and CkptNone for a wide variety of workflows.
APA, Harvard, Vancouver, ISO, and other styles
19

Nikolov, Dimitar, and Erik Larsson. "Clustered checkpointing: Maximizing the level of confidence for non-equidistant checkpointing." Integration 58 (June 2017): 549–62. http://dx.doi.org/10.1016/j.vlsi.2016.10.013.

Full text
APA, Harvard, Vancouver, ISO, and other styles
20

Singh, Dilbag, Amit chhabra, and Jaswinder Singh. "IMCLA: Performance Evaluation of Integrated Multilevel Checkpointing Algorithms using Checkpointing Efficiency." International Journal of Computing and Digital Systems 2, no. 1 (January 1, 2013): 9–19. http://dx.doi.org/10.12785/ijcds/020102.

Full text
APA, Harvard, Vancouver, ISO, and other styles
21

Yanagihara, Masayoshi, Masanori Odagiri, Shunji Osaki, and Naoto Kaio. "Optimal checkpointing procedures taking into account system failure caused by checkpointing." Electronics and Communications in Japan (Part III: Fundamental Electronic Science) 78, no. 10 (October 1995): 69–79. http://dx.doi.org/10.1002/ecjc.4430781008.

Full text
APA, Harvard, Vancouver, ISO, and other styles
22

Symes, William W. "Reverse time migration with optimal checkpointing." GEOPHYSICS 72, no. 5 (September 2007): SM213—SM221. http://dx.doi.org/10.1190/1.2742686.

Full text
Abstract:
Reverse time migration (RTM) requires that fields computed in forward time be accessed in reverse order. Such out-of-order access, to recursively computed fields, requires that some part of the recursion history be stored (checkpointed), with the remainder computed by repeating parts of the forward computation. Optimal checkpointing algorithms choose checkpoints in such a way that the total storage is minimized for a prescribed level of excess computation, or vice versa. Optimal checkpointing dramatically reduces the storage required by RTM, compared to that needed for nonoptimal implementations, at the price of a small increase in computation. This paper describes optimal checkpointing in a form which applies both to RTM and other applications of the adjoint state method, such as construction of velocity updates from prestack wave equation migration.
APA, Harvard, Vancouver, ISO, and other styles
23

NI, WEIGANG, SUSAN V. VRBSKY, and SIBABRATA RAY. "PITFALLS IN DISTRIBUTED NONBLOCKING CHECKPOINTING." Journal of Interconnection Networks 05, no. 01 (March 2004): 47–78. http://dx.doi.org/10.1142/s0219265904001027.

Full text
Abstract:
Coordinated checkpointing has low stable storage requirements and simplifies the recovery process by reserving a set of consistent global checkpoints. Unfortunately, most algorithms that were proposed either incurred a high communication overhead or blocked all processes. Then, a coordinated algorithm was presented which was nonblocking and which forced only a subset of all processes to participate in a checkpointing event. This algorithm was shown to create inconsistencies in some situations and new algorithms to take consistent checkpoints were proposed. However, we found that these algorithms can still result in inconsistencies when typical behavior in a distributed environment is considered, such as multiple forced checkpoints and multiple concurrent checkpoint initiations. In this paper we identify the inconsistencies that can occur and present an efficient nonblocking algorithm that collects consistent global checkpoints and avoids some of the pitfalls in distributed nonblocking checkpointing.
APA, Harvard, Vancouver, ISO, and other styles
24

NEERAJ, RATHORE. "CHECKPOINTING: FAULT TOLERANCE MECHANISM." i-manager’s Journal on Cloud Computing 4, no. 1 (2017): 28. http://dx.doi.org/10.26634/jcc.4.1.13756.

Full text
APA, Harvard, Vancouver, ISO, and other styles
25

Cao, Guohong, and Mukesh Singhal. "Checkpointing with mutable checkpoints." Theoretical Computer Science 290, no. 2 (January 2003): 1127–48. http://dx.doi.org/10.1016/s0304-3975(02)00566-2.

Full text
APA, Harvard, Vancouver, ISO, and other styles
26

Stainov, Rumen. "An asynchronous checkpointing service." Microprocessing and Microprogramming 31, no. 1-5 (April 1991): 117–20. http://dx.doi.org/10.1016/s0165-6074(08)80055-5.

Full text
APA, Harvard, Vancouver, ISO, and other styles
27

Li, Chung-Chi Jim, Elliot M. Stewart, and W. Kent Fuchs. "Compiler-assisted full checkpointing." Software: Practice and Experience 24, no. 10 (October 1994): 871–86. http://dx.doi.org/10.1002/spe.4380241002.

Full text
APA, Harvard, Vancouver, ISO, and other styles
28

Ziarek, Lukasz, Philip Schatz, and Suresh Jagannathan. "Modular Checkpointing for Atomicity." Electronic Notes in Theoretical Computer Science 174, no. 9 (June 2007): 85–115. http://dx.doi.org/10.1016/j.entcs.2007.04.008.

Full text
APA, Harvard, Vancouver, ISO, and other styles
29

Kim, Bongjae, Jungkyu Han, Joonhyouk Jang, Jinman Jung, Junyoung Heo, Hong Min, and Dong Sop Rhee. "A Dynamic Checkpoint Interval Decision Algorithm for Live Migration-Based Drone-Recovery System." Drones 7, no. 5 (April 24, 2023): 286. http://dx.doi.org/10.3390/drones7050286.

Full text
Abstract:
Numerous services and applications have been developed to monitor anomalies or collect various sensing information in large-scale monitoring areas using drones. Nonetheless, interruptions of drone missions in such areas occasionally occur due to network errors, low battery levels, or physical defects, such as damage to the rotor and propeller. Checkpointing is a technique that periodically saves the system’s state, allowing it to be restored to that point in the event of a failure. In such circumstances, checkpointing techniques can be used to periodically save information related to the drone mission and replace a malfunctioning drone with the saved checkpoint information. In this paper, we propose a dynamic checkpoint interval decision algorithm for a live migration-based drone-recovery system. The proposed scheme minimizes the drone’s energy consumption while efficiently performing checkpointing. According to the basic experimental results, the proposed scheme consumed only about 3.51% more energy, while performing about 25.97% more checkpoint operations compared to the FIC (Fixed Interval Checkpointing) scheme. By using the proposed scheme, it is possible to increase the availability of checkpoint information and quickly resume drone missions, while minimizing the increase in energy consumption of the drone by saving checkpoints more frequently. Therefore, the proposed scheme can improve the reliability and stability of drone-based services.
APA, Harvard, Vancouver, ISO, and other styles
30

Saha, Debashis. "How Small and Medium Enterprises (SMEs) Should Bid for Spot Instances of Amazon's EC2 Cloud." International Journal of Business Data Communications and Networking 10, no. 4 (October 2014): 43–59. http://dx.doi.org/10.4018/ijbdcn.2014100103.

Full text
Abstract:
In cloud service provisioning, spot instances are spare slots for which it has no pre-booking, unlike reserved or on-demand instances for which a cloud service provider (CSP) has a priori booking. CSPs like Amazon prefer spot instance approach to sell their “idle” computing resources as and when these idle slots appear. Though they price the spot instances dynamically depending on supply-demand status, usually the spots instances are relatively cheap. Hence, Amazon's spot instances are an attractive option for IT managers in small and medium enterprises (SMEs) that normally have sporadic requirements for resources. However, SMEs have to win their desired spot instances through the auction mechanism conducted by Amazon. Since the IT manager always looks for finishing her job quickly within some specified budget, finding how to bid for spot instances in order to stay within its limited budget is a challenging task for her. She may continue to consume spot instances as long as her bid exceeds the current spot price. But, if she loses at any point, the unfinished task must be put on hold by some checkpointing mechanism so that the task may resume from the same point when she wins the spot next time. Using simulations for a very popular cloud, namely Amazon EC2, it has been found that, at a lower bid price, OPTIMAL checkpointing leads to a total cost higher than the total HOURLY checkpointing cost on a much higher bid value. Therefore, SMEs should go for higher bid prices when using OPTIMAL checkpointing and lower bid prices with HOURLY checkpointing. In the process, the author has observed some interesting correlation among checkpoint strategy, task reliability and completion time, which is reported here.
APA, Harvard, Vancouver, ISO, and other styles
31

Zheng, Junjun, Hiroyuki Okamura, and Tadashi Dohi. "Availability Analysis of Software Systems with Rejuvenation and Checkpointing." Mathematics 9, no. 8 (April 13, 2021): 846. http://dx.doi.org/10.3390/math9080846.

Full text
Abstract:
In software reliability engineering, software-rejuvenation and -checkpointing techniques are widely used for enhancing system reliability and strengthening data protection. In this paper, a stochastic framework composed of a composite stochastic Petri reward net and its resulting non-Markovian availability model is presented to capture the dynamic behavior of an operational software system in which time-based software rejuvenation and checkpointing are both aperiodically conducted. In particular, apart from the software-aging problem that may cause the system to fail, human-error factors (i.e., a system operator’s misoperations) during checkpointing are also considered. To solve the stationary solution of the non-Markovian availability model, which is derived on the basis of the reachability graph of stochastic Petri reward nets and is actually not one of the trivial stochastic models such as the semi-Markov process and the Markov regenerative process, the phase-expansion approach is considered. In numerical experiments, we illustrate steady-state system availability and find optimal software-rejuvenation policies that maximize steady-state system availability. The effects of human-error factors on both steady-state system availability and the optimal software-rejuvenation trigger timing are also evaluated. Numerical results showed that human errors during checkpointing both decreased system availability and brought a significant effect on the optimal rejuvenation-trigger timing, so that it should not be overlooked during system modeling.
APA, Harvard, Vancouver, ISO, and other styles
32

Subasi, Omer, Tatiana Martsinkevich, Ferad Zyulkyarov, Osman Unsal, Jesus Labarta, and Franck Cappello. "Unified fault-tolerance framework for hybrid task-parallel message-passing applications." International Journal of High Performance Computing Applications 32, no. 5 (September 26, 2016): 641–57. http://dx.doi.org/10.1177/1094342016669416.

Full text
Abstract:
We present a unified fault-tolerance framework for task-parallel message-passing applications to mitigate transient errors. First, we propose a fault-tolerant message-logging protocol that only requires the restart of the task that experienced the error and transparently handles any message passing interface calls inside the task. In our experiments we demonstrate that our fault-tolerant solution has a reasonable overhead, with a maximum observed overhead of 4.5%. We also show that fine-grained parallelization is important for hiding the overheads related to the protocol as well as the recovery of tasks. Secondly, we develop a mathematical model to unify task-level checkpointing and our protocol with system-wide checkpointing in order to provide complete failure coverage. We provide closed formulas for the optimal checkpointing interval and the performance score of the unified scheme. Experimental results show that the performance improvement can be as high as 98% with the unified scheme.
APA, Harvard, Vancouver, ISO, and other styles
33

Zhang, Tianyu, Kaige Liu, Jack Kosaian, Juncheng Yang, and Rashmi Vinayak. "Efficient Fault Tolerance for Recommendation Model Training via Erasure Coding." Proceedings of the VLDB Endowment 16, no. 11 (July 2023): 3137–50. http://dx.doi.org/10.14778/3611479.3611514.

Full text
Abstract:
Deep-learning-based recommendation models (DLRMs) are widely deployed to serve personalized content. In addition to using neural networks, DLRMs have large, sparsely-accessed embedding tables, which map categorical features to a learned dense representation. Due to the large sizes of embedding tables, DLRM training is typically distributed across the memory of tens or hundreds of nodes. Node failures are common in such large systems and must be mitigated to enable training to complete within production deadlines. Checkpointing is the primary approach used for fault tolerance in these systems, but incurs significant time overhead both during normal operation and when recovering from failures. As these overheads increase with DLRM size, checkpointing is slated to become an even larger overhead for future DLRMs, which are expected to grow. This calls for rethinking fault tolerance in DLRM training. We present ECRec, a DLRM training system that achieves efficient fault tolerance by coupling erasure coding with the unique characteristics of DLRM training. ECRec takes a hybrid approach between erasure coding and replicating different DLRM parameters, correctly and efficiently updates redundant parameters, and enables training to proceed without pauses, while maintaining the consistency of the recovered parameters. We implement ECRec atop XDL, an open-source, industrial-scale DLRM training system. Compared to checkpointing, ECRec reduces training-time overhead on large DLRMs by up to 66%, recovers from failure up to 9.8× faster, and continues training during recovery with only a 7--13% drop in throughput (whereas checkpointing must pause).
APA, Harvard, Vancouver, ISO, and other styles
34

GUPTA, SUNIL KUMAR, R. K. CHAUHAN, and PARVEEN KUMAR. "A MINIMUM-PROCESS COORDINATED CHECKPOINTING PROTOCOL FOR MOBILE COMPUTING SYSTEMS." International Journal of Foundations of Computer Science 19, no. 04 (August 2008): 1015–38. http://dx.doi.org/10.1142/s0129054108006108.

Full text
Abstract:
Checkpoint is a designated place in a program at which normal process is interrupted specifically to preserve the status information necessary to allow resumption of processing at a later time. A checkpoint algorithm for mobile distributed systems needs to handle many new issues like: mobility, low bandwidth of wireless channels, lack of stable storage on mobile nodes, disconnections, limited battery power and high failure rate of mobile nodes. These issues make traditional checkpointing techniques unsuitable for such environments. Minimum-process coordinated checkpointing is an attractive approach to introduce fault tolerance in mobile distributed systems transparently. This approach is domino-free, requires at most two checkpoints of a process on stable storage, and forces only a minimum number of processes to checkpoint. But, it requires extra synchronization messages, blocking of the underlying computation or taking some useless checkpoints. In this paper, we design a minimum-process checkpointing algorithm for mobile distributed systems, where no useless checkpoint is taken. We reduce the blocking of processes by allowing the processes to do their normal computations, send messages and receive selective messages during their blocking period.
APA, Harvard, Vancouver, ISO, and other styles
35

Yang, Pengliang, Romain Brossier, Ludovic Métivier, and Jean Virieux. "Wavefield reconstruction in attenuating media: A checkpointing-assisted reverse-forward simulation method." GEOPHYSICS 81, no. 6 (November 2016): R349—R362. http://dx.doi.org/10.1190/geo2016-0082.1.

Full text
Abstract:
Three-dimensional implementations of reverse time migration (RTM) and full-waveform inversion (FWI) require efficient schemes to access the incident field to apply the imaging condition of RTM or build the gradient of FWI. Wavefield reconstruction by reverse propagation using final snapshot and saved boundaries appears quite efficient but unstable in attenuating media, whereas the checkpointing strategy is a stable alternative at the expense of increased computational cost through repeated forward modeling. We have developed a checkpointing-assisted reverse-forward simulation (CARFS) method in the context of viscoacoustic wave propagation with a generalized Maxwell body. At each backward reconstruction step, the CARFS algorithm makes a smart decision between forward modeling using checkpoints and reverse propagation based on the minimum time-stepping cost and an energy measure. Numerical experiments demonstrated that the CARFS method allows accurate wavefield reconstruction using less timesteppings than optimal checkpointing, even if seismic attenuation is very strong. For RTM and FWI applications involving a huge number of independent sources and/or applications on architectures with limited memory, CARFS will provide an efficient tool with adequate accuracy in practical implementation.
APA, Harvard, Vancouver, ISO, and other styles
36

Andrijauskas, Fabio, Igor Sfiligoi, Diego Davila, Aashay Arora, Jonathan Guiang, Brian Bockelman, Greg Thain, and Frank Würthwein. "CRIU - Checkpoint Restore in Userspace for computational simulations and scientific applications." EPJ Web of Conferences 295 (2024): 07046. http://dx.doi.org/10.1051/epjconf/202429507046.

Full text
Abstract:
Creating new materials, discovering new drugs, and simulating systems are essential processes for research and innovation and require substantial computational power. While many applications can be split into many smaller independent tasks, some cannot and may take hours or weeks to run to completion. To better manage those longer-running jobs, it would be desirable to stop them at any arbitrary point in time and later continue their computation on another compute resource; this is usually referred to as checkpointing. While some applications can manage checkpointing programmatically, it would be preferable if the batch scheduling system could do that independently. This paper evaluates the feasibility of using CRIU (Checkpoint Restore in Userspace), an open-source tool for the GNU/Linux environments, emphasizing the OSG’s OSPool HTCondor setup. CRIU allows checkpointing the process state into a disk image and can deal with both open files and established network connections seamlessly. Furthermore, it can checkpoint traditional Linux processes and containerized workloads. The functionality seems adequate for many scenarios supported in the OSPool. However, some limitations prevent it from being usable in all circumstances.
APA, Harvard, Vancouver, ISO, and other styles
37

Kraemer, Stefan, Rainer Leupers, Dietmar Petras, Thomas Philipp, and Andreas Hoffmann. "Checkpointing SystemC-Based Virtual Platforms." International Journal of Embedded and Real-Time Communication Systems 2, no. 4 (October 2011): 21–37. http://dx.doi.org/10.4018/jertcs.2011100102.

Full text
Abstract:
The ability to restore a virtual platform from a previously saved simulation state can considerably shorten the typical edit-compile-debug cycle for software developers and therefore enhance productivity. For SystemC based virtual platforms (VP), dedicated checkpoint/restore (C/R) solutions are required, taking into account the specific characteristics of such platforms. Apart from restoring the simulation process from a checkpoint image, the proposed checkpoint solution also takes care of re-attaching debuggers and interactive GUIs to the restored virtual platform. The checkpointing is handled automatically for most of the SystemC modules, only the usage of host OS resources requires user provision. A process checkpointing based C/R has been selected in order to minimize the adaption required for existing VPs at the expense of large checkpoint sizes. This drawback is overcome by introducing an online compression to the checkpoint process. A case study based on the SHAPES Virtual Platform is conducted to investigate the applicability of the proposed framework as well as the impact of checkpoint compression in a realistic system environment.
APA, Harvard, Vancouver, ISO, and other styles
38

Chiu, J. F., and G. M. Chiu. "Hardware-supported asynchronous checkpointing scheme." IEE Proceedings - Computers and Digital Techniques 145, no. 2 (1998): 109. http://dx.doi.org/10.1049/ip-cdt:19981908.

Full text
APA, Harvard, Vancouver, ISO, and other styles
39

Rönngren, Robert, and Rassul Ayani. "Adaptive checkpointing in Time Warp." ACM SIGSIM Simulation Digest 24, no. 1 (July 1994): 110–17. http://dx.doi.org/10.1145/195291.182577.

Full text
APA, Harvard, Vancouver, ISO, and other styles
40

Benoit, Anne, Aurelien Cavelan, Valentin Le Fevre, Yves Robert, and Hongyang Sun. "Towards Optimal Multi-Level Checkpointing." IEEE Transactions on Computers 66, no. 7 (July 1, 2017): 1212–26. http://dx.doi.org/10.1109/tc.2016.2643660.

Full text
APA, Harvard, Vancouver, ISO, and other styles
41

Aupy, Guillaume, Yves Robert, Frédéric Vivien, and Dounia Zaidouni. "Checkpointing algorithms and fault prediction." Journal of Parallel and Distributed Computing 74, no. 2 (February 2014): 2048–64. http://dx.doi.org/10.1016/j.jpdc.2013.10.010.

Full text
APA, Harvard, Vancouver, ISO, and other styles
42

ZIAREK, LUKASZ, and SURESH JAGANNATHAN. "Lightweight checkpointing for concurrent ML." Journal of Functional Programming 20, no. 2 (March 2010): 137–73. http://dx.doi.org/10.1017/s0956796810000067.

Full text
Abstract:
AbstractTransient faults that arise in large-scale software systems can often be repaired by reexecuting the code in which they occur. Ascribing a meaningful semantics for safe reexecution in multithreaded code is not obvious, however. For a thread to reexecute correctly a region of code, it must ensure that all other threads that have witnessed its unwanted effects within that region are also reverted to a meaningful earlier state. If not done properly, data inconsistencies and other undesirable behavior might result. However, automatically determining what constitutes a consistent global checkpoint is not straightforward because thread interactions are a dynamic property of the program. In this paper, we present a safe and efficient checkpointing mechanism for Concurrent ML (CML) that can be used to recover from transient faults. We introduce a new linguistic abstraction, called stabilizers, that permits the specification of per-thread monitors and the restoration of globally consistent checkpoints. Safe global states are computed through lightweight monitoring of communication events among threads (e.g., message-passing operations or updates to shared variables). We present a formal characterization of its design, and provide a detailed description of its implementation within MLton, a whole-program optimizing compiler for Standard ML. Our experimental results on microbenchmarks as well as several realistic, multithreaded, server-style CML applications, including a web server and a windowing toolkit, show that the overheads to use stabilizers are small, and lead us to conclude that they are a viable mechanism for defining safe checkpoints in concurrent functional programs.
APA, Harvard, Vancouver, ISO, and other styles
43

RODRIGUEZ, G. "Controller/Precompiler for Portable Checkpointing." IEICE Transactions on Information and Systems E89-D, no. 2 (February 1, 2006): 408–17. http://dx.doi.org/10.1093/ietisy/e89-d.2.408.

Full text
APA, Harvard, Vancouver, ISO, and other styles
44

Baldoni, R. "Consistent Checkpointing for Transaction Systems." Computer Journal 44, no. 2 (February 1, 2001): 92–100. http://dx.doi.org/10.1093/comjnl/44.2.92.

Full text
APA, Harvard, Vancouver, ISO, and other styles
45

Agbaria, Adnan, and Roy Friedman. "Virtual-machine-based heterogeneous checkpointing." Software: Practice and Experience 32, no. 12 (2002): 1175–92. http://dx.doi.org/10.1002/spe.478.

Full text
APA, Harvard, Vancouver, ISO, and other styles
46

Wong, Kenneth F., and Mark Franklin. "Checkpointing in Distributed Computing Systems." Journal of Parallel and Distributed Computing 35, no. 1 (May 1996): 67–75. http://dx.doi.org/10.1006/jpdc.1996.0069.

Full text
APA, Harvard, Vancouver, ISO, and other styles
47

Kukreja, Navjot, Jan Hückelheim, Mathias Louboutin, John Washbourne, Paul H. J. Kelly, and Gerard J. Gorman. "Lossy checkpoint compression in full waveform inversion: a case study with ZFPv0.5.5 and the overthrust model." Geoscientific Model Development 15, no. 9 (May 12, 2022): 3815–29. http://dx.doi.org/10.5194/gmd-15-3815-2022.

Full text
Abstract:
Abstract. This paper proposes a new method that combines checkpointing methods with error-controlled lossy compression for large-scale high-performance full-waveform inversion (FWI), an inverse problem commonly used in geophysical exploration. This combination can significantly reduce data movement, allowing a reduction in run time as well as peak memory. In the exascale computing era, frequent data transfer (e.g., memory bandwidth, PCIe bandwidth for GPUs, or network) is the performance bottleneck rather than the peak FLOPS of the processing unit. Like many other adjoint-based optimization problems, FWI is costly in terms of the number of floating-point operations, large memory footprint during backpropagation, and data transfer overheads. Past work for adjoint methods has developed checkpointing methods that reduce the peak memory requirements during backpropagation at the cost of additional floating-point computations. Combining this traditional checkpointing with error-controlled lossy compression, we explore the three-way tradeoff between memory, precision, and time to solution. We investigate how approximation errors introduced by lossy compression of the forward solution impact the objective function gradient and final inverted solution. Empirical results from these numerical experiments indicate that high lossy-compression rates (compression factors ranging up to 100) have a relatively minor impact on convergence rates and the quality of the final solution.
APA, Harvard, Vancouver, ISO, and other styles
48

CAPPELLO, FRANCK, HENRI CASANOVA, and YVES ROBERT. "PREVENTIVE MIGRATION VS. PREVENTIVE CHECKPOINTING FOR EXTREME SCALE SUPERCOMPUTERS." Parallel Processing Letters 21, no. 02 (June 2011): 111–32. http://dx.doi.org/10.1142/s0129626411000126.

Full text
Abstract:
An alternative to classical fault-tolerant approaches for large-scale clusters is failure avoidance, by which the occurrence of a fault is predicted and a preventive measure is taken. We develop analytical performance models for two types of preventive measures: preventive checkpointing and preventive migration. We instantiate these models for platform scenarios representative of current and future technology trends. We find that preventive migration is the better approach in the short term by orders of magnitude. However, in the longer term, both approaches have comparable merit with a marginal advantage for preventive checkpointing. We also develop an analytical model of the performance for fault tolerance based on periodic checkpointing and compare this approach to both failure avoidance techniques. We find that this comparison is sensitive to the nature of the stochastic distribution of the time between failures, and that failure avoidance is likely inferior to fault tolerance in the long term. Regardless, our result show that each approach is likely to achieve poor utilization for large-scale platforms (e.g., 220 nodes) unless the mean time between failures is large. We show how bounding parallel job size improves utilization, but conclude that achieving good utilization in future large-scale platforms will require a combination of techniques.
APA, Harvard, Vancouver, ISO, and other styles
49

Savin, G. I., B. M. Shabanov, R. S. Fedorov, A. V. Baranov, and P. N. Telegin. "Checkpointing Tools in a Supercomputer Center." Lobachevskii Journal of Mathematics 41, no. 12 (December 2020): 2603–13. http://dx.doi.org/10.1134/s1995080220120355.

Full text
APA, Harvard, Vancouver, ISO, and other styles
50

Shuyu Chen, Guoliang Liu, and Xiaoqin Zhang. "Low-Overhead Checkpointing/Rollback Recovery Algorithms." International Journal of Advancements in Computing Technology 4, no. 17 (September 30, 2012): 244–53. http://dx.doi.org/10.4156/ijact.vol4.issue17.29.

Full text
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography