Log in

Relevant bibliographies by topics / Diskless checkpointing

Contents

Journal articles
Dissertations / Theses
Book chapters
Conference papers

Academic literature on the topic 'Diskless checkpointing'

Author: Grafiati

Published: 4 June 2021

Last updated: 15 February 2022

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Diskless checkpointing.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "Diskless checkpointing"

1

Plank, J. S., Kai Li, and M. A. Puening. "Diskless checkpointing." IEEE Transactions on Parallel and Distributed Systems 9, no. 10 (1998): 972–86. http://dx.doi.org/10.1109/71.730527.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Hakkarinen, D., and Zizhong Chen. "Multilevel Diskless Checkpointing." IEEE Transactions on Computers 62, no. 4 (April 2013): 772–83. http://dx.doi.org/10.1109/tc.2012.17.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

Rao, Dr Ch D. V. Subba, Dr M. M. Naidu, and V. Sai Krishna. "Efficient Diskless Checkpointing and Log Based Recovery Schemes." International Journal of Computer Applications 5, no. 12 (August 10, 2010): 29–36. http://dx.doi.org/10.5120/959-1336.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Ge-Ming Chiu and Jane-Ferng Chiu. "A New Diskless Checkpointing Approach for Multiple Processor Failures." IEEE Transactions on Dependable and Secure Computing 8, no. 4 (July 2011): 481–93. http://dx.doi.org/10.1109/tdsc.2010.76.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Plank, James S., Youngbae Kim, and Jack J. Dongarra. "Fault-Tolerant Matrix Operations for Networks of Workstations Using Diskless Checkpointing." Journal of Parallel and Distributed Computing 43, no. 2 (June 1997): 125–38. http://dx.doi.org/10.1006/jpdc.1997.1336.

Full text

APA, Harvard, Vancouver, ISO, and other styles

6

Song, Xiaodong, Wanfeng Dou, Guoan Tang, Kun Yang, and Kejian Qian. "A Diskless Checkpointing Algorithm for Cluster Architectures Applied to Geospatial Raster Data Processing." Journal of Algorithms & Computational Technology 8, no. 4 (December 2014): 369–87. http://dx.doi.org/10.1260/1748-3018.8.4.369.

Full text

APA, Harvard, Vancouver, ISO, and other styles

7

A. Kofahi, Najib, Said Al-Bokhitan ., and Ahmed Al-Nazer . "On Disk-based and Diskless Checkpointing for Parallel and Distributed Systems: An Empirical Analysis." Information Technology Journal 4, no. 4 (September 15, 2005): 367–76. http://dx.doi.org/10.3923/itj.2005.367.376.

Full text

APA, Harvard, Vancouver, ISO, and other styles

8

Cappello, Franck. "Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities." International Journal of High Performance Computing Applications 23, no. 3 (July 20, 2009): 212–26. http://dx.doi.org/10.1177/1094342009106189.

Full text

Abstract:

The emergence of petascale systems and the promise of future exascale systems have reinvigorated the community interest in how to manage failures in such systems and ensure that large applications, lasting several hours or tens of hours, are completed successfully. Most of the existing results for several key mechanisms associated with fault tolerance in high-performance computing (HPC) platforms follow the rollback—recovery approach. Over the last decade, these mechanisms have received a lot of attention from the community with different levels of success. Unfortunately, despite their high degree of optimization, existing approaches do not fit well with the challenging evolutions of large-scale systems. There is room and even a need for new approaches. Opportunities may come from different origins: diskless checkpointing, algorithmic-based fault tolerance, proactive operation, speculative execution, software transactional memory, forward recovery, etc. The contributions of this paper are as follows: (1) we summarize and analyze the existing results concerning the failures in large-scale computers and point out the urgent need for drastic improvements or disruptive approaches for fault tolerance in these systems; (2) we sketch most of the known opportunities and analyze their associated limitations; (3) we extract and express the challenges that the HPC community will have to face for addressing the stringent issue of failures in HPC systems.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic "Diskless checkpointing"

1

Rough, Justin, and mikewood@deakin edu au. "A Platform for reliable computing on clusters using group communications." Deakin University. School of Computing and Mathematics, 2001. http://tux.lib.deakin.edu.au./adt-VDU/public/adt-VDU20060412.141015.

Full text

Abstract:

Shared clusters represent an excellent platform for the execution of parallel applications given their low price/performance ratio and the presence of cluster infrastructure in many organisations. The focus of recent research efforts are on parallelism management, transport and efficient access to resources, and making clusters easy to use. In this thesis, we examine reliable parallel computing on clusters. The aim of this research is to demonstrate the feasibility of developing an operating system facility providing transport fault tolerance using existing, enhanced and newly built operating system services for supporting parallel applications. In particular, we use existing process duplication and process migration services, and synthesise a group communications facility for use in a transparent checkpointing facility. This research is carried out using the methods of experimental computer science. To provide a foundation for the synthesis of the group communications and checkpointing facilities, we survey and review related work in both fields. For group communications, we examine the V Distributed System, the x-kernel and Psync, the ISIS Toolkit, and Horus. We identify a need for services that consider the placement of processes on computers in the cluster. For Checkpointing, we examine Manetho, KeyKOS, libckpt, and Diskless Checkpointing. We observe the use of remote computer memories for storing checkpoints, and the use of copy-on-write mechanisms to reduce the time to create a checkpoint of a process. We propose a group communications facility providing two sets of services: user-oriented services and system-oriented services. User-oriented services provide transparency and target application. System-oriented services supplement the user-oriented services for supporting other operating systems services and do not provide transparency. Additional flexibility is achieved by providing delivery and ordering semantics independently. An operating system facility providing transparent checkpointing is synthesised using coordinated checkpointing. To ensure a consistent set of checkpoints are generated by the facility, instead of blindly blocking the processes of a parallel application, only non-deterministic events are blocked. This allows the processes of the parallel application to continue execution during the checkpoint operation. Checkpoints are created by adapting process duplication mechanisms, and checkpoint data is transferred to remote computer memories and disk for storage using the mechanisms of process migration. The services of the group communications facility are used to coordinate the checkpoint operation, and to transport checkpoint data to remote computer memories and disk. Both the group communications facility and the checkpointing facility have been implemented in the GENESIS cluster operating system and provide proof-of-concept. GENESIS uses a microkernel and client-server based operating system architecture, and is demonstrated to provide an appropriate environment for the development of these facilities. We design a number of experiments to test the performance of both the group communications facility and checkpointing facility, and to provide proof-of-performance. We present our approach to testing, the challenges raised in testing the facilities, and how we overcome them. For group communications, we examine the performance of a number of delivery semantics. Good speed-ups are observed and system-oriented group communication services are shown to provide significant performance advantages over user-oriented semantics in the presence of packet loss. For checkpointing, we examine the scalability of the facility given different levels of resource usage and a variable number of computers. Low overheads are observed for checkpointing a parallel application. It is made clear by this research that the microkernel and client-server based cluster operating system provide an ideal environment for the development of a high performance group communications facility and a transparent checkpointing facility for generating a platform for reliable parallel computing on clusters.

APA, Harvard, Vancouver, ISO, and other styles

2

Leifsson, Egir örn. "Recovery in Distributed Real-Time Database Systems." Thesis, University of Skövde, Department of Computer Science, 1999. http://urn.kb.se/resolve?urn=urn:nbn:se:his:diva-395.

Full text

Abstract:

Recovery is a fundamental service in database systems. In this work, we present a new mechanism for diskless real-time recovery in fully replicated distributed real-time database systems. Traditionally, recovery has relied on disk-resident redundant data. Unfortunately, disks cannot always be used in real-time systems since these systems are sometimes used in environments which do not allow the use of disks. Also, minimizing the amount of hardware can save money, especially in mass-produced products. Instead of loading the database from disk, our recovery mechanism enables a restarted node to retrieve a copy of the database from an arbitrary remote node. The recovery mechanism does not violate timeliness during normal processing and, during recovery, all nodes except for the recovering node can guarantee the timeliness of critical transactions. The mechanism uses fuzzy checkpointing to copy the database to the recovering node. Fuzzy checkpointing has been chosen since it copies the database without regard to concurrency control and, thus, does not increase data contention in the database. We conclude that the suggested recovery mechanism is a feasible option for fully replicated distributed real-time database systems.

APA, Harvard, Vancouver, ISO, and other styles

Book chapters on the topic "Diskless checkpointing"

1

Chen, Zizhong. "Scalable Fault Tolerance for Large-Scale Parallel and Distributed Computing." In Handbook of Research on Scalable Computing Technologies, 760–83. IGI Global, 2010. http://dx.doi.org/10.4018/978-1-60566-661-7.ch033.

Full text

Abstract:

Today’s long running scientific applications typically tolerate failures by checkpoint/restart in which all process states of an application are saved into stable storage periodically. However, as the number of processors in a system increases, the amount of data that need to be saved into stable storage also increases linearly. Therefore, the classical checkpoint/restart approach has a potential scalability problem for large parallel systems. In this chapter, we introduce some scalable techniques to tolerate a small number of process failures in large parallel and distributed computing. We present several encoding strategies for diskless checkpointing to improve the scalability of the technique. We introduce the algorithm-based checkpoint-free fault tolerance technique to tolerate fail-stop failures without checkpoint or rollback recovery. Coding approaches and floating-point erasure correcting codes are also introduced to help applications to survive multiple simultaneous process failures. The introduced techniques are scalable in the sense that the overhead to survive k failures in p processes does not increase as the number of processes p increases. Experimental results demonstrate that the introduced techniques are highly scalable.

APA, Harvard, Vancouver, ISO, and other styles

Conference papers on the topic "Diskless checkpointing"

1

Hakkarinen, Douglas, and Zizhong Chen. "N-Level Diskless Checkpointing." In 2009 11th IEEE International Conference on High Performance Computing and Communications. IEEE, 2009. http://dx.doi.org/10.1109/hpcc.2009.55.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Menderico, Raphael Marcos, and Islene Calciolari Garcia. "Diskless Checkpointing with Rollback-Dependency Trackability." In 2010 IEEE International Symposium on Reliable Distributed Systems (SRDS). IEEE, 2010. http://dx.doi.org/10.1109/srds.2010.17.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

Chen, Zizhong, and Jack Dongarra. "A Scalable Checkpoint Encoding Algorithm for Diskless Checkpointing." In 2008 IEEE 11th High-Assurance Systems Engineering Symposium (HASE). IEEE, 2008. http://dx.doi.org/10.1109/hase.2008.13.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Chiu, Jane-Ferng, and Wei-Hua Hao. "Mutual-Aid: Diskless Checkpointing Scheme for Tolerating Double Faults." In 2008 10th IEEE International Conference on High Performance Computing and Communications (HPCC). IEEE, 2008. http://dx.doi.org/10.1109/hpcc.2008.123.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Eckart, Ben, Xubin He, Chentao Wu, Ferrol Aderholdt, Fang Han, and Stephen Scott. "Distributed Virtual Diskless Checkpointing: A Highly Fault Tolerant Scheme for Virtualized Clusters." In 2012 26th IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 2012. http://dx.doi.org/10.1109/ipdpsw.2012.136.

Full text

APA, Harvard, Vancouver, ISO, and other styles

6

Yang, Jin-Min, and Enquan Yan. "A Diskless Checkpointing Scheme Based on Vertical Encoding to Lower Fault Tolerance Overhead." In 2017 IEEE 19th International Conference on High Performance Computing and Communications, IEEE 15th International Conference on Smart City and IEEE 3rd International Conference on Data Science and Systems (HPCC/SmartCity/DSS). IEEE, 2017. http://dx.doi.org/10.1109/hpcc-smartcity-dss.2017.70.

Full text

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!