Dissertations / Theses: 'Fault-tolerant computing'

1

潘忠強 and Chung-keung Poon. "Fault tolerant computing on hypercubes." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 1991. http://hub.hku.hk/bib/B31209944.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Nickerson, Naomi. "Practical fault-tolerant quantum computing." Thesis, Imperial College London, 2015. http://hdl.handle.net/10044/1/31475.

Full text

Abstract:

Quantum computing has the potential to transform information technology by offering algorithms for certain tasks, such as quantum simulation, that are vastly more efficient than what is possible with any classical device. But experimentally implementing practical quantum information processing is a very difficult task. Here we study two important, and closely related, aspects of this challenge: architectures for quantum computing, and quantum error correction Exquisite quantum control has now been achieved in small ion traps, in nitrogen-vacancy centres and in superconducting qubit clusters, but the challenge remains of how to scale these systems to build practical quantum devices. In Part I of this thesis we analyse one approach to building a scalable quantum computer by networking together many simple processor cells, thus avoiding the need to create a single complex structure. The difficulty is that realistic quantum links are very error prone. Here we describe a method by which even these error-prone cells can perform quantum error correction. Groups of cells generate and purify shared resource states, which then enable stabilization of topologically encoded data. Given a realistically noisy network (10% error rate) we find that our protocol can succeed provided that all intra-cell error rates are below 0.8%. Furthermore, we show that with some adjustments, the protocols we employ can be made robust also against high levels of loss in the network interconnects. We go on to analyse the potential running speed of such a device. Using levels of fidelity that are either already achievable in experimental systems, or will be in the near-future, we find that employing a surface code approach in a highly noisy and lossy network architecture can result in kilohertz computer clock speeds. In Part II we consider the question of quantum error correction beyond the surface code. We consider several families of topological codes, and determine the minimum requirements to demonstrate proof-of-principle error suppression in each type of code. A particularly promising code is the gauge color code, which admits a universal transversal gate set. Furthermore, a recent result of Bombin shows the gauge color code supports an error-correction protocol that achieves tolerance to noisy measurements without the need for repeated measurements, so called single-shot error correction. Here, we demonstrate the promise of single-shot error correction by designing a decoder and investigating its performance. We simulate fault-tolerant error correction with the gauge color code, and estimate a sustainable error rate, i.e. the threshold for the long time limit, of ~0.31% for a phenomenological noise model using a simple decoding algorithm.

APA, Harvard, Vancouver, ISO, and other styles

3

Kurt, Mehmet Can. "Fault-tolerant Programming Models and Computing Frameworks." The Ohio State University, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=osu1437390499.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Su, Xueyuan. "Efficient Fault-Tolerant Infrastructure for Cloud Computing." Thesis, Yale University, 2014. http://pqdtopen.proquest.com/#viewpdf?dispub=3578459.

Full text

Abstract:

Cloud computing is playing a vital role for processing big data. The infrastructure is built on top of large-scale clusters of commodity machines. It is very challenging to properly manage the hardware resources in order to utilize them effectively and to cope with the inevitable failures that will arise with such a large collection of hardware. In this thesis, task assignment and checkpoint placement for cloud computing infrastructure are studied.

As data locality is critical in determining the cost of running a task on a specific machine, how tasks are assigned to machines has a big impact on job completion time. An idealized abstract model is presented for a popular cloud computing platform called Hadoop. Although Hadoop task assignment (HTA) is [special characters omitted]-hard, an algorithm is presented with only an additive approximation gap. Connection is established between the HTA problem and the minimum makespan scheduling problem under the restricted assignment model. A new competitive ratio bound for the online GREEDY algorithm is obtained.

Checkpoints allow recovery of long-running jobs from failures. Checkpoints themselves might fail. The effect of checkpoint failures on job completion time is investigated. The sum of task success probability and checkpoint reliability greatly affects job completion time. When possible checkpoint placements are constrained, retaining only the most recent Ω(log n) possible checkpoints has at most a constant factor penalty. When task failures follow the Poisson distribution, two symmetries for non-equidistant placements are proved and a first order approximation to optimum placement interval is generalized.

APA, Harvard, Vancouver, ISO, and other styles

5

Deconda, Keerthi. "Fault tolerant pulse synchronization." Thesis, [College Station, Tex. : Texas A&M University, 2008. http://hdl.handle.net/1969.1/ETD-TAMU-2331.

Full text

APA, Harvard, Vancouver, ISO, and other styles

6

Garnsworthy, Johnathan Randall. "Fundamental concepts for fault tolerant systems." Thesis, University of Newcastle Upon Tyne, 1990. http://hdl.handle.net/10443/2055.

Full text

Abstract:

In order to be able to think clearly about any subject we need precise definitions of its basic terminology and concepts. If one reads the literature describing fault tolerant computing there is less agreement on fundamental models, concepts and terminology that would perhaps be expected. There are well established usages in particular subcommunities and many other individual workers take care to use terms carefully. Unfortunately there are also many papers in which terms are freely applied to concepts in an arbitrary and inconsistent way. This thesis attempts to bring together some of the concepts of fault tolerant computing and place them in a formal framework. The approach taken is to develop formal models of system structure and behaviour, and to define the basic concepts and terminology in terms of those models. The model of system structure is based on directed graphs and the model of behaviour is based on trace theory.

APA, Harvard, Vancouver, ISO, and other styles

7

Leal, William. "A foundation for fault tolerant components /." The Ohio State University, 2001. http://rave.ohiolink.edu/etdc/view?acc_num=osu1486402957194912.

Full text

APA, Harvard, Vancouver, ISO, and other styles

8

Prakash, Ravi. "Fault tolerant resource management in mobile computing systems /." The Ohio State University, 1996. http://rave.ohiolink.edu/etdc/view?acc_num=osu1487940308431572.

Full text

APA, Harvard, Vancouver, ISO, and other styles

9

Stainer, Julien. "Computability Abstractions for Fault-tolerant Asynchronous Distributed Computing." Thesis, Rennes 1, 2015. http://www.theses.fr/2015REN1S054/document.

Full text

Abstract:

Cette thèse étudie ce qui peut-être calculé dans des systèmes composés de multiple ordinateurs communicant par messages ou partageant de la mémoire. Les modèles considérés prennent en compte la possibilité de défaillance d'une partie de ces ordinateurs ainsi que la variabilité et l'hétérogénéité de leurs vitesses d'exécution. Les résultats présentés considèrent principalement les problèmes d'accord, les systèmes sujets au partitionnement et les détecteurs de fautes. Ce document établis des relations entre les modèles itérés connus et la notion de détecteur de fautes. Il présente une hiérarchie de problèmes généralisant l'accord k-ensembliste et le consensus s-simultané. Une nouvelle construction universelle basée sur des objets consensus s-simultané ainsi qu'une famille de modèles itérés autorisant plusieurs processus à s'exécuter en isolation sont introduites
This thesis studies computability in systems composed of multiple computers exchanging messages or sharing memory. The considered models take into account the possible failure of some of these computers, as well as variations in time and heterogeneity of their execution speeds. The presented results essentially consider agreement problems, systems prone to partitioning and failure detectors. The document establishes relations between known iterated models and the concept of failure detector and presents a hierarchy of agreement problems spanning from k-set agreement to s-simultaneous consensus. It also introduces a new universal construction based on s-simultaneous consensus objects and a family of iterated models allowing several processes to run in isolation

APA, Harvard, Vancouver, ISO, and other styles

10

Mohammadi, Shahram. "Distributed recovery in fault-tolerant interconnected networks." Thesis, University of Manchester, 1990. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.304628.

Full text

APA, Harvard, Vancouver, ISO, and other styles

11

周繼鵬 and Jipeng Zhou. "Fault-tolerant wormhole routing for mesh computers." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2001. http://hub.hku.hk/bib/B31242789.

Full text

APA, Harvard, Vancouver, ISO, and other styles

12

Zhou, Jipeng. "Fault-tolerant wormhole routing for mesh computers." Hong Kong : University of Hong Kong, 2001. http://sunzi.lib.hku.hk:8888/cgi-bin/hkuto%5Ftoc%5Fpdf?B23000909.

Full text

APA, Harvard, Vancouver, ISO, and other styles

13

Pierce, Evelyn Tumlin. "Self-adjusting quorum systems for Byzantine fault tolerance /." Full text (PDF) from UMI/Dissertation Abstracts International, 2000. http://wwwlib.umi.com/cr/utexas/fullcit?p3004357.

Full text

APA, Harvard, Vancouver, ISO, and other styles

14

O'Gorman, Joe. "Architectures for fault-tolerant quantum computation." Thesis, University of Oxford, 2017. http://ora.ox.ac.uk/objects/uuid:4219548d-798b-45f8-b376-91025bbe3ec4.

Full text

Abstract:

Quantum computing has enormous potential, but this can only be realised if quantum errors can be controlled sufficiently to allow quantum algorithms to be completed reliably. However, quantum-error-corrected logical quantum bits (qubits) which can be said to have achieved meaningful error suppression have not yet been demonstrated. This thesis reports research on several topics related to the challenge of designing fault-tolerant quantum computers. The first topic is a proposal for achieving large-scale error correction with the surface code in a silicon donor based quantum computing architecture. This proposal relaxes some of the stringent requirements in donor placement precision set by previous ideas from the single atom level to the order of 10 nm in some regimes. This is shown by means of numerical simulation of the surface code threshold. The second topic then follows, it is the development of a method for benchmarking and assessing the performance of small error correcting codes in few-qubit systems, introducing a metric called 'integrity' - closely linked to the trace distance -- and a proposal for experiments to demonstrate various stepping stones on the way to 'strictly superior' quantum error correction. Most quantum error correcting codes, including the surface code, do not allow for fault-tolerant universal computation without the addition of extra gadgets. One method of achieving universality is through a process of distilling and then consuming high quality 'magic states'. This process adds additional overhead to quantum computation over and above that incurred by the use of the base level quantum error correction. The latter parts of this thesis report an investigation into how many physical qubits are needed in a `magic state factory' within a surface code quantum computer and introduce a number of techniques to reduce the overhead of leading magic state techniques. It is found that universal quantum computing is achievable with ∼ 16 million qubits if error rates across a device are kept below 10^-4. In addition, the thesis introduces improved methods of achieving magic state distillation for unconventional magic states that allow for logical small angle rotations, and show that this can be more efficient than synthesising these operations from the gates provided by traditional magic states.

APA, Harvard, Vancouver, ISO, and other styles

15

Adam, Johan D. "Failure diagnostic expert systems : a case study in fault diagnosis /." Master's thesis, This resource online, 1991. http://scholar.lib.vt.edu/theses/available/etd-01202010-020148/.

Full text

APA, Harvard, Vancouver, ISO, and other styles

16

Bakken, David Edward. "Supporting fault-tolerant parallel programming in Linda." Diss., The University of Arizona, 1994. http://hdl.handle.net/10150/186872.

Full text

Abstract:

As people are becoming increasingly dependent on computerized systems, the need for these systems to be dependable is also increasing. However, programming dependable systems is difficult, especially when parallelism is involved. This is due in part to the fact that very few high-level programming languages support both fault-tolerance and parallel programming. This dissertation addresses this problem by presenting FT-Linda, a high-level language for programming fault-tolerant parallel programs. FT-Linda is based on Linda, a language for programming parallel applications whose most notable feature is a distributed shared memory called tuple space. FT-Linda extends Linda by providing support to allow a program to tolerate failures in the underlying computing platform. The distinguishing features of FT-Linda are stable tuple spaces and atomic execution of multiple tuple space operations. The former is a type of stable storage in which tuple values are guaranteed to persist across failures, while the latter allows collections of tuple operations to be executed in an all-or-nothing fashion despite failures and concurrency. Example FT-Linda programs are given for both dependable systems and parallel applications. The design and implementation of FT-Linda are presented in detail. The key technique used is the replicated state machine approach to constructing fault-tolerant distributed programs. Here, tuple space is replicated to provide failure resilience, and the replicas are sent a message describing the atomic sequence of tuple space operations to perform. This strategy allows an efficient implementation in which only a single multicast message is needed for each atomic sequence of tuple space operations. An implementation of FT-Linda for a network of workstations is also described. FT-Linda is being implemented using Consul, a communication substrate that supports fault-tolerant distributed programming. Consul is built in turn with the x-kernel, an operating system kernel that provides support for composing network protocols. Each of the components of the implementation has been built and tested.

APA, Harvard, Vancouver, ISO, and other styles

17

Gallagher, William Lynn. "Fault tolerant multipliers and dividers using time shared triple modular redundancy /." Digital version accessible at:, 1999. http://wwwlib.umi.com/cr/utexas/main.

Full text

APA, Harvard, Vancouver, ISO, and other styles

18

Vokkaarne, Vijay. "A Fault Tolerant Mobile IP based on Ring Protocol." [Gainesville, Fla.] : University of Florida, 2002. http://purl.fcla.edu/fcla/etd/UFE1001192.

Full text

APA, Harvard, Vancouver, ISO, and other styles

19

Thavamani, Sudha. "Fault tolerant control of a ship propulsion system." Diss., Online access via UMI:, 2006.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

20

Gilbar, Thomas Christopher. "Fault tolerant and integrated token ring network." FIU Digital Commons, 1993. https://digitalcommons.fiu.edu/etd/3935.

Full text

Abstract:

This thesis is a study of communication protocols (token ring, FDDI, and ISDN), microcontrollers (68HC 1EVB), and fault tolerance schemes. One of the major weaknesses of the token ring network is that if a single station fails, the entire system fails. A scheme involving a combination of hardware and timer interrupts in the software has been designed and implemented which deals with this risk. Software and protocols have been designed and applied to the network to reduce the chance of bit faults in communications. ISDN frame format proved to be exceptional in its capacity to carry echoed data and a large variety of tokens which could be used by the stations to test the data. By its very nature, the token ring supplied another major fault detection device by allowing the data to be returned and tested at its source. The resulting network was successful.

APA, Harvard, Vancouver, ISO, and other styles

21

Nixon, Ian Michael. "The automatic synthesis of fault tolerant and fault secure VLSI systems." Thesis, University of Edinburgh, 1988. http://hdl.handle.net/1842/6637.

Full text

Abstract:

This thesis investigates the design of fault tolerant and fault secure (FTFS) systems within the framework of silicon compilation. Automatic design modification is used to introduce FTFS characteristics into a design. A taxonomy of FTFS techniques is introduced and is used to identify a number of features which an "automatic design for FTFS" system should exhibit. A silicon compilation system, Chip Churn 2 (CC2), has been implemented and has been used to demonstrate the feasibility of automatic design of FTFS systems. The CC2 system provides a design language, simulation facilities and a back-end able to produce CMOS VLSI designs. A number of FTFS design methods have been implemented within the CC2 environment; these methods range from triple modular redundancy to concurrent parity code checking. The FTFS design methods can be applied automatically to general designs in order to realise them as FTFS systems. A number of example designs are presented; these are used to illustrate the FTFS modification techniques which have been implemented. Area results for CMOS devices are presented; this allows the modification methods to be compared. A number of problems arising from the methods are highlighted and some solutions suggested.

APA, Harvard, Vancouver, ISO, and other styles

22

Ralph, Scott K. "A constraint-based approach for computing fault tolerant robot programs." Thesis, National Library of Canada = Bibliothèque nationale du Canada, 1999. http://www.collectionscanada.ca/obj/s4/f2/dsk1/tape8/PQDD_0017/NQ46408.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

23

Jackson, Alexander Huw. "Asynchronous embryonics : self-timed biologically-inspired fault-tolerant computing arrays." Thesis, University of York, 2004. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.423748.

Full text

APA, Harvard, Vancouver, ISO, and other styles

24

Ferreira, Ronaldo Rodrigues. "The transactional HW/SW stack for fault tolerant embedded computing." reponame:Biblioteca Digital de Teses e Dissertações da UFRGS, 2015. http://hdl.handle.net/10183/114607.

Full text

Abstract:

O desafio de implementar tolerância a falhas em sistemas embarcados advém das restrições físicas de ocupação de área, dissipação de potência e consumo de energia desses sistemas. A necessidade de otimizar essas três restrições de projeto concomitante à computação dentro dos requisitos de desempenho e de tempo-real cria um problema difícil de ser resolvido. Soluções clássicas de tolerância a falhas tais como redundância modular dupla e tripla não são factíveis devido ao alto custo em potência e a falta de um mecanismo para se recuperar erros. Apesar de algumas técnicas existentes reduzirem o overhead de potência e área, essas incorrem em alta degradação de desempenho e muitas vezes assumem um modelo de falhas que não é factível. Essa tese introduz a Pilha de HW/SW Transacional, ou simplesmente Pilha, para gerenciar de maneira eficiente as restrições de área, potência, cobertura de falhas e desempenho. A Pilha introduz uma nova estratégia de compilação que organiza os programas em Blocos Básicos Transacionais (BBT), juntamente com um novo processador, a Arquitetura de Blocos Básicos Transacionais (ABBT), a qual provê detecção e recuperação de erros de grão fino e determinística ao usar o BBT como um contâiner de erros e como unidade de checkpointing. Duas soluções para prover a semântica de execução do BBT em hardware são propostas, uma baseada em software e a outra em hardware. A área, potência, desempenho e cobertura de falhas foram avaliadas através do modelo de hardware do ABBT. A Pilha provê uma cobertura de falhas de 99,35%, com overhead de 2,05 em potência e 2,65 de área. A Pilha apresenta overhead de desempenho de 1,33 e 1,54, dependento do modelo de hardware usado para suportar a semântica de execução do BBT.
Fault tolerance implementation in embedded systems is challenging because the physical constraints of area occupation, power dissipation, and energy consumption of these systems. The need for optimizing these three physical constraints while doing computation within the available performance goals and real-time deadlines creates a conundrum that is hard to solve. Classical fault tolerance solutions such as triple and dual modular redundancy are not feasible due to their high power overhead or lack of efficient and deterministic error recovery. Existing techniques, although some of them reduce the power and area overhead, incur heavy perfor- mance penalties and most of the time do not assume a feasible fault model. This dissertation introduces the Transactional HW/SW Stack, or simply Stack, to effi- ciently manage the area, power, fault coverage, and performance conundrum. The Stack introduces a new compilation strategy that assembles programs into Transac- tional Basic Blocks, together with a novel microprocessor, the TransactiOnal Basic Block Architecture (ToBBA), which provides fine-grained error detection and deter- ministic error rollback and elimination using the Transactional Basic Blocks (TBBs) both as a container for errors and as a small unit of data checkpointing. Two so- lutions to sustain the TBB semantics in hardware are introduced: software- and hardware-based. Stack’s area, power, performance, and coverage were evaluated using ToBBA’s hardware implementation model. The Stack attains an error correc- tion coverage of 99.35% with 2.05 power overhead within an area overhead of 2.65. The Stack also presents a performance overhead of 1.33 or 1.54, depending on the hardware model adopted to support the TBB.

APA, Harvard, Vancouver, ISO, and other styles

25

Hossack, C. J. "Fully interconnected fault tolerant transputer networks using global link adaptors." Thesis, University of Reading, 1995. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.283238.

Full text

APA, Harvard, Vancouver, ISO, and other styles

26

Gaughan, Patrick T. "Design and analysis of fault-tolerant pipelined multicomputer networks." Diss., Georgia Institute of Technology, 1994. http://hdl.handle.net/1853/15701.

Full text

APA, Harvard, Vancouver, ISO, and other styles

27

Xue, Zhengyuan, and 薛正远. "Implementation of fault-tolerant quantum computation with superconducting device." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2009. http://hub.hku.hk/bib/B43085465.

Full text

APA, Harvard, Vancouver, ISO, and other styles

28

Osman, Taha Mohammed. "FADI : a fault-tolerant environment for distributed processing systems." Thesis, Nottingham Trent University, 1998. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.388867.

Full text

APA, Harvard, Vancouver, ISO, and other styles

29

Xue, Zhengyuan. "Implementation of fault-tolerant quantum computation with superconducting device." Click to view the E-thesis via HKUTO, 2009. http://sunzi.lib.hku.hk/hkuto/record/B43085465.

Full text

APA, Harvard, Vancouver, ISO, and other styles

30

Webster, Paul Thomas. "Fault-Tolerant Logical Operators in Quantum Error-Correcting Codes." Thesis, The University of Sydney, 2021. https://hdl.handle.net/2123/25112.

Full text

Abstract:

Performing quantum computing that is robust against noise will require that all operations are fault-tolerant, meaning that they succeed with high probability even if a limited number of errors occur. We address the problem of fault-tolerantly implementing logical operators on quantum error-correcting codes – operators that apply logic gates to information protected by such codes. Specifically, we investigate what classes of logical operators are possible by particular approaches in important types of codes, especially topological stabiliser codes. We also analyse what fundamental limitations constrain the goal of realising fault-tolerant quantum computing by such implementations and how these limitations can be overcome. We begin by presenting necessary background theory on quantum computing, quantum error-correcting codes and fault tolerance. We then specifically consider the approach to fault tolerance of locality-preserving logical operators in topological stabiliser codes. We present a method for determining the set of such operators admitted by a wide range of such codes and apply this method to important examples such as surface codes and colour codes. Next, we consider the alternative approach of implementing logical operators in topological stabiliser codes with defects, especially by the technique of braiding. We show that such approaches are fundamentally limited, but that effective schemes can nonetheless be constructed, both within these limitations and by circumventing them. We then consider fault tolerance in a more general context. We prove a highly general no-go theorem in this context, applicable to a wide range of stabiliser codes. We also show that this proof illuminates how it can be circumvented and provides perspective on a range of fault-tolerant schemes. Finally, we conclude by reviewing how these results collectively address our research questions and suggesting future work.

APA, Harvard, Vancouver, ISO, and other styles

31

Mancini, Luigi Vincenzo. "Reliability issues in the design of distributed object-based architectures." Thesis, University of Newcastle Upon Tyne, 1989. http://hdl.handle.net/10443/2057.

Full text

Abstract:

This thesis is aimed at enhancing the existing set of techniques for building distributed systems, specifically from the point of view of fault-tolerant com- puting. Reliability is of fundamental importance in the design and operation of dis- tributed systems, as an increasing number of computers are employed in the automation of various essential services. In the past decade, much research effort has been concerned with the object-based methodology for the design and implementation of reliable distributed systems. This thesis describes three contributions to this effort. First, it is shown that object-based programming features can in fact be introduced into pro- cedural languages provided that these languages are endowed with certain facilities. Then, work is discussed which illustrates the relationship between distributed object-based architectures and an apparently different form of distributed architectures based on processes. This work puts the notion of object-based architectures into a new perspective, which shows that the object-based philosophy and the process-based philosophy are the dual of each other. Finally, an important aspect of the design of an object-based distributed architecture is investigated, that of automatic garbage collection. A distri- buted garbage collection scheme is described that handles fault tolerance by an extension of the technique commonly employed to detect unwanted com- putations in distributed architectures. The scheme proposed can also be seen as yet a further illustration of the link between object-based and process-based architectures.

APA, Harvard, Vancouver, ISO, and other styles

32

Clements, N. Scott. "Fault tolerance control of complex dynamical systems." Diss., Georgia Institute of Technology, 2003. http://hdl.handle.net/1853/15515.

Full text

APA, Harvard, Vancouver, ISO, and other styles

33

Batchu, Rajanikanth Reddy. "Incorporating fault-tolerant features into message-passing middleware." Master's thesis, Mississippi State : Mississippi State University, 2003. http://library.msstate.edu/etd/show.asp?etd=etd-04072003-215052.

Full text

APA, Harvard, Vancouver, ISO, and other styles

34

Fayyaz, Muhammad. "Task oriented fault-tolerant distributed computing for use on board spacecraft." Thesis, University of Leicester, 2016. http://hdl.handle.net/2381/36268.

Full text

Abstract:

Current and future space missions demand highly reliable, High Performance Embedded Computing (HPEC). The review of the literature has shown that no single solution could meet both issues efficiently at present addressing HPEC as well as reliability. Furthermore, there is no suitable method of assessing performance for such a scheme. In this thesis a novel cooperative task-oriented fault-tolerant distributed computing (FTDC) architecture is proposed, which caters for high performance and reliability in systems on board spacecraft. In a nut shell, the architecture comprises two types of nodes, a computing node and an input-output node, interfaced together through a high-speed network with bus topology. To detect faults in the nodes, a fault management scheme specifically designed to support the cooperative task-oriented distributed computing concept is proposed and employed, which is referred to as Adaptive Middleware for Fault-Tolerance (AMFT). AMFT is implemented as a separate hardware block and operates in parallel with the processing unit within the computing node. A set of metrics is designed and mathematical models of availability and reliability are developed, which are used to evaluate the proposed distributed computing architecture and fault management scheme. As a new development, extending the current state of the art, the proposed fault-tolerant distributed architecture has been subjected to a rigorous assessment through hardware implementation. Implementation approaches at two levels were adopted to provide a proof of concept: a board level and a Multiprocessor System-on-Chip (MPSoC) level. Both distributed computing system implementations were evaluated for functional validity and performance. To examine the FTDC architecture performance under a realistic space related distributed computing scenario a case-study application, representing a satellite Attitude and Orbit Control System (AOCS), was developed. The AOCS application was selected because it features a time critical task execution, in which system failure and reconfiguration time must be kept minimal. Based on the case-study application, it was demonstrated that the FTDC architecture is capable of fully meeting the desired requirements by timely migrating tasks to functional nodes and keeping rollback of task states minimal, which proves the advantages of the adopted cooperative distributed approach for use on board spacecraft.

APA, Harvard, Vancouver, ISO, and other styles

35

Bhaduri, Debayan. "Design and Analysis of Defect- and Fault-tolerant Nano-Computing Systems." Diss., Virginia Tech, 2007. http://hdl.handle.net/10919/26453.

Full text

Abstract:

The steady downscaling of CMOS technology has led to the development of devices with nanometer dimensions. Contemporaneously, maturity in technologies such as chemical self-assembly and DNA scaffolding has influenced the rapid development of non-CMOS nanodevices including vertical carbon nanotube (CNT) transistors and molecular switches. One main problem in manufacturing defect-free nanodevices, both CMOS and non-CMOS, is the inherent variability in nanoscale fabrication processes. Compared to current CMOS devices, nanodevices are also more susceptible to signal noise and thermal perturbations. One approach for developing robust digital systems from such unreliable nanodevices is to introduce defect- and fault-tolerance at the architecture level. Structurally redundant architectures, reconfigurable architectures and architectures that are a hybrid of the previous two have been proposed as potential defect- and fault-tolerant nanoscale architectures. Hence, the design of reliable nanoscale digital systems will require detailed architectural exploration. In this dissertation, we develop probabilistic methodologies and CAD tools to expedite the exploration of defect- and fault-tolerant architectures. These methodologies and tools will provide nanoscale system designers with the capability to carry out trade-off analysis in terms of area, delay, redundancy and reliability. During execution, the next state of a digital system is only dependent on the present state and the digital signals propagate in discrete time. Hence, we have used Markov processes to analyze the reliability of nanoscale digital architectures. Discrete Time Markov Chains (DTMCs) have been used to analyze logic architectures and Markov Decision processes (MDPs) have been used to analyze memory architectures. Since structurally redundant and reconfigurable nanoarchitectures may consist of millions of nanodevices, we have applied state space partitioning techniques and Belief propagation to scale these techniques. We have developed three toolsets based on these Markovian techniques. One of these toolsets has been specifically developed for the architectural exploration of molecular logic systems. The toolset can generate defect maps for isolating defective nanodevices and provide capabilities to organize structurally redundant fault-tolerant architectures with the non-defective devices. Design trade-offs for each of these architectures can be computed in terms of signal delay, area, redundancy and reliability. Another tool called HMAN (Hybrid Memory Analyzer) has been developed for analyzing molecular memory systems. Besides analyzing reliability-redundancy trade-offs using MDPs, HMAN provides a very accurate redundancy-delay trade-off analysis using HSPICE. SETRA (Scalable, Extensible Tool for Reliability Analysis) has been specifically designed for analyzing nanoscale CMOS logic architectures with DTMCs. SETRA also integrates well with current industry-standard CAD tools. It has been shown that multimodal computational models capture the operation of emerging nanoscale devices such as vertical CNT transistors, instead of the bimodal Boolean computational model that has been used to understand the operation of current electronic devices. We have extended an existing multimodal computational model based on Markov Random Fields (MRFs) for analyzing structurally redundant and reconfigurable architectures. Hence, this dissertation develops multiple probabilistic methodologies and tools for performing nanoscale architectural exploration. It also looks at different defect- and fault-tolerant architectures and explores different nanotechnologies.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

36

Mishra, Shivakant. "Consul: A communication substrate for fault-tolerant distributed programs." Diss., The University of Arizona, 1992. http://hdl.handle.net/10150/185824.

Full text

Abstract:

As human dependence on computing technology increases, so does the need for computer system dependability. This dissertation introduces Consul, a communication substrate designed to help improve system dependability by providing a platform for building fault-tolerant, distributed systems based on the replicated state machine approach. The key issues in this approach--ensuring replica consistency and reintegrating recovering replicas--are addressed in Consul by providing abstractions called fault-tolerant services. These include a broadcast service to deliver messages to a collection of processes reliably and in some consistent order, a membership service to maintain a consistent system-wide view of which processes are functioning and which have failed, and a recovery service to recover a failed process. Fault-tolerant services are implemented in Consul by a unified collection of protocols that provide support for managing communication, redundancy, failures, and recovery in a distributed system. At the heart of Consul is Psync, a protocol that provides for multicast communication based on a context graph that explicitly records the partial (or causal) order of messages. This graph also serves as the basis for novel algorithms used in the ordering, membership, and recovery protocols. The ordering protocol combines the semantics of the operations encoded in messages with the partial order provided by Psync to increase the concurrency of the application. Similarly, the membership protocol exploits the partial ordering to allow different processes to conclude that a failure has occurred at different times relative to the sequence of messages received, thereby reducing the amount of synchronization required. The recovery protocol combines checkpointing with the replay of messages stored in the context graph to recover the state of a failed process. Moreover, this collection of protocols is implemented in a highly-configurable manner, thus allowing a system builder to easily tailor an instance of Consul from this collection of building-block protocols. Consul is built in the x-Kernel and executes standalone on a collection of Sun 3 work-stations. Initial testing and performance studies have been done using two applications: a replicated directory and a distributed wordgame. These studies show that the semantic based order is more efficient than a total order in many situations, and that the overhead imposed by the checkpointing, membership, and recovery protocols is insignificant.

APA, Harvard, Vancouver, ISO, and other styles

37

Parameswaran, Rupa. "Investigation of precision versus fault tolerance in voting algorithms." Thesis, Georgia Institute of Technology, 2002. http://hdl.handle.net/1853/13536.

Full text

APA, Harvard, Vancouver, ISO, and other styles

38

Blum, Daniel Ryan. "VLSI implementation of cross-parity and modified dice fault tolerant schemes." Online access for everyone, 2004. http://www.dissertations.wsu.edu/Thesis/Spring2004/d%5Fblum%5F043004.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

39

Subramaniyan, Rajagopal. "Gossip-based failure detection and consensus for terascale computing." [Gainesville, Fla.] : University of Florida, 2003. http://purl.fcla.edu/fcla/etd/UFE0000799.

Full text

APA, Harvard, Vancouver, ISO, and other styles

40

Tadepalli, Sriram Satish. "GEMS: A Fault Tolerant Grid Job Management System." Thesis, Virginia Tech, 2003. http://hdl.handle.net/10919/9661.

Full text

Abstract:

The Grid environments are inherently unstable. Resources join and leave the environment without any prior notification. Application fault detection, checkpointing and restart is of foremost importance in the Grid environments. The need for fault tolerance is especially acute for large parallel applications since the failure rate grows with the number of processors and the duration of the computation. A Grid job management system hides the heterogeneity of the Grid and the complexity of the Grid protocols from the user. The user submits a job to the Grid job management system and it finds the appropriate resource, submits the job and transfers the output files to the user upon job completion. However, current Grid job management systems do not detect application failures. The goal of this research is to develop a Grid job management system that can efficiently detect application failures. Failed jobs are restarted either on the same resource or the job is migrated to another resource and restarted. The research also aims to identify the role of local resource managers in the fault detection and migration of Grid applications.
Master of Science

APA, Harvard, Vancouver, ISO, and other styles

41

何偉康 and Wai-hong Ho. "Performance and fault-tolerance studies of wormhole routers in 2D meshes." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 1997. http://hub.hku.hk/bib/B31214125.

Full text

APA, Harvard, Vancouver, ISO, and other styles

42

Ho, Wai-hong. "Performance and fault-tolerance studies of wormhole routers in 2D meshes /." Hong Kong : University of Hong Kong, 1997. http://sunzi.lib.hku.hk/hkuto/record.jsp?B19685737.

Full text

APA, Harvard, Vancouver, ISO, and other styles

43

Chen, Changgui, and mikewood@deakin edu au. "A Reactive system model for building fault-tolerant distributed applications." Deakin University. School of Computing and Mathematics, 2001. http://tux.lib.deakin.edu.au./adt-VDU/public/adt-VDU20050915.134208.

Full text

Abstract:

The development of fault-tolerant computing systems is a very difficult task. Two reasons contributed to this difficulty can be described as follows. The First is that, in normal practice, fault-tolerant computing policies and mechanisms are deeply embedded into most application programs, so that these application programs cannot cope with changes in environments, policies and mechanisms. These factors may change frequently in a distributed environment, especially in a heterogeneous environment. Therefore, in order to develop better fault-tolerant systems that can cope with constant changes in environments and user requirements, it is essential to separate the fault tolerant computing policies and mechanisms in application programs. The second is, on the other hand, a number of techniques have been proposed for the construction of reliable and fault-tolerant computing systems. Many computer systems are being developed to tolerant various hardware and software failures. However, most of these systems are to be used in specific application areas, since it is extremely difficult to develop systems that can be used in general-purpose fault-tolerant computing. The motivation of this thesis is based on these two aspects. The focus of the thesis is on developing a model based on the reactive system concepts for building better fault-tolerant computing applications. The reactive system concepts are an attractive paradigm for system design, development and maintenance because it separates policies from mechanisms. The stress of the model is to provide flexible system architecture for the general-purpose fault-tolerant application development, and the model can be applied in many specific applications. With this reactive system model, we can separate fault-tolerant computing polices and mechanisms in the applications, so that the development and maintenance of fault-tolerant computing systems can be made easier.

APA, Harvard, Vancouver, ISO, and other styles

44

Ebert, Dean A. "Design and development of a configurable fault-tolerant processor (CFTP) for space applications." Thesis, Monterey, Calif. : Springfield, Va. : Naval Postgraduate School ; Available from National Technical Information Service, 2003. http://library.nps.navy.mil/uhtbin/hyperion-image/03Jun%5FEbert.pdf.

Full text

Abstract:

Thesis (M.S. in Electrical Engineering)--Naval Postgraduate School, June 2003.
Thesis advisor(s): Herschel H. Loomis, Alan A. Ross. Includes bibliographical references (p. 219-224). Also available online.

APA, Harvard, Vancouver, ISO, and other styles

45

Bakhshi, Valojerdi Zeinab. "Persistent Fault-Tolerant Storage at the Fog Layer." Licentiate thesis, Mälardalens högskola, Inbyggda system, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:mdh:diva-55680.

Full text

Abstract:

Clouds are powerful computer centers that provide computing and storage facilities that can be remotely accessed. The flexibility and cost-efficiency offered by clouds have made them very popular for business and web applications. The use of clouds is now being extended to safety-critical applications such as factories. However, cloud services do not provide time predictability which creates a hassle for such time-sensitive applications. Moreover, delays in the data communication between clouds and the devices the clouds control are unpredictable. Therefore, to increase predictability an intermediate layer between devices and the cloud is introduced. This layer, the Fog layer, aims to provide computational resources closer to the edge of the network. However, the fog computing paradigm relies on resource-constrained nodes, creating new potential challenges in resource management, scalability, and reliability. Solutions such as lightweight virtualization technologies can be leveraged for solving the dichotomy between performance and reliability in fog computing. In this context, container-based virtualization is a key technology providing lightweight virtualization for cloud computing that can be applied in fog computing as well. Such container-based technologies provide fault tolerance mechanisms that improve the reliability and availability of application execution. By the study of a robotic use-case, we have realized that persistent data storage for stateful applications at the fog layer is particularly important. In addition, we identified the need to enhance the current container orchestration solution to fit fog applications executing in container-based architectures. In this thesis, we identify open challenges in achieving dependable fog platforms. Among these, we focus particularly on scalable, lightweight virtualization, auto-recovery, and re-integration solutions after failures in fog applications and nodes. We implement a testbed to deploy our use-case on a container-based fog platform and investigate the fulfillment of key dependability requirements. We enhance the architecture and identify the lack of persistent storage for stateful applications as an important impediment for the execution of control applications. We propose a solution for persistent fault-tolerant storage at the fog layer, which dissociates storage from applications to reduce application load and separates the concern of distributed storage. Our solution includes a replicated data structure supported by a consensus protocol that ensures distributed data consistency and fault tolerance in case of node failures. Finally, we use the UPPAAL verification tool to model and verify the fault tolerance and consistency of our solution.

APA, Harvard, Vancouver, ISO, and other styles

46

Payne, John C. "Fault tolerant computing testbed : a tool for the analysis of hardware and software fault handling techniques /." Thesis, Monterey, Calif. : Springfield, Va. : Naval Postgraduate School ; Available from National Technical Information Service, 1998. http://handle.dtic.mil/100.2/ADA359579.

Full text

Abstract:

Thesis (M.S. in Electrical Engineering) Naval Postgraduate School, December 1998.
"December 1998." Thesis advisor(s): Alan A. Ross. Includes bibliographical references (p. 169). Also available online.

APA, Harvard, Vancouver, ISO, and other styles

47

Huang, Dijiang Medhi Deepankar. "Many-to-many secure group communication and its applications." Diss., UMK access, 2004.

Find full text

Abstract:

Thesis (Ph. D.)--School of Computing and Engineering. University of Missouri--Kansas City, 2004.
"A dissertation in computer networking and telecommunication networking." Advisor: Deep Medhi. Typescript. Vita. Title from "catalog record" of the print edition Description based on contents viewed Feb. 24, 2006. Includes bibliographical references (leaves 140-147). Online version of the print edition.

APA, Harvard, Vancouver, ISO, and other styles

48

Crick, David Alan. "Measures of inexact diagnosability." Diss., Georgia Institute of Technology, 1990. http://hdl.handle.net/1853/9216.

Full text

APA, Harvard, Vancouver, ISO, and other styles

49

Bharthipudi, Saraswati. "Comparison of numerical result checking mechanisms for FFT computations under faults." Available online, Georgia Institute of Technology, 2004:, 2003. http://etd.gatech.edu/theses/available/etd-12172003-171912/unrestricted/Saraswati%5FBharthipudi%5F2002%5F05.pdf.

Full text

Abstract:

Thesis (M.S.)--Electrical and Computer Engineering, Georgia Institute of Technology, 2004.
Dr. Feodor Vainstein, Committee Member; Dr. Doug Blough, Committee Chair; Dr. David Schimmel, Committee Member. Includes bibliographical references (leaves 71-75).

APA, Harvard, Vancouver, ISO, and other styles

50

Hay, Karen June. "A proof methodology for verification of real-time and fault-tolerance properties of distributed programs." Diss., The University of Arizona, 1993. http://hdl.handle.net/10150/186261.

Full text

Abstract:

From the early days of programming, the dependability of software has been a concern. The development of distributed systems that must respond in real-time and continue to function correctly in spite of hardware failure have increased the concern while making the task of ensuring dependability more complex. This dissertation presents a technique for improving confidence in software designed to execute on a distributed system of fail-stop processors. The methodology presented is based on a temporal logic augmented with time intervals and probability distributions. A temporal logic augmented with time intervals, Bounded Time Temporal Logic (BTTL), supports the specification and verification of real-time properties such as, "The program will poll the sensor every t to T time units." Analogously, a temporal logic augmented with probability distributions, Probabilistic Bounded Time Temporal Logic (PBTTL), supports reasoning about fault-tolerant properties such as, "The program will complete with probability less than or equal to p", and a combination of these properties such as, "The program will complete within t and T time units with probability less than or equal to p." The syntax and semantics of the two logics, BTTL and PBTTL, are carefully developed. This includes development of a program state model, state transition model, message passing system model and failure system model. An axiomatic program model is then presented and used for the development of a set of inference rules. The inference rules are designed to simplify use of the logic for reasoning about typical programming language constructs and commonly occurring programming scenarios. In addition to offering a systematic approach for verifying typical behaviors, the inference rules are intended to support the derivation of formulas expressing timing and probabilistic relationships between the execution times and probabilities of individual statements, groups of statements, message passing and failure recovery. Use of the methodology is demonstrated in examples of varying complexity, including five real-time examples and four combined real-time and fault-tolerant examples.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Fault-tolerant computing'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles