Dissertations / Theses on the topic 'Fault-tolerance computing'
Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles
Consult the top 50 dissertations / theses for your research on the topic 'Fault-tolerance computing.'
Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.
You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.
Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.
Mugwar, Bader. "Fault tolerance : a new method to detect fault in computing systems." Virtual Press, 1986. http://liblink.bsu.edu/uhtbin/catkey/450654.
Full textSullivan, John F. "Network fault tolerance system." Link to electronic thesis, 2000. http://www.wpi.edu/Pubs/ETD/Available/etd-0501100-125656.
Full textWagealla, Waleed. "Reliable mobile agents for distributed computing." Thesis, Nottingham Trent University, 2003. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.272441.
Full textPierce, Evelyn Tumlin. "Self-adjusting quorum systems for Byzantine fault tolerance /." Full text (PDF) from UMI/Dissertation Abstracts International, 2000. http://wwwlib.umi.com/cr/utexas/fullcit?p3004357.
Full textHall, Stephen. "An integrated fault tolerance framework for service oriented computing." Thesis, Lancaster University, 2010. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.547982.
Full textClements, N. Scott. "Fault tolerance control of complex dynamical systems." Diss., Georgia Institute of Technology, 2003. http://hdl.handle.net/1853/15515.
Full textDamani, Om Prakash. "Optimistic protocols for fault-tolerance in distributed systems /." Digital version accessible at:, 1999. http://wwwlib.umi.com/cr/utexas/main.
Full textSnodgrass, Joshua D. "Low-power fault tolerance for spacecraft FPGA-based numerical computing." Monterey, Calif. : Springfield, Va. : Naval Postgraduate School ; Available from National Technical Information Service, 2006. http://library.nps.navy.mil/uhtbin/hyperion/06Sep%5FSnodgrass%5FPhD.pdf.
Full textDissertation Advisor(s): Herschel H. Loomis. "September 2006." Includes bibliographical references (p. 217-224). Also available in print.
Hunt, Robert D. "New software-based fault tolerance methods for high performance computing." Thesis, University of Bristol, 2015. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.683389.
Full textRao, Sriram S. "Egida : a toolkit for low-overhead fault-tolerance /." Digital version accessible at:, 1999. http://wwwlib.umi.com/cr/utexas/main.
Full textParameswaran, Rupa. "Investigation of precision versus fault tolerance in voting algorithms." Thesis, Georgia Institute of Technology, 2002. http://hdl.handle.net/1853/13536.
Full textBazzi, Rida Adnan. "Automatically increasing fault tolerance in distributed systems." Diss., Georgia Institute of Technology, 1994. http://hdl.handle.net/1853/8133.
Full textKlonowska, Kamilla. "Theoretical aspects on performance bounds and fault tolerance in parallel computing /." Karlskrona : Department of Systems and Software Engineering, School of Engineering, Blekinge Institute of Technology, 2007. http://www.bth.se/fou/Forskinfo.nsf/allfirst2/a46ebb190dfb7caec12573a700356d59?OpenDocument.
Full textYi, Byungho. "Faults and fault-tolerance in distributed computing systems : the election problem." Diss., Georgia Institute of Technology, 1994. http://hdl.handle.net/1853/8312.
Full textStewart, Robert. "Reliable massively parallel symbolic computing : fault tolerance for a distributed Haskell." Thesis, Heriot-Watt University, 2013. http://hdl.handle.net/10399/2834.
Full textBicer, Tekin. "Supporting Fault Tolerance and Dynamic Load Balancing in FREERIDE-G." The Ohio State University, 2010. http://rave.ohiolink.edu/etdc/view?acc_num=osu1267638588.
Full textRoy, Amitabha. "Symmetry breaking and fault tolerance in boolean satisfiability /." view abstract or download file of text, 2001. http://wwwlib.umi.com/cr/uoregon/fullcit?p3024528.
Full textTypescript. Includes vita and abstract. Includes bibliographical references (leaves 124-127). Also available for download via the World Wide Web; free to University of Oregon users.
Nguyen, Anthony. "Database system architecture for fault tolerance and disaster recovery." [Denver, Colo.] : Regis University, 2009. http://adr.coalliance.org/codr/fez/view/codr:152.
Full text何偉康 and Wai-hong Ho. "Performance and fault-tolerance studies of wormhole routers in 2D meshes." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 1997. http://hub.hku.hk/bib/B31214125.
Full textHo, Wai-hong. "Performance and fault-tolerance studies of wormhole routers in 2D meshes /." Hong Kong : University of Hong Kong, 1997. http://sunzi.lib.hku.hk/hkuto/record.jsp?B19685737.
Full textTarafdar, Ashis. "Software fault tolerance in distributed systems using controlled re-execution /." Digital version accessible at:, 2000. http://wwwlib.umi.com/cr/utexas/main.
Full textArechiga, Austin Podoll. "Sensitivity of Feedforward Neural Networks to Harsh Computing Environments." Thesis, Virginia Tech, 2018. http://hdl.handle.net/10919/84527.
Full textMaster of Science
Soria-Rodriguez, Pedro. "Multicast-Based Interactive-Group Object-Replication For Fault Tolerance." Digital WPI, 1999. https://digitalcommons.wpi.edu/etd-theses/1069.
Full textVillamayor, Leguizamón Jorge Luis. "Fault tolerance configuration and management for HPC applications using RADIC architecture." Doctoral thesis, Universitat Autònoma de Barcelona, 2018. http://hdl.handle.net/10803/666057.
Full textHigh Performance Computing (HPC) systems continue growing exponentially in terms of components quantity and density to achieve demanding computational power. At the same time, cloud computing is becoming popular, as key features such as scalability, pay-per-use and availability continue to evolve. It is also becoming a competitive platform for running parallel HPC applications due to the increasing performance of virtualized, highly-available instances. Although, augmenting the amount of components to create larger systems tends to increment the frequency of failures in both clusters and cloud environments. Nowadays, HPC systems have a failure rate of around 1000 per year, meaning a failure every approximately 8 hours. Most of the parallel distributed applications are built on top of a Message Passing Interface (MPI). MPI implementations follow a default fail-stop semantic, which aborts the execution in case of host failure in a cluster. In this case, the application owner needs to restart the execution, which affects the wall clock time and, also, the cost since it requires to acquire computing resources for longer periods of time. Fault Tolerance (FT) techniques need to be applied to MPI parallel executions in both, cluster and cloud environments. With FT techniques, high availability is ensured for parallel applications. In order to apply some FT solutions, administrator privileges are required, to install them in the cluster nodes. Moreover, when failures appear human intervention is required to recover the application. A solution, which minimizes users and administrators intervention is preferred. A contribution of this thesis is a Fault Tolerance Manager (FTM) for coordinated checkpoint, which provides the application's users with automatic recovery from failures when losing computing nodes. It takes advantage of node local storage to save checkpoints, and it distributes copies of them along all the computation nodes, avoiding the bottleneck of a central stable storage. We also leverage the FTM to use uncoordinated and semi-coordinated rollback recovery protocols. In this contribution, FTM is implemented in the application-layer. Furthermore, a dynamic resource controller is added to the FTM, which monitors the FT protection resource usage and performs actions to maintain an acceptable level of protection. Another contribution aims to the FT protection and recovery tasks configuration. Two models are introduced. The First Protection Point model (FPP) determines the starting point to introduce FT protection gaining benefits in terms of total execution time including failures, by removing unnecessary checkpoints. The second model allows improving the FT resource configuration for the recovery task. Regarding cloud environments, we propose Resilience as a Service (RaaS), a fault tolerant framework for HPC applications, which uses FTM. RaaS provides clouds with a highly available, distributed and scalable fault-tolerant service. It redesigns traditional HPC protection and recovery mechanisms, to natively leverage cloud capabilities and its multiple alternatives for implementing FT tasks. To summarize, this thesis contributes on providing a multi-platform resilience manager, suitable for traditional baremetal clusters and clouds (public and private). The presented solution provides FT in an automatic, distributed and transparent manner in the application and user levels according to the users, applications, and runtime requirements. It gives the users critical FT information, allowing them to trade-off cost and protection keeping the mean time to repair within acceptable ranges. Several experimental environments such as bare-metal clusters and cloud (public and private), running different parallel applications were used during the experimental validations. The experiments verify the functionality and improvement of the contributions. Moreover, they also show that the Mean Time To Repair is bounded within acceptable ranges.
Varghese, Blesson. "Swarm-array computing : a swarm robotics inspired approach to achieve automated fault tolerance in high-performance computing systems." Thesis, University of Reading, 2011. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.559260.
Full textCelik, Yasin. "FEASIBILITY STUDIES OF STATISTIC MULTIPLEXED COMPUTING." Diss., Temple University Libraries, 2018. http://cdm16002.contentdm.oclc.org/cdm/ref/collection/p245801coll10/id/511914.
Full textPh.D.
In 2012, when Professor Shi introduced me to the concept of Statistic Multiplexed Computing (SMC), I was skeptical. It contradicted everything I have learned and heard about distributed and parallel computing. However, I did believe that unhandled failures in any application will negatively impact its scalability. For that, I agreed to take on the feasibility study of SMC for practical applications. After six+ years research and experimentations, it became clear to me that the most widely believed misconception is “either performance or reliability” when upscaling a distributed application. This conception was the result of the direct use of hop-by-hop communication protocols in distributed application construction. Terminology: Hop-by-hop data protocol is a two-sided reliable lossless data communication protocol for transmitting data between a sender and a receiver. Either the sender or the receiver crash will cause data losses. Examples: MPI, RPC, RMI, OpenMP. End-to-end data protocol is a single-sided reliable lossless data communication protocol for transmitting data between application programs. All runtime available processors, networks and storage will be automatically dispatched to the best effort support of the reliable communication regardless transient and permanent device failures. Examples: HDFS, Blockchain, Fabric and SMC. Active end-to-end data protocol is a single-sided reliable lossless data communication pro- tocol for transmitting data and automatically synchronizing application programs. Example: SMC (AnkaCom, AnkaStore (this dissertation)). Unlike the hop-by-hop protocols, the use of end-to-end protocol forms an application- dependent overlay network. An overlay network for distributed and parallel computing application, such as Blockchain, has been proven to defy the “common wisdom” for two important distributed computing challenges: a) Extreme scale computing without single-point failures is practically feasible. Thus, all transaction or data losses can be eliminated. b) Extreme scale synchronized transaction replication is practically feasible. Thus, the CAP conjecture and theorem become irrelevant. Unlike passive overlay networks, such as the HDFS and Blockchain, this dissertation study proves that an active overlay network can deliver higher performance, higher reliability and security at the same time as the application up scales. Although application-level security is not part of this dissertation, it is easy to see that application-level end-to-end protocols will fundamentally eliminate the “man-in-the-middle” attacks. This will nullify many well-known attacks. With the zero-single-point failure and zero impact synchronous replication features, SMC applications are naturally resistant to DDoS and ransomware attacks. This dissertation explores practical implementations of the SMC concept for compute intensive (CI) and data intensive (DI) applications. This defense will disclose the details of CI and DI runtime implementations and results of inductive computational experiments. The computational environments include the NSF Chameleon bare-metal HPC cloud and Temple’s TCloud cluster.
Temple University--Theses
Luckow, André. "A dependable middleware for enhancing the fault tolerance of distributed computations in grid environments." Aachen Shaker, 2009. http://d-nb.info/1002791081/04.
Full textMorten, Andrew J. "An accurate analytical framework for computing fault-tolerance thresholds using the [[7,1,3]] quantum code." Thesis, Massachusetts Institute of Technology, 2005. http://hdl.handle.net/1721.1/35052.
Full textIncludes bibliographical references (p. 141-143).
In studies of the threshold for fault-tolerant quantum error-correction, it is generally assumed that the noise channel at all levels of error-correction is the depolarizing channel. The effects of this assumption on the threshold result are unknown. We address this problem by calculating the effective noise channel at all levels of error-correction specifically for the Steane [[7,1,3]] code, and we recalculate the threshold using the new noise channels. We present a detailed analytical framework for these calculations and run numerical simulations for comparison. We find that only X and Z failures occur with significant probability in the effective noise channel at higher levels of error-correction. We calculate that when changes in the noise channel are accounted for, the value of the threshold for the Steane [[7,1,3]] code increases by about 30 percent, from .00030 to .00039, when memory failures occur with one tenth the probability of all other failures. Furthermore, our analytical model provides a framework for calculating thresholds for systems where the initial noise channel is very different from the depolarizing channel, such as is the case for ion trap quantum computation.
by Andrew J. Morten.
S.B.
Hay, Karen June. "A proof methodology for verification of real-time and fault-tolerance properties of distributed programs." Diss., The University of Arizona, 1993. http://hdl.handle.net/10150/186261.
Full textAlfawair, Mai. "A framework for evolving grid computing systems." Thesis, De Montfort University, 2009. http://hdl.handle.net/2086/3423.
Full textKwon, Young Woo. "Effective Fusion and Separation of Distribution, Fault-Tolerance, and Energy-Efficiency Concerns." Diss., Virginia Tech, 2014. http://hdl.handle.net/10919/49386.
Full textPh. D.
Stainer, Julien. "Computability Abstractions for Fault-tolerant Asynchronous Distributed Computing." Thesis, Rennes 1, 2015. http://www.theses.fr/2015REN1S054/document.
Full textThis thesis studies computability in systems composed of multiple computers exchanging messages or sharing memory. The considered models take into account the possible failure of some of these computers, as well as variations in time and heterogeneity of their execution speeds. The presented results essentially consider agreement problems, systems prone to partitioning and failure detectors. The document establishes relations between known iterated models and the concept of failure detector and presents a hierarchy of agreement problems spanning from k-set agreement to s-simultaneous consensus. It also introduces a new universal construction based on s-simultaneous consensus objects and a family of iterated models allowing several processes to run in isolation
Shoker, Ali. "Byzantine fault tolerance from static selection to dynamic switching." Toulouse 3, 2012. http://thesesups.ups-tlse.fr/1924/.
Full textByzantine Fault Tolerance (BFT) is becoming crucial with the revolution of online applications and due to the increasing number of innovations in computer technologies. Although dozens of BFT protocols have been introduced in the previous decade, their adoption by practitioners sounds disappointing. To some extant, this indicates that existing protocols are, perhaps, not yet too convincing or satisfactory. The problem is that researchers are still trying to establish 'the best protocol' using traditional methods, e. G. , through designing new protocols. However, theoretical and experimental analyses demonstrate that it is hard to achieve one-size-fits-all BFT protocols. Indeed, we believe that looking for smarter tac-tics like 'fasten fragile sticks with a rope to achieve a solid stick' is necessary to circumvent the issue. In this thesis, we introduce the first BFT selection model and algorithm that automate and simplify the election process of the 'preferred' BFT protocol among a set of candidate ones. The selection mechanism operates in three modes: Static, Dynamic, and Heuristic. For the two latter modes, we present a novel BFT system, called Adapt, that reacts to any potential changes in the system conditions and switches dynamically between existing BFT protocols, i. E. , seeking adaptation. The Static mode allows BFT users to choose a single BFT protocol only once. This is quite useful in Web Services and Clouds where BFT can be sold as a service (and signed in the SLA contract). This mode is basically designed for systems that do not have too fuctuating states. In this mode, an evaluation process is in charge of matching the user preferences against the profiles of the nominated BFT protocols considering both: reliability, and performance. The elected protocol is the one that achieves the highest evaluation score. The mechanism is well automated via mathematical matrices, and produces selections that are reasonable and close to reality. Some systems, however, may experience fluttering conditions, like variable contention or message payloads. In this case, the static mode will not be e?cient since a chosen protocol might not fit the new conditions. The Dynamic mode solves this issue. Adapt combines a collection of BFT protocols and switches between them, thus, adapting to the changes of the underlying system state. Consequently, the 'preferred' protocol is always polled for each system state. This yields an optimal quality of service, i. E. , reliability and performance. Adapt monitors the system state through its Event System, and uses a Support Vector Regression method to conduct run time predictions for the performance of the protocols (e. G. , throughput, latency, etc). Adapt also operates in a Heuristic mode. Using predefined heuristics, this mode optimizes user preferences to improve the selection process. The evaluation of our approach shows that selecting the 'preferred' protocol is automated and close to reality in the static mode. In the Dynamic mode, Adapt always achieves the optimal performance among available protocols. The evaluation demonstrates that the overall system performance can be improved significantly too. Other cases explore that it is not always worthy to switch between protocols. This is made possible through conducting predictions with high accuracy, that can reach more than 98% in many cases. Finally, the thesis shows that Adapt can be smarter through using heursitics
Kurt, Mehmet Can. "Fault-tolerant Programming Models and Computing Frameworks." The Ohio State University, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=osu1437390499.
Full textJeffery, Casey Miles. "Performance analysis of dynamic sparing and error correction techniques for fault tolerance in nanoscale memory structures." [Gainesville, Fla.] : University of Florida, 2004. http://purl.fcla.edu/fcla/etd/UFE0007163.
Full textTadepalli, Sriram Satish. "GEMS: A Fault Tolerant Grid Job Management System." Thesis, Virginia Tech, 2003. http://hdl.handle.net/10919/9661.
Full textMaster of Science
Schöll, Alexander [Verfasser], and Hans-Joachim [Akademischer Betreuer] Wunderlich. "Efficient fault tolerance for selected scientific computing algorithms on heterogeneous and approximate computer architectures / Alexander Schöll ; Betreuer: Hans-Joachim Wunderlich." Stuttgart : Universitätsbibliothek der Universität Stuttgart, 2018. http://d-nb.info/1164013211/34.
Full textBakhshi, Valojerdi Zeinab. "Persistent Fault-Tolerant Storage at the Fog Layer." Licentiate thesis, Mälardalens högskola, Inbyggda system, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:mdh:diva-55680.
Full textRaja, Chandrasekar Raghunath. "Designing Scalable and Efficient I/O Middleware for Fault-Resilient High-Performance Computing Clusters." The Ohio State University, 2014. http://rave.ohiolink.edu/etdc/view?acc_num=osu1417733721.
Full textGheorghiu, Alexandru. "Robust verification of quantum computation." Thesis, University of Edinburgh, 2018. http://hdl.handle.net/1842/31542.
Full textSilva, Jaquilino Lopes. "A distributed platform for the volunteer execution of workflows on a local area network." Master's thesis, Faculdade de Ciências e Tecnologia, 2014. http://hdl.handle.net/10362/13102.
Full textAlbatroz Engineering has developed a framework for over-head power lines inspection data acquisition and analysis, which includes hardware and software. The framework’s software components include inspection data analysis and reporting tools, commonly known as PLMI2 application/platform. In PLMI2, the analysis of over-head power line maintenance inspection data consists of a sequence of Automatic Tasks (ATs) interleaved with Manual Tasks (MTs). An AT consists of a set of algorithms that receives as input one or more datasets, processes them and returns new datasets. In turn, an MT enables human supervisors (also known as lines inspection operators) to correct, improve and validate the results of ATs. ATs run faster than MTs and in the overall work cycle, ATs take less than 10% of total processing time, but still take a few minutes. There is data flow dependency among tasks, which can be modelled with a workflow and even if MTs are omitted from this workflow, it is possible to carry the sequence of ATs, postponing MTs. In fact, if the computing cost and waiting time are negligible, it may be advantageous to run ATs earlier in the workflow, prior to validation. To address this opportunity, Albatroz Engineering has invested in a new procedure to stream the data through all ATs fully unattended. Considering these scenarios, it could be useful to have a system capable of detecting available workstations at a given instant and subsequently distribute the ATs to them. In this way, operators could schedule the execution of future ATs for a given inspection data, while they are performing MTs of another. The requirements of the system to implement fall within the field Volunteer Computing Systems and we will address some of the challenges posed by these kinds of systems, namely the hosts volatility and failures. Volunteer Computing is a type of distributed computing which exploits idle CPU cycles from computing resources donated by volunteers and connected through the Internet/Intranet to compute large-scale simulations. This thesis proposes and designs a new distributed task scheduling system in the context of Volunteer Computing Systems, able to schedule the ATs of PLMI2 and exploit idle CPU cycles from workstations within the company’s local area network (LAN) to accelerate the data analysis, being aware of data flow interdependencies. To evaluate the proposed system, a prototype has been implemented, and the simulations results have shown that it is scalable and supports fault-tolerance of tasks execution, by employing the rescheduling mechanism.
Guo, Yan. "Fault-tolerant resource allocation of an airborne network." Diss., Online access via UMI:, 2007.
Find full textIncludes bibliographical references.
Stoicescu, Miruna. "Architecting Resilient Computing Systems : a Component-Based Approach." Thesis, Toulouse, INPT, 2013. http://www.theses.fr/2013INPT0120/document.
Full textEvolution during service life is mandatory, particularly for long-lived systems. Dependable systems, which continuously deliver trustworthy services, must evolve to accommodate changes e.g., new fault tolerance requirements or variations in available resources. The addition of this evolutionary dimension to dependability leads to the notion of resilient computing. Among the various aspects of resilience, we focus on adaptivity. Dependability relies on fault tolerant computing at runtime, applications being augmented with fault tolerance mechanisms (FTMs). As such, on-line adaptation of FTMs is a key challenge towards resilience. In related work, on-line adaption of FTMs is most often performed in a preprogrammed manner or consists in tuning some parameters. Besides, FTMs are replaced monolithically. All the envisaged FTMs must be known at design time and deployed from the beginning. However, dynamics occurs along multiple dimensions and developing a system for the worst-case scenario is impossible. According to runtime observations, new FTMs can be developed off-line but integrated on-line. We denote this ability as agile adaption, as opposed to the preprogrammed one. In this thesis, we present an approach for developing flexible fault-tolerant systems in which FTMs can be adapted at runtime in an agile manner through fine-grained modifications for minimizing impact on the initial architecture. We first propose a classification of a set of existing FTMs based on criteria such as fault model, application characteristics and necessary resources. Next, we analyze these FTMs and extract a generic execution scheme which pinpoints the common parts and the variable features between them. Then, we demonstrate the use of state-of-the-art tools and concepts from the field of software engineering, such as component-based software engineering and reflective component-based middleware, for developing a library of fine-grained adaptive FTMs. We evaluate the agility of the approach and illustrate its usability throughout two examples of integration of the library: first, in a design-driven development process for applications in pervasive computing and, second, in a toolkit for developing applications for WSNs
Zhan, Zhiyuan. "Meeting Data Sharing Needs of Heterogeneous Distributed Users." Diss., Georgia Institute of Technology, 2007. http://hdl.handle.net/1853/14598.
Full textJeganathan, Nithyananda Siva. "A CONTROLLER AREA NETWORK LAYER FOR RECONFIGURABLE EMBEDDED SYSTEMS." UKnowledge, 2007. http://uknowledge.uky.edu/gradschool_theses/484.
Full textViana, Antonio Eduardo Bernardes. "Uma Abordagem Autonômica para Tolerância a Falhas na Execução de Aplicações em Desktop Grids." Universidade Federal do Maranhão, 2011. http://tedebc.ufma.br:8080/jspui/handle/tede/479.
Full textComputers grids are characterized by the high dynamism of its execution environment, resources and applications heterogeneity, and the requirement for high scalability. These features turn tasks such as configuration, maintenance and recovery of failed applications quite challenging and is becoming increasingly difficult to perform them only by human agents. The autonomic computing paradigm denotes computer systems capable of changing their behavior dynamically in response to changes in the execution environment. For achieving this, the software is generally organized following the MAPE-K (Monitoring, Analysis, Planning, Execution and Knowledge) model, in which managers perform the execution environment sensing activities, context analysis, planning and execution of dynamic reconfiguration actions, based on shared knowledge about the controlled system. In this work we present an autonomic mechanism based on the MAPE-K model to provide fault tolerance for applications running on computer grids, which is capable of monitoring the execution environment and, based on the evaluation of the collected data, to decide which reconfiguration actions must eventually be applied to the fault tolerance mechanism in order to keep the system in balance with the goals of minimizing the applications average completion time and to provide a high success rate in completing their tasks. This paper also describes the performance evaluation of the proposed autonomic mechanism, accomplished through the use of simulation techniques that took into account several opportunistic desktop grids typical environmental scenarios.
Grades de computadores são caracterizadas pelo alto dinamismo de seu ambiente de execução, alta heterogeneidade de recursos e tarefas e por requererem grande escalabilidade. Essas características tornam tarefas como configuração, manutenção e recuperação da execução de aplicações em caso de falhas bastante desafiadoras e cada vez mais difíceis de serem realizadas exclusivamente por agentes humanos. A computação autonômica denota sistemas computacionais capazes de mudar seu comportamento dinamicamente em resposta a variações do ambiente de execução. Para isso, o software é geralmente organizado seguindo-se o modelo MAPE-K (Monitoring, Analysis, Planning, Execution and Knowledge), no qual gerentes autonômicos realizam as atividades de sensoriamento do ambiente de execução, análise de contexto, planejamento e execução de ações de reconfiguração dinâmica, compartilhando algum conhecimento sobre o sistema controlado. Nesse trabalho apresentamos um mecanismo autonômico baseado no modelo MAPE-K para prover tolerância a falhas na execução de aplicações em grades de computadores capaz de monitorar o ambiente de execução e, a partir da avaliação dos dados coletados, decidir quais ações de reconfiguração devem eventualmente ser aplicadas ao mecanismo de tolerância falhas para manter o sistema em equilíbrio com os objetivos de minimizar o tempo médio de conclusão das aplicações e prover alta taxa de sucesso na conclusão de suas tarefas. Este trabalho descreve ainda a avaliação de desempenho do mecanismo autonômico proposto, realizada através do uso técnicas de simulação e que levou em consideração aos diversos cenários típicos de ambientes de desktop grids oportunistas.
Rao, Shrisha. "Safety and hazard analysis in concurrent systems." Diss., University of Iowa, 2005. http://ir.uiowa.edu/etd/106.
Full textKarl, Holger. "Responsive Execution of Parallel Programs in Distributed Computing Environments." Doctoral thesis, Humboldt-Universität zu Berlin, Mathematisch-Naturwissenschaftliche Fakultät II, 1999. http://dx.doi.org/10.18452/14455.
Full textClusters of standard workstations have been shown to be an attractive environment for parallel computing. However, there remain unsolved problems to make them suitable to some application scenarios. One of these problems is a dependable and timely program execution: There are many applications in which a program should be successfully completed at a predictable point of time. Mechanisms to combine the properties of both dependable and timely execution of parallel programs in distributed computing environments are the main objective of this dissertation. Addressing these properties requires a joint metric for dependability and timeliness. Responsiveness is such a metric; it is refined for the purposes of this work. As a case study, Calypso and Charlotte, two parallel programming systems, are analyzed and their shortcomings on several abstraction levels with regard to responsiveness are identified. Solutions for them are presented and generalized, resulting in widely applicable mechanisms for (parallel) responsive services. Specifically, these solutions are: 1) a responsiveness analysis of Calypso's eager scheduling (a mechanism for load balancing and fault masking), 2) ameliorating a single point of failure by a responsiveness analysis of checkpointing and by a standard interface-based system for replication of legacy software, 3) managing resources in a way suitable for parallel programs, and 4) using semantical information about the communication pattern of a program to improve its performance. All proposed mechanisms can be combined and are suitable for use in standard environments. It is shown by analysis and experiments that these mechanisms improve the responsiveness of eligible applications.
Mohammed, Bashir. "A Framework for Efficient Management of Fault Tolerance in Cloud Data Centres and High-Performance Computing Systems: An Investigation and Performance analysis of a Cloud Based Virtual Machine Success and Failure Rate in a typical Cloud Computing Environment and Prediction Methods." Thesis, University of Bradford, 2019. http://hdl.handle.net/10454/17400.
Full textLemos, Fernando Tarlá Cardoso. "Uma arquitetura otimizada para a detecção de falhas em grades computacionais." Universidade de São Paulo, 2012. http://www.teses.usp.br/teses/disponiveis/3/3141/tde-19072013-115312/.
Full textIn distributed platforms, fault detection is an essential requirement to a wide range of fault tolerance techniques, such as restoring the state of distributed applications with checkpointing and message logging. However, fault detection often depends on reliable communication between the processing nodes and detection fault modules. Direct communication between the nodes and detection modules is often impossible in hierarchical grid computing platforms. The physical distance between the institutions and resources available on the grid, and thus the requirement of long distance networks connecting them, is another factor that makes direct fault detection in computer grids a challenge. This thesis presents a fault detection architecture for distributed platforms, optimized for usage in hierarchical grids and thus taking into account its restrictions and requirements. The architecture, named GFDA (Grid Fault Detection Architecture), is structured as fault detection modules for faults that affect the computing nodes available on the grid, detection modules for faults that affect the distributed applications, and modules that perform the collection, processing and forwarding of the fault and recovery notifications generated by the detection modules. This thesis presents implementation details, an evaluation of the correctness of the designed architecture, and results obtained through the deployment of parts of the architecture in a simulated cluster that uses virtual machines to simulate computing nodes. Techniques to optimize the quality of the detection fault service are proposed. The results obtained through the usage of such techniques are compared to the results obtained through traditional approaches. Positive results were extracted even under adverse connectivity conditions by using techniques such as the processing of fault and recovery notifications and the introduction of redundant information in the messages exchanged between the detection modules. It is concluded that the GFDA architecture contributes to the establishment of a viable solution for fault detection in a hierarchical grid computing platform that presents connectivity restrictions between the nodes.