Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2023, IJIRIS:: AM Publications
https://doi.org/10.26562/ijiris.2023.v0903.03…
7 pages
1 file
With the increasing adoption of cloud computing, ensuring high availability and fault tolerance has become paramount for organizations. Amazon Web Services (AWS) offers a robust infrastructure for hosting applications, but it requires careful architectural design and implementation to achieve desired levels of availability and fault tolerance. This research paper explores two innovative concepts, namely Cloud Fractal and Decentralized Replication and Orchestration, and their application in achieving high availability and fault tolerance in AWS. We present a comprehensive analysis of these concepts and provide practical guidelines for their implementation in real-world scenarios. Our findings demonstrate the effectiveness of Cloud Fractal and Decentralized Replication and Orchestration in enhancing the reliability and resilience of AWS deployments.
AJIT-e Online Academic Journal of Information Technology, 2016
Cloud computing has recently become an attractive topic due to its ability to offer information technology solutions through virtual machines as on-demand services to share and consume resources over the Internet. As a result of rapid development in such services, the necessity of fault tolerance in the cloud is a major concern with reliability, availability and dependability which are more critical to this new service type. This work investigates techniques and means of tolerating cloud services as well as cloud customers' systems/enterprises execution over the cloud safe from failures. Failures in cloud enabled services should be expected to occur hence they should be handled. The essential features of implementing fault tolerance strategies guarantee the business continuity, avoid financial lost, recovering systems from failures, and provide disaster recovery as well. The specific focus is to explore scenarios of avoiding/recovering from failures through redundancy, checkpoint and replication. Commercial IaaS providers such as Amazon's AWS and Google's GCE are taken as examples as they tolerate their infrastructure from failures; in this way a robust architecture with fault tolerance property could be built for a system/enterprise. Hence, general conceptual steps with fault tolerance considerations have been proposed.
Based on the pay-as-you-go strategy, cloud computing platforms are spreading very rapidly. One of the main characteristics of cloud computing is the splitting into many layers. From a technical point of view, most cloud computing platforms exploit virtualization, which implies that they are split into 3 layers: hosts, virtual machines and applications. From an administration point of view, they are split into 2 layers: the cloud provider who manages the hosting center and the customer who manages his application in the cloud. This structuring of cloud makes it difficult to implement effective management policies. This paper focuses on fault tolerance in cloud computing platforms and more precisely on autonomic repair in case of faults. It discusses the implications of this splitting in the implementation of fault tolerance. In most of current approaches, fault tolerance is exclusively handled by the provider or the customer, which leads to partial or inefficient solutions. Solutions, which involve a collaboration between the provider and the customer are much promising. We illustrate this discussion with experiments where exclusive and collaborative fault tolerance solutions are implemented in an autonomic cloud infrastructure that we prototyped.
Academic journal of Nawroz University, 2024
Ensuring system availability and reliability is crucial in the quickly developing field of cloud computing. The importance of fault tolerance in cloud infrastructure systems grows as organizations become more reliant on it to support their critical operations. The purpose of this article is to investigate the intricate realm of cloud computing and distributed systems. Specifically, the paper will investigate the numerous forms of cloud computing, fault tolerance methods, and frameworks that enable cloud services to be robust and durable. Cloud computing has transformed the way in which organizations and individuals access and administer computing resources. The paper discusses several deployment options, including public, private, hybrid, and multicloud environments, which provide organizations with the advantages of flexibility, scalability, and costeffectiveness. The inherent flexibility of cloud computing renders it well-suited for a diverse range of applications, spanning from the hosting of websites to the execution of intricate data analytics processes. Generally, cloud computing encounters substantial obstacles, including the need of maintaining uninterrupted service in the face of hardware failures, network outages, or software errors, despite its tremendous benefits. The critical importance of fault tolerance in this particular situation cannot be overstated, as it plays a pivotal role in maintaining the dependability and availability of the system. The primary objective of this study is to examine the utilization of distributed systems as a means to augment fault tolerance within the realm of cloud computing and distributed systems. Distributed systems offer an optimal approach for addressing difficulties related to fault tolerance, owing to its intrinsic capability to divide workloads and data over several nodes. This approach utilizes redundancy, replication, and the ability to recover seamlessly from disturbances, hence enhancing the resilience and resource efficiency of cloud services. This research reviews novel techniques and frameworks that utilize distributed systems to create fault-tolerant cloud computing architectures, emphasizing their substantial influence on the cloud computing domain. In conclusion, this research report includes a comparative analysis table that encompasses twenty preceding works.
IEEE Transactions on Cloud Computing, 2016
The large-scale utilization of cloud computing services for hosting industrial/enterprise applications has led to the emergence of cloud service reliability as an important issue for both cloud service providers and users. To enhance cloud service reliability, two types of fault tolerance schemes, reactive and proactive, have been proposed. Existing schemes rarely consider the problem of coordination among multiple virtual machines (VMs) that jointly complete a parallel application. Without VM coordination, the parallel application execution results will be incorrect. To overcome this problem, we first propose an initial virtual cluster allocation algorithm according to the VM characteristics to reduce the total network resource consumption and total energy consumption in the data center. Then, we model CPU temperature to anticipate a deteriorating physical machine (PM). We migrate VMs from a detected deteriorating PM to some optimal PMs. Finally, the selection of the optimal target PMs is modeled as an optimization problem that is solved using an improved particle swarm optimization algorithm. We evaluate our approach against five related approaches in terms of the overall transmission overhead, overall network resource consumption, and total execution time while executing a set of parallel applications. Experimental results demonstrate the efficiency and effectiveness of our approach. Index Terms-Cloud data center, cloud service reliability, fault tolerance (FT), particle swarm optimization (PSO), virtual cluster Ç 1 INTRODUCTION C LOUD computing is widely adopted in current professional and personal environments. It employs several existing technologies and concepts, such as virtual servers and data centers, and gives them a new perspective [1]. Furthermore, it enables users and businesses to not only use applications without installing them on their machines but also access resources on any computer via the Internet [2]. With its pay-per-use business model for customers, cloud computing shifts the capital investment risk for under-or overprovisioning to cloud providers. Therefore, several leading technology companies, such as Google, Amazon, IBM, and Microsoft, operate large-scale cloud data centers around the world. With the growing popularity of cloud computing, modern cloud data centers are employing tens of thousands of physical machines (PMs) networked via hundreds of routers/switches that communicate and coordinate to deliver highly reliable cloud computing services. Although the failure probability of a single device/link might be low [3], it is magnified across all the devices/links hosted in a cloud data center owing to the problem of coordination of PMs. Moreover, multiple fault sources (e.g., software, human errors, and hardware) are the norm rather than the exception [4]. Thus, downtime is common and seriously affects the service level of cloud computing [5]. Therefore, enhancing cloud service reliability is a critical issue that requires immediate attention. Over the past few years, numerous fault tolerance (FT) approaches have been proposed to enhance cloud service reliability [6], [7]. It is well known that FT consists of fault detection, backup, and failure recovery, and nearly all FT approaches are based on the use of redundancy. Currently, two basic mechanisms, namely, replication and checkpointing, are widely adopted. In the replication mechanism, the same task is synchronously or asynchronously handled on several virtual machines (VMs) [8], [9], [10]. This mechanism ensures that at least one replica is able to complete the task on time. Nevertheless, because of its high implementation cost, the replication mechanism is more suitable for real time or critical cloud services. The checkpointing mechanism is categorized into two main types: independent checkpoint mechanisms that only consider a whole application to perform on a VM, and coordinated checkpoint mechanisms that consider multiple VMs (i.e., a virtual cluster) to jointly execute parallel applications [11], [12], [13], [14], [15], [16]. The two types of mechanisms periodically save the
Cloud computing technologies are being used aggressively these days to enable use of shared resources. However the confidentiality and availability of the data stored on the cloud is still a serious problem. In a cloud, several faults do occur which adversely hamper the continuous availability of service to the end customer. Faults could be hardware, software or network related. Infrastructure installed on the clouds does get affected due to all kinds of faults. The infrastructure supported on the clouds must be made available to the clients even during the occurrence of the faults to provide continuous service. In this paper architectural models have been proposed using which the infrastructure related services are made available to the clients even during the occurrence of the faults making the entire process of cloud computing reliable and effective.
Int. J. Inf. Syst. Model. Des., 2020
Fault tolerance is the most imperious issue in the cloud to provide reliable services. Inherent vulnerability to failure hampers the performance and reliability of cloud services. Hence, to achieve reliability, fault tolerance becomes a mandatory feature which is hard to implement due to the dynamic infrastructure and complex interdependencies. Numerous fault tolerance techniques have been developed in the literature to address the challenges of cloud reliability. A recent research survey presented in this paper attempts to integrate the different fault tolerance architecture. This study presents a critical research review on various existing fault tolerance techniques to improve services reliability, availability, and applications execution in the cloud. A comparative analysis, based on different critical metrics like failure prediction, detection strategy, failure history, VM placement, and limitations, of the reviewed framework systems is also included in the paper. This review i...
International Journal of Cloud Applications and Computing, 2018
With mission critical web applications and resources being hosted on cloud environments, and cloud services growing fast, the need for having greater level of service assurance regarding fault tolerance for availability and reliability has increased. The high priority now is ensuring a fault tolerant environment that can keep the systems up and running. To minimize the impact of downtime or accessibility failure due to systems, network devices or hardware, the expectations are that such failures need to be anticipated and handled proactively in fast, intelligent way. This article discusses the fault tolerance system for cloud computing environments, analyzes whether this is effective for Cloud environments.
Journal of King Saud University - Computer and Information Sciences, 2018
Cloud computing has brought about a transformation in the delivery model of information technology from a product to a service. It has enabled the availability of various software, platforms and infrastructural resources as scalable services on demand over the internet. However, the performance of cloud computing services is hampered due to their inherent vulnerability to failures owing to the scale at which they operate. It is possible to utilize cloud computing services to their maximum potential only if the performance related issues of reliability, availability, and throughput are handled effectively by cloud service providers. Therefore, fault tolerance becomes a critical requirement for achieving high performance in cloud computing. This paper presents a comprehensive overview of fault tolerance-related issues in cloud computing; emphasizing upon the significant concepts, architectural details, and the state-of-art techniques and methods. The objective is to provide insights into the existing fault tolerance approaches as well as challenges yet required to be overcome. The survey enumerates a few promising techniques that may be used for efficient solutions and also, identifies important research directions in this area.
With mission critical web applications and resources being hosted on cloud environments, and cloud services growing fast, the need for having greater level of service assurance regarding fault tolerance for availability and reliability has increased. The high priority now is ensuring a fault tolerant environment that can keep the systems up and running. To minimize the impact of downtime or accessibility failure due to systems, network devices or hardware, the expectations are that such failures need to be anticipated and handled proactively in fast, intelligent way. In this paper, fault tolerance for cloud computing environments is analyzed to determine if this is effective for Cloud environments.
2012 IEEE International Systems Conference SysCon 2012, 2012
Fault tolerance, reliability and resilience in Cloud Computing are of paramount importance to ensure continuous operation and correct results, even in the presence of a given maximum amount of faulty components. Most existing research and implementations focus on architecture-specific solutions to introduce fault tolerance. This implies that users must tailor their applications by taking into account environment-specific fault tolerant features. Such a need results in non transparent and inflexible Cloud environments, requiring too much effort to developers and users. This paper introduces an innovative perspective on creating and managing fault tolerance that shades the implementation details of the reliability techniques from the users by means of a dedicated service layer. This allows users to specify and apply the desired level of fault tolerance without requiring any knowledge about its implementation.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
Journal of Systems Architecture, 2019
IEEE Systems Journal, 2013
Failure Free Cloud Computing Architectures, 2022
16th Int'l Conf. Computer and Information Technology, 2014
Resilience Assessment and Evaluation of Computing Systems, 2012
International Journal of Trend in Scientific Research and Development, 2019
Bulletin of Electrical Engineering and Informatics, 2024