Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2012, 2012 Ieee International Conference on Communications
…
6 pages
1 file
Modern day data centers coordinate hundreds of thousands of heterogeneous tasks and aim at delivering highly reliable cloud computing services. Although offering equal reliability to all users benefits everyone at the same time, users may find such an approach either too inadequate or too expensive to fit their individual requirements, which may vary dramatically. In this paper, we propose a novel method for providing reliability as an elastic and on-demand service. Our scheme makes use of peer-to-peer checkpointing and allows user reliability levels to be jointly optimized based on an assessment of their individual requirements and total available resources in the data center. We show that the joint optimization can be efficiently solved by a distributed algorithm using dual decomposition. The solution improves resource utilization and presents an additional source of revenue to data center operators. Our validation results suggest a significant improvement of reliability over existing schemes.
IEEE Transactions on Parallel and Distributed Systems, 2016
Modern day data centers coordinate hundreds of thousands of heterogeneous tasks and aim at delivering highly reliable cloud computing services. Although offering equal reliability to all users benefits everyone at the same time, users may find such an approach either inadequate or too expensive to fit their individual requirements, which may vary dramatically. In this paper, we propose a novel method for providing elastic reliability optimization in cloud computing. Our scheme makes use of peer-to-peer checkpointing and allows user reliability levels to be jointly optimized based on an assessment of their individual requirements and total available resources in the data center. We show that the joint optimization can be efficiently solved by a distributed algorithm using dual decomposition. The solution improves resource utilization and presents an additional source of revenue to data center operators. Our validation results suggest a significant improvement of reliability over existing schemes.
IEEE Transactions on Cloud Computing, 2016
The large-scale utilization of cloud computing services for hosting industrial/enterprise applications has led to the emergence of cloud service reliability as an important issue for both cloud service providers and users. To enhance cloud service reliability, two types of fault tolerance schemes, reactive and proactive, have been proposed. Existing schemes rarely consider the problem of coordination among multiple virtual machines (VMs) that jointly complete a parallel application. Without VM coordination, the parallel application execution results will be incorrect. To overcome this problem, we first propose an initial virtual cluster allocation algorithm according to the VM characteristics to reduce the total network resource consumption and total energy consumption in the data center. Then, we model CPU temperature to anticipate a deteriorating physical machine (PM). We migrate VMs from a detected deteriorating PM to some optimal PMs. Finally, the selection of the optimal target PMs is modeled as an optimization problem that is solved using an improved particle swarm optimization algorithm. We evaluate our approach against five related approaches in terms of the overall transmission overhead, overall network resource consumption, and total execution time while executing a set of parallel applications. Experimental results demonstrate the efficiency and effectiveness of our approach. Index Terms-Cloud data center, cloud service reliability, fault tolerance (FT), particle swarm optimization (PSO), virtual cluster Ç 1 INTRODUCTION C LOUD computing is widely adopted in current professional and personal environments. It employs several existing technologies and concepts, such as virtual servers and data centers, and gives them a new perspective [1]. Furthermore, it enables users and businesses to not only use applications without installing them on their machines but also access resources on any computer via the Internet [2]. With its pay-per-use business model for customers, cloud computing shifts the capital investment risk for under-or overprovisioning to cloud providers. Therefore, several leading technology companies, such as Google, Amazon, IBM, and Microsoft, operate large-scale cloud data centers around the world. With the growing popularity of cloud computing, modern cloud data centers are employing tens of thousands of physical machines (PMs) networked via hundreds of routers/switches that communicate and coordinate to deliver highly reliable cloud computing services. Although the failure probability of a single device/link might be low [3], it is magnified across all the devices/links hosted in a cloud data center owing to the problem of coordination of PMs. Moreover, multiple fault sources (e.g., software, human errors, and hardware) are the norm rather than the exception [4]. Thus, downtime is common and seriously affects the service level of cloud computing [5]. Therefore, enhancing cloud service reliability is a critical issue that requires immediate attention. Over the past few years, numerous fault tolerance (FT) approaches have been proposed to enhance cloud service reliability [6], [7]. It is well known that FT consists of fault detection, backup, and failure recovery, and nearly all FT approaches are based on the use of redundancy. Currently, two basic mechanisms, namely, replication and checkpointing, are widely adopted. In the replication mechanism, the same task is synchronously or asynchronously handled on several virtual machines (VMs) [8], [9], [10]. This mechanism ensures that at least one replica is able to complete the task on time. Nevertheless, because of its high implementation cost, the replication mechanism is more suitable for real time or critical cloud services. The checkpointing mechanism is categorized into two main types: independent checkpoint mechanisms that only consider a whole application to perform on a VM, and coordinated checkpoint mechanisms that consider multiple VMs (i.e., a virtual cluster) to jointly execute parallel applications [11], [12], [13], [14], [15], [16]. The two types of mechanisms periodically save the
2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), 2016
In large scale data centers, a single fault can lead to correlated failures of several physical machines and the tasks running on them, simultaneously. Such correlated failures can severely damage the reliability of a service or a job. This paper models the impact of stochastic and correlated failures on job reliability in a data center. We focus on correlated failures caused by power outages or failures of network components, on jobs running multiple replicas of identical tasks. We present a statistical reliability model and an approximation technique for computing a job's reliability in the presence of correlated failures. In addition, we address the problem of scheduling a job with reliability constraints. We formulate the scheduling problem as an optimization problem, with the aim being to achieve the desired reliability with the minimum number of extra tasks. We present a scheduling algorithm that approximates the minimum number of required tasks and a placement to achieve a desired job reliability. We study the efficiency of our algorithm using an analytical approach and by simulating a cluster with different failure sources and reliabilities. The results show that the algorithm can effectively approximate the minimum number of extra tasks required to achieve the job's reliability.
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13, 2013
In this paper, we aim at optimizing fault-tolerance techniques based on a checkpointing/restart mechanism, in the context of cloud computing. Our contribution is threefold. (1) We derive a fresh formula to compute the optimal number of checkpoints for cloud jobs with varied distributions of failure events. Our analysis is not only generic with no assumption on failure probability distribution, but attractively simple to apply in practice. (2) We design an adaptive algorithm to optimize the checkpointing effect regarding various costs like checkpointing/restart overhead. (3) We evaluate our optimized solution in a real cluster environment with hundreds of virtual machines and Berkeley Lab Checkpoint/Restart tool. Task failure events are emulated via a production trace produced on a large-scale Google data center. Experiments confirm that our solution is fairly suitable for Google systems. Our optimized formula outperforms Young's formula by 3-10 percent, reducing wall-clock lengths by 50-100 seconds per job on average.
2010 IEEE 3rd International Conference on Cloud Computing, 2010
Recently introduced spot instances in the Amazon Elastic Compute Cloud (EC2) offer lower resource costs in exchange for reduced reliability; these instances can be revoked abruptly due to price and demand fluctuations. Mechanisms and tools that deal with the cost-reliability trade-offs under this schema are of great value for users seeking to lessen their costs while maintaining high reliability. We study how one such a mechanism, namely checkpointing, can be used to minimize the cost and volatility of resource provisioning. Based on the real price history of EC2 spot instances, we compare several adaptive checkpointing schemes in terms of monetary costs and improvement of job completion times. Trace-based simulations show that our approach can reduce significantly both price and the task completion times.
Lecture Notes in Computer Science, 2011
The cloud computing is a computing paradigm that users can rent computing resources from service providers as much as they require. A spot instance in cloud computing helps a user to utilize resources with less expensive cost, even if it is unreliable. When a user performs tasks with unreliable spot instances, failures inevitably lead to the delay of task completion time and cause a seriously deterioration in the QoS of users. Therefore, we propose a price history based checkpointing scheme to avoid the delay of task completion time. The proposed checkpointing scheme reduces the number of checkpoint trials and improves the performance of task execution. The simulation results show that our scheme outperforms the existing checkpointing schemes in terms of the reduction of both the number of checkpoint trials and total costs per spot instance for user's bid.
International Journal of Cloud Applications and Computing, 2011
Cloud computing refers to both the applications delivered as services over the Internet and the hardware and systems software in the datacenters that provide those services. Failures of any type are common in current datacenters, partly due to the number of nodes. Fault tolerance has become a major task for computer engineers and software developers because the occurrence of faults increases the cost of using resources and to meet the user expectations, the most fundamental user expectation is, of course, that his or her application correctly finishes independent of faults in the node. This paper proposes a fault tolerant architecture to Cloud Computing that uses an adaptive Checkpoint mechanism to assure that a task running can correctly finish in spite of faults in the nodes in which it is running. The proposed fault tolerant architecture is simultaneously transparent and scalable.
Cloud computing infrastructure encompasses many design challenges. Dealing with unreliability is one of the important design challenges in cloud computing platforms as we have a variety of services available for a variety of clients. In this paper, we present a model for the reliability assessment of the cloud infrastructures (computing nodes mostly virtual machines). This reliability assessment mechanism helps to do the scheduling on cloud infrastructure and perform fault tolerance on the basis of the reliability values acquired during reliability assessment. In our model, every compute instance (virtual machine in PaaS or physical processing node in IaaS) have reliability values associated with them. The system assesses the reliability for different types of applications. We have different mechanism to assess the reliability of general applications and real time applications. For real time applications, we have time based reliability assessment algorithms. All the algorithms are m...
—A non-invasive, cloud-agnostic approach is demonstrated for extending existing cloud platforms to include checkpoint-restart capability. Most cloud platforms currently rely on each application to provide its own fault tolerance. A uniform mechanism within the cloud itself serves two purposes: (a) direct support for long-running jobs, which would otherwise require a custom fault-tolerant mechanism for each application; and (b) the administrative capability to manage an over-subscribed cloud by temporarily swapping out jobs when higher priority jobs arrive. An advantage of this uniform approach is that it also supports parallel and distributed computations, over both TCP and InfiniBand, thus allowing traditional HPC applications to take advantage of an existing cloud infrastructure. Additionally, an integrated health-monitoring mechanism detects when long-running jobs either fail or incur exceptionally low performance, perhaps due to resource starvation, and proactively suspends the job. The cloud-agnostic feature is demonstrated by applying the implementation to two very different cloud platforms: Snooze and OpenStack. The use of a cloud-agnostic architecture also enables, for the first time, migration of applications from one cloud platform to another.
Resilience Assessment and Evaluation of Computing Systems, 2012
Cloud Computing is a novel paradigm for providing data center resources as on demand services in a pay-as-you-go manner. It promises significant cost savings by making it possible to consolidate workloads and share infrastructure resources among multiple applications resulting in higher cost-and energy-efficiency. However, these benefits come at the cost of increased system complexity and dynamicity posing new challenges in providing service dependability and resilience for applications running in a Cloud environment. At the same time, the virtualization of physical resources, inherent in Cloud Computing, provides new opportunities for novel dependability and quality-of-service management techniques that can potentially improve system resilience. In this chapter, we first discuss in detail the challenges and opportunities introduced by the Cloud Computing paradigm. We then provide a review of the state-of-the-art on dependability and resilience management in Cloud environments, and conclude with an overview of emerging research directions.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
European Space Projects: Developments, Implementations and Impacts in a Changing World
IJSR, 2023
Lecture Notes in Computer Science, 2013
IEEE Access, 2016
Proceedings of the 8th International Conference on Ubiquitous Information Management and Communication - ICUIMC '14, 2014
2015 IEEE 40th Local Computer Networks Conference Workshops (LCN Workshops), 2015
Computer and Information Science (ICIS), 2013 IEEE/ACIS 12th International Conference on, 2013
Proceedings of the 2010 IEEE/IFIP Network Operations and Management Symposium, NOMS 2010, 2010
International Journal of Grid and Utility Computing, 2020
Wireless Personal Communications, 2020
Academic journal of Nawroz University, 2024
Concurrency and Computation: Practice and Experience, 2018
Scalable Computing: Practice and Experience, 2014
Recent Patents on Computer Science, 2017
2010 IEEE 26th International Conference on Data Engineering (ICDE 2010), 2010
Cluster Computing, 2018
IEEE Transactions on Services Computing, 2012
International Journal of Computer Applications, 2014