What is it about?
Cloud providers have recently offered their unused resources as transient instances. Amazon sells idle cloud resources as spot instances pricing by an auction-based market mechanism to reduce the cost without any availability guarantee. Thus, to dynamically and autonomously manage cloud resources to execute user applications ensuring greater reliability with cheaper spot instances is an open problem. In this context, we propose a fault-tolerant multi-agent architecture as middleware of cloud providers and users to mediate access to a wide range of heterogeneous resources providing a resilient application execution environment with a dynamic flexible fault-tolerant mechanism based on adaptive checkpointing. Our architecture combines a case-based reasoning model with a survival analysis model to predict failure events and refine fault-tolerant plans with adequate parameters to increase reliability optimizing total execution time and costs. We evaluated the proposed architecture with real historical data collected from Amazon EC2 price changes including, with approximately 21 million records and generating 1,362,816 scenarios stored in our case knowledge database. The results considering the time to revocation achieved high levels of accuracy (98%) with a gain up to 74.48% to total execution time, reducing total cost when compared to other approaches in the literature.
Photo by Ilya Pavlov on Unsplash
Why is it important?
we propose a fault-tolerant multi-agent architecture as middleware of cloud providers and users to mediate access to a wide range of heterogeneous resources providing a resilient application execution environment with a dynamic flexible fault-tolerant mechanism based on adaptive checkpointing fault tolerance technique.
Read the Original
This page is a summary of: Towards increasing reliability of Amazon EC2 spot instances with a fault-tolerant multi-agent architecture, Multiagent and Grid Systems, October 2019, IOS Press, DOI: 10.3233/mgs-190312.
You can read the full text:
This project investigates the application of agent-based architectures to create a resilient environment using unsecured transient servers to offer trusted services or run applications using Cloud Computing idle resources. Exploring idle resources is an efficient way to save energy and money (e.g., reuse unused CPU and memory to provide services and run applications). The BRA2Cloud architecture combines machine learning and a statistical model to predict instance survival time and helps to refine fault tolerance parameters to provide trusted services, reducing monetary cost. This model compiles and analyses Amazon EC2 Spot Instances’ historic price change data to predict revocation events. Our agents pursue an efficient usage of Spot Instances, providing a novel resilient environment between users and cloud resources, through machine learning, to predict revocation events and define suitable Fault Tolerance mechanisms with their respective parameters. This is a key step toward successful and efficient usage of these instances to provide trusted services with minimal interruptions at cheapest prices. Experiments indicate that this model can be used under realistic working conditions with better use of idle resources.
The following have contributed to this page