Innovative practices session 5C: Cloud atlas — Unreliability through massive connectivity

VLSI Test Symposium(2013)

Cited 2|Views2
No score
Abstract
The rapid pace of integration, emergence of low power, low cost computing elements, and ubiquitous and ever-increasing bandwidth of connectivity have given rise to data center and cloud infrastructures. These infrastructures are beginning to be used on a massive scale across vast geographic boundaries to provide commercial services to businesses such as banking, enterprise computing, online sales, and data mining and processing for targeted marketing to name a few. Such an infrastructure comprises of thousands of compute and storage nodes that are interconnected by massive network fabrics, each of them having their own hardware and firmware stacks, with layers of software stacks for operating systems, network protocols, schedulers and application programs. The scale of such an infrastructure has made possible service that has been unimaginable only a few years ago, but has the downside of severe losses in case of failure. A system of such scale and risk necessitates methods to (a) proactively anticipate and protect against impending failures, (b) efficiently, transparently and quickly detect, diagnose and correct failures in any software or hardware layer, and (c) be able to automatically adapt itself based on prior failures to prevent future occurrences. Addressing the above reliability challenges is inherently different from the traditional reliability techniques. First, there is a great amount of redundant resources available in the cloud from networking to computing and storage nodes, which opens up many reliability approaches by harvesting these available redundancies. Second, due to the large scale of the system, techniques with high overheads, especially in power, are not acceptable. Consequently, cross layer approaches to optimize the availability and power have gained traction recently. This session will address these challenges in maintaining reliable service with solutions across the hardware/software stacks. The currently available commercial data-center and cloud infrastructures will be reviewed and the relative occurrences of different causalities of failures, the level to which they are anticipated and diagnosed in practice, and their impact on the quality of service and infrastructure design will be discussed. A study on real-time analytics to proactively address failures in a private, secure cloud engaged in domain-specific computations, with streaming inputs received from embedded computing platforms (such as airborne image sources, data streams, or sensors) will be presented next. The session concludes with a discussion on the increased relevance of resiliency features built inside individual systems and components (private cloud) and how the macro public cloud absorbs innovations from this realm.
More
Translated text
Key words
macro public cloud,embedded computing platform,storage node,secure cloud,innovative practices session,cloud atlas,cloud infrastructure,large scale,massive connectivity,software stack,private cloud,low cost computing element,enterprise computing
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined