Thoughts on "no reliable cloud"

April 10, 2013

Recently Hendrik Volkmer put up a blog post entitled There will be no reliable cloud. Part of it was based on a presentation I watched at the last OpenStack summit (wish I was going to the Portland summit, but alas is not to be).

The Cloud Scaling presentation was one I enjoyed and considered thought provoking. I wrote a few notes on that presentation last year.

No reliable cloud

Here’s a quote from the top of the first post

Stop wasting your time trying to [find a reliable cloud]. Stop wasting your time (and money) trying to build one. If you find a service provider that claims that they have it: Maybe question their understanding of cloud - and business.

I put that there to remind me of the point of the series of posts, and because it essentially defines the attention grabbing headline. :)

tl;dr

My thoughts on these posts come down to this:

He’s mostly talking web-scale applications
A single zone will not be reliable
But still have to make zones as reasonably reliable as possible (where’s the line?)
We should design reliable applications on top of unreliable zones (but how?)
Contain failure!
HA pairs are probably not the direction to go in to gain reliability
Clustering software often brings in complexity that can destroy reliability gains
Stateless systems are a lot more fun :)
Keep the stateful part of an application or system small

Thinking about reliability in a cloud, especially an OpenStack cloud, is an interesting thought experiment. Fortunately, the OpenStack cloud I help to run, which is the back-end for a single application, is actually mostly stateless–except for machine images, the OpenStack database, and the application database. Not a lot of stateful information, except those darn windows images that are many tens of times the size of a standard Linux cloud image.

Notes from the part one post

For a short post it sure goes over a lot of information and links!

HA pairs fail catastrophically
HA pairs don’t scale
Classic HA example: NFS + DRBD and clustering, such as Pacemaker…then problems?
HA pairs often end up cheating CAP theorem
Cluster software causes more system outages than hardware failures of software bugs (this I can attend to having used clustered LVM)
Distributed systems ** Eg. Percona Xtradb Cluster
Availability vs reliability ** HA systems that need to go down for maintenance are a joke
How to build a reliable cloud
A cloud is a distributed system
Use the stateless (from Cloud Scaling presentation) approach for stateless parts
Distributed data stores for the stateful parts (eg. distributed mysql, distributed file systems such as ceph)
But the distributed stateful part is often what fails (eg. EBS in Amazon)
Notes from blog post comments (notably Randy Bias of Cloud Scaling)
On OpenStack
Move to MySQL Cluster with the NBDEngine running 2-4 mysql instances, and load balancing across them
Or perhaps OpenStack will get rid of the RDBMS and replace with K/V store
Even with 1000s of nodes, metadata use is still low in OpenStack, could be put in memory and persist data using any appropriate back-end
No point in having highly redundant hardware for stateless services
Build reliable applications on unreliable clouds

Ok, now on to part two.

Notes from part two

The second post builds on the basic information provided in the first.

Complexity + Scale => Reduced Reliability + Increased Chance of catastrophic failures

Complexity
Complex system fail catastrophically
Failure domains
OpenStack example
Single controller, single cloud (or zone)
HA setup – two controllers in an HA mode of some kind
Single controller, multiple cloud (or multiple zones)
A single zone is unreliable
If both HA nodes fail, still unreliable, and HA is more complex
Two zones is two failure domains, which is more reliable than a single HA-enabled zone
(But of course you should make each zone as reliable as possible)
Reliability engineering (aka math)
Reliability engineering matters except when it doesn’t
“The higher the number of dependent components => the lower the overall availability and the bigger the impact of failure”
In a cloud with many nodes, adding the ability for live migration will actually decrease reliability, because all nodes are now tied together
Many reliability calculations come from mechanical engineering, which is much different than software engineering
Many complex systems fail by cascading, failure starts small and grows big, until it engulfs the entire system
General approach is to make failure local and contained
Partial failure is desirable
Business side
Software reliability is cheaper
Most web scale applications consist of a large stateless part and a small stateful piece
It does not make business sense to provide a super-reliable cloud
A single compute node or even zone will never be reliable
Best not to consider virtual machines, such as those in EC2, as servers

NOTE: There will eventually be a part three post, but as of this writing it’s not up yet.

Conclusion

To me, it boils down to building reliable applications on unreliable clouds, which I think is what a lot of people are doing, and is what seems to come out every time AWS fails.

The first issue that pops into my mind though is RDBMS systems, and how to replicate data between zones, which is often a network concern. Actually, replicating any data between zones could be a problem, which is why, I’m guessing, that he’s (perhaps) suggesting to keep stateful pieces small.

Table of Contents

No reliable cloud

tl;dr

Notes from the part one post

Notes from part two

Conclusion