Recently Hendrik Volkmer put up a blog post entitled There will be no reliable cloud. Part of it was based on a presentation I watched at the last OpenStack summit (wish I was going to the Portland summit, but alas is not to be).
The Cloud Scaling presentation was one I enjoyed and considered thought provoking. I wrote a few notes on that presentation last year.
Here's a quote from the top of the first post
Stop wasting your time trying to [find a reliable cloud]. Stop wasting your time (and money) trying to build one. If you find a service provider that claims that they have it: Maybe question their understanding of cloud - and business.
I put that there to remind me of the point of the series of posts, and because it essentially defines the attention grabbing headline. :)
My thoughts on these posts come down to this:
Thinking about reliability in a cloud, especially an OpenStack cloud, is an interesting thought experiment. Fortunately, the OpenStack cloud I help to run, which is the back-end for a single application, is actually mostly stateless--except for machine images, the OpenStack database, and the application database. Not a lot of stateful information, except those darn windows images that are many tens of times the size of a standard Linux cloud image.
For a short post it sure goes over a lot of information and links!
Availability vs reliability ** HA systems that need to go down for maintenance are a joke
How to build a reliable cloud
Ok, now on to part two.
The second post builds on the basic information provided in the first.
Complexity + Scale => Reduced Reliability + Increased Chance of catastrophic failures
NOTE: There will eventually be a part three post, but as of this writing it's not up yet.
To me, it boils down to building reliable applications on unreliable clouds, which I think is what a lot of people are doing, and is what seems to come out every time AWS fails.
The first issue that pops into my mind though is RDBMS systems, and how to replicate data between zones, which is often a network concern. Actually, replicating any data between zones could be a problem, which is why, I'm guessing, that he's (perhaps) suggesting to keep stateful pieces small.