From chaos, a greater understanding of resilience


Welcome to Protocol Cloud, your comprehensive roundup of everything you need to know about the week in cloud and enterprise software. This week: Finding the meaning in chaos, the story of Kelsey Hightower, and how the BBC uses the cloud.

The fault in our stars

A little peek behind the Protocol Cloud scenes: I write this newsletter on Tuesday mornings, holding it open until the last second in case anything big happens, but most days, it's done and dusted by noon-ish Pacific Time. That means today's, which you are seeing on the morning of Wednesday, Nov. 4, was written on one side of a chasm.

There didn't seem to be any point in waiting to see what happened on the most pivotal U.S. Election Day of any of our lifetimes, given that the outcome might be in doubt for days or weeks to come. Likewise, there didn't seem to be any point in trying to sum up what that outcome might mean for cloud computing, a concept that I would pay real dollars to hear either man up for election yesterday describe in a sentence or two.

But there was something cloudy that popped into my head while thinking about this historic week: chaos engineering.

  • Born at Netflix, chaos engineering is the deliberate introduction of worst-case-scenario problems into cloud computing infrastructure.
  • The idea is to understand — within a controlled environment — how systems fail when they encounter stress they weren't designed for, such as a sudden outage in a cloud computing region.
  • It's taking the notion of "hoping for the best, but preparing for the worst" to its logical ends, designing your systems with durability in mind by understanding how they will react to shocks that can be impossible to predict, in hopes the whole system won't fall apart.
  • Gremlin, a startup founded by ex-Netflix and Amazon engineers, has raised $26 million in funding to help companies employ this concept in their own applications.

Modern web infrastructure is amazingly complex, and while lots of people get paid lots of money to design it in resilient ways that have been refined over years of hard-earned lessons, things still break all the time.

  • Once you accept that systems will fail, understanding failure, rather than desperately trying to prevent failure, becomes the priority.
  • And once you understand failure, reacting to failure results in a stronger system than trying to prevent the inevitable: Failure will happen and you should do everything you can to make sure it happens in a predictable way that can be dealt with without breaking the entire system.
  • "What you'd really like to do is choose between two things other than collapsing in a heap," said Adrian Cockcroft, the former Netflix engineer who helped develop the theory of chaos engineering and current vice president of cloud architecture strategy at AWS, in a speech in 2018 describing how most applications fall down.
  • One of those two things is to have apps fail gracefully, so the user understands what just happened and still has a working computer, and the other is to acknowledge that while the problem might cause a subpar app experience, 80% of its functionality is probably good enough.

This is a cultural — not technological — shift in thinking inside organizations, much as DevOps was a decade ago. Until cloud vendors, end users, and partners figure out how to make this world a little simpler and a little easier, the best way to prevent systemic catastrophe is to recognize how systems will fail.

  • There is a metaphor here.


Tap into nearly unlimited resources to tackle your most demanding high-performance computing (HPC) or AI challenges. Azure can help you develop your title, run it as a service, and build effective multiplayer communities with solutions designed for modern game development.

This Week On Protocol

Decision time: If we know who the president of the United States will be in January when you receive this newsletter, you can find all of Protocol's Election 2020 coverage here. If we don't know who the president of the United States will be in January when you receive this newsletter, you can find all of Protocol's Election 2020 coverage here.

Getting better: Kelsey Hightower is a special person in the cloud and enterprise computing world, and he's had quite the journey from managing an Atlanta McDonald's before he could drive a car to becoming one of Google Cloud's most valuable employees. Check out our profile of his life and career before you end up working for him one day.

Cloudy dollars: During the third quarter, the giants of cloud computing continued to shrug off the economic effects of the pandemic that have wrecked so many other businesses. It's unclear how long that will last, but for now it seems like enterprise vendors that are behind on their cloud strategies are the ones suffering the most.

Five Questions For...

Mai-Lan Tomsen Bukovec, global vice president of storage, AWS

What was your first tech job?

My father was a career U.S. Foreign Service officer, and so I grew up in U.S. embassies in a number of different countries. My first tech job was data entry using a Wang computer in the Press and Cultural section of the U.S. Embassy in Beijing, China.

What's the best piece of advice you could give to someone starting their first tech job?

Be relentlessly curious about how the customer uses what you are building and understand if you are solving the problem that the customer wants you to solve.

Pick one piece of consumer or business software (that isn't sold by your company) that you can't live without.

I use my Garmin watch and heart rate monitor tracker almost every day. Conditioning matters for boxing and martial arts, which I have practiced for many years. Heart rate target zones and recovery rates help me train smarter.

What was the first computer that made you realize the power of computing and connectivity?

I joined the U.S. Peace Corps in 1994 after I graduated from college. I lived in a village in northern Mali, West Africa. It took many dusty, hot hours to make my way via public transportation from my village in the northern Mopti region to Bamako, the capital city of Mali. The contrast between living in a village with no running water and electricity, and reading email on a screen in an office in Bamako a day or two later brought home the power of connectivity in a way that I hadn't experienced as a college student with easy access to computers.

What will be the biggest challenge for cloud computing over the coming decade?

People and culture. It's resistance to change and fear of the unknown. Often, it comes down to how organizations lead through change. The big difference between organizations that talk about moving to the cloud, and those that actually do it comes down to a few simple elements: leadership commitment and organizational execution.

Thanks for reading — see you next week.