Founding Uber SRE

Original URL

This is my personal story of starting the SRE organization at Uber. If you want advice rather than reminiscence, take a look at Trunk and Branches Model and Productivity in the age of hypergrowth.

After I left SocialCode in 2014, I spent a month interviewing at a handful of companies trying to figure out what to do next. I was torn between considering two different paths: (1) leading engineering at a very small startup, or (2) taking a much smaller role at a fast growing company, with the expectation that growth would create opportunity. After some consideration, I ended up taking the latter path to join Uber.

I forget my exact interview timeline at Uber, but it was roughly along the lines of “interview on Tuesday, offer on Wednesday, and start the following Monday. (This wasn’t the fastest turnaround I ever saw at Uber, quite far from it! I once gave an offer to someone on a Friday around 5:00, who gave notice and started two days later on Monday.) The role I was hired into was “DevOps Manager”, which I was initially quite nervous about as I wasn’t quite certain what DevOps meant. Waiting over the weekend to start, I anxiously read The Phoenix Project in an attempt to reverse engineer what I’d just been hired to do.

As best I can remember, there were four teams doing what might be classically described as “infrastructure” when I started. The Data and Maps engineering teams were also grouped under the Infrastructure organization, but they were pretty focused on other work, so I’m going to ellide them as a matter of narrative convenience.

The infrastructure teams were (1) Developer Tools team, (2) an Infrastructure Engineering team laser focused on outpacing PostgreSQL indexes’ appetite for disk space, (3) a team in Denmark primarily focused on deployment tooling, and (4) the InfraOps team that did everything else. We were about twenty people within an engineering organization of about two hundred, rapidly on its way to two thousand.

I joined that last team, InfraOps. We were five people, and responsible for maintaining most of the company’s servers (Uber did not use any cloud providers at that point), the routing infrastructure (mostly HAProxy), Kafka, most databases, service provisioning (in the “service-oriented-architecture” sense of the word), some security stuff (there was only one dedicated security engineer at the time), puppet, Clusto, and the various knots of software that fell between the organizational cushions.

As I started, I took stock of the team’s challenges and narrowed them down to a handful of areas to urgently focus on: keeping the site up (things routinely went wrong most Friday afternoons due to accelerating load, and folks’ phones routinely ran out of power due to Pagerduty pings during their 12 hour on-call shift), scaling Kafka, preparing to migrate out of our first datacenter before Uber’s Halloween traffic spike (we’d decided not to purchase any additional capacity in our current data center, so this was a hard deadline), and supporting incoming requests for to provision services. In my second and third weeks, I wrote up a strategy for each, and it briefly felt like we had a clear path forward.

Those plans were–to the best of my extremely objective and in no way self-interested memory–most excellent plans. Such good plans, that my manager would later confide that he had initially been concerned that I was going to walk in and immediately solve the problems he’d been struggling with. Of course, it would soon become clear that those very excellent plans worked best on paper.

Our first step was setting up a service cookbook: a YAML file describing work we did for other teams, and a Python Flask application that compiled that file into a UI that automated those requests where possible and otherwise generated well-structured tickets that included the necessary information to complete the task. Our ticket system at the time, Phabricator, didn’t support adding required fields for different types of requests, so the service cookbook provided a lot of value by simply ensuring the right information was included with each request. Even more importantly, the service cookbook gave us something we’d been missing: data on the incoming requests.

As usage data accumulated, it soon became clear that we were spending most of our time either responding to critical production fires–not the sort of thing we could ignore–or provisioning new services.

Service provisioning

Working in Uber’s monolith had become difficult: migrations were risky, deploys were slow, debugging incidents was challenging because each deployment included many teams’ commits, and so on. The decision had been made to deprecate the monolith, and teams were fleeing as quickly as they could into new services.

Service provisioning was a haphazard process, following a long runbook and requiring the sequenced deployment of puppet updates on at least three different tiers of hosts, in addition to twiddling a number of error-prone bits in Clusto (similar to Hashicorp’s Consul but predating it by some years; much of Uber’s initial infrastructure technology stack was ported over from Digg by an early engineer, Jeremy). After those steps, you’d inevitably start debugging what you, or the team requesting the new service, did wrong. Altogether, you could easily spend half a day provisioning a new service and then another week going back and forth with the service’s owning team to get the final pieces working properly.

We were falling behind a bit more each week on our service provisioning backlog, and we couldn’t add more folks to the provisioning efforts without taking them off our other urgent initiatives. So we focused on our best remaining tool, automation.

We didn’t have many folks to dedicate to service provisioning, so initially it was just me doing it part time and an engineer who joined the same day I did, Xiaojian. That wasn’t enough, so we soon hired a second engineer to work on this full time, Woody. Woody spent so much time on service provisioning that once someone was confused to learn that Woody was a person and not a Hipchat bot.

Our service cookbook gave us an interface to intercept incoming service provisioning requests, and we kept expanding on the tooling until folks could fully provision without any support from the team. Some of the interim steps were quite awkward! I particularly remember one stage where we auto-generated the Puppet configuration for each new service but still needed the requesting engineer to actually merge the change in themselves. Folks were quite confused to have requesting a new service end with them pasting changes into our Puppet repository and hoping it didn’t break anything.

Another interesting sub-challenge was service discovery and routing. We did service discovery through HAProxy, where you’d route to a service within your current environment by connecting on localhost to the service’s statically allocated port, which HAProxy would then route to an instance running somewhere in that environment. HAProxy was configured on each server individually when Puppet ran, generally once every hour on each server. (The hourly puppet runs were splayed over twenty minutes, so you had about five minutes between an invalid puppet change starting to propagate and the entire fleet crashing. We got good at pausing rollouts under duress, but we also crashed a lot of stuff.) To provision a new service, you needed to reserve a globally unique port. If you reused a port, you’d inevitably cause an outage. Ports were initially recorded in a wiki page somewhere, although many weren’t recorded anywhere. To automate this step of service provisioning, we moved from manual assignment to automated assignment (and later away from the horrors of static port assignment). It’s a good example of the many early choices that we had to urgently unwind to automate the full process.

These problems weren’t by any means the consequence of “bad choices” in the original infrastructure setup. Instead, like most technical decisions, the approaches simply didn’t survive intact through several orders of magnitude of growth. Many of the challenges we encountered are almost unheard of today, but most of the tools that address these issues today simply didn’t exist as proven options when Uber’s infrastructure was first set up. (As a later example, Uber would later end up running Mesos rather than Kubernetes, as Kubernetes at that time was an early project without much usage at load.)

As we solved each of these problems, service provisioning got faster and less error prone. Within 18 months we scaled from 15 services to over 2,000, with every single one of those services provisioned by either this team or the platform they built. As the team peaked at approximately four folks working on this particular problem, the vast majority were done through the platform. Far from the initial cross-team, week-long slog, service provisioning became easy enough that at one point every onboarding engineer would provision a new service on their first day.

Headcount

In addition to our work on automation, we ramped up hiring. The first challenge I encountered was that the team rejected every candidate. In my manager interview I ended up doing the traveling salesman problem in Python on a whiteboard, and that was basically par for the interview process. There were no standardized questions, no rubric for evaluating candidates, and the overall structure varied for every hiring manager.

Today, I would start by rolling out those common practices, but at that point in my career, I’d never seen or heard of those best practices. So instead I’d sit with my recruiting partner and debug why each loop failed to generate an offer. For candidates that I thought could help us survive the onslaught of incoming work, I’d go argue with the team about why we needed to hire them. Eventually they got annoyed having the same argument with me and started to agree with hiring some of the candidates.

As someone who’d hired maybe six engineers before joining Uber, this was a real baptism by fire. My first year at Uber, each week I would often do ten phone screens along with four or five interviews. I can vividly remember being in these glass rooms after doing three back-to-back phone screens one day and struggling to distinguish between the candidates in my notes. We did so much interviewing that we developed a process of proactively rotating good interviewers out of interviewing when they started burning out.

We also had both hilarious and horrific hiring experiences as a result of little oversight or standardization. We rejected one candidate, whose referrer convinced us to interview them again six months later, where we promptly rejected them again. Not deterred, another team insisted on interviewing the candidate a third time. I reached out to that team’s manager to dissuade them, but they insisted and extended an offer. The candidate accepted, started at Uber, then a few hours into their first day they accepted a second offer at Twitter instead, (conveniently, one block away). In the end, they interviewed more times at Uber than hours they spent working there.

Hiring at this rate, along with frequent production fires, and a frequent stream of frustrated partner teams was without a doubt the hardest I have ever worked. It was all consuming. On the plus side, it did work. We hired quite a few folks, and within a year we’d grown from five to about forty folks driven entirely by external hiring.

Founding SRE

As our team grew and our automation improved, our plan indicated circumstances would improve. But they didn’t. We remained deeply underwater. Our approach was working, but it wasn’t working fast enough. The engineering organization was doubling around us every six months, and every few months brought a new, urgent project at the scale of e.g. provisioning two new data centers in China.

What else could we do to dig out from underneath this workload? If growing human and software capacity wasn’t enough, then the only option we could figure out was to do less work. So obvious! Any approach to doing less had to satisfy two specific requirements. First, we needed to retain our ability to staff the unpredictable yet inevitable production problems that come with doubling load every six months. Second, we had to provide sufficient support to other engineering teams such that they could make forward progress on their critical, independent goals.

There’s a memetic belief that the solution to doing less is to just force your manager to prioritize your workload, but I’ve only seen this strategy work in rather top-down organizations. It’s also a surefire way to demote yourself from leader to manager. I was, admittedly, too obsessed with our team overcoming this challenge ourselves to think about asking for help. Ultimately we came up with another approach: could we solve these constraints by changing the shape of our organization?

It was that question that sparked the creation of Uber’s SRE organization.

We had the staffing to scale our infrastructure if we could protect them from incoming work requests, so we focused on how to protect those teams from interrupts while simultaneously giving requesting teams enough support to make forward progress. The structure was effectively:

Our three or four top organizational partners would get dedicated SRE support. The partner teams would explicitly prioritize the workload of SRE’s supporting them. We defined “top” by number of requests, not by impact
Everyone else would get support from the team building the self-service platform. They’d prioritize work on the principle of accelerating automation
The rest of infrastructure would support specific areas of scalability and operation, e.g. networking, Kafka. They’d mostly prioritize scalability, usability, and efficiency

As we started rolling this out, we had a typical Uber problem: we didn’t have anyone to staff the SRE team. However, it wasn’t long until we hired Uber’s first SRE, Rick, Uber’s first SRE manager, Eamon, and hiring ramped up from there.

My motivational speech to Rick when he joined wasn’t very good, something along the lines of, “You’re really smart and experienced. You need to help this partner team accomplish their work without making any requests to the wider infrastructure team. Don’t wait for permission, just go do stuff. Good luck!” Entirely to Rick’s credit, somehow this worked out.

Soon thereafter, another engineer was hired to join Rick working with the same partner team. Then two more engineers were hired to support the next partner team on the list. The two after that supported a third team, and the plan started to come together. We continued this model, adding more SREs to embed directly into top partner teams until we ran out of sufficiently large partner teams to support.

The partner teams, finally having the ability to align infrastructure requests with their work, mostly stopped escalating. Even though we hadn’t given them the level of support they wanted, we had given them explicit control and exposure to the tradeoffs their teams’ asks created, many of which they’d been unaware of in the earlier model.

As our SRE rollout stabilized, it felt like we’d finally solved the problem.

This SRE model wasn’t perfect. It thrived with collaborative partner teams and struggled with others, but it solved the problem it was designed to: our partner teams were making predictable progress, and we were able to simultaneously prioritize scalability work. Most importantly, it solved this problem without any global down-top direction on priorities, instead relying on partner teams to locally solve their priorities.

SRE v2

Uber’s wider infrastructure organization had an organic approach to problem solving. It was fundamentally a bottoms-up organization that had local strategies without a common overarching strategy, which created quite a few points of friction. Eventually, as it grew larger, a search started for my direct manager’s role, and eventually we hired Uber’s next head of infrastructure.

The new head of infrastructure had a very specific way of structuring architecture and infrastructure organizations based on their previous experience. Within a few weeks they kicked off a project to rearchitect Uber’s technology and infrastructure stack. He also had a specific view of how SRE should work, and within a couple months hired a previous colleague to take over leading the SRE team.

As is often the case after a new leader joins to reimplement their previous setup, important context about how things worked got glossed over. I also did a poor job of explaining the specific problem we’d solved with Uber’s SRE organization. To this day, I believe these new leaders had a fundamental misunderstanding of what problem Uber’s SRE organization was solving (enabling bottoms-up organization with highly fluid priorities that weren’t resolved into a single priority list anywhere to function successfully), viewing it instead from the lens of their previous experience. As a result, SRE quickly shifted from an embedded model to a very different model.

The name was unchanged, but it was soon a very different team. That unchanged name concealed the beginning of a significant leadership and cultural shift, one that would culminate in Susan Rigetti’s Reflecting On One Very, Very Strange Year at Uber. A few months later, my fifth boss in two years was out the door, and I followed shortly thereafter.

The SRE organization remained. The SRE organization that we’d built was gone.

Reflection

When recounting stories like this one or the Digg V4 launch, there’s always a tug towards presenting yourself in the best possible way, but I try to recount these things somewhat faithfully: if I could redo my time, I’d do many things differently. Uber was a breakout role for me, and I wouldn’t have succeeded in my subsequent work without everything I learned first at Uber. There are also few things I’m more proud of than what we accomplished together. To the folks I worked with and learned from at Uber: thank you, you changed my life for the better.

That’s not to say the experience was entirely good. Working at that era of Uber extracted a toll. Many others paid a much worse price; I paid one too. I was in way over my head as a leader, and I struggled with it. On a work trip ramping up the Uber Lithuania office, my partner of seven years called to let me know they’d moved out. Writing had been my favorite hobby, but I gave up on writing during my time there. Some chapters enrich our lives without being something we’d repeat, and Uber is certainly one of those chapters for me.

Hi folks. I'm Will aka @lethain.If you're looking to reach out to me, here are ways I help. If you'd like to get a email from me, subscribe to my weekly newsletter.