Google Cloud Platform Blog
Lessons from a Google App Engine SRE on how to serve over 100 billion requests per day
Wednesday, April 6, 2016
In this blog post we caught up with Chris Jones, a Site Reliability Engineer on Google App Engine for the past three-and-a-half years and SRE at Google for almost 9 years, to find out more about running production systems at Google. Chris is also one of the editors of
Site Reliability Engineering: How Google Runs Production Systems
, published by O’Reilly and available today.
Google App Engine
serves over 100 billion requests per day. You might have heard about how our Site Reliability Engineers, or SREs, make this happen. It’s a little bit of magic, but mostly about applying the principles of computer science and engineering to the design and development of computing systems —generally very large, distributed ones.
Site Reliability Engineering is a set of engineering approaches to technology that lets us or anyone run better production systems. It went on to inform the concept of DevOps for the wider IT community. It’s interesting because it’s a relatively straightforward way of improving performance and reliability at planet-scale, but can be just as useful for any company for say, rolling out Windows desktops. Done right, SRE techniques can increase the effectiveness of operating any computing service.
Q
: Chris, tell us how many SREs operate App Engine and at what scale?
CJ
: We have millions of apps on App Engine serving over 100 billion requests per day supported by dozens of SREs.
Q
: How do we do that with so few people?
CJ
: SRE is an engineering approach to operating large-scale distributed computing services. Making systems highly standardized is critical. This means all systems work in similar ways to each other, which means fewer people are needed to operate them since there are fewer complexities to understand and deal with.
Automation is also important: our turn-up processes to build new capacity or scale load balancing are automated so that we can scale these processes nicely with computers, rather than with more people. If you put a human on a process that’s boring and repetitive, you’ll notice errors creeping up. Computers’ response times to failures are also much faster than ours. In the time it takes us to notice the error the computer has already moved the traffic to another data center, keeping the service up and running. It’s better to have people do things people are good at and computers do things computers are good at.
Q
: What are some of the other approaches behind the SRE model?
CJ
: Because there are SRE teams working with many of Google’s services, we’re able to extend the principle of standardization across products: SRE-built tools originally used for deploying a new version of Gmail, for instance, might be generalized to cover more situations. This means that each team doesn’t need to build its own way to deploy updates. This ensures that every product gets the benefit of improvements to the tools, which leads to better tooling for the whole organization.
In addition, the combination of software engineering and systems engineering knowledge in SRE often leads to solutions that synthesize the best of both backgrounds. Google’s software network load balancer,
Maglev
, is an example — and it’s the underlying technology for the
Google Cloud Load Balancer
.
Q
: How do these approaches impact App Engine and our customers running on App Engine?
CJ
: Here’s a story that illustrates it pretty well. In the summer of 2013 we moved all of App Engine’s US region from one side of the country to the other. The move incurred no downtime to our customers.
Q
: How?
CJ
: We shut down one App Engine cluster, and as designed, the apps running on it automatically moved to the remaining clusters. We had created a copy of the US region’s
High Replication Datastore
in the destination data center ahead of time so that those applications’ data (and there were petabytes of it!) was already in place; changes to the Datastore were automatically replicated in near real-time so that it was consistently up to date. When it was time to turn on App Engine in the new location, apps assigned to that cluster automatically migrated from their backup clusters and had all their data already in place. We then repeated the process with the remaining clusters until we were done.
Advance preparation, combined with extensive testing and contingency plans, meant that we were ready when things went
slightly wrong
and were able to minimize the impact on customers. And of course, we put together an internal postmortem — another key part of how SRE works — to understand what went wrong and how to fix it for the future, without pointing fingers.
Q
: Very cool. How can we find out more about SRE?
CJ
: Sure. If you’re interested in learning more about how Site Reliability Engineering works at Google, including the lessons we learned along the way, check out this
website
, the
new book
and we’ll also be at
SREcon
this week (April 7-8) giving various talks on this topic.
-
Posted by Jo Maitland, Managing Editor, Google Cloud Platform
No comments :
Post a Comment
Don't Miss Next '17
Use promo code NEXT1720 to save $300 off general admission
REGISTER NOW
Free Trial
GCP Blogs
Big Data & Machine Learning
Kubernetes
GCP Japan Blog
Labels
Announcements
56
Big Data & Machine Learning
91
Compute
156
Containers & Kubernetes
36
CRE
7
Customers
90
Developer Tools & Insights
80
Events
34
Infrastructure
24
Management Tools
39
Networking
18
Open Source
105
Partners
63
Pricing
24
Security & Identity
23
Solutions
16
Stackdriver
19
Storage & Databases
111
Weekly Roundups
16
Archive
2017
Feb
Jan
2016
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2015
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2014
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2013
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2012
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2011
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2010
Dec
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2009
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2008
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Feed
Subscribe by email
Technical questions? Check us out on
Stack Overflow
.
Subscribe to
our monthly newsletter
.
Google
on
Follow @googlecloud
Follow
Follow
No comments :
Post a Comment