Editor's Note: You can now use Zipkin tracers with Stackdriver Trace. Go here to get started.



Part of the promise of the ...




Editor's Note: You can now use Zipkin tracers with Stackdriver Trace. Go here to get started.



Part of the promise of the Google Cloud Platform is that it gives developers access to the same tools and technologies that we use to run at Google-scale. As the evolution of our Dapper distributed tracing system, Stackdriver Trace is one of those tools, letting developers analyze application latency and quickly isolate the causes of poor performance. While it was initially focused on Google App Engine projects, Stackdriver Trace also supports applications running on virtual machines or containers via instrumentation libraries for Node.js, Java, and Go (Ruby and .Net support will be available soon), and also through an API. Trace is available at no charge for all projects, and our instrumentation libraries are all open source with permissive licenses.



Another popular distributed tracing system is Zipkin, which Twitter open-sourced in 2012. Zipkin provides a plethora of instrumentation libraries for capturing traces from applications, as well as a backend system for storing and presenting traces through a web interface. Zipkin is widely used; in addition to Twitter, Yelp and Salesforce are major contributors to the project, and organizations around the world use it to view and diagnose the performance of their distributed services.



Zipkin users have been asking for interoperability with Stackdriver Trace, so today we’re releasing a Zipkin server that allows Zipkin-compatible clients to send traces to Stackdriver Trace for analysis.



This will be useful for two groups of people: developers whose applications are written in a language or framework that Stackdriver Trace doesn’t officially support, and owners of applications that are currently instrumented with Zipkin who want access to Stackdriver Trace’s advanced analysis tools. We’re releasing this code open source on GitHub with a permissive license, as well as a container image for quick set-up.



As described above, the new Stackdriver Trace Zipkin Connector is a drop-in replacement for an existing Zipkin backend and continues to use the same Zipkin-compatible tracers. You no longer need to set up, manage or maintain a Zipkin backend. Alternatively, you can run the new collector on each service that's instrumented with Zipkin tracers.



There are currently Zipkin clients available for Java, .Net, Node.js, Python, Ruby and Go, with built-in integration to a variety of popular web frameworks.




Setup Instructions


Read the Using Stackdriver with Zipkin Collector guide to configure and collect trace data from your distributed tracer. If you're not already using a tracer client, you can find one in a list of the most popular Zipkin tracers.




FAQ


Q: What does this announcement mean if I’ve been wanting to use Stackdriver Trace but it doesn’t yet support my language?



If a Zipkin tracer supports your chosen language and framework, you can now use Stackdriver Trace by having the tracer library send its data to the Stackdriver Trace Zipkin Collector.



Q: What does this announcement mean if I currently use Zipkin?



You’re welcome to set up the Stackdriver Trace Zipkin server and use it in conjunction with or as a replacement for your existing Zipkin backend. In addition to displaying traces, Stackdriver Trace includes advanced analysis tools like Insights and Latency Reports that will work with trace data collected from Zipkin tracers. As Stackdriver Trace is hosted by Google, you'll not need to maintain your own backend services for trace collection and display.



Latency reports are available to all Stackdriver Trace customers



Q: What are the limitations of using the Stackdriver Trace Zipkin Collector?

This release has two known limitations:


  1. Zipkin tracers must support the correct Zipkin time and duration semantics.

  2. Zipkin tracers and the Stackdriver Trace instrumentation libraries can’t append spans to the same traces, meaning that traces that are captured in one library won’t contain spans for services instrumented in the other type of library. For example:





  3. In this example, requests made to the Node.js web application will be traced with the Zipkin library and sent to Stackdriver Trace. However, these traces do not contain spans generated within the API application or for the RPC calls that it makes to the Database. This is because Zipkin and Stackdriver Trace use different formats for propagating trace context between services. 



    For this reason we recommend that projects wanting to use Stackdriver Trace either exclusively use Zipkin-compatible tracers along with the Zipkin Connector, or use instrumentation libraries that work natively with Stackdriver Trace (like the official Node.js, Java or Go libraries).



Q: Will this work as a full Zipkin server?



No, as the initial release only supports write operations. Let us know if you think that adding read operations would be useful, or submit a pull request through GitHub.



Q: How much does Stackdriver Trace cost?



You can use Zipkin with Stackdriver Trace at no cost.



Q: Can I use Stackdriver Trace to analyze my AWS, on-premises, or hybrid applications or is it strictly for services running on Google Cloud Platform?



Several projects already do this today! Stackdriver Trace will analyze all data submitted through its API, regardless of where the instrumented service is hosted, including traces and spans collected from the the Stackdriver Trace instrumentation libraries or through the Stackdriver Trace Zipkin Connector.




Wrapping up


We here on the Stackdriver team would like to send out a huge thank you to Adrian Cole of the Zipkin open source project. Adrian’s enthusiasm, technical assistance, design feedback and help with the release process have been invaluable. We hope to expand this collaboration with Zipkin and other open source projects in the future. A huge shout out is also due to the developers on the Stackdriver team who developed this feature.



Like the Stackdriver Trace instrumentation libraries, the Zipkin Connector has been published on GitHub under the Apache license. Feel free to file issues there or submit pull requests for proposed changes.






2016 is winding down, and we wanted to take this chance to thank you, our loyal readers, and wish you happy holidays. As a little gift to you, here’s a poem, courtesy of Mary Koes, a product manager on the Stackdriver team channeling the Clement Clarke Moore classic.




2016 is winding down, and we wanted to take this chance to thank you, our loyal readers, and wish you happy holidays. As a little gift to you, here’s a poem, courtesy of Mary Koes, a product manager on the Stackdriver team channeling the Clement Clarke Moore classic.



Twas the night before Christmas and all through the Cloud

Not a creature was deploying; it wasn't allowed.

The servers were all hosted in GCP or AWS

And Stackdriver was monitoring them so no one was stressed.





The engineers were nestled all snug in their beds

While visions of dashboards danced in their heads.

When then from my nightstand, there arose such a clatter,

I silenced my phone and checked what was the matter.





Elevated error rates and latency through the roof?

At this rate our error budget soon would go poof!

The Director OOO, the CTO on vacation,

Who would I find still manning their workstation?





Dutifully, I opened the incident channel on Slack

And couldn't believe when someone answered back.

SClaus was the user name of this tireless engineer.

I wasn't aware that this guy even worked here.





He wrote, "Wait while I check your Stackdriver yule Logs . . .

Yep, it seems the errors are all coming from your blogs."

Then in Error Reporting, he found the root cause

"Quota is updated. All fixed. :-)" typed SClaus.





Who this merry DevOps elf was, I never shall know.

For before we did our postmortem, away did he go.

Just before vanishing, he took time to write,

"Merry monitoring to all and to all a silent night!"



Happy holidays everyone, and see you in 2017!





Last week, we opened registration for two Google Certified Professional beta exams — Cloud Architect and Data Engineer — to the public. As companies move to the cloud, demand for competent technical professionals has grown, and these certifications can help people and companies connect. In particular, our new Cloud Architect Certification is helpful to businesses making the shift into cloud infrastructure and platform as a service.




Last week, we opened registration for two Google Certified Professional beta exams — Cloud Architect and Data Engineer — to the public. As companies move to the cloud, demand for competent technical professionals has grown, and these certifications can help people and companies connect. In particular, our new Cloud Architect Certification is helpful to businesses making the shift into cloud infrastructure and platform as a service.



A Google Certified Professional - Cloud Architect enables organizations to leverage Google Cloud technologies through an understanding of cloud architecture, Google Cloud Platform and users thereof. She has demonstrated her ability to design, develop and manage dynamic solutions that are robust, scalable and highly available.



At its core, the Cloud Architect Certification strives to support our company-wide mission to build the most open cloud for all. If you look at the Google Cloud Architect Certification Exam Guide, you’ll see that a Cloud Architect should also be experienced in microservices and multi-tiered distributed applications that span multi-cloud or hybrid environments. We want Cloud Architects to be able to complement existing on-premises infrastructure with cloud services, even if they’re not all on our cloud.



To earn your certification, you must successfully pass our Cloud Architect exam. We invite you to take our beta exam by January 18, 2017, and qualify for these additional benefits:


  • Save 40% on the cost of certification.

  • Prove early adoption by claiming a low certificate number if you pass.

  • Get exclusive access to the Certification Lounge at Google Cloud Next ’17 if you pass.


Ready? Register now and we’ll see you in the cloud!





Technology makes more sense when you map it out. That’s why we now have icons and sample architectural diagrams for ...




Technology makes more sense when you map it out. That’s why we now have icons and sample architectural diagrams for Google Cloud Platform (GCP) available to download. Using these icons, developers, architects and partners can represent complex cloud designs in white papers, datasheets, presentations and other technical content.



The icons are available in a wide variety of formats, and can be mixed and matched with icons from other cloud and infrastructure providers, to accurately represent hybrid- and multi-cloud configurations. There are icons representing GCP products, diagram elements, services, etc. View them below and at cloud.google.com/icons.


We'll update these files as we launch more products, so please check back.



To give you a flavor, below is one of more than 50 sample diagrams in Slides and Powerpoint. No need to start each diagram from scratch!





Happy diagramming!





A critical part of creating a great cloud application is making sure it runs today, tomorrow and every day thereafter. Google Stackdriver offers industrial-strength logging, monitoring and error reporting tools for Windows and .NET, so that your applications are consistently available. And companies of all sizes, such as ...




A critical part of creating a great cloud application is making sure it runs today, tomorrow and every day thereafter. Google Stackdriver offers industrial-strength logging, monitoring and error reporting tools for Windows and .NET, so that your applications are consistently available. And companies of all sizes, such as Khan Academy and Wix, are already using Stackdriver to simplify ops.



With Stackdriver Logging and Stackdriver Monitoring, Google Cloud Platform (GCP) now has several excellent tools for .NET developers to stay on top of what's happening with their applications: a Logging agent and client library, a Monitoring agent and a Stackdriver Diagnostics library for error reporting. Let's take a look at these new options available for .NET developers deploying and running applications on GCP.




Logging agent




Google Compute Engine virtual machines (VMs) running .NET applications can now automatically collect request and application logs. This is similar to the logging information provided by VMs running in Google App Engine standard and flexible environments. To start logging to Stackdriver, install the Logging agent on your Compute Engine VMs, following these instructions. To confirm things are working, look for a test log entry that reads textPayload: "Successfully sent to Google Cloud Logging API" in the Stackdriver Logs Viewer.



Once the Logging agent is installed in a VM, it starts emitting logs, and you'll have a "log's-eye-view’" of what's happening via auto-generated logs that reflect the events collected by Windows Event Viewer. No matter how many VMs your application requires, the Logs Viewer provides a consolidated view of the Windows logs being generated across your application.






Monitoring agent




Automated logging of warnings and errors from your apps are just the beginning. Monitoring also lets you track specific metrics about your Windows VMs and receive an alert when they cross a predefined threshold. For example, imagine you want to know when a Windows VM's memory usage exceeds 80%. Monitoring agent to the rescue, an optional agent for your Windows VMs that collects CPU and memory utilization, pagefile and volume usage metrics for Monitoring. If the VM is running Microsoft IIS or SQL server, the agent also collects metrics from those services. See the Metrics List page for the full list of metrics it can collect, including metrics from third-party apps, and follow these installation instructions to install it.



Once the Monitoring agent is up and running, it's time to explore the real power of monitoring alerting! You can create a policy to alert you when a specific threshold value is crossed. For example, here's how to create a policy that sends a notification when a VM's CPU utilization stays above 80% for more than 15 minutes:



Step 1. Add a metric threshold condition. From the Monitoring main menu select "Alerting > Create a policy." Click "Add Condition." Select a condition type and appropriate threshold.






Step 2. Complete the details of the alerting policy. Under "Notification" enter an optional email address to receive alerts via email. Add any other details to the optional "Documentation" field. Finally, name the policy and click "Save Policy."



After creating a monitoring policy, you'll see the policy details page along with the status of any incidents:



To monitor web servers, Monitoring has a built-in "Uptime check" alert that continuously pings your VM over HTTP, HTTPS or TCP at a custom interval, helping you ensure that your web server is responding and serving pages as expected.



Here's how to create an Uptime check that pings the webserver at the specified hostname every 5 minutes:


  1. From the Monitoring dashboard click "Create Check" under "Uptime checks."

  2. Enter the details for the new Uptime check including Name, Check Type, Resource Type, Hostname and Path and specify how often to run the Uptime check under the "Check every" field.

  3. Click "Save."




The new Uptime checks page lists the geographic locations from where the checks are being run along with a status indicator:






Logging custom events for .NET Applications




Not only can you monitor resources, but you can also log important events specific to your application. "Google.Cloud.Logging.V2" is a beta .NET client library for Logging that provides an easy way to generate custom event logs using Stackdriver integration with Log4Net.



Step 1: Add the Logging client's Nuget packages to your Visual Studio project.



Right click your solution in Visual Studio and choose "Manage Nuget packages for solution." In the Visual Studio NuGet user interface, check the "Include prerelease" box, search for the package named "Google.Cloud.Logging.V2" and install it. Then install the "Google.Cloud.Logging.Log4Net" package in the same way.



Step 2: Add a Log4Net XML configuration section to your web application's Web.config file containing the following code:



<configuration>
<configSections>
<section name="log4net" type="log4net.Config.Log4NetConfigurationSectionHandler, log4net" />
</configSections>
<log4net>
<appender name="CloudLogger" type="Google.Cloud.Logging.Log4Net.GoogleStackdriverAppender,Google.Cloud.Logging.Log4Net">
<layout type="log4net.Layout.PatternLayout">
<conversionPattern value="%-4timestamp [%thread] %-5level %logger %ndc - %message" />
</layout>
<projectId value="YOUR-PROJECT-ID" />
<logId value="mySampleLog" />
</appender>
<root>
<level value="ALL" />
<appender-ref ref="CloudLogger" />
</root>
</log4net>







Step 3: Configure Log4net to use Logging by adding the following line of code to your application’s Global.asax.cs file:

log4net.Config.XmlConfigurator.Configure();



. The Application_Start() method in Global.asax.cs should look like this:



  protected void Application_Start()
{
GlobalConfiguration.Configure(WebApiConfig.Register);

// Configure log4net to use Stackdriver logging from the XML configuration file.
log4net.Config.XmlConfigurator.Configure();
}



Step 4: Add this statement to your application code to include the client libraries:

using log4net;



Step 5: To write logs that will appear in the Stackdriver Logs Viewer, add the following code to your application:



// Retrieve a logger for this context.
ILog log = LogManager.GetLogger(typeof(WebApiConfig));

// Log some information to Google Stackdriver Logging.
log.Info("Hello World.");



Once you build and run this code, you'll get log entries that look like this:



See the "How-To" documentation for installing and using the Logging client Nuget package for .NET applications.




Error Reporting for .NET Applications




Even if your VMs are running perfectly, your application may encounter runtime exceptions due to things like unexpected usage patterns. Good news! We recently released the beta Stackdriver Diagnostics ASP.NET NuGet package for Compute Engine VMs running .NET. With it, all exception errors from your application are automatically logged to Error Reporting.



Step 1: Enable the Error Reporting API.



Step 2: Right-click your solution in Visual Studio, choose "Manage Nuget packages for solution."

Check the "Include prerelease" checkbox. Search for the package named "Google.Cloud.Diagnostics.AspNet" and then install the package.



Step 3: Add the library to your application code:

using Google.Cloud.Diagnostics.AspNet;



Step 4: Add the following code to the "Register" method of your .NET web app:

public static void Register(HttpConfiguration config)

public static void Register(HttpConfiguration config)
{
// Add a catch all for the uncaught exceptions.
string projectId = "YOUR-PROJECT-ID";
string serviceName = "NAME-OF-YOUR-SERVICE";
string version = "VERSION-OF-YOUR-SERVICE";
// Add a catch all for the uncaught exceptions.
config.Services.Add(typeof(IExceptionLogger),
ErrorReportingExceptionLogger.Create(projectId, serviceName, version));
}





Here's an example of the exceptions you'll see in Error Reporting:







Click on an exception to see its details:



See the "How-To" documentation for installing and using the Stackdriver Diagnostics ASP.NET NuGet package for .NET applications.




Try it out




Now that you know how easy it is to log, monitor and enable error reporting for .NET applications on Google Cloud, go ahead and deploy a .NET application to Google Cloud for yourself. Next install the Logging and Monitoring agents on your VM(s) and add the Stackdriver Diagnostics and Logging client packages to your application. You can rest easier knowing that you're logging exactly what's going on with your application and that you'll be notified whenever something goes bump in the night.




From product news to behind-the-scenes stories to tips and tricks, we covered a lot of ground on the Google Cloud Platform (GCP) blog this year. Here are the most popular posts from 2016.



From product news to behind-the-scenes stories to tips and tricks, we covered a lot of ground on the Google Cloud Platform (GCP) blog this year. Here are the most popular posts from 2016.







  1. Google supercharges machine learning tasks with TPU custom chip - A look inside our custom ASIC built specifically for machine learning. This chip fast-forwards technology seven years into the future. 


    Tensor Processing Unit board


  2. Bringing Pokemon Go to life - Niantic’s augmented reality game uses more than a dozen Google Cloud services to delight and physically exert millions of Pokemon chasers across the globe.




  3. New undersea cable expands capacity for Google APAC customers and users - Together with Facebook, Pacific Light Data Communication and TE SubCom, we’re building the first direct submarine cable system between Los Angeles and Hong Kong.



  4. Introducing Cloud Natural Language API, Speech API open beta and our West Coast Region expansion - Now anyone can use machine learning models to process unstructured data or to convert speech to text. We also announced the opening of our Oregon Cloud Region (us-west1).




  5. Google to acquire Apigee - Apigee, an API management provider, helps developers integrate with outside apps and services. (Our acquisition of cloud-based software buyer and seller, Orbitera, also made big news this year.)




  6. Top 5 GCP NEXT breakout sessions on YouTube (so far) - From Site Reliability Engineering (SRE) and container management to building smart apps and analyzing 25 billion stock market events in an hour, Google presenters kept the NEXT reel rolling. (Don’t forget to sign up for Google Cloud Next 2017, which is just around the corner!)




  7. Advancing enterprise database workloads on Google Cloud Platform - Announcing that our fully managed database services Cloud SQL, Cloud Bigtable and Cloud Datastore are all generally available, plus Microsoft SQL Server images for Google Compute Engine.




  8. Google Cloud machine learning family grows with new API, editions and pricing - The new Cloud Jobs API makes it easier to fill open positions, and GPUs spike compute power for certain jobs. Also included: custom TPUs in Cloud Vision API, Cloud Translation API premium and general availability of Cloud Natural Language API.




  9. Google Cloud Platform sets a course for new horizons - In one day, we announced eight new Google Cloud regions, BigQuery support for Standard SQL and Customer Reliability Engineering (CRE), a support model in which Google engineers work directly with customer operations teams.




  10. Finding Pete’s Dragon with Cloud Vision API - Learn how Disney used machine learning to create a “digital experience” that lets kids search for Pete’s friend Elliot on their mobile and desktop screens.



  11. Top 10 GCP sessions from Google I/O 2016 - How do you develop a Node.js backend for an iOS and Android based game? What about a real-time game with Firebase? How do you build a smart RasPI bot with Cloud Vision API? You'll find the answers to these and many other burning 




  12. Spotify chooses Google Cloud Platform to power its data infrastructure - As Spotify’s user base grew to more than 75 million, it moved its backend from a homegrown infrastructure to a scalable and reliable public cloud.




Thank you for staying up to speed on GCP happenings on our blog. We look forward to much more activity in 2017, and invite you to join in on the action if you haven't already. Happy holidays!





Editor’s note: Just because something is a good problem to have, doesn’t mean it’s not a problem. In this latest installment of the CRE life lessons series, we learn about techniques that the Google Site Reliability Engineering team uses to handle too much of a good thing (traffic) with grace  ...




Editor’s note: Just because something is a good problem to have, doesn’t mean it’s not a problem. In this latest installment of the CRE life lessons series, we learn about techniques that the Google Site Reliability Engineering team uses to handle too much of a good thing (traffic) with grace  and how you can apply them to your own code running on Google Cloud Platform (GCP).



In our last installment in this series, we talked about how to prevent an accidental DDoS from your own code. In this post, we’ll talk about what to do when you have the problem everybody hopes for the success disaster.



The most painful kind of software failure is the "success disaster." This happens when your application consistently gets more traffic than you can sustainably handle. While you scramble to add capacity, your users may start to get the idea that it’s not worth the effort to use your system and eventually leave for something else.



What makes this the worst sort of failure is that nobody thinks it will happen to them while simultaneously hoping it does. It’s an entirely avoidable problem. Embrace the practice of load shedding and spare yourself the pain of this regret. Load shedding is a technique that allows your system to serve nominal capacity, regardless of how much traffic is being sent to it, in order to maintain availability. To do this, you'll need to throw away some requests and make clients retry.


Procrustean load shedding


You may recall, Poseidon’s son Procrustes had a very, um, one-size-fits-all approach to accommodating his overnight guests. In its simplest form, load shedding can be a bit like that too: observe some easily obtained local measure like CPU load, memory utilization or request queue length, and when this load number crosses a predetermined “safe” level as established by load testing, drop a fraction of incoming traffic to bring the load back to safe levels. For example, the system may drop the first n of each 10 requests where n starts at 1, ramps up as system load stays high, and drops gradually as the load returns to safe levels.



For example, here’s a Python method that processes a new request while keeping the load under a hard limit of 45 units and pushing down towards a soft limit of 25 units:



def addRequest(self, r):

HARD_QUOTA = 45
SOFT_QUOTA = 25
STEPS = 10

divisor = (HARD_QUOTA - SOFT_QUOTA) / STEPS

self.received += 1
self.req_modulus = (self.req_modulus + 1) % STEPS

# Are we overloaded?
load = self.getLoad()

# Become progressively more likely to reject requests
# once load > soft quota; reject everything once load
# hits hard limit.

threshold = int((HARD_QUOTA - load) / divisor)


if self.req_modulus < threshold:
# We're not too loaded
self.active_requests.append(r)
self.accepted += 1
else:
self.rejected += 1





When you feed a varying load into this system, you get the behavior seen below:



In the system modeled requests expire after a fixed time and are of varying cost. At a normal request rate (0-2 requests/sec) the system is running comfortably within limits. When requests double at t=30, the load shedding kicks in; we start to see a rise in rejected and expired requests but the load is kept under the hard limit. Rejected requests are more common than expired ones, which is what we want as expired requests consume system resources for no utility. Once the request rate returns to normal at t=90, new rejected and expired requests stop. Between t=120 and t=150 there's a 50% rise in requests, which reactivates load shedding but at a lower rate.



This kind of load shedding is simple to implement and is definitely better than having no load shedding at all, but it also has at least one very serious drawback: it assumes that all types of requests and clients are equal. In our experience, this is seldom true. If 95% of your online store’s requests are people paging through your catalog, and 5% are actual purchase requests, wouldn’t you want to prioritize the latter? A Procrustean approach to load shedding won’t help you with this.



Fortunately there are alternatives.


Ranking requests for criticality and cost


Before you can safely throw away less valuable work (i.e., drop requests, refuse connections, etc.) you first have to rank the relative importance of each request. That means figuring out what a request costs.



Every request has two costs:


  1. The cost to perform the work (the direct cost)

  2. The cost to not perform the work (the opportunity cost)


Direct cost is usually expressed in terms of a finite computing resource like CPU, RAM or network bandwidth. In our experience, however, this most usually resolves to CPU, as RAM is often already over-provisioned relative to CPU. (Networks can sometimes be the scarce resource, but normally only for specialty cases.)



Opportunity cost, on the other hand, is a little trickier to calculate. How do you measure the cost of not doing something? It’s tempting to try to express it in terms of dollars but that’s usually an oversimplification. Dollars of revenue are not the same as dollars of profit. One might be vitally more important to your business than another. With that in mind, here are two rules to remember when thinking about this:


  1. Denominate your costs in terms of your scarcest resource. If CPU is the scarcest thing in your system then use that to express all of your costs. If it’s revenue or profit then use that. At Google, for example, we sometimes use engineering hours as a measure of cost because we perceive engineering time as more scarce than dollars.

  2. Get everyone to agree on the units before you start ranking request types. Different parts of your business will have different views of the costs of dropping traffic. The ads team might value the dollars in lost revenue for not serving a piece of content while your marketing team might value the total number of users that can simultaneously access your application. The UX team, on the other hand, might think that latency is the most important thing since laggy UIs make users grumpy. The point is that this all gets settled by deciding on the denominating units first!


Once you decide how to measure the costs of dropped work then you can stack-rank the requests to shed. This is known as establishing your criticality. The more critical traffic gets prioritized ahead of the less critical traffic.



Of course, even this has its nuances. For example, some load shedding systems are designed to minimize the aggregate opportunity cost in the system while others consider how the opportunity costs and direct costs relate to each other. (Known as weighted or scaled costs.)



It’s almost never possible to know either the direct or opportunity cost of a specific query at runtime. Even if you could know, it’s likely that the computational overhead of calculating it in-line for every request would seriously reduce your serving capacity. Instead, you should establish a few criticality buckets or classes for your known request types. This way you can more easily classify each request into one of the buckets and use that to stack-rank their priorities. (Those of you with networking backgrounds will recognize this as a key component of Quality of Service (QoS) systems.)




Setting criticality




For load shedding to work best, your system should determine the criticality bucket of a request as early as possible, and as cheaply as possible, based on your criteria. Some common examples of how to determine criticality include:


  • An explicit field in the request specifying the bucket.

  • Bucketing by the hostname, which lets you "black-hole" low-priority traffic in overload situations by using DNS to point to a sacrificial server. This is a big hammer, but occasionally life-saving because it can stop requests from reaching your overloaded service in the first place.

  • The URL path, which is fairly cheap to check though does require some extra processing by your front-end service.

  • User ID, and whether it belongs to a specific group, e.g.,"paying customers" (highest), "logged in users" (medium-high), "logged-out users" (medium-low), "known robot accounts" (lowest). This allows the most precise bucketing, but is more expensive to check for each request.


At Google, we often classify batch operations (for example, background photo uploads) as "non-critical retryable." This signals that a request is not directly user-facing and that the user generally doesn't mind if the handling is delayed several minutes, or even an hour. In this case, the system can easily drop the request and tell the client to re-attempt the upload later. As long as the retry interval is quite large, the overall volume of retries remains low, while still allowing clients to resume uploading once the system load crisis is over.



We’ve had several painful experiences where a rogue client was using the same hostname as many good clients, making it impossible to block the rogue without affecting the good clients as well. Now, in situations where a public HTTP-based infrastructure service serves many different kinds of clients, every type of client accesses the service through its own hostname. This allows us to isolate all traffic from a badly-behaved client and route it to more distant servers with spare capacity. While this may increase latency for these bad clients, it spares other client types. Alternatively we can designate a subset of servers to handle the bad client traffic as best as they can, accepting that they'll likely become overloaded, and keep traffic from other clients away from those hosts. There’s also the last-ditch option of simply black-holing the bad traffic.




Criticality changes over time


Opportunity costs seldom follow a straight line, and what’s critical now, might not be later. Over time, a request might move from one criticality bucket to another. Take for example, loading your front page.



At first, the request to load your front page is very valuable because it’s serving important content (perhaps ads) to your user. After a certain amount of waiting, say 2 seconds, the user will probably abandon the slow page and go someplace else. That means from 0.0 second until 1.9 seconds the request to load your front page might be in your highest criticality bucket. Once it hits 2.0 seconds, however, you might as well drop it to the lowest bucket (or cancel it altogether), because the user probably isn’t there anymore.



For this reason, a great source of load that you can shed cheaply is requests that are exceeding their response deadlines, as established by user interface data and design. The better tuned your deadlines, the cheaper this will be.




Soft quotas vs. hard quotas




Suppose your system has a total serving capacity of 1,000,000 QPS and you average 10,000 simultaneous users at peak. In order to protect yourself from particularly demanding users you decide to cap each client at 100 QPS. This cap is called a quota.



The problem, of course, with giving each client a hard quota of 100 QPS is that when you have fewer than 10,000 clients hitting your backends, you have idle capacity. Wasted capacity can never be recovered (at least without the aid of a time machine) so you should avoid that at all costs. An important principle we follow inside Google is work conservation  which can be stated as clients who have exceeded their quotas should not be throttled if the system has remaining capacity.



In our example the 100 QPS quota is a soft quota because it shouldn’t necessarily be enforced if the system can tolerate the extra load. A hard quota, on the other hand, is a limitation that cannot ever be exceeded under any circumstances. Hard quotas exist to protect your infrastructure while soft quotas exist to help you manage finite resources “equitably”  however that’s defined in your business.



This brings us to another important consideration: fairness. When the system runs out of capacity then the clients who are most over their quotas should be the first to be throttled. If user X is sending 150 QPS to the system and user Y is sending 500 QPS, it might be unfair to squash user X until user Y has had 350 QPS load-shed.




Optimistic and pessimistic throttling




Having decided which traffic to throttle, you still need to decide how to throttle. Your two basic choices are optimistic and pessimistic throttling.



Optimistic throttling just means that you don’t start dropping traffic until you reach global capacity. When this happens, the load shedding system starts working its way through requests, beginning with the least important items and working back up the stack until things are healthy again. The advantage to this approach is that it’s pretty easy to implement and relatively computationally “cheap” because you’re not reacting until you get close to your global limit.



The downside of optimistic throttling, however, is that you'll spike over your global maximum while you start shedding load. Most users will only experience this momentary overload in the form of slightly higher latency, and for this reason, this is our recommended approach for a majority of systems.



If you do choose to go down the optimistic throttling path, it’s super important to thoroughly test your logic. With this approach, there’s a risk that your active load shedding may break due to a coding error in one of your binary releases, and you may not notice it for several weeks until you hit a peak that triggers load shedding and your servers start to segfault. Not that this has ever happened to us . . . ;-)



Pessimistic throttling, on the other hand, assumes that you may not exceed your global maximum under any circumstance  not even for a very short period of time. This is a more computationally expensive approach because the load management system has to continually compute (and recompute) quotas and other limits and transmit them throughout your system. This almost always means that you never quite serve up to your global capacity. And even when you do, the additional computational load eats into capacity that would otherwise be serving capacity. For these reasons, pessimistic throttling is more difficult and costlier to implement and maintain.




Throttling as a signal




If you're the owner of a system that has started to throttle some of its traffic, what does that tell you?



The naive interpretation is that you have a problem that you need to fix, and the simplest approach is to add more capacity in the system. For example, you could turn up more servers, or add resources to the ones you already have. However, if you’re spending 20% more to keep 20% more servers up and running, but the extra capacity is only used for a few minutes at peak every day, this isn’t a good use of resources.



Instead, look at the effects of throttling in terms of user experience and revenue. Are real users seeing errors or service degradation as a result? If so, what fraction of active users are affected? How many revenue-related requests are being throttled? How much is this costing you, compared to the cost of providing extra compute resource to serve those requests?



In many cases, if the system is only throttling non-interactive retryable requests, then your system is probably working as intended. As long as the throttling period is not prolonged and the retries are completing within your processing SLO there’s no real reason to spend more money to serve them more promptly. That said, if your service is throttling traffic for 12 hours every day, it may be time to do something about its capacity.



Analyzing the impact of throttling should be relatively easy to perform because you’re already classifying your requests into buckets, while monitoring tools can show you what fraction of each bucket’s requests was throttled.




Case study




Google once ran a service with many millions of mobile clients that cached users’ state on their mobile devices about images that were incrementally uploaded (in the background) to a backend storage service. The service was designed to handle peak global traffic, plus an additional margin, with the assumption that two serving locations could be unavailable at any one time. The service also handled a significant amount of interactive (user-facing) traffic.



We identified this service as a candidate for load shedding, and implemented it by marking requests with a new "request priority" field, with values ranging from "critical user-facing" to "non-critical background" (background uploads). The service was set to automatically shed requests once it reached its predetermined maximum capacity, starting with the lowest priority and working its way up.



Two days after the load shedding code made it to production, a new app release was pushed to the clients that had the unfortunate side effect of resetting the record of which data had already been uploaded. This bug made all the clients try to connect to our service at once to re-upload all their data. You can see the upload failure rate in the following graph:



This is not a graph you want to see if you’re the SRE on call. But the system continued to serve traffic correctly. Load shedding saved it from becoming overloaded by dropping nearly half of all background upload requests, while the remaining clients patiently backed off and retried again later. After a couple of hours, we turned up enough additional capacity to handle the load, the clients uploaded their data, and things went back to normal. (The short spike in server errors is an artefact of the way we disabled the throttling once the new capacity was in place.) In short, load shedding provided defense-in-depth against an irreversible coding bug.




Wrapping up




We all want to build systems whose popularity exceeds our wildest dreams. In thinking about those cases, however, we too often dismiss them by saying “that’s a problem I’d like to have!” In our experience, these are only problems you want to have until you have them. Then they’re just problems  and painful ones at that.



Reliability is your most important feature and you want your application to be insanely popular. Load shedding is a cheap way to design with that success in mind. Build it in early and you’ll spare yourself the agony of pondering what might have been.



This is our last CRE post of 2016. We hope all of you have a wonderful holiday season and thank you for the wonderful comments and suggestions. We’ll see you again in the new year. Until then: May your queries flow and your pagers stay silent . . .





Today, we’re bringing the latest Kubernetes 1.5 release to Google Cloud Platform (GCP) customers. In addition to the full slate of features available in Kubernetes ...




Today, we’re bringing the latest Kubernetes 1.5 release to Google Cloud Platform (GCP) customers. In addition to the full slate of features available in Kubernetes, Google Container Engine brings a simplified user experience for cross-cloud federation, support for running stateful applications and automated maintenance of your clusters.



Highlights of this Container Engine release include:




  • Auto-upgrade and auto-repair for nodes simplify on-going management of your clusters

  • Simplified cross-cloud federation with support for the new "kubefed" tool

  • Automated scaling for key cluster add-ons, ensuring improved uptime for critical cluster services

  • StatefulSets (originally called PetSets) in beta, enabling you to run stateful workloads on Container Engine

  • HIPAA compliance allowing you to run HIPAA regulated workloads in containers (after agreement to Google Cloud’s standard Business Associate Agreement).




The adoption of Kubernetes and growth of the community has propelled it to be one of the fastest and most active open source projects, and that growth is mirrored in the accelerating usage of Container Engine. By using the fully managed services, companies can focus on delivering value for their customers, rather than on maintaining their infrastructure. Some recent customer highlights include:




  • GroupBy uses Container Engine to support continuous delivery of new commerce application capabilities for their customers, including retailers such as The Container Store, Urban Outfitters and CVS Health.







Google Container Engine provides us with the openness, stability and scalability we need to manage and orchestrate our Docker containers. This year, our customers flourished during Black Friday and Cyber Monday with zero outages, downtime or interruptions in service thanks, in part, to Google Container Engine.” - Will Warren, Chief Technology Officer at GroupBy.




  • MightyTV ported their workloads to Container Engine to power their video recommendation engine, reducing their cost by 33% compared to running on traditional virtual machines. Additionally, they were able to remove a third-party monitoring and logging service and let go of maintaining Kubernetes on their own.






If you’d like to help shape the future of Kubernetes the core technology Container Engine is built on  join the open Kubernetes community and participate via the kubernetes-users-mailing list or chat with us on the kubernetes-users Slack channel.



Finally, if you’d like to try Kubernetes or GCP, it’s super easy to get started with one-click Kubernetes clusters creation with Container Engine. Sign up for a free trial here.



Thank you for your support!





From the beginning, our goal for Google Cloud Platform has been to build the most open cloud for all developers and businesses alike, and make it easy for them to build and run great software. A big part of this is being an active member of the open source community and working directly with developers where they are, whether they’re at an emerging startup or a large enterprise.




From the beginning, our goal for Google Cloud Platform has been to build the most open cloud for all developers and businesses alike, and make it easy for them to build and run great software. A big part of this is being an active member of the open source community and working directly with developers where they are, whether they’re at an emerging startup or a large enterprise.





Today we're pleased to announce that Google has joined the Cloud Foundry Foundation as a Gold member to further our commitment to these goals.






Building on success


We've done a lot of work with the Cloud Foundry community this year, including the delivery of the BOSH Google CPI release, enabling the deployment of Cloud Foundry on GCP, and the recent release of the Open Service Broker API. These efforts have led to additional development and integration with tools like Google Stackdriver for hybrid monitoring, and custom service brokers for eight of our GCP services:




Collaborating with customers and partners as we’ve worked on these projects made the decision to join the Cloud Foundry Foundation simple. It's an energized community with vast enterprise adoption, and the technical collaboration has been remarkable between the various teams.






What’s next


Joining the Cloud Foundry Foundation allows us to be even more engaged and collaborative with the entire Cloud Foundry ecosystem. And as we enter 2017, we look forward to even more integrations and more innovations between Google, the Cloud Foundry Foundation and our joint communities.







Cloud load balancers are a key part of building resilient and highly elastic services, allowing you to think less about infrastructure and focus more on your applications. But the applications themselves are evolving: they are becoming highly distributed and made up of multiple tiers. Many of these are delivered as internal-only microservices. That’s why we’re excited to announce the general availability of ...






Cloud load balancers are a key part of building resilient and highly elastic services, allowing you to think less about infrastructure and focus more on your applications. But the applications themselves are evolving: they are becoming highly distributed and made up of multiple tiers. Many of these are delivered as internal-only microservices. That’s why we’re excited to announce the general availability of Internal Load Balancing, a fully-managed flavor of Google Cloud Load Balancing that enables you to build scalable and highly available internal services for internal client instances without requiring the load balancers to be exposed to the Internet.



In the past year, we’ve described our Global Load Balancing and Network Load Balancing technologies in detail at the GCP NEXT and NSDI conferences. More recently, we revealed that GCP customer Niantic deployed HTTP(S) LB, alongside Google Container Engine and other GCP technologies, to scale its wildly popular game Pokémon GO. Today Internal Load Balancing joins our arsenal of cloud load balancers to deliver the scale, performance and availability your private services need.



Internal Load Balancing architecture

When we present Cloud Load Balancing, one of the questions we get is “Where is the load balancer?” and then “How many connections per second can it support?”



Similar to our HTTP(S) Load Balancer and Network Load Balancer, Internal Load Balancing is neither a hardware appliance nor an instance-based solution. It is software-defined load balancing delivered using Andromeda, Google’s network virtualization stack.




(click to enlarge)


As a result, your internal load balancer is “everywhere” you need it in your virtual network, but “nowhere” as a choke-point in this network. By virtue of this architecture, Internal Load Balancing can support as many connections per second as needed since there is no load balancer in the path between your client and backend instances.



Internal Load Balancing features

Internal Load Balancing enables you to distribute internal client traffic across backends running private services. In the example below, client instance (192.168.1.1) in Subnet 1 connects to the Internal Load Balancing IP (10.240.0.200) and gets load balanced to a backend instance (10.240.0.2) in Subnet 2.




(click to enlarge)


With Internal Load Balancing, you can:



  • Configure a private RFC1918 load-balancing IP from within your virtual network;

  • Load balance across instances in multiple availability zones within a region;

  • Configure session affinity to ensure that traffic from a client is load balanced to the same backend instance;

  • Configure high-fidelity TCP, SSL(TLS), HTTP or HTTPS health checks;

  • Get instant scaling for your backend instances with no pre-warming; and

  • Get all the benefits of a fully managed load balancing service. You no longer have to worry about load balancer availability or the load balancer being a choke point.



Configuring Internal Load Balancing

You can configure Internal Load Balancing using the REST API, the gcloud CLI or Google Cloud Console. Click here to learn more about configuring Internal Load Balancing.



The (use) case for Internal Load Balancing

Internal Load Balancing delivers private RFC 1918 load balancing within your virtual private networks in GCP. Let’s walk through three interesting ILB use cases:





1. Scaling your internal services



In a typical microservices architecture, you deliver availability and scale for each service by deploying multiple instances of it and using an internal load balancer to distribute traffic across these instances. Internal Load Balancing does this, and also autoscales instances to handle increases in traffic to your service.




(click to enlarge)


2. Building multi-tier applications on GCP



Internal Load Balancing is a critical component for building n-tier apps. For example, you can deploy HTTP(S) Load Balancing as the global web front-end load balancer across web-tier instances. You can then deploy the application server instances (represented as the Internal Tier below) behind a regional internal load balancer and send requests from your web tier instances to it.




(click to enlarge)


3. Delivering high availability and scale for virtual appliances



Traditionally, high availability (HA) for hardware appliances is modeled as an active-standby or active-active set-up where two (or sometimes more) such devices exchange heart beats and state information across a dedicated, physical synchronization link in a Layer 2 network.



This model no longer works for cloud-based virtual appliances such as firewalls, because you do not have access to the physical hardware. Layer 2 based high availability doesn’t work either because public cloud virtual networks are typically Layer 3 networks. More importantly, cloud apps store shared state outside of the application for durability, etc., so it is possible to eliminate traditional session state synchronization.




(click to enlarge)


Considering all of these factors, a high availability model that works well on Google Cloud Platform is deploying virtual appliance instances behind Internal Load Balancing. Internal Load Balancing performs health checks on your virtual appliance instances, distributes traffic across the healthy ones, and scales the number of instances up or down based on traffic.



What’s next for Internal Load Balancing

We have a number of exciting Internal Load Balancing features coming soon, including service discovery using DNS, load balancing traffic from on-prem clients across a VPN to backends behind an internal load balancer, and support for regional instance groups.



Until then, we hope you will give Internal Load Balancing a spin. Start with the tutorial, read the documentation and deploy it on GCP. We look forward to your feedback!





Google Cloud’s guiding philosophy is to enable what’s next, and gaming is one industry that’s constantly pushing what’s possible with technical innovation. At Google, we are no stranger to these advancements, from ...


Google Cloud’s guiding philosophy is to enable what’s next, and gaming is one industry that’s constantly pushing what’s possible with technical innovation. At Google, we are no stranger to these advancements, from AlphaGo’s machine learning breakthrough to Pokemon GO’s achievements in scaling and mapping on GCP.



We are always seeking new partners who share our enthusiasm for innovation, and today we are announcing a partnership with Improbable, a company focused on building large-scale, complex online worlds through their distributed operating system, SpatialOS. As part of the partnership, Improbable is launching the SpatialOS Games Innovation Program, which provides game developers with credits to access Improbable’s technology powered by GCP and the freedom to get creative and experiment with what’s possible up until they launch the game. Today, game developers can join the SpatialOS open alpha, and start to prototype, test and deploy games to the cloud. The program will fully launch in Q1 2017, along with the SpatialOS beta.



SpatialOS allows game developers to create simulations of great scale (a single, highly detailed world can span hundreds of square miles), great complexity (millions of entities governed by realistic physics) and huge populations (thousands of players sharing the same world). These exciting new games are possible with SpatialOS plus the scalability, reliability and openness of GCP, including the use of Google Cloud Datastore’s fully managed NoSQL database and Google Compute Engine’s internal network, instance uptime, live migration and provisioning speed.







Bossa Studios is already using SpatialOS and GCP to build Worlds Adrift, a 3D massively multiplayer game set to launch in early 2017. In Worlds Adrift, thousands of players share a single world of floating islands that currently cover more than 1000km². Players form alliances, build sky-ships and become scavengers, explorers, heroes or pirates in an open, interactive world. They can steal ships and scavenge wrecks while the islands’ flora and fauna can flourish and decline over time.




A collision of two fully customized ships flying through the procedurally generated and persistent universe of Worlds Adrift. Read about the game’s origin story and technical details of its physics.



We see many opportunities for GCP to support developers building next-generation games and look forward to what game studios large and small will create out of our partnership with Improbable. To join the SpatialOS open alpha or learn more about the developer program visit SpatialOS.com.










Google Cloud Platform offers a range of services and APIs supported by an impressive backend infrastructure. But to benefit from the power and capabilities of our APIs, you as a developer also need a great client-side experience: client libraries you’ll actually want to use, that are well documented, and that are easy to access.




Google Cloud Platform offers a range of services and APIs supported by an impressive backend infrastructure. But to benefit from the power and capabilities of our APIs, you as a developer also need a great client-side experience: client libraries you’ll actually want to use, that are well documented, and that are easy to access.



That’s why we are announcing today the beta release of the new Google Cloud Client Libraries for four of our cloud services: BigQuery, Google Cloud Datastore, Stackdriver Logging, and Google Cloud Storage. These libraries are idiomatic, well-documented, open-source, and cover seven server-side languages: C#, Go, Java, Node.js, PHP, Python, and Ruby. Most importantly, this new family of libraries is for GCP specifically and provides a consistent experience as you use each of these four services.


Finding client libraries fast


We want to make it easy for you to discover client libraries on cloud.google.com, so we updated our product documentation pages with a prominent client library section for each of these four products. Here’s what you can see in the left-hand navigation bar of the BigQuery documentation APIs & Reference section:









Click on the Client Libraries link to see the new Client Libraries page and select the language of your choice to learn how to install the library:







Right underneath the installation section, there’s a sample that shows how to make an API call. Set up auth using a single command, copy-paste the sample code and replace your variables, and you’ll be up and running in no time.









Lower in the page, you can find the links to access the library’s GitHub repo, ask a question on StackOverflow, or navigate to the client library reference for your specific language:








Client libraries you’ll want to use


The new Google Cloud Client Libraries were built with usability in mind from day one. We strive to make the libraries idiomatic and include the usage patterns you expect from your programming language -- so you feel right at home when you code against them.



They should also include plenty of samples. Each client library reference now includes a code example for every language and every API method showing you how to work with the API and best practices. For instance, the Node.js client library reference for BigQuery displays the following code with the createDataset method:





Furthermore, the product documentation on cloud.google.com for each of the four APIs contains many how-to guides with targeted samples for all our supported languages. For example, here is the code for learning how to stream data into BigQuery:














Next steps


This is just the beginning for Google Cloud Client Libraries. Our first job is to make the libraries for these four APIs generally available. We’ll also add support for more APIs, improve our documentation across the board, and keep adding more samples.



We take developer experience seriously and want to hear from you. Feel free to file issues on GitHub in one of our client library repositories or ask questions on StackOverflow.



Happy Coding!