Last week on CRE life lessons, we discussed how to come up with a precise numerical target for system availability. We term this target the Service Level Objective (SLO) of our system. Any discussion we have in future about whether the system is running sufficiently reliably and what design or architectural changes we should make to it must be framed in terms of our system continuing to meet this SLO.




Last week on CRE life lessons, we discussed how to come up with a precise numerical target for system availability. We term this target the Service Level Objective (SLO) of our system. Any discussion we have in future about whether the system is running sufficiently reliably and what design or architectural changes we should make to it must be framed in terms of our system continuing to meet this SLO.



We also have a direct measurement of SLO conformance: the frequency of successful probes of our system. This is a Service Level Indicator (SLI). When we evaluate whether our system has been running within SLO for the past week, we look at the SLI to get the service availability percentage. If it goes below the specified SLO, we have a problem and may need to make the system more available in some way, such as running a second instance of the service in a different city and load balancing between the two.




Why have an SLO at all?


Suppose that we decide that running our aforementioned Shakespeare service against a formally defined SLO is too rigid for our tastes; we decide to throw the SLO out of the window and make the service “as available as is reasonable.” This makes things easier, no? You simply don’t mind if the system goes down for an hour now and then. Indeed, perhaps downtime is normal during a new release and the attending stop-and-restart.



Unfortunately for you, customers don’t know that. All they see is that Shakespeare searches that were previously succeeding have suddenly started to return errors. They raise a high-priority ticket with support, who confirms that they see the error rate and escalate to you. Your on-call engineer investigates, confirms this is a known issue, and responds to the customer with “this happens now and again, you don’t have to escalate.” Without an SLO, your team has no principled way of saying what level of downtime is acceptable; there's no way to measure whether or not this a significant issue with the service. and you cannot terminate the escalation early with “Shakespeare search service is currently operating within SLO.” As our colleague Perry Lorier likes to say, “if you have no SLOs, toil is your job.”




The SLO you run at becomes the SLO everyone expects




A common pattern is to start your system off at a low SLO, because that’s easy to meet: you don’t want to run a 24/7 rotation, your initial customers are OK with a few hours of downtime, so you target at least 99% availability  1.68 hours downtime per week. But in fact, your system is fairly resilient and for six months operates at 99.99% availability  down for only a few minutes per month.



But then one week, something breaks in your system and it’s down for a few hours. All hell breaks loose. Customers page your on-call complaining that your system has been returning 500s for hours. These pages go unnoticed, because on-call leaves their pagers on their desks overnight, per your SLO which only specifies support during office hours.



The problem is, customers have become accustomed to your service being always available. They’ve started to build it into their business systems on the assumption that it’s always available. When it’s been continually available for six months, and then goes down for a few hours, something is clearly seriously wrong. Your excessive availability has become a problem because now it’s the expectation. Thus the expression, “An SLO is a target from above and below”  don’t make your system very reliable if you don’t intend and commit to it to being that reliable.



Within Google, we implement periodic downtime in some services to prevent a service from being overly available. In the SRE Book, our colleague Marc Alvidrez tells a story about our internal lock system  Chubby. Then, there’s the set of test front-end servers for internal services to use in testing, allowing those services to be accessible externally. These front-end servers are convenient but are explicitly not intended for use by real services; they have a one business day support SLA, and so can be down for 48 hours before the support team is even obligated to think about fixing them. Over time, experimental services that used those front-ends started to become critical; when we finally had a few hours of downtime on the front-ends, it caused widespread consternation.



Now we run a quarterly planned-downtime exercise with these front-ends. The front-end owners send out a warning, then block all services on the front-ends except for a small whitelist. They keep this up for several hours, or until a major problem with the blockage appears; the blockage can be quickly reversed in that case. At the end of the exercise the front-end owners receive a list of services that use the front-ends inappropriately, and work with the service owners to move them to somewhere more suitable. This downtime exercise keeps the front-end availability suitably low, and detects inappropriate dependencies in time to get them fixed.




Your SLA is not your SLO




At Google, we distinguish between a Service-Level Agreement (SLA) and a Service-Level Objective (SLO). An SLA normally involves a promise to someone using your service that its availability should meet a certain level over a certain period, and if it fails to do so then some kind of penalty will be paid. This might be a partial refund of the service subscription fee paid by customers for that period, or additional subscription time added for free. The concept is that going out of SLA is going to hurt the service team, so they'll push hard to keep it within SLA.



Because of this, and because the principle availability shouldn’t be much better than the SLO, the SLA is normally a looser objective than the SLO. This might be expressed in availability numbers: for instance, an availability SLA of 99.9% over 1 month with an internal availability SLO of 99.95%. Alternatively the SLA might only specify a subset of the metrics comprising the SLO.



For example, with our Shakespeare search service, we might decide to provide it as an API to paying customers in which a customer pays us $10K per month for the right to send up to one million searches per day. Now that money is involved, we need to specify in the contract how available they can expect the service to be, and what happens if we breach that agreement. We might say that we'll provide the service at a minimum of 99% availability, following the definition of successful queries given previously. If the service drops below 99% availability in a month, then we'll refund $2K; if it drops below 80% then, we'll refund $5K.



If you have an SLA that's different from your SLO, as it almost always is, it’s important for your monitoring to measure SLA compliance explicitly. You want to be able to view your system’s availability over the SLA calendar period, and easily see if it appears to be in danger of going out of SLA. You'll also need a precise measurement of compliance, usually from logs analysis. Since we have an extra set of obligations (in the form of our SLA) to paying customers, we need to measure queries received from them separately from other queries (we might not mind dropping queries from non-paying users if we have to start load shedding, but we really care about any query from the paying customer that we fail to handle properly). That’s another benefit of establishing an SLA  it’s an unambiguous way to prioritize traffic.



When you define your SLA, you need to be extra-careful about which queries you count as legitimate. For example, suppose that you give each of three major customers (whose traffic dominates your service) a quota of one million queries per day. One of your customers releases a buggy version of their mobile client, and issues two million queries per day for two days before they revert the change. Over a 30-day period you’ve issued approximately 90 million good responses, and two million errors; that gives you a 97.8% success rate. You probably don’t want to give all your customers a refund as a result of this; two customers had all their queries succeed, and the customer for whom two million out of 32 million queries were rejected brought this upon themselves. So perhaps you should exclude all “out of quota” response codes from your SLA accounting.



On the other hand, suppose you accidentally push an empty quota specification file to your service before going home for the evening. All customers receive a default 1000 queries per day quota. Your three top customers get served constant “out of quota” errors for 12 hours until you notice the problem when you come into work in the morning, and revert the change. You’re now showing 1.5 million rejected queries out of 90 million for the month, a 98.3% success rate. This is all your fault: counting this as 100% success for 88.5M queries is missing the point and a moral failure for measuring the SLA.




Conclusion




SLIs, SLOs and SLAs aren’t just useful abstractions. Without them you cannot know if your system is reliable, available, or even useful. If they don’t tie explicitly back to your business objectives then you have no idea if the choices you make are helping or hurting your business. You also can’t make honest promises to your customers.



If you’re building a system from scratch, make sure that SLIs, SLOs and SLAs are part of your system requirements. If you already have a production system but don’t have them clearly defined then that’s your highest priority work.



To summarize:


  • If you want to have a reliable service, you must first define “reliability.” In most cases that actually translates to availability.

  • If you want to know how reliable your service is, you must be able to measure the rates of successful and unsuccessful queries; these will form the basis of your SLIs.

  • The more reliable the service, the more it costs to operate. Define the lowest level of reliability that you can get away with, and state that as your Service Level Objective (SLO).

  • Without an SLO, your team and your stakeholders cannot make principled judgements about whether your service needs to be made more reliable (increasing cost and slowing development) or less reliable (allowing greater velocity of development).

  • If you’re charging your customers money you'll probably need an SLA, and it should be a little bit looser than your SLO.




As an SRE (or DevOps professional), it's your responsibility to understand how your systems serve the business in meeting those objectives, and, as much as possible, control for risks that threaten the high-level objective. Any measure of system availability that ignores business objectives is worse than worthless because it obfuscates the actual availability, leading to all sorts of dangerous scenarios, false senses of security and failure.



For those of you who wrote us thoughtful comments and questions from our last article, we hope this post has been helpful. Keep the feedback coming!



N. B. Google Cloud Next '17 is fewer than seven weeks away. Register now to join Google Cloud SVP Diane Greene, Google CEO Sundar Pichai, and other luminaries for three days of keynotes, code labs, certification programs, and over 200 technical sessions. And for the first time ever, Next '17 will have a dedicated space for attendees to interact with Google experts in Site Reliability Engineering and Developer Operations.






[Editor’s note: Today we hear from Agosto, a Google Cloud Premier Partner that has been building products and delivering services on Google Cloud Platform (GCP) since 2012, including Internet of Things applications. Read on to learn about Agosto’s work to build an MQTT service broker for Google Cloud Pub/Sub, and how you can incorporate it into your own IoT applications.] ...




[Editor’s note: Today we hear from Agosto, a Google Cloud Premier Partner that has been building products and delivering services on Google Cloud Platform (GCP) since 2012, including Internet of Things applications. Read on to learn about Agosto’s work to build an MQTT service broker for Google Cloud Pub/Sub, and how you can incorporate it into your own IoT applications.]



One of our key practice areas is Internet of Things (IoT). Using the many components of GCP, we’ve helped customers rapidly move their ideas from product concept to launch.



Along the way, we evaluated several IoT platforms and repeatedly came to the conclusion that we’d be better off staying on the GCP stack than a single IoT platform with costly licensing hooks and closed-source practices. Our clients also like being able to build scalable, functional prototypes using pre-existing and standard reference architectures and tools.



One of the many challenges we faced along the way was picking an efficient transport for two-way messaging between “things” and GCP. After evaluating a number of emerging and mature protocols, we settled on Message Queuing Telemetry Transport (MQTT). The MQTT protocol has been around since the early 2000’s and is now an ISO Standard. Originated in 1999 by Andy Stanford-Clark and Arlen Nipper, it's lightweight, has solid documentation and has tens of thousands of production deployments. Furthermore, many existing pre-IoT or “Machine to Machine” projects already use MQTT as their transport from embedded device to the back-office. With MQTT, we’ve been able to increase velocity and reduce complexity for our IoT products and services.



MQTT is a great transport protocol, but it can be challenging to manage at scale, particularly when it comes to scaling message storage and delivery systems. As one of the earliest Google partners to develop a set of reusable tools, reference architectures and methods for accelerating IoT products to market, we’ve been impressed with Google Cloud Pub/Sub, a durable, low-latency and scalable service for handling many-to-many asynchronous messaging. But Cloud Pub/Sub uses HTTPS to transfer data. Over numerous small requests, all those HTTP headers add up to a lot of extra data  a no-go when you’re dealing with a constrained device that communicates over a mobile network, and where you pay for each byte in mobile data charges, battery usage  or both.



We needed to bridge the gap between IoT-connected devices and Cloud Pub/Sub, and began investigating ways to connect MQTT to Cloud Pub/Sub using and extending RabbitMQ.



After initial load tests showed this approach was viable, Google asked Agosto to develop an open-source, highly performant MQTT connection broker that integrates with Cloud Pub/Sub. With low network overhead (Agosto has seen up to 10x less compared to HTTPS in scenarios we've tested) and high throughput, MQTT is a natural fit for many scenarios.



The resulting message broker integrates messaging between connected devices using a MQTT client and Cloud Pub/Sub; RabbitMQ performs the protocol conversion for two-way messaging between the device and Cloud Pub/Sub. This means administrators of the RabbitMQ compute infrastructure don't have to concern themselves with managing the durability of the data, or scaling storage.



Our message broker can support both small and very large GCP projects. For example, with smaller projects and IoT prototypes, you can rapidly deploy a single node of Agosto’s MQTT to Pub/Sub Connection Broker supporting up to 120,000 messages per minute for as little as $25/month for the compute costs. Larger production deployments with load-balanced brokers can support millions of concurrent connections and much higher throughput.



Download the broker, follow the instructions and learn more about leveraging MQTT and GCP for your IoT project.

GitHub: https://github.com/Agosto/gcp-iot-adapter



And if you're looking for a more customized implementation of our MQTT to Pub/Sub Connection broker, visit our website to learn more about our offerings.




Eclipse is one of the most popular IDEs for Java developers. Today, we're launching the beta version of Cloud Tools for Eclipse, a plugin that extends Eclipse to ...



Eclipse is one of the most popular IDEs for Java developers. Today, we're launching the beta version of Cloud Tools for Eclipse, a plugin that extends Eclipse to Google Cloud Platform (GCP). Based on Google Cloud SDK, the initial feature set targets App Engine standard environment, including support for creating applications, running and debugging them inside the IDE with the Eclipse Web Tools Platform tooling and deploying them to production.



You may be wondering how this plugin relates to the Google Plugin for Eclipse, which was launched in 2009. The older plugin is focused on a broader set of technologies than just GCP. Moreover, support for the Eclipse Web Tools Platform and Maven is spotty at best. Moving forward, we'll invest in building more cloud-related tooling in Cloud Tools for Eclipse.



Cloud Tools for Eclipse is available for Eclipse 4.5 (Mars) and Eclipse 4.6 (Neon) and can be installed through the Eclipse Update Manager. The plugin source code is available on GitHub, and we welcome contributions and reports of issues from the community.



First, install the Cloud Tools for Eclipse plugin. To verify that the plugin has installed correctly, launch Eclipse and look at the bottom right hand side of the window -- you should see a Google “G” Icon. Click on this icon to login to your Google account.



Now we'll demonstrate how to create and deploy a simple Maven-based "Hello World" App Engine standard environment application. First, create a new App Engine project from Cloud Console. (If this is your first time using GCP, we recommend signing up for our Free Trial first.) When you see this card, click Create a project:



You should then land on the following cards:



Every GCP project has a unique project ID. You’ll need this string later, so let’s grab that. On the left hand nav, click on Home and copy the project ID as shown below.





Now that you have an App Engine project, you're ready to deploy a simple Hello World application. Open Eclipse and click on File > New > Project and type “Maven-based Google” in the Wizards section, then select the following:



Fill in the Maven group ID and artifact ID and click Next:



In the next page, select the Hello World template and click Finish.



Now, right click on your project in the Project Explorer and select Run As > App Engine. You should now see your application running locally shortly on localhost. In the output terminal in Eclipse, the correct URL is hyperlinked.



Once you've finished running the application locally, you can deploy it to the cloud. Right-click on your application in the Eclipse Project Explorer and select Deploy to App Engine Standard. You'll see the following dialog if you're logging in for the first time. Click on the Account drop-down and proceed with the web browser UI to link the plugin for your GCP Account.



Once signed in, enter the Project ID of the application you created in Cloud Console and leave the rest as is. This is the ID you wrote down earlier.



Click Deploy to upload the finished project to App Engine. Status updates appear in the Eclipse console as files are uploaded. When the deployment finishes, the URL of the deployed application is shown in the Eclipse console. That’s it!



You can check the status of your application in the Cloud Console by heading to the App Engine tab and clicking on Instances to see the underlying infrastructure of your application.



We'll continue to add support for more GCP services to the plugin, so stay tuned for update notifications in the IDE. If you have specific feature requests, please submit them in the GitHub issue tracker.



To learn more about Java on GCP, visit the GCP Java developers portal, where you can find all the information you need to run your Java applications on GCP.



Happy Coding!



P.S. IntelliJ users, see here for the Cloud Tools for IntelliJ plugin.





In our last installment of the CRE life lessons series, we discussed how to survive a "success disaster" with load-shedding techniques. We got a lot of great feedback from that post, including several questions about how to tie measurements to business objectives. So, in this post, we decided to go back to first principles, and investigate what “success” means in the first place, and how to know if your system is “succeeding” at all.




In our last installment of the CRE life lessons series, we discussed how to survive a "success disaster" with load-shedding techniques. We got a lot of great feedback from that post, including several questions about how to tie measurements to business objectives. So, in this post, we decided to go back to first principles, and investigate what “success” means in the first place, and how to know if your system is “succeeding” at all.



A prerequisite to success is availability. A system that's unavailable cannot perform its function and will fail by default. But what is "availability"? We must define our terms:



Availability defines whether a system is able to fulfill its intended function at a point in time. In addition to being used as a reporting tool, the historical availability measurement can also describe the probability that your system will perform as expected in the future. Sometimes availability is measured by using a count of requests rather than time directly. In either case, the structure of the formula is the same: successful units / total units. For example, you might measure uptime / (uptime + downtime), or successful requests / (successful requests + failed requests). Regardless of the particular unit used, the result is a percentage like 99.9% or 99.999%  sometimes referred to as “three nines” or “five nines.”



Achieving high availability is best approached by focusing on the unsuccessful component (e.g., downtime or failed requests). Taking a time-based availability metric as an example: given a fixed period of time (e.g., 30 days, 43200 minutes) and an availability target of 99.9% (three nines), simple arithmetic shows that the system must not be down for more than 43.2 minutes over the 30 days. This 43.2 minute figure provides a very concrete target to plan around, and is often referred to as the error budget. If you exceed 43.2 minutes of downtime over 30 days, you'll not meet your availability goal.



Two further concepts are often used to help understand and plan the error budget:



Mean Time Between Failures (MTBF): total uptime / # of failures. This is the average time between failures.



Mean Time to Repair (MTTR): total downtime / # of failures. This is the average time taken to recover from a failure.



These metrics can be computed historically (e.g., over the past 3 months, or year) and combined as (Total Period / MTBF) * MTTR to give an expected downtime value. Continuing with the above example, if the historical MTBF is calculated to be 10 days, and the historical MTTR is calculated to be 20 minutes, then you would expect to see 60 minutes of downtime ((30 days / 10 days) * 20 minutes)  clearly outside the 44-minute error budget for a three-nines availability target. To meet the target would require decreasing the MTBF (say to every 20 days) or decreasing the MTTR (say to 10 minutes), or a combination of both.



Keeping the concepts of error budget, MTBF and MTTR in mind when defining an availability target helps to provide justification for why the target is set where it is. Rather than simply describing the target as a fixed number of nines, it's possible to relate the numeric target to the user experience in terms of total allowable downtime, frequency and duration of failure.



Next, we'll look at how to ensure this focus on user experience is maintained when measuring availability.



Measuring availability




How do you know whether a system is available? Consider a fictitious "Shakespeare" service, which allows users to find mentions of a particular word or phrase in Shakespeare’s texts. This is a canonical example, used frequently within Google for training purposes, and mentioned throughout the SRE book.



Let's try working the scientific method to determine the availability of the hypothetical Shakespeare system.


  1. Question: how often is the system available?

  2. Observation: when you visit shakespeare.com, you normally get back the "200 OK" status code and an HTML blob. Very rarely, you see a 500 Internal Server error or a connection failure.

  3. Hypothesis: if "availability" is the percentage of requests per day that return 200 OK, the system will be 99.9% available.

  4. Measure: "tail" the response logs of the Shakespeare service’s web servers and dump them into a logs-processing system.

  5. Analyze: take a daily availability measurement as the percentage of 200 OK responses vs. the total number of requests.

  6. Interpret: After seven days, there’s a minimum of 99.7% availability on any given day.




Happily, you report these availability numbers to your boss (Dave), and go home. A job well done.



The next day Dave draws your attention to the support forum. Users are complaining that all their searches at shakespeare.com return no results. Dave asks why the availability dashboard shows 99.7% availability for the last day, when there clearly is a problem.



You check the logs and notice that the web server has received just 1000 requests in the last 24 hours, and they're all 200 OKs except for three 500s. Given that you expect at least 100 queries per second, that explains why users are complaining in the forums, although the dashboard looks fine.



You've made the classic mistake of basing your definition of availability on a measurement that does not match user-expectations or business objectives.




Redefining availability in terms of the user experience with black-box monitoring




After fixing the critical issue (a typo in a configuration file) that prevented the Shakespeare frontend service from reaching the backend, we take a step back to think about what it means for our system to be available.



If the "rate of 200 OK logs for shakespeare.com" is not an appropriate availability measurement, then how should we measure availability?



Dave wants to understand the availability as observed by users. When does the user feel that shakespeare.com is available? After some lively back-and-forth, we agree that the system is available when a user can visit shakespeare.com, enter a query and get a result for that query within five seconds, 100% of the time.



So you write a black-box "prober" (black-box, because it makes no assumptions about the implementation of the Shakespeare service, see the SRE Book, Chapter 6) to emulate a full range of clients devices (mobile, desktop). For each type of client, you visit shakespeare.com, enter the query "to be or not to be," and check that the result contains the expected link to Hamlet. You run the prober for a week, and finally recalculate the minimum daily availability measure: 80% of queries return Hamlet within five seconds, 18% of queries take longer, 1% timeout and another 1% return errors. A full 20% of queries fail our definition of availability!




Choosing an availability target according to business goals




After getting over his shock, Dave asks a simple question: “Why can't we have 100% returning within 5 seconds?”



You explain all the usual reasons why: power outages, fiber cuts, etc. After an hour or so, Dave is willing to admit that 100% query response in under five seconds is truly impossible.



Which leads, Dave to ask, “What availability can we have, then?”



You turn the question the question around on him: “What availability is required for us to meet our business goals?”



Dave's eyes light up. The business has set a revenue target of $25 million per year, and we make on average $0.01 per query result. At 100 queries per second * 3,1536,000 seconds per year * 80% success rate * $0.01 per query, we'll earn $25.23 million. In other words, even with a 20% failure rate, we'll still hit our revenue targets!



Still, a 20% failure rate is pretty ugly. Even if we think we'll meet our revenue targets, it's not a good user experience and we might have some attrition as a result. Should we fix it, and if so, what should our availability objective be?




Evaluating cost/benefit tradeoffs, opportunity costs




Suppose the rate of queries returning in greater than five seconds can be reduced to 0.5% if an engineer works on the problem for six months. How should we decide whether or not to do this?



We can start by estimating how much the 20% failure rate is going to cost us in missed revenue (accounting for users who give up on retrying) over the life of the product. We know roughly how much it will cost to fix the problem. Naively, we may decide that since the revenue lost due to the error rate exceeds the cost of fixing the issue, then we should fix it.



But this ignores a crucial factor… the opportunity cost of fixing the problem. What other things could an engineer have done with that time instead?



Hypothetically, there’s a new search algorithm that increases the relevance of Shakespeare search results, and putting it into production might drive a 20% increase in search traffic, even as availability remains constant. This increase in traffic could easily offset any lost revenue due to poor availability.



An oft-heard SRE saying is that you should “design a system to be as available as is required, but not much more.” At Google, when designing a system, we generally target a given availability figure (e.g., 99.9%), rather than particular MTBF or MTTR figures. Once we’ve achieved that availability metric, we optimize our operations for "fast fix," e.g., MTTR over MTBF, accepting that failure is inevitable, and that “spes consilium non est” (Hope is not a strategy). SREs are often able to mitigate the user visible impact of huge problems in minutes, allowing our engineering teams to achieve high development velocity, while simultaneously earning Google a reputation for great availability.



Ultimately, the tradeoff made between availability and development velocity belong to the business. Precisely defining the availability in product terms allows us to have a principled discussion and to make choices we can be proud of.



N.B. Google Cloud Next '17 is fewer than seven weeks away. Register now to join Google Cloud SVP Diane Greene, Google CEO Sundar Pichai and other luminaries for three days of keynotes, code labs, certification programs and over 200 technical sessions. And for the first time ever, Next '17 will have a dedicated space for attendees to interact with Google experts in Site Reliability Engineering and Developer Operations.





Google Cloud uses the open-source KVM hypervisor that has been validated by scores of researchers as the foundation of ...




Google Cloud uses the open-source KVM hypervisor that has been validated by scores of researchers as the foundation of Google Compute Engine and Google Container Engine, and invests in additional security hardening and protection based on our research and testing experience. Then we contribute back our changes to the KVM project, benefiting the overall open-source community.



What follows is a list of the main ways we security harden KVM, to help improve the safety and security of your applications.




  1. Proactive vulnerability search: There are multiple layers of security and isolation built into Google’s KVM (Kernel-based Virtual Machine), and we’re always working to strengthen them. Google’s cloud security staff includes some of the world’s foremost experts in the world of KVM security, and has uncovered multiple vulnerabilities in KVM, Xen and VMware hypervisors over the years. The Google team has historically found and fixed nine vulnerabilities in KVM. During the same time period, the open source community discovered zero vulnerabilities in KVM that impacted Google Cloud Platform (GCP).



  2. Reduced attack surface area: Google has helped to improve KVM security by removing unused components (e.g., a legacy mouse driver and interrupt controllers) and limiting the set of emulated instructions. This presents a reduced attack and patch surface area for potential adversaries to exploit. We also modify the remaining components for enhanced security.



  3. Non-QEMU implementation: Google does not use QEMU, the user-space virtual machine monitor and hardware emulation. Instead, we wrote our own user-space virtual machine monitor that has the following security advantages over QEMU:



    Simple host and guest architecture support matrix. QEMU supports a large matrix of host and guest architectures, along with different modes and devices that significantly increase complexity. Because we support a single architecture and a relatively small number of devices, our emulator is much simpler. We don’t currently support cross-architecture host/guest combinations, which helps avoid additional complexity and potential exploits. Google’s virtual machine monitor is composed of individual components with a strong emphasis on simplicity and testability. Unit testing leads to fewer bugs in complex system. QEMU code lacks unit tests and has many interdependencies that would make unit testing extremely difficult.

    No history of security problems. QEMU has a long track record of security bugs, such as VENOM, and it's unclear what vulnerabilities may still be lurking in the code.



  4. Boot and Jobs communication: The code provenance processes that we implement helps ensure that machines boot to a known good state. Each KVM host generates a peer-to-peer cryptographic key sharing system that it shares with jobs running on that host, helping to make sure that all communication between jobs running on the host is explicitly authenticated and authorized.



  5. Code Provenance: We run a custom binary and configuration verification system that was developed and integrated with our development processes to track what source code is running in KVM, how it was built, how it was configured and how it was deployed. We verify code integrity on every level  from the boot-loader, to KVM, to the customers’ guest VMs.



  6. Rapid and graceful vulnerability response: We've defined strict internal SLAs and processes to patch KVM in the event of a critical security vulnerability. However, in the three years since we released Compute Engine in beta, our KVM implementation has required zero critical security patches. Non-KVM vulnerabilities are rapidly patched through Google's internal infrastructure to help maximize security protection and meet all applicable compliance requirements, and are typically resolved without impact to customers. We notify customers of updates as required by contractual and legal obligations.



  7. Carefully controlled releases: We have stringent rollout policies and processes for KVM updates driven by compliance requirements and Google Cloud security controls. Only a small team of Google employees has access to the KVM build system and release management control.




There’s a lot more to learn about KVM security at Google. Click the links below for more information.






And of course, KVM is just one infrastructure component used to build Google Cloud. We take security very seriously, and hope you’ll entrust your workloads to us.  


FAQ


Should I worry about side channel attacks?



We rarely see side channel attacks attempted. A large shared infrastructure the size of Compute Engine makes it very impractical for hackers to attempt side channel attacks, attacks based on information gained from the physical implementation (timing and memory access patterns) of a cryptosystem, rather than brute force or theoretical weaknesses in the algorithms. To mount this attack, the target VM and the attacker VM have to be collocated on the same physical host, and for any practical attack an attacker has to have some ability to induce execution of the crypto system being targeted. One common use for side channel attacks is against cryptographic keys. Side channel attacks that leak information are usually addressed quickly by cryptographic library developers. To help prevent that, we recommend that Google Cloud customers ensure that their cryptographic libraries are supported and always up-to-date.



What about Venom? 



Venom affects QEMU. Compute Engine and Container Engine are unaffected because both do not use QEMU.



What about Rowhammer? 



The Google Project Zero team led the way in discovering practical Rowhammer attacks against client platforms. Google production machines use double refresh rate to reduce errors, and ECC RAM that detects and corrects Rowhammer-induced errors. 1-bit errors are automatically corrected, and 2-bit errors are detected and cause any potentially offending guest VMs to be terminated. Alerts are generated for any projects that cause an unusual number of Rowhammer errors. Undetectable 3-bit errors are theoretically possible, but extremely improbable. A Rowhammer attack would cause a very large number of alerts for 2-bit and 3-bit errors and would be detected.



A recent paper describes a way to mount a Rowhammer attack using a KSM KVM module. KSM, the Linux implementation of memory de-duplication, uses a kernel thread that periodically scans memory to find memory pages with the same contents mapped from multiple VMs that are candidates for merging. Memory “de-duping” with KSM can help to locate the area to “hammer” the physical transistors underlying those bits of data, and can target the identical bits on someone else’s VM running on the same physical host. Compute Engine and Container Engine are not vulnerable to this kind of attack, since they do not use KSM. However, if a similar attack is attempted via a different mechanism, we have mitigations in place to detect it.



What is Google doing to reduce the impact of KVM vulnerabilities? 



We have evaluated the sources of vulnerabilities discovered to date within KVM. Most of the vulnerabilities have been in the code areas that are in the kernel for historic reasons, but can now be removed without a significant performance impact when run with modern operating systems on modern hardware. We’re working on relocating in-kernel emulation functionality outside of the kernel without a significant performance impact.



How does the Google security team identify KVM vulnerabilities in their early stage? 



We have built an extensive set of proprietary fuzzing tools for KVM. We also do a thorough code review looking specifically for security issues each time we adopt a new feature or version of KVM. As a result, we've found many vulnerabilities in KVM over the past three years. About half of our discoveries come from code review and about half come from our fuzzers.





Google Cloud Platform (GCP) customers need an easy way to centrally manage and control GCP resources, projects and billing accounts that belong to their organization. As companies grow, it becomes progressively difficult to keep track of an ever-increasing number of projects, created by multiple users, with different access control policies and linked to a variety of billing instruments.




Google Cloud Platform (GCP) customers need an easy way to centrally manage and control GCP resources, projects and billing accounts that belong to their organization. As companies grow, it becomes progressively difficult to keep track of an ever-increasing number of projects, created by multiple users, with different access control policies and linked to a variety of billing instruments. Google Cloud Resource Manager allows you to group resource containers under the Organization resource, providing full visibility, centralized ownership and unified management of your company’s assets on GCP.



The Organization resource is now automatically assigned to all GCP users who have G Suite accounts, without any additional steps on their part. All you need to do is create a project within your company’s domain to unlock the Organization resource and all its benefits!



Since it was introduced in October 2016, hundreds of customers have successfully deployed Cloud Resource Manager’s Organization resource, and have provided positive feedback.

"At Qubit, we love the flexibility of GCP resource containers including Organizations and Projects. We use the Organization resource to maintain centralized visibility of our projects and GCP IAM policies to ensure consistent access controls throughout the company. This gives our developers the capabilities they need to put security at the forefront throughout our migration to the cloud."  Laurie Clark-Michalek, Infrastructure Engineer at Qubit.

Understanding the Cloud Resource Manager Organization resource

The Cloud Resource Manager Organization resource is the root of the GCP resource hierarchy and is a critical component for all enterprise use cases, from social media to financial services, from gaming to e-commerce, to name a few. Here are a few benefits offered by the Organization resource:

  • Tie ownership of GCP projects to your company, so they remain available when a user leaves the organization.

  • Allow GCP admins to define IAM policies that apply horizontally across the entire organization.

  • Provide central visibility and control over billing for effective cost allocation and reporting.

  • Enable new policies and features for improved security.



The diagram below illustrates the GCP resource hierarchy and its link with the G Suite account.

G Suite, our set of intelligent productivity apps, is currently a prerequisite to access the Cloud Resource Manager Organization resource in GCP. It represents your company by providing ownership, lifecycle control, identities and a recovery mechanism. If you don’t already have a G Suite account, you can sign up to obtain one here. (You can request a GCP account that does not require G Suite to use the Cloud Resource Manager Organization resource. For more information, contact your sales representative.)





Getting started with the Cloud Resource Manager Organization resource



Unlocking the benefits of the Cloud Resource Manager Organization resource is easy; it's automatically provisioned for your organization the first time a GCP user in your domain creates a GCP project or billing account. The Organization resource display name is automatically synchronized with your G Suite organization name and is visible in the Cloud Console UI picker, as shown in the picture below. The Organization resource is also accessible via gcloud and the Cloud Resource Manager API.

Because of the ownership and lifecycle implications explained above, the G Suite super admin is granted full control over GCP by default. Usually, different departments in an organization manage G Suite and GCP. Thus, the first and most important step for the G Suite super admin overseeing a GCP account is to identify and assign the IAM Organization Admin role to the relevant users in their domain. Once assigned, the Organization Admins can manage IAM policies, project ownership and billing centrally, via Cloud Console, gcloud or the Cloud Resource Manager API.



All new GCP projects and billing accounts will belong to the Cloud Resource Manager Organization resource by default, and it’s easy to migrate existing GCP Projects there too. Existing projects that have not migrated under the Organization resource are visible under the “No Organization” hierarchy.



How to manage your Cloud Resource Manager Organization resource with gcloud



The following script summarizes the steps to start using the Cloud Resource Manager Organization resource.



# Query your Organization ID
> gcloud organizations list
DISPLAY_NAME ID DIRECTORY_CUSTOMER_ID
MyOrganization 123456789 C03ryezon

# Access Organization details
> gcloud organizations describe [ORGANIZATION_ID]
creationTime: '2016-11-15T04:42:33.042Z'
displayName: MyOrganization
lifecycleState: ACTIVEname: organizations/123456789
owner: directoryCustomerId: C03ryezon

# How to assign the Organization Admin role
# Must have Organization Admin or Super Admin permissions

> gcloud organizations add-iam-policy-binding [ORGANIZATION_ID]
--member=[MEMBER_ID] --roleroles/resourcemanager.organizationAdmin

# How to migrate an existing project into the Organization
> gcloud alpha projects move [PROJECT_ID] --organization [ORGANIZATION_ID]

# How to list all projects in the Organization
> gcloud projects list --filter ‘parent.id=[ORGANIZATION_ID] AND
parent.type=organization’




What’s next



The Cloud Resource Manager Organization resource is the root of the GCP hierarchy and is key to centralized control, management and improving security. By assigning the CRM Organization resource to all G Suite users, we're setting the stage for more innovation. Stay tuned for new capabilities that rely on the Cloud Resource Manager Organization resource as they become available in 2017. And for a deep dive into the Cloud Resource Manager and the latest in GCP security, join us at a security bootcamp at Next ’17 in San Francisco this March.





IT organizations want to realize the cost and speed benefits of cloud, but can’t afford to throw away years of investment in tools, talent and governance processes they’ve built on-prem. Hybrid models of application management have emerged as a way to get the best of both worlds.




IT organizations want to realize the cost and speed benefits of cloud, but can’t afford to throw away years of investment in tools, talent and governance processes they’ve built on-prem. Hybrid models of application management have emerged as a way to get the best of both worlds.



Development and test (dev/test) environments help teams create different environments to support the development, testing, staging and production of enterprise applications. Working with CloudBolt Software, we’ve prepared a full tutorial guide that describes how to quickly provision these environments in a self-service capacity, while maintaining full control over governance and policies required by enterprise IT.



CloudBolt isn’t just limited to dev/test workloads, but anything your team runs on VMs. As a cloud management platform that integrates your on-prem virtualization and private cloud resources with the public cloud, CloudBolt serves as a bridge between your existing infrastructure and Google Cloud Platform (GCP). Developers within your organization can provision the resources they need through an intuitive self-service portal, while IT maintains full control over how these provisioned environments are configured, helping them reap the cost and agility benefits of GCP using the development tools and processes they’ve built up over the years. It’s also an elegant way to rein in VM sprawl, helping organizations manage the ad-hoc environments that spring up with new projects. CloudBolt even provides a way to automatically scan and discover VMs in both on-prem and cloud environments.



Teams can get started immediately with this self-service tutorial. Or join us for our upcoming webinar featuring CloudBolt’s CTO Bernard Sanders and Google’s Product Management lead for Developer Tools on January 26th. Don’t hesitate to reach out to us to explore which enterprise workloads make the most sense for your cloud initiatives.





Google Cloud Audit Logging helps you to determine who did what, where and when on Google Cloud Platform (GCP). This fall, Cloud Audit Logging became generally available for a number of products. Today, we’re significantly expanding the set of products integrated with Cloud Audit Logging ...




Google Cloud Audit Logging helps you to determine who did what, where and when on Google Cloud Platform (GCP). This fall, Cloud Audit Logging became generally available for a number of products. Today, we’re significantly expanding the set of products integrated with Cloud Audit Logging:

The above integrations are all currently in beta.



We’re also pleased to announce that audit logging for Google Cloud Dataflow, Stackdriver Debugger and Stackdriver Logging is now generally available.



Cloud Audit Logging provides log streams for each integrated product. The primary log stream is the admin activity log that contains entries for actions that modify the service, individual resources or associated metadata. Some services also generate a data access log that contains entries for actions that read metadata as well as API calls that access or modify user-provided data managed by the service. Right now only Google BigQuery generates a data access log, but that will change soon.



Interacting with audit logs in Cloud Console

You can see a high-level overview of all your audit logs on the Cloud Console Activity page. Click on any entry to display a detailed view of that event, as shown below.



By default, data access logs are not displayed in this feed. To enable them from the Filter configuration panel, select the “Data Access” field under Categories. (Please note, you also need to have the Private Logs Viewer IAM permission in order to see data access logs). You can also filter the results displayed in the feed by user, resource type and date/time.



Interacting with audit logs in Stackdriver

You can also interact with the audit logs just like any other log in the Stackdriver Logs Viewer. With Logs Viewer, you can filter or perform free text search on the logs, as well as select logs by resource type and log name (“activity” for the admin activity logs and “data_access” for the data access logs).



Here are some log entries in their JSON format, with a few important fields highlighted.

In addition to viewing your logs, you can also export them to Cloud Storage for long-term archival, to BigQuery for analysis, and/or Google Cloud Pub/Sub for integration with other tools. Check out this tutorial on how to export your BigQuery audit logs back into BigQuery to analyze your BigQuery spending over a specified period of time.

"Google Cloud Audit Logs couldn't be simpler to use; exported to BigQuery it provides us with a powerful way to monitor all our applications from one place.Darren Cibis, Shine Solutions

Partner integrations

We understand that there are many tools for log analysis out there. For that reason, we’ve partnered with companies like Splunk, Netskope, and Tenable Network Security. If you don’t see your preferred provider on our partners page, let us know and we can try to make it happen.



Alerting using Stackdriver logs-based metrics

Stackdriver Logging provides the ability to create logs-based metrics that can be monitored and used to trigger Stackdriver alerting policies. Here’s an example of how to set up your metrics and policies to generate an alert every time an IAM policy is changed.



The first step is to go to the Logs Viewer and create a filter that describes the logs for which you want to be alerted. Be sure that the scope of the filter is set correctly to search the logs corresponding to the resource in which you are interested. In this case, let’s generate an alert whenever a call to SetIamPolicy is made.



Once you're satisfied that the filter captures the correct events, create a logs-based metric by clicking on the "Create Metric" option at the top of the screen.



Now, choose a name and description for the metric and click "Create Metric." You should then receive a confirmation that the metric was saved.

Next, select “Logs-based Metrics” from the side panel. You should see your new metric listed there under “User Defined Metrics.” Click on the dots to the right of your metric and choose "Create alert from metric."



Now, create a condition to trigger an alert if any log entries match the previously specified filter. To do that, set the threshold to "above 0" in order to catch this occurrence. Logs-based metrics count the number of entries seen per minute. With that in mind, set the duration to one minute as the duration specifies how long this per-minute rate needs to be sustained in order to trigger an alert. For example, if the duration were set to five minutes, there would have to be at least one alert per minute for a five-minute period in order to trigger the alert.



Finally, choose “Save Condition” and specify the desired notification mechanisms (e.g., email, SMS, PagerDuty, etc.). You can test the alerting policy by giving yourself a new permission via the IAM console.



Responding to audit logs using Cloud Functions



Cloud Functions is a lightweight, event-based, asynchronous compute solution that allows you to execute small, single-purpose functions in response to events such as specific log entries. Cloud functions are written in JavaScript and execute in a standard Node.js environment. Cloud functions can be triggered by events from Cloud Storage or Cloud Pub/Sub. In this case, we'll trigger cloud functions when logs are exported to a Cloud Pub/Sub topic. Cloud Functions is currently in alpha, please sign up to request enablement for your project.



Let’s look at firewall rules as an example. Whenever a firewall rule is created, modified or deleted, a Compute Engine audit log entry is written. The firewall configuration information is captured in the request field of the audit log entry. The following function inspects the configuration of a new firewall rule and deletes it if that configuration is of concern (in this case, if it opens up any port besides port 22). This function could easily be extended to look at update operations as well.



Copyright 2017 Google Inc.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

'use strict';

exports.processFirewallAuditLogs = (event) => {
const msg = JSON.parse(Buffer.from(event.data.data, 'base64').toString());
const logEntry = msg.protoPayload;
if (logEntry &&
logEntry.request &&
logEntry.methodName === 'v1.compute.firewalls.insert') {
let cancelFirewall = false;
const allowed = logEntry.request.alloweds;
if (allowed) {
for (let key in allowed) {
const entry = allowed[key];
for (let port in entry.ports) {
if (parseInt(entry.ports[port], 10) !== 22) {
cancelFirewall = true;
break;
}
}
}
}
if (cancelFirewall) {
const resourceArray = logEntry.resourceName.split('/');
const resourceName = resourceArray[resourceArray.length - 1];
const compute = require('@google-cloud/compute')();
return compute.firewall(resourceName).delete();
}
}
return true;
};


As the function above uses the gcloud Node.js module, be sure to include that as a dependency in the package.json file that accompanies the index.js file specifying your source code:

{
"name" : "audit-log-monitoring",
"version" : "1.0.0",
"description" : "monitor my audit logs",
"main" : "index.js",
"dependencies" : {
"@google-cloud/compute" : "^0.4.1"
}
}


In the image below, you can see what happened to a new firewall rule (“bad-idea-firewall”) that did not meet the acceptable criteria as determined by the cloud function. It's important to note, that this cloud function is not applied retroactively, so existing firewall rules that allow traffic on ports 80 and 443 are preserved.



This is just one example of many showing how you can leverage the power of Cloud Functions to respond to changes on GCP.



Conclusion



Cloud Audit Logging offers enterprises a simple way to track activity in applications built on top of GCP, and integrate logs with monitoring and logs analysis tools. To learn more and get trained on audit logging as well as the latest in GCP security, sign up for a Google Cloud Next ‘17 technical bootcamp in San Francisco this March.





Google Stackdriver Monitoring allows users to create charts and alerts on monitoring metrics gathered across their Google Cloud Platform (GCP) and Amazon Web Services environments. Stackdriver users who want to drill deeper into their monitoring data can use ...




Google Stackdriver Monitoring allows users to create charts and alerts on monitoring metrics gathered across their Google Cloud Platform (GCP) and Amazon Web Services environments. Stackdriver users who want to drill deeper into their monitoring data can use Cloud Datalab, an easy-to-use tool for large-scale data exploration, analysis and visualization. Based on Jupyter (formerly IPython), Cloud Datalab allows you access to a thriving ecosystem, including Google BigQuery and Google Cloud Storage, plus many statistics and machine learning packages, including TensorFlow. We include notebooks of detailed tutorials to help you get started with your Stackdriver data, and the vibrant Jupyter community is a great source for more published notebooks and tips.



Libraries from the Jupyter community open up a variety of visualization options. For example, a heatmap is a compact representation of data, often used to visually highlight patterns. With a few lines of code included in the sample notebook, Getting Started.ipynb, we can visualize utilization across different instances to look for opportunities to reduce spend.



The Datalab environment also makes it possible to do advanced analytics. For example, in the included notebook, Time-shifted data.ipynb, we walk through time-shifting the data by day to compare today vs. historical data. This powerful analysis allows you to identify anomalies in your system metrics at a glance, by visualizing how they change from their historical values.


Compare today’s CPU utilization to the weekly average by zone





Stackdriver metrics, viewed with Cloud Datalab




Get started




The first step is to sign up for a 30-day free trial of Stackdriver Premium, which can monitor workloads on GCP and AWS. It takes two minutes to set up. Next, set up Cloud Datalab, which can be easily configured to run on Docker with this Quickstart. Sample code and notebooks for exploring trends in your data, analyzing group performance and heat map visualizations are included in the Datalab container.



Let us know what you think, and we’ll do our best to address your feedback and make analysis of your monitoring data even simpler for you.






Trust in the cloud is paramount to any business who is thinking about using it to power their critical applications, deliver new customer experiences and house their most sensitive data. Today, we're issuing a white paper by our security team that details how security is designed into our infrastructure from the ground up.




Trust in the cloud is paramount to any business who is thinking about using it to power their critical applications, deliver new customer experiences and house their most sensitive data. Today, we're issuing a white paper by our security team that details how security is designed into our infrastructure from the ground up.



Google Cloud’s global infrastructure provides security through the entire information processing lifecycle.This infrastructure provides secure deployment of services, secure storage of data with end-user privacy safeguards, secure communications between services, secure and private communication with customers over the internet and safe operation by administrators.



Google uses this infrastructure to build its internet services, including both consumer services such as Search, Gmail, and Photos, and our Google Cloud enterprise services.



The paper describes the security of this infrastructure in progressive layers starting from the physical security of our data centers, continuing on to how the hardware and software that underlie the infrastructure are secured, and finally, describing the technical constraints and processes in place to support operational security.



In a final section, the paper highlights how our public cloud infrastructure, Google Cloud Platform (GCP), benefits from the security of the underlying infrastructure. We take Google Compute Engine as an example service and describe in detail the service-specific security improvements that we build on top of the infrastructure.



For more information please take a look at the paper.

https://cloud.google.com/security/security-design



We're also pleased to announce the addition of regular, security-focused content on this blog under the Security & Identity label, which includes posts on topics like virtual machine security, identity and access management, platform integrity and the practical applications of encryption. Watch this space!





Google has long supported efforts to encrypt customer data on the internet, including using HTTPS everywhere. In the enterprise space, we're pleased to broaden the continuum of encryption options available on ...




Google has long supported efforts to encrypt customer data on the internet, including using HTTPS everywhere. In the enterprise space, we're pleased to broaden the continuum of encryption options available on Google Cloud Platform (GCP) with Cloud Key Management Service (KMS), now in beta in select countries.


"With the launch of Cloud KMS, Google has addressed the full continuum of encryption and key management use cases for GCP customers. Cloud KMS fills a gap by providing customers with the ability to manage their encryption keys in a multi-tenant cloud service, without the need to maintain an on-premise key management system or HSM.” Garrett Bekker, Principal Security Analyst at 451 Research

Customers in regulated industries, such as financial services and healthcare, value hosted key management services for the ease of use and peace of mind that they provide. Cloud KMS offers a cloud-based root of trust that you can monitor and audit. As an alternative to custom-built or ad-hoc key management systems, which are difficult to scale and maintain, Cloud KMS makes it easy to keep your keys safe.



With Cloud KMS, you can manage symmetric encryption keys in a cloud-hosted solution, whether they’re used to protect data stored in GCP or another environment. You can create, use, rotate and destroy keys via our Cloud KMS API, including as part of a secret management or envelope encryption solution. It’s directly integrated with Cloud Identity Access Management and Cloud Audit Logging for greater control over your keys.



Forward thinking cloud companies must lead by example and follow best practices. For example, Ravelin, a fraud detection provider, encrypts small secrets, such as configurations and authentication credentials, needed as part of customer transactions, and uses separate keys to ensure that each customer's data is cryptographically isolated. Ravelin also encrypts secrets used for internal systems and automated processes.


“Google is transparent about how it does its encryption by default, and Cloud KMS makes it easy to implement best practices. Features like automatic key rotation let us rotate our keys frequently with zero overhead and stay in line with our internal compliance demands. Cloud KMS’ low latency allows us to use it for frequently performed operations. This allows us to expand the scope of the data we choose to encrypt from sensitive data, to operational data that does not need to be indexed.” Leonard Austin, CTO at Ravelin

At launch, Cloud KMS uses the Advanced Encryption Standard (AES), in Galois/Counter Mode (GCM), the same encryption library used internally at Google to encrypt data in Google Cloud Storage. This AES GCM is implemented in the BoringSSL library that Google maintains, and continually checks for weaknesses using several tools, including tools similar to the recently open-sourced cryptographic test tool Project Wycheproof.




The GCP encryption continuum




With the introduction of Cloud KMS, GCP now offers a full range of encryption key management options, allowing you to choose the right security solution for your use-case based on the nature of your data (e.g., is there financial, personal health, private individual, military, government, confidential or sensitive data?) and whether you want to store keys in the cloud or on-premise.



By default, Cloud Storage manages server-side encryption keys on your behalf. If you prefer to manage your cloud-based keys yourself, select "Cloud Key Management Service," and if you’d like to manage keys on-premise, select "Customer Supplied Encryption Keys" (for Google Cloud Storage and for Google Compute Engine). See the diagram below for a use-case decision tree:








Your data is yours


While we’re on the topic of data protection and data privacy, it might be useful to point out how we think about GCP customer data. Google will not access or use GCP customer data, except as necessary to provide them the GCP services. You can learn more about our encryption policy by reading our whitepaper, “Encryption at Rest in Google Cloud Platform.”



Safe computing!







Today we’re sharing the first episode of our Pivotal Cloud Foundry on Google Cloud Platform (GCP) mini video series, featuring engineers from the Pivotal and Google Cloud Graphite teams who've been working hard to make this open-source platform run great on GCP. Google’s Cloud Graphite team works exclusively on open source projects in collaboration with project maintainers and customers. We’ll have more videos and blog posts this year, just like this one, highlighting that work.






Today we’re sharing the first episode of our Pivotal Cloud Foundry on Google Cloud Platform (GCP) mini video series, featuring engineers from the Pivotal and Google Cloud Graphite teams who've been working hard to make this open-source platform run great on GCP. Google’s Cloud Graphite team works exclusively on open source projects in collaboration with project maintainers and customers. We’ll have more videos and blog posts this year, just like this one, highlighting that work.



In 2016 we began working with Pivotal, and announced back in October that customers can deploy and operate Pivotal Cloud Foundry on GCP. Thanks to this partnership, companies in industries like manufacturing, healthcare and retail can accelerate their digital transformation and run cloud-native applications on the same kind of infrastructure that powers Gmail, YouTube, Google Maps and more.


“The chemistry between the two engineering teams was remarkable as if we had been working together for years. The Cloud Foundry community is already benefiting from this work. It’s simple to deploy Cloud Foundry atop Google’s infrastructure, and developers can easily extend their apps with Google’s analytics and machine learning services. We look forward to working with Google in the future to advance our shared vision for multi-cloud choice and flexibility.”  Joshua McKenty, Head of Platform Ecosystem, Pivotal

Specifically, together with Pivotal, we have:


  • Brought BOSH to GCP, adding support for Google’s global networking and load balancer, quick VM boot times, live migration and preemptible VM pricing

  • Built a service broker to let Cloud Foundry developers easily use Google services such as Google BigQuery, Google Cloud SQL and Google Cloud Machine Learning in their apps

  • Developed the stackdriver-tools BOSH release to give operators and developers access to health and diagnostics information in Stackdriver Logging and Stackdriver Monitoring






In the first episode of the video series, Dan Wendorf of Pivotal and I talk about deploying BOSH and Cloud Foundry to GCP, using the tutorial you can follow along with on GitHub.



Join us on YouTube to watch other episodes that will cover topics like setting up and consuming Google services with our Service Broker, or how to monitor and debug Cloud Foundry applications with Stackdriver. Just follow Google Cloud on YouTube, or @GoogleCloud on Twitter to find out when new videos are published. And stay tuned for more blog posts and videos about the work we’re doing with Puppet, Chef, HashiCorp, Red Hat, SaltStack and others.





If you work in cloud security, you might be planning a trip to San Francisco next month for the RSA Conference. If so, please stop by our San Francisco office for a series of 20 security talks. Our office is a 12-minute walk up Howard Street from Moscone Center where the RSA Conference is held.




If you work in cloud security, you might be planning a trip to San Francisco next month for the RSA Conference. If so, please stop by our San Francisco office for a series of 20 security talks. Our office is a 12-minute walk up Howard Street from Moscone Center where the RSA Conference is held.



Google Cloud takes security seriously, and we’re excited to share more about some of the interesting and difficult problems we’re working on day-to-day. In our security talks, you’ll hear about our efforts in cloud identity, vulnerability trends from Project Zero, DDoS mitigation, container security and more!



See below for the full agenda of exciting security talks we’ll be hosting. To learn more and RSVP, visit https://cloudplatformonline.com/rsa





We’re also excited that Googlers will be giving talks at the RSA conference itself:


Hope to see you at RSA!