Google Cloud Platform gets a performance boost today with the much anticipated public beta of NVIDIA Tesla K80 GPUs. You can now spin up NVIDIA GPU-based VMs in three GCP regions: us-east1, asia-east1 and europe-west1, using the ...




Google Cloud Platform gets a performance boost today with the much anticipated public beta of NVIDIA Tesla K80 GPUs. You can now spin up NVIDIA GPU-based VMs in three GCP regions: us-east1, asia-east1 and europe-west1, using the gcloud command-line tool. Support for creating GPU VMs using the Cloud Console appears next week.



If you need extra computational power for deep learning, you can attach up to eight GPUs (4 K80 boards) to any custom Google Compute Engine virtual machine. GPUs can accelerate many types of computing and analysis, including video and image transcoding, seismic analysis, molecular modeling, genomics, computational finance, simulations, high performance data analysis, computational chemistry, finance, fluid dynamics and visualization.






NVIDIA K80 GPU Accelerator Board



Rather than constructing a GPU cluster in your own datacenter, just add GPUs to virtual machines running in our cloud. GPUs on Google Compute Engine are attached directly to the VM, providing bare-metal performance. Each NVIDIA GPU in a K80 has 2,496 stream processors with 12 GB of GDDR5 memory. You can shape your instances for optimal performance by flexibly attaching 1, 2, 4 or 8 NVIDIA GPUs to custom machine shapes.






Google Cloud supports as many as 8 GPUs attached to custom VMs, allowing you to optimize the performance of your applications.



These instances support popular machine learning and deep learning frameworks such as TensorFlow, Theano, Torch, MXNet and Caffe, as well as NVIDIA’s popular CUDA software for building GPU-accelerated applications.


Pricing


Like the rest of our infrastructure, the GPUs are priced competitively and are billed per minute (10 minute minimum). In the US, each K80 GPU attached to a VM is priced at $0.700 per hour per GPU and in Asia and Europe, $0.770 per hour per GPU. As always, you only pay for what you use. This frees you up to spin up a large cluster of GPU machines for rapid deep learning and machine learning training with zero capital investment.






Supercharge machine learning


The new Google Cloud GPUs are tightly integrated with Google Cloud Machine Learning (Cloud ML), helping you slash the time it takes to train machine learning models at scale using the TensorFlow framework. Now, instead of taking several days to train an image classifier on a large image dataset on a single machine, you can run distributed training with multiple GPU workers on Cloud ML, dramatically shorten your development cycle and iterate quickly on the model.



Cloud ML is a fully-managed service that provides end-to-end training and prediction workflow with cloud computing tools such as Google Cloud Dataflow, Google BigQuery, Google Cloud Storage and Google Cloud Datalab.



Start small and train a TensorFlow model locally on a small dataset. Then, kick off a larger Cloud ML training job against a full dataset in the cloud to take advantage of the scale and performance of Google Cloud GPUs. For more on Cloud ML, please see the Quickstart guide to get started, or this document to dive into using GPUs.




Next steps


Register for Cloud NEXT, sign up for the CloudML Bootcamp and learn how to Supercharge performance using GPUs in the cloud. You can use the gcloud command-line to create a VM today and start experimenting with TensorFlow-accelerated machine learning. Detailed documentation is available on our website.








Here at Google Cloud, we employ a small army of developer advocates, DAs for short, who are out on the front lines at conferences, at customer premise, or on social media, explaining our technologies and communicating back to people like me and our product teams about your needs as a member of a development community.




Here at Google Cloud, we employ a small army of developer advocates, DAs for short, who are out on the front lines at conferences, at customer premise, or on social media, explaining our technologies and communicating back to people like me and our product teams about your needs as a member of a development community.



DAs take the responsibility of advocating for developers seriously, and have spent time poring over the extensive Google Cloud Next '17 session catalog, bookmarking the talks that will benefit you. To wit:


  • If you’re a developer working in Ruby, you know to turn to Aja Hammerly for all things Ruby/Google Cloud Platform (GCP)-related. Aja’s top pick for Rubyists at Next is Google Cloud Platform < 3 Ruby with Google Developer Program Engineer Remi Taylor, but there are other noteworthy mentions on her personal blog.

  • Mete Atamel is your go-to DA for all things Windows on GCP. Selfishly, his top Next session is his own about running ASP.NET apps on GCP, but he has plenty more suggestions for you to choose from

  • Groovy nut Guillaume Laforge is going to be one busy guy at Next, jumping from between sessions about PaaS, serverless and containers, to name a few. Here’s his full list of his must-see sessions

  • If you’re a game developer, let Mark Mandel be your guide. Besides co-presenting with Rob Whitehead, CTO of Improbable, Mark has bookmarked sessions about location-based gaming, using GPUs and game analytics. Mosy on over to his personal blog for the full list.

  • In the past year, Google Apps Script has opened the door to building amazing customizations for G Suite, our communication and collaboration platform. In this G Suite Developers blog post, Wesley Chun walks you through some of the cool Apps Script sessions, as well as sessions about App Maker and some nifty G Suite APIs. 

  • Want to attend sessions that teach you about our machine learning services? That’s where you’ll find our hands-on ML expert Sara Robinson, who in addition to recommending her favorite Next sessions, also examines her talk from last year’s event using Cloud Natural Language API. 


For my part, I’m really looking forward to Day 3, which we’re modeling after my favorite open source conferences thanks to Sarah Novotny’s leadership. We’ll have a carefully assembled set of open talks on Kubernetes, TensorFlow and Apache Beam that cover the technologies, how to contribute, the ecosystems around them and small group discussions with the developers. For a full list of keynotes, bootcamps and breakout sessions, check out the schedule and reserve your spot.





At Waze, our mission of outsmarting traffic, together ...




At Waze, our mission of outsmarting traffic, together forces us to be mindful of one of our users' most precious possessions  their time. Our cloud-based service saves users time by helping them find the optimal route based on crowdsourced data.



But a cloud service is only as good as it is available. At Waze, we use multiple cloud providers to improve the resiliency of our production systems. By running an active-active architecture across Google Cloud Platform (GCP) and AWS, we’re in a better position to survive a DNS DDOS attack, a regional failure  even a global failure of an entire cloud provider.



Sometimes, though, a bug in routing or ETA calculations makes it to production undetected, and we need the ability to roll that code back or fix it as fast as possible  velocity is key. That’s easier said than done in a multi-cloud environment. For example, our realtime routing service spans over 80 autoscaling groups/managed instance groups across two cloud providers and over multiple regions.



This is where continuous delivery helps out. Specifically, we use Spinnaker, an open source, continuous delivery platform for releasing software changes with high velocity and confidence. Spinnaker has handled 100% of our production deployments for the past year, regardless of target platform.




Spinnaker and continuous delivery FTW


Large-scale cloud deployments can be complex. Fortunately, Spinnaker abstracts many of the particulars of each cloud provider, allowing our developers to focus on making Waze even better for our users instead of concerning themselves with the low-level details of multiple cloud providers. All the while, we’re able to maintain important continuous delivery concepts like canaries, immutable infrastructure and fast rollbacks.



Jenkins is a first-class citizen in Spinnaker, so once code is committed to git and Jenkins builds a package, that same package triggers the main deployment pipeline for that particular microservice. That pipeline bakes the package into an immutable machine image on multiple cloud providers in parallel and continues to run any automated testing stages. The deployment proceeds to staging using blue/green deployment strategy, and finally to production without having to get deep into the details of each platform. Note that Spinnaker automatically resolves the correct image IDs per cloud provider, so that each cloud’s deployment processes happen automatically and correctly without the need for any configuration.



Example of a multi-cloud pipeline




*Jenkins icon from Jenkins project.




Multi-Cloud blue/green deployment using Spinnaker

Thanks to Spinnaker, developers can focus on developing business logic, rather than becoming experts on each cloud platform. Teams can track the lifecycle of a deployment using several notification mechanisms including email, Slack and SMS, allowing them to coordinate handoffs between developer and QA teams. Support for tools like canary analysis and fast rollbacks allows developers to make informed decisions about the state of their deployment. Since Spinnaker is designed from the ground up to be a self-service tool, developers can do all of this with minimal involvement from the Ops team.



At Waze, we strive to release new features and bug fixes to our users as quickly as possible. Spinnaker allows us to do just that while also helping keep multi-cloud deployments and rollbacks simple, easy and reliable.





If this sounds like something your organization would benefit from, check out Spinnaker. And don't miss our talks at GCP Next 2017:








Google Cloud acquired Orbitera last fall, and already we’ve hit a key milestone: completing the migration of the multi-cloud commerce platform from AWS to ...




Google Cloud acquired Orbitera last fall, and already we’ve hit a key milestone: completing the migration of the multi-cloud commerce platform from AWS to Google Container Engine, our managed Kubernetes environment.



This is a real testament to the maturity of Kubernetes and Container Engine, which in less than two years, has emerged as the most popular managed container orchestrator service on the market, powering Google Cloud’s own services as well as customers such as Niantic with Pokemon Go.



Founded in 2012, Orbitera originally built its service on an extensive array of AWS services including EC2, S3, RDS and RedShift, and weighed the pros and cons of migrating to various Google Cloud compute platforms  Google Compute Engine, Google App Engine or Container Engine. Ultimately, migrating to Container Engine presented Orbitera with the best opportunity to modernize its service and move from a monolithic architecture running on virtual machines to a microservices-based architecture based on containers.



The resulting service allows Orbitera ISV partners to easily sell their software in the cloud, by managing the testing, provisioning and ongoing billing management of their applications across an array of public cloud providers.



Running on Container Engine is proving to be a DevOps management boon for the Orbitera service. With Container Engine, Orbitera now runs in multiple zones with three replicas in each zone, for better availability. Container Engine also makes it easy to scale component microservices up and down on demand as well as deploy to new regions or zones. Operators roll out new application builds regularly to individual Kubernetes pods, and easily roll them back if the new build behaves unexpectedly.



Meanwhile, as a native Google Cloud service, the container management service integrates with multiple Google Cloud Platform (GCP) services such as Cloud SQL load balancers, and Google Stackdriver, whose alerts and dashboards allow Orbitera to respond to issues more quickly and more efficiently.



Going forward, running on Container Engine positions Orbitera to take advantage of more modular microservices and APIs and rapidly build out new services and capabilities for customers. That’s a win for enterprises who depend on Orbitera to provide a seamless, consistent and easy way to manage software running across multiple clouds, including Amazon Web Services and Microsoft Azure.



Stay tuned for a technical deep dive from the engineering team that performed the migration, where they'll share lessons learned, and tips and tricks for performing a successful migration to Container Engine yourself. In the meantime, if you have questions about Container Engine, you can find us on our Slack channel.









Today, we’re excited to announce the public beta for Cloud Spanner, a globally distributed relational database service that lets customers have their cake and eat it too ...








Today, we’re excited to announce the public beta for Cloud Spanner, a globally distributed relational database service that lets customers have their cake and eat it too: ACID transactions and SQL semantics, without giving up horizontal scaling and high availability.



When building cloud applications, database administrators and developers have been forced to choose between traditional databases that guarantee transactional consistency, or NoSQL databases that offer simple, horizontal scaling and data distribution. Cloud Spanner breaks that dichotomy, offering both of these critical capabilities in a single, fully managed service.


Cloud Spanner presents tremendous value for our customers who are retailers, manufacturers and wholesale distributors around the world. With its ease of provisioning and scalability, it will accelerate our ability to bring cloud-based omni-channel supply chain solutions to our users around the world,”  John Sarvari, Group Vice President of Technology, JDA

JDA, a retail and supply chain software leader, has used Google Cloud Platform (GCP) as the basis of its new application development and delivery since 2015 and was an early user of Cloud Spanner. The company saw its potential to handle the explosion of data coming from new information sources such as IoT, while providing the consistency and high availability needed when using this data.



Cloud Spanner rounds out our portfolio of database services on GCP, alongside Cloud SQL, Cloud Datastore and Cloud Bigtable.



As a managed service, Cloud Spanner provides key benefits to DBAs:


  • Focus on your application logic instead of spending valuable time managing hardware and software

  • Scale out your RDBMS solutions without complex sharding or clustering

  • Gain horizontal scaling without migration from relational to NoSQL databases

  • Maintain high availability and protect against disaster without needing to engineer a complex replication and failover infrastructure

  • Gain integrated security with data-layer encryption, identity and access management and audit logging




With Cloud Spanner, your database scales up and down as needed, and you'll only pay for what you use. It features a simple pricing model that charges for compute node-hours, actual storage consumption (no pre-provisioning) and external network access.



Cloud Spanner keeps application development simple by supporting standard tools and languages in a familiar relational database environment. It’s ideal for operational workloads supported by traditional relational databases, including inventory management, financial transactions and control systems, that are outgrowing those systems. It supports distributed transactions, schemas and DDL statements, SQL queries and JDBC drivers and offers client libraries for the most popular languages, including Java, Go, Python and Node.js.




More Cloud Spanner customers share feedback


Quizlet, an online learning tool that supports more than 20 million students and teachers each month, uses MySQL as its primary database; database performance and stability are critical to the business. But with users growing at roughly 50% a year, Quizlet has been forced to scale its database many times to handle this load. By splitting tables into their own databases (vertical sharding), and moving query load to replicas, it’s been able to increase query capacity  but this technique is reaching its limits quickly, as the tables themselves are outgrowing what a single MySQL shard can support. In its search for a more scalable architecture, Quizlet discovered Cloud Spanner, which will allow it to easily scale its relational database and simplify its application:


Based on our experience and performance testing, Cloud Spanner is the most compelling option we’ve seen to power a high-scale relational query workload. It has the performance and scalability of a NoSQL database, but can execute SQL so it’s a viable alternative to sharded MySQL. It’s an impressive technology and could dramatically simplify how we manage our databases.” Peter Bakkum, Platform Lead, Quizlet


The history of Spanner


For decades, developers have relied on traditional databases with a relational data model and SQL semantics to build applications that meet business needs. Meanwhile, NoSQL solutions emerged that were great for scale and fast, efficient data-processing, but they didn’t meet the need for strong consistency. Faced with these two sub-optimal choices that customers grapple with today, in 2007, a team of systems researchers and engineers at Google set out to develop a globally-distributed database that could bridge this gap. In 2012, we published the Spanner research paper that described many of these innovations. The result was a database that offers the best of both worlds:












(click to enlarge)



Remarkably, Cloud Spanner achieves this combination of features without violating the CAP Theorem. To understand how, read this post by the author of the CAP Theorem and Google Vice President of Infrastructure, Eric Brewer.



Over the years, we’ve battle-tested Spanner internally with hundreds of different applications and petabytes of data across data centers around the world. At Google, Spanner supports tens of millions of queries per second and runs some of our most critical services, including AdWords and Google Play.



If you have a MySQL or PostgreSQL system that's bursting at the seams, or are struggling with hand-rolled transactions on top of an eventually-consistent database, Cloud Spanner could be the solution you're looking for. Visit the Cloud Spanner page to learn more and get started building applications on our next-generation database service.





Building systems that manage globally distributed data, provide data consistency and are also highly available is really hard. The beauty of the cloud is that someone else can build that for you.




Building systems that manage globally distributed data, provide data consistency and are also highly available is really hard. The beauty of the cloud is that someone else can build that for you.



The CAP theorem says that a database can only have two of the three following desirable properties:



  • C: consistency, which implies a single value for shared data

  • A: 100% availability, for both reads and updates

  • P: tolerance to network partitions



This leads to three kinds of systems: CA, CP and AP, based on what letter you leave out. Designers are not entitled to two of the three, and many systems have zero or one of the properties.



For distributed systems over a “wide area,” it's generally viewed that partitions are inevitable, although not necessarily common. If you believe that partitions are inevitable, any distributed system must be prepared to forfeit either consistency (AP) or availability (CP), which is not a choice anyone wants to make. In fact, the original point of the CAP theorem was to get designers to take this tradeoff seriously. But there are two important caveats: First, you only need to forfeit consistency or availability during an actual partition, and even then there are many mitigations. Second, the actual theorem is about 100% availability; a more interesting discussion is about the tradeoffs involved to achieve realistic high availability.



Spanner joins Google Cloud

Today, Google is releasing Cloud Spanner for use by Google Cloud Platform (GCP) customers. Spanner is Google’s highly available, global SQL database. It manages replicated data at great scale, both in terms of size of data and volume of transactions. It assigns globally consistent real-time timestamps to every datum written to it, and clients can do globally consistent reads across the entire database without locking.



In terms of CAP, Spanner claims to be both consistent and highly available despite operating over a wide area, which many find surprising or even unlikely. The claim thus merits some discussion. Does this mean that Spanner is a CA system as defined by CAP? The short answer is “no” technically, but “yes” in effect and its users can and do assume CA.



The purist answer is “no” because partitions can happen and in fact have happened at Google, and during some partitions, Spanner chooses C and forfeits A. It is technically a CP system.



However, no system provides 100% availability, so the pragmatic question is whether or not Spanner delivers availability that is so high that most users don't worry about its outages. For example, given there are many sources of outages for an application, if Spanner is an insignificant contributor to its downtime, then users are correct to not worry about it.



In practice, we find that Spanner does meet this bar, with more than five 9s of availability (less than one failure in 105). Given this, the target for multi-region Cloud Spanner will be right at five 9s, as it has some additional new pieces that will be higher risk for a while.



Inside Spanner 



The next question is, how is Spanner able to achieve this?



There are several factors, but the most important one is that Spanner runs on Google’s private network. Unlike most wide-area networks, and especially the public internet, Google controls the entire network and thus can ensure redundancy of hardware and paths, and can also control upgrades and operations in general. Fibers will still be cut, and equipment will fail, but the overall system remains quite robust.



It also took years of operational improvements to get to this point. For much of the last decade, Google has improved its redundancy, its fault containment and, above all, its processes for evolution. We found that the network contributed less than 10% of Spanner’s already rare outages.



Building systems that can manage data that spans the globe, provide data consistency and are also highly available is possible; it’s just really hard. The beauty of the cloud is that someone else can build that for you, and you can focus on innovation core to your service or application.



Next steps



For a significantly deeper dive into the details, see the white paper also released today. It covers Spanner, consistency and availability in depth (including new data). It also looks at the role played by Google’s TrueTime system, which provides a globally synchronized clock. We intend to release TrueTime for direct use by Cloud customers in the future.



Furthermore, look for the addition of new Cloud Spanner-related sessions at Google Cloud Next ‘17 in San Francisco next month. Register soon, because seats are limited.





Today we're excited to announce general availability and full support for Google Cloud Endpoints, a truly distributed API gateway. It features a server-local proxy (the ...




Today we're excited to announce general availability and full support for Google Cloud Endpoints, a truly distributed API gateway. It features a server-local proxy (the Extensible Service Proxy) and is built on the same services that Google uses to power its own APIs. For developers building applications and microservices on Google Cloud Platform (GCP), Cloud Endpoints is the best-suited modern API gateway that helps secure and monitor their APIs.



APIs are a critical part of mobile apps, modern web applications and microservices. With the increased focus on APIs comes increased responsibility: the top features you need to take care of your API are authorization, monitoring and logging. In other words, “Help make my API safer” and “Tell me how my API is doing.” And, above all: “Help make sure it is highly performant!”





Cloud Endpoints helps you with all of that. Through integrations with Firebase and Auth0, Cloud Endpoints authenticates each call to your API  so you know who's using your mobile and web apps. Cloud Endpoints also validates service-to-service calls, helping to keep your microservices more secure. You can create API keys via the Google Cloud Console, just like we do for APIs such as the Google Translation API and the Google Maps APIs. It logs API calls to Stackdriver Logging and displays a monitoring dashboard in the console, giving you critical status information about the health and performance of your API.



Cloud Endpoints is tightly integrated with the GCP ecosystem and is easy to use especially when running containerized workloads. The API proxy is built into Google App Engine flexible environment and can be added to any Kubernetes or Google Container Engine deployment with a couple of lines of YAML. Deployment occurs through gcloud, and the monitoring and logging is all in Cloud Console. The proxy functionality is also built into the Endpoints API Frameworks (see below) for use on App Engine standard environment.



Cloud Endpoints is ideal for GCP customers who need a fast, scalable API gateway. Enterprise customers can also take advantage of Apigee Edge, an industry leading API management platform that we acquired this fall, for full-featured API management that works across on-premises, cloud and hybrid deployment models.



Endpoints has been in beta for three months and our early adopters are already doing amazing things with it. We’ve seen people going from initial evaluation to production in days, and scalability has been terrific: one customer’s API peaked at over 11,000 requests per second, and another customer served nearly 50 million requests in a day.


“When migrating our workloads to Google Cloud Platform, we needed to securely communicate between multiple data centres. Traditional methods like firewalls and ad hoc authentication were unsustainable, quickly leading to a jumbled mess of ACLs. Endpoints, on the other hand, gives us a standardised authentication system, backed by Google's security pedigree.”  Laurie Clark-Michalek, Infrastructure Engineer at Qubit


Frameworks


For developers working on App Engine standard environment, we're also announcing general availability and full support for our Java and Python frameworks. The Endpoints Frameworks are available to help developers quickly get started serving an API from App Engine. Think of the Endpoints Frameworks as lightweight alternatives to Python Flask or Java Jersey. The Endpoints Frameworks come with built-in integration to the Google Service Control API, meaning that back-ends built with the Endpoints Frameworks do not need to run behind the Extensible Service Proxy.




Pricing


We want everyone to be able to benefit from Cloud Endpoints, so we created a free tier so that your first two million API calls per month are available at no charge. Once your service is popular enough to go beyond the free-tier, the pricing is a simple $3.00 per million requests.



The Endpoints Frameworks can optionally expose all of the features of the Cloud Endpoints API gateway. The Endpoints Frameworks can be used at no cost, but the Endpoints API gateway features turned on with the Frameworks (API keys, monitoring and logging) are charged at the standard rate.



Go ahead, read the documentation, try our walkthroughs for App Engine (standard or flexible environment), Container Engine or Compute Engine and join our Google Cloud Endpoints Google Group. To hear more about Google’s vision on comprehensive API technologies for enterprise customers and developers, including more about Apigee Edge, please join us next month at Google Cloud Next '17 in San Francisco.





So many sessions, so little time. Google Cloud Next '17, taking place next month, features over 200 breakout sessions; many geared directly at security professionals. If you only have time on your schedule for a few security breakouts, here are the ones you can’t afford to miss.




So many sessions, so little time. Google Cloud Next '17, taking place next month, features over 200 breakout sessions; many geared directly at security professionals. If you only have time on your schedule for a few security breakouts, here are the ones you can’t afford to miss.



For a foundation in how we help secure Google Cloud Platform (GCP)  and for a peek at the various security threats Google grapples with day in and day out  check out “Lessons learned from securing both Google and Google Cloud customers.” Here, Andy Chang, Google Senior Product Manager, will discuss the various layers of Google security, its security team and what it’s learned from preventing, detecting and responding to cyber attacks over the years.



Now that you better understand our security features, learn how to attach your on-prem environment to Google Cloud via Virtual Private Cloud. In “How to create a secure, private environment in the cloud and on-prem with Google Cloud Virtual Private Clouds,” Ines Envid, Google Product Manager, and Neha Pattan, Software Engineer, will show you how to build a sandbox to run your cloud workloads alongside on-prem applications, as well as how to integrate with GCP’s machine learning, big data and storage services.



More and more, building cloud applications means building mobile applications. In “Security first for a mobile first strategy,” director of Android security Adrian Ludwig discusses the multiple layers of protection that the Android platform provides to help keep business and personal information safe.



We do our part on the backend, but it’s up to you to write quality apps. In “Designing secure UX into your products,” Google senior developer advocate Mandy Waite discusses best practices you should follow when building apps and services, plus how Google protects against threats like malware and phishing attacks.



At a fundamental level, for many of our customers, keeping their business safe is rooted in protecting email. In “Trends in data security,” Gilad Golan, Google Director for Security and Data Protection, and Nicolas Lidzborski, Staff Software Engineer, describe our latest innovations in email security  and how you can apply those to your organization.



As an added bonus, we’re also offering a full-day security bootcamp before the show. Register now to reserve your spot, and see you at NEXT!





Google recently launched GPUs on Google Cloud Platform (GCP), which will allow customers to leverage this hardware for highly parallel workloads. These GPUs are connected to our cloud machines via a variety of PCIe switches, and that required us to have a deep understanding of PCIe security.




Google recently launched GPUs on Google Cloud Platform (GCP), which will allow customers to leverage this hardware for highly parallel workloads. These GPUs are connected to our cloud machines via a variety of PCIe switches, and that required us to have a deep understanding of PCIe security.



Securing PCIe devices requires overcoming some inherent challenges. For instance, GPUs have become far more complex in the past few decades, opening up new avenues for attack. Since GPUs are designed to directly access system memory, and since hardware has historically been considered trusted, it's difficult to ensure all the settings to keep it contained are set accurately, and difficult to ensure whether such settings even work. And since GPU manufacturers don't make the source code or binaries available for the GPU's main processes, we can't examine those to gain more confidence. You can read more about the challenges presented by the PCI and PCIe specs here.



With the risk of malicious behavior from compromised PCIe devices, Google needed to have a plan for combating these types of attacks, especially in a world of cloud services and publicly available virtual machines. Our approach has been to focus on mitigation: ensuring that compromised PCIe devices can’t jeopardize the security of the rest of the computer.



Fuzzing to the rescue



A key weapon in our arsenal is fuzzing, a testing technique that uses invalid, unexpected or random inputs to expose irregular behavior, such as memory leaks, crashes, or undocumented functionality. The hardware fuzzer we built directly tests the behavior of the PCIe switches used by our cloud GPUs.



After our initial research into the PCIe spec, we prepared a list of edge cases and device behaviors that didn’t have clearly defined outcomes. We wanted to test these behaviors on real hardware, and we also wanted to find out whether real hardware implemented the well defined parts of the spec properly. Hardware bugs are actually quite common, but many security professionals assume their absence, simply trusting the manufacturer. At Google, we want to verify every layer of the stack, including hardware.



Our plan called for a fuzzer that was highly specialized, and designed to be effective against the production configurations we use in our cloud hardware. We use a variety of GPU and switch combinations on our machines, so we set up some programmable network interface controllers (NICs) in similar configurations to simulate GPU memory accesses.



Our fuzzer used those NICs to aggressively hammer the port directly upstream from each NIC, as well as any other accessible ports in the network, with a variety of memory reads and writes. These operations included a mixture of targeted attacks, randomness and "lucky numbers" that tend to cause problems on many hardware architectures. We wanted to detect changes to the configuration of any port as a result of the fuzzing, particularly the port's secondary and subordinate bus numbers. PCIe networks with Source Validation enabled are governed primarily by these bus numbers, which dictate where packets can and cannot go. Being able to reconfigure a port's secondary or subordinate bus numbers could give you access to parts of the PCIe network that should be forbidden.



Our security team reviewed any suspicious memory reads or writes to determine if they represent security vulnerabilities, and adjusted either the fuzzer or our PCIe settings accordingly.



We discovered some curiosities. For instance, on one incorrect configuration, some undocumented debug registers on the switch were incorrectly exposed to downstream devices, which we discovered could cause serious malfunctioning of the switch under certain access patterns. If a device can cause out-of-spec behavior in the switch it’s connected to, it may be able to cause insecure routing, which would compromise the entire network. The value of fuzzing is its ability to find vulnerabilities in undocumented and undefined areas, outside the normal set of behaviors and operations defined in the spec. But by the end of the process, we had determined a minimum set of ACS features necessary to securely run GPUs in the cloud.




Let's check out those memory mappings too




When you make use of a GPU on a local computer through the root OS, it has direct memory access to the computer’s memory. This is very fast and straightforward. However, that model doesn't work in a virtualized environment like Google Compute Engine.



When a virtual machine is initialized, a set of page tables maps the guest's physical memory to the host's physical memory, but the GPU has no way to know about those mappings, and thus will attempt to write to the wrong places. This is where the Input–output memory management unit (IOMMU) comes in. The IOMMU is a page table, translating GPU accesses into DRAM/MMIO reads and writes. It's implemented in hardware, which reduces the remapping overhead.



This means the IOMMU is performing a pretty delicate operation. It’s mapping its own I/O virtual addresses into host physical addresses. We wanted to verify that the IOMMU was functioning correctly, and ensure that it was enabled any time a device may be running untrusted code, so that there would be no opportunity for unfiltered accesses.



Furthermore, there were features of the IOMMU that we didn't want, like compatibility interrupts. This is a type of interrupt that exists to support older Intel platforms that lack the interrupt-remapping capabilities that the IOMMU gives you. They're not necessary for modern hardware, and leaving them enabled allows guests to trigger unexpected MSIs, machine reboots, and host crashes.



The most interesting challenge here is protecting against PCIe's Address Translation Services (ATS). Using this feature, any device can claim it's using an address that's already been translated, and thus bypass IOMMU translation. For trusted devices, this is a useful performance improvement. For untrusted devices, this is a big security threat. ATS could allow a compromised device to ignore the IOMMU and write to places it shouldn't have access to.



Luckily, there's an ACS setting that can disable ATS for any given device. Thus, we disabled compatibility interrupts, disabled ATS, and had a separate fuzzer attempt to access memory outside the range specifically mapped to it. After some aggressive testing we determined that the IOMMU worked as advertised and could not be bypassed by a malicious device.




Conclusions




Beyond simply verifying our hardware in a test environment, we wanted to make sure our hardware remains secure in all of production. Misconfigurations are likely the biggest source of major outages in production environments, and it's a similar story with security vulnerabilities. Since ACS and IOMMU can be enabled or disabled at multiple layers of the stack—potentially varying between kernel versions, the default settings of the device, or other seemingly-minor tweaks—we would be remiss to rely solely on isolated unit tests to verify these settings. So, we developed tooling to monitor the ACS and IOMMU settings in production, so that any misconfiguration of the system could be quickly detected and rolled back.



As much as possible, it's good practice not to trust hardware without first verifying that it works correctly, and our targeted attacks and robust fuzzing allowed us to settle on a list of ACS settings that allow us to share GPUs with cloud users securely. This has resulted in being able to provide GPUs to our customers with a high degree of confidence in the security of the underlying system. Stay tuned for more posts that detail how we implement security at Google Cloud.





Google Developers Codelabs provide guided coding exercises to get hands-on experience with a wide range of topics such as Android Wear, Firebase and Web. Google Cloud Platform ...




Google Developers Codelabs provide guided coding exercises to get hands-on experience with a wide range of topics such as Android Wear, Firebase and Web. Google Cloud Platform (GCP) has its own section, with codelabs for Google Compute Engine, Google App Engine, Kubernetes and many more.



We’re always working to create new content, and I’m happy to announce that we now have new codelabs for running Windows and .NET apps on GCP, with their own dedicated page. Here’s an overview to help you get started.





First, if you’re a .NET developer, you probably love and use Visual Studio daily. Install and use Cloud Tools for Visual Studio teaches you how to install and use our GCP plugin for Visual Studio.



If you're a traditional ASP.NET developer writing apps for Windows Server, Deploy Windows Server with ASP.NET Framework to Compute Engine is the first codelab you should try. It teaches you how to deploy a Windows Server with ASP.NET Framework on Compute Engine.



Once you have your Windows Server deployed, you can try Deploy ASP.NET app to Windows Server on Compute Engine. It shows you how to take a simple ASP.NET app and publish it to your Windows Server from Visual Studio. These two codelabs provide a good understanding of traditional ASP.NET development and deployment on GCP.



If you've already made the switch to ASP.NET Core, the new multi-platform version of ASP.NET, then start with Build and launch an ASP.NET Core app from Google Cloud Shell to learn how to build and test a basic ASP.NET Core app from Cloud Shell. The whole codelab can be done inside your browser, which is pretty cool!



Afterwards, you can take this app and either deploy to App Engine or to Kubernetes on Google Container Engine. App Engine is definitely the easier path, and Deploy an ASP.NET Core app to App Engine can show you the way. If you want to tackle Kubernetes, you can follow Deploy ASP.NET Core app to Kubernetes on Container Engine to create a Kubernetes cluster of ASP.NET Core pods.



Regardless of where you deploy your app, you need to manage it, and we have a codelab on PowerShell to help with that: Install and use Cloud Tools for PowerShell teaches the use of our PowerShell cmdlets to access and manage GCP resources via PowerShell scripts.



I hope this gives you a good overview of where to start with Windows and .NET codelabs on GCP. We'll be adding more to our dedicated page for Windows and .NET, so be sure to check back regularly.





With 200-plus sessions to choose from at Google Cloud Next ‘17 on March 8 - 10, there’s a little bit of something for everyone. But if you’re an application developer coming to the show, here are a few sessions in particular that I recommend you check out.




With 200-plus sessions to choose from at Google Cloud Next ‘17 on March 8 - 10, there’s a little bit of something for everyone. But if you’re an application developer coming to the show, here are a few sessions in particular that I recommend you check out.



The most popular application development platform on Google Cloud Platform (GCP) is Java. If that describes your shop, be sure to check out "Power your Java workloads on Google Cloud Platform," with Amir Rouzrokh, Product Manager for all things Java on GCP. Amir will show attendees how to deploy a Spring Boot application to GCP, plus how to use Cloud Tools for IntelliJ to troubleshoot production problems.



In the past year, we’ve also made big strides supporting Microsoft platforms like ASP.NET on GCP. For a taste, check out Google Developer Advocate Mete Atamel’s talk “Take your ASP.NET apps to the next level with Google Cloud,” where he’ll cover how to migrate an ASP.NET app to GCP, how to work with our Powersehll cmdlets and Visual Studio plugins and how to tie into advanced GCP services like Google Cloud Storage, Cloud Pub/Sub and our Machine Learning APIs. Then there’s "Running .NET and containers in Google Cloud Platform" with Jon Skeet and Chris Smith, who will show you the next generation of OSS, cross-platform .NET Core apps running in Containers in Google App Engine and in Kubernetes. (And if that's still not enough, you can always sign up for the full day Windows on GCP bootcamp before the show.)



Speaking of App Engine, here’s your chance to learn all about App Engine flexible environment, our next-generation PaaS offering. In "You can run that on App Engine?," Product Manager Justin Beckwith shows you how to easily build production-scale web apps for an expanded variety of application patterns.



We’re also excited to talk more about Apigee, the API management platform we acquired in the fall. At “Using Apigee Edge to create and publish APIs that developers love,” Greg Brail, Principal Software Engineer and Prithpal Bhogil, GCP Sales Engineer, will walk developers through how to use Apigee Edge and best practices for building developer-friendly APIs.



Newcomers to GCP may also enjoy Google Cloud Product Manager Omar Ayoub’s session, "Developing made easy on Google Cloud Platform", where we’ll provide an overview of all the different libraries, IDE and framework integrations and other tools for developing applications on GCP.



But the hottest application development topic at Next '17 is arguably Google Cloud Functions, our event-based computing platform that we announced in alpha last year. For an introduction to Cloud Functions, there’s "Building serverless applications with Google Cloud Functions" with Product Manager Jason Polites. Mobile developers should also consider "Google Cloud Functions and Firebase", marrying our mobile backend as a service offering with Cloud Functions’ lightweight, asynchronous compute.



Of course, that’s just the tip of the iceberg when it comes to application development sessions. Be sure to check out the full session catalog, and register sooner rather than later to secure your spot in the most coveted sessions and bootcamps.





Our goal at Google Cloud Platform (GCP) is to be the best enterprise cloud environment. Throughout 2016, we worked hard to ensure that Windows developers and IT administrators would feel right at home when they came to GCP: whether it’s ...




Our goal at Google Cloud Platform (GCP) is to be the best enterprise cloud environment. Throughout 2016, we worked hard to ensure that Windows developers and IT administrators would feel right at home when they came to GCP: whether it’s building an ASP.NET application with their favorite tools like Visual Studio and PowerShell, or deploying the latest version of Windows Server onto Google Compute Engine.



Continuing our work in providing great infrastructure for enterprises running Windows, we’re pleased to announce pre-configured images for Microsoft SQL Server Enterprise and Windows Server Core on Compute Engine. High-availability and disaster recovery are top of mind for our larger customers, so we’re also announcing support for SQL Server AlwaysOn Availability Groups and persistent disk snapshots integrated with Volume Shadow Copy Service (VSS) on Windows Server. Finally, all of our Windows Server images are now enabled with Windows Remote Management support, including our Windows Server Core 2016 and 2012 R2 images.




SQL Server Enterprise Edition images on GCE




You can now launch Compute Engine VMs with Microsoft SQL Server Enterprise Edition pre-installed, and pay by the minute for SQL Server Enterprise and Windows Server licenses. Customers can also choose to bring their own licenses for SQL Server Enterprise.



We now support pre-configured images for the following versions in Beta:




  • SQL Server Enterprise 2016

  • SQL Server Enterprise 2014

  • SQL Server Enterprise 2012 







Supported SQL Server images available on Compute Engine (click to enlarge)



SQL Server Enterprise
targets mission-critical workloads by supporting more cores, higher memory and important enterprise features, including:




  • In-memory tables and indexes

  • Row-level security and encryption for data at rest or in motion

  • Multiple read-only replicas for integrated HA/DR and read scale-out

  • Business intelligence and rich visualizations on all platforms, including mobile

  • In-database advanced analytics with R






Combined with Google’s world-class infrastructure, SQL Server instances running on Compute Engine benefit from price-to-performance advantages, highly customizable VM sizes and state-of-the-art networking and security capabilities. With automatic sustained use discounts and the prospect of retiring hardware and associated maintenance on the horizon, customers can achieve total costs lower than those of other cloud providers.



To get started, learn how to create SQL Server instances easily on Google Compute Engine.








High-availability and disaster recovery for SQL Server VMs




Mission-critical SQL Server workloads require support for high-availability and disaster recovery. To achieve this, GCP supports Windows Server Failover Clustering (WSFC) and SQL Server AlwaysOn Availability Groups. AlwaysOn Availability Groups is SQL Server’s flagship HA/DR solution, allowing you to configure replicas for automatic failover in case of failure. These replicas can be readable, allowing you to offload read workloads and backups.



Compute Engine users can now configure AlwaysOn Availability Groups. This includes configuring replicas on VMs in different isolated zones as described in these instructions.




A highly available SQL Server reference architecture using Windows Server Failover Clustering and SQL Server AlwaysOn Availability Groups (click to enlarge)




Better backups with VSS-integrated persistent disk snapshots for Windows VMs




Being able to take snapshots in coordination with Volume Shadow Copy Service ensures that you get application-consistent snapshots for persistent disks attached to an instance running Windows -- without having to shut it down. This feature is useful when you want to take a consistent backup for VSS-enabled applications like SQL Server and Exchange Server without affecting the workload running on the VMs.



To get started with VSS-enabled persistent disk snapshots, select Snapshots under the Cloud Console Compute Engine page. There you'll see a new check-box on the disk snapshot creation page that allows you to specify whether a snapshot should be VSS-enabled.




(click to enlarge)



This feature can also be invoked via the gcloud SDK and API, following these instructions.




Looking ahead




GCP’s expanded support for SQL Server images and high availability are our latest efforts to improve Windows support on Compute Engine, and to build a cloud environment for enterprise Windows that leads the industry. Last year we expanded our list of pre-configured images to include SQL Server Standard, SQL Server Web and Windows Server 2016, and announced comprehensive .NET developer solutions, including a .NET client library for all GCP APIs through NuGet. We have lots more in store for the rest of 2017!



For more resources on Windows Server and Microsoft SQL Server on GCP, check out cloud.google.com/windows and cloud.google.com/sql-server. And for hands-on training on how to deploy and manage Windows and SQL Server workloads on GCP, come to the GCP NEXT ‘17 Windows Bootcamp. Finally, if you need help migrating your Windows workloads, don’t hesitate to contact us. We’re eager to hear your feedback!