cloud-test: Google Cloud Dataproc: Making Spark and Hadoop Easier, Faster, and Cheaper

Google Cloud Platform Blog

Google Cloud Dataproc: Making Spark and Hadoop Easier, Faster, and Cheaper

Wednesday, September 23, 2015

Working with large datasets requires powerful tools, but too often those tools add new layers of complexity. To use your data efficiently, you need to minimize the time from data-capture to insights. But concerns about deployment, scaling, monitoring, utilization, and cost can get in the way of what matters most: your data. With more data being generated each day, you have less time to peel back the layers of complexity around the tools you rely on for success. We think using powerful data tools should be easy as 1-2-3.

Cloud Dataproc is a managed Spark and Hadoop service that lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning. Cloud Dataproc automation helps you create clusters quickly, manage them easily, and save money by turning clusters off when you don't need them. With less time and money spent on administration, you can focus on your jobs and your data. In the time it takes you to read this blog post, you can have a Spark or Hadoop cluster created, configured, and ready to work for you.

Cloud Dataproc minimizes the time you spend on administration and management

When compared to traditional, on-premises products and competing cloud services, Cloud Dataproc has a number of unique advantages for clusters of 3 to hundreds of nodes:

Low-cost. Cloud Dataproc is priced at only 1 cent per virtual CPU in your cluster per hour, on top of the other Cloud Platform resources you use. In addition to this low price, Cloud Dataproc clusters can include preemptible instances that have lower compute prices, reducing your costs even further. Instead of rounding your usage up to the nearest hour, Cloud Dataproc charges you only for what you really use with minute-by-minute billing and a low, ten-minute-minimum billing period.

Super fast. Without using Cloud Dataproc, it can take anywhere from 5 to 30 minutes to create Spark and Hadoop clusters on-premises or through IaaS providers. By comparison, Cloud Dataproc clusters are quick to start, scale, and shutdown with each of these operations taking 90 seconds or less, on average. This means you can spend less time waiting for clusters and more hands-on time working with your data.

Integrated. Cloud Dataproc has built-in integration with other Google Cloud Platform services, such as BigQuery, Google Cloud Storage, Google Cloud Bigtable, Google Cloud Logging, and Google Cloud Monitoring, so you have more than just a Spark or Hadoop cluster—you have a complete data platform. For example, you can use Cloud Dataproc to effortlessly ETL terabytes of raw log data directly into BigQuery for business reporting.

Managed. Use Spark and Hadoop clusters without the assistance of an administrator or special software. You can easily interact with clusters and Spark or Hadoop jobs through the Google Developers Console, the Google Cloud SDK, or the Cloud Dataproc REST API. When you're done with a cluster, you can simply turn it off so you don’t spend money on an idle cluster. You won’t need to worry about losing data, because Cloud Dataproc is integrated with Cloud Storage, BigQuery, and Cloud Bigtable.

Simple and familiar. You don’t need to learn new tools or APIs to use Cloud Dataproc, making it easy to move existing projects into Cloud Dataproc without redevelopment. Spark, Hadoop, Pig, and Hive are frequently updated, so you can be productive faster. Today, we are launching with clusters that have Spark 1.5 and Hadoop 2.7.1.

Cloud Dataproc joins a rich set of cloud technologies focused on faster speed, robust features, and lower costs. With Cloud Platform you have access to:

Awesome infrastructure including Google Compute Engine, Cloud Storage, and Google Cloud Networking.

Cloud Dataproc, builds on this infrastructure to let you use Spark and Hadoop more easily, faster and at a lower cost. Since Cloud Dataproc is built on Cloud Platform, you have instant access to solid-state drives (SSD) and preemptible virtual machines.

Combining Cloud Dataproc with next-generation data processing and analytics services in Google Cloud Platform powered by Google-native technologies, including BigQuery, Google Cloud Dataflow, and Google Cloud Pub/Sub.

Today we’re releasing Google Cloud Dataproc as a beta service. Cloud Dataproc gives you anytime access to super-fast, simple yet powerful, managed Spark and Hadoop clusters. Since you only pay for what you use with minute-by-minute billing, you won’t break the bank in the process. We look forward to seeing how you find creative, innovative, and productive ways to use Cloud Dataproc. To learn more about Cloud Dataproc, visit the Cloud Dataproc site, review our getting started guide, or submit your questions and feedback on Stack Overflow.

- Posted by James Malone, Product Manager