Google Cloud Platform Blog
Sneak peek: Google Cloud Dataflow, a Cloud-native data processing service
Thursday, June 26, 2014
In today's world, information is being generated at an incredible rate. However, unlocking insights from large datasets can be cumbersome and costly, even for experts.
It doesn’t have to be that way. Yesterday, at
Google I/O
, you got a sneak peek of Google Cloud Dataflow, the latest step in our effort to make data and analytics accessible to everyone. You can use Cloud Dataflow:
for data integration and preparation (e.g. in preparation for interactive SQL in BigQuery)
to examine a real-time stream of events for significant patterns and activities
to implement advanced, multi-step processing pipelines to extract deep insight from datasets of any size
In these cases and many others, you use Cloud Dataflow’s data-centric model to easily express your data processing pipeline, monitor its execution, and get actionable insights from your data, free from the burden of deploying clusters, tuning configuration parameters, and optimizing resource usage. Just focus on your application, and leave the management, tuning, sweat and tears to Cloud Dataflow.
Cloud Dataflow is based on a highly efficient and popular model used internally at Google, which evolved from
MapReduce
and successor technologies like
Flume
and
MillWheel
. The underlying service is language-agnostic. Our first SDK is for Java, and allows you to write your entire pipeline in a single program using intuitive Cloud Dataflow constructs to express application semantics.
Cloud Dataflow represents all datasets, irrespective of size, uniformly via PCollections (“parallel collections”). A PCollection might be an in-memory collection, read from files on
Cloud Storage
, queried from a
BigQuery
table, read as a stream from a
Pub/Sub
topic, or calculated on demand by your custom code.
Because PCollections can be arbitrarily large, Cloud Dataflow includes a rich library of PTransforms (“parallel transforms”), which you can customize with your own application logic. For example, ParDo (“parallel do”) runs your code over each element in a PCollection independently (like both the Map and Reduce functions in MapReduce or WHERE in SQL), and GroupByKey takes a PCollection of key-value pairs and groups together all pairs with the same key (like the Shuffle step of MapReduce or GROUP BY and JOIN in SQL). In addition, anyone can define new custom transformations by composing other transformations -- this extensibility lets you write reusable building blocks which can be shared across programs. Cloud Dataflow provides a starter set of these composed transforms out of the box, including Count, Top, and Mean.
Writing in this modular, high-level style naturally leads to pipelines that make multiple logical passes over the same data. Cloud Dataflow automatically optimizes your data-centric pipeline code by collapsing multiple logical passes into a single execution pass. However, this doesn't turn the system into a black box: as you can see below, Cloud Dataflow’s monitoring UI uses the building block concept to show you the pipeline as you wrote it, not as the system chooses to execute it.
The same Cloud Dataflow pipeline may run in different ways, depending on the data sources. As you start designing or debugging, you can run against data local to your development environment. When you’re ready to scale up to real data, that same pipeline can run in parallel batch mode against data in Cloud Storage or in distributed real-time processing mode against data coming in via a Pub/Sub topic. This flexibility makes it trivial to transition between different stages in the application development lifecycle: to develop and test applications, to adapt an existing batch pipeline to track time-sensitive trends, or to fix a bug in a real-time pipeline and backfill the historical results.
When you use Cloud Dataflow, you can focus solely on your application logic and let us handle everything else. You should not have to choose between scalability, ease of management and a simple coding model. With Cloud Dataflow, you can have it all.
If you’d like to be notified of future updates about Cloud Dataflow, please join our
Google Group
.
-Posted by Frances Perry, Software Engineer
No comments :
Post a Comment
Don't Miss Next '17
Use promo code NEXT1720 to save $300 off general admission
REGISTER NOW
Free Trial
GCP Blogs
Big Data & Machine Learning
Kubernetes
GCP Japan Blog
Labels
Announcements
56
Big Data & Machine Learning
91
Compute
156
Containers & Kubernetes
36
CRE
7
Customers
90
Developer Tools & Insights
80
Events
34
Infrastructure
24
Management Tools
39
Networking
18
Open Source
105
Partners
63
Pricing
24
Security & Identity
23
Solutions
16
Stackdriver
19
Storage & Databases
111
Weekly Roundups
16
Archive
2017
Feb
Jan
2016
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2015
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2014
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2013
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2012
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2011
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2010
Dec
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2009
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2008
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Feed
Subscribe by email
Technical questions? Check us out on
Stack Overflow
.
Subscribe to
our monthly newsletter
.
Google
on
Follow @googlecloud
Follow
Follow
No comments :
Post a Comment