Google Cloud Platform Blog
Easily run Dataflow Big Data pipelines anywhere, thanks to Cloudera
Tuesday, January 20, 2015
Big data processing can take place in many contexts. Sometimes you’re prototyping new pipelines, and at other times you’re deploying them to run at scale. Sometimes you’re working on-premises, and at other times you’re in the cloud. Sometimes you care most about speed of execution, and at other times you want to optimize for the lowest possible processing cost. The best deployment option often depends on this context. It also changes over time; new data processing engines become available, each optimized for specific needs — from the venerable Hadoop MapReduce to Storm, Spark, Tez or Flink, all in open source, as well as cloud-native services. Today’s optimal choice of big data runtime might not be tomorrow’s.
But in all these cases, what remains true is that you need an easy-to-use, powerful and flexible programming model that makes developers productive. And no one wants to have to rewrite their algorithm for a specific runtime.
We believe the
Dataflow programming model
, based on years of experience at Google, can provide maximum developer productivity and seamless portability. That's why in December we
open sourced the Cloud Dataflow SDK
, which offers a set of primitives for large-scale distributed computing, including rich semantics for stream processing. This allows the same program to execute either in stream or batch mode.
Today, we’re taking the next step in ensuring the portability of the Dataflow programming model by working with Cloudera to make Dataflow run on Spark. There are currently three runners available to allow Dataflow programs to execute in different environments:
Direct Pipeline
: The “Direct Pipeline” runner executes the program on the local machine.
Google Cloud Dataflow
: The Google Cloud Dataflow service is a hosted and fully managed execution environment for Dataflow programs on Google Cloud Platform. Programs can be deployed on it via a runner. This service is currently in alpha phase and available to a limited number of users; you can
apply here
.
Spark
: Thanks to Cloudera, the Spark runner allows the same Dataflow program to execute on a Spark cluster, whether in the cloud or on-premises. The runner is part of the
Cloudera Labs effort
and is available in
this GitHub repo
. You can find out more about Dataflow and the Spark runner from Cloudera’s Josh Wills in this
blog post
.
We are delighted that Cloudera is joining us, and we look forward to the future growth of the Dataflow ecosystem. We’re confident that Dataflow programs will make data more useful in an ever-growing number of environments, in cloud or on-premises. Please join us – whether by using the
Dataflow SDK
(deploying via one of the three runners listed above) for your own data processing pipelines, or by creating a new Dataflow runner for your favorite big data runtime.
-Posted by William Vambenepe, Product Manager
No comments :
Post a Comment
Don't Miss Next '17
Use promo code NEXT1720 to save $300 off general admission
REGISTER NOW
Free Trial
GCP Blogs
Big Data & Machine Learning
Kubernetes
GCP Japan Blog
Labels
Announcements
56
Big Data & Machine Learning
91
Compute
156
Containers & Kubernetes
36
CRE
7
Customers
90
Developer Tools & Insights
80
Events
34
Infrastructure
24
Management Tools
39
Networking
18
Open Source
105
Partners
63
Pricing
24
Security & Identity
23
Solutions
16
Stackdriver
19
Storage & Databases
111
Weekly Roundups
16
Archive
2017
Feb
Jan
2016
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2015
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2014
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2013
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2012
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2011
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2010
Dec
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2009
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2008
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Feed
Subscribe by email
Technical questions? Check us out on
Stack Overflow
.
Subscribe to
our monthly newsletter
.
Google
on
Follow @googlecloud
Follow
Follow
No comments :
Post a Comment