cloud-test: Easily run Dataflow Big Data pipelines anywhere, thanks to Cloudera

Google Cloud Platform Blog

Easily run Dataflow Big Data pipelines anywhere, thanks to Cloudera

Tuesday, January 20, 2015

Dataflow programming modelopen sourced the Cloud Dataflow SDK

Direct Pipeline: The “Direct Pipeline” runner executes the program on the local machine.

Google Cloud Dataflow: The Google Cloud Dataflow service is a hosted and fully managed execution environment for Dataflow programs on Google Cloud Platform. Programs can be deployed on it via a runner. This service is currently in alpha phase and available to a limited number of users; you can apply here.

Spark: Thanks to Cloudera, the Spark runner allows the same Dataflow program to execute on a Spark cluster, whether in the cloud or on-premises. The runner is part of the Cloudera Labs effort and is available in this GitHub repo. You can find out more about Dataflow and the Spark runner from Cloudera’s Josh Wills in this blog post.