Google Cloud Platform Blog
BigQuery in Practice - Loading Data Sets that are Terabytes and Beyond
Friday, January 31, 2014
We all know the story of David and Goliath. But did you know that King Saul prepared David for battle by fully arming him? He put a coat of armor and a bronze helmet on him, and gave David his sword. David tried walking around in them but they didn't feel right to him. In the end, he decided to carry five small stones and a sling instead. These were the tools that he used to fight off lions as a shepherd boy. We know what the outcome was. David showed us that picking the right tools and using them well is one of the keys to success.
Let's suppose you are tasked to start a Big Data project. You decide to use
Google BigQuery
because:
Its hosting model allows you to quickly run your data analysis without having to set up a costly computing infrastructure.
The interactive speed allows your analysts to quickly validate hypothesis about their insights.
To get started though, your Goliath is to load multi-terabytes of data into BigQuery. The technical article,
BigQuery in practice - Loading Data Sets that are Terabytes and Beyond
, is intended for IT Professionals and Data Architects who are planning to deploy large data sets to Google BigQuery. When dealing with multi-terabytes to petabytes of data, managing the processing of data such as uploading, failure recovery, cost and quota management becomes paramount.
Just as David showed us the importance of using the right tools effectively, the paper presents various options and considerations to help you to decide on the optimal solution. It follows the common ingestion workflow as depicted in the following diagram and discusses the tools that you can use during each stage - from uploading the data to the Google Cloud Storage, running your Extract Transform and Load (ETL) pipelines, to loading the data into BigQuery.
Scenarios for data ingestion into BigQuery
When dealing with large data sets the correct implementation can mean a savings of hours or days, while an improper design may mean weeks of re-work. David was so successful that King Saul gave him a high rank in the army. Similarly, we are here to help you use Google Cloud Platform successfully so your Big Data project will achieve the same level of success.
- Posted by Wally Yau, Cloud Solutions Architect
Announcing the Google Cloud Platform Developer Challenge 2013 Winners
Thursday, January 30, 2014
About 5 months ago developers from across the globe witnessed the launch of the
Google Cloud Developer Challenge
. This contest invited developers to build locally relevant web applications built on
Google App Engine
that solve real world problems and “WOW” the world through the deft application of Google APIs like the
Google+
,
YouTube
and
Maps
APIs.
The competition launched on September 4th, 2013 and first round of submissions closed on November 22nd, 2013. The challenge had two categories for submissions as follows:
Enterprise/Small Business Solutions , Education, Not for Profit
Social / Personal Productivity/Games / Fun
Submissions were sent in from all the six challenge regions across the world:
Latin America
Sub Saharan Africa
South East Asia
Middle East & North Africa
India
Rest of the World
Participants chose the category that best described their submissions.
After receiving
434 submissions
in the first round, it took
80 judges from 26 countries worldwide
working with judges from Google to select
98 finalists
who would go on to compete for the grand prize in their regions. The finalists had over 3 weeks to improve their apps and resubmit for the final judging round carried out by Googlers from around the world.
It is therefore a great pleasure to announce the 12 winners of the 2013 edition of the Google Cloud Developer Challenge. You can find the details of the 12 winning apps
here
. We would like to appreciate the efforts of all the developers and judges from around the world that helped make the contest a memorable one.
-Posted by Chukwuemeka Afigbo, Program Manager
Large Akka Cluster on Google Compute Engine
Wednesday, January 22, 2014
Our guest blog post today comes from Patrik Nordwall, Senior Software Engineer at Typesafe. His areas of expertise include Scala, Akka, and how to build reactive applications.
Akka
is a toolkit developed by
Typesafe
and based on the
Actor Model
for building highly concurrent, distributed and fault tolerant applications on the Java Virtual Machine. We have been working together with the Google Cloud Platform team to test our Akka Cluster on
Google Compute Engine
. The goal was to push our toolkit to its limits and gain insight into how to scale systems on this platform. The results are jaw-dropping—reaching 2400 nodes as well as starting up a 1000 node cluster in just over four minutes. We learned in just a few minutes that Akka on Compute Engine is a great combination for elastic application deployments.
Our impression of Google Compute Engine is that everything just works, which is more than you typically expect from an IaaS. The Google Compute Engine is easy to operate and understand the tools and APIs; it features great stability, and the speed of starting new instances is outstanding, allowing us to SSH into them only 10 seconds after spawning them.
Running a 2400 Akka Nodes Cluster
The test was performed by first starting 1500 instances in batches of 20 every 3 minutes. Each instance was hosting one JVM running a
small test application
(see below for details), which joined the cluster.
Akka Cluster
is a decentralized peer-to-peer cluster that is using a gossip protocol to spread membership changes. Joining a node involves two consecutive state changes that are spread to all members in the cluster one after the other. We measured the time it took from initiating the join until all nodes have seen the new node as a full member.
As can be seen in the above chart, it typically takes 15 to 30 seconds to add 20 nodes. Note that this is the duration until the cluster has a consistent view of the new member, without any central coordination. Nodes were added slowly—stretching the process over a total period of four hours—to also verify the cluster’s stability over time with repeated membership changes.
The time to join increases with the size of the cluster, but not drastically; the theoretical expectation would be logarithmic behavior. For our implementation this holds only up to 500 nodes because we gradually reduce the bias of gossip peer selection when the cluster grows beyond that size.
During periods without cluster membership changes the network traffic of 1500 nodes amounted to around 8 kB/s aggregated average input and output on each node. The average CPU utilization across all nodes was around 10%.
1500 nodes is a fantastic result, and to be honest far beyond our expectations. Our instance quota was 1500 instances, but why not continue and add more JVMs on the running instances until it breaks? This worked up to 2400 nodes, but after that the Akka cluster finally broke down:
when adding more nodes we observed long garbage collection pauses and many nodes were marked as unreachable and did not come back. This was the limit of the Akka cluster software, and not a limit of the Google Compute Engine in itself. To our current knowledge this is not a hard limit, which means that we will eventually overcome it.
Starting Up 1000 Nodes
In the first test the nodes were added slowly to also verify the cluster’s stability. Bulk additions are a more typical use case when starting up a fresh cluster. We also tested how long time it took to get an Akka cluster running across 1000 Google Compute Engine instances, hosting one cluster node each.
It took 4 minutes and 15 seconds from starting the script until all 1000 Akka cluster members were reported as full members and seen by all other nodes.
That measurement also includes the time it takes to start the actual Google Compute Engine instances, which is just mind-boggling—cloud elasticity taken to its extreme. The Google Compute Engine is a perfect environment for Akka Cluster and its ability to scale up and down automatically as nodes are spun up or shut down.
Testing Details
The revision of the used test application in the
akka/apps
repository is
f86ba8db18
. It is using snapshot 2.3-20131025-230950 of Akka. For information on how to run Akka Compute Engine, you can
read my post
on Typesafe.com
We used the Oracle JVM, Java version 1.7.0_40 with 1538 MB heap and ParallelGC.
The Google Compute Engine instances used in this test were of type n1-standard-2 in zone europe-west1-a, with Debian Linux (debian-7-wheezy-v20130816). It is priced at $0.228/h, has 7.5 GB memory, and 5.5 GCEU.
We would like to send a big thank you to Google for making it possible to conduct these tests. Overall, it made Akka a better product. Google Compute Engine is an impressive infrastructure as a service, with the stability, performance and elasticity that you need for high-end Akka systems.
-Contributed by Patrik Nordwall, Senior Software Engineer, Typesafe
Learn about Permissions on Google Cloud Platform
Tuesday, January 21, 2014
Do your co-workers ask you “How should I set up Google Cloud Platform projects for my developers?” Have you wondered about the difference between the Project Id, the Project Number and the App Id? Do you know what a service account is and why you need one? Find the answers to these and many other questions in a
newly published guide to understanding permissions, projects and accounts on Google Cloud Platform
.
Especially if you are just getting started, and are still sorting out the various concepts and terminology, this is the guide for you. The article includes explanations, definitions, best practices and links to the relevant documentation for more details. It’s a good place to start when learning to use Cloud Platform.
- Posted by Jeff Peck, Cloud Solutions Technical Account Manager
Performance advantages of the new Google Cloud Storage Connector for Hadoop
Wednesday, January 15, 2014
Our guest blog post today comes from Mike Wendt, R&D Associate Manager at Accenture Technology Labs, who recently published a study detailing the real world performance advantages of Hadoop on Google Compute Engine. His team utilized the recently launched Google Cloud Storage Connector for Hadoop and observed significant performance improvements over HDFS on local filesystems
Hadoop clusters tend to be deployed on bare-metal; however, they are increasingly deployed on cloud environments such as Google Compute Engine. Benefits such as pay-per-use pricing, scalability and performance tuning make cloud a practical option for Hadoop deployments. At
Accenture Technology Labs
, we were interested in proving the value of cloud over bare-metal and devised a method for a price-performance-ratio comparison of a bare-metal Hadoop cluster with cloud-based Hadoop clusters at the matched total-cost-of-ownership level.
As a part of this effort we created the Accenture Data Platform Benchmark comprising three real-world MapReduce applications to benchmark execution times for each platform. Collaborating with Google engineers we chose Google Compute Engine for our most recent version of this study and leveraged the
Google Cloud Storage Connector for Hadoop
within our experiments. For a look at the detailed report, you can find it
here
.
Original experiment setup
In conducting our experiments we used Google Compute Engine instances with local disks and streaming MapReduce jobs to copy input/output data to/from the local HDFS within our Hadoop clusters.
Figure 1. Data-flow model using input and output copies to local disk HDFS
As shown in Figure 1, this data-flow method provided us with the data we needed for our benchmarks at the cost of the total execution time, with the additional copy input and output phases. In addition to increased execution times, this data-flow model also resulted in more complex code for launching and managing experiments. The added code required modification of our testbench scripts to include the necessary copies and extra debugging and testing time to ensure the scripts were correct.
Modified experiment setup using Google Cloud Storage Connector for Hadoop
During our initial testing Google engineers approached us and offered an opportunity to use the Google Cloud Storage Connector for Hadoop (“the connector”) before its general release. We were eager to use the connector since it would simplify our scripted workflows and improve data movement. Integrating the connector into our workflows was straightforward and required minimal effort. Once the connector was configured we were able to change our data-flow model removing the need for copies and giving us the ability to directly access Google Cloud Storage for input data and write output data.
Figure 2. Data-flow model using Google Cloud Storage Connector for Hadoop
Figure 2 shows the direct access of input data by the MapReduce job(s) and ability to write output data directly to Google Cloud Storage all without additional copies via streaming MapReduce jobs. Not only did the connector reduce our execution times by removing the input and output copy phases, but the ability to access the input data from Google Cloud Storage proved unexpected performance benefits. We were able to see further decreases in our execution times due to the high availability of the input data compared to traditional HDFS access, which are detailed in the “result of experiments” section.
Result of experiments
The results of our experiments with the three workloads of the Accenture Data Platform Benchmark are detailed below. For each workload we examine the data access pattern of the workload and the performance of the Google Cloud Storage Connector for Hadoop.
Recommendation engine
In Figure 3, we can see that using the connector resulted in an average execution time savings of 24.4 percent when compared to “Local Disk HDFS (w/ copy times).” Of the savings, 86.0 percent (or 21.0 percent overall) came from removing the need for input and output data copies, and 14.0 percent of the savings (or 3.4 percent overall) by using the connector to retrieve input and save output data. The performance impact of the connector is not as strong as the other workloads because this workload utilizes a smaller dataset.
The opportunity for speed-up using the connector is dependent on the amount of reads and writes to Google Cloud Storage. Because the workload reads input data only in the first job and writes output data only in the last of the ten cascaded jobs, there is limited opportunity to improve the execution time using the connector. Also the relatively small dataset (5 GB) for the recommendation engine is able to be processed more quickly on the Google Compute Engine instances and results in less data that needs to be moved between Google Cloud Storage and the cluster.
Figure 3. Recommendation engine execution times
Sessionization
The sessionization workload rearranges a large dataset (24 TB uncompressed; ~675 GB compressed), in a single MapReduce job. This CPU and memory intensive workload benefited significantly from the use of the Google Cloud Storage Connector for Hadoop. Using the connector resulted in an average execution time savings of 26.2 percent when compared to “Local Disk HDFS (w/ copy times).” Of the savings, 25.6 percent (or 6.7 percent overall) came from removing the need for input and output data copies, and 74.4 percent of the savings (or 19.5 percent overall) by using the connector as shown in Figure 4. This large speed-up from the connector is thanks to the nature of the workload as a single MapReduce job. Overhead with the NameNode and data locality issues such as streaming data to other nodes for processing can be avoided by using the connector to supply all nodes with data equally and evenly. This proves that even with remote storage, data locality concerns can be overcome by using Google Cloud Storage and the provided connector to see greater results than using traditional local disk HDFS.
Figure 4. Sessionization execution times
Document clustering
Similar speed-ups from the connector were observed with the document-clustering workload. Using the connector resulted in an average execution time savings of 20.6 percent when compared to “Local Disk HDFS (w/ copy times).” Of the savings, 26.8 percent (or 5.5 percent overall) came from removing the need for input and output data copies, and 73.2 percent of the savings (or 15.0 percent overall) by using the connector as shown in Figure 5. Owing to the large amount of data processed (~31,000 files with a size of 3 TB) by the first MapReduce job of the document-clustering workload, the connector is able to transfer this data to the nodes much faster, resulting in the speed-up when compared to local disk HDFS.
Figure 5. Document clustering execution times
Conclusion
From our study, we can see that remote storage powered by the Google Cloud Storage connector for Hadoop actually performs better than local storage. The increased performance can be seen in all three of our workloads to varying degrees based on their access patterns. Workloads like sessionization and document clustering read input data from 14,800 and about 31,000 files, respectively, and see the largest improvements because the files are accessible from every node in the cluster. Availability of the files, and their chunks, is no longer limited to three copies within the cluster, which eliminates the dependence on the three nodes that contain the data to process the file or to transfer the file to an available node for processing.
In comparison, the recommendation engine workload has only one input file of 5 GB. With remote storage and the connector, we still see a performance increase in reading this large file because it is not in several small 64 MB or 128 MB chunks that must be streamed from multiple nodes in the cluster to the nodes processing the chunks of the file. Although this performance increase is not as large as the other workloads (14.0 percent compared with 73.2 to 74.4 percent with the other workloads), we can still see the value of using remote storage to provide faster access and greater availability of data when compared with the HDFS data locality model. This availability of remote storage on the scale and size provided by Google Cloud Storage unlocks a unique way of moving and storing large amounts of data that is not available with bare-metal deployments.
Additionally, our
recent study
reinforced our original findings. First, cloud-based Hadoop deployments offer better price-performance ratios than bare-metal clusters. Second, the benefit of performance tuning is so huge that cloud’s virtualization layer overhead is a worthy investment as it expands performance-tuning opportunities. Third, despite the sizable benefit, the performance-tuning process is complex and time-consuming and thus requires automated tuning tools.
Extending our original findings, we were able to observe the performance impact of data locality and remote storage within the cloud. While counterintuitive, our experiments prove that using remote storage to make data highly available outperforms local disk HDFS relying on data locality. The performance benefits we were able to see with the Google Cloud Storage Connector for Hadoop can be realized with other MapReduce applications and helps to accelerate the adoption of Hadoop deployments on Google Compute Engine.
-Contributed by Mike Wendt, R&D Associate Manager, Accenture Technology Labs
Easier, faster and lower cost Big Data processing with the Google Cloud Storage Connector for Hadoop
Tuesday, January 14, 2014
Google Compute Engine VMs provide a
fast
and reliable way to run
Apache Hadoop
. Today, we’re making it easier to run
Hadoop on Google Cloud Platform
with the Preview release of the
Google Cloud Storage connector for Hadoop
that lets you focus on your data processing logic instead of on managing a cluster and file system.
Diagram of Hadoop on Google Cloud Platform. HDFS and the NameNode are optional when storing data in Google Cloud Storage
In the 10 years since we first introduced
Google File System (GFS)
— the basis for Hadoop Distributed File System (HDFS) — Google has continued to improve our storage system for large data processing. The latest iteration is
Colossus
.
Today’s launch delivers exactly that. Using a simple connector library, Hadoop can now run directly against
Google Cloud Storage
— an object store built on Colossus. That means you benefit from Google’s expertise in large data processing.
Here are a few other benefits of running Hadoop with Google Cloud Storage:
Compatibility:
The Google Cloud Storage connector for Hadoop code-compatible with Hadoop. Just change the URL to point to your data.
Quick startup:
Your data is ready to process. You don’t have to wait for extra minutes or more while your data is copied over to HDFS and the NameNode comes out of safe mode, and you don’t have to pay for the VM time for data copying either.
Greater availability and scalability:
Google Cloud Storage is globally replicated and has higher availability than HDFS because it’s independent of the compute nodes and the NameNode. If the VMs are turned down (or, cloud forbid, crash) your data lives on.
Lower costs:
Save on storage and compute: storage, because there’s no need to maintain two copies of your data, one for backups and one for running Hadoop; compute, because you don’t need to keep VMs going just to serve data. And with per-minute billing, you can run Hadoop jobs faster on more cores and know your costs aren’t getting rounded up to a whole hour.
No storage management overhead:
Whereas HDFS requires routine maintenance -- like file system checks, rebalancing, upgrades, rollbacks and NameNode restarts -- Google Cloud Storage just works. Your data is safe and consistent with no extra effort.
Interoperability:
By keeping your data in Google Cloud Storage, you can benefit from all of the other Google services that already play nicely together.
Performance:
Google’s infrastructure delivers high performance from Google Cloud Storage that’s comparable to HDFS -- without the overhead and maintenance.
To see the benefits for yourself, give Hadoop on Google Cloud Platform a try by following the
simple tutorial
.
We would love to hear your
feedback and ideas
on how to make Hadoop and MapReduce run even better on Google Cloud Platform.
-Posted by Jonathan Bingham, Product Manager
A better way to explore and learn on GitHub
Monday, January 13, 2014
Almost one year ago, Google Cloud Platform
launched our GitHub organization,
with repositories ranging from tutorials to samples to utilities. This is where developers could find all resources relating to the platform, and get started developing quickly. We started with 36 repositories, with lofty plans to add more over time in response to requests from you, our developers. Many product releases, feature launches, and one logo redesign later, we are now up to 123 repositories illustrating how to use all parts of our platform!
Despite
some clever naming schemes
, it was becoming difficult to find exactly the code that you wanted amongst all of our repositories. Idly browsing through over 100 options wasn’t productive. The repository names gave you an idea of what stacks they used, but not what problems they solved.
Today, we are making it easier to browse our repositories and search for sample code with our landing page at
googlecloudplatform.github.io
. Whether you want to find all Compute Engine resources, locate all samples that are available in your particular stack, or find examples that fit your particular area of interest, you can find it with the new GitHub page. We’ll be rotating the repositories in the featured section, so make sure to wander that way from time to time.
We are very committed to open source at
Google Cloud Platform
. Please let us know what kind of samples and tools that you’d like to see from the team. We’re looking forward to many more commits ahead!
-Posted by Julia Ferraioli, Developer Advocate
Multisite hosting using Protocol Forwarding, Compute Engine’s new forwarding capability
Thursday, January 9, 2014
Serving 1 million requests per second using Compute Engine
sure is awesome. But what about scaling in the other direction and serving more than one website from the same instance? A handy new feature called
Protocol Forwarding
makes this possible on
Compute Engine
. It enables you to attach multiple Internet IP addresses to an instance. This way you can configure your webserver to host content that belong to different sites, even with SSL.
Protocol Forwarding is a new addition to Compute Engine's growing collection of forwarding capabilities (including
Layer-3 Load Balancing
). In addition to attaching multiple Internet IP addresses to a single instance without network address translation (NAT), it can also forward select IP protocols to the virtual machine instances. The supported protocols are TCP, UDP, SCTP and IPSec (ESP/AH). A sharp-eyed reader can tell that, when used together with
advanced routing
, Protocol Forwarding simplifies the virtual private network (VPN) setup that connects your outside servers to Compute Engine network.
Protocol Forwarding and Load Balancing share many of the same underpinnings, including the same top-level API resource called
Forwarding Rules
. The best way to learn how to use it is to follow the
quickstart guide
. Upon completion, you should have a web server on a single instance hosting three different websites with three different Internet IP addresses.
Pricing for Protocol Forwarding is the same as Layer-3 Load Balancing. In fact, you can enjoy both features at no cost until the end of January, 2014. Please read more about
pricing and promotion details
.
On behalf of the Google Cloud Platform team, I wish everyone a productive and rewarding 2014!
-Posted by Gary Ling, Product Manager
Don't Miss Next '17
Use promo code NEXT1720 to save $300 off general admission
REGISTER NOW
Free Trial
GCP Blogs
Big Data & Machine Learning
Kubernetes
GCP Japan Blog
Labels
Announcements
56
Big Data & Machine Learning
91
Compute
156
Containers & Kubernetes
36
CRE
7
Customers
90
Developer Tools & Insights
80
Events
34
Infrastructure
24
Management Tools
39
Networking
18
Open Source
105
Partners
63
Pricing
24
Security & Identity
23
Solutions
16
Stackdriver
19
Storage & Databases
111
Weekly Roundups
16
Archive
2017
Feb
Jan
2016
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2015
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2014
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2013
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2012
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2011
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2010
Dec
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2009
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2008
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Feed
Subscribe by email
Technical questions? Check us out on
Stack Overflow
.
Subscribe to
our monthly newsletter
.
Google
on
Follow @googlecloud
Follow
Follow