We all know the story of David and Goliath. But did you know that King Saul prepared David for battle by fully arming him? He put a coat of armor and a bronze helmet on him, and gave David his sword. David tried walking around in them but they didn't feel right to him. In the end, he decided to carry five small stones and a sling instead. These were the tools that he used to fight off lions as a shepherd boy. We know what the outcome was. David showed us that picking the right tools and using them well is one of the keys to success.
We all know the story of David and Goliath. But did you know that King Saul prepared David for battle by fully arming him? He put a coat of armor and a bronze helmet on him, and gave David his sword. David tried walking around in them but they didn't feel right to him. In the end, he decided to carry five small stones and a sling instead. These were the tools that he used to fight off lions as a shepherd boy. We know what the outcome was. David showed us that picking the right tools and using them well is one of the keys to success.



Let's suppose you are tasked to start a Big Data project. You decide to use Google BigQuery because:


  • Its hosting model allows you to quickly run your data analysis without having to set up a costly computing infrastructure.

  • The interactive speed allows your analysts to quickly validate hypothesis about their insights.


To get started though, your Goliath is to load multi-terabytes of data into BigQuery. The technical article, BigQuery in practice - Loading Data Sets that are Terabytes and Beyond, is intended for IT Professionals and Data Architects who are planning to deploy large data sets to Google BigQuery. When dealing with multi-terabytes to petabytes of data, managing the processing of data such as uploading, failure recovery, cost and quota management becomes paramount.



Just as David showed us the importance of using the right tools effectively, the paper presents various options and considerations to help you to decide on the optimal solution. It follows the common ingestion workflow as depicted in the following diagram and discusses the tools that you can use during each stage - from uploading the data to the Google Cloud Storage, running your Extract Transform and Load (ETL) pipelines, to loading the data into BigQuery.




Scenarios for data ingestion into BigQuery

When dealing with large data sets the correct implementation can mean a savings of hours or days, while an improper design may mean weeks of re-work. David was so successful that King Saul gave him a high rank in the army. Similarly, we are here to help you use Google Cloud Platform successfully so your Big Data project will achieve the same level of success.



- Posted by Wally Yau, Cloud Solutions Architect

About 5 months ago developers from across the globe witnessed the launch of the Google Cloud Developer Challenge. This contest invited developers to build locally relevant web applications built on ...
About 5 months ago developers from across the globe witnessed the launch of the Google Cloud Developer Challenge. This contest invited developers to build locally relevant web applications built on Google App Engine that solve real world problems and “WOW” the world through the deft application of Google APIs like the Google+, YouTube and Maps APIs.



The competition launched on September 4th, 2013 and first round of submissions closed on November 22nd, 2013. The challenge had two categories for submissions as follows:


  • Enterprise/Small Business Solutions , Education, Not for Profit

  • Social / Personal Productivity/Games / Fun




Submissions were sent in from all the six challenge regions across the world:


  • Latin America

  • Sub Saharan Africa

  • South East Asia

  • Middle East & North Africa

  • India

  • Rest of the World




Participants chose the category that best described their submissions.



After receiving 434 submissions in the first round,  it took 80 judges from 26 countries worldwide working with judges from Google to select 98 finalists who would go on to compete for the grand prize in their regions. The finalists had over 3 weeks to improve their apps and resubmit for the final judging round carried out by Googlers from around the world.



It is therefore a great pleasure to announce the 12 winners of the 2013 edition of the Google Cloud Developer Challenge. You can find the details of the 12 winning apps here. We would like to appreciate the efforts of all the developers and judges from around the world that helped make the contest a memorable one.



-Posted by Chukwuemeka Afigbo, Program Manager

Our guest blog post today comes from Patrik Nordwall, Senior Software Engineer at Typesafe. His areas of expertise include Scala, Akka, and how to build reactive applications.



Akka is a toolkit developed by ...
Our guest blog post today comes from Patrik Nordwall, Senior Software Engineer at Typesafe. His areas of expertise include Scala, Akka, and how to build reactive applications.



Akka is a toolkit developed by Typesafe and based on the Actor Model for building highly concurrent, distributed and fault tolerant applications on the Java Virtual Machine. We have been working together with the Google Cloud Platform team to test our Akka Cluster on Google Compute Engine. The goal was to push our toolkit to its limits and gain insight into how to scale systems on this platform. The results are jaw-dropping—reaching 2400 nodes as well as starting up a 1000 node cluster in just over four minutes. We learned in just a few minutes that Akka on Compute Engine is a great combination for elastic application deployments.



Our impression of Google Compute Engine is that everything just works, which is more than you typically expect from an IaaS. The Google Compute Engine is easy to operate and understand the tools and APIs; it features great stability, and the speed of starting new instances is outstanding, allowing us to SSH into them only 10 seconds after spawning them.



Running a 2400 Akka Nodes Cluster

The test was performed by first starting 1500 instances in batches of 20 every 3 minutes. Each instance was hosting one JVM running a small test application (see below for details), which joined the cluster. Akka Cluster is a decentralized peer-to-peer cluster that is using a gossip protocol to spread membership changes. Joining a node involves two consecutive state changes that are spread to all members in the cluster one after the other. We measured the time it took from initiating the join until all nodes have seen the new node as a full member.





As can be seen in the above chart, it typically takes 15 to 30 seconds to add 20 nodes. Note that this is the duration until the cluster has a consistent view of the new member, without any central coordination. Nodes were added slowly—stretching the process over a total period of four hours—to also verify the cluster’s stability over time with repeated membership changes.



The time to join increases with the size of the cluster, but not drastically; the theoretical expectation would be logarithmic behavior. For our implementation this holds only up to 500 nodes because we gradually reduce the bias of gossip peer selection when the cluster grows beyond that size.



During periods without cluster membership changes the network traffic of 1500 nodes amounted to around 8 kB/s aggregated average input and output on each node. The average CPU utilization across all nodes was around 10%.



1500 nodes is a fantastic result, and to be honest far beyond our expectations. Our instance quota was 1500 instances, but why not continue and add more JVMs on the running instances until it breaks? This worked up to 2400 nodes, but after that the Akka cluster finally broke down:

when adding more nodes we observed long garbage collection pauses and many nodes were marked as unreachable and did not come back. This was the limit of the Akka cluster software, and not a limit of the Google Compute Engine in itself. To our current knowledge this is not a hard limit, which means that we will eventually overcome it.



Starting Up 1000 Nodes

In the first test the nodes were added slowly to also verify the cluster’s stability. Bulk additions are a more typical use case when starting up a fresh cluster. We also tested how long time it took to get an Akka cluster running across 1000 Google Compute Engine instances, hosting one cluster node each.



It took 4 minutes and 15 seconds from starting the script until all 1000 Akka cluster members were reported as full members and seen by all other nodes.



That measurement also includes the time it takes to start the actual Google Compute Engine instances, which is just mind-boggling—cloud elasticity taken to its extreme. The Google Compute Engine is a perfect environment for Akka Cluster and its ability to scale up and down automatically as nodes are spun up or shut down.



Testing Details

The revision of the used test application in the akka/apps repository is f86ba8db18. It is using snapshot 2.3-20131025-230950 of Akka. For information on how to run Akka Compute Engine, you can read my post on Typesafe.com



We used the Oracle JVM, Java version 1.7.0_40 with 1538 MB heap and ParallelGC.



The Google Compute Engine instances used in this test were of type n1-standard-2 in zone europe-west1-a, with Debian Linux (debian-7-wheezy-v20130816). It is priced at $0.228/h, has 7.5 GB memory, and 5.5 GCEU.



We would like to send a big thank you to Google for making it possible to conduct these tests. Overall, it made Akka a better product. Google Compute Engine is an impressive infrastructure as a service, with the stability, performance and elasticity that you need for high-end Akka systems.



-Contributed by Patrik Nordwall, Senior Software Engineer, Typesafe

Do your co-workers ask you “How should I set up Google Cloud Platform projects for my developers?” Have you wondered about the difference between the Project Id, the Project Number and the App Id? Do you know what a service account is and why you need one? Find the answers to these and many other questions in a ...
Do your co-workers ask you “How should I set up Google Cloud Platform projects for my developers?” Have you wondered about the difference between the Project Id, the Project Number and the App Id? Do you know what a service account is and why you need one? Find the answers to these and many other questions in a newly published guide to understanding permissions, projects and accounts on Google Cloud Platform.



Especially if you are just getting started, and are still sorting out the various concepts and terminology, this is the guide for you. The article includes explanations, definitions, best practices and links to the relevant documentation for more details. It’s a good place to start when learning to use Cloud Platform.



- Posted by Jeff Peck, Cloud Solutions Technical Account Manager

Our guest blog post today comes from Mike Wendt, R&D Associate Manager at Accenture Technology Labs, who recently published a study detailing the real world performance advantages of Hadoop on Google Compute Engine. His team utilized the recently launched Google Cloud Storage Connector for Hadoop and observed significant performance improvements over HDFS on local filesystems ...
Our guest blog post today comes from Mike Wendt, R&D Associate Manager at Accenture Technology Labs, who recently published a study detailing the real world performance advantages of Hadoop on Google Compute Engine. His team utilized the recently launched Google Cloud Storage Connector for Hadoop and observed significant performance improvements over HDFS on local filesystems



Hadoop clusters tend to be deployed on bare-metal; however, they are increasingly deployed on cloud environments such as Google Compute Engine. Benefits such as pay-per-use pricing, scalability and performance tuning make cloud a practical option for Hadoop deployments. At Accenture Technology Labs, we were interested in proving the value of cloud over bare-metal and devised a method for a price-performance-ratio comparison of a bare-metal Hadoop cluster with cloud-based Hadoop clusters at the matched total-cost-of-ownership level.



As a part of this effort we created the Accenture Data Platform Benchmark comprising three real-world MapReduce applications to benchmark execution times for each platform. Collaborating with Google engineers we chose Google Compute Engine for our most recent version of this study and leveraged the Google Cloud Storage Connector for Hadoop within our experiments. For a look at the detailed report, you can find it here.



Original experiment setup

In conducting our experiments we used Google Compute Engine instances with local disks and streaming MapReduce jobs to copy input/output data to/from the local HDFS within our Hadoop clusters.




Figure 1. Data-flow model using input and output copies to local disk HDFS

As shown in Figure 1, this data-flow method provided us with the data we needed for our benchmarks at the cost of the total execution time, with the additional copy input and output phases. In addition to increased execution times, this data-flow model also resulted in more complex code for launching and managing experiments. The added code required modification of our testbench scripts to include the necessary copies and extra debugging and testing time to ensure the scripts were correct.



Modified experiment setup using Google Cloud Storage Connector for Hadoop

During our initial testing Google engineers approached us and offered an opportunity to use the Google Cloud Storage Connector for Hadoop (“the connector”) before its general release. We were eager to use the connector since it would simplify our scripted workflows and improve data movement. Integrating the connector into our workflows was straightforward and required minimal effort. Once the connector was configured we were able to change our data-flow model removing the need for copies and giving us the ability to directly access Google Cloud Storage for input data and write output data.




Figure 2. Data-flow model using Google Cloud Storage Connector for Hadoop

Figure 2 shows the direct access of input data by the MapReduce job(s) and ability to write output data directly to Google Cloud Storage all without additional copies via streaming MapReduce jobs. Not only did the connector reduce our execution times by removing the input and output copy phases, but the ability to access the input data from Google Cloud Storage proved unexpected performance benefits. We were able to see further decreases in our execution times due to the high availability of the input data compared to traditional HDFS access, which are detailed in the “result of experiments” section.



Result of experiments

The results of our experiments with the three workloads of the Accenture Data Platform Benchmark are detailed below. For each workload we examine the data access pattern of the workload and the performance of the Google Cloud Storage Connector for Hadoop.



Recommendation engine

In Figure 3, we can see that using the connector resulted in an average execution time savings of 24.4 percent when compared to “Local Disk HDFS (w/ copy times).” Of the savings, 86.0 percent (or 21.0 percent overall) came from removing the need for input and output data copies, and 14.0 percent of the savings (or 3.4 percent overall) by using the connector to retrieve input and save output data. The performance impact of the connector is not as strong as the other workloads because this workload utilizes a smaller dataset.



The opportunity for speed-up using the connector is dependent on the amount of reads and writes to Google Cloud Storage. Because the workload reads input data only in the first job and writes output data only in the last of the ten cascaded jobs, there is limited opportunity to improve the execution time using the connector. Also the relatively small dataset (5 GB) for the recommendation engine is able to be processed more quickly on the Google Compute Engine instances and results in less data that needs to be moved between Google Cloud Storage and the cluster.




Figure 3. Recommendation engine execution times

Sessionization

The sessionization workload rearranges a large dataset (24 TB uncompressed; ~675 GB compressed), in a single MapReduce job. This CPU and memory intensive workload benefited significantly from the use of the Google Cloud Storage Connector for Hadoop. Using the connector resulted in an average execution time savings of 26.2 percent when compared to “Local Disk HDFS (w/ copy times).” Of the savings, 25.6 percent (or 6.7 percent overall) came from removing the need for input and output data copies, and 74.4 percent of the savings (or 19.5 percent overall) by using the connector as shown in Figure 4. This large speed-up from the connector is thanks to the nature of the workload as a single MapReduce job. Overhead with the NameNode and data locality issues such as streaming data to other nodes for processing can be avoided by using the connector to supply all nodes with data equally and evenly. This proves that even with remote storage, data locality concerns can be overcome by using Google Cloud Storage and the provided connector to see greater results than using traditional local disk HDFS.




Figure 4. Sessionization execution times

Document clustering

Similar speed-ups from the connector were observed with the document-clustering workload. Using the connector resulted in an average execution time savings of 20.6 percent when compared to “Local Disk HDFS (w/ copy times).” Of the savings, 26.8 percent (or 5.5 percent overall) came from removing the need for input and output data copies, and 73.2 percent of the savings (or 15.0 percent overall) by using the connector as shown in Figure 5. Owing to the large amount of data processed (~31,000 files with a size of 3 TB) by the first MapReduce job of the document-clustering workload, the connector is able to transfer this data to the nodes much faster, resulting in the speed-up when compared to local disk HDFS.




Figure 5. Document clustering execution times

Conclusion

From our study, we can see that remote storage powered by the Google Cloud Storage connector for Hadoop actually performs better than local storage. The increased performance can be seen in all three of our workloads to varying degrees based on their access patterns. Workloads like sessionization and document clustering read input data from 14,800 and about 31,000 files, respectively, and see the largest improvements because the files are accessible from every node in the cluster. Availability of the files, and their chunks, is no longer limited to three copies within the cluster, which eliminates the dependence on the three nodes that contain the data to process the file or to transfer the file to an available node for processing.



In comparison, the recommendation engine workload has only one input file of 5 GB. With remote storage and the connector, we still see a performance increase in reading this large file because it is not in several small 64 MB or 128 MB chunks that must be streamed from multiple nodes in the cluster to the nodes processing the chunks of the file. Although this performance increase is not as large as the other workloads (14.0 percent compared with 73.2 to 74.4 percent with the other workloads), we can still see the value of using remote storage to provide faster access and greater availability of data when compared with the HDFS data locality model. This availability of remote storage on the scale and size provided by Google Cloud Storage unlocks a unique way of moving and storing large amounts of data that is not available with bare-metal deployments.



Additionally, our recent study reinforced our original findings. First, cloud-based Hadoop deployments offer better price-performance ratios than bare-metal clusters. Second, the benefit of performance tuning is so huge that cloud’s virtualization layer overhead is a worthy investment as it expands performance-tuning opportunities. Third, despite the sizable benefit, the performance-tuning process is complex and time-consuming and thus requires automated tuning tools.



Extending our original findings, we were able to observe the performance impact of data locality and remote storage within the cloud. While counterintuitive, our experiments prove that using remote storage to make data highly available outperforms local disk HDFS relying on data locality. The performance benefits we were able to see with the Google Cloud Storage Connector for Hadoop can be realized with other MapReduce applications and helps to accelerate the adoption of Hadoop deployments on Google Compute Engine.



-Contributed by Mike Wendt, R&D Associate Manager, Accenture Technology Labs

Google Compute Engine VMs provide a fast and reliable way to run Apache Hadoop. Today, we’re making it easier to run Hadoop on Google Cloud Platform with the Preview release of the ...
Google Compute Engine VMs provide a fast and reliable way to run Apache Hadoop. Today, we’re making it easier to run Hadoop on Google Cloud Platform with the Preview release of the Google Cloud Storage connector for Hadoop that lets you focus on your data processing logic instead of on managing a cluster and file system.




Diagram of Hadoop on Google Cloud Platform. HDFS and the NameNode are optional when storing data in Google Cloud Storage

In the 10 years since we first introduced Google File System (GFS) — the basis for Hadoop Distributed File System (HDFS) — Google has continued to improve our storage system for large data processing. The latest iteration is Colossus.



Today’s launch delivers exactly that. Using a simple connector library, Hadoop can now run directly against Google Cloud Storage — an object store built on Colossus. That means you benefit from Google’s expertise in large data processing.



Here are a few other benefits of running Hadoop with Google Cloud Storage:


  • Compatibility: The Google Cloud Storage connector for Hadoop code-compatible with Hadoop. Just change the URL to point to your data.

  • Quick startup: Your data is ready to process. You don’t have to wait for extra minutes or more while your data is copied over to HDFS and the NameNode comes out of safe mode, and you don’t have to pay for the VM time for data copying either.

  • Greater availability and scalability: Google Cloud Storage is globally replicated and has higher availability than HDFS because it’s independent of the compute nodes and the NameNode. If the VMs are turned down (or, cloud forbid, crash) your data lives on.

  • Lower costs: Save on storage and compute: storage, because there’s no need to maintain two copies of your data, one for backups and one for running Hadoop; compute, because you don’t need to keep VMs going just to serve data. And with per-minute billing, you can run Hadoop jobs faster on more cores and know your costs aren’t getting rounded up to a whole hour.

  • No storage management overhead: Whereas HDFS requires routine maintenance -- like file system checks, rebalancing, upgrades, rollbacks and NameNode restarts -- Google Cloud Storage just works. Your data is safe and consistent with no extra effort.

  • Interoperability: By keeping your data in Google Cloud Storage, you can benefit from all of the other Google services that already play nicely together.

  • Performance: Google’s infrastructure delivers high performance from Google Cloud Storage that’s comparable to HDFS -- without the overhead and maintenance.


To see the benefits for yourself, give Hadoop on Google Cloud Platform a try by following the simple tutorial.



We would love to hear your feedback and ideas on how to make Hadoop and MapReduce run even better on Google Cloud Platform.



-Posted by Jonathan Bingham, Product Manager

Almost one year ago, Google Cloud Platform launched our GitHub organization, with repositories ranging from tutorials to samples to utilities. This is where developers could find all resources relating to the platform, and get started developing quickly. We started with 36 repositories, with lofty plans to add more over time in response to requests from you, our developers. Many product releases, feature launches, and one logo redesign later, we are now up to 123 repositories illustrating how to use all parts of our platform ...
Almost one year ago, Google Cloud Platform launched our GitHub organization, with repositories ranging from tutorials to samples to utilities. This is where developers could find all resources relating to the platform, and get started developing quickly. We started with 36 repositories, with lofty plans to add more over time in response to requests from you, our developers. Many product releases, feature launches, and one logo redesign later, we are now up to 123 repositories illustrating how to use all parts of our platform!



Despite some clever naming schemes, it was becoming difficult to find exactly the code that you wanted amongst all of our repositories. Idly browsing through over 100 options wasn’t productive. The repository names gave you an idea of what stacks they used, but not what problems they solved.



Today, we are making it easier to browse our repositories and search for sample code with our landing page at googlecloudplatform.github.io. Whether you want to find all Compute Engine resources, locate all samples that are available in your particular stack, or find examples that fit your particular area of interest, you can find it with the new GitHub page. We’ll be rotating the repositories in the featured section, so make sure to wander that way from time to time.



We are very committed to open source at Google Cloud Platform. Please let us know what kind of samples and tools that you’d like to see from the team. We’re looking forward to many more commits ahead!



-Posted by Julia Ferraioli, Developer Advocate

Serving 1 million requests per second using Compute Engine sure is awesome. But what about scaling in the other direction and serving more than one website from the same instance? A handy new feature called ...
Serving 1 million requests per second using Compute Engine sure is awesome. But what about scaling in the other direction and serving more than one website from the same instance? A handy new feature called Protocol Forwarding makes this possible on Compute Engine. It enables you to attach multiple Internet IP addresses to an instance. This way you can configure your webserver to host content that belong to different sites, even with SSL.



Protocol Forwarding is a new addition to Compute Engine's growing collection of forwarding capabilities (including Layer-3 Load Balancing). In addition to attaching multiple Internet IP addresses to a single instance without network address translation (NAT), it can also forward select IP protocols to the virtual machine instances. The supported protocols are TCP, UDP, SCTP and IPSec (ESP/AH). A sharp-eyed reader can tell that, when used together with advanced routing, Protocol Forwarding simplifies the virtual private network (VPN) setup that connects your outside servers to Compute Engine network.



Protocol Forwarding and Load Balancing share many of the same underpinnings, including the same top-level API resource called Forwarding Rules. The best way to learn how to use it is to follow the quickstart guide. Upon completion, you should have a web server on a single instance hosting three different websites with three different Internet IP addresses.



Pricing for Protocol Forwarding is the same as Layer-3 Load Balancing. In fact, you can enjoy both features at no cost until the end of January, 2014. Please read more about pricing and promotion details.



On behalf of the Google Cloud Platform team, I wish everyone a productive and rewarding 2014!



-Posted by Gary Ling, Product Manager