Leverage Google Cloud Platform features for your Hadoop deployments
Thursday, February 27, 2014
Accenture recently demonstrated Google Compute Engine as an excellent place to run Hadoop for performance and price. They demonstrated both familiar HDFS-based deployments as well as performance and cost advantages of using the Google Cloud Storage connector.
If you use Apache Hive with Hadoop, we want to show you another great feature of the Google Cloud Platform that will save you time and money. The default Hive installation uses a local single-session database as the Hive Metastore to store Hive table and column names, column types, and so on. This local database is fine for single-user/single-session usage, but to use Hive with multiple concurrent connections, you need to install and maintain a relational database such as MySQL.
Google Cloud SQL is managed MySQL in the cloud, which now supports the MySQL wire protocol, so you can connect to Google Cloud SQL using common tools from anywhere. Why not use it to run your MySQL database in the cloud too, and save yourself the time and cost of having to install and maintain MySQL? Let Google take care of it for you.
We provide a sample application to help you get started running Apache Hive and Pig on Google Compute Engine. The README includes detailed instructions for using Cloud SQL for the Hive Metastore.
The rest of this blog post hits the high points for what you need to do to use Google Cloud SQL for the Hive Metastore.
Create a Google Cloud SQL instance
In the Google Cloud Console, create a Google Cloud SQL instance. Make sure to:
Install MySQL client and MySQL JDBC Driver
Connect to the Google Compute Engine instance that is running Hive and install both the MySQL client and the MySQL JDBC driver. Add the JDBC driver JAR file to Hive's CLASSPATH. The simplest method is to copy the file to the hive/lib/directory.
Create the hivemeta database and the user to access it
Use the mysql command line tool to:
Configure Hive to use Cloud SQL
Update your hive/conf/hive-site.xml file to connect to the Google Cloud SQL database. Add the following configuration, replacing cloud-sql-ip-addr with the assigned IP address of the Cloud SQL instance, and replacing password with the hdpuser database user password:
Hive is now configured to use the Google Cloud SQL database as its metastore; there's no need for you to maintain a MySQL installation.
Other software that uses MySQL can similarly use Google Cloud SQL. Since Cloud SQL is MySQL, you won't need to migrate code - just establish connectivity and you are off and running.
-Posted by Matt Bookman, Cloud Solutions Architect
If you use Apache Hive with Hadoop, we want to show you another great feature of the Google Cloud Platform that will save you time and money. The default Hive installation uses a local single-session database as the Hive Metastore to store Hive table and column names, column types, and so on. This local database is fine for single-user/single-session usage, but to use Hive with multiple concurrent connections, you need to install and maintain a relational database such as MySQL.
Google Cloud SQL is managed MySQL in the cloud, which now supports the MySQL wire protocol, so you can connect to Google Cloud SQL using common tools from anywhere. Why not use it to run your MySQL database in the cloud too, and save yourself the time and cost of having to install and maintain MySQL? Let Google take care of it for you.
We provide a sample application to help you get started running Apache Hive and Pig on Google Compute Engine. The README includes detailed instructions for using Cloud SQL for the Hive Metastore.
The rest of this blog post hits the high points for what you need to do to use Google Cloud SQL for the Hive Metastore.
Create a Google Cloud SQL instance
In the Google Cloud Console, create a Google Cloud SQL instance. Make sure to:
- Choose the Compute Engine zone of the Compute Engine instance running Hive as the preferred location for the Cloud SQL instance.
- Select "Assign an IP Address" to the Cloud SQL instance.
- Add the external IP address of the Compute Engine instance as an Authorized IP Address. Be sure to append "/32" to the IP address entered in this field.
- Define a root password for MySQL (it's good practice).
Install MySQL client and MySQL JDBC Driver
Connect to the Google Compute Engine instance that is running Hive and install both the MySQL client and the MySQL JDBC driver. Add the JDBC driver JAR file to Hive's CLASSPATH. The simplest method is to copy the file to the hive/lib/directory.
Create the hivemeta database and the user to access it
Use the mysql command line tool to:
- Connect to the Google Cloud SQL instance
- Create a database called hivemeta (Hive requires the database to use latin1 character encoding, which you'll need to specify explicitly).
- Create the database user (for example hdpuser).
- Issue grants on the hivemeta database to the database user.
Configure Hive to use Cloud SQL
Update your hive/conf/hive-site.xml file to connect to the Google Cloud SQL database. Add the following configuration, replacing cloud-sql-ip-addr with the assigned IP address of the Cloud SQL instance, and replacing password with the hdpuser database user password:
Hive is now configured to use the Google Cloud SQL database as its metastore; there's no need for you to maintain a MySQL installation.
Other software that uses MySQL can similarly use Google Cloud SQL. Since Cloud SQL is MySQL, you won't need to migrate code - just establish connectivity and you are off and running.
-Posted by Matt Bookman, Cloud Solutions Architect
No comments :
Post a Comment