cloud-test: October 2010

Google Cloud Platform Blog

Research Project: AppScale at University of California, Santa Barbara

Friday, October 22, 2010

The following post is a guest post by Chris Bunch, a Computer Science Ph.D. student at the University of California, Santa Barbara. He is one of the student leads on the AppScale project, an open source Google App Engine compatible hosting solution led by Professor Chandra Krintz. Chris has developed and maintained AppScale as a research project over the last two years with fellow student lead Navraj Chohan and others.Over here at the UCSB Racelab, we've complained endlessly about finding a web framework we actually could use. For a long time we thought we just wouldn't be able to find it - many were so-so or good but only after a substantial learning curve. So imagine our surprise back in April 2008 when we heard about what we thought would be just-another-web-framework provided by Google in the Python version of App Engine. But after giving it a try, we were smitten. We finally found a web framework that (1) we could actually use on non-trivial projects
and (2) we could teach in nine-week classes without having students lose half the time with the idiosyncrasies of the programming language involved or the web framework itself. Furthermore, the minimalistic APIs make it simple to get work done: it did for us exactly what we needed and nothing else.Yet as researchers and hackers-at-heart there was one thing that we really wanted to do with App Engine that we couldn't do: run it on a whole bunch of our machines and tinker with it. A similarly-minded hacker named Chris Anderson had released AppDrop, which was a modified version of the App Engine SDK that hooked up to PostgresSQL and run in Amazon EC2, but only ran over a single machine. So after much discussion, we came up with the following short list of things we wanted to do with App Engine:

We wanted to run it on our own virtual machines or those running in Eucalyptus or Amazon EC2 in order to investigate how we can optimally harness cloud infrastructures in our cloud platform.

Tons of new datastores have emerged as part of the "NoSQL" movement, and we need a mechanism to evaluate their performance under controlled experiments as well as traditional databases such as MySQL. We also need a platform that supports the ability to add new data storage mechanisms so that when developers tout the features of their new datastore, we can download it and evaluate it under similar circumstances as other datastores.

One of the reasons we love Google App Engine is the simple set of APIs provided, but we also wanted to use that as a starting
point where we could add new APIs and control the environment in which they run.

We love that Google App Engine "just works". You don't know where it's running and how it's running, but you can see that it is running, and we wanted to make sure that whatever we developed, that it did the same. We wanted to develop something that automatically deployed your App Engine app and configured everything for you. Expert users should be able to have more control over the system, but the system should be able to handle your app from the moment you deploy it to the moment you tear it down.

It had to be open-source - just like how we wanted something to tinker with and run experiments on, we wanted it to be something that you could tinker with too. We wanted you to be able to add in support for a database you're interested in and see how it performs, and we wanted you to be able to add in APIs that you think would be interesting to have an easy-to-use web framework interact with.

So with that in mind, we created AppScale, an open-source cloud platform
for Google App Engine applications. Here's how we did it:

We took the standard three-tier web deployment approach and clearly segmented each tier into a specific component in the system: an AppLoadBalancer routes users to their applications, an AppServer runs the user's App Engine app, and an AppDB handles database interactions. Each have clearly defined roles in the system and are controlled by an AppController, a daemon that runs on each machine, monitors each component, and controls the specific order in which services are started. It writes all the configuration files for each service and coordinates services between the other AppControllers in the deployment. For those interested, we detail the specifics on the original AppScale implementation in this paper.
We also wanted to embody the principle of "standing on the shoulders of giants", and as such, we employ open-source software as often as possible, where appropriate. Our AppLoadBalancer employs the nginx web server as well as the haproxy load balancer to ensure high performance. Our Memcache API implementation uses memcache under the hood, while our MapReduce API uses Apache Hadoop, which we added to give App Engine
users running over AppScale the ability to run Hadoop MapReduce jobs from within their web applications.Because we were able to keep the database support abstracted away from the other components in the system, we were able to add support for nine different data storage solutions within AppScale: HBase, Hypertable,
MySQL, Cassandra, Voldemort, MongoDB, MemcacheDB, Scalaris, and SimpleDB. Many of these databases have seen interest in recent years but have been hard to measure under comparable conditions, and vary greatly. To give a few examples, they vary in the query languages they provide, their topologies (e.g., master / slave, peer-to-peer), data consistency policies, and end-user library interfaces. This has made it non-trivial for the community to objectively determine scenarios in which one database performs better or worse than another and investigate why, but under AppScale, deploying all these databases is done automatically with no interaction from the user. And because AppScale is open-source, if a developer doesn't like the particular interface we use for a database, they can improve on it and give back to the community. We've used
AppScale internally to evaluate the performance of Google App Engine applications on these datastores as well as developed an App Engine app, Active Cloud DB, that exposes a RESTful API that developers can use
to access these datastores from any programming language or web framework.Finally, the most important lesson we learned was the value of incremental development. Our core development team fluctuates between two to three developers, so from the first meeting we had, we knew that
our very first release couldn't support every App Engine API nor could it run nine databases seamlessly. Therefore, we started off with support for the two BigTable clones, HBase and Hypertable, as well as support
for just the Datastore API, the URL Fetch API, and the Users API within App Engine. From there, we learned what datastores people actually wanted to see support for as well as what APIs people wanted to use. We
were also able to add APIs within App Engine apps deployed to AppScale to be able to run virtual machines under the EC2 API, while also running computation under the MapReduce API.But developing AppScale was certainly not a cakewalk for us. Over the course of the last two years, five major issues (some technical and some not) have arisen within the project:

Writing software that works without knowing ahead of time how many machines will be in the system proved initially to be difficult to grasp, but in many cases we were able to reduce the number of variations that could occur and use that to provide some predictability with respect to how we configure and deploy databases and applications.

We couldn't assume that the AppScale administrator has access to DNS; without it, a number of APIs and features are extremely difficult to implement. Load balancing is much more difficult, and many APIs that are tied to host names must be tied to one machine in the system, else they don't work properly. VLAN tagging shows some promise to alleviate these problems, but right now is far from being deployed inexpensively and easily.

The source code for the Java version of App Engine isn't publicly available, so we had to spend a lot of time decompiling the SDK, modifying it to use our database and our API implementations instead of the SDK implementations, and recompiling it. All of these were non-trivial and greatly added to the time it took for us to deploy a version of AppScale with Java App Engine support.

Not all users want a pre-built virtual machine image, so ensuring that building the AppScale environment was done right every time was a top priority. We had to limit ourselves to Ubuntu Jaunty for many releases, and only recently were we able to expand to include Karmic and Lucid, which still make up a microcosm of the distributions available in the Linux world. Adding the ability to install AppScale via apt-get in these specific Linux distributions has also been a crucial step in making sure that users could easily and quickly install AppScale for use.

Both undergraduates and graduate students here at UCSB have done projects involving AppScale, which means that the number and experience levels of developers working on AppScale is completely unpredictable at a given moment in time. Oftentimes the projects they work on are only tangentially related to features that users want, and the time scales that they are available to work for is vastly different than most software engineers are used to.

All of these problems are greatly exacerbated by only having a two-to-three person core developer team, but this also makes the AppScale project particularly interesting to work on. Despite having worked on AppScale for two years, there are still tons of interesting problems to work on and we still love the Python App Engine web framework as much as we did when we first picked it up. And of course, AppScale is open-source, under the New BSD License, so feel free to download it and tinker around like we have! Check out AppScale at:
http://appscale.cs.ucsb.edu

http://code.google.com/p/appscale

Google

Labels: Open Source

New App Engine SDK 1.3.8 includes New Admin Tools and Performance Improvements

Thursday, October 14, 2010

Today, we’re releasing version 1.3.8 of the App Engine SDK. Whether you’re a Java or a Python developer, this release includes several exciting new features for improving monitoring, performance, and maintenance tasks.Instances ConsoleThis release includes a new page in the Admin Console, called the Instances page. This page allows you to view information about all server instances currently in use by your application. This information can be useful in debugging your application and also understanding its performance characteristics. There’s no configuration needed for this feature. Just click the “Instances” link on the left hand navigation of the Admin Console to see Average QPS, latency, and memory for an instance.

Screenshot of the instances page of the Admin Console
Task Queue ImprovementsThis release also has a couple new Task Queue features: First, the maximum bucket size that you can specify during queue configuration is now 100, up from 50. Second, we’ve added a new "Run Now" button to the Task Queues section of the Admin Console that enables developers to run a task immediately. This can be very helpful for debugging your tasks in production.Builtins DirectivesThis release contains a new feature for Python apps: builtin handlers that allow you to quickly and easily enable standard functionality in your application without adding additional code to your codebase. The libraries available today are remote_api, appstats, and the datastore_admin feature (see below). For example, to use the remote_api with your application, simply add the following to your app.yaml file:builtins:
- remote_api: onIf you are already using the remote api endpoint your app, you can choose to remove the entry in the handlers section of your app.yaml and use the above directive instead to simplify your app.yaml file.Support for builtin handlers is not yet available for Java applications, but will be available in an upcoming release.Delete all (or a part) of your application’s dataNote: this feature is currently only available by default for Python; see the note below for ways to use it with Java application.Today, we are releasing an experimental addition to the admin console which provides a simple UI for delete all entities, or all entities of a given kind, in your datastore. To enable this functionality, simply enable the following builtin in your app.yaml file:builtins:
- datastore_admin: onAdding these lines to app.yaml enables the “Datastore Admin” page in your app’s Admin Console, where you can see all of the entity types you are able to delete:

Screenshot of the datastore delete builtin UIBe aware that these deletes are issued by your application (you can read about how the handler works by looking at this code file in the SDK). For this reason, your application will use resources, most significantly CPU, for the deletions you issue which will count towards your application’s daily resource budget.Datastore delete is currently available only with the Python runtime. Java applications, however, can still take advantage of this feature by creating a non-default Python application version that enables Datastore Admin in the app.yaml. Native support for Java will be included in an upcoming release.Python Pre-compilation on by DefaultFinally, the python pre-compilation feature we announced in 1.3.5 is now turned on for all new python application uploads using the 1.3.8 SDK by default. If you wish to disable this feature, just specify the flag --no-precompilation on the appcfg.py command line when uploading your app.This release also contains a few more small features and bug fixes. You can read about the full release in our release notes in Python and Java. As always, your feedback in the forums is appreciated (and had a significant influence on this release!).Posted by the App Engine Team