As the summer heat descends on the Northern Hemisphere we thought we’d release our newest App Engine version with some changes that are sure to keep you playing around in the cool, air-conditioned indoors (hey, you don’t want your computer to overheat, right?).

As the summer heat descends on the Northern Hemisphere we thought we’d release our newest App Engine version with some changes that are sure to keep you playing around in the cool, air-conditioned indoors (hey, you don’t want your computer to overheat, right?).


Production Changes



  • Adjustable Scheduler Parameters - As we previously discussed, we are introducing two scheduler knobs (okay, they actually look like sliders) that will allow you to control some of the parameters that influence how many Instances run your application. Today you will be able to set the minimum pending latency and maximum number of idle instances for your application.


Datastore Changes



  • Advanced Query Planning - We are removing the need for exploding indexes and reducing the custom index requirements for many queries. The SDK will suggest better indexes in several cases and an upcoming article will describe what further optimizations are possible.

  • Namespaced Datastore Stats - Now, in addition to getting overall datastore stats, we are providing a new option to query datastore stats per namespace.


Task Queue Changes



  • New Task Queue details page - We’ve revamped the Task Queue details page in the Administration Console to provide more information about the tasks being run. You can now see the headers included in the enqueued task, the payload, and information from previous task runs.

  • 1MB Pull Task Size - It’s our belief that there is only one way for size limits to go - and that’s up! So with this release we’ve increased the size for pull tasks to 1MB.

  • Pull queue lease modification - We’ve introduced a new method for Pull Queues that allows you to extend the lease on existing tasks if the initial lease on the task was insufficient.


Lastly, we have some exciting news related to the experimental Go runtime. While it still remains experimental, starting with 1.5.2, all HRD apps will have access to the Go runtime in production.


As always, there are also some small features and bug fixes, the full list of which can be found in our release notes (Python, Java). We look forward to your feedback and questions in our forum.
































Summary




Last week, we posted about a limited outage on July 14, 2011. Now that our internal postmortem is complete, we thought you would also like to get more detail about what went wrong and what we are going to do to ensure this doesn't happen again.






Root Cause and Analysis




The main lesson learned is to improve our live traffic testing as a relatively minor bug triggered a corner case for some of our customers. The bug was in a new release of the infrastructure in the App Engine Java execution environment. During development, testing, and qualification, this bug was essentially hidden from view because it only manifested itself under specific load patterns. During the outage, requests to affected applications would fail with errors when traffic was routed to affected instances. Application logs would have shown affected instances experienced high latency, error rates, or were not reachable from the Internet. This could have been caught by letting the live traffic testing run longer.




In order for live traffic testing to work properly, we need to improve our monitoring as well. In this case, having more points from which to do black box monitoring would have helped immensely. We are currently working on much broader monitoring for App Engine and will be integrating more extensive black box testing in upcoming quarters.




Once again, we’d like to point out that we could have done a much better job of communicating issues to all of you. While we strive to strike a balance between letting you know about major issues and not bothering you about the day-to-day operations; we clearly should have communicated this incident to you sooner. Rest assured you’ll be better informed of issues in the future.







Timeline




July 14, 2011 - 11:30 AM US/Pacific - The new Java execution environment is released to production.




July 14, 2011 - 5:00-6:00 PM US/Pacific - The previously scheduled Master/Slave read-only maintenance period occurred.




July 14, 2011 - 8:00-9:30 PM US/Pacific - Monitoring shows error rates and latency for Java applications using the Master/Slave datastore are slowly increasing across the entire system. Investigation reveals that the new Java execution environment is malfunctioning.




July 14, 2011 - 9:30 PM US/Pacific - Rollback of the Java execution environment to the previous version begins. Latency and error rates begin to fall.




July 14, 2011 - 11:30 PM US/Pacific - Rollback of the Java execution environment to the previous version completes. Java Master/Slave applications are functioning normally.






Remediation





  • Faster notification on our status site and downtime-notify mailing list

  • More live traffic stress tests for new releases

  • Better black box monitoring to detect small impacts more quickly





[Edit] Clarification: no HR datastore apps were affected. Overall, the outage resulted in a 1.9% error rate, affecting approximately 0.005% of all App Engine traffic at peak.
















On July 14, 2011, beginning at 7 PM US/Pacific time (PDT/GMT-7), a subset of Java App Engine applications were affected by a service outage, which gradually increased in magnitude over time. At 9:30 PM US/Pacific, repair work commenced which began to reduce the effect of the outage; by 11:30 PM US/Pacific, the repair work had completed, restoring normal service to all Java App Engine applications.




During this period, affected applications would have experienced high latency and error rates. This outage occurred shortly after a scheduled maintenance period; however, the outage was not related to the maintenance work.




Overall reliability, quick return to service, and fast, accurate communication to our customers are some of the core goals of Google App Engine's service offering. While we restored service relatively quickly, it's clear to us that we fell short in prompt communication of status updates. We apologize for this, and we'll look at our procedures to improve our performance in this area.




In the meantime, we have a preliminary understanding of the outage, and we are continuing our investigation to insure that we have fully repaired the root cause. We will publish a detailed postmortem once we have concluded our research. Thanks again for your patience and understanding.



[Edit] Clarification: no HR datastore apps were affected. Overall, the outage resulted in a 1.9% error rate, affecting approximately 0.005% of all App Engine traffic at peak.






Posted by Wesley Chun, Google App Engine team

You may have heard the news that 2011 is Google’s biggest hiring year yet. And the App Engine team is looking for a few great Software Engineers to join us in San Francisco to code, collaborate on innovative ideas for platform computing, and get burritos with us every Friday.

You may have heard the news that 2011 is Google’s biggest hiring year yet. And the App Engine team is looking for a few great Software Engineers to join us in San Francisco to code, collaborate on innovative ideas for platform computing, and get burritos with us every Friday.






The App Engine team in San Francisco.


We think we have some of the best and most enthusiastic developers out there-- so we thought we’d ask you to come work with us. The App Engine team needs talented developers to help build the platform by developing the features on our road map and issue tracker, as well as that amazing feature idea you have that will revolutionize App Engine for our users.


If you are a backend engineer who is experienced with C++ or Java, lives and breathes distributed systems, and is obsessed with scalability, please visit our jobs website to submit your resume along with a cover letter describing your experience with App Engine. We’ll be in touch if we think your experience matches our open positions.


Thanks for continuing to help make App Engine a great platform!