cloud-test: Stress Testing with Energyworx

Google Cloud Platform Blog

Stress Testing with Energyworx

Friday, August 28, 2015

Founded in 2012, Energyworx offers big data aggregation and analytics cloud-software services for the energy and utilities industry. Their products and services include grid optimization and reliability, meter-data management, consumer engagement, energy trading and environmental-impact reduction. They are based in the Netherlands. To learn more, visit www.energyworx.org

Getting all cloudy gives you a tremendous amount: Agility, scalability, cost savings and more. The scales weigh heavily in favor of embracing cloud goodness. However, on the other side of that scale, getting all cloudy means giving up a degree of control. You don’t control the infrastructure and, in certain cases, you don’t know the implementation behind APIs you rely on. This is especially true of managed services such as databases and message queues, and those APIs and associated SLAs are central to the operation of your systems. There’s nothing surprising, bad or wrong about this situation, as stated previously there are far more pros than cons with the cloud, but as engineers whose reputation (and need for a night’s sleep uninterrupted by a 3am wake up call) rely on the stability and scalability of the systems we build, what do we do? We follow the age old maxim, trust but verify, and verify by testing!

Testing comes in many forms but broadly there are two types, functional and stress testing. Functional tests check for correctness. When I register for your service does my email address get encrypted and correctly persisted? Stress tests check for robustness. Does your service handle 100,000 users registering in the fifteen minutes after it’s mentioned in the news? As an aside, I was tempted as I wrote this post to phrase everything in terms of “we all know this…” and “of course we all do that..” when it comes to testing because we do all know it’s a good thing to do and we all do it to one extent or another but the number of issues good engineers face with scalability issues is proof that the importance of stress testing isn’t a universally held truth, or at least a universally practiced truth. The remainder of this post focuses on a set of best practices we distilled from a stress testing exercise we did in Google Cloud Platform with Energyworx as part of their go live.

Energyworx and Google Cloud Platform leveraged existing Energyworx REST APIs together with Grinder to stress test the system. Grinder allows the calls to the REST APIs to be scaled up and down as required depending on the type and degree of stress to be applied. Test scenarios were based around scaling the number of smart meters uploading data, the amount of work performed by the meters and physical locations of the meters. For example, we knew a single meter worked correctly so let’s try several hundred thousand meters working at the same time, or let’s have a meters running Europe accessing the system in the US, or let’s have thousands of meters do an end of day upload at the same time. Following these best practices Energyworx ran extended 200 core tests for approximately $10 a time and proved that their system was ready for millions of meters flooding the grid daily with billions of values. We were right and Energyworx launch went off without a hitch. Stress testing is a blast…

First best practice is to leverage Google Cloud Platform to provide the resources to stress test. To simulate hundreds of thousands of smart meters (or users, or game sessions, or other stimuli) takes resources and Google Cloud Platform allows you to spin these up on demand, in very little time and pay by the minute for them. That’s a great deal for stress testing.

Second best practice is that systems are often complex, with different tiers and services interacting and it can be tough to predict how they will behave under stress, so use stress testing to probe the behavior of your system and the infrastructure and services your system relies upon. Be creative with your scenarios and you’ll learn a lot about your system’s behavior.

Third best practice is that you should test the rate of change of the load you apply as well as the maximum load. What that means is that it’s great to know your system can handle a load of 100K transactions per second but it’s still not a useful system if it can only handle these in batches of 10K increases each minute for 10 minutes when a single news article from the right expert can bring you that much traffic in the web equivalent of the blink of an eye.

Fourth best practice is that you should test regularly. If you release each Friday and bugfix on demand, you don’t need to stress test every time you release but you should stress test the entire system every 2-4 weeks to ensure that performance is not degrading over time.

- Posted by Corrie Elston, Solutions Architect