Understanding Cloud Pricing Part 5 - NoSQL Databases
Monday, August 31, 2015
We’ve had a lot of great responses and feedback (keep ‘em coming!) about our cloud pricing posts (Local SSDs, Virtual Machines, Data Warehouses) and today we’re back to talk about running NoSQL databases in the cloud. Specifically, we want to give you the information you need to understand how to estimate the cost of running NoSQL workloads on Google Cloud Platform.
NoSQL Databases
The NoSQL database market has experienced massive growth for the last few years and NoSQL databases have been instrumental in solving many distributed data and scaling challenges, which have opened the door for new and innovative applications and solutions. “NoSQL” is an umbrella term that encompasses any data store that fits the notion of “not only SQL” and many products offer a high degree of tunability around the standard relational database concepts of atomicity, consistency, isolation, and durability (see ACID for more information) and the distributed systems concepts of consistency, availability, and partition tolerance (see CAP theorem for more information). And every NoSQL database offers something different when it comes to how data is modeled and stored - including, but not limited to - JSON document, key-value, wide-column, and blob storage.
As expected, there are several different self-managed options available such as MongoDB, Apache Cassandra, Riak, Apache CouchDB, Couchbase and many more. Today we’re going to focus on how to estimate pricing when running MongoDB. MongoDB is a document-based, highly-scalable NoSQL database that provides dynamic JSON schemas along with a powerful query language. There are a variety of use cases for MongoDB such as, 360-degree view of the customer, real-time analytics, internet of things applications, and content management (to name a few).
However, when looking at the pricing data for MongoDB, we noticed something interesting. We had planned a separate blog post to talk about pricing Cassandra on Google Cloud Platform as well. But the hardware (virtual or real) requirements are very similar and neither require a license to be purchased, so the costs are very similar. It didn’t make sense to have another post stating more or less the same thing, just replacing the name of the database so we are going to include Cassandra here as well.
Cassandra, unlike MongoDB, is a key-value store. Cassandra was written at Facebook with much of the data model inspired by Google's Bigtable white paper and the availability design inspired by Amazon's Dynamo white paper. Cassandra was designed for high availability, performance, and tunable consistency. Cassandra has no leader or master node, but rather all the nodes in a cluster exist in a ring, where data is replicated a configurable number of times. Availability comes from having a headless cluster storing your data; tunable consistency comes from how much effort you want your cluster to spend to return your queries. Cassandra and MongoDB are two of the most used NoSQL databases that we see our customers using.
Starting Point
So how do you estimate pricing given multiple use cases and different possible query and traffic patterns? To get started with MongoDB, we’re going to narrow the scope a bit and estimate the costs of the resources used in existing benchmarks. There are several benchmarks that have been published about MongoDB performance and we’ll focus in on two of them, one published by MongoDB and another from United Software Associates. Both benchmarks reach roughly the same throughput and latency conclusions so this is a reasonable model to build upon.
While the benchmarks from United Software Associates used a single MongoDB node for testing, the benchmarks published by MongoDB used a 3-node replica set. Replica sets are a redundant, highly-available deployment of MongoDB and they are strongly recommended for all production workloads (at a minimum). The smallest possible replica set is comprised of three nodes, each configured with matching specifications so we’ll include that configuration in our pricing breakdown below. The on-prem reference hardware specs used in the benchmarks were as follows (MongoDB, like most databases, tends to favor more RAM and storage IOPS where possible):
Benchmark | MongoDB | United Software Associates |
CPU | Dual 10-core Xeon 3.0 GHz | Dual 6-core Xeon 3.06 GHz |
RAM | 128 GB | 96 GB |
Storage | 2 x 960 GB SSD | 2 x 960 GB SSD |
Monthly Price (single node) | Unavailable** | |
Monthly Price (3-node replica set) | Unavailable** |
Now if we map that back to Google Compute Engine instances and storage offerings we would have the following 2 closely matching configurations along with pricing:
Instance Type | n1-highmem-16 | n1-standard-32 |
CPU | 16 Xeon vCPU | 32 Xeon vCPU |
RAM | 104 GB | 120 GB |
Storage | 4 x 375 GB Local SSD | 4 x 375 GB Local SSD |
Monthly Price (single node) | $843.60 | $1,146.10 |
Monthly Price (3-node replica set) | ||
Monthly Price Difference | 44% | 24% |
Annual Savings vs. On-Premise | $24,530.88 | $13,640.40 |
The cost breakdown above shows the pricing for a single node and for a 3-node replica set, which is a typical production deployment of MongoDB as stated above. We selected Local SSD for the storage layer in order to support the IOPS required for the throughput metrics achieved in the benchmark reports. As shown in this disk type comparison, Local SSD can support up to 280,000 write IOPS per instance. We know that Local SSD is ephemeral storage, meaning that its lifecycle is tied to the virtual machine to which it is mounted, which is another reason why we chose to estimate pricing for the highly available MongoDB 3-node replica set option. Finally, the prices shown above include Google Cloud Platform sustained use discounts which totals about a 30% discount over the course of the month.
The pricing for Cassandra is pretty similar to MongoDB. They both benefit from Local SSD in terms of performance. And the trade-off between more memory (n1-highmem-16) and more compute (n1-standard-32) is the type of choice that DBAs will have to make when designing a typical Cassandra cluster. Of course, this is just guidance on pricing to get you started, you won't know what's best for your application until you actually run tests yourself.
Running Your Own Tests
As with any benchmarks, your mileage may vary when testing your particular workloads. Isolated tests run during benchmarks don’t always equate to real world performance so it is important that you run your own tests and assess read-write performance for a workload that closely matches your usage. Take a look at PerfKit and use to it to profile your own proposed deployments, including mixing and matching workloads or worker counts.
Pricing NoSQL workloads can be somewhat challenging but hopefully we’ve given you a way to get started in estimating your costs. If you’re interested in learning more about compute and storage on Google Cloud Platform, check out Google Compute Engine or take a look at the documentation. Feedback is always welcome so if you’ve got comments or questions, don’t hesitate to let us know in the comments.
We’ve gotten a lot of great feedback about this post, and we wanted to let you know that we will also be posting about cloud pricing for Google Cloud Platform's managed NoSQL options in the near future. In forthcoming blog posts, we’ll talk about how to understand the pricing around Google Cloud Bigtable and Google Cloud Datastore and compare those to other popular managed offerings. Thanks for the questions and comments, keep ‘em coming!
- Posted by Sandeep Parikh and Peter-Mark Verwoerd, Solutions Architects
* - Price was taken from a configure-to-order bare metal server at Softlayer
** - Configuration was unavailable to estimate the monthly price