I'm learning to love NoSQL databases for their ability to scale and replicate, but it's tough to beat Postgres for projects whose scaling requirements are more modest. The fact is, JOINs are handy. And so are all the wonderful bells and whistles that a mature relational database can offer like transactions and multiple indexing strategies.
Read on if you want to know how far you can scale with a traditional SQL database like Postgres.
The scaling options for applications that use Postgres to store and query data are as follows, in the order you'll want to try them to try to prevent your system from falling over when it gets busy.
I start with a caching strategy because you should start every project with a good one. It's that kind of important. Every SQL database that you might find yourself using is the product of the considerable effort of a team of very smart individuals. With that said, if you ask it the same question twice in a row, it's likely not smart enough to determine whether or not the answer will be different the second time without actually performing the same amount of work it did the first time...again.
Your caching strategy should provide some way of ensuring that a certain amount of redundant queries are reduced to a single query and that subsequent queries will return the answer without consulting the database.
There are several ways of accomplishing this, but one of the most simple is to compute a hash for the SQL query itself and store the result of the query associated with the hash string in a key/value store like Redis. Then you access the query response data with a function that first checks the key/value store for the existence of the hash, returning the cached data if present, and, if not, returning freshly-queried data with which it will also prime the cache. In this way, multiple, identical queries will only ever result in a single database access.
Of course, you need to intelligently expire the cache entry. You might just invalidate every key every 5 minutes, or you might find a way to invalidate certain specific keys when any underlying data in the database is changed. However you do it, the idea to to prevent the data in the cache from becoming stale.
The point of all this is to find a way to use computationally cheap methods of query data retrieval whenever possible, so you can conserve expensive database resources. It will quite likely involve some application code to get things working as they should and this is a great place to insist on a high level of test coverage, but nothing else you'll do will have as much bang for your buck as replacing unnecessary complex queries with simple key-value lookups.
This is all about ROI. It is often smarter to spend an extra couple hundred dollars every month for the next few years than it is to pay someone like me to find a way to make things faster.
Databases thrive on memory and they make a lot of disk accesses, so while it can get expensive, it is quick and easy to switch to a machine with more CPU resources, more RAM, and faster storage devices. And it's guaranteed to work in a way that software fixes seldom are.
This can be a costly strategy in a cloud environment, but it's quick and easy. If you want your database to be faster, you can simply make sure it's running on a faster machine.
A couple of simple and relatively inexpensive things to do when you're operating on a smaller scale and within a tight budget: 1) make sure your provider is using SSDs and not spinning-rust hard drives and 2) increase the RAM available to the database.
At the end of the day, it comes down to bigger checks but it doesn't have to be drastic. If a 2- or 3-fold increase in performance would make all the difference you're looking for, that can almost always be achieved more economically at the AWS console.
So, if you've done a good job of caching and you've run out of ways to have the database run on faster and faster machines, your next step is to start building clusters.
This can be tricky and, if you're at this stage, you should probably be doing a good bit of homework on your options, but one easy strategy is to separate the reads from and writes to the database such that you can configure your application to read from the database at one network location and write to the database at another.
Because systems usually spend a lot more time SELECTing than INESRTing and UPDATING and because it's easier to cluster in a Master/Slaves configuration than a Master/Master one, arranging to write exclusively to the Master node and arranging to read exclusively from a pool of Slave nodes becomes an easy way to enjoy what is effectively horizontal scaling using a relatively simple and consistent clustering model.
You can imagine that, eventually, your system might reach a point where it can no longer handle the volume of writes coming in or the speed of your network has placed a limit on the number of read nodes the master/slave configuration can support. When that happens, the next step is to look at sharding the database somehow.
Effectively sharding distributes the load across multiple servers, in much the same way that the welcome tables at voting precincts will have a line for last names beginning with A-M and another for names beginning with N-Z. Depending on the database, there may be a way to handle this automatically but it is also possible to do this sort of thing with application code.
Depending on the database you're using, you may have some opportunities to do some automatic sharding, which is almost certainly a better way to go than coming up with a custom solution. There are definitely projects which enable automatic sharding in Postgres.
Personally, I've always managed to walk away from the edge before any of this was necessary.
In my opinion, the added administrative effort and computational costs associated with sharding for a SQL database like Postgres makes almost every other option look infinitely more attractive. If the idea starts coming up at meetings or in terrible dreams at night, IMHO it's time to look at splitting the database and moving some of its data to separate databases in separate clusters. If it won't result in trying to join tables that don't co-exist on the same machine, then those tables can be relocated. Or maybe the application code needs to be examined to see if there isn't a way to reduce the load on that side of things. Anything really, because things get a lot more complicated from here on out.
Frankly, this is the point at which a rethink of your data strategy is probably in order and you might think about a storage layer based on something like Cassandra or Elasticsearch.
In future articles, I'll explore alternative databases, how they manage scaling concerns, and how they force developers to rethink their solutions to problems.