Sharding refers to the breaking up of our data into chunks (shards). Sharding benefits by allowing us to scale our database for large scale systems.
Sharding can be implemented at the application level where the code will determine which database shard is to be read or written to.
There are many ways to implement sharding. Let's look at some common ways.
In this method, we select a specific unique column and hash it. The result hash can be used to determine a shard where the read or write operation needs to go into.
For example, if we are trying to access User A's profile which has an user id,
user123, then we firstly hash this. Let's say we obtain a simplisitic value of 100. We can now use the modulo operation on this value based on the number of shards we have. If we have 7 shards, then 100 % 7 = 2. Our read operation will hit Shard 2.
This simple hash-based sharding helps us to consistently direct our entries to the respective shards.
In range-based sharding, we create ranges for our entries. For example, all customers with names starting with A to L will go into Shard 1 and the rest goes into Shard 2.
We will just keep a hash table where we have this logic. When a request comes, we check the name of our customer and determine the respective shard we need to hit.
Sharding is helpful for us to horizontally scale our database (scaling out). This is in contrast to vertically scaling (scaling up) where we upgrade our existing instance of the database by adding more computing power in the form of RAM, CPU etc.
However, vertical scaling is limited. There is only a certain limit for your upgrades and it becomes increasingly expensive to keep upgrading vertically. Horizontal scaling solves this by allowing you to freely scale as per what the database requirements are and the setup is flexible too.
Sharding helps us to reduce latency in our queries. Instead of searching all rows, now we can focus our query search on a specific shard where there are fewer rows. This means we get our results back more quickly.
Sharding can help us mitigate outages. In the event of an outage, an single database may cause the entire application to fail. On a sharded database, outages may only affect a specific shard and hence only a portion of the application and users will be impacted.
Sharding is a complex task that needs to be done with perfection. The last issue we want to have is to deal with lost data or corruption. This will impact the consistency of the database which is one of the ACID properties.
Following this, complexity also arises in the form of management. Managing a single database is much easier than managing data across multiple shards.
A database hotspot occurs when we start to overload a single shard. For example, if we are sharding a database of attractions based on the cities, then certain tourist spots in certain cities may be more popular. Hence, certain shards will be overloaded in terms of read and writes which makes our database prone to crashes or failures.
Sharding is a decision which needs to be made carefully. Once you implement sharding, it is going to be hard to revert your database back to one single piece. There are going to be multiple rows and columns you need to merge and your backups, partitions etc. may all be out of sync. This is also economically expensive, requiring time and effort.
While some databases support sharding out-of-the-box (e.g. MongoDB), other popular ones like PostgreSQL requires manual effort to implement sharding.
One of the scenarios we need to plan for is what if we need to increase the number of shards in future? Would the current implementation allow for that easily?
Such questions are important as the last thing we want is to get stuck on further scaling out our app!
Sharding is a powerful way to scale, but it does come with its own set of complexities and issues. Hence, sharding should usually be used after you have exhausted some of the primary ways of scaling such as:
Vertical scaling is still useful for us and can be considered as a first solution. If just a simple upgrade of the RAM or CPU solves your performance issues, then why shard? Sharding should come in when vertical scaling involves too much cost or does not solve a long term issue.
If our database is read heavy, then the simplest and first approach to improve performance is to leverage a cache-based mechanism. We can temporaily store frequently requested data into a cache and save our database from duplicate reads.
We can use database replica sets to solve for read-heavy performance. The replica sets mirror the primary database server and with the help of a load balancer, we can direct the queries to various replicas instead of the same single database.
An added advantage here is of course redundancy which helps us to mitigate outages as replica sets can take over the failed primary database server.
Another way to improve performance is to optimise our queries to be more efficient. If you have sharding, but your queries involve complex joins then you are not really solving the root of the issue.