Working with Sharding in MongoDB

Mongodb is a data management system for blob data. So that means we can manage huge amounts of data through mongodb. So,storing this vast amount of data is the most challenging thing to do. When a single machine is unable to store this data or unable to provide an acceptable read and write throughput, problems begins to rise.
To deal with this problem mongodb provides us with a unique feature called sharding. We will be discussing about this matter below.

What is sharding in mongodb?

As per mongodb.org sharding in mongodb is a method for storing data across multiple machines. Mongodb uses sharding to support deployments with very large data sets and high throughput operations.

Why should we use sharding in mongodb?

  • In replication all writes go to master node
  • Latency sensitive queries still go to master
  • Single replica set has limitation of 12 nodes
  • Memory can’t be large enough when active dataset is big
  • Local Disk is not big enough
  • Vertical scaling is too expensive

When you analyze the above set of problems. You can see that we have to deal with this kind of problem every now and then. To solve these issues database systems have two basic approaches namely vertical scaling and sharding.

Vertical scaling:
Vertical scaling adds more CPU and storage resources to increase capacity. This method has it’s own limitations. High performance systems with large numbers of CPUs and large amount of RAM are more expensive than smaller systems. On the other hand cloud-based systems may only allow users for smaller instances.so,we can say that there is a maximum capability for vertical scaling.

Sharding:
Sharding or horizontal scaling divides the data set and distributes the data over multiple servers, or shards. Each shard is an independent database, and collectively, the shards make up a single logical database. Sharding supports high throughput and large data sets:

Sharding reduces the number of operations each shard handles. Each shard processes fewer operations as the cluster grows. As a result, a cluster can increase capacity and throughput horizontally.

Sharding reduces the amount of data that each server needs to store. Each shard stores less data as the cluster grows.

As it shows in the diagram below:

sharding

Sharding in mongodb:

Mongdb sharding supports sharding through the configuration of a sharded clusters. Sharded cluster consists of the following components: shards, query routers and config servers.

Shards:
Mongodb uses shards to store the data. To provide high availability and data consistency each shard represents a replica set.

Query Routers:
Query Routers interface with direct operations to the appropriate shard. shards can contain more than one query router to divide the client request load. A client sends requests to one query router.

Config server:
Config servers store the cluster’s metadata. This data contains a mapping of the cluster’s data set to the shards. The query router uses this metadata to target operations to specific shards.

The diagram below will help you to understand better.

sharding in mongodb

Data Partitioning:

In mongodb sharding collection is broken into many shards and act as individual collections. But how does mongodb breaks the collection in many parts? We are going to discuss about this in data partitioning. Mongodb creates a balanced cluster using two background process: splitting and the balancer.

Splitting:
Splitting is a background process that keeps chunks of data from growing too large. When a chunk of data grows beyond a specified limit, mongodb splits it in half. In time of splitting none of the available shards are affected.Inserts and updates triggers splits. Splits are a efficient meta-data change. In case of splitting none of the data is migrated.

Balancing:
The balancer is a background process that manages data chunk migrations. The balancer runs in all of the query routers in a cluster. When the distribution of a sharded collection in a cluster is uneven, the balancer migrates data from the shard that has the largest number of chunks to the shard with the least number of chunks until the collection balances. If there’s an error during the migration, the balancer aborts the process leaving the chunk unchanged on the origin shard. MongoDB removes the chunk’s data from the origin shard after the migration completes successfully.

The shards manage data migrations as a background operation between an origin shard and a destination shard. During migration, the destination shard is sent all the current documents in the chunk from the origin shard. After this the destination shard captures and applies all changes made to the data during the migration process. Finally, the metadata regarding the location of the chunk on config server is updated.

As we have discussed about the process of sharding in mongodb,there is also a choice to add or remove a certain shard. But this approach is a bit trickier than it sounds. Because adding a blank shard will dis-balance the system until data is migrated from other shards and balanced out. Same goes for the removal process also. In case of removal the data in specified shards has to be migrated and balanced out through the remaining shards. After completion of this process one can remove that certain shard.

Related Links:

1> How to Get Started with MongoDB Database?
2> How to Get Started with MongoDB?
3> How to Import and Export Through Mongodb?
4> How to Use Projection in MongoDB?
5> Using sort method in mongodb
6> Map-Reduce in MongoDB
7> Introduction to Replication in MongoDB
8> Deploying a Replica Set in MongoDB
9> Discussing Replication Lag in MongoDB
10> Replica Set Members in Mongodb
11> Working with Index in MongoDB
12> Working with Aggregation in MongoDB
13> How to Work with Aggregation Framework in MongoDB?
14> Working with Pipeline Concept in MongoDB
15> Discussing about Pipeline Expression in MongoDB

If you find this article helpful, you can connect us in Google+ and Twitter.