Database Partitioning and Load Balancer

What is Load Balancer ?

Suppose we have 'n' number of servers and there are multiple request coming from client and we want to distribute these request evenly among these servers at that point "Load Balancer" comes into the picture. Load Balancer sits in between client and server. Load Balancer also track the status of all resources while distributing the requests, if server is not available then LB stop sending request to that server.

Why Load Balancing is important ?

By balancing/distributing load among all the available servers, a load balancer improves the application stability and responsive by reducing individual server load thus preventing any server from becoming single point of failure.

Health Checks -

Load Balancers perform Health Checks at regular intervals in which LB ping to servers & observes response sent back by servers, however if server failed to respond back for defined number of times then that server get removed from the pool and request/traffic is not forwarded to that server.

How Load Balancer decide where to send next request ?

There are multiple algorithms are available for configuring LB, which are - Least connection method, Least Response time, Round Robin (used by AWS) or IP Hash

Database Partitioning :

It's a technique to break big database into smaller parts. When our database grows we have 2 potions either to Scale Up or to Scale Out. Well but after some point we can scale up the system/server.

For eg. We have scaled up our system to have 1TB RAM from that point we can not add more RAM to it so we can add same system with same configuration and scale out our system to support more data.

Benefits of Database Partitioning-

Database can be easily managed.
Can give better performance.
Performs better in terms of load balancing.

Types of Database Partitioning- 1) Horizontal Partitioning / Data Sharding-

In this way of partitioning we put different row on different tables. To distinguish which row will be store in which table we select one key. For example, In table A we are storing rows having ID less than 10K and in table B we saving rows above 10K. This is also called as Range-based-Partitioning

Problem with Horizontal Partitioning / Data Sharding - The Problem with this approach is considering above example we have limit for table A which is 0 to 10k so it will have row having ID's less than 10K but in table B there is no upper limit so rows having ID above 10K will get stored in table B there is no limit for that so it will result in unbalanced servers. So selecting this key limit performs crucial role in Horizontal Partitioning / Data Sharding.

2) Key or Hashed based Partitioning -

In this approach, we apply a hash function to some key attributes of object which we are storing, this gives us the partition number. Benefit of this method is it's simplicity. For example, we have 4 servers and we are storing user data, so if we apply our hash function to user id (ID%4) this will gives us the server number where we can store the record. So if ID = 3 then 3%4 = 3, so this record will get stored in 3rd server.

Problem with Key or Hashed based Partitioning-

The fundamental problem with this approach is we have fix number of servers, so in future if we add more number of servers this will results in changing our Hash Function too, which would require redistribution of data and downtime for the service which is not desirable. To counter this problem we can use Consistent Hashing.

3) Vertical Partitioning Or Feature based partitioning -

In this method, we separate our data based on it's type and save it to it's specific server. For example, if we are building application like Instagram then we'll save user photos in one separate DB server, user profile info in different DB server and so on. This method is very simple to understand and easy to implement. However there is one issue with this approach is that if one of the DB server store more data then we need to do further partition for that DB server.

4) Directory-Based Partitioning Or Map-Based Partitioning -

In this type of partitioning we create a map in table which store the info of records and on which DB server it is being stored. The upside of this method is that we easily do query in lookup server to find out on which DB server that record is stored. Even after server redistribution we easily change the mapping in lookup server. The downside of this approach is if we store large amount of data in our map table then we may need to do partitioning on that we will result in Cascading Distribution. Second is if we are having single map table/server then it will become single point of failure. And if we replicate the server then it will require the separate infra to do that.

Side-effects of Data Partitioning-

1) Joins and Denormalization -

Normally we use joins in RDBMS but after partitioning we can not use joins directly as our data is distributed across different servers. So to perform joins we need to denormalization the data first or let say our data is distributed across different database servers and we still want to perform join then we have to perform join across the server but most of the RDBMS does not support this so, we have to enforce such functionality in application code. As a result it make application more complex.

2) Referential Integrity -

When we perform join we use foreign key constraint for reference purpose. But once we did partitioning of the data referential integrity get lost. And most of the RDBMS does not support referential integrity in sharded form.

3) Rebalancing -

There may be a case that after partitioning if data distribution is not uniform then one of the server will be having more load than others because of that we may require to do partitioning again on that DB server causing rebalancing issues.

Basic of System Design - II