Skip to main content

Consistent Hashing

 I have been meaning to write this blog for some time now but somehow kept on postponing (read procrastinating...). I read about this technique of Consistent Hashing a while back and was mesmerized by its elegance. Recently read it again in this brilliant blog and thought of expressing it in my own words for posterity. So let us begin.

Hashing

I won't talk too much about hashing since it is a very basic computer science concept. In a word, it means mapping an object to another object. Or more generally, mapping a key to a value where their types don't matter. Mostly the mapping is from a string to int. 

There could be multiple different hash functions that can exist which randomize the how the keys are hashed to the values. We are going to consider a simple use case where let's say you have N different servers and there are multiple keys that you want to distribute among those servers. How do you do it?

Well, a simple strategy is that you have a function which applies some logic to map the keys to an integer value. So, f(key) = some integer value, say val. And now since there are N servers we have to distribute them to, we can take val%N which will give us an integer in [0, N - 1] and based on the result we can send the key to the corresponding server. 
This works out pretty well if the number of servers remain constant (i.e N). However, we know that when dealing with distributed systems, assuming the servers won't go down is pretty unreasonable. Also, in many cases you may want to add more servers to handle more traffic so the number of servers can go up as well as go down. 
If we use the same approach (taking %N) as above, each time the number of servers change, we will have to re-hash all the keys with the new number of servers available. This would lead to a lot of unnecessary moving around of keys across the servers. This would also break any local caching present on the servers. This is actually one of the problems faced by distributed databases. 
Hence we want to find out a solution where if the number of servers change, we don't have to rehash a large number of keys. This is where consistent hashing comes into the picture.

Consistent Hashing

The concept of consistent hashing is very straightforward, and it has to do with circles. Basically, if we can map our keys as well as servers on a ring, consistent hashing claims that we can solve the re-hashing problem in case the servers go up or down. 

Let us assume that we have three servers - A, B and C. Also say we have five keys, John, Steve, Bill, Kate and Jane. For doing consistent hashing, we need to map all these entities on to a ring. A simple way to do that is to define a function f such that f(key) = theta (where 0 <= theta <= 360). So now theta can be considered as an angle and therefore can be mapped onto a circle.

Applying the above strategy here is how the keys will look like after the mapping:


And once everything is mapped, we define a simple rule that a key would belong to the server which is closest to it in the anticlockwise (or clockwise, doesn't matter) direction. So based on this rule it is easy to see that John maps to server C, Bill maps to B, Jane maps to A and so on.

Why is this beneficial you ask? Well let's consider a server goes down. Say server B in the diagram above. How many keys will be shifted? Only those that were mapped to B. In this case that is only 1. Bill will now map to server C (closest anticlockwise in the new setup) instead of B. 
This was a small example, but even with a large number of keys, with a good distribution, the number of keys that will have to be shifted would be very less in comparison to our %N approach. 

Moreover, this mapping on a ring strategy allows us to incorporate the server capacity into consideration in a very cool way. Let's say some servers are bigger than other servers and can handle more keys. All we need to do is create more copies of that server on this ring. So for eg. if server A can handle twice the load than server B, we can create two copies of A, say A0 and A1 and map both of them to the ring. It is easy to see that with random distribution the number of keys mapped to A (A0 + A1) will be twice as many mapped to B.

Use Case

Consistent hashing is an incredibly useful technique and used across various distributed systems. Known examples of consistent hashing use include (from wikipedia):
  • Couchbase automated data partitioning 
  • Partitioning component of Amazon's storage system Dynamo
  • Data partitioning in Apache Cassandra
  • Data partitioning in Voldemort
  • Akka's consistent hashing router
  • Riak, a distributed key-value database
  • Akamai content delivery network
  • Discord chat application

Hope this blog gives you a good understanding of consistent hashing and why it is so useful. The technique is a very useful one and whenever you have to hash things in a distributed system always consider if this technique can be applied there or not.

Comments

Popular posts from this blog

Streaming Algorithms - Lossy Counting

We had a discussion recently in our office on one of the famous  streaming algorithms  known as lossy count. I decided to write down my understanding of it focusing on the space complexity discussion and proving why it works in logarithmic space. This  is the original paper introducing this algorithm and most of the information discussed in this blog is (hopefully) a simplified version of this paper relating to lossy count. Problem Statement There are use cases where you may be receiving a continuous stream of data and you want to know which items in that data stream exceed a certain threshold. The data stream could be really big so you can't store it in memory. Moreover you want to answer queries in real time. So here are the challenges this problem imposes: 1. Need to do it in single pass (time constraint) 2. Limited memory 3. Large volume of data in real-time The above issues warrant the need of a smart counting algorithm. Data stream mining to identify events & pa...

Book Review: Atomic Habits

 I recently completed this wonderful book by James Clear,  Atomic Habits . More than a book review, this is going to be a compilation of the learnings that I have from this book. The book is such a vast treasure chest of information that it is very easy to forget all of it. Even writing it down is probably not enough. The actual value of the book will come from the application of the concepts it teaches in our everyday life. But, to start things off, here goes my learnings from this great book. Probably the strongest point which the book tries to drive home is that very small changes in your habit (atomic habits) can bring remarkable improvements in your life. This is also the tag line of the book: Tiny Changes, Remarkable Results. How true this is will only be confirmed after we apply the laws stated in this book. The author suggests that instead of trying to set some goals and achieve it, we should try to make very small changes in our mindset/system. We should not focus on ...

Zookeeper Internals

I had been wanting to understand the internals of zookeeper for quite some time now. In fact, I had already read the basics a couple of times and even worked with the zkcli in my previous organization. However, as it is said that anything written down is more firmly impressed on the mind, hence I am writing this post. (This also has a reference to one of my favorite magician - Michael Vincent ) I recently presented this topic in my current organization and I thought now would be a good time to write this blog. I can accompany it with the slides I had used there which would hopefully make things more clear. Although, I would like to start off with a disclaimer that I am just a beginner in the study of zookeeper and there are multiple things I don't know/understand about it yet. Still I hope at least some of you would find this blog interesting and maybe learn something new from it. Enough with the chit-chat, lets begin! What is Zookeeper? Zookeeper as the name suggests is literally...