Skip to main content

Streaming Algorithms - Lossy Counting

We had a discussion recently in our office on one of the famous streaming algorithms known as lossy count. I decided to write down my understanding of it focusing on the space complexity discussion and proving why it works in logarithmic space. This is the original paper introducing this algorithm and most of the information discussed in this blog is (hopefully) a simplified version of this paper relating to lossy count.

Problem Statement


There are use cases where you may be receiving a continuous stream of data and you want to know which items in that data stream exceed a certain threshold. The data stream could be really big so you can't store it in memory. Moreover you want to answer queries in real time. So here are the challenges this problem imposes:
1. Need to do it in single pass (time constraint)
2. Limited memory
3. Large volume of data in real-time

The above issues warrant the need of a smart counting algorithm. Data stream mining to identify events & patterns can be performed by applying the following algorithms: Lossy Counting and Sticky Sampling as mentioned in the paper. Today we are going to look at the lossy counting algorithm.

Lossy Counting


Some notations first:
s - support threshold Ɛ (0, 1)
ε - error parameter Ɛ (0, 1), ε << s
N - no. of elements seen in the stream so far

There could be a query at any point of time to produce a list of item along with their estimated frequency. The answer returned by the algorithm will have the following guarantees:

1. All items whose true frequency exceed sN will be in the output (no false negatives)
2. No items with true frequency less than (s - ε)N will be in the output.
3. Estimated frequency of the item will be less than the true frequency by at most εN.

Now lets see what is the lossy counting algorithm and how it achieves the above three guarantees. 

The most important aspect of lossy counting algorithm is that it divides the incoming stream of data into buckets of width w = ciel(1/ε). Buckets will be labeled with ids starting from 1.The current bucket id will be denoted by bcur. Also for an element e, the true frequency will be denoted by fe and the estimated frequency will be denoted by f.
The items would be maintained in a data structure D which will have entries of the form (e, f, Δ) where Δ is the maximum possible error in f (which was the estimated frequency of the element e).

Here is the final algorithm now:

Step 1: We get an element e in the stream. Check if it is present in D. If it is, increment the frequency by 1. If not, insert an entry (e, 1, bcur - 1) in D.
Step 2: At the bucket boundary, delete elements which have f + Δ <= bcur
Step 3: If there is a query to return the top elements, return all elements in D with f >= (s - ε)N. (this ensures the 2nd guarantee of the algorithm)

Now for a proof of why this works you can refer to section 4.2 of the paper. There are a few lemmas with simple proof which shouldn't be too hard to understand. 

I mainly want to focus on the space complexity of this algorithm. The math involved in the proof is a little bit more complicated and not very intuitive. Hence would like to present the proof here:






It was difficult to include mathematical equations in the blog hence have uploaded images by writing the proof on paper. Might be a little hard to read. Sorry about that.

But I hope this gives you some confidence about the lossy counting algorithm. I would highly recommend reading the paper. It is written very clearly and should be easy to understand hopefully more so after reading this blog.

Comments

Popular posts from this blog

Book Review: Atomic Habits

 I recently completed this wonderful book by James Clear,  Atomic Habits . More than a book review, this is going to be a compilation of the learnings that I have from this book. The book is such a vast treasure chest of information that it is very easy to forget all of it. Even writing it down is probably not enough. The actual value of the book will come from the application of the concepts it teaches in our everyday life. But, to start things off, here goes my learnings from this great book. Probably the strongest point which the book tries to drive home is that very small changes in your habit (atomic habits) can bring remarkable improvements in your life. This is also the tag line of the book: Tiny Changes, Remarkable Results. How true this is will only be confirmed after we apply the laws stated in this book. The author suggests that instead of trying to set some goals and achieve it, we should try to make very small changes in our mindset/system. We should not focus on ...

Zookeeper Internals

I had been wanting to understand the internals of zookeeper for quite some time now. In fact, I had already read the basics a couple of times and even worked with the zkcli in my previous organization. However, as it is said that anything written down is more firmly impressed on the mind, hence I am writing this post. (This also has a reference to one of my favorite magician - Michael Vincent ) I recently presented this topic in my current organization and I thought now would be a good time to write this blog. I can accompany it with the slides I had used there which would hopefully make things more clear. Although, I would like to start off with a disclaimer that I am just a beginner in the study of zookeeper and there are multiple things I don't know/understand about it yet. Still I hope at least some of you would find this blog interesting and maybe learn something new from it. Enough with the chit-chat, lets begin! What is Zookeeper? Zookeeper as the name suggests is literally...