We had a discussion recently in our office on one of the famous streaming algorithms known as lossy count. I decided to write down my understanding of it focusing on the space complexity discussion and proving why it works in logarithmic space. This is the original paper introducing this algorithm and most of the information discussed in this blog is (hopefully) a simplified version of this paper relating to lossy count.
Problem Statement
There are use cases where you may be receiving a continuous stream of data and you want to know which items in that data stream exceed a certain threshold. The data stream could be really big so you can't store it in memory. Moreover you want to answer queries in real time. So here are the challenges this problem imposes:
1. Need to do it in single pass (time constraint)
2. Limited memory
3. Large volume of data in real-time
The above issues warrant the need of a smart counting algorithm. Data stream mining to identify events & patterns can be performed by applying the following algorithms: Lossy Counting and Sticky Sampling as mentioned in the paper. Today we are going to look at the lossy counting algorithm.
Lossy Counting
Some notations first:
s - support threshold Ɛ (0, 1)
ε - error parameter Ɛ (0, 1), ε << s
N - no. of elements seen in the stream so far
There could be a query at any point of time to produce a list of item along with their estimated frequency. The answer returned by the algorithm will have the following guarantees:
1. All items whose true frequency exceed sN will be in the output (no false negatives)
2. No items with true frequency less than (s - ε)N will be in the output.
3. Estimated frequency of the item will be less than the true frequency by at most εN.
Now lets see what is the lossy counting algorithm and how it achieves the above three guarantees.
The most important aspect of lossy counting algorithm is that it divides the incoming stream of data into buckets of width w = ciel(1/ε). Buckets will be labeled with ids starting from 1.The current bucket id will be denoted by bcur. Also for an element e, the true frequency will be denoted by fe and the estimated frequency will be denoted by f.
The items would be maintained in a data structure D which will have entries of the form (e, f, Δ) where Δ is the maximum possible error in f (which was the estimated frequency of the element e).
Here is the final algorithm now:
Step 1: We get an element e in the stream. Check if it is present in D. If it is, increment the frequency by 1. If not, insert an entry (e, 1, bcur - 1) in D.
Step 2: At the bucket boundary, delete elements which have f + Δ <= bcur
Step 3: If there is a query to return the top elements, return all elements in D with f >= (s - ε)N. (this ensures the 2nd guarantee of the algorithm)
Now for a proof of why this works you can refer to section 4.2 of the paper. There are a few lemmas with simple proof which shouldn't be too hard to understand.
I mainly want to focus on the space complexity of this algorithm. The math involved in the proof is a little bit more complicated and not very intuitive. Hence would like to present the proof here:
It was difficult to include mathematical equations in the blog hence have uploaded images by writing the proof on paper. Might be a little hard to read. Sorry about that.
But I hope this gives you some confidence about the lossy counting algorithm. I would highly recommend reading the paper. It is written very clearly and should be easy to understand hopefully more so after reading this blog.
Comments
Post a Comment