Skip to main content

Streaming Algorithms - Lossy Counting

We had a discussion recently in our office on one of the famous streaming algorithms known as lossy count. I decided to write down my understanding of it focusing on the space complexity discussion and proving why it works in logarithmic space. This is the original paper introducing this algorithm and most of the information discussed in this blog is (hopefully) a simplified version of this paper relating to lossy count.

Problem Statement


There are use cases where you may be receiving a continuous stream of data and you want to know which items in that data stream exceed a certain threshold. The data stream could be really big so you can't store it in memory. Moreover you want to answer queries in real time. So here are the challenges this problem imposes:
1. Need to do it in single pass (time constraint)
2. Limited memory
3. Large volume of data in real-time

The above issues warrant the need of a smart counting algorithm. Data stream mining to identify events & patterns can be performed by applying the following algorithms: Lossy Counting and Sticky Sampling as mentioned in the paper. Today we are going to look at the lossy counting algorithm.

Lossy Counting


Some notations first:
s - support threshold Ɛ (0, 1)
ε - error parameter Ɛ (0, 1), ε << s
N - no. of elements seen in the stream so far

There could be a query at any point of time to produce a list of item along with their estimated frequency. The answer returned by the algorithm will have the following guarantees:

1. All items whose true frequency exceed sN will be in the output (no false negatives)
2. No items with true frequency less than (s - ε)N will be in the output.
3. Estimated frequency of the item will be less than the true frequency by at most εN.

Now lets see what is the lossy counting algorithm and how it achieves the above three guarantees. 

The most important aspect of lossy counting algorithm is that it divides the incoming stream of data into buckets of width w = ciel(1/ε). Buckets will be labeled with ids starting from 1.The current bucket id will be denoted by bcur. Also for an element e, the true frequency will be denoted by fe and the estimated frequency will be denoted by f.
The items would be maintained in a data structure D which will have entries of the form (e, f, Δ) where Δ is the maximum possible error in f (which was the estimated frequency of the element e).

Here is the final algorithm now:

Step 1: We get an element e in the stream. Check if it is present in D. If it is, increment the frequency by 1. If not, insert an entry (e, 1, bcur - 1) in D.
Step 2: At the bucket boundary, delete elements which have f + Δ <= bcur
Step 3: If there is a query to return the top elements, return all elements in D with f >= (s - ε)N. (this ensures the 2nd guarantee of the algorithm)

Now for a proof of why this works you can refer to section 4.2 of the paper. There are a few lemmas with simple proof which shouldn't be too hard to understand. 

I mainly want to focus on the space complexity of this algorithm. The math involved in the proof is a little bit more complicated and not very intuitive. Hence would like to present the proof here:






It was difficult to include mathematical equations in the blog hence have uploaded images by writing the proof on paper. Might be a little hard to read. Sorry about that.

But I hope this gives you some confidence about the lossy counting algorithm. I would highly recommend reading the paper. It is written very clearly and should be easy to understand hopefully more so after reading this blog.

Comments

Popular posts from this blog

RSA Encryption System

THE NEED OF ENCRYPTION In today's world where a lot of secured information like our credit card number, passwords etc., travel around the web, the presence of a secure encryption system is almost inevitable. We want a method to encrypt a message, send it over the insecure internet connection and allow only the receiver to be able to decrypt and read the original message. This exact problem was solved by R ivest, S hamir and A dleman(RSA) in the year 1978, in their paper A Method for Obtaining Digital Signatures and Public-Key Cryptosystems. In this post, I will try to explain the method they adopted to create a secure encryption system. PUBLIC KEY CRYPTOSYSTEM A public key cryptosystem has the following properties: 1) D(E(M)) = M, if we decrypt an encrypted message, we should get back the message. 2) Both E and D should be easy to compute, if it takes days to encrypt or decrypt the message, then the cryptosystem is not very useful. 3) By publicly revea...

Fermat's little theorem

In my post on RSA encryption system I mentioned the use of Fermat's little theorem. In this post I am going to give a formal proof (and explain in simple terms) the theorem itself. This will also give you a chance to boast among your friends (possibly nerdy) that you know the proof to one of Fermat’s theorem ;) THE STATEMENT Fermat's little theorem states that: For any integer a not divisible by p and any prime p, the following always holds: a (p-1) ≡ 1 (mod p) The reason why the theorem states that a should not be divisible by p is very clear. Let us assume that a was divisible by p, then obviously p will divide a (p-1) . So this means that a (p-1) ≡ 0 (mod p) when p | a. So this is an exception and is separately mentioned in the theorem. PROOF Now coming to the actual proof of the theorem. If we see the left hand side of the equation in the statement, we have the term a (p-1) . So it is clear that this was obtained by multiplying a, p-1 times. Now let us ...