Skip to main content

Streaming Algorithms - Lossy Counting

We had a discussion recently in our office on one of the famous streaming algorithms known as lossy count. I decided to write down my understanding of it focusing on the space complexity discussion and proving why it works in logarithmic space. This is the original paper introducing this algorithm and most of the information discussed in this blog is (hopefully) a simplified version of this paper relating to lossy count.

Problem Statement


There are use cases where you may be receiving a continuous stream of data and you want to know which items in that data stream exceed a certain threshold. The data stream could be really big so you can't store it in memory. Moreover you want to answer queries in real time. So here are the challenges this problem imposes:
1. Need to do it in single pass (time constraint)
2. Limited memory
3. Large volume of data in real-time

The above issues warrant the need of a smart counting algorithm. Data stream mining to identify events & patterns can be performed by applying the following algorithms: Lossy Counting and Sticky Sampling as mentioned in the paper. Today we are going to look at the lossy counting algorithm.

Lossy Counting


Some notations first:
s - support threshold Ɛ (0, 1)
ε - error parameter Ɛ (0, 1), ε << s
N - no. of elements seen in the stream so far

There could be a query at any point of time to produce a list of item along with their estimated frequency. The answer returned by the algorithm will have the following guarantees:

1. All items whose true frequency exceed sN will be in the output (no false negatives)
2. No items with true frequency less than (s - ε)N will be in the output.
3. Estimated frequency of the item will be less than the true frequency by at most εN.

Now lets see what is the lossy counting algorithm and how it achieves the above three guarantees. 

The most important aspect of lossy counting algorithm is that it divides the incoming stream of data into buckets of width w = ciel(1/ε). Buckets will be labeled with ids starting from 1.The current bucket id will be denoted by bcur. Also for an element e, the true frequency will be denoted by fe and the estimated frequency will be denoted by f.
The items would be maintained in a data structure D which will have entries of the form (e, f, Δ) where Δ is the maximum possible error in f (which was the estimated frequency of the element e).

Here is the final algorithm now:

Step 1: We get an element e in the stream. Check if it is present in D. If it is, increment the frequency by 1. If not, insert an entry (e, 1, bcur - 1) in D.
Step 2: At the bucket boundary, delete elements which have f + Δ <= bcur
Step 3: If there is a query to return the top elements, return all elements in D with f >= (s - ε)N. (this ensures the 2nd guarantee of the algorithm)

Now for a proof of why this works you can refer to section 4.2 of the paper. There are a few lemmas with simple proof which shouldn't be too hard to understand. 

I mainly want to focus on the space complexity of this algorithm. The math involved in the proof is a little bit more complicated and not very intuitive. Hence would like to present the proof here:






It was difficult to include mathematical equations in the blog hence have uploaded images by writing the proof on paper. Might be a little hard to read. Sorry about that.

But I hope this gives you some confidence about the lossy counting algorithm. I would highly recommend reading the paper. It is written very clearly and should be easy to understand hopefully more so after reading this blog.

Comments

Popular posts from this blog

RSA Encryption System

THE NEED OF ENCRYPTION In today's world where a lot of secured information like our credit card number, passwords etc., travel around the web, the presence of a secure encryption system is almost inevitable. We want a method to encrypt a message, send it over the insecure internet connection and allow only the receiver to be able to decrypt and read the original message. This exact problem was solved by R ivest, S hamir and A dleman(RSA) in the year 1978, in their paper A Method for Obtaining Digital Signatures and Public-Key Cryptosystems. In this post, I will try to explain the method they adopted to create a secure encryption system. PUBLIC KEY CRYPTOSYSTEM A public key cryptosystem has the following properties: 1) D(E(M)) = M, if we decrypt an encrypted message, we should get back the message. 2) Both E and D should be easy to compute, if it takes days to encrypt or decrypt the message, then the cryptosystem is not very useful. 3) By publicly revea

Consistent Hashing

 I have been meaning to write this blog for some time now but somehow kept on postponing (read procrastinating...). I read about this technique of Consistent Hashing a while back and was mesmerized by its elegance. Recently read it again in this brilliant blog and thought of expressing it in my own words for posterity. So let us begin. Hashing I won't talk too much about hashing since it is a very basic computer science concept. In a word, it means mapping an object to another object. Or more generally, mapping a key to a value where their types don't matter. Mostly the mapping is from a string to int.  There could be multiple different hash functions that can exist which randomize the how the keys are hashed to the values. We are going to consider a simple use case where let's say you have N different servers and there are multiple keys that you want to distribute among those servers. How do you do it? Well, a simple strategy is that you have a function which applies some