Skip to main content

Posts

Showing posts from November, 2020

De-duping at scale!

Today I am going to talk about a project which I worked on in my organisation Sumo Logic. We have this microservice which is used for collecting data from the cloud. One of the most prominent use case  of that microservice is to collect data from customers S3 bucket.  I have written another blog on how we worked on making data discovery faster so that we can reduce ingestion lag. You can check it out here .  After we solved the data discovery issue, we realised that there was another issue we were facing. This was limiting our scalability. The stakes this time were even higher. I can't describe the complete architecture of the microservice as I would probably be violating some NDA. I will only talk about a small section of it which was the main bottleneck for us. Context We use the AWS List API  to list the objects of the S3 bucket. Of course we take the necessary permissions from the customer to be able to do that. So we have a scanner thread running per source which lists the buc