Episode 3:

Turning All Data at Grofers into Live Event Streams

August 27, 2020

In this week’s episode of The Data Stack Show, Kostas Pardalis connects with Satyam Krishna, a data engineer at Grofers, India’s largest low-price online supermarket. Grofers boasts a network of more than 5,000 partner stores, a user base with three million iOS and Android app downloads, and an efficient supply chain that allow it to deliver more than 25 million products to customers every month.

Notes:

Satyam offers insights into how he helped build the data engineering function at Grofers, how they developed a robust data stack, how they’re turning production databases into live event streams using Change Data Capture, how Grofers’ internal customers consume data, and the company made adjustments due to the pandemic.

Topics of discussion included:

  • Satyam moving from a developer to a data engineer (2:43)
  • Describing Grofers’ data stack and data lake (6:41)
  • Who is consuming data inside the company and what are some of their common uses specific to Grofers? (12:03)
  • What are the biggest issues day-to-day as a data engineer? (18:21)
  • COVID’s impact on business practices and the data stack (21:28)
  • The big problem of data discoverability and metadata cataloging (27:44)
  • Completely changing architecture to something that can scale up (33:16)

Satyam leads the consumer engineering team at Grofers and has been with the company for six years. He was the third engineer on staff and initially was a mobile engineer but shifted to data engineering two years ago, allowing him to get more of a 360-view of the company. “I wanted to look at the product from all angles,” he said about the transition. “I had spent a good enough time building that consumer application, but I wanted to see how the users interact with it and what’s the data around it. That always excited me to look at how we are getting the conversions and how the different metrics are getting tracked.”

With the shift from managing a mobile application to data engineering for internal files, Satyam noticed a completely different challenge. “Once you start building internal tools, you’re building for your stakeholders and you get the feedback much faster (than with the typical consumer feedback loop).”

Transcription:

Collecting and storing data

Grofers is based on AWS services and uses RedShift as its primary warehouse. Ingestion occurs through batch jobs and streaming jobs. AirFlow is utilized for the orchestration layer to help manage those jobs. They use a HUDI lake for all their source data and have a spark cluster running to process all the different event data and a lot of their MLAI workflows. “Our people query these data marks on Redshift primarily using an open source tool called Redash, and we also have a self-serve analytics use case where we use Tableau.”

Because Grofers is a transactional business, their source databases are mostly Postgres and MySQL. They have a microservice architecture with different services like card orders, last mile delivery, and different e-commerce components like a delivery app, a picker and an app shopper. A lot of those services have their own independent databases. “We capture the replication logs from these databases and we use Debezium which captures those CDC changes and dumps into a Kafka stream,” he said. “From Kafka we dump it into a raw location in S3 where we process it and convert it into a HUDI lake.”

Satyam added, “We use RudderStack to basically capture all of our impression needs on our consumer application.”

When asked about the comparison of the volume of event data versus transactional data collected, Satyam observed that, no question, the volume of event data is much higher. “We capture around five to six billion events every month,” he said. “Whereas our transactional data, we must be generating some terabytes of data every month. The scale of the event data is on a completely different level.”

For the transactional data they are using CDC while for event data they take data from vendors dumped into S3 in a raw format, convert it into compressed files partitioned by date that can be accessed through Redshift Spectrum. “We don’t even keep it in Redshift because of the size of the data.”

Challenges faced

Satyam identified some common use cases inside the company and some of the challenges that accompany that. Inside Grofers, there’s a common engineering team that manages the data product and the data platforms, but each team has its own data analysts in a decentralized way. He offered an example of a way that event data are used. “When they have to test rolling out a feature, they want to see how the users are using it, they are running an A/B test and they want to see the conversions happening on it.” Other examples he noted were sponsored product recommendations and personalized homepage feeds that use data from past purchases and other consumer behavior.

With a decentralized data analysis structure, Satyam noted that one of the challenges can revolve around data discoverability. “That brings in a lot of repetition of data and at times, people don’t know if they’re looking at the right data or not. I would say data discoverability is a big problem.” But through data cataloging practices, they are working on solving this problem. “That can basically help you know, ‘What does this table mean? Who created it and when was it last updated?’” This process involves building a system that create alerts that inspire more confidence in the data. “You get alerted internally rather than someone else reaching out to you,” he said. “So it brings in more confidence that the data that you’re looking at is right. If you start losing confidence in the data, you can never trust your metrics. And then the business decisions that you’re taking, it gets more and more difficult to be sure what you’re actually impacting or making positive changes.”

Of course, Grofers also faced challenges in light of the COVID pandemic. “I think one of the challenges we faced was more in terms of the communication internally in the team,” he said. “If you’re locked down for months and cooped up in a room, you want to ensure that that energy gets released We figured out games that we can play in our typical meetings and at the end of the day.”

The company also had to determine how to scale up the need for delivery services while also factoring in increased cost for delivery. “Ground operations were impacted massively.” He recounted questions they had to ask themselves. “How do we ensure the safety and security of our customer? How do we reduce touchpoint and how are able to serve all the customers?”

He said their response was to better batch their products and clump deliveries so that there are less delivery routes and the ability to deliver more orders. “We added capabilities in our application around how you can edit your order multiple times, which was not even supported before,” he said. “That was one of the biggest changes for us and there were a lot of new needs around reporting. Once you change your business model, you also want to track metrics at a much much faster pace. There were a lot of new real time reporting needs that came with that situation.”

The Data Stack Show is a weekly podcast powered by RudderStack. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.