Episode 184:

Kafka Streams and Operationalizing Event Driven Applications with Apurva Mehta of Responsive

April 3, 2024

This week on The Data Stack Show, Eric and Kostas chat with Apurva Mehta, Co-Founder and CEO of Responsive, about event-driven applications and the necessary infrastructure. Apruva shares his journey from LinkedIn to Confluent and eventually founding Responsive, focusing on managing event-driven applications in the cloud. The discussion covers the definition of event-driven applications, the significance of latency and state in event processing, and the evolution of Kafka and Kafka Streams. They also explore the challenges of managing Kafka in production, the developer experience with Kafka Streams, and the operational complexities of running distributed stateful applications. Apruva highlights Responsive’s approach to simplifying the management of these applications, the potential for innovation in event-driven architectures, and more.

Notes:

Highlights from this week’s conversation include:

Apruva’s background in streaming technology (0:48)
Developer experience and Kafka streams (2:47)
Motivation to bootstrap a startup (4:09)
Meeting the Confluent founders and early work at Confluent (6:59)
Projects at Confluent and transition to engineering management (10:34)
Overview of Responsive and event-driven applications (12:55)
Defining event-driven applications (15:33)
Importance of latency and state in event-driven applications (18:54)
Low Latency and Stateful Processing (21:52)
In-Memory Storage and Evolution of Kafka (25:02)
Motivation for KSQL and Kafka Streams (29:46)
Category Creation and Database-like Interface (34:33)
Developer Experience with Kafka and Kafka Streams (38:50)
Kafka Streams Functionality and Operational Challenges (41:44)
Metrics and Tuning Configurations (43:33)
Architecture and Decoupling in Kafka Streams (45:39)
State Storage and Transition from RocksDB (47:48)
Future of Event-Driven Architectures (56:30)
Final thoughts and takeaways (57:36)

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 00:05
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com. We are here with Apruva from Responsive, and we are so excited to chat about many topics today, event based applications and all the infrastructure both underneath and on top of them to run them. So Apruva, thank you so much for joining us on the show today.

Apurva Mehta 00:44
Great, thank you for having me excited to be here.

Eric Dodds 00:48
Give us just your brief background, you spend a lot of time in the world of streaming, or you’re a multi time entrepreneur, but give us the story. How did you get into the world of streaming?

Apurva Mehta 00:58
Yeah, thanks. No, yeah, I have first got exposed. I mean, I was at LinkedIn, you know, like, I’ve been at LinkedIn continent, and now I’m doing responsive. But I got exposed to streaming at LinkedIn, you know, the SAMSA job back in the day. And, you know, through LinkedIn, got to know the founders of content Muda, concerned, worked on Kafka and Kafka case, equal Kafka streams for years, basically, and, you know, responsive continues to line we’re building a platform to manage these event driven applications in the cloud. And yep, so began at LinkedIn, I would say, 2013.

Kostas Pardalis 01:34
That, so some profile, okay, I have a couple of actually have many questions, that’s, I’d like to ask you, hopefully, we’ll have the time to go through all of that stuff. But some of the high level topics that I’m really interested in, like, first of all, I’d love to hear from you, like, what is Kafka? And what are other technologies into as a protocol, or as a standard may probably, and what’s happening with, like, the products that are built around it. And the next thing that is, like, very interesting is and inspired from what you’re doing, like today about developer experience and like, things around like Kafka streams, right? Like, what’s, what it means like to build why we want to build things with Kafka streams, right? And what can be done better for the engineers out there that they are using Kafka streams and Kafka, obviously, to work with that. So that’s like, three of the things that I’d love to get deeper into, like, but how about you? What are some things that you’re excited about, like discussing today? Yeah, all

Apurva Mehta 02:47
of that sounds, and those are all topics and very passionate about, I think it’s good to tell the story that, like, what you said Kafka as a tech standard as a technology, the differentiation there, I think, I will also say, talking a bit about like, this world on top of Kafka, right, like stream processing, I think is a very overloaded world, in many use cases, many different ways to slice that space up, right? Many different technologies in that space, you know, I think, talking about the space itself, and how, you know, how men are shedding my point of view on how to think about it, and then maybe getting into Kafka streams, and like, how the technologies differ with a focus on Kafka streams, I think, would be a very interesting, I think, an educational conversation, you know, so that I’m excited about doing that. Yeah.

Kostas Pardalis 03:36
100%. And let’s do it, what do you think, Eric?

Eric Dodds 03:40
Yeah, let’s dig in. This is gonna be great. All right. Well, you gave us a brief, a brief introduction, about your time at LinkedIn. And you know, how you sort of got into streaming and how you met some of the confluent people. I want to dig into that. I want to dig into that story, especially the beginnings at LinkedIn. But when we were chatting before the show, you talked about bootstrapping your own startup. And I just wanted to ask you a couple of questions. I know, we have, you know, several entrepreneurs who are on the show. What motivated you to start your own to bootstrap your own app and Company?

Apurva Mehta 04:20
Way back in 2011. So I had just started before that at Yahoo, actually, in the Cloud Platform team, you know, in 2008, was my first job. Yeah, and I think it was almost three years at Yahoo. And then, you know, I was like, an age, right. Like, I think I have always been a startup person, right. I always wanted to do it. And back then the app store apple was new, right? Everyone was writing an app. Right? Cloud services, will you be right? Yeah. So it was like the thing to do, you know, that, you know, and I still think but even back then it was a big thing on bootstrapping companies being a very great way to have a good life and you know, so that I was also a very young Graduate impressionable. So, I mean, this is what you do, right? Like, I saved some money from Yahoo, I had some experience building stuff at a good company. And you know, I had this I also played music and you know, this music recording, sharing music recordings, annotating music recordings, it was not no no products for that, right? And I said, Why don’t we build something that’s cloud native mobile native, that allows, you know, groups to kind of share resources with each other and discuss rehearsals. And it was definitely a problem I had, personally. And that was also standard advice, solve a problem you personally have. That’s a great way to start a company. So yeah, that’s kind of the genesis.

Eric Dodds 05:37
Yeah, very cool. And well, what instrument or instruments do you play?

Apurva Mehta 05:42
I stopped since my son was born several years ago, but I played the tabla for a long time, like its Indian drums, Indian classical drums. So I played that for 1012 years, pretty intelligently. I built an app to help with my lessons and my practices and rehearsal also. So yeah. We hope to get back into it. Yeah.

Eric Dodds 06:05
Great. Okay. Just one more question. When’s the last time you went back and looked at the code you wrote for that? That first app that you built?

Apurva Mehta 06:13
Oh, many years? I don’t know. I guess once I went to LinkedIn, you know, family, like it starts coming into very busy, it’s been? Probably when I stopped. I never looked at it again. Yeah, most

Eric Dodds 06:30
The reason I asked was, I was looking for a file the other day, and I ran across this old Projects folder, and I was like, Whoa, you know, this has, like, 10 year old stuff in it. And you look at and you’re like, wow, that seems so awesome at the time. And okay, so LinkedIn. So tell us about the meeting. How did you meet the confluent founders at LinkedIn? And you know, it was somewhat related to the work that you were doing there, but give us a little bit deeper peek into that story?

Apurva Mehta 06:59
Yeah, so at LinkedIn, I was working at it, I did two projects. One was on a big graph database. So like LinkedIn has like this. I don’t know if it’s still true. But back then it was, you know, like one in memory graph database that serves all your connections, like basically, it’s a very fast, low latency lookup that you know, who’s like, who’s your connections? And that kind of is used by the whole website to do all sorts of things, right? Like, can you even see someone’s profile? Can you in your search results? What’s the ranking? Right? Like, there’s a lot of calls made to that it is it used to be, and I can’t speak to what it is today. But basically, like getting the statistic back then is for every call to linkedin.com, it would be 500 calls to this graph database, through the whole service tree out and often did stack up so like, you know, it was not very efficient, I would say, the bottom layer would call and then it’ll send layers up into the upper layer would also call back. And so because it was not like, it’s like a massively distributed microservices architecture, which, you know, obviously, no team actually, no one actually optimized it top to bottom. But anyway, the net effect is if there was any latency at the bottom layer, it could basically Halt linkedin.com. Like, basically, there’ll be cascading timeouts through the service stack. And linkedin.com could be down, like me talking microseconds, if the microsecond lookup went into, like a millisecond, LinkedIn. And this is true, there was a Saturday back in 2013, Wellington was down for eight hours. Wow. And the reason was linkedin.com, still just wouldn’t load. And the reason was, this graph database was slightly slow. And so that’s kind of basically, it was a P zero for the company to figure it out. And, but I started digging into it. And ultimately, it led deep into basically, the configuration of Linux on LinkedIn, its hardware, you know, they hadn’t the data center. So they had, like, what you call multi socket systems. And, and basically, the memory banks were kind of divided across the two CPU sockets. And then we’re not so set up to keep each bank only pinned. So if you have a thread running on one socket, it cannot read memory from another. So if by chance, like you had 64 gigs of memory, that’s what we had that time in RAM. If enough 32 was filled, paging our data because of the setting it’s going to some you knew my interview setting, I forgot what it’s called. But basically, it was a BIOS configuration plus a learn Linux Kernel configuration that caused us to use no more than half the memory. So we have tons of memories, but we could be eating out in half. And because it was a page out now you’re in memory lookup is no longer in memory lookup with a disk lookup that’s a latency spike and that kind of cascades up and that is a root cause of a latency. I wrote a blog post about this. We can put it in the show notes. I can send it to you but basically, but basically having discovered that it also affected Kafka because Kafka is a in memory system that like it expects for performance to be not hitting disk, and they will After getting paged out, right, it was a problem for them to and to this helped them and other teams. And basically, because of this, I kind of got to know the founders of cash flow. And since then, you know, we’ve been indicted. And that’s how conflict happened later, I would say.

Eric Dodds 10:16
Very cool. And just tell us a little bit you said, there were a couple of things you worked on at confluent. Can you talk about, you know, were those areas of interest for you, you joined conflict pretty early. So maybe they were just, you know, sort of key areas of product development. But tell us a little bit about that.

Apurva Mehta 10:34
Yeah, I started a conflict in 2016 summer of 2016. And I think the first project was to add transactions, basically, there’s exactly one semantics where, you know, you basically when you produce a message to Kafka, it is persisted only once, right, and then a transactional system on top of that, where, you know, you can also consume it, like a batch of messages exactly one time. So I can commit a batch of messages across multiple topics as a unit, and then only make sure the committed messages are consumed. So that was like a major upgrade to the Kafka protocol and Kafka capabilities. So I worked on that project. So there’s three of us, it was probably the most fun I’ve had as an engineer, because two really good, two really good engineers, it’s a really hard problem. Screaming exactly once is kind of, like going away. And I would say, very hard from an engineering perspective, we actually made the performance better. Even though all like at least no loss, even though you had transactions, because we actually optimized so much else to get the performance there. So it was a great engineering thing. And I think, just a lot of fun doing it with the dAT team, back in the small like one room in Palo Alto, you know, the whole company. It was a lot of fun. So, yeah, so that was my first project. Then I moved on to the key sequel team basically just launched. Around the time we launched this transaction, it was actually very popularly received, like its streaming sequel on Kafka. And then the company wanted to invest more. So I moved to the case, equal team. And then I did a technical project to add joints to case equal and then the component was growing really fast, right? So that only two or three engineering managers were like 5060 engineers at some point in 2018. And so they asked, “Do I want to manage this case with an equal team? I said, Sure, why not? And, and then since then, I kind of became a manager, you know, of Casey Cook, Augustine’s, or groovy launcher crowd. That would say, the last few years, that’s what I was doing at confluent. Learning how to be an engineering manager. Yeah.

Eric Dodds 12:38
And you have started to be responsive. So can you just give us a quick overview? Because it obviously works with Kafka as a first class citizen, but is a layer on top for building event based applications?

Apurva Mehta 12:55
Could I Yeah, so yeah, I think so this idea of like, the slump that came about in 2022. You know, I mean, what we saw at confluent was, you know, like, these developers are building all sorts of truly mission critical apps. You know, like, there’s a real distraction, and we can probably get into that in the show. I guess that’s a big topic of conversation. But essentially, maybe I’ll keep it short for now. And then we can just dig into it, you know, as we go, rather than, but basically. Yeah, I mean, I, what I saw at confluent was, you know, this category of like, you know, you have these kinds of back end applications that are not in the direct, like, you’re not sending in loading a website and sending a request, right, but to the to an application that’s serving from a database, but basically, you’re starting off something in the back, like, it could be od fulfilling an order. It could be, you know, like, you know, fulfilling a prescription, like, there’s actually a pharmacy that uses Kafka streams to fulfill prescriptions. It could be, you know, things in your warehouse logistics, like very common, like the, all the applications that make sure things get to the right place at the right time, in general, are very, you know, good fit for these event driven patterns, back end event driven patterns, right? You know, something happened, you have to react to it, you have to keep state around what’s happening to make the right decision. And this has to be programmed, right? So there’s application developers building these systems. And I thought that was tremendous . It was actually a very popular surprise, like, so many of these apps exist, whether they’re directly written with producers or consumers or written with Kafka streams or whatever. But my thing was that looking at a massive market, the pattern, the architecture makes a lot of sense for a lot of people, Kafka, the data source is there. And but there’s no real focus on that, you know, making that really easy and good to do. Right. So that was responsive like, can we be in the cloud? Assuming that Kafka is ubiquitous, you know, become the platform, you know, the true foundation for writing these apps that make it scalable, easy to test easy to run, you know, all of it. So that’s kind of, I’d be just I mean, for me the opposite is there and responsive is kind of founded to capitalize and kind of add value in that space.

Eric Dodds 15:10
Yeah, it makes total sense. Well, I know, I want to dig into that cost. It probably wants to dig into that even more just around, you know, sort of operationalizing event driven apps. But before we dig into the specifics, can we just talk briefly about what you consider an event driven app? Right? I mean, that can mean a ton of different things, like you said, there is a sense in which it’s ubiquitous, right? If there’s any sort of process happening in the real world that needs to be represented digitally, to accomplish some, you know, process or logistical flow within an organization? You know, those are technically sort of all potentially event based applications. But what do you think about that? Are there parameters around how you would define it? Do you think there are a lot of different definitions?

Apurva Mehta 16:01
Yeah, I think, I mean, a very reductive decK definition would be right, if you have an application, like, and this probably describes all applications, so it’s kind of irrelevant. But basically, like, ultimately, like, you know, like, it’s more like a pattern, right? Like, are you sitting, are you writing some functions, which are basically waiting for triggers to kind of execute, and, you know, potentially triggering other functions downstream of them. And in totality, you know, achieving some outcome, right, like, basically, it’s kind of like reductively, like functional programming is eventually just function invocations, a chain of them achieves an outcome. I think, in the context of what I said about responsive, these are, like more, I would say, back end, like basically, you know, you could have like, if you think of it, like even a web server is event driven, that you’re waiting for a request to come in, and you’re responding to the request is the event and the responses, you know, the output, and in a service mesh, service oriented architecture, there’s a stack of these services calling each other and like, that’s basically synchronous, and event driven, right. But I would say that asynchronous event driven is kind of the space Kafka, the space on top of Kafka, right? Like where, you know, you logged an event to Kafka, and now that event is being propagated to many apps, and they’re all reacting to it in their own time, in some sense, and doing different things with it. Right. So, yeah, I would say like, there’s this asynchronous back end event driven apps, you know, have basically a source like caca right, I would say, like, I don’t know, that’s a useful categorization. But that’s really the categorization in which, for example, responsive, you know, that space of like, you have these apps that are built off of these event sources like Kafka, and we are there to kind of, you know, we are focusing on that area, basically, if that makes sense. I’m happy to have a discussion, honestly, like, how do you all see it? Right? And, yeah, if you know, there’s another way to see it is that there’s different things people do even in that categorization. Right, like, you could do some sort of streaming analytics, streaming, ETL, you can actually build like these, you know, true, you know, kind of workflow, like things on top of it. But, you know, there’s so many ways to slice that, too.

Eric Dodds 18:08
Yeah, maybe I’m interested in Kostis, in your opinion as well. One question I have for both of you is, this may be a very oversimplified, you know, way to look at it or even an unfair question. But appropriate, you said, you know, functional programming, if you’re extremely reductive, is essentially an event, you know, sort of an event driven application. But one thing that’s interesting about some of the use cases that you mentioned around logistics and other things is that there’s a timestamp that represents something happening in the physical world, right? Or like an input from something outside of the application creating a trigger, how important do you think that is, in terms of defining an event driven application and cost us would love your thought as well?

Kostas Pardalis 18:53
I think and I’m going to talk like from, like my experience with like hearing people talking about bots and stream processing, in processing, like primarily, like analytical workloads, right, where it’s kind of like people like it feels like it’s like the Holy Grail of like merging the two, right? It’s like, there’s people saying, you know, what, like, bots are just a subset of streaming at the end. And if we can make everything streaming, why would we need bots, right, I think, and we forget marketing here and like trying to, you know, sell new products and like all these things and go back to the fundamentals. I think there’s like a couple of very small sets of dimensions that really matter here. And I think one is latency that is associated with, let’s say the use case that you have, I mean, how fast do you have to respond to something? Yeah, because yeah, like, sure, like in the general case, Let’s say we can play if everything was free in this universe. Yeah, if I can have an inside dancer, like any question I have, I wouldn’t be great. But our universe does not work like that. Like it’s up to us, right? So there are use cases where low latency matters. Let’s say, I’m swiping my credit card, and someone has to go and do like, I don’t know, like credit check, or go and do something like fraud detection, right? Sure. By the way, I would ask people next time that they pay online for something, and they put their credit card, like just count the time it takes, and they will see that like, there is like, sometimes they’re right. It’s not like loading the Google landing page, for example, right. And we are okay with that. We are happy with this trade off, right, like to wait a little bit longer, but make sure that we are safe, like, no one’s going to steal our identity or like our credit. And the other thing is state. And when I mean, by state, I mean, like how much and how complicated the information that we need to act upon. It has to be right, if, for example, we needed to go and keep track of like, every possible state, like every possible, like interaction that the user has done, like in the past, like a couple of years, right? It’s just impossible to be like, hey, like, I’m going to like, process these on the fly, and make sure that we have it in a sub second, like, latency back, right. So this is how I at least the mental model that I have built, when I’m thinking about, like trying to, like, figure out what is needed out there. And at the end, like, where are the rights? The right trade offs? Because there are always trade -offs made? I don’t know, like, above all, like, if you agree with that mental model, or like you would add something or like remove something from that? Yeah,

Apurva Mehta 21:53
I think that’s a good, good way of thinking about it. I think essentially, what you’re saying is, if you need low latency, sophisticated, stateful, processing, you know, event driven back end, event driven apps probably are a good fit, like fraud detection was a common, most common example, right? Like, it’s complicated to get it right, it needs to be done with low latency. So having, like, you know, something that’s running that reacts to your swipe, and, you know, detecting fraud before bringing back that it’s good to go. That’s a perfect example of an event driven architecture being a good use, right? I think that’s an you know, I don’t know if you know, Walmart has a talk about this, but they use Kafka streams, when you check out on all the walmart.com jet.com properties. It’s sitting there, you know, evaluating your stuff abroad, giving you recommendations while you’re on the shopping carts, like all of those are very low latency, right. You know, it’s all done on, you know, Kafka streams is part of that thing. Anyway, And that’s a good categorization, I think, to Eric’s question, I think around. Yeah, I think it does. Yeah, I think you mentioned timestamp in the message. I think one important property outside of latency, I think, is also replayability. I think the general like, unless, like, you’re actually responding to a request that I do a click, and in the case of character card, you know, you go to another screen, and you’re waiting, and then it, you know, it pings back, and you’re good to go kind of thing. But I think at the end of the case, there’s a very synchronous aspect to it, but you can wait. But the other way, you generally don’t like clicking a button, you want to respond, right. And so those, I think, you’re never going to use Kafka based systems to serve like your Qlik. Right. And this is like for this very complicated fraud, kind of use cases. But I think there’s a lot where you don’t actually need to ping back and then actually logging an event to something like Kafka, which makes it durable. And then writing back ends to kind of process them actually has many nice properties, right? Like, even when you talk about batch versus streaming, like streaming is incremental processing, it’s actually cheaper over time to do streaming, right, because you’re not reprocessing the same thing again, and again, by definition. And that topic is like this, where you have a durable event written to a log, there’s auditability. to it, there’s, you know, there’s a lot of and you know, you can if you find a bug in your app, you can replay the event, potentially, if you build your system, right and redo the output. These are all opportunities, you don’t have a synchronous request responsible, because once the response is gone, it’s gone. But with events, if you actually build an event driven system, right, you actually have very nice properties around, you know, making it better, right, like, it’s actually a better way to build. I think I’m obviously biased. So I think that’s a very nuanced aspect of, you know, when to choose architecture, right? You know, you might even want to do it. For reasons outside of cold latency reasons. Or like, you know, you could be more efficient, it’s more operable, it’s more auditable, it’s more like, you know, you can eventually get better correctness over time, right? It’s also complicated, but the architecture lends itself to those properties?

Kostas Pardalis 25:00
Yeah, I have a question about that. So that’s great, right? Like being able to, let’s say, keep track of all the interactions or change like the state out there and being able to replicate and even do things like time travel, like, let’s see, I want to see how my user looks like, five hours ago, like, whatever. It’s great. It’s amazing, right? But my question here is, like you mentioned, you said that a sample like Africa is an in memory system, right? And my question here is that what you were describing here is like, a system that will infinitely grow in terms of like, the data has to store right, like, as we create more interactions, we have to keep them there and make sure we can go back and like, access all that. So with a system that has to live in memory, to be performant. And provide also that low latency, let’s say guarantees that everyone is looking for like, Kafka, how you can also make sure that you can store everything in there, right? What, how, how did it work from the beginning, and if there’s like evolution, and like how Kafka is treating these, I’d love to hear the story.

Apurva Mehta 26:16
Yeah, that’s a great question. I think this is a super interesting one. So I mean, I did say Kafka, you know, expects to be in memory, right? I think there is no design, I think it’s still true, right? Like, like it used the Linux page cache signal, you know, very heavily to kind of make sure the segments were in memory, so that almost every read is just hitting memory. They also had optimizations where you could copy from the page guys to the network buffer directly. So it doesn’t really go through the JVM, and all that stuff. So those highly optimized for low latency efficient scale, you know, and I think original, originally, right, and it’s still, to a large extent, a lot of traffic. And Kafka is a kind of metric, right? Like, your logging, or like at LinkedIn, that was what it was used for. The primary one of the big, original use cases was the entire monitoring of linkedin.com was through metrics, lat locks to Kafka, and then served on the monitoring dashboards and whatever, right? So I think there are use cases where you have, you know, high volume writes, and a large number of reads of recent data that expects super low latency where they have to be in memory. Right? And that’s kind of getting into like, that’s a class of use cases. Right? So single in the high volume of writes in and then there’s a notion of fan art, like, for one, right? How many reads on that event? Right? Like, is it five? Is it 10? Is it 20? There’s a notion of how soon after the right, is the reading happening, right? Is it very soon after? Like, are you dealing with the end of the Lord, basically, right? Many consumers dealing, they enter the log and Kafka, I would say, is optimized for that, but you have high volume of rights, and many readers stealing the log, right, and a lot of systems are built with that basically just looks, you know, and is highly optimized for that, right. And I think that’s a very good use of Kafka. It’s kind of, you know, metrics, logging use cases, right there, maybe there’s some apps that need, you know, like, that kind of transactional data is generally not that high volume, that’s generally the point, right? Like, you know, transactional data, and not like every click on the website is not being is not the same as every checkout, right? It’s a completely different order of magnitude. So Kafka started with that. And I think over time, you know, now, like, they have like, the added compacted topic. So what that means is that for every key, that you are right, you can just keep the latest version of that key. So you can significantly reduce the data you store, you know, then they have tiered storage, now that tears off all the data to S3, but you can still read it through the same protocol, it’s just the latency will be slower. Right. So I think with things like compact topics in a tearing, you could keep it as the law, you know, you know, the system of record for the evolution of the data and your company, right. And that opens up different use cases. And this is kind of what you’re starting to get into, right? Like Kafka is amazingly not that old. Right? As systems go, conflict itself is going to be 10 years old this year. And like, it’s not that old, you know, for a new category of data systems. So I would say the original was this high volume, high cPanel, low latency use cases around you know, metrics, distribution and that kind of thing, which is still very popular. And then you have more of these transactional application use cases, systems of record use cases that are actually there, like banks using, like, many really sophisticated organizations use Kafka and that capacity is evolving in that direction, right? So both those things are true, and those latter are more new, I would say relatively speaking.

Kostas Pardalis 29:45
Yeah. And you said, okay, like, it started like that. And then you mentioned also that like, two things that were built on top of that, right, like one was like gay SQL and also like Kafka streams. So first of all, why was the reasoning behind building that? Well, I like the system that initially was, let’s say, very resilient and performance. Like kind of almost like a queue. I know, it’s like it’s not a queue, but like something where you write data, and then you have, like consumers, like on the other side that they can like really fast, like read. And they’ll have the look, why get into systems that they’re much more complicated in terms of like, the logic that you can build, right? And like an example that is like SQL itself, like, it’s even like a completely different way, like a model, like a relational model, like a way that’s like, you have to, like interact or like, with data, instead of like saying, having, like, events, one after the other in there. So what was the reasoning behind that? Like, what was the motivation to build these and where the systems are today? Right? Are they used and what is stuck?

Apurva Mehta 31:02
Yeah, that’s a good question. I think, okay. So by the way, just gonna know logically that Kafka streams at confluent came first, right? It was launched in 2015, or 16. And then the K sequel was launched in 2017. GraphQL is actually built on Kafka streams. But I would say like, the motivation for either kind is little even before that, right? I would say, like, if you look at the early internet companies, right, like the popular ones that did open source like Twitter, like LinkedIn, so I can speak to LinkedIn, right? Like, there was Kafka. And the same group, roughly, who did Kafka also built SAMSA. Right. SAMSA? It’s kind of a stream processor. Right? Yeah. And what are the use cases motivating use cases of SAMSA? Like, I would say, like, for example, we used it to ingest into the graph database. So we’re getting the exhaust of like, if you connected like a change log, like someone created a connection on LinkedIn, right? That’s a log log in Kafka. And then there was a SAMSA job that could take that and write it in a format that could be indexed in the graph database, for example, right. So it’s kind of like an ETL job you’re trying to get. Right? Save for search, there was like an insert we used eventually, after I left the app. So to get to search ingestion, like you need to build a live index of recent activity, like, like, the common thing is that if you I connect with you, the first thing I do is search for you. Right, or one of the and you want that source to show up, not some random person, if you have a common name for the connection, right, that needs to be indexed, and then just be indexed quickly. So that was a SAMSA job, right? So there is this class of, you know, sophisticated processing, you need to do on these event streams, for things like this real time ETL for things like real time fraud detection, like real time analytics, kind of use cases, right? So so. So that’s the motivation, basically, you could do it just for consumer events, right? The state manager, right, the fault tolerance level or distribution, right? Like these are highly elastic, scalable, stateful operations. If you do it on your own, you have to build this scaling, load distribution fault, the availability detection, state management, your Ceph, right, and SAMSA, and Kafka streams, right, and K sequel, and others kind of all kind of trying to solve exactly those problems, right? Of load management, liveness detection, you know, fault tolerance protocols, state management with all those properties, and give it to you in a consumable API, that as a developer, you’re just thinking of processing events, I get an event, I need to process it, I have state available in processing that event, and everything else just looks, that’s the goal of those systems. That’s the motivation for them to x. And there’s a huge class of use cases that actually are stateful, or scalable, or high scale need available, high availability, that people want to build on these event streams, and you need technology for that. It’s not something you could do on your own. Right. But most teams will still like to do it. Well.

Kostas Pardalis 34:07
Yeah, that makes sense. And, okay, so you said Casey will was built on top of Kafka streams right? So why put a sequel there as the interface? Like if they don’t have something more primitive? Primitive is not a bad thing here, what I’m saying is right, it’s just like a different API at the end, like to describe your business logic that you want to execute on top of the data.

Apurva Mehta 34:32
I mean, I can’t speak definitively that like my case equal I cannot join that project after it was already launched in public review and whatever so so I wasn’t there the origins of the of you know, why the genesis of those conversations, but I think in general, I can say you know, that, you know, in general, this whole dates category of events streaming, right like this whole like it. I mean, honestly, like, I’m proud to have been confident because As they kind of created this category like this data in motion is now something you understand it’s different from data at rest, right? It’s, you know, like this. Like, it’s hard to imagine that before, then what is this? There is no category, right? You know, like this whole space didn’t exist. And they created a category of like these event driven data, streaming data in motion systems that are distinct from your database request response data address systems. Right. And I think this is what I’m trying to say is that category creation is hard. Oh, yeah. Right. You know, and I think, if you can make, like, you know, if you’re selling a database, and this is something that, you know, to confront of co founder says, like, you’re sending a database, you know, you’re not creating a new category just to compete on why your database is better than their database, right? People understand they need a database. Now, do they need your database is a much simpler question, right, in many cases to answer as a company selling something. If you’re selling an event streaming system, like, what is that? What is it? I don’t know? Do I need it? Is it like, like, streaming video to me? Like, is it like streaming an online live event online? If people don’t know the word, right? And so I think the idea was that if you can make it look like a database, then you have expanded the number of people who can get what you’re doing. It expanded the people you can, it basically is the best way to grow adoption, which is laid out like a sequel was, you know, many more people tried physicals and tried to suppress things I didn’t, you know, so. So I think that was the I Can I think now, I was again, caveat to that wasn’t there to Genesis conversations per case equals specifically, but that was, I would say pretty confidently that was the idea like, can you broaden the market by making it look like a database giving it in the language that people already know, SQL? Right. What would be the main? I would think the main thing reason to do

Kostas Pardalis 36:55
it. Yeah. 100% that I think, like, it’s a good way like for me to ask like, the next question, because I think one of the reasons that people try to introduce, let’s say, new API’s for interacting with the system, is because you want to provide different experience to the user, right? And by making it, you might improve the experience that they have, and also expand the user base, right, as you said, like, there are obviously many more people that they can write SQL about, there are people that can write Kafka streams, right and make sense. So, and without, like, in mind, I’d like to ask you about, and I’m talking also, like, as a person who like, I build products on top of Docker, like, my first company, like Blendo, was actually Kafka was like a very core part of the, of the platform that we build there. And, okay, back then, at least, I’m talking about, like, 2015 to 2014. Managing and producing Kafka was not the easiest thing, right? It’s, it was like, a system that was promising a lot that could deliver a lot. But you really had to know a lot about the system itself, and then also the system in order to manage it properly. And obviously, that makes it hard for more people to like to work with technology, right? So have you seen, let’s say, the developer experience evolving around Kafka? These years, and if you could share a little bit of around, let’s say, what makes it hard to work with Kafka? And what are like some ways of like solving these and like abstracting

Apurva Mehta 38:50
so you will Kafka, Kafka streams are both? Both

Kostas Pardalis 38:53
Actually, I definitely like to start from Kafka because Kafka streams are built on top of Kafka, right? So, I’m sure that there are things that are inherited from there, right? So

Apurva Mehta 39:05
It’s very different. But yeah, we can start with Kafka. So I think Kafka is like Yeah, I mean, for sure. Like, yeah, these massively stateful brokers like it were built for a company like LinkedIn, which has very good engineers, right? It’s very hard to kind of massage all of that into something, you can just take and run beyond the thing beyond a certain scale and beyond a certain, you know, level of mission criticality, you have to learn a ton of stuff, right? There’s so much inside Kafka right? But so many advances, so much tuning, so much stuff, you have to learn to run it well. So it’s not surprising it was problematic for most people in 2014. I think now, I mean, there’s so many good managed services, right? Like, you know, like confluent we use a confluent cloud in our company now and you know, like, honestly, like, you don’t, honestly, we don’t have to care like it basically mostly just works. Yeah, right on the back end. And I think that’s a huge thing, right? Like you can under think you start from basically zero, and you know, pay for what you use and your low volume, pay little and it just keeps giving it to you is you basically now just get about that protocol, right? And just like you, it’s not going to, it’s not going to break on you from a performance perspective. So I think today in 2024, it’s a completely different world from 2014, right, like this many managed services for Kafka offering different tiers of service, you know, like, Amazon basically runs it on some host for you, you’d love to learn a lot. And you have like these fully managed service offerings on the other end of the spectrum, and many people competing in that space. So I think most people today, and then if you’re running it on your own, there’s so many other companies that do management tooling, and like observability tooling, and like, there’s like at least five companies doing that, if you run it on your own. You know, many more people actually know, like, there’s courses and training and certifications for Kafka operators. So you know, you have streams, and all these other people who just ship Kubernetes operators that you can run on OpenStack, or whatever. So there’s so much around it that, regardless of where you are, in terms of your requirements, you probably won’t have as much of a problem as you did in 2014. Right. So I think it’s come a long way. I’m sure there’s a lot to do. But I think there’s so many more options now.

Kostas Pardalis 41:24
And slob Kafka streams, like what it means to operate Kafka streams? And, like, what’s the developer experience there? And like, what’s like, the space for improvement, right? Like, what can be done, like to make Kafka streams? Like, a much better experience for someone like the development monads? And yeah,

Apurva Mehta 41:44
I think for, just for I don’t think we’ve talked about what it is actually, right. Like, I think Kafka streams, you know, it’s a library that ships with Kafka, it’s open source, it’s part of Apache Kafka product. And you know, it kind of gives you these really rich API’s, like, you know, that you can write a function that reacts to event, it has a state that you can build up over time that is maintained for you, right, Kafka streams it, take this library, build this app, deploy the app, and now Kafka streams can scale you out to new nodes kill you in when you scale down in, you know, and it kind of detects node failures, rebalance the stuff onto other nodes and maintain state across all that right moves data around for you. And if you think about it, it’s just a library running in your app. If you’re running an app, like you run any other app, but it does all this stuff for you, on the back end, right. So that’s what the library is built for. And it’s great. People love it for that flexibility. Like, you literally are an application team running your app, you own your page, or you have your tools, and you know, your fields, like other services you operate. And people love it, because it allows them to see that, you know, as opposed to submitting a job in someone else’s, Lester, I think it is a very different, you know, mindset, right. But I think, then coming to your question, that’s an intro for Kafka streams. But I think the problem is that because it’s a library right? Now, you as an application team, are basically running a sharded distributed replicated database if you have a stateful application, right? And now you have to learn about it’s kind of like what you had to do with Kafka in 2014, right, you have to learn about all the internals about how it works, to solve problems when you hit them, and you will hit them. Right. And, you know, the default Kafka streams, you know, like, you have to, like, you have to collect the metrics somehow use GM X, Triple J, M, beans, whatever, you have to get them out in which metrics are like seminar Netflix, which already care about, like fine tuning configurations, which one should you use, right? And then you have to do all that you have to collect the logs at the right level for the right classes. And then you do all that to solve problems if you hit them in production with your stateful application, right? Most companies, they start, it’s super nice. It’s great. It works. And then when it doesn’t work, I’ve been on calls, right support calls where they didn’t have metrics, they didn’t have logs, but they needed to work right now. Yeah, right. So you’re very unprepared a lot of the time too, because you don’t know it’s so magical to start with. It hides a lot of complexity. But you know, it can only hide so far like you would not deploy an application today with a co-located database, certainly a co-located distributed database and run it in production and expect it to work. Right? But really, that’s what people are doing with Kafka streams, right. And that is the root cause of a lot of issues. So I think, you know, like what we are doing, for example, it’s kind of, you know, allowing people to eat their cake and have it read, keeping the form factor. It’s still a library running in your application, keep all the great things, right. But delegate all the hard things to us, right so that you don’t have to solve it. We give you clean SLS clean interfaces, and that kind of is a big step forward for operations, right like, you know, you’re running it like you would other production services where you’re separate In a statement compute, you’re letting teams experts run the distributed stateful stuff for you, behind clean and SLAs. I would say that’s the biggest problem, right? This coupling is great to start. It’s horrible when you hit the limits of it. And I think that’s the biggest problem, I would say to solve anything. If you saw that, well, you know, like, more people will write more of these abstract because they are really compelling. There’s a lot you can do with it. And it’s really easy to start with. Right. So. So that’s kind of an abstract question. And that’s a reason to just focus on those operational problems. Yeah.

Kostas Pardalis 45:39
And how do you do this decoupling? I mean, I’m sure it’s like a hard problem. But like, can you take us through, like, let’s say the architecture that you have to, like, build at the end to it, these decoupling guns with the uplink like, okay, then overall, additional benefits? Will you talk about it? Yeah,

Apurva Mehta 46:02
I mean, I would say, first of all, we have the experts on our company, right. I think it matters that, you know, a lot of the people we have built the system, like, so we know how to do it. I think the other big important thing is that Kafka streams are always written as this kind of application framework, right? Like it is meant to link into your app. So you could write your app any way you want. It was also built so that the underlying systems are very componentized. Very much like the state store is an interface, you can implement whatever state you want, you can implement any assignment logic you want, you can implement anything you want, you can implement whatever client you want, right? So basically, you could plug in everything at the bottom, it was built so that we could do it. Right, what we’re really doing. And obviously, like, that was a design concept, implementation is much harder. But it was always designed with that in mind, when we thought it was okay, it was taking advantage of that design. And with plugging into points where it’s clean, to plug into already by design. So, you know, like, you actually have to rewrite zero and athletic, our customers have moved in 10 minutes, you know, completely awkward state, completely awkward management, but you don’t need to change anything. So I think it’s both designed like that. And we know the system and the two things combined allow us to kind of deliver this with them. Very true. Like, it’s basically Kafka stream. Nothing changes in the application.

Kostas Pardalis 47:19
Yeah, so I was going through like the, like a very high level, like the documentation of responsive and so they’re that and correct me if I’m wrong. I’m not an expert, right in Kafka streams, but Kafka streams? And because we are talking here about managing the state, right? Catholic to store the state somewhere. If I’m not wrong, the status is stored in a rock DB instance. Yes, right. Yeah. Correct. One of the things that you change with responsiveness is like, what’s the datastore? Where, like, the state is stored. So you have like, skillet to be there? And also, you mentioned MongoDB, can you tell us a little bit about that, like, why you have to why you decided to move away from rocks DB into like, these systems? And what’s next after that? Because I think, again, if I understood correctly, like from, from what I’ve read that you’re also working on building your own store out there. There’s a little bit about that. And then I have one last question on this topic. But I don’t want to just put two main questions together. So yeah, those

Apurva Mehta 48:33
Yeah, no. So yeah, I think the reason to move Yeah, you’re right. It’s stock CP by default, right status materialized in rock dB. The reason to move again, is like, if you’re running a role, like replicated rocks, Shadow rocks DB in the context of the application, it hurts operations a lot, right? Like if you lose a node now basically to reread storm or Kafka topics. All the state to materialize a new drug to be an insensitive trick, it takes away elasticity, right. And then there are a lot of solutions on the protocols on top to help you manage that. But they complicate those protocols. And often you get the vicious loops of, you know, you start trying to restore state, but the group gets rebalancing and your things keep in the infinite restore loop unless you have the right tunings. It just doesn’t work. There’s a lot of complexity with doing it that way. So that’s why, you know, I mean, honestly, like, most app teams don’t want to run a stateful app. They want to run stateless apps at auto scale, right? So you have to remove storage. It’s almost like a precondition for good operations, in my opinion, right? That’s a strong view we have as a company as well. And then, so that’s kind of the answer to why we move to x dB, right. I think it’s just operationally really hard once you hit a certain scale. You know, and we have heard so many stories of people who don’t touch the system. When it breaks, we can fix it. And that kind of schedule, you know, is kind of appealing. But yeah, so we solve that by removing it and then by Ceylon, Mongo, and our own event. really is basically, you’re still in a transactional store, right? Like, you need basic, you know, like Kafka does, like I mentioned, we do transactions, you know, you Kafka streams does allow you to kind of read modify write sequences of events, it’s extremely powerful primitive to write correct applications. And so you need your stone out to be transactional, too. Right. So that’s why we had to pick some of these kinds of stores that, you know, that kind of have this, you know, you can just dump into, so you can just be something naive about it. So transactions are a big requirement and ability to transaction in the right way with the right performance, right, like, that’s another problem. The semantics have to be that the performance also has to be there. So you’ve done a lot of work to make these transactional systems in Mongo, and Scylla. Work with Kafka streams, right? A bit of good performance, excellent performance, I would say. So I think the other thing is also just in terms of form factor. I’d like Mongo especially to be everywhere, like all the big companies use it, they already are contracted out, and many of them have a requirement that things can’t leave their data center, they can’t leave the network, right? If you think you bring your own Mongo, it’s already blessed, you really have very little security risk. So from an adoption perspective, especially because Kafka streams are used in most major enterprises, they’re very locked down. Right. And, you know, just saying that you bring your own logo, and we will make it work really well. is an extremely good from a, you know, a procurement and security perspective, it could be it’s mostly a block of all startups wrapped up, I’ll give you a data as to bad ways that you keep your data, we use it make it work for you, right, so. So that’s kind of the other reason to go with these vendors. And then, you know, that comes to your question about our own, is that these are also not optimized, right? Like we can do we already have a system that works that, you know, you are kind of in the default Kafka streams having rocks DB on EBS. Typically, if you’re on AWS replicated over GraphQL, right, we could pull it out, you know, replicate over S3, like, you know, like some companies are doing with Kafka make it dear to S3, people really cheap intermediate cache to serve quick reads and Ticketek quick buffered writes and make it cheaper than the default, but with much infinite scalability, because you have S3, and because you know, and no management, right? So that’s kind of the long term like that kind of will further open up the market for us, right? Like you can have really high state high scale apps without worrying about cost with great operations. Right. So that’s the reason to do it. It’s still a long way off, you know, honestly I think Silla Nago works really well, right. And for a lot of people, we’re building transactional apps who have strict data governance and data, you know, residency requirements, it’s much easier for them to just use something that’s blessed in the company. Yeah,

Kostas Pardalis 52:42
makes sense. And, like the final part of the question, why not Kafka itself? Because you mentioned you thought, like, a little bit, why but just want to make like, more explicit, because, as you say, like, you build like, transactional semantics on top of like Kafka. So why not just use Kafka itself and simplify the architecture? To store the state also there.

Apurva Mehta 53:08
And in Kafka is basically a log realm in Kafka streams, default does store the state in Kafka, and then rebuilds it into rocks dB, when a new node comes online, or whatever it’s called the restore process. I think this whole problem is like, if you want real elasticity, right, you know, this process of rebuilding state, or even sometimes re-downloading from S3 sets, that takes time if it’s a lot of state, right? So having any kind of local state that needs to be done locally on a local disk before you can do any processing can be a big problem, especially at scale. And even if it’s all at scale, if it takes a long time to rebuild from Kafka, you’re waiting a long time to scale out. And that could be deadly for some people. Right? So I think that’s the biggest I mean, I think it’s debatable whether that’s the right architecture, if you have a really mission critical high available, you know, app, and especially if you need to ask the city, right, like, just the time to waiting to build up a terabyte of data from Kafka could be a day, you could wait a day, right? I think that’s really the question. And people do it. Honestly, people do it, like the many companies successfully using it, and they’ve configured it and they can work with it. But most people, right, hit problems. Right. And I think having the choice is great. Make

Kostas Pardalis 54:19
sense? Okay, cool. I’ll give the microphone back to Eric. So Eric, all yours again. I have a feeling we need to arrange another one of these discussions here. And I’d like some very interesting questions that were triggered by all the things that we talked about the proposal. I’m really looking forward to having you back again but Eric, all yours?

Eric Dodds 54:43
Yeah, I agree. Well, we are. We’re right at the buzzer, as Brooks likes to say but, you know, we have talked. Your knowledge of streaming systems is so deep a part of it. It’s really incredible. But I can’t help but ask you to know if You had to go solve some other technical problem that didn’t have to didn’t have anything to do with streaming, you sort of had a blank check to go solve a problem in a different area. What interests you, you know, sort of outside of streaming? Like, what do you think about when you’re not thinking about streaming? Oh, man, all right. Is it possible after so many years? You think in streams now?

Apurva Mehta 55:27
I would say the latter is true. If it’s not streaming and not this. It’s like my kids, my family. It’s like anything, a book.

Eric Dodds 55:35
But honestly, you see everything as a Kafka topic?

Apurva Mehta 55:38
No, I mean, I think it’s not that I don’t think there are so many great things to do. But for me, I just feel like I genuinely generally have been content for so long, I would say even at Panther take, took a couple of years to get into the mindset. And once you’re in the mindset of these event driven architectures, you know, this durable log, that’s the source of truth of all data and your whole company. That’s the vision of the continent and the central nervous system is another thing they say. I just think that honestly, there’s so much that can be done in space, right? There’s so many ways, like the apps of the future, the kinds of things you could do with them, the kinds of you know how fast you could deliver great new capabilities, I think the like, we’re in very early stages and might fundamentally believe it. Right. And that’s genuinely exciting to actually, it’s hard technical problems. First of all, as an engineer and technologist, that’s exciting. Many of these are unsolved. And then from a business and use case. perspective, I think there’s so much you could do to kind of grow the market and innovate and pricing, right pricing, how do you package the problem with so many things that nobody makes money in stream processing? I shouldn’t say that there are companies in space. But this is just a fact. Right? I think there’s a lot of value there. Also, right? And I think honestly, like, if you think about hard technical problems, and I think a clear market, because people are doing things, right? It’s just that it’s hard to kind of do it and make capture value, honestly, like, it’s perfect, right? And I have a background and it’s like, it’s not like I’m just coming off the street reading a book and getting, you know, like, is a deep excitement. It’s not super exciting. So I think honestly, like, do you mean, maybe it’s a disappointing answer, but honestly, not like, oh, this boy. Like I’m pretty excited about it.

Eric Dodds 57:21
Yeah, I think deep excitement is a really good way to describe it. And that’s very palpable, just talking with you. So very excited for what you’re building. And thank you again, for teaching us so much with your time with us on the show.

Apurva Mehta 57:35
Thanks so much for having me. It was a lot of fun. I hope I didn’t take too long to answer. But, but Yeah, happy to do this again. You know, I mean, it was a lot of fun if you want to continue the conversation.

Eric Dodds 57:46
We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe to your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.

🎙 Sign up for The Future of Machine Learning Livestream!

🗞️ Signup for Our Newsletter

Episode 184:

Kafka Streams and Operationalizing Event Driven Applications with Apurva Mehta of Responsive

April 3, 2024

Notes:

Transcription:

About the Podcast

Sign Up for The Data Stack Show Newsletter