Episode 01:

Data Council Week: The Evolution of Stream Processing With Eric Sammer of Decodable

April 23, 2023

This week on The Data Stack Show, we have a special edition as we recorded a series of bonus episodes live at Data Council in Austin, Texas. In this first interview, Brooks and Kostas chat with Eric Sammer, the Founder and CEO of Decodable. During the episode, the group discusses real-time data, stream processing, the complexities of the modern data stack, exciting announcements at Decodable, and more.

Notes:

Highlights from this week’s conversation include:

  • Eric’s journey to becoming CEO of Decoable (0:20)
  • Does real time matter? (2:12)
  • Differences in stream processing systems (7:57)
  • Processing in motion (13:04)
  • Why haven’t there been more open source projects around CDC? (20:34)
  • The Decodable experience and future focuses for the company (24:31)
  • Streaming processing and data lakes (32:54)
  • Data flow processing technologies of today (39:01)

 

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 00:05
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com.

Brooks Patterson 00:24
All right, we are in person at data Council Austin, we’re able to sit down with Eric Samer, he’s the CEO and decodable I’m Brooks, I’m filling in for Eric, he got in a biking accident. He’s fine, but wasn’t able to make it to the conference. So I’m coming out from behind the curtain here and excited to chat with Eric. We’ve got Kostas here, of course. But Eric, to get started, could you just kind of give us your background? And what led to you now kind of becoming co decodable?

Eric Sammer 00:55
Yeah, absolutely. First of all, thanks for having me. It’s always a pleasure, I get a chance to talk about this stuff. Yeah, I mean, so. So my background, you know, I’ve been doing this now for 25 something years, really focusing around data infrastructure. So I lovingly refer to myself as an infrastructure monkey versus like, you know, while people are doing fancy math and cool stuff with data, I’m moving bytes around, you know, flipping zeros to one. So I spent a lot of time working on things like SQL query engines and stream processing infrastructure, which really took up the last decade or so of my life. I built one or two systems internally for mostly market marketing and advertising applications. And then sometime around 2010, late 2009 2010, I wonder being an early employee at Cloudera and spent like, four years working on sort of the first generation of big data stuff, and then wound up creating a company that eventually, you know, we were, we were acquired by Splunk, and spent a bunch of time they are working on real time infrastructure, stream processing, and just like cloud platforms in general for observability data. And, and then, about two years ago, broke out and started decodable, which is a stream processing platform. And as a cloud service, we can get into the details that that’s interesting, really focused around, being able to acquire data from all of the fun and interesting sources, process that data in real time and get it into all the right destination systems in the format that’s ready for use in analytics.

Brooks Patterson 02:35
Cool. One thing we were chatting about just before we hit record, that you kind of kind of brought up is just the idea of does real time matter? Could you unpack that for us? And just kind of talk about what you mean, there? Yeah, there, there are different camps, who would probably argue different things?

Eric Sammer 02:53
Yeah, I mean, you know, I think there is a segment of the market who I mean, people probably break down into three groups, or people who are very sure of what they get out of real time data and buy real time, specifically, I mean, low latency sub second, you know, availability of data, either for analytics, or for driving online applications, and systems and those kinds of things. There’s one group who fully understand it, know exactly what they’re talking about, and have a strong opinion about it. There’s a group of people who say, Well, it depends on the use case. And like some use cases, demand real time, in some cases, use cases don’t. And then there’s a third group of people who say nothing really matters, like real times are never important. And those kinds of things. And, you know, I think like, you know, selection bias, of course, but like, we we talked to the second and third group, you know, first and second group, sorry, you know, most of the time, and I would say like, the biggest thing that we hear from some people is, you know, my use case doesn’t require real time. And like, the interesting thing there is that, like, at some level, I don’t disagree. The thing I would point out is that, like, if you asked three years ago, whether or not you needed to know exactly whether or not your food had been picked up from the restaurant, and where it was in between the restaurant and your house, everybody would have gone, like who really cares? And then COVID hit and now everybody fully expects up to the second visibility into where their fried chicken is, right. And they think that like, so like, what winds up happening is the use case, I would argue, doesn’t require real time until someone decides to do it, and changes the expectation. And I think companies like GrubHub, or Netflix, or YouTube content, recommendations, or any of these other things have changed. changed the expectations. And that as a result now either saves the money or generates revenue. And like one use case for me is, I don’t know about you guys, but I don’t have a lot of loyalty to like retailers around certain things. So if I need, you know, I can’t ever get dizzy, if I need a mop, I don’t care where I buy it, I care that it’s in stock, and I care who can get it to me fastest. And if that’s hypothetically, Amazon Walmart, Target, you know what I mean? Like, I will get it from either one of them. So I care about inventory being up to date, I care about who has the lowest price. And like all these things are things that are responsive to inventory lab arriving at a loading dock, or dynamic pricing logic to adjust prices based on competitive sales and like those kinds of things. So my argument is, everything’s real time, like, either in potential, or, you know, something that winds up being real time, you know, because a competitor has driven it in that direction. And, you know, I’m sort of interested if you guys agree with that or not, but like, that’s my take on the world. Yeah.

Brooks Patterson 06:20
That’s great. Do you agree?

Kostas Pardalis 06:24
Yeah, I do agree. I mean, I think the reason is because we have all this infrastructure out there and all this technology being built. I don’t think it’s just because like, you know, geeks want to, like, have the equivalent of a fast car. They built Kafka, right? Like, something similar, but at the same time. Yeah, like, I think the problem with real time is that real time is a very relative term and like the semantics a lot. So if you’re talking to a marketeer like what is real time and you talk to someone who is responsible for fraud detection, you’re probably going to get a bit of a difference. Yes. Not only the definition of what real time is, but also the importance of real time, right? Yeah, like if my campaign, let’s say rounds, like five minutes later. Okay. Problem, I think, although I will probably be frustrated because I have to write but if someone gets it, I don’t know, like a board conference, like the day after. That sounds fun, right. But let’s talk a little bit more about technology, right? We remember, like, what some of the first, like real time processing, let’s say pieces of infrastructure both came out of like Twitter, which was a definition of real time, right? Yep. Back then. Like technology, like some czar? What was the name of the Twitter’s like they have like a platform?

Eric Sammer 07:56
Yes. So LinkedIn had Samza. Twitter had initially a storm. Yeah, like and then they built heron, which was another one. And then there was Spark streaming that came out with the spark ecosystem, and Apache Flink, until like, there’s been a couple of these things that have grown up over time.

Kostas Pardalis 08:16
Yeah. And I wouldn’t like to talk about these and Delta Lake compared with something like Kafka to understand, like, what’s the difference between Kafka and a system like Flink or sonza? Right?

Eric Sammer 08:30
Yeah, absolutely. So I think like, let’s pick apart a sort of Kafka gist for a second. Kafka really, there are four main components or projects that people talk about when they talk about Kafka, maybe even five. One is the actual messaging broker itself, right? And that’s the part that I like, I think of as Kafka. Yeah. Then there’s K streams, which was the Java library for actually doing stream processing, and K sequel, which was the sequel interface built on top of case Graham, then there’s Kafka Connect, which was the connector layer. And then there was the schema registry. You know, some of these things are under Apache licenses. Some of these things are under the complement community license, if I remember correctly, but you know, so when I think about Kafka, I think of the broker proper, which is really just about Pub Sub messaging or eventing, you know, so like really about just the transport of data and real processing capabilities beyond just moving it from A to B. And so I think that the processing systems that we’re talking about storm, Sam’s case Change Case equal. Flink, which is the one that I’m probably most familiar with. That’s what we’re based on, the Dakota ball. You know, various other systems like that run on top of those Kafka topics right? Many of them support not just Kafka, but Kafka like systems and including some of the cloud providers stuff like Kinesis and GCP, Pub/Sub. And those kinds of things.

Kostas Pardalis 10:04
Okay, we like the processing, right? I would argue, let’s say, and let’s forget, like gay sequels, K streams and all that stuff that, okay, I have a producer, I have a consumer, I can provide business logic there. I can do processing, right, right on top of like Kafka. What’s the difference between that and having a system like Flink?

Eric Sammer 10:25
Yeah. So, in general, you could argue that anything that writes to a Kafka topic or reads is effectively doing stream processing at some level, like it might just be doing minimal transformation, and might be doing sophisticated transformation and those kinds of things. I think that the difference is, it is really like the stream processing frameworks are just that there are frameworks, right? So they’re gonna give you a bunch of capabilities, including an execution engine, typically that’s optimized and sort of understands things like predicate analysis, and aggregation, operations and window functions, and all these other kinds of things. They typically also understand schemas, or events, or isation D serialization, they typically understand state management, where am I in the stream? What happens when I fail? And how do I recover, to achieve either at least once or exactly once processing of data, you know, getting rid of duplicates, those kinds of things, or not producing them to begin with. And also some higher order concepts like a notion of event time, and watermarking. And like all of these other sorts of more sophisticated things that sort of help achieve correctness with the processing data. So you know, in that sense, you should think about stream processing systems the same way you would think about a database in the sense that not that they necessarily work the same way, but that rather than just have files on disk, and like, reinvent Postgres, on top of that, it behooves you to take advantage of like the fact that people have put in a lot of work to get the correctness and the processing and those, does that make sense?

Kostas Pardalis 12:12
makes absolute sense. But I have like a follow up questions on that, like, the way that you describe it is that like, what, how I visualize it, like I have like data in motion, and I’m applying or to aggregations like and kind of like data processing, as the data is like, still, like in motion a couple of years ago, let’s say like, what I like, a bit like after, like, 2015, or so we started, like hearing a lot about the concept of field t instead of ETL. Because, okay, that will be the main, like, what you’re describing sounds more like a deal, right? You extract the data, somehow the data is still in motion, like I transform the data, and then like, I’m going to like to do something with whatever I produce from there, right? Yeah. But then we had, like this whole concept of like, you don’t have to do that anymore. Just let’s just extract and load the data. And after the data is loaded, you can go in, like, with extreme scale, go and process the data. Take, assuming let’s say, I have Kafka there. latencies are low, theoretically, at least, I can’t get close to let’s say, like, the real time and in some cases, let’s say I have something like, it’d be no or ClickHouse. I can have like, realtime, yes. Okay. So, what’s the difference there? Why do we still need to have these complicated systems? Because they are complicated, right? Like samsara is not like these are things like to go out and do these processing, like in motion. Yeah.

Eric Sammer 13:45
I mean, this is a really interesting question. I think it’s a philosophical debate. So, you know, you’re right. If you look at the, if you look at this through the lens of being, for instance, like a Snowflake user, like from your perspective, you have many sources of data, you want to get them into Snowflake, you want to do your processing there, and why on earth would you ever do any kind of transformation? So a couple things, one that comes up, and a lot of the quote unquote ELT tools do this under the hood, they are actually doing things like mapping data types, and like they are doing processing but it’s de minimis process. I thought about business logic processing. And someone explained it to me that the thing about ELT is that it’s not actually that it doesn’t do transformation it does. It’s that the majority of the business logic is pushed to the target systems. And that definition made sense to me. So it’s actually ETL T, yes. Right. You know, there’s two T’s in there, which is okay, when it becomes interesting. So a couple things. One, you’re actually using your costly CPU to do the processing. If you do that, you know that there’s latency character mistakes and those kinds of things. But the I actually think that the more interesting angle on this is that if you zoom out, and you think about other places that data wants to go, you start to go like, okay, so like, it’s gonna go to S3, it’s also going to go to Snowflake, it’s also going to go to ClickHouse, or pieno, or Rockset, or whatever, you know, wherever it’s going to end up goon droid. It’s also going to go back into operational systems like ElasticSearch, that you can provide online product search and like Algolia, or whatever people are using these days, it might also get cooked in various ways and go through a bunch of microservices. And so it’s not so much that you want to push all of your business logic in the world into the stream, it’s that you want to have the capabilities to do impedance matching between all those systems. Some of them aren’t allowed to have PII data. Some of them don’t want certain records, some of them need quality fixed before it lands in those systems, where you can’t do updates and like mutations and like those kinds of things. And so I think I would think about stream processing the way you think about, I use networking as an example. But like packet mangling on a network. You know, stream processing is the equivalent of your load balancer, right? Like, it allows you to do some amount of processing before the packets land in the target system. And I think when you think about it, from a holistic perspective, you kind of go like, Oh, then it actually makes sense, because you’re not tightly binding the schemas and the structure of the data between the source system and the target system. And like one of the biggest changes, or one of the biggest challenges that I hear is that if you’re doing ELT into a system like Snowflake, and somebody makes a schema, incompatible change, you’ve broken your target system, and like you’re very tightly coupled to those operational systems. So I think that when you start talking about data contracts, and larger organizations, like being able to do these things and pay over those problems, I think stream processing is one way you can start to cut into that.

Kostas Pardalis 17:14
Yeah, 100%. And I think I asked this question a lot, because like, I totally agree with you. And I also get why people might wonder about these things. Sure. And I think there’s always a huge gap between what theoretically can be achieved in the wild , like in practice is happening, right? And usually, that’s okay. That’s where engineering comes from. Right? Like, that’s why we need engineers. That’s why we engineer these systems, right? Gears there are always like trade offs. And like, each one of them has unique trade offs, like, Yeah, sure. Why not just use only ClickHouse? Right? And do everything? Theoretically, you shouldn’t be able to do it. Yeah, have you tried to do like, a lot of joins there, for example, right? Or how it is like to change the schema, like on something like vino, or like, there are always trade offs. And that’s why like, there’s like, at the end, wisdom, like in the industry, it’s not like, like, these things are like, just because you know, crazy VCs and founders, they will like to push their agendas and stuff. That’s what I’m usually saying. But I wouldn’t like to go one step before that. Processing. And I because I know that like another like, important components are decodable. Is, has to do with CDC. And so this is like one of these interesting things that, you know, like everyone kind of talks about it, and says it’s important, like, it’s a very good idea, like all these things at the same time. Like if you think about it outside of Debezium. I don’t think there’s any other mature, at least framework to attach it on, like in a relational database, like an OLTP database. And during the, like, a stream, right?

Eric Sammer 19:14
I think not in the open source world. Certainly, there’s a bunch of commercial systems that have been around for a very long time in various forms. You know, I think the Golden Gate is probably one of the more well known HVR that was acquired by Fivetran. Does this kind of thing? So there’s those kinds of things but I think in the open source world, people actually don’t have a great sense of similar, nearby adoption these days. And I think airbike actually use Debezium. It does, yeah. Okay. So, Debezium is the one that I know best. And we know best that it is decodable because again, we’re based on part of that. But I think you’re right. I think you know, one thing that is interesting is not just about lower latency to getting the changes. But there’s this whole host of applications, especially on the operational side of the business, versus the analytical side of the business, that can use Change Data Capture data, as effectively triggers to like, kick off a bunch of really interesting stuff. You know, we were talking earlier about inventory getting updated, maybe you want to make only things that are in stock searchable, you know, and you want to play with search relevance, you know, for instance, for like an e-commerce site, you know, based on inventory. So like, that’s the kind of thing or marketing campaigns, when PlayStation fives come back in stock, I want to alert everybody who has won on their wish list, right? Like, those are the kinds of things that I think we can enable with CDC beyond just database replication, which is a core use case, of course.

Kostas Pardalis 20:59
Yeah. Why do you think like, we haven’t seen more open source like projects around CDC,

Eric Sammer 21:06
because it’s really hard. Because every database system implements exposure of the binlog in the transaction log a different way. And some of them don’t have there really hasn’t been a single good way of exposing this. So Postgres, MySQL, Oracle, Mongo, they all have like just different database level specific substrates, you know, substrates for those kinds of things. And I think it just takes a special kind of person to commit themselves to going and solving that kind of problem. You know, we are very lucky to have you know, Gunnar Morling, he was the project lead at Debezium at Red Hat for a long time at decodable. So Gunnar, like, spends a lot of time thinking about these kinds of things to his credit, but it’s, it’s, I don’t want to say it’s thankless because I think people appreciate it. But it is really hard, you know, it is really hard.

Kostas Pardalis 22:01
Yeah, it makes sense. Like, one thing that I always found interesting, both in good ways, in a bad way, about division. It’s, let’s say, that’s moved to Kafka, right? It is a project that, I mean, technically, like, you have, in a way, like to run with Kafka Connect at least, like, the moment you decide to not have a calf cutter, you start being very happy with it, right? Do you see? Like, I mean, and I’m asking you not because like, okay, obviously, I’m not like a committer. Or like, you’re like, Debezium, but you work with it, right? Like, it is like part of your stack, like, do you see this changing? And ultimately, why like, why it has to be so what that’s like to Kafka?

Eric Sammer 22:50
Yeah, I mean, you’re absolutely right. I mean, there’s multiple layers there, like in the implementation, and so like, even internally inside of Dakota wall, we wind up using Debezium, without Kafka in certain places, like for certain use cases, more as a library, you know, to access certain things. It’s definitely tricky. You gotta know the internals, I think it’s quite frankly, and there is, and that, again, I’m not a, like an expert in like, what’s happening in the community on this. So she’s been please take it with a grain of salt, but like, my understanding is that there’s a long term feature request inside of the Debezium community to support Yeah, running without Kafka there. I think this is like a trap that open source projects fall into . There’s always this like, Well, why don’t we make a configurable thing, which explodes the complexity of these projects pretty significantly. I, you know, my sense is that the upside is you could potentially remove the Kafka dependency. The downside is that it only makes things more complicated. I mean, this is like a plug. But, you know, one of the things that we focus on is just making Debezium less complicated, and Flink is part of that for us as well. So like, if you don’t know, or care about Flink, and Kafka and Debezium, we try and create a platform where you can define a connection to Postgres, and get the result in Pino or in Kafka or in Kinesis or in any other system that we wind up supporting there without having to deal with like, the guts on this stuff. So like, to some degree, that’s the value or part of the value that we deliver there. So next

Kostas Pardalis 24:33
provocative question. I certainly like saying that you have like three pieces of technology that you’re using as part of the codable like Debezium Kafka and Flink, each one of them. It’s an operational nightmare.

Eric Sammer 24:47
Yes, that is not controversial. I’ll assume that that is true.

Kostas Pardalis 24:52
Like I had, like whenever I had to do any of these, okay, it wasn’t fun. Let’s put it this way. Like you need a very like I don’t know, like a special type of person who enjoys working with these things. Yes. So I’m scared. Why would I come to decodable? When I know that, like, there’s like all these complexities there, why would I do that?

Eric Sammer 25:15
I mean, that’s what pays the bills at decodable. Like the people, the reason people come to us is because they want the capabilities, but they don’t want the operational overhead. And so, you know, Flink alone has a couple of 100 configuration parameters. If I remember correctly, it’s, yeah, it’s sizable. Our goal is to make that disappear. So like, we try to offer what I think is the right user experience, which is largely serverless. You can give us a connection, you know, a bunch of connection information for our database, or SQL query. And like, you don’t have to know that it’s Debezium and Flink, and like all these other kinds of things on the hood, if you do care, we give you the right trapdoors to, like, you know, give us a Flink job, if that’s what you want, you don’t want to give us SQL or something like that. And we’ll handle that. But it’s funny, because there’s just like, this Goldilocks zone, where, like, if it’s so complicated, people don’t want to adopt the technology at all, no matter how much a vendor paves over it. And if it’s so easy that no one needs us. Right. So like, obviously, you know, that said, I do think we always want to make it easier. And we do spend some time upstream trying to like, you know, do some work there to make this stuff, you know, easier to use. But the reality is that all of the options, all of the while I don’t want to use S3 as my state tour I want to use, you know, this other thing. And like, all that plug ability, all that optionality makes it more like a toolkit for stream processing and less like a solution for stream processing. And so, you know, there’s value in that, but that cuts both ways, right. And so, I don’t know, I’d like to think I’m biased. But I like to think that we solve this problem for people. But you’re right. I mean, it’s a real concern, that complexity of any disaggregated system, I think there’s been some good discussions about disaggregation. And like the modern data stack and those kinds of things. It generates complexity.

Kostas Pardalis 27:21
Yeah, Congress 100%. And, okay, I know, there have also been some very interesting announcements about the product lately. And like you mentioned, like the modern data stack, and I know that one of these has to do. T. So would you like to share a little bit more about some interesting things that are happening, like with a product right now?

Eric Sammer 27:43
Yeah, the two kinds of users that we see in decodable are data engineers, you know, who are ingesting data or sort of like making it ready for ml pipelines and analytics, stuff like that. And then application developers who build more like online applications, you know, you know, real time applications, same underlying tech stack. So for the data engineers, you know, out there, what we wanted to do is allow people who know Snowflake, DBT, and airflow to be productive stream processing people without having to take on the Debezium, Kafka Flink stack. And so for them, we announced earlier today, support for DBT adapter, we now support DBT, you can use DBT to build your stream processing jobs in sequel with the same tool set in the workflow that you know. And the other thing that we’re super excited about that we announced today is first class support for snowflakes snow pipe streaming API. Now, without spinning up a warehouse, you can ingest data in real time into Snowflake with no S3 bucket configured with no SNS queues with no im policy stuff, just tell us what the data warehouse is the data warehouse name, and we will ingest and it turns out that Snowflake has made this incredibly cost effective. So you’re not paying for warehouse time, there’s a small amount of money that you end up paying in terms of credits, but it is substantially more cost effective to ingest data into Snowflake. And it shows up in real time, which is incredibly interesting.

Kostas Pardalis 29:26
So what do you say real time because like the last time that I worked with snow bike I think the end to end and when I say end to end I mean like from the moment of like, the vent heat snow pipe to when you see it like on the view like when it gets mother realized, like inside like your data warehouse, thinking about like, span of like three minutes, like two minutes. Is this like something has changed with Snowflake?

Eric Sammer 29:51
Yeah, so this is what the snow pipe streaming API is because you’re actually writing. I don’t know the implementation. My understanding Is that you’re basically running into snowflakes, internal formats there. So you’re skipping a lot of the batch load steps. And so we’ve seen on the order of, you know, seconds and less and even less than a second, I think so like, you can actually like, run SELECT statements and watch records change. Yeah, it’s incredible.

Kostas Pardalis 30:20
Oh, yeah, because I think like the previous at least, like implementation, like the first implementation of like, snowbike, was more of like a micro boxing architecture. But it was still like using, let’s say, S3, under the mood, like to state the data and like, but it was obviously, like, optimized, in a way like to reduce the latency there as much as possible. But again, when you have six threads in there, you add another layer of latency, you cannot avoid that. Right. So that’s very interesting. That’s something I should definitely like, also, like take for myself, I knew that Snowflake was working on the streaming capabilities that they had. So it’s very interesting to see what they’ve done. And I’m also looking forward to seeing what spark and Databricks are going to be doing on that because I think, a Spark streaming, it shows its age, like it’s like, sorry to say that, but I don’t think that they really love working with that string. Like it’s very inflexible, like, it’s really hard. So I’m very curious to see what the response is going to be.

Eric Sammer 31:28
Yeah, I, you know, I’ve had the privilege of working with some of the people who are now working on Spark streaming, structured streaming. Yeah. And it’s, it’s hard for me to say structured streaming. Yeah. And, and those kinds of things. I’ll say this, I mean, I won’t claim to be an expert on the internals of Spark. I’ll say that they have really smart people, they are working on this. You know, we of course, are super biased. We think that Flink has effectively won, you know, out of all of the open source projects, it has the most robust and sort of battle tested stream processing engine. But I’m interested to see what the team at Databricks does, they have a fantastic team over there. And my guess is that it’s actually going to be very hard to make the kinds of changes that I think they need to make, without breaking anyone who’s already using it today. And I think that’s going to create a challenge for them.

Kostas Pardalis 32:30
Yeah, definitely, I think it’s gonna be interesting, because like, I also do have like to do the fundamental concepts behind like, Spark itself on how like, it has the guarantees it has with like, micro batching, and like, all that stuff. So it will be very interesting to see what they come up with. They will definitely come up with something, like, for example, like Okay, but you don’t love their features, or they have like, Databricks, like, it’s pretty good, actually. Like, it’s very, like, okay, it doesn’t have like, the simplicity of like Snowflake has been at the same time. Like, it’s very robust, like, and both performance and like, also, like the capabilities that it gives you. And okay, it’s always, the average will always be more configurable products and Snowflake, like they have completely different product faces, right, like in terms of the user experience. It makes the whole sense. One other thing I wanted to ask you is about data lakes specifically. And then I like two reasons for that. One is because streaming data and data lakes, okay, they make a ton of sense together, because usually streams provide, like, the volume of data that makes dinner viable. That’s one of the reasons. The other reason is because all the table formats, like delta, for example, and like Iceberg are also working on that. I’m not so sure about Hudi. But I’m pretty sure they also have something similar. There is a concept of CDC there, right? They propagate changes, like when you do something with a table and see you can have like, like a feat to listen to these changes, which is kind of interesting to see these from happening, like in systems that are supposed to be slower moving, right, by definition in a way. So what’s your experience so far? In terms of streaming processing, and Blake’s both from consuming and pushing data into them?

Eric Sammer 34:24
Yeah, you know, I think it’s funny. I mean, we say we use three words like continuous streaming, in real time. And I actually think that continuous processing is what Hoodie and Iceberg and Delta Lake are. I think Delta live tables specifically with Databricks are all trending towards and I actually think that’s a positive thing, right? It’s really about the propagation of change throughout the dependency on the downstream processes. I think that like on the whole This will remove Increasingly remove the need for sorting out of band processing on a lot of these kinds of things, which is, I think, a net positive, you know, and I think anything that simplifies the lives of data engineers is a good thing. Just like way too complicated to do relatively simple stuff. I think continuous processing, and better primitives for continuous processes are going to be seen in the data lake. But I agree that like, these things are compatible, because again, I view the data lake as one destiny, we’re streaming data, you know, and again, like, that’s my bias, because a lot of our customers have these online systems as well that need different cuts of data and stuff like that. So I think that this is actually a natural continuum of this change based or continuous change based processing that can now extend into the data lake, which historically has been, like immutable, which has always been complicated, even as far back as the Hadoop days, HDFS and stuff.

Kostas Pardalis 36:02
Yeah. 100%. All right. And one last question from so we talked at the beginning of our conversation about the history behind like, the streaming processing frameworks, right? And they go back, like, pretty much like the same, like, the Hadoop era right? Now, since then, like this whole, like, big data movement. We’ve seen, like, companies, you know, IPO in like, Kafka became like a confident IPO. Okay, Databricks. Let’s say the IPOs. They haven’t, but

Eric Sammer 36:37
Let’s say they are on their way.

Kostas Pardalis 36:39
Yeah, we have Snowflakes. There’s a lot of like, let’s say, value created from data related technologies. But we haven’t really seen any of these, like streaming processing frameworks, creating a company that is like, you know, the Snowflake or like the Kafka or like the complement of like, companies out there. Why do you think this is the case? Especially after seeing how the industry likes investing so much in them? Right, like, because we are talking about super complex, distributed systems that someone has to build. And there are like many of them, it’s from them. It’s just Flink?

Eric Sammer 37:17
Yeah, I think it’s a really good question. I think, you know, a couple of things. Well, one, to their credit, I would say that confluence has done some of this right, like, even though we probably overlap with them a little bit, you know, we’d like to partner with them, I think more closely than we do sometimes. But you know, I think, to their credit, they are probably the closest thing to a publicly traded company that has those kinds of that is based on that cake kind of capability, but not a pure stream processing kind of solution. You’re absolutely right. I think a couple of things. One, I really think that the use cases have finally caught up to the technology. I think in a lot of cases, even back in like 2015 people weren’t as bullish on what they could get out of, I don’t think that people fully understand what they can get out of lower latency, you know, sort of higher throughput data. On the processing side, I think people are starting to get that now when you see if you’re a logistics company, he grew up in a hub, or you see FedEx, he started to get it. So I think that’s been part of it. I also think that the tooling was nowhere near as mature. And as sophisticated as it is today, we talked about a bunch of different systems, I think each generation of at least the open source stream processing systems, excuse me, have gotten like incrementally higher throughput, higher performance, but also just like more stable, more correct under failures, easier to reason about. And, you know, quite frankly, got SQL support. And I think like as much as, as much as we maligned SQL, like people know it, and it works and like people get it. And I think that gaining SQL support is actually a big accelerator for sure. Processing in general.

Kostas Pardalis 39:11
That’s super cool. Now. Okay, one more question. I have to ask her to follow up on it. So sorry. I can not do that. So I’m talking about streaming processing, right? Like we also have a new family of technologies based on streaming data flow, and like the Dataflow family of processing out there. My mother realizes one of them, but they’re like more. What’s your take on that and how is this different compared to something like Flink? Yeah.

Eric Sammer 39:47
I mean, I think the answer is that there’s a lot of differences in implementation, for sure. timely data flow and like what the materialized team has done there. I mean, it’s actually really interesting and exciting technology, I think that they are tackling the next. At least, you know, from my vantage point, they’re trying to make streaming more intuitive by attacking some of the consistent stuff that tends to crop up in stream processing. I think it’s really interesting. I think that tack is probably less. And again, I’m biased, let’s sort of have to put that on the table as a disclaimer, every time I start to say something, but I do think it is less mature, and things like flaying, and all these other kinds of things. So warts and all, you know, I think that like, you know, Flink is incredibly robust. And, and sort of well understood in these kinds of cases. But I think it’s anything that pushes stream processing forward, I think, is a good thing. And so differential data flow and timely data flow are exciting projects. I think it’ll be really interesting to see what materializes and what kind of companies do with that technology. I still think it has a little bit of a way to go. But like, you know, I think that’s a place where I’m sure Frank over materialized but disagrees with. So like, that’s a, it’s like an interesting conversation to have. And in fact, here at the conference, you know, there is a presentation, titled data flow. So it’d be interesting.

Kostas Pardalis 41:18
Yep. 100%. All right. So that’s all from me for today. So we should conclude before I come up with more. These are great. I love these.

Brooks Patterson 41:29
We have gotten past the buzzer. I think this is because you have so many great questions. But Eric, before we sign off here, if folks listening want to find out more about decodable, what’s, where should they go?

Eric Sammer 41:41
Yeah, they should go to decodable.co. And they can sign up for a free account. get started right away. There’s a free cheer there that allows people to get up and running with both Flink API and spa SQL.

Brooks Patterson 41:54
Awesome. Thanks so much for coming on.

Eric Sammer 41:57
Guys. Thank you so much for having me. It was a real pleasure.

Eric Dodds 42:00
We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe to your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.