Episode 163:

Simplifying Real-Time Streaming with David Yaffe and Johnny Graettinger of Estuary

November 8, 2023

This week on The Data Stack Show, Eric and Kostas chat with David Yaffe and Johnny Graettinger, Co-Founders of Estuary, a company building the next generation of real-time data integration solutions. During the episode, David and Johnny discuss streaming technology, the challenges of real-time streaming, the importance of low latency and high-scale data processing in the ad tech industry. They delve into the reasons behind building Estuary, its unique approach to decoupling storage and computing, and its focus on real-time updates and scalability. They also touch on the complexities of state management in streaming applications, data mesh, the impact of streaming on data processing and orchestration, and more.

Notes:

Highlights from this week’s conversation include:

Johnny and David’s background in working together (1:56)
The background story of Estuary (4:15)
The challenges of ad tech and the need for low latency (5:44)
Use cases for moving data at scale (10:35)
Real-time data replication methods (11:54)
Challenges with Kafka and the birth of Gazette (13:54)
Comparing Kafka and Gazette (20:22)
The importance of existing streaming tools (22:28)
Challenges of managing Kafka and the need for a different approach (23:40)
The role of compaction in streaming applications (26:54)
The challenge of relaxing state management (34:01)
Replication and the problem of data synchronization (36:48)
Incremental Back Fills and Risk-Free Production Database (46:03)
Estuary as a Platform and Connectors (47:45)
The challenges of real-time streaming (57:56)
Orchestration in real-time streaming (1:00:51)

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 00:05
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com. This week’s recording is with Johnny and Dave from Estuary. And I think this is going to be a really fun conversation. It’s a topic that we’ve actually covered quite a bit on the show, which is streaming. You know, in particular, real time streaming. But this is really in the context of, I think, what you use streaming for, and we really dig into sort of the Kafka side of the conversation, which we haven’t covered in depth a ton. But part of the STA story is really reacting to real time streaming needs, evaluating Kafka and seeing some pretty severe shortcomings, which is why they built Estuary. Now, what’s really interesting to me is, in many ways, they don’t talk about SQL as a streaming service. You know, they kind of talk about it almost real time. ETL, which is fascinating. There’s some open source technology under the hood. And this is really, I think, going to be an interesting conversation, because streaming is obviously a hot topic. And there are multiple technologies, so it is really interesting to see what the SRA team has built. Yeah, 100%

Kostas Pardalis 01:37
It was very fascinating, like conversation, actually, for many different reasons. First of all, it was like, pretty technical, and only, like, in terms of talking about HR itself. Actually, we have a very deep dive into Kafka. How Kafka is built, and some of the issues there that actually Ishwari is addressing, that like thing, the perspective of the architecture of the system, like, for example, we were talking about how compute and storage in Kafka is like, very tied together, and how this has been, like, changed with using something SRE and like, what does this mean in terms of like managing the system and like what type of like use cases IPS enables. So we had a very interesting architectural conversation around this type of system. So anyone who is interested, like to understand, like better how Kafka and like this type of streaming systems are working, definitely, like should listen to that. And then we talk a lot about also some important concepts like CDC rights and why CDC is important, how we use it and how they implemented it, because the standard out there is pretty much like using something like Debezium. But the fourth leg of the story actually implemented everything like thanks, grads. And they have some really good reasons why they did that. And they are talking through these things, so amazing. People, both John and Dave, like very deep expertise in this type of technology. And we have an amazing conversation, ranging from the technical side of things up to the business side of things. So I think everyone should like to listen to them. And hopefully, we’re going to have them again in the future. Because I don’t think one hour was enough to go through all the different topics when it comes to streaming. I

Eric Dodds 03:38
I totally agree. Well, let’s dig in and talk about streaming with the SRE team. Hey, Johnny, welcome to The Data Stack Show. We are so excited to chat about lots of things. So thanks for giving us some of your time.

Johnny Graettinger 03:53
Hey, thanks for having us.

David Yaffe 03:54
Thanks for having us.

Eric Dodds 03:56
All right, we’ll start where we always do. You have a great history of working together on multiple different things, doing multiple startups selling different companies. So I’ll let you choose who wants to start with the background story.

David Yaffe 04:15
I’ll go for it. So I’m Dave Yapi, co-founder of estuary. I’ve been working with Johnny for about 15 years, we started working together, back in probably 2009 and 2008, at an ad tech company. And we ended up building a platform, which was called invite media, which was bought by Google. The interesting thing about ad tech is that it really cares about low latency and very high scale. So we built a lot of technology around those use cases, which helped us meet customer needs. We ended up building a platform called Arbor, which was a data platform that facilitated transactions between publishers and advertisers. And that required, you know, very low latency. See, we built a streaming system to facilitate that. And we ended up selling that to a public company back in 2016. So the streaming system we built was called Gazette, because that was a pretty novel piece of technology, which Johnny will dive a lot deeper into. But we cared about low latency, because when someone’s purchasing an item on a website, they are expressing interest in a product. And the key is to get back in front of them really quickly, in order to use that data as efficiently as possible. So we wanted to make sure we were able to do that. We built that system. And now we’ve been applying it for the past four years at estuary to, to the world of data infrastructure.

Johnny Graettinger 05:44
Yep. So hi, I’m Johnny, Dave mentioned, I’ve been working together for quite a while now. And yeah, whatever your feelings may be on the advertising ecosystem, in some respects, for like, recovering from the ad tech ecosystem, but as an engineer in the space, there are pretty fascinating data challenges just because of the the speed at which you’re having to make decisions, which is typically in like the 100, millisecond type, timeframe, and your the amount of data that you need to bring to bear to make those decisions, because you’re consulting various indices of data to figure out like, whether to show an advertisement and how much to pay for it, and that kind of thing. And just from a data management perspective, too, because you’re talking about, you know, gobs of this stuff coming in, you need to sort of connect it together, build various graphs that sort of allow you to understand all these data points fit together project information through it, get it back out again as quickly as possible. So there are a lot of sort of very interesting engineering challenges and like wrangling this constant influx of data into useful data products and doing it really quickly. So as Dave mentioned, we ended up building some technology to help us do that. The last company that we built called Arbor, that we open sourced as a project called Gazette. And we are now continuing to build on that project today with our current company, Estuary, which focuses on essentially sort of making it really easy to connect the various systems that you have that contain data that you care about. So whether that’s a OLTP database, or a Pub/Sub system, or even a Google Sheets, or various SaaS API’s being able to connect that data to where you want it to be, which could be another database, or an analytics warehouse or, you know, an elastic search clusters. So there are generally all kinds of places where you have data, and we make it a lot easier to kind of move it around and transform it along the way.

Eric Dodds 07:42
Very cool. So I have this is just a personal curiosity about ad tech, because I’m somewhat familiar, not definitely not as intimately as each of you. But it sounds like the API’s for delivery of ads were pretty robust, because you’re talking about sort of collecting this data, determining you know, someone’s like browsing products on the site, you want to sort of retarget them. And so you build Gazette, to sort of facilitate collecting that data and thinking into these other systems. But then you also have all these mechanisms to deliver those ads. I mean, is that, you know, because you’re to some extent, even if you can move the data, with extremely low latency, you’re limited by the API that you’re delivering it to right.

David Yaffe 08:28
I’ll go quickly, I’m sure Johnny will add a lot of context. But there’s a thing in the advertising industry called real time bidding, right. And real time bidding is effectively a protocol, which enables publishers or exchanges of publishers to share opportunities to partake in an auction. And what they’ll do is they’ll have an auction, which goes out to 10s, or hundreds or even more different potential buyers, and they’ll say, here’s the opportunity, do you want to be part of it? That’s a protocol that existed back in 2008-2007. And it still exists today. It’s how almost all of the transactions that you see online, the ads that you see online are transacted. And it’s, you know, it’s something that happened when I left that space in 2014 16 million times a second across the United States. So it’s really, really robust.

Johnny Graettinger 09:24
Yeah. And some of the data challenges are when you’re making a demand side platform, you’re basically operating a bidder that’s trying to make a decision. There’s information that comes in that, that real time bidding request, and you’re using that information to consult various indices, you have to figure out whether or not you’re gonna make a purchase decision and how much you’re gonna pay for that. So those auxiliary sort of indexes of data that you’re using to make those decisions need to be up to date and ready to go in order to make that decision. So it’s not just handling the requests coming in. It’s also handling all of the data that you’re using to make decisions in terms of knowing how to respond to that request.

Eric Dodds 10:09
So, I’d love to hear about a couple more of those types of use cases. So I mean, that makes a ton of sense, right, where you need to move massive amounts of data, you know, in near real time, so that people can sort of get these highly contextualized ads. What are some other use cases where moving data at that scale, you know, people are using Gazette or estuary to facilitate?

Johnny Graettinger 10:34
Yeah, scaling down a little bit. Because not, of course, most companies are not in the tech space, it’s mostly just sort of giving our journey a little bit in terms of how we got into this space and why we do what we do. But another example of kind of reach for is a lot of companies have a database, like an OLTP database that they’re using to manage sort of their business data and process transactions, you very often want to wire up caches to that or another use case that we see a lot is search, you want to enable sort of fancier forms of search, whether that’s ElasticSearch, or, you know, vector databases, like pine cone, which are really just a variant of like semantic search. So you have all of these different systems that you might want to use to power search for your product, or, you know, whatever it is that you do, that’s your kind of business domain. And you don’t necessarily want all of that just going through your one database. So a fairly common bread and butter use cases, taking that data from your primary database, and then pulling it out, transforming it a little bit and making it useful. And these various backends for search, like pine phone or elastic and others. And they are a caching problem. So it’s important that that be up to date and reflect what’s in.

Eric Dodds 11:54
Yep, that makes total sense. And, I mean, a number of sorts of ways that people solve this currently come to mind, but what are the patterns that you see? So if you’re not using, you know, guys that are estuary? What are people doing? Is it just a much slower sort of replicated database replication process? I mean, a lot of people would even use Kafka to just ingest the change log and do some sort of transformation so that they can populate some other downstream system. What are the popular ways that you see people do this today?

David Yaffe 12:28
Yes, there’s, if you truly care about real time, it’s probably Kafka plus Debezium. Plus engineers can manually manage a lot of that stuff. Because yep, there is some manual aspect to it. And if you don’t care about real time, which is also common, you might be using a system like Fivetran, or stitch to be able to sync those assets together. Sometimes you’ll see people who just query their database and, you know, simply do it all in house without using something like Debezium. But lots of different possibilities on how you can potentially sync that data. Our goal is to make it as easy to work with streaming data that’s truly updated in real time, as it is to use any of those other methods. So really, just like configuration, get it from your source, get it to your destination, if you want to do transformations, try to make it as simple as SQL, and kind of make the whole thing much more point and click and scalable than it would be otherwise.

Eric Dodds 13:26
Yep. Okay. And I have to ask this question, because I mean, I don’t know, maybe our listeners aren’t asking this, but it’s in my head. So I have to believe that you looked at using or used Kafka to solve some of these problems early in your journey pre Gazette, right? I mean, yeah, there weren’t a lot of other systems that I think would be able to handle the type of scale that you were describing, sort of in the ad tech industry. Is that true?

Johnny Graettinger 13:54
Yeah. So the decision to start because that was made back in 2014. And the central reason that kind of was evaluating Kafka at the time, and we were sort of building this business that focused on data within the advertising ecosystem. And we knew that latency was really important for the business, it was important that we be able to sort of capture this stuff and transform it and move it as quickly as possible. So Kafka is the obvious game in town for doing that. And it was that was the case in 2014, as well. There’s one sort of architectural limitation of Kafka that I just couldn’t get past which is like, the central reason for building because that, which is the coupling of compute and storage that Kafka does, so when you have Kafka brokers, you are writing data into Kafka topics, and those Kafka brokers are basically managing that data on the disk of the Kafka broker. And as you expand the cluster or grow it is basically moving that data around but you’re fundamentally limited to the disks that are attached to your Kafka brokers. But even worse than that, there’s this tension that exists within Kafka and downstream usages of Kafka, where your brokers are responsible for accepting the real time rates that are coming into your system, basically like the data that’s coming in, that you cannot lose. So the number one most important role of those brokers is getting that data down, getting it replicated, making sure you’re not going to lose it, so that it’s available. But at the same time, you have these downstream use cases that are wanting to pull from these topics of data and do something with it, process that data, turn it into some kind of data product. And if you imagine for a moment that you’re bringing up a new application, you’re a startup, perhaps in the ad tech space, you’re bringing up some kind of new consumer of data that’s got to basically backfill over all this historical data that you might have seen so far. Now, that application, you want it to basically slurp up data as quickly as possible from your streaming system, because you want to process all that backlog of historical data, weeks of data, months, maybe even years of data. There’s an inherent tension where that application wants to read data as quickly as possible from brokers and the same discs that are also responsible for your real time writes coming into the system that you cannot lose, and you want to be fast. So the sort of coupling of storage and compute is the primary reason for starting because up is solving that problem and decoupling days. So what causes that ends up doing differently is that the brokers are really just focused on the really near Rock real time stuff, like the rights that have just happened, they’re focused on getting that replicated, getting it down, making sure it’s not lost. And then very quickly after that, they persisted into Cloud Storage. And Cloud Storage is actually the durable representation of that data. That’s where it actually lives. For, you know, it’s after, like the really, you know, the near moment has passed. And it’s now sort of in the historical record, it lives only in Cloud Storage. And what’s really neat about that, is that applications can basically stand up a new streaming application that is going to process months of historical data. And also is going to catch up in stream in real time, once it processes that backlog. And when that new application comes up, rather than needing to go hammer the brokers and pull as much data through them as they can, what they’re able to do instead is basically just ask the brokers occasionally, hey, where do I go find the next gigabyte of data, and then just go read it directly out of Cloud Storage, and then sort of seamlessly transition to real time streaming as they get to the present. So that’s a capability that we built out and Gazette and leaned on a lot for building all kinds of downstream applications that are consuming quite a bit of data. And that obviously, that decoupling of storing things in Cloud Storage means that there’s really no retention limit on how much you can keep them out of the historical record. So that in a sense, I had one person tell me it let us be fearless when building architecture and figuring out how to put these pieces together, because we really just didn’t have to worry about whether this new downstream use case is going to bring down production. And interesting,

Eric Dodds 18:28
like one more question for me, but actually, it cost us. I’m sure you have a ton of questions on the Gazette. So I want to hear your questions in the Gazette. Did you when you started building Gazette? Did you know you were going to open source it? Or was it just, you know, you were trying to solve your own problem? How did the open source story for Gazette come about?

Johnny Graettinger 18:51
Yeah, it was always the intention. But when we were going through the acquisition process with our last company, we kind of made sure of it, it was something you know, honestly, it was a negotiated thing with our acquirer where as we were going through that, that conversation and the diligence process that we wanted it to be open sourced, and they were good enough to let us do that and follow through.

Eric Dodds 19:16
Very cool. Yeah. If you develop technology inside of a company, getting it actually open source, especially if you’re an acquisition, certainly seems like one of the hardest parts, so yeah, cost us because I’m interested in your questions. Yeah.

Kostas Pardalis 19:32
Yeah, it’s very interesting, actually, like while I was sharing all this time, Okay, question. Why would anyone like us Kafka today, if the Gazette is out there, right? Like, what are the reasons I’m asking these? Sounds like a little bit of a provocative question. But the reason I’m asking is because when you’re engineering BAE Systems at that scale and how Like procedure, and let’s say like the operations that are supposed to be getting down there, there are always trade offs, right? Kafka did some trade offs and is pretty sure that you’ve also done some different trades that like delivering the technology there. And I think learning about these trade offs is very important for any engineer, because it gives you a very deep insight on why the system works the way it works. And then we can also make a connection therewith, like, the problems that are getting solved, right? Because different problems require different trade offs. So why would I, today, go and use Kafka or red panda, or any other like, let’s say similar, like solution that we hear about, like out there instead of like an estuary?

David Yaffe 20:50
Yeah. So from a, just from a purely business standpoint, Kafka. The way I look at Kafka is it’s the protocol more than anything else at this point, right? Kafka. Of course, there’s an open source project and a whole bunch of stuff around it, but the protocol is the thing that really matters, right? Red Panda is a totally new implementation, which happens to use the Kafka protocol. And Gazette currently does not use the Kafka protocol, but soon it will, right? Well, it will have the option to support the Kafka protocol as well. So you could think of that as Kafka gist with some different architectural decisions that make it function differently. So from my point of view, from a business standpoint, what you really want is the Kafka protocol. That’s all you really care about, right? Because then you have all the integrations that they build, you have the ability to use Kafka Connect and the whole ecosystem that sits around and so that’s my point of view. But Johnny, what would you say?

Johnny Graettinger 21:42
Yeah, I think a lot of it is the ecosystem. When we were starting estuary, we spent a bunch of time kind of surveying because we, you know, we were coming out, we just sort of finished our, our stay at the acquiring company, and we were, we were leaving, we were, because that had been open source. And we’re kind of figuring out what’s next. And we spent some time like, the obvious thing is like, Okay, well, this, if you squint and look at it from a distance, there are a lot of similarities with Kafka. Should we make a go at sort of promoting this as an alternative to Kafka, and we spent a bunch of time surveying Kafka users within various companies. And there are a couple of, there are a couple of answers that kind of came out of that. One is that, of course, the ecosystem matters a lot. So like that, the existing Kafka Connect ecosystem matters, the existing applications that already speak this protocol, matter. And if you’re not using Kafka today, maybe you’re using something like Kinesis, or you’re using Google Pub/Sub, or tools like that. So there are all kinds of different tools in the streaming space, like Why does anybody really need another streaming protocol? And honestly, I can’t necessarily say that they do. Like I’m not here promoting Kafka is like, you should go pick this up guard, sorry, promoting because of that and saying that you should go pick this up and use it today. Like one major limitation is we only ever implemented clients in the Go programming language. So in some respects, our realization was like, what the, what we think the industry really needs is more of an up leveling, like, focusing on the streaming broker is wrong place to be focusing like, it’s, we think of it more as an implementation detail, honestly, of the real problem, which is like, I’ve got these different systems that I want to connect together, and I want to sort of describe what I want to have happen. And then I want you to go do it for me. So yes, we happen to use it under the hood, but it’s an implementation detail. And honestly, it doesn’t even really matter that much from our customers’ perspectives. But there’s another answer here too, which is that I vividly remember conversations we had with teams that were managing Kafka within large companies. And I remember like one engineer talking, you know, telling us about the work that they had done to basically impose these various, like quota constraints on other teams and like, so that they weren’t overloading our Kafka brokers and that, you know, they were able to keep things nice and stable, and they were presenting this as a good thing. And all the while, while I’m listening in this conversation, I’m thinking, okay, so you pushed this problem down onto the rest of your organization, and limited your other team’s abilities to actually use this infrastructure. But here you have an entire team who’s that, you know, like, this is their job, it’s their lifetime, like managing Kafka is what they do. So going in and telling them like you’re doing the wrong thing is not a great message to be delivering. So that kind of turned us off from trying to sort of directly take on Kafka . We want to build something different here, or introduce something different here. That’s roughly the same shape and fits into the same box. So those are some high level answers in terms of engineering trade offs. Honestly, there aren’t that many, a main central one probably is that, because that manages immutable logs. So one key difference between cause and Kafka is that because that is writing immutable logs, where you can add new content to that log, and that log can be as big as you want. And it’s backed by cloud storage, so you really don’t have to worry about its size. And you can read it very efficiently because you’re pulling that log directly out of cloud storage. But it does not do what are called compactions out of the box, like the broker does not is not responsible for compacting that log over time where Kafka is. So if you have multiple instances of a key Kafka will compact, like prior versions of that key away. Now, there we can get into why that actually doesn’t really matter that much as the next follow up question. But I don’t want to waste too much time on this one answer.

Kostas Pardalis 26:07
Yeah, actually, let’s do that. And let’s reverse the question. Like, let me ask you why it is important. Like to do compaction, I’m thinking like, in Okay, let’s say if I was like an engineer working like on Kafka, like implementing compaction doesn’t sound like, like, the easiest thing like to do, right, especially like with these throughputs, that you have like to go and maintain their like, it’s definitely like going to add, like complexity to the management of the system anyway, right. So why, let’s say they decided to do that, like, what’s the use case for compaction out there. And as a follow up to that, why, from the HR like point of view, is it not that important to be and like to, to do the consumption?

Johnny Graettinger 26:54
Yeah, I would say, compaction is very important. But there are different ways to do it. So how I’m going to get a little bit technical in this answer, I hope that’s okay. But you, let’s say you’re building a streaming application. And very often streaming applications need to manage state. So if you’re trying to do like a stateful join, or something like that, you’ve got three different streams of data, and you need to sort of index one side of that stream, so that you can, you know, query that index and look up values when you see instances of the other side of that stream. So you need some kind of stateful database around in order to do that. So in the Kafka streams ecosystem, that’s typically rocks dB, it’s also rocks DB within because that if you’re building an application on top, because you’re generally using rocks db to do it, we use rocks db to do it as well under the hood. Now, Kafka brokers are doing compactions of top cuts of Kafka topics. So when you write multiple instances of a key, it’s eventually sort of pruning out older versions of that key. And that’s important just from a disk usage standpoint, as well. You don’t want that topic to be unbounded in size. But the kind of funny thing that happens is when you’re building the streaming application, the stateful streaming application that’s using rocks dB, and you’re consuming from one topic stream, and you’re indexing those values in rocks, DB is also doing compaction. Rocks DB is what’s called a log structured merge tree. Part of the way it works, it’s basically an append only sort of, it’s managing this append only tree of files that represent the contents of the database. And as you write new content, it’s sort of taking some of those files, compacting them together, writing new files, and then deleting files. So one of the insights we had is this, this rocks dB, it’s already doing this compaction, why are we doing compaction twice? Why are we doing it in these two different places? What if we instead sort of leverage the compaction work that rocks DB is already doing? harness that, and then we don’t really need to worry about this problem within the broker.

Kostas Pardalis 29:02
That’s interesting. So Okay. Follow up question to that. Because you mentioned, like the thermal states there. And I want to ask that, because like confluent, that’s their conference, like, pretty recently, right. And there was also the acquisition that they did like a couple of months ago, I think about like, from a company that is working with Flink. If there’s like state management somehow, like part of the system itself, right. Why do we also need a system like Flink that does, again, like processing rights. And the main value that it adds, from my understanding, at least is that it is a stateful query engine on the end like it’s stateful streaming query engine. So this is some guide of state management and so many different places you And I think like how to like for people to understand like, why do we need all these like different complicated systems, right? To actually go into war with each other in order to achieve something. So love to hear like, learn a little bit more about that. Yeah, well,

Johnny Graettinger 30:18
all tempted to use this apart a little bit sort of within the Kafka ecosystem because you have, we’re talking about a couple of different things here. We have Kafka streams and Kafka streams can have as its back end, sort of a state management like a database for indexing state, and that can be rock dB, for example. But Flink is actually managing an entirely separate state. So when you have state within Flink, what it’s doing is it’s typically persisting checkpoints of state to Cloud Storage, and then recovering those checkpoints of state from odd storage. So they’re completely separate state stores. And this is a little confusing, like the streaming ecosystem is confusing, because state ends up being a very hard problem. And it’s a hard problem to solve well in order to scale well, so we see a bunch of different sorts of competing solutions for how to best do it. So I’m not sure that’s directly answering your question, honestly. But I just wanted to kind of articulate a little bit like the different flavors, like these different flavors of tool tools, or managing sort of state within a streaming context. One of the trade offs, I guess, is for really low latency. The way Kafka streams is does it is going to be faster than Flink, because Flink basically needs to accrue a bunch of like a window of a bunch of data, and then run a transaction where it’s like computing All of its effects in the state updates, and then snapshot that persistent into Cloud Storage, and then it’s releasing the results. So there’s a little more latency involved in the process when you’re using Flink, for that reason.

Kostas Pardalis 31:57
Makes sense. So you mentioned about a story that there is also like processing that someone can do with will assist them right? How is that different? If it is different from using Kafka streams or Kafka streams like Flink convert to asura? Yeah,

Johnny Graettinger 32:16
so a major thing that we are doing just from a vision standpoint, when you work with existing tools within the streaming ecosystem, whether that’s Kafka streams, or also Flink, or spark, and you’re trying to do some kind of, you know, stateful computation, where you need, you know, a joint or something where you need the state, in order to facilitate that computation. All of these frameworks basically require that you adapt your modeling of state and to how they view into their mechanism for persisting that state and restoring that state. So from a vision standpoint, where we’re trying to get to, really is that you as a user, as an author, of, of these transformations should not have to do that you should be able to bring your chosen tool to bear in terms of managing state, whether that’s a Python script that’s keeping something in memory, you know, the same way that you would write it out where you’re just like slurping in something from a file and then wrote me competing, competing something in memory and then spitting out an output and like, why can’t you do that on a streaming context as well. So from a vision standpoint, what we’re really after is basically, Unix processes and files and pipes scaled up to the cloud, where you can basically bring your own program, which can manage state however it likes, and run that in the context of a transformation within the platform we’re building. So we’re not quite there to build more before we fully realize that vision. But the major goal we have is to sort of break that DOT coupling of the user defined state and the way that the framework needs that state in order to manage its persistence.

Kostas Pardalis 34:00
Yeah, that’s super interesting, because I would assume that okay, imposing like a framework on how to manage states allows like the system designer to reason about like, how to effectively do the state management rights, well, when you relax that, okay, like many things can go wrong, especially like when someone brings their own Python code or like something else. So I will be super excited to see how this works. Because it is like a great opportunity. Like actually, like, this platform is like all these like different people to go and like to write code for that without having to know deeply about all the intricacies of these distributed systems. I think it is super important if we were to bring more people into the streaming ecosystem. Yeah, that sounds like a very hard problem, right? Like from an engineering perspective, a very hard problem. So I’d love like to see How like, you do that as like you are realizing your vision

Johnny Graettinger 35:04
also dropped this nugget, which is that basically the VM ecosystem and the ability to snapshot what processes are doing has gotten a lot better. And it’s, you can think about incorporating that into an overall, you know, streaming transactional allowance. Yeah, that’s

Kostas Pardalis 35:21
interesting. I mean, I was looking, I looked into something special like, this micro VM kind of system like firecracker, like some others that are really good like doing the snapshotting. And even like moving the snaps to throw around. My question always was like, Okay, getting snapshots of the VM is one thing, if this VM also I don’t know, it’s like a couple of gigabytes there. In its memory that needs to be taken, it will take time, right? Like, it’s latency that is added there, like you can’t, like avoid that. So it’s very interesting to see how in the context of streaming processing, this might be different than, like, let’s say, in an OLAP system, where you might have like, large amounts of like, parts of the state being in a node and how to move that around. But anyway, that’s like a very interesting conversation we can have at some point. Going back, you mentioned at the beginning, like also like the term of casting, you were describing, like a use case where you have like, all these different systems, you have elastic shirts, you have read these you can have pine cone Eucommia have Iona Redshift, like whatever. And using something like Asura, actually to replicate the state of an input, let’s say, system to all these different systems. Now, it’s commonly said that like one of the hardest problems like cache invalidation, right, like figuring out where in, like the state between the two systems is actually not in sync and like, doing something about that. And I would assume that the system has different requirements around that, but how do you see this problem from your point of view as being like the system that needs to be able to support that? Right? What do you’ve seen in the industry out there? Like in terms of like, problems around that? And like, how do you solve the problem? Yeah,

Johnny Graettinger 37:26
In some respects, this is sort of an organizational problem that companies have to decide, like, what are the authoritative systems for datasets that we care about? Where are we making changes? You know, what system are we making issuing rights to essentially, and that’s I’m not sure there’s any tool out there to be re-included that’s going to be able to make that decision for you. But we can build tools that will support the decisions that you make. So you know, primary answers are basically around discovery, like being able to understand there’s a kind of an interesting concept that’s grown up. And certainly parallel to what we’ve been doing, which is the data mesh concept. And for those who aren’t aware of it, like data mesh is fundamentally the idea is like, trying to engender a more self service, like internal ecosystem within a company have different systems where data lives and democratized access to discover where that data is discovered the various data products that exist and use them for different downstream purposes to make different downstream data products. And this concept of ThoughtWorks is kind of very much in parallel to our own thinking and what we were building at the time, but it’s something we strongly agree with that a big part of this is, it’s one thing to sort of build the pipe that’s going to move data, but if you can’t communicate what those pipes are, and what they’re doing to the you know, the broader cohort of users within your organization, and that’s only sort of a partial solution. So a lot of you know, part of the answer is basically just building good tools for people to understand what the data flows are, where data is coming from, how it was built, like what went into building a data product, where it came from, etc.

Kostas Pardalis 39:15
Yeah, yeah, I love that you brought like data on this because, like, that’s triggered like an interesting question, I think. So, data myths have a very interesting concept of like, okay, you have like the data leaves and like all these different places, and you should be able to go here you can like reason about like, the place where the data is right. So, like to two approaches like I would say like technically like to realize that one is actually replicate the state that is needed between the different systems, something that I would assume with something like Australia you can do right, I have my production OLTP database, some parts of the state there, I will replicate it like Do my, like OLAP system so I can go and do like the only things that I need, right? But there’s also the concept of like, instead of like moving the data around moving the logic around to do like grief iteration actually. Right? So instead of trying to move the data and so like the hard problem of replicating the state and reasoning about the state in a distributed manner, let’s just ask the questions directly, like the source, right? Yeah, I have my own, like, reasons here to share the pros and cons with each one of them. But I’d love to hear your take on that.

Johnny Graettinger 40:41
It’s two media thoughts. One is, of course, like, if you’re talking about your production database, you might not want your analytical query load going to your production database. So that’s problem number one. But another problem is, you might not care about the data as it is right now. So a really common use case that we see is, you know, because we are, we’re doing Change Data Capture from databases, which means that we have essentially this immutable log of all the individual changes that have been made to your tables over time. And that can certainly be used to maintain the materialized view of the current state of the table. And that’s what we call a standard materialization. But there are actually a couple of other use cases that fall out of this too, because another thing people commonly want is auditing. It’s an audit table of what changes have been made to this operational table over time. So that’s something that you can do, it’s essentially a byproduct of having an immutable log of all the change events in your database. So we see customers who do both, they actually want both the history, but they also want the real time view in different systems. And yet another one is sort of a point in time recovery, where I might care about what the state of this table was last Thursday. And asking the current table is not going to tell me that but I can ask the immutable log of the change events, and I can basically rebuild that view as it existed at some previous point in time. So usually, like, the current tape, if you can just issue a query to the current table, great, but there’s often all these kinds of related data products that you want that the operational table can’t tell you. Yeah.

Kostas Pardalis 42:27
Yeah, it makes total sense. And by the way, like, it’s really like, being able to time travel is especially important for many reasons. Like, it’s a very common thing. But like, especially with feature engineering, to be honest, like to be able to calculate a feature in time, like in different snapshots in time for the same user. So it’s, it, it becomes like, it arises as a problem, like in many different use cases, which makes it super interesting. Alright, cool. So you mentioned CDC. And it was mentioned, like, that Kafka you have like, the ecosystem there like after, together with Debezium, right? From Red Hat. I always wondered why we haven’t seen more attempts in the open source for something like Debezium, to be honest, or why these things do not even like more, like standardized in the way. So I’d like to share your thoughts and thoughts there. First of all, do you use something like Debezium? How do you do it? Have

Johnny Graettinger 43:32
you had? We wanted to? Yeah, we, we tried to and we’re dragged kicking and screaming into basically building our own implementation from soup to nuts. Okay. And there are a couple of reasons why. So the biggest probably is that our focus is on being able to handle very large databases and do it efficiently. So databases that are terabytes in size. And one of the big challenges with Debezium. And a lot of sort of Naiad for application or Change Data Capture from database, it’s very easy to grab, like the ongoing feed of Change events from your database, it’s very hard to grab that set, you know, very tightly married to the historical like the backfill of the existing rows in your table. Because your database is not keeping around an infinite right ahead log where it’s got all its change events, it’s constantly compacting it. So you really need to sort of both process the read a headlock, and then also query the table for all of the existing state and not not table and it’s important to do it correctly as well. It’s very easy to implement this incorrectly where you’re getting kind of logically inconsistent results. So one of the big problems is very, like the naive solution for Change Data Capture which and certainly at the time, we will be restarting and looking at this problem. That’s what Debezium was doing. They are basically like starting a right ahead log reader, but not actually reading it. So you basically just create this read ahead log reader, and you’re sort of pinning the right ahead log. And then you start scanning all the table contents of, of the table that you want to backfill. And then once you’ve, you’re basically issuing like a select star for the table. And then once you’ve read all of that, you begin reading that right ahead log again. Yep. But a major issue when you start getting into larger databases that this has is that, first of all, you have to think about, like, what happens if that select star statement crashes part way through. So that’s one issue, you might need to reissue that a few times a bunch of times, it might have to restart some of that work. But another even bigger issue is that when you have created that write ahead log reader, but you’re not actually consuming it, you’re not actually reading from it. What the database is doing under the hood is it’s basically saying, Well, someone wants to read this. So I can’t delete these writer headlock segments. And what commonly happens is that people’s discs fill up on their production database, which is not good. So one of the goals we had from very early on was, we want to do really incremental backfills, where we are always reading the Red Hat DBLog, and never putting the production database at risk, while also being capable of doing many in multi-terabyte backfills. So there’s a paper out of Netflix actually called DB log. And we essentially kind of looked at that and had some ideas from it and ran with it. Yeah, yeah, that’s really

Kostas Pardalis 46:39
interesting. Because like, I haven’t like we have the author of that, like at the podcast at some point. And I think, I think the Bayesian when, like a more recent version, they tried, like they have implemented something similar with a watermarking. Or they do their like, show you can do like in parallel, I think it was, again, like inspired by the same like, publication. But that’s what would have been like my next question, let’s because I’ve seen that like in practice, and it is like a problem with like, how CDC has been, like, implemented both like in systems like expert like with Fivetran, for example. Because you have like these huge databases that we need to replicate the state from there before you can start consuming. The change log in, it’s always like a hard problem to solve. So, okay. Is this like something that’s open source is part of like a cure is like part of like, it’s a separate project? Like, how are the CDC parks?

Johnny Graettinger 47:43
Being yelled at? Yeah, so we model, essentially, an estuary is like two things. It’s the platform itself, or running these data flows. And then it’s the connectors, which are really applications that sit on top of the platform for talking to different systems for doing things like CDC and doing materializations, which are sort of pushing data into systems and keeping it up to date there, as well as transformation. So the fundamental architectures that we have the platform, which is sort of facilitating data movement between these pieces, and then these applications, which are basically Docker images that speak a protocol over standard in and standard out for doing sort of streaming captures streaming materialization. So the short answer is yes, we have these sources available on GitHub, we package them as Docker images that speak a capture protocol, we ended up like kind of kicking and screaming, again, designing our own protocol for captures and materializations of data. So that part is a little bit bespoke. But it is a fairly, you know, accessible protocol. And they are available.

Kostas Pardalis 48:52
Okay, that’s super interesting. Okay, one last question from me. And then I’ll give the microphone back to Eric. You mentioned materialization. And, like, streaming processing. In the streaming landscape, there is also this new model of computation that’s based on like, time, we’re like differential data flow. I think like the most pro-like, no products using this kind of approach are materialized. How is it different from using something like SRE, right? So if you can go and do, let’s say, these incremental materialization and the like, space for both solutions, they’re like, different use cases, like help us understand again. I’m asking these questions because I feel like the streaming landscape is always complex. There are some very, like, heavy buzzwords like differential data flow and like, yeah, yeah, like people. gets in, like, not everyone is like, you know, like a systems engineer that wants to get that deep into how things like work, right. But it is part of how they have to interact with systems and foosball and like, understand why they exist and why they are out there and what we are trying to solve. So, I’d love to hear from you, if there is some competitive element there, or they’re just like, complementary solutions at the end. Yeah,

David Yaffe 50:28
we actually work with materialize a little bit, at least, we’re trying to do some more and more from a team standpoint, you know, materialized does not focus on connectors, you know, they focus on streaming SQL and making that a really first class thing. And also trying to uplevel it and make it exactly the same protocol as standard SQL. That’s not what we’re doing. Right, that’s a very different problem, then what we’re doing, we’re more focused on getting data from your systems, acting as the pipes that can kind of get into other places, so getting it into materializes the various, you know, core things that we want to be doing with them. And so, at a very high level, we don’t see ourselves as competitors at all. But I’m sure Johnny will go into technical details on the differences between differential data flow in our specific processing model for SQL, screaming SQL.

Johnny Graettinger 51:24
Yeah, I’d say kind of the, my quip answer is like materialized is a really cool system for sort of streaming sequel based compute, and getting answers out of it. And then flow you can look at as a system that makes all of your other databases real time. So in some respects, like, you can think of materialized and another company that comes to mind is rising wave, which is a fairly similar, fairly similar position in space. These are streaming Compute Engines for doing streaming transformation, based around sequel, from like a product philosophy perspective, our view is like Moore’s better, you know, we’re like, we kind of want to meet users where they are where they want to be in terms of using whatever sort of streaming compute that they would like, for computing, data products. So like, another example of this is a company called bike wax, which has taken the differential dataflow library and wrapped it with Python bindings to make it you know, something that you can write Python programs around, like what we like to be able to offer that as a transfer of way of doing transformation using using flow. So I think there are a lot of interesting different solutions. And I think this stuff is gonna play out for quite a while because there is no silver bullet. For streaming. It’s like, it’s a legitimately very hard problem. And they’re, they’ve got some really interesting takes on how to solve it. So

Kostas Pardalis 52:58
yeah, it’s a great honor to be honest. And I think it would be nice, like at some point like to cobble together a couple of like, people from the streaming processing, like space that they are working on different, like, ways of solving the problem and like, have a conversation all together. Because it is like, I think like you will be a service, let’s say like to the industry, as I like to help people understand why all these like products and like technologies are like buildings, why at the end, it’s also like, so confusing, because there are good reasons. It’s not like Taiwan. Everyone is saying the confusion around here, I actually like the opposite. It’s just like a very hard problem. And even like, the term real time, many times, like, it’s like the semantics of real time is so different depending on who you ask, right? That makes things like, hard show. Anyway, I’d love to have this kind of panel discussion at some point, like, maybe we should go to like, reach out to the folks that materialize and try to get you all together and like, discuss these things. I think it would be super interesting. But I have to stop now. So, Eric, the microphone is yours again.

Eric Dodds 54:10
I will just follow it. I think that is, uh, I really appreciate that perspective. I think it’s really easy to look at similar technologies. And, you know, I think, just naturally I think there can be, you know, defensiveness, or, you know, disagreements about different ways of doing things. But I think, you know, at least from my perspective, I totally agree more is better, right? I mean, this type of data flow is something that we want to be more broadly accepted by a much larger portion of the market. Which actually brings me to one of my final questions, which is, David, you mentioned something earlier, and I’m interested in both of your perspectives on this. You know, if you want to do real time, you have sort of this set of options. If you want to know the sort of batch, you have to sort of a Fivetran, or like the traditional batch ETL flow. What is keeping companies from just moving to this kind of, let’s say, streaming ETL, real time ETL?

David Yaffe 55:25
Honestly, nothing anymore. And in a lot of ways, we’re working with companies that don’t necessarily even have real time problems, like they’re just using us for analytics, because we’ve made it as simple as working with a batch system to execute a data flow. And that’s our goal, really, to kind of take the gap away. Of course, things get more complicated when you start doing streaming transformations. Sure. But if you’re just doing effectively, you know, a point to point or even one point too many points, data transfer solution there, they really are as easy as each other at this point. So I don’t think there’s really massive differences from the UX there. There used to be, but I like most of the time, so when you’re looking at a streaming system, you’re having to do stuff like manager on schema evolution. So you’re, if you really want to bridge the gap, you have to do stuff like that. That was a core problem for us that we wanted to make sure that we tackled really well. But yeah, in general, I think that those are essentially closing.

Eric Dodds 56:31
Yeah. And one followed, actually, Johnny had died. I said, I want both of your perspectives. So before I have a follow up, your thoughts?

Johnny Graettinger 56:40
Yeah, sorry, the editor. I was we’re gathering there a little bit. So the question and the question is basically, what is holding companies back from

Eric Dodds 56:51
from doing streaming? ETL in real time, you know, because I mean, batch is sort of the history of transferring data. That’s yeah. Okay. Aggressive oversimplification. But, but yeah, it’s I mean, I have in my mind that there are multiple advantages of moving to more of a streaming architecture. But why are more companies doing it? Maybe they are, and we don’t see it?

Johnny Graettinger 57:16
I think they are, and we don’t acknowledge it. So if you are using a tool like Fivetran, and you’re using sort of incremental, incremental syncs from your database to an analytics warehouse, and you’re running that, you know, every power or something like that, guess what you’re streaming like that is streaming, you are moving incremental data from one place to another, and avoiding moving the whole dataset. So that is already streaming, it’s just a matter of like, what is the latency involved? So in some respects, what we’re talking about is just having dramatically tighter latency, but functionally doing the same thing. Yep. So in many respects, streaming is already happening. And people kind of have cobbled it into batch systems, whether that’s Fivetran, or another tool, where you’re doing sort of an incremental, updating resolution of new data and moving it from one place to another. And then, you know, you have tools like DBT, where you’re building incremental models we’re looking at, you’re using like a cursor to figure out how to go update some downstream data product based on just the new data. So people are already doing this. It already exists in the ecosystem. And in some respects, we’re just talking about making it faster and simpler and tightening the latency that’s available. One more

David Yaffe 58:40
add on to that. Oh, yeah, go for it, if you don’t mind. Yeah, I think classically, people have been burned by streaming, right. So that you’ll talk to a whole group of practitioners who hear the word streaming, and they say, Great, oh, no, I don’t want that. So that’s something that we found out when we first started the business. And a lot of the times we won’t lead with streaming anymore, right? Well, with incremental data, low latency. And that actually opens it up to a whole group of people who wouldn’t otherwise be interested, right? If you’re doing analysis, a lot of the time you’re gonna think streaming is very expensive, because classically, it used to be, it’s not anymore, but that’s something that we definitely heard a lot.

Eric Dodds 59:19
So it’s an interesting point, Dave, I think there’s a lot of Kafka baggage there. And, you know, we talked about some of that earlier, where you face this severe trade off, that is actually pretty interesting like, you know, almost like a DevOps problem that doesn’t even relate to data, you know, like, the difficulty of operating systems. I’ve also heard a lot of companies try to move to event driven architecture in terms of the way that they’re building their stuff, and that, you know, can take you down a rabbit hole. One question I do have, though, and Brooks I’m sorry, I’m going long, but this is me punishing Costas by asking more interesting questions that he’s not here to enjoy. Because Costas had to drop off for another meeting. For our listeners, that’s the context. But what is how this impacts orchestration? Right? Because when you think about streaming, one of the interesting things, when you come from a world where, let’s say you’re running multiple batch jobs, you have multiple downstream dependencies. And just because of the nature of data, you’re, you know, you require an orchestration level, because these jobs take different times to run, they need to be sequenced differently, and the downstream dependencies, you know, obviously, are impacted. But when you move to, you know, a real time streaming setup, at least on the surface, it seems like it solves some of the chronological orchestration problems.

David Yaffe 1:00:51
Yeah, for sure. I think, you know, we get asked, I get asked routinely, when I’m demoing our products, how do I orchestrate this? And the answer is you don’t, right, like it’s your data, when it’s available, it goes there. But yeah, it’s something that takes a little while to understand. And it’s also complex when it comes to transformations. Because if you’re thinking about transformations, in terms of events, it’s a little bit of a different, different thought process than it is in terms of both datasets that you’re transforming.

Eric Dodds 1:01:23
Makes total sense, Johnny, anything to add on the orchestration side?

Johnny Graettinger 1:01:27
Yeah, that’s fundamentally right. It’s not necessarily like either or if you’re, you know, in our world and product like you don’t orchestrated, as Dave said, users will sometimes like use us to do some pre transformation to sort of reduce their costs within the analytics warehouse, like they still want to work with it in Snowflake, but they’re using us to do some deduplication or to pre join some data that they commonly kind of want to access in a joint fashion. And then they might process it from there downstream with DBT. So it’s, you know, you can marry these together and for quite a good reason. Sure. Do that.

Eric Dodds 1:02:03
Yep. Makes total sense. Okay, last question. Sorry, for going long Brooks. Where did the name estuary come from? I mean, I know from a geographical standpoint, what it means but is there any relation?

Johnny Graettinger 1:02:18
Yeah, fundamentally, you know, we’re in streaming data, though we’ve kind of de-emphasized and no longer lead with streaming quite so much. But the evocatively sort of where streams meet, that is what an Astra is, it’s where sort of the ocean and it’s and rivers are reading together. So there’s some evocative like, you know, an evocative, like imagery that we’re going after their cut. Otherwise, it’s a name.

David Yaffe 1:02:46
I love it. Yeah, real time data. And you have your data lake, where it’s all dumping. We are effectively a real time data lake. That’s because that enables it. So that’s what the idea was. But yeah, it’s lost on a lot of people.

Eric Dodds 1:03:01
I like it. It’s calming, you know, like an estuary seems like a place you want to be. I like it. Cool. Well, thank you for letting me throw a couple of additional questions at you. You just satisfy my curiosity around orchestration. And we’d love to have you on a streaming panel and have you back on the show.

David Yaffe 1:03:18
Awesome. Thanks for having us. Thanks for having us.

Eric Dodds 1:03:21
We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe to your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.

🎙 Sign up for The Future of Machine Learning Livestream!

🗞️ Signup for Our Newsletter

Episode 163:

Simplifying Real-Time Streaming with David Yaffe and Johnny Graettinger of Estuary

November 8, 2023

Notes:

Transcription:

About the Podcast

Sign Up for The Data Stack Show Newsletter