Episode 06:

Data Council Week: All About Debezium and Change Data Capture with Gunnar Morling of Decodable

April 27, 2023

This week on The Data Stack Show, we have a special edition as we recorded a series of bonus episodes live at Data Council in Austin, Texas. In this episode, Brooks and Kostas chat with Gunnar Morling, Senior Staff Software Engineer at Decodable. During the episode, Gunnar talks about his time at Red Hat, including spearheading the Debezium project, a platform for change data capture (CDC). Topics in this conversation include the processes of Debezium, how to build a diverse system while incorporating common interfaces, open-sourced CDC projects, and more.

Notes:

Highlights from this week’s conversation include:

  • Gunner’s background in data (0:32)
  • Setting the vision in early days of Red Hat and spearheading Debezium (6:20)
  • Replication of data in Debezium (9:47)
  • The patterns and processes of Debezium (16:21)
  • Debezium working with Kafka (19:03)
  • Building a diverse system while incorporating common interfaces (24:09)
  • The importance of documentation in open-sourced projects (27:59)
  • Debezium’s vision moving forward (31:32)
  • Why aren’t there more CDC open-sourced solutions? (34:35)
  • Connecting with Gunnar (37:27)

 

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 00:03
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com.

Brooks Patterson 00:23
All right, we’re back here at Data Council. If you’re following along, you already know Eric couldn’t make it to the conference. So I’m Brooks filling in for Eric, while he is out and getting Kostas here, obviously, as well. And we just sat down with Gunnar morling. He’s a senior engineer at decodable. extremely excited to chat with him today. We chatted with Eric, the CEO earlier this week, so are excited to kind of continue just digging in. And to kick us off. Well, I guess we’re gonna welcome the show. Awesome. Yeah. Thank you so much. Yeah, absolutely. I would love to just hear about your background, where you started, and kind of your path to where you are today. And decodable?

Gunnar Morling 01:08
Oh, yeah, sure. Let’s do this. So yes, I mean, I joined decodable, just a few months back in November last year. So it’s still pretty new for me. Before that, I have been exactly up to the day exactly for 10 years at Red Hat. And there I’ve you couldn’t explain why my tenure at Red Hat is divided into two parts. So for the last five years I was working on Debezium, which is a tool for changing data captures. I was the project lead for Debezium. And I guess we will talk about that in more depth. So that’s what I did for the last five years. And before that, I was working on different parts of the Hibernate project in parallel. So I was mostly working on beam validation. I did the bean validation to that oh specification. We had to do fancy exploration of using hibernate and object relational mapping and applying it to no SQL stores. So that’s an interesting episode, as I’m just saying, this project doesn’t exist any longer. But you know, I learned lots of stuff back then. So yeah, that’s what I did before joining decodable. Pretty much. Cool.

Brooks Patterson 02:11
Tell us a little bit about what kind of thing drove your decision to leave Red Hat after 10 years? To this day? That’s amazing. Yeah, but what it sounds like when we were talking before, before we started recording here, there’s some kind of personal reasons, but also some of your technical interests kind of drove you to the next thing.

Gunnar Morling 02:28
Right. Right, exactly. It’s a combination of the two things. And, you know, I really enjoyed my time at redhead. It’s a great company, I still have many friends there. So it’s, it’s a place I would recommend, really, to everyone to work. So it’s a great company. Why I left was well, so I was working on Debezium for approximately five years. And I felt, you know, I wanted to do something new again, not because I didn’t like the project any longer. I still think it’s an exciting project. There’s tons of applications for it, but for me, it was a bit okay. I’ve been working on it for quite some time, and I want to do something new. So this was one motivation. And then well, I mean, Debezium is about ingesting change events out of databases, like Postgres, but like MySQL into Kafka, typically. And, of course, people don’t do this just for its own sake, right? They want, they don’t want data just in Kafka. They want to take it elsewhere. They want to take their data into Snowflake, Elasticsearch into other databases to do query use cases to do full text search. And Debezium itself didn’t have or doesn’t have a good answer to that. And it’s just not the scope of the project. This is a CDC platform, it doesn’t concern itself with taking data out of Kafka into other platforms. But still, you need this as part of your overall data platform. And I felt, okay, I want to look at this data journey, let’s say, like really end to end. So I want to look at it, okay. There’s the Debezium part, which takes data into Kafka, but also I want to help people with taking the data out of Kafka and into other systems. So this is exactly what decodable does, amongst other things. So this was interesting to me. And then of course, there’s this notion of processing data, right, like filtering data, changing types of data, changing date formats, this kind of stuff, doing stateful operations, like joining aggregations, and so on, which all is done by Flink. And so while I was recommending people all the time, when they came to me and asked about this, use something like Flink or something like Kafka streams, it allows you to do the things yet decodable I have now the opportunity to work on this and well provide people with a platform which does all that in a managed way. So that’s it in terms of a you know, work as work motivation. And then there’s the AWS river. I was at Reddit as much for 10 years and the company grew a lot over the time. So when I left it was more than 20,000 employees and big when you start Yeah, I think it was, I believe 3000 Because it was almost 7x. And this changed, you know, and the processes changed. And it’s what it is. I mean, you know, it’s not a startup, obviously, I wanted to have this experience, I wanted to go to a small place and see, okay, how is this like, everybody really is like pulling in one direction. You make quick decisions. You see, okay, this flies or this doesn’t fly. And you don’t, you know, think about stuff for six months to realize, okay, one of the things about analytics massive, you know, and so that’s why I wanted to go for a solid practical level.

Brooks Patterson 05:35
Make sense? Cool. Because I know you’re excited to dig into Debezium. So take it away.

Kostas Pardalis 05:42
Yeah, sure. So how did you start working on the visual?

Gunnar Morling 05:47
Right? Yes, that’s a good question. And it kind of it was a coincidence, I have to be if I want to be totally honest, because so as I mentioned, I was working on hibernate stuff before then, again, I feel like I have this five year attention spans that after five years, I feels like I need to do something new. And I was at this point, it was a stage. It was back then when I was looking for something new. I didn’t feel like leaving Red Hat. I’ve wanted to explore something new within Red Hat. And then the original founders, I’m not the founder of Debezium, the original founder, left Red Hat. So Randall, how can you create the project? He became confused. And so the disk project lead role was open. And yeah, you know, I was there, there was an open project lead role. So things came together, and I picked it up.

Kostas Pardalis 06:42
No, that’s great. And when did that, like, Let’s go live with a little bit of a history right? of division, right? Like when was division first?

Gunnar Morling 06:51
Published, right. So I took over in 2017. And by that, I believe it was roughly like one year old. So it started in 2016. And it was a very small team back then. So it was a remnant of the original project lead. There was another engineer, Horia. So the two worked on the project. And, you know, the pitch did really well with a Red Hat. So they made a very good case why we should sponsor this project. And then well, things happen. Rendell left and independently CoreOS, the other guy left next month. So we went from two people who knew about the project who could work on it to like having nobody. This was, of course, a challenge. And then I came in, and thankfully Rendell, I mean, he left but you know, I had this email, and he was very helpful, so I could reach out to him. So I came in. And then another engineer, ug, a good friend of mine, also came in, so it was the two of us who ran the project. And then it was a little bit like a startup, even within Red Hat. So we know, because back then there wasn’t even a product around it, it was just a plain upstream community project. So we’ve worked on new features. And in connectors, we’ve worked on creating brand awareness. So I went to conferences, doing blog posts, telling people how to use it, why you would use it, all this kind of stuff. And then of course, we constantly made the case for getting new engineers into the project, I believe, after a year or so we got another engineer, actually somebody from the Hibernate team, I could convince to move over. And then, you know, we grew and at some point, it actually became part of the Red Hat, Red Hat, product portfolio. So you know, what they usually do there is like this duality of having upstream community projects like Debezium. And there’s a commercially supported product offering where customers can get a subscription, and they get, you know, support for that. So this happened at some point. And of course, this took it into next year, right, because of support, organization, proper professional documentation, all this other product management, all this kind of stuff. So then it really took off. Yeah. 100%.

Kostas Pardalis 09:08
And what was the initial motivation behind starting division wildlife? These two folks decided like they started building a division?

Gunnar Morling 09:19
That’s a very good question. So Randall was working, actually. And we touched on this a little bit earlier. So there was a Red Hat project back then, which was called a teed up, which was a data virtualization product. You know, it’s like a federated query engine. And he was working on that originally. And I believe he realized the need to because they had the notion of materialized views indeed, and he realized the need to have some sort of triggers for updating those views. And I believe this was his core motivation, and then he started it and yeah, this was quite well perceived quite quick. actually a community formed around it. So people started to use it. They had tons of use cases for it. And it was Yeah. quite popular from the beginning. Yeah,

Kostas Pardalis 10:09
absolutely. And what’s the, like the most common use cases around like Debezium? Like, how do people use it?

Gunnar Morling 10:19
Alright. So I would say, the biggest category of use case is what you could call replication in the widest sense. So taking data out of an operational database, as soon as it changed into other kinds of data stores. So typically, because you have specific requirements, you want to take your data into a data warehouse like Snowflake, so you can do analytics or Apache Pino. So you can do like real time analytics and show dashboards. Or maybe you want to take your data from I know, from a maybe a commercial database and production, like Oracle, you know, some licensing implications to it, maybe you want to have a copy of the data in Postgres, so you can query it and on the side and do some sort of testing. So you want to take this data across database vendor boundaries. So that’s, you know, that’s all replication. And I would say, also something like updating a search index, I would also consider that a replication use case. Because if you want to do full text search in your data, typically, you cannot do it as well on a relational database. Rather, you want to do something like or you want to use something like Elastic Search or open search. And you want, of course, this index to be fresh to be up to date. So you do your photo search, it gives you current search results. And so that’s something which people do a lot, feeding the search indexes, updating or invalidating caches, if they have, like, you know, cached versions of the data to invalidate that after data change. This comes up a lot. And then I would say there’s another big category in the context of micro services. So propagating data, exchanging data between microservices, maybe moving from monolith to microservices, things like that. So there’s different patterns in that space, like the outbox pattern, or the strangler fig pattern which people use and people, you know, benefit, then again, from CDC, for those kinds of things.

Kostas Pardalis 12:16
Make sense? So why would someone let’s take the replication use case, right? Why would someone like to use, like a CDC pattern to do the replication instead of going and executing SQL queries, right? And get updates, right? And pull them out? And then replicate? Yes, the other side? Or there’s another option there? Just create a replica, right? Yes. And have, let’s say, a replica of Postgres, right? And use that as a read only replica where you go with, like, do your analytics, right.

Gunnar Morling 12:49
So I mean, you totally can do all the things, and I would recommend doing them when they make sense. So if you have the replica set up, I mean, I totally can see where you want to have, like read replicas of your Postgres database, so you can have no replicas closer to the user. And, in particular, if your query requirements are satisfied, why a Postgres? That sounds very reasonable to do, right. So I’m all for that. But sometimes you have other query requirements, right? You don’t want to run this kind of query in Postgres, you want to run it in maybe in a data warehouse, or maybe in a search index, or you, or maybe you have a use case, where you benefit from CRAF curious, like, you know, near for J cipher queries, this kind of stuff. So you bridge essentially, the kind of database. And this is where you would use this rather than the built in replication mechanisms. Also, if you want to just grass when the boundaries, right. So if you want to go from Oracle to Postgres, probably, you did that. It would make sense. I hope that makes sense in terms of the replication. Now, you ask, why wouldn’t I just go and query for changed data, right? And that also can be a valid approach, I always differentiate between lock based CDC, which is what Debezium does, and we can dive into this with what it means vs. query based CDC what you describe, and there’s a few key differences between them. One of them is, well, if you do this query based approach, how often do you run this query? And what does it mean for your data freshness? So I know if you run this query every hour, while your data might be stale for one hour, right, which, again, depending on use case, may or not, may be acceptable versus the lock based CDC approach. It gives you a very low latency of milliseconds, two to three digit milliseconds, maybe seconds end to end. So I, you know, just mentioned in my talk, there’s Debezium users. They take data from their operational MySQL clusters into Google Big Query for analytical purposes. And they have End to End latency, you’re within less than two seconds. So you know, there is really fresh data in there because of the system. So now you could run this query every two seconds on your MySQL database, but probably would kill the performance of the database, it would be too much overhead. And still, no matter how often you were to run this query, you would not be sure whether you missed any changes between two of those polling runs. And in the extreme case, something could happen, something gets inserted, and something gets deleted within two seconds. I mean, if you were to do it every two seconds even, and then you would never know about this record, right? Depending on what you want to do, this might not be good enough, right? So maybe you want to use this, which is another use case for building an audit log about all your data, this must be complete, right? So you really want to be sure you see all the updates. So that’s one implication of the query based approach. Also, you need to define how to actually identify your change record. So you need to have some sort of columns, which tell you okay, that’s the last update timestamp. So it’s a bit in ways if on how you model your data schema, whereas if you do lock based approach, you know, this can capture changes from any tables, this impact? And lastly, yes, deleting data, that’s another interesting thing. So if you delete something from a table, you cannot get it with a polling based approach, right? Because it’s just gone. Unless you were to do something like, it’s like a soft, delete Exactly. With the lock based approach, this goes to the transaction log of the database, and all the events that are appended to all the changes are attended to the transaction log, also, a delete is attended to this lock, and then Debezium will be able to get it from there.

Kostas Pardalis 16:45
Yeah. So Okay, we’re done. Let’s say that our babies that it’s like a system, that man’s estate into a stream of changes, right. And that’s like, what we propagate when we’re using something like Debezium, how do we go back to the state again, because we have to recreate the state, right? So and I know that like, this is no timing, like Debezium does, right? But it is part of the workflow, right? Like, yes, on the other side, someone needs to go there and be like, okay, like, what’s the current state? The source database has? So what does this process look like? What kind of patterns have you seen there? How hard it is

Gunnar Morling 17:24
like, right, right, right. Yes. So how does that work? So I would say this depends a little bit on the way you use Debezium and how you deploy it. And I already mentioned this, but Debezium, at least initially, was very closely associated with Apache Kafka, so people used Kafka. There is a side project of Kafka, which is called Kafka Connect, which is a runtime and development framework for connectors, and Debezium still is based on Kafka Connect, you know, so Kafka Connect will run the Debezium connectors for taking data into Kafka. And if you do this, and this is one of their ways how you could use Debezium, then you would use a sync connector for Kafka Connect. So there’s a very rich ecosystem of connectors. So you would use a JDBC sync connector, which subscribes to those topics, maybe apply some sort of transformation, and put the data into a sync database. So that’s what you would do with Kafka and Kafka Connect. Now, there’s other ways you could use Debezium. And one which is very interesting is what we call the embedded engine. So in that case, you use it as a Java library within your JVM based application. And then this means it gives you very much flexibility in terms of how you want to react to those changements. Essentially, it’s just a callback method, which will just whenever a change event comes in, then this callback method will be invoked. And you can do with those change events, whatever you want. And this is what integrators of Debezium into other platforms typically use. So one example would be Apache Flink, there is a project, a side project of Flink, which is called Flink, CDC. So they take the Debezium connectors and other CDC connectors to ingest changements directly into Flink, so you don’t need to run Kafka. That’s just you know, in the process in Flink, you will ingest those changes, and then you could go about processing the data there.

Kostas Pardalis 19:27
Okay. And, oh, that’s interesting. So like, he’s like, if you don’t have calf cover, right? How do you monitor its delivery semantics, because one of the things that you get to live with Kafka is that you have some very specific, strong guarantee right about the event. And even downstream, let’s say your consumer fails or whatever the data is going to remain in the topic, right? And most importantly, you’re not going to: Are there any pressure on the source database? Yes. Right. Because like, okay, like one of the most like, I think I remember like, at some point I started, like, playing around with the visuals. And I set up posters that are based on Yes, enable, like logical replication that I have my slots, blah, blah, blah. I stopped consuming, you ran

Gunnar Morling 20:21
out of disk space? Yes. I was like,

Kostas Pardalis 20:25
wow. Like, I’m not using the database. And the reason is, like, because there’s another log there that someone needs to consume. So it gets freed. Right? And that’s like a very important thing operationally. Yes. You don’t want in any way your production database to run out of space. So how does it work with like, not Kafka in between? Right?

Gunnar Morling 20:50
Right. So yes, that’s a specific challenge. In this case, it’s a particular course, by the way, how RDS productize this Postgres, so what happens there is, you can have a Postgres database, and like one physical host, and there can be multiple logical databases on the same physical host. And the transaction log, this is shared across all those logical hosts. So that’s one transaction log for the entire Postgres instance. But those replication slots, which you mentioned, which are essentially the handle for a connector, like Debezium, to go there and extract changes from the log, are specific to one of those logical databases. And now what’s happening on RDS in particular is they have another like an internal database, which you cannot access, even though it’s like, for administrative purposes, whatever, in there, they do like heartbeat changes every five minutes or so. So there’s a number of changes, which happened in a database, which you cannot access, maybe you don’t even know to be honest about really. And so now you come and set up your Debezium connector to another database, like your own logical database, and you want your string changes there. And as long as there are changes coming in, this all is fine, it will, it will just work the challenge. And as opposed as this situation, you ran into this, if you don’t receive any changes for your own database, or if you just stop your connector, and it’s running this replication slug, which you set up this cannot progress, kind of go to the database and say, this consumer has made progress up to this point. So you can discard any previous portions of the transaction of the debts, a common stumbling stone. And what you can do there is actually well, you know, a, you just have natural traffic in your own database, then it will be fine to connect to progresses and every now and then it will go to the database and say, Okay, this is the offset, I acknowledge, and the database is free to discard any older lock state. And if you need to account for the situation that your connector is, you know, not receiving changes for quite some time, then the Debezium connector can actually induce some artificial changes into the database itself. So you can configure, like a specific statement, you want to run just like a heartbeat, a few minutes, and this will solve this particular problem.

Kostas Pardalis 23:20
Okay, that makes a lot of sense. But still, how do you work with delivery semantics where Kafka is not there? Right, right.

Gunnar Morling 23:29
Yes, I mean, so then it depends quite a bit on the specific connector server, for instance, to use with Flink. So let’s say on the, on the sync site, and Flink, you still might use maybe Kafka Sync Connect right now Kafka there. So then the same, you could even do exactly one semantics, actually, because this would be like a transactional writer. And you Flink would make sure that, you know, if you crash or whatever, that no events would be duplicated with the asset out another time. But I don’t know if you use a non-transactional thing. Yeah, then you probably would have again, like, at least once semantics, which means you would, it could happen in particular, if there’s like an unclean shutdown. Yeah, that you would see some events and other times consumers in that scenario need to be item potent, to be ready to receive a change event another time and then like, discard it. Yeah. This kind of scenario. Yeah. Makes a lot of sense. So

Kostas Pardalis 24:38
Okay, Debezium is a kind of middleware, right? Like, it is a piece of software it needs to interoperate right? With at least like a couple of different database systems as its input. And okay, replication was always like a very esoteric thing. Yes, I remember. For example, okay, replicas, Even like using the big logo, like MySQL was available for a long time. Yeah. But Postgres got logical reputation

Gunnar Morling 25:10
for something. Yeah. So

Kostas Pardalis 25:12
It’s been much more recent. Yes. Before that they had like the binary replication thing that okay, like, how do you interpret this blah, blah, blah? How do you build a system that needs to interoperate in a way, like in a coma, and create a common interface? With like, such a diverse and esoteric

Gunnar Morling 25:36
part of the database? Right, right. Yes, I mean, it’s a challenge, right? It’s, it’s exactly the challenge which Debezium team has has, all those databases have different interfaces, different formats, different API’s, how you would go and extract changes, there was a for instance, in case of the Cassandra connector, it actually reads the log files from the from the Database Machine. In the case of Postgres, it’s the logical replication mechanism, in the case of Oracle to query some specific views. So it’s different for all databases. And yes, it requires original engineering for each of those connectors, requires at least one engineer on the team to dive into the specifics of that particular connector to make sure we understand how this works. To understand all the subtleties and stuff like what you described on rds. This takes a while to figure it out, right? And realize, okay, this is the situation and this is how we can mitigate it. So it’s, I would say, not trivial to do. And this is also why that is, you know, the team is always conscious about adding more connectors, because it’s, it’s, it means that it’s a maintenance effort. You need to have the people around to understand and work with this codebase. And in particular, it means it just happens every now and then people from the community come and they suggest they would provide a new connector, and they say, Hey, there’s this database, and we want to build a Debezium connector for it. And would you be interested in making this a part of the Debezium project? And now on the first thought that sounds interesting is that you would think that, hey, we can get support for a new database. But what you are, and it’s a mistake, which you know, you can easily make, you need to think about the maintenance implication of this, are those people who have usually very well intentions to provide you with this connector, will they be able to maintain it, at least for some time, right? I mean, I realize nobody can tell what’s happening in five years from now. So I wouldn’t expect anybody to promise they will maintain this for five years. But there must be some common understanding, if somebody contributes a connector, that they are, at least for a reasonable amount of time, that they are around and work on this connector. And maintaining this, for instance, what they recently just happened with a Google Cloud Spanner, which is something I’m really excited about the Google team who work on at Google and Cloud Spanner their distributed SQL database, they decided to build a CDC connector based on the Debezium framework. And they published an open source project as part of Debezium. Umbrella. So for me, that’s just very interesting to see. How did this become sort of the de facto standard? In this space? Yeah.

Kostas Pardalis 28:24
I’ve seen that problem. Like, I was Trino.

Gunnar Morling 28:29
Yes, yeah. You have connectors. It’s the

Kostas Pardalis 28:31
same thing. Yeah. And like, what I think like many people, and especially like, new people in the open source community don’t realize is that when you have a piece of software that is really used, like in critical parts of very large and many infrastructure out there. You can’t just decommission this thing. Yes. It’s a it’s like, when something gets into the codebase, like digging it

Gunnar Morling 28:58
out is hurtful. Yeah, it hurts. Exactly.

Kostas Pardalis 29:02
So if there’s no plan, like to maintain, like a connector, it’s hard like for, okay, an open source project that always has like, limited resources, right. Like, even if you have a company behind you, there’s going to be limited resources.

Gunnar Morling 29:16
Obviously, different priorities. Yeah. And

Kostas Pardalis 29:20
without commitments, yeah, that like, it makes a lot of sense for like, the committers. Like, the main dangers of this project being like, so they’re saying no, yeah, exactly. Because at the end, making, like your own choice without can really hurt the project, but like the community at the end, yeah,

Gunnar Morling 29:39
totally. And I mean, it also comes to the things like testing, right, so something like Cloud Spanner, I mean, you need to have an instance of that which you can run your test suite against. I mean, for Postgres are all those open source databases. I can run them locally. Yeah. Docker, so that’s not a problem. But for that, you know, we can really audit Debezium Project campaigns. you for helping a Cloud Spanner testing infrastructure. They need to provide it. So this also needs to be part of the continuous Yeah.

Kostas Pardalis 30:06
Like that’s yet another example like with Trino. Like, yeah, like we have a connector for BigQuery or you have, right? How do you test every time you run? They’re like you need, like a Redshift cluster to run, right? Yeah. open source projects. I’m not exactly, you know, like drawing money. It’s. So it is hard, like the operations around like, like what we in a company take for granted in terms of like CI CD, or like, testing infrastructure, like all that stuff, are at least an order hablar. Like, in an open source project, and people should be like, it’s not like people don’t appreciate your efforts to contribute. Yeah. But contribution to the project is more than the

Gunnar Morling 30:56
code. Exactly. Yeah, totally, totally. It’s

Kostas Pardalis 30:58
maintaining, like, it’s the engineering part of writing the code that’s like, important. That’s like a topic on its own. Like, at some point, it would be nice to get together some, like people from open source like to talk about

Gunnar Morling 31:10
that. Yeah, totally. I mean, there’s also the question of documentation, right. And helping people. Yeah, I know, let’s say there’s an esoteric database. People use this and they run into issues, who helps them with that, because the core team likely cannot be an expert on a gazillion different databases.

Kostas Pardalis 31:31
So okay. Debezium has been around for a while, as you said, like, is becoming much like a, let’s say, like, a, like a specification in a way, like,

Gunnar Morling 31:41
Yeah, I mean, people rely on the format. Flink exposes a changement format. ScylladB. As another example, they implemented the CDC connector, which they implement by the DOT, they maintain it on their end. But it’s, again, based on the Debezium framework, and also they expose the same changement.

Kostas Pardalis 31:58
Yeah. So what’s, at the same time, I have to say, as a user, that it still feels like, like, Debezium is like a piece of software that it’s like, tightly integrated with the Kafka ecosystem. Yes. And, obviously, I don’t have any problem with Kafka. But what I want to ask is like, what’s the future of Debezium? Right? Like, what’s the vision there? Like, how do you see the project moving forward? Because it seems that it’s becoming more and more important is like, more horizontal? It’s right. Right. Right. So how do you handle that? How do you move forward?

Gunnar Morling 32:34
Right? Yeah, I mean, so I just to reiterate, I’m not the current project lead anymore, right? So I don’t have the power or I don’t make the roadmap, right, I would see there is. So there is an effort to evolve more into an end to end platform. One thing which the team works on right now is actually a JDBC sync connector. So because right now you would use it together with other JDBC things from other vendors, which you know, you need to configure them in the right way to make sure they’re all compatible. So having a Debezium JDBC sync connector, this definitely will help people to use it more easily and set up end to end flows based on Kafka. Now, you mentioned it’s tied to Kafka. That’s true. Well, you know, there’s a strong Kafka consideration. But then also, there actually is a component and I think this will gain an importance which is called a Debezium server. And Debezium server is, in terms of your overall architecture, it has the same role like Kafka connects with the runtime for the connectors, but then this gives you connectivity with Kinesis, Google Cloud, Pub/Sub, Apache, pulsar, Promega, all kinds of data streaming platforms. So people also can use Debezium with things other than Kafka, right? Because I mean, as a move, I like Kafka as well, I love it. But you know, maybe people are deeply into the AWS ecosystem, they use Kinesis, and then still should be able to benefit from CDC. So that always was the mission, we want to give you the best CDC platform no matter which data streaming platform you’re on. So that’s something which is happening, then, of course, also what I observe is there’s actually quite a several services who take Debezium and provide managed service offerings around it. So Red Hat is doing that. You know, so in addition to the classical on prem product, there’s connector service. But then, of course, it’s also startups like decodable. There’s a few others who take this and provide you with a very cohesive end to end platform, which also adds the notion of processing to it, which makes it very easy to use, I feel so I think I mean, that’s not so much about Debezium The core project, but that’s I feel like more and more people are going to consume a Debezium and, you know, run it because then they don’t have to operate it

Kostas Pardalis 35:00
themselves. Yeah, yeah. 100%. Alright, and one last question for me about Debezium. So CDC has been around us like buttons for quite a while, right. But based on my experience, I’ve always liked outside of Debezium. I haven’t seen much like any other open source projects yet that can be used in some kind of production. Right. Right. Why do you think that this is the case? And like, do you think that this is going to change? Like, do you think we’re going to see more recommendations?

Gunnar Morling 35:35
So I would say I mean, decK is another open source disease solution, but usually for particular databases. So for instance, there is a Maxwell demon from the Zendesk team, which is an ACC solution for MySQL, just for the database. And I suppose there are others for Postgres and Unifor particular databases. Right now, I’m not aware of any open source CDC solution, which really has this intention to be a CDC for all kinds of databases, all the popular databases, so I don’t see anything coming up like this, or I’m not aware of anything, let’s say, there’s a few interesting developments, for instance, Netflix, they have their own internal CDC solution, at some point, they kind of indicated it would open source it. But so far, they haven’t followed up on this. And this has been quite a while ago. So I don’t think it’s going to change now. But what they actually did is, and this is why I’m mentioning this is they published a research paper about their snapshotting algorithms. So this is about taking an initial snapshot of your data and putting this into your streaming platform. And they open source or they wrote this paper to describe this very interesting, because, you know, it allows you to reboot strip tables to run multiple tables, snapshots in parallel and this kind of stuff and actually Debezium implemented this, this solution from the Netflix guys. So yeah, that’s where I see it going right now.

Kostas Pardalis 37:05
Okay. That’s awesome. I hope to have you back on the show. Anytime really soon. And I would love to see like it was more valuable to be honest, like with, like, people who are, let’s say, like veterans, like an open source like to communicate some of these things that like people maybe don’t understand right? Now, I’m feeling a little bit obscure or sometimes even a little bit rude. But I think it’s going to be very useful for anyone who is considering contributing to that.

Gunnar Morling 37:37
Pay my pleasure, for sure. So

Brooks Patterson 37:38
Let’s do that. Yeah. Brooks worked on putting that together. That’d be great. We’re gonna Yeah, thanks so much. This has been a fascinating conversation. You’re clearly the authority on Debezium. Very active online. If folks want to follow along, how can they connect with you?

Gunnar Morling 37:58
So they could follow me on Twitter? I’m going to write on Twitter. I’m on LinkedIn. I don’t know the It’s my name. They’re on LinkedIn. And they also contribute to ghana @ decodable.co. If they want to talk about Flink and decodable, maybe. So yeah, different ways to reach out to

Brooks Patterson 38:17
me. Cool. And what about decodable? If I want to learn more about you,

Gunnar Morling 38:22
if you want to learn more about decodable, you totally should go to the Decodable website. There is a free, free tier free trial, which you could use to get your hands on, on to the product. You also could go to our YouTube channel. There’s a few interesting recordings there. And kind of a sneak preview, I’m going to do a new series on our YouTube channel called the data streaming quick tips. So you also can watch out for that on the Decodable YouTube channel.

Brooks Patterson 38:49
Awesome. Cool. Well, look up Gunnar on Twitter and then Gunnar at decodable. Is it decodable.com.is? That co-decoding that CO if you want to reach out by email, and then check out the YouTube channel sounds like some exciting things coming up.

Gunnar Morling 39:12
Awesome. Yeah, totally. Thank you so much.

Brooks Patterson 39:14
Yeah, thanks for coming on the show. My pleasure. Thanks.

Eric Dodds 39:18
We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe to your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.