On this week’s episode of The Data Stack Show, Eric and Kostas have Meroxa back on the show, this time talking with co-founder and CTO Ali Hamidi and developer advocate Taron Foxworth. Together they discuss uses and implementations of change data capture, formulating open CDC guidelines, and debate the use of reverse ETL.
Highlights from this week’s episode include:
The Data Stack Show is a weekly podcast powered by RudderStack. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Eric Dodds 00:06
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack the CDP for developers. You can learn more at RudderStack.com.
Eric Dodds 00:23
So really excited to have them back on the show. One thing that I’m interested in, and I want to get a little bit practical here, especially for our audience, one of the questions I’m going to ask is, where do you start with implementing CDC in your stack? It’s useful in so many ways. It’s such an interesting technology. But it’s one of those things where you can kind of initially think about it, and you’re like, Oh, that’s kind of interesting. But then you see one use case, and you start to think of a bunch of other use cases. And so I want to ask them, where do they see their users and customers start in terms of implementing CDC? And maybe even what does it replace? Kostas, how about you?
Eric Dodds 00:23
All right, welcome back. We are doing something on this episode that we love. And this is when we talk with companies who we talked with a long time ago, and in the podcast world, a long time ago is maybe six months or so, which for us is about a season. So one of our early conversations in season one was at a company called Meroxa, and they have built some really interesting tooling around CDC. We talked with DeVaris, one of the founders. And we get to talk with one of the other founders, Ali, and then a dev evangelist named Taaron Foxworth today from Meroxa. And they, I think, recently raised some money, and have built lots of really interesting stuff since we talked with Devarius.
Kostas Pardalis 01:49
Yeah. First of all, I want to see what happened this almost one year since we spoke with DeVaris. And one year for a startup is like a huge amount of time. And it seems like they are doing pretty well. I mean, as you said, Eric, very recently, they raised their Series A. So one of the things that I absolutely want to ask them is what happened in this past year. And also, I think that just like a couple of weeks ago, they also released their product publicly. So I want to see the difference between now and then. That’s one thing.
Kostas Pardalis 02:23
And the other thing is, of course, we are going to discuss in a much, much more technical depth about CDC. And I’m pretty sure that we are going to have many questions about how it can be implemented, why it is important, what is hard around the implementation, and any other technical information that we can get from Ali.
Eric Dodds 02:43
Let’s jump in and start the conversation. All right, Ali and Fox, welcome to the show. We are so excited to talk with Meroxa again, we had DeVaris on the show six or eight months ago, I think, and so much has happened at Meroxa since then. And we’re just glad to have you here.
Ali Hamidi 03:03
Thanks. Thanks for having us.
Taron Foxworth 03:05
Yeah, I’m so excited to talk with you today.
Eric Dodds 03:07
Okay, we have a lot to cover, because Kostas and I are just sort of fascinated with CDC and the way that it fits into the stack. But before we go there, we talked with DeVaris, one of the founders. Could you just talk a little bit about each of your roles at Meroxa, and maybe a little bit about what you were doing before you came to Meroxa?
Ali Hamidi 03:27
Yeah, so I’m Ali, Ali Hamidi, and I’m the CTO and the other co-founder at Meroxa. And so before starting Meroxa with DeVaris, I was a lead engineer at Heroku just by Salesforce, specifically working on the hybrid data team handling Heroku Kafka, which was the mesh Kafka offering. But before that, you know, I’ve always been working in and around the data space, and did a ton of work around data engineering in the past.
Taron Foxworth 03:53
And I’ll go next. Hi, everyone. My name is Taron Foxworth. I also go by Fox at Meroxa. I am the head of developer advocacy. I spend most of my time now building material that helps customers understand data engineering and Meroxa itself. I also work a lot with our customers actually understanding how they’re using Meroxa and also trying to learn from them as much as possible. In the past I ran evangelism and education for an IoT platform. That’s kind of where I really jumped into this data conversation because you know, IoT generates a bunch of data, a bunch of sources. Then I joined Meroxa back in February to really dive into this data engineering world. And it’s been such a blast so far.
Eric Dodds 04:35
Very cool. Well, I think starting out it’d be really good. Just to remind our audience, we have many, many new listeners, since the last time we talked with Meroxa. Could you just give us an overview of the platform and a little bit about why CDC?
Ali Hamidi 04:53
Yeah, sure. So Meroxa is essentially a real-time data engineering managed platform. So essentially, it makes it very easy for you to integrate with data sources, pull data from one place, transform it, and then place it into another place in the format that you want. And so a big part of that for us is really the focus on CDC, change data capture. And you know, it’s been around for a while, but only recently really gained a lot of a lot of interest and a lot of attention. And so really, the value of CDC is, rather than taking a snapshot of what your source record or database looks like, at the time of, you know, making that request, CDC gives you the list of changes that are occurring in your database. So for example, if you’re looking at the CDC string within Postgres, anytime a record is created, updated or deleted, you’re getting that event, and it basically describes the operation. And so it gives you a really sort of rich view of what exactly is happening on the upstream source, rather than just, okay, this is the end result of what happened. It gives you sort of the story of what happened and it inserts that sort of temporal aspect.
Eric Dodds 06:04
There are so many uses for CDC. I’d love to know is Meroxa focusing on a particular type of use case or particular type of data as you’ve built the platform out?
Ali Hamidi 06:20
Yeah, so we kind of see CDC as … I kind of have an answer in two parts to that question. So one of the things that, you know, led us to focus on CDC is really, we were trying to look at the areas where we can add the most value, and really apply our expertise and sort of the experience that the team has, and sort of generate the most value for customers. And so one of the areas that we looked at is setting up CDC pipelines, CDC connectors, has always been really difficult for customers. And, you know, having spoken to lots of customers, difficult CDC projects can take upwards of, you know, 6-12 months, sometimes longer. And it’s just an inherently difficult project to get off the ground. And so really that’s one of the areas we thought, okay, we can apply our expertise and our automation and our sort of technical skills to make that easier. And so the goal of the platform, sort of the IP in the platform, is really doing the heavy lifting, when it comes to setting up these connectors. And so CDC seems like a natural place for us to focus on inherently because you know, it’s very difficult for people to do. And so if we can make it very easy, then there’s value in that. We also sort of view CDC as the sort of the superset of data integration, and the sense that you can create sort of the snapshot view of your data from the CDC stream. But you can’t really go the other way, you can’t sort of create data where there isn’t any. You can sort of compact the CDC stream into what the end result should be. And so if you’re starting from this richer, more granular stream of changes, then essentially, any use case that is covered by the sort of the traditional ETL or ELT use case can be also supported by the CDC approach. But it also unlocks sort of new things. And so a very contrived example, but I think one that kind of explains where the value and the addition of the temporal data is, if you, if you look at sort of an e-commerce use case, where you’re tracking what’s happening in shopping carts, then you know, whenever someone adds something to the cart, you could potentially, it’s a very naive approach, but you could represent that shopping cart as a recommend database. And then when someone adds something, you know, the increment the number of items to do. And so that would actually trigger an event that would land in the stream. Whereas if you’re looking at just the snapshot, then whatever you happen to look at that would be the number. And so if someone adds something and removes something, and that’s two of them, and then removes something, that’s all data, that’s potentially valuable. And that would land in the CDC stream of what exactly the user did. Whereas if you’re just looking at the snapshot, it’s, you know, the end result. And so if I added 10 things, and then I dropped it to only one, and that’s what I purchased. And then later when the snapshot happens, you’d only see the one thing you wouldn’t see the intermediate steps that I went through. So it’s a very contrived example. But I think it demonstrates the idea of this, the additional sort of rich data that you’re potentially leaving on the table by not using the CDC stream.
Kostas Pardalis 09:25
Ali, I have a question. So usually, when CDC gets into conversation it’s in two main use cases. One is I think you also mentioned ELT and ETL. And the other one is as part of a microservice architecture where you use CDC to feed data like different micro services. Do you see Meroxa so far being used more on one or the other?
Ali Hamidi 09:53
So we I mean, traditionally the approach that we’ve been pushing and kind of marketing for is the more traditional ELT use case, mainly because I think that’s easier to understand. And it’s more sort of common for people to kind of wrap their minds around. But the structure and architecture of the Meroxa platform is that essentially, the way it works is when you create a source connector, you’re pulling data from some database, say a Postgres through CDC, and it’s actually landing in an intermediate, so specifically Kafka, that’s managed by the Meroxa platform.
Ali Hamidi 10:27
And so this is where, you know, the second use case, or the, I’m not sure if I wanna use the term data mesh, because I feel like it’s pretty loaded and has a lot of baggage. But essentially, the sort of the application use case or microservices use case would sort of fall into place. Because these events, these change events are actually landing in an intermediate, but it’s being managed by us, that a customer also has access to. And so you know, what we typically see is, is customers will come for the sort of easier low hanging fruit sort of use case of ETL, or ELT, but then sort of almost immediately realized that oh, actually, once I have this change stream in this intermediate that I can easily tap into, now I can leverage it for other things. And so we have some features that kind of make that easier. An example of that is you can generate a GRPC endpoint that points directly into this change stream. And so you can have a GRPC client that receives those changes in real time. And so that kind of falls into the sort of microservices use case pretty well. But it is the same infrastructure. And that’s kind of the key for us. We view Meroxa as being sort of a core part of data infrastructure. And so we want to make it very easy for you to get your data out of wherever it is, and place it into an intermediate. So specifically Kafka that you can then hang connectors off and kind of peek into and and really leverage that data for whatever use you have.
Kostas Pardalis 11:45
Yeah, yep, that’s super interesting. So follow up question about ETL and ELT. So CDC has, let’s say, kind of like limitation, which it’s not like the limitation of CDC itself, it’s more about the interface that you have with a database that when you first establish a connection with the application log, you have access to very specific data sets in the past, right, you don’t actually have access to all the data that the whole state of the database has. And usually, when you’re doing ETL, you first want to replicate this whole state and then keep updating the state using like, CDC kind of API. So how do you deal with that at Meroxa?
Ali Hamidi 12:29
Yeah, so the tooling that we use, we obviously, you know, we like to say we stand on the shoulders of giants and, and leverage a lot of open source tools. And so the tooling that we use, you know, depending on which data source you’re connecting to, so say if you’re using Postgres, we’re likely to provision the Debezium connector, sort of behind the scenes, but that actually supports the idea of creating a snapshot first. And so it will basically take the entire current state and push that into the stream. And then once it’s called up, it will start pushing the changes by consuming the replication log. And so you do get both. You get like the full initial state as a snapshot, and then you get the changes once that initial snapshot is done. So that’s, that’s kind of how we address that use case.
Kostas Pardalis 13:13
Okay, that sounds interesting. So the initial snapshot, it’s something that you capture through a JDBC connection.
Ali Hamidi 13:18
Yeah.
Kostas Pardalis 13:19
Okay. Okay. That’s, that’s clear. That’s interesting. Yeah, it makes total sense. Because, you know, you need the initial state and validating the state. So yeah.
Ali Hamidi 13:30
And that’s supported across all of the data sources that we natively support. So whether it’s, you know, CDC through Mongo, or MySQL, or Postgres, they all work in a similar way. Do the initial snapshot, and then once that’s caught up, we actually start the CDC stream.
Kostas Pardalis 13:46
Super nice. So I know that your product became publicly available, like pretty recently, you were like in a closed beta for a while now, do you want to share with us a little bit about what to expect when I first sign up on the product? Some very cool features that you have included there? And I know I’m adding too many questions now. But also, like if there’s something coming up, like in the next couple of weeks, or something that you are very excited about?
Ali Hamidi 14:14
Sure. Yeah. So we launched and we made the platform publicly available, about a month ago, and it’s available at Meroxa.com. You can sign up, and we have a generous free tier. Our pricing model is based on events processed. And so you can go in and create an account, you can leverage the dashboard or the CLI. And as I mentioned, really, the RIP is making it very easy to get data out of a data source and making it very easy to input data. And so like an example of that is, I mentioned the CDC streams with with Postgres, but the platform will sort of gracefully degrade its mechanism for pulling data out of Postgres depending on where it’s being hosted, what version it’s running, what permissions the user has, and that kind of thing. The command or the process for the customers is uniform, it’s basically the same. And that also extends across the data sources. So you type the same command, whether you’re talking to Postgres running on RDS with CDC enabled, or you’re talking to Mongo on MongoDB Atlas. It’s basically the same command, same UX. And that’s really our edge, I guess. In terms of sort of features, really, what we’re pitching is the user experience. We’re trying to make it very, very easy to set up these pipelines and get data flowing. And that’s really where a lot of our attention has been focused.
Kostas Pardalis 15:35
That’s super interesting. And can you tell us a little bit more about the user experience? Like how do we interact? Like, for example, and I guess the product is intended mainly towards engineers, right? So is it like the whole interaction through the UI only? Do you provide an API that programmatically someone can like, create and destroy and update like CDC pipelines? What are your thoughts around that? What is like, let’s say, in your mind, also, as an engineer, right, like, the best possible experience that an engineer can have from a product like this?
Ali Hamidi 16:10
Yeah, so from my perspective, you know, being on the user side of this sort of work for many years, really, I felt most at home, working through SCLI or some kind of infrastructure automation. I’d love to use something like Terraform or a similar tool to kind of set up these pipelines. For us right now, we’ve launched with the CLI. So we have the Meroxa CLI, which has full parity with the dashboard. And we have the UI itself, which is the dashboard, so you can sort of visually go in and create these pipelines. We haven’t quite yet made a public API available, but it’s something that we’re definitely interested in and working towards. We’re just not quite there yet. And certainly, you know, I’m a huge fan of Terraform. And the idea of infrastructure as code, I think, is great. And it’s something that we definitely need to address. And that’s something that, you know, we’re looking forward to addressing in the future. But yeah, CLI right now, a dashboard through the UI. And this is a full parity between the two. Typically, the way you interact with it is you’d introduce resources to the platform. And so you’d add, you know, add Postgres, give it a name, add Redshift, give it a name, and then create a pipeline and create a connection to Postgres. The platform reaches out, inspects Postgres, figures out the best way to get data out, and starts pouring it into an intermediate Kafka. And then you kind of peek into that and say, Okay, take that stream, and write it into Redshift now. And the rest is handled by the platform.
Kostas Pardalis 17:40
That’s super interesting. By the way, I think we also have to mention that pretty recently, you also raised another financing round, Is this correct?
Ali Hamidi 17:49
Yeah. Yeah, we raised a pretty sizable series A. We closed sort of towards the end of last year, but recently announced it with Drive Capsule leading our series A. It’s been, you know, super amazing, working with them and the rest of our investors. And yeah, so you know, that enabled us to accelerate the growth of the team, really build out our engineering team and sort of the other supporting resources. So we went from about eight people last October to 27 as of today.
Kostas Pardalis 18:22
Oh, that’s great. That’s amazing. That’s a really nice growth rate. And I’m pretty sure you are still hiring. So yeah, everyone of our listeners out there, they want to work in a pretty amazing technology and be part of an amazing team. I think they should reach out to you.
Ali Hamidi 18:38
For sure. For sure. We’re always hiring, always looking for back end engineers, front end engineers. Yeah. If you’re interested in the data space, then we’d love to hear from you.
Kostas Pardalis 18:46
That’s cool. All right. So let’s chat a little bit more about CDC and the use cases around CDC. So based on your experience, so far, what are the most common, let’s say, sources and destinations for CDC? And why also? Like why do you think that people are mainly interested in it at this point, and the maturity that the product technology has right now, are interested in this?
Ali Hamidi 19:11
Yeah, so at least from from our point of view, and what we’ve seen and what customers are telling us, the most sort of common data sources would be Postgres, MySQL, MongoDB, SQL Server, really the the operational databases are the things that are backing these sort of common applications and API’s. That tends to be what people are asking us for. And so I think the reasoning behind that is really, that’s where the most value comes out. So you mentioned earlier, you know, the two sort of different paths for CDC use, the one being ELT, and one being like the microservices sort of application type use case. And I think there’s a really nice sort of appealing aspect of saying, Well, I don’t need to change any of my upstream application if all of the changes are happening in the database, I can just kind of look into that stream, and radiate that information across my infrastructure and start taking advantage of it. And so I think that’s why, you know, most of the use cases, most of the the requests are really around operational data source.
Kostas Pardalis 20:14
It’s interesting. Can you share a little bit of your experience with the different data sources? Which one do you think it’s like the easiest to work with in terms of CDC and which ones are the most difficult ones?
Ali Hamidi 20:25
Mainly because of my time at Heroku, Heroku was very famously, very strongly associated with Postgres. I’d argue that Heroku Postgres was probably the first solid sort of production grade Postgres offering that was available as a managed Postgres. And I think Postgres as a product itself is incredible. I think it’s really great to work with and its development has been super fast-paced, but always very stable. And I think the way that they have implemented, sort of replication has made it very, very useful for building out CDC on top of, and so that’s, I think, personally, that’s where I would kind of lean towards. I think to get like the premium CDC experience, Postgres is probably the best right now. I know that MongoDB has done a ton of work with their Streaming API, and sort of done stuff there to make that super easy too. But yeah, just for simplicity, and getting things up and running, Postgres is great for CDC. Mainly, because it leverages the built in replication mechanism.
Ali Hamidi 21:27
That being said, one of the things that we sort of continually see, and this is probably a good time to bring up the the initiative that, you know we’re trying to work on, amongst, you know, some partners and sort of industry peers, CDC itself has come a long way in terms of what it does and interest and where it can be applied. But I think there’s room for us to kind of agree as a community as a, as a collection of experts that work in the field, potentially some guidelines to make interoperability better. And so you have different companies building out, you know, CDC mechanisms, whether it’s someone building CDC natively into their product like CockroachDB, or someone like the Debezium team at Red Hat who are building these CDC connectors, I think there’s definitely an opportunity for us to sort of sit around a table and agree on, alright, if I want to provide a great CDC experience, I want to enable interoperability. So maybe I want to use, you know, Debezium on one end, and I want to pour that CDC stream into CockroachDB, let us agree on at least a style of communication, like some kind of common ground between us so that we can make this interoperability possible and make it easier for customers to really make use of that.
Ali Hamidi 22:45
And so, one of the things that we’ve been talking about, and I’ll let Fox kind of talk a little bit more about the the initiative in general, but we’re basically partnering up with some of our sort of industry partners to push the idea of an open CDC initiative, essentially, to kind of agree on what it looks like to implement CDC and support CDC and what it looks like to support it well.
Kostas Pardalis 23:09
Well, that’s super interesting. Yeah. I’d love to hear more about what’s the state of open CDC right now?
Taron Foxworth 23:16
Yeah, so I’d love to hop in here. This has been so informative. I’ve just been sitting here clapping my hands, soaking in knowledge and all that information about CDC. But open CDC is really, I think, an initiative that’s going to drive a lot of activity and community just around CDC in general, because like Ali mentioned, there are multiple ways you can actually start to capture this data, like Debezium, for example, leverages Postgres logical replication, to actually keep track of all the changes that are occurring. And the nice thing there is you get changes for every insert operation, update operation, delete operation. But there’s also other mechanisms of CDC as well, like, for example, one connection type is polling. Like you can constantly ask the database to look for a primary key increment. So when you know a new ID has come in, that’s a new entry or looking at a field may say updated at. So with all these different mechanisms of actually tracking the changes, some consistent format around systems around, okay, well, if you have a CDC event, you should be able to track here’s what snapshots look like, here’s what creates look like, here’s what updates look like, here’s what deletes look like. And what we can start to do is offer some consistency amongst these systems. So that CDC producers, and CDC consumers all agree on, you know, what they should be producing and consuming. And then that just leads to a great foundation for kind of all the things that Ali was talking about, just the secret sauces of CDC, whether that be replicating data, all the way to building microservices that actually leverage these events in an event-driven architecture type of way. So right now in terms of open CDC, we’re putting together these standards and this specification. So be on the lookout for something more official soon. But if you have any ideas or something, we would love to hear from you and love to work with you on this initiative to make sure that this is something that’s really great for the CDC community.
Kostas Pardalis 25:20
Yeah, that’s, that’s, that’s amazing guys, like I hope this is going to work out at the end. And obviously, like anyone who is listening to these and is involved one way or another, like in this kind of CDC project, I think they should reach out to you. Is there some I mean, outside of Morocco right now, are there any other partners that you have that are part of this initiative?
Taron Foxworth 25:44
Yeah, one big one is the Vizio itself, we, we talked with the lead maintainer of the Debian project, because I think the cesium as just a project in general has been so influential in terms of CDC, and their format, that JSON specification, it includes things like the schema that is being tracked from the database, and the events and the operation, the things like the transaction number of the database transaction. And in the case of logical replication, right, like the actual wall line they would be reading from. So there have, they have been one group that we’ve been working with and materialized is another, so materialized, they’re a streaming database. And CDC is really important for them, because as soon as you’re streaming changes, and calculating information, that system is very important for how they consume the data, and then produce that back out in a meaningful way. So I think, you know, working with the different types. So when you look at CDC, in general, you might have actual products, such as Postgres producing a CVC stream. But you also have, like CDC services, say, like maraca that’s actually consuming them and get you to do something useful. So I think there’s different types of players and companies that we can begin to work with. But those are a couple of the few that we’ve been having some really awesome conversations so far about.
Kostas Pardalis 27:05
That’s super interesting. Fox, do you see value in having these conversations also, like the cloud providers, for example? The reason I’m asking is because so far, the way that I’ve seen like products that they’re trying to do ETL from like, Postgres, and SQL, and MySQL, depending on the cloud provider, the version of the databases, you might be able to perform a CDC or not, right, so there is no unified experience, at least across like the different providers, the cloud providers out there. Do you think it makes sense for them also to be part of this initiative?
Ali Hamidi 27:41
Yeah, I mean, I think it definitely makes sense. I know, we want to try to get as many people on board as possible. And, you know, some of the ideas that we’ve been talking about is, how can we classify the, I don’t want to say compliance, because I feel compliance is too strong, like the idea of, we don’t necessarily want to enforce a standard, but some kind of categorization of like, good, better best of, like, if you are planning to, to leverage CDC, like this is, you know, a really good experience, or this is like the best possible experience where you get all of the operations you want, it’s very clear, you get the before and after of the event, you get everything you need.
Ali Hamidi 28:18
So yeah, I think from from my point of view, you know, the more people that are involved, the more people that adopt it, the more people that are kind of, you know, following our guidelines, the better the better it will be, and the more likely we’ll have sort of successful interoperability. And so I can definitely imagine a world where these bigger cloud providers are kind of not necessarily changing their formats to match it, but at least, you know, if they’re going to build something, if you’re going to build something new or integrate something, then why not build against some sort of commonly accepted guidelines that you know, benefit everyone?
Kostas Pardalis 28:55
That’s great. I think you’re after something big here, guys, so I really wish you the best of luck with this and also from our side as RudderStack, I think it would be great to have a conversation about that and see how we can also help with his initiative. We should chat more about it. Alright, so some questions about CDC again, and the experience around using CDC, right? You are providing a solution. Right? So it runs on the cloud, Meroxa is like, connecting to the database system of your customer. And this data ends up on a Kafka topic. And from there from what I understand it can be consumed as a stream using, like, different API’s. What are the, let’s say, the expectations in terms of latency that a user should expect by using CDC in general and Meroxa in particular?
Ali Hamidi 29:56
So with CDC, it’s very much dependent on how it’s implemented, right? So, you know, I mentioned previously that one of the things that we do is we sort of degrade gracefully, in terms of what is possible. And so if you point Meroxa at a Postgres instance, that’s running on RDS that you know, has the right permissions and logical replication everything, then latency is incredibly low, because it’s basically building on the same mechanism that is used for replication. And so if you had a standby database, typically that’s potentially less than a second behind, you know, milliseconds behind in terms of like, for replication. And so we’re seeing that data in real time, at the same time as all the other standbys are. And so the answer in latency can be also sub-second. But that’s like the best case. I mentioned with open CDC, like good, better, best; this would be the best tier where you’re really getting low latency, high throughput, sort of low resource impact. But you know, the end to end is obviously very variable, because once it’s in Kafka, Kafka is very, you know, famously high throughput and low latency as well. So that tends not to be the limiting factor. But what tends to be the limiting factor is what you do with that data. If you’re kind of tapping into the stream directly, and using something like the GRPC endpoint, you know, the feature that we have, then you could potentially also get it, you know, sub-second, see all of those changes that are happening on the database. If you move down to something different, like, maybe you’re running Postgres that’s very restrictive, you’ve given us a user that has very limited permissions, and we aren’t able to plug into the logical replication slot, then we kind of fall back to JDBC polling. And so then you’re kind of, you’re looking at the longest, you know, worst case scenario with the polling time, plus, whatever time it takes for us to write it into Kafka. And potentially if you’re writing it out to something else, like S3, or something that is inherently batch based for writes, then you’re kind of incurring that additional time penalty. But typically, what we see entering this is still pretty low, like single digit seconds is quite common.
Kostas Pardalis 32:04
That’s interesting? Do you see practical workloads, where you also have to take the initial snapshots? Do you see issues there in terms of catching up with the data as they are generated using CDC?
Ali Hamidi 32:17
Yeah, that’s kind of an area where I think there’s definitely room for improvement, both in the way we handle things and tools, in general. The initial snapshots can often be very, very large. So obviously, if you, you know, use something like Meroxa right at the beginning, it’s great, because you don’t have that much data. But if you come in and are pretty late in the game, and you have terabytes of data, then that’s terabytes we have to pull in before we can start doing the CDC stream. And so I think there’s room for improvement in terms of the tooling, you know, being able to do it in parallel, or being able to do things like that would be great. And I know, you know, we’re working on things internally, and also sort of the upstream providers, like, you know, Debezium and other teams are also working on things like allowing, you know, incremental snapshots and being able to take snapshots on demand and stuff like that. So I think there, there’s definitely, room for improvement, you know, I’d love for us to be able to like seed, a snapshot, maybe be able to, like, preemptively load from historical data, and then build on top of it rather than only take the snapshot ourselves, and stuff like that. So, yeah, I think there’s still definitely a kind of room for improvement there.
Kostas Pardalis 33:30
Yeah, that’s super interesting. One more question. CDC is considered like, traditionally, something that is related to database systems, and like a transactional database system, like something like MongoDB, something like Postgres, et cetera? Do you see CDC becoming something more generic, let’s say, as a pattern, and including also other types of sources there?
Ali Hamidi 34:03
Yeah. I think, you know, if you, if you kind of squint your eyes a little bit, the CDC event is just an event that describes something that happened to the database, right. And so it’s really no different to, evented systems if you were building out an application, and you kind of emit an event from your application that’s describing a state change. So really, it’s the equivalent in functionality or in semantics. And so here is an event, you know, a state change that your databases experience, versus here is a state change that your application is experiencing. And so our goal or our belief is that really, if we can provide a uniform experience across the two of them, then this, you know, it may not be necessarily cold CDC, because, you know, evented systems as a term has been around for a while. There’s no reason they couldn’t, you know, plug into like, any kind of SaaS application or your own custom application that’s triggering these events that they shouldn’t be treated in uniformly with the CDC events, if you just consider a state change of some sort.
Kostas Pardalis 35:08
Yeah, absolutely. I think the first example that comes to mind, and related to that is Salesforce. Like Salesforce lets you subscribe to changes, actually, they call it CDC to be honest. I don’t know how well it works, but it’s like a very good example of CDC as an interface with a SaaS application. Right? So yeah, I’d love to see more of this happening out there. I think that as platforms embrace this kind of way to subscribe to changes and catch up like things will become much, much better in terms of integrating tools. So yeah, that’s, that’s interesting.
Kostas Pardalis 35:47
Ali, something else about that. Recently, there’s a lot of hype around what is called reverse ETL. So there we have the case of actually pulling data out of the data warehouse and pushing this data into different applications on the cloud. Traditional data warehouses are not built in a way that, you know, like image changes, or even like, allows for, like many concurrent queries, like it’s a completely different type of technology. Regardless of that, though, we see that in examples like Snowflake, right, like Snowflake, from what I’ve seen, like recently, they have, like, a way where you can track changes, right? Yeah, it’s not exactly CDC, but it’s close to CDC, right? Do you see CDC potentially playing a role in these kinds of applications too?
Ali Hamidi 36:39
I don’t know. I think that the jury’s still out on the reverse ETL. I feel like my initial reaction to sort of the whole idea of reverse ETL is, it’s kind of a fix for potentially the wrong problem I think. The reason you know, people want reverse ETL is because you’re, you’re kind of following this ELT idea of dump everything, roll into your data warehouse, clean it up, process it, put it in a state that is useful for my other applications. And then now I want to take the data out and kind of plug it into my other components. But I feel like that’s kind of too far downstream for us. My thinking on the subject is really, if, you know, if ETL in real time was good enough, if we provided the right kind of tooling, the right kind of API’s, the right kind of interface, to do that kind of transformation in real time on the platform, in a way that is, you know, manageable and sustainable, then it kind of removes the need for dumping everything raw into a data warehouse, doing the processing, and then getting the reverse ETL. So an example of this is, you know, because we’re putting everything in Kafka, Kafka has, you know, retention, and, and so we could plug in a connector and say, Okay, take the last two weeks worth of data, apply this processing, you know, summarize it in this way, do the stream processing, and then take those results, and write it into my application. But it also lets you do things like, well, you know, maybe the transformation was wrong, let me rewind again, and try again, with a different transformation. And so I think that the task for us is really to build that tooling to kind of make the idea of reverse ETL almost unnecessary, by trying to build better tooling. I feel like ELT and reverse ETL is really a result of having funky ETL tools or tools that really didn’t meet the needs or weren’t, you know, weren’t really usable enough or performant enough to achieve that. So we’ve kind of gone extreme in the other direction of saying, just get everything rolled into your data warehouse, and then we’ll figure it out later. And so that’s inherently not real time. And our focus is very much on real time. And so if we can, we can provide the right tooling and do it upfront and do it on the platform. I think it should hopefully, if we do it well enough, negate the sort of need for having a reverse ETL.
Kostas Pardalis 39:06
That’s a very interesting perspective. What do you think, Eric, about this?
Eric Dodds 39:10
Well, I was actually going to ask a question. I was going to follow up with a question. So I’m so glad you asked Kostas. I think before we get … the reverse ETL topic is one that we love to talk about and debate about on the show. But I think first it would be interesting, both for me and our audience, just to hear what are the most common parts of the stack that are replaced with Meroxa when someone adopts the products? Or is it generally sort of a net new pipeline? I think it’d be interesting to know about that, and then I can give my thoughts on reverse ETL.
Ali Hamidi 39:45
Yeah, so we, I mean, we don’t necessarily try to go in and replace like, you don’t need to replace anything. Typically, the path for using Meroxa is to deploy us in parallel. And so we’ll tap directly into your operational database and start streaming the data into our intermediate Kafka. And then you can start leveraging Kafka, and the streams and the streaming data in Kafka to build out new applications or, you know, pour it into your data warehouse or whatever it is that you want. And so, you know, we use the term data infrastructure, and try to position it more as you know, we don’t view Meroxa as a point product; it’s not a point to point connection, really. What we’re trying to do is get your data out from the various data sources that it resides in and putting it into a flexible real time intermediate that you can then sort of tap into and leverage for other things.
Eric Dodds 40:40
Yeah, absolutely. Makes sense. And I think, you know, the reverse ETL question is interesting, because it sort of crosses a number of different types of technologies that are connected into the stack. So I think the first thing that came to my mind, Kostas, when this subject came up was the tip of the spear tends to be marketing and sales, SaaS tooling, right? When you think about, you know, sort of data pipelines, you know, whether it’s your traditional like ETL, cloud data pipelines, or, you know, event streaming, type tooling, etc. It tends to … so the demands of marketing and sales to get data that’s going to help them sort of drive leads and revenue, etc, tend to create a huge amount of demand. And so the first round of ETL tools, I think, is really focused on those, right, you’re trying to get, you know, sort of audiences out of your warehouse into marketing platforms, ad platforms, enrich data from your warehouse into your sales tools, your salespeople have better insight. But I think Ali what you … it’s been such an interesting conversation, because the idea around sort of streaming data in and out is much, much larger than just sort of those point solutions. And so I think it’ll be fascinating to see how the space evolves, especially as technologies like Meroxa become more and more common and we discover all the different use cases. It strikes me as one of those tools, even throughout this conversation, where you sort of get an immediate use case. And then you think about all of these other interesting ways that it could be useful as well. Right? Which is so interesting.
Ali Hamidi 42:20
Yeah, for sure. That’s something that we see pretty pretty frequently. Customers will come with a particular use case in mind, like, the most common one is sort of operational data into your data warehouse. But once they have that data flowing, then they have this sort of real time stream of events coming from their operational database, that includes every change that their database has seen. Then they kind of almost immediately go well, now that I have this, I can do these other things. Like maybe I’ll tap into the same stream, transform it, and keep my Elasticsearch cluster up to date in real time while I’m at it. And so then, like, once you do that, then you’re like, Oh, well, actually, I can use this to, you know, make a clone of a web hook that hits my, you know, my partner company, whenever this particular thing happens, because now I don’t need to change my infrastructure, I don’t need to sort of custom instrument anything, I’m just looking at the role of events, and I can kind of tap into it and really leverage that. Yeah, so one of the things that you mentioned, like the reverse ETL idea of, you know, enriching data for use of marketing, I think, we don’t currently have this functionality. But you know, I just want to kind of see the thoughts of the audience and you, is imagine, you jumped forward some about of time, and if we, or someone like us can make it super easy to do cross stream joins and enrich data in real time, then do you really need to pour your data into a data warehouse, and then pull data from Salesforce, and then pull data from Zendesk and then like, join them across the thing, join them across all of the tables, and then wipe them out into something else. Whereas if you were able to do it in real time, you know, by doing no stream joins and hitting third party API’s to enrich those records and create a flat record that you can then plug straight into Salesforce? You know, I think it would be hard to argue that, you know, I can’t imagine anyone saying, you know, what, this real time is just way too fast. I wish it was taking several hours like, this is just too responsive. So I feel like the task is not a question of whether anyone would want that. I think that’s clear. It’s whether or not anyone like us or someone else can make it happen in a way that’s easy to use. I think that’s really the task.
Eric Dodds 44:25
Yeah, you know, it’s interesting, in our last episode, or the one before, we kind of mentioned these different phases, right? So you have sort of the introduction of the cloud data warehouse, which sort of spawned this entire crop of pipeline tools, because now all of a sudden, you needed to unify your data, right? And now you have sort of the next round of that where you’re seeing reverse ETL and you know, sort of different event streaming type solutions. And it’s interesting because a lot of the sort of new technologies spawn new use cases and spawn new technologies. And so I think it is fascinating to think about a future, and this is actually something we’ve been discussing a lot at RudderStack where Kostas and I work where currently we live in a phase where there’s heavy implementation of pipelines. And if you imagine a world which you talked about Ali, where the use case is the first class citizen and the pipelines are an abstraction that you really don’t even sort of deal with in terms of setting up a point to point connection. I think that’s where things are going. And I think the type of sort of cross stream joints you’re talking about are fascinating, because then you sort of get rid of all of this manual work to create point to point connections, which still, I mean, it’s very powerful to sort of do all of that in a warehouse. But if you can abstract all of that, and just give someone a very easy way to activate a use case, and not have to worry about the pipeline’s because all that’s happening under the hood. I mean, that’s, that opens up so many possibilities, because you get so much time and effort back.
Ali Hamidi 46:15
I mean, for sure, you know, you hit the nail on the head there. That’s really the use case that we’re trying to address. That’s the problem that we’re trying to solve, you know, and that’s the world that we’re trying to head towards.
Eric Dodds 46:26
Very cool. Well, unfortunately, wow. We are actually over time a little bit. This has been such a good conversation. Well, thank you so much for joining us on the show. Audience, please feel free to check out Meroxa at Meroxa.com. And we’ll check in with you maybe in another six or eight months and see where things are at. Thanks again.
Ali Hamidi 46:45
That sounds good. Thank you so much for having us.
Eric Dodds 46:47
Well, Meroxa is just a cool company and now having talked to three people there, they just seem like they attract really great people and great talent. So that’s always a fun conversation. I’m going to follow up on their answer to my initial question. And I thought it was really interesting, some technologies, you know, let’s say you change data warehouses, or you change some sort of major pipeline infrastructure in your company, that can be a pretty significant lift. And it was really cool to me the way that they talked about how their customers are approaching, implementing CDC, and it really was around if you need to make some sort of change or update to some sort of data feed, then you can replace that with Meroxa. And so that’s what they see a lot of companies doing. And I think that makes CDC a lot more accessible as sort of a core piece of the stack, as opposed to going through some sort of major migration. What stuck out to you Kostas?
Kostas Pardalis 47:44
Yeah, two things. Actually, one is about this great initiative they have started, which is the open CDC. I’m very interested to see what’s going to come out of this. Just to remind our listeners about it, it’s about an initiative that will help standardize the way that CDC works, and mainly about messages and how the data is represented. So it will be much easier to use different CDC tools. Anything that is open is always like a step forward in the industry, it remains to be seen, like how the industry and the market is going to perceive that. So that’s a very interesting part of our conversation. The second one was about reverse ETL. And the comment that Ali made that actually, if you implement CDC and ETL in general in the right way, you don’t really need to reverse ETL. It’s very interesting; a little bit controversial opinion, if considering, like, how hard to reverse ETL is right now. So again, I’m really curious to see in the future who’s going to be right. So it was a very exciting conversation. And I’m looking forward to chatting again with him in a couple of months.
Eric Dodds 48:55
Sounds great. Well, thanks again for joining us on The Data Stack Show, and we’ll catch you next time.
Eric Dodds 49:02
We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds at Eric@datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.
Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
To keep up to date with our future episodes, subscribe to our podcast on Apple, Spotify, Google, or the player of your choice.
Get a monthly newsletter from The Data Stack Show team with a TL;DR of the previous month’s shows, a sneak peak at upcoming episodes, and curated links from Eric, John, & show guests. Follow on our Substack below.