Episode 149:

Turning Tables Into APIs for Real-time Data Apps, Featuring Matteo Pelati and Vivek Gudapuri of Dozer

August 2, 2023

This week on The Data Stack Show, Eric and Kostas chat with Matteo Pelati and Vivek Gudapuri, Co-Founders of Dozer, a company that helps users turn various data sources into APIs for real-time data access. During the conversation, the group discusses the problems that led to the creation of Dozer and how it bridges the gap between data engineering and application engineering. Topics also include the components and workflow of Dozer, its handling of schema changes, working with event streams, use cases, the importance of reliability and observability in Dozer’s data-to-API solution, and more.

Notes:

Highlights from this week’s conversation include:

Building Dozer: Simplifying Data Sources into APIs (1:13)
Bridging Data Engineering with Application Engineering (4:19)
Turning Data Sources into APIs (7:46)
The cost of caching (12:59)
Challenges with legacy systems (14:30)
Real-time data integration (19:31)
YAML and SQL experience (25:37)
Behind the scenes of Dozer (29:18)
Heavy Workloads and Low Latency (42:00)
Use Cases of Dozer (45:51)
Reliability and storing data from different connectors (51:35)
Importance of observability in serving data to customers (53:24)
Final thoughts and takeaways (56:34)

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 00:05
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. They’ve been helping us put on the show for years, and they just launched an awesome new product called profiles. It makes it easy to build an identity graph and complete customer profiles right in your warehouse for data lake, you should go check it out at rudderstack.com. Welcome back to The Data Stack Show. Kostas, we love talking about real time stuff. And we have a fascinating company on the show today. dozers, so we’re going to talk with Zack and Matteo both have fascinating backgrounds. But they allow you to take a data source, really many types of data sources, you know, from sort of, like, real time like Kafka s data sources to a table, maybe in your Snowflake, warehouse, and just turn it into an API to get real time data, which is fascinating. And I want to know, what in their experience, sort of led them to build a dozer? What problems did they face where they had, I mean, obviously, they’re trying to simplify something right, turn a table into an API, you know, sounds very much like marketing, you know, which gives me pause. But if they can actually do it, that’s really cool, and so I want to know why they built it. And then I’m gonna let you ask them how they built it.

Vivek Gudapuri 01:51
Yeah, 100%, I think it’s, it’s a very interesting space. Because now I think like, we are reaching the point where, you know, like, we’ve accumulated all this data into, like, the data warehouse, or like the data infrastructure that we have, in general, like we are able to create insights from that data. But like, the question is, like, what’s next, right? Like, how can we create even more value from this data? And that’s like, where we start seeing like stuff like reverse ETL coming into, like the picture, or, let’s say the approach that those are, is taking in taking this data from the more like analytical infrastructure that like company has, and turn it back into something that an application developer can use to go and build a boat, right? Because I mean, I feel like we always like the first use case that we think about data is like, wow, analytics and bi and reporting. But to be honest, like today, like that’s just like a small part of what, like the industry is doing right? Or like what the companies need to do. There’s much more like this. We can do it. But there’s a gap there. Obviously, a reverse ETL is probably partially addressing this gap. Well, I don’t think that it’s a shoulder problem. And I think that’s exactly what companies like those are trying to do, so it’s going to be super interesting to see what these breeds are like and what kind of technology and tooling is needed to bridge data engineering with application engineering. And that’s what we are going to talk about today, and I’m very excited about it. So let’s go and do it.

Eric Dodds 03:37
All right. Let’s dig in. evac. Matteo. Welcome to The Data Stack Show. We’re so excited to chat about dozer and all things data. So thanks for joining us.

Vivek Gudapuri 03:50
Thank you very much. All right.

Eric Dodds 03:53
Well, let’s start where we always do Vivek. Do you want to tell us about your background and kind of what led you to starting dozer? And feel free to talk about how you know, Mateo, of course, this part of that story?

Vivek Gudapuri 04:06
Yeah, so I’ve always been in technology roles. I knew about moto in one of the first companies I worked for in Singapore. So I’ve been in sync for the last 10 to 12 years, it was a psycho back company. And since then, we have been good friends. We always talked about starting something together, we iterated on several concepts. And this is something we came across in our previous experiences, and you’ve solved it in multiple different ways. And we’ll explain what that means in a second. Speaking about my personal experience, I was, I mean, maybe the last few companies would be a good place to start. So just before this, I was a CTO of a FinTech company, solving a payment problem in Southeast Asia. Before that I was involved with a publicly set company called Gog as a CTO, basically solving Logistics in Southeast Asia and Australia. And before that I was involved with patients which had a 200 million exit basically by an appellate company, right? That’s a little about me.

Eric Dodds 05:11
Awesome material.

Matteo Pelati 05:14
Yeah. So I’m S3 decK . We have known each other for about 10 years. We work at the first one of the first companies together. I’m coming from a mixer. I’ve been working in software engineering for the last 20 years. And in data for about the last 10 years. I’ve been jumping around between startups, and mostly financial institutions. I was part of rails, I was part of a data robot relatively early when they were scaling. And I was helping to scale out their product to enterprises. Right after that, I joined DBS Bank, which is the biggest bank in Southeast Asia, and helped to build the entire data platform and the data team actually, from the ground up. And right after that, before starting dozer, I was leading the data group for Asia Pacific and data engineering for Asia Pacific at Goldman Sachs. And yeah, after that, me and react, we have been always iterating about ideas, and we, and we like very much the concept of doser. And we just started, we just just decided to jump for me.

Eric Dodds 06:30
Awesome. Well, before we get into dozer specifics, you know, it’s really interesting, hearing your stories. There’s both sort of a startup background and then also like large enterprise both and fintech. Is doser. Does it have roots in sort of like FinTech flavored problems? Or is that just a coincidence of the experience that both of you have?

Matteo Pelati 07:01
I think I can. It’s kind of a coincidence. Because those are assaults, a generic data problem, which I happen to face in, in financial institutions. And we vacco to happen to meet in a Fintech startup bar. But it’s not something that is specifically for FinTech at all actually.

Eric Dodds 07:31
Okay, well give us a high level overview. And then let’s talk about the problems that you faced that sort of drove you to actually start building a dozer but give us an overview of what a dozer is and what it does.

Vivek Gudapuri 07:46
So dozer, basically point us at any data source or multiple data sources. With a simple configuration in YAML, we can produce API’s in gRPC, and rest. And developers can actually put that together in a few minutes and start working on data products right away. So the main problem statement says there is a significant investment in the data world. You know, like there are a lot of tools basically working on ingestion, transformations, etc, etc. But there are not many arrows going out of the data warehouses and data lakes etc. So typically, companies looking to solve a data serving problem end up building a lot of infrastructure from scratch. And that is what we have done in the past as well. So typically, that will involve if you’re working with real time, you will bring in a Kafka you will bring in, for example, a spark infrastructure. For scheduled for a batch job, you’re bringing in Elasticsearch already to cache the queries, your build API’s on top of that. So typically, they didn’t know how to stitch together various technologies, and it would require a significant amount of time and coordination between several teams. So in cost as well, right, so this is what we kind of personally faced. And here’s what we wanted to productize. So with a small team, or or a single developer can actually kind of go all the way from leveraging data to produce API’s, so that data products can instantly at Terraform work on the problem that you care most about. Right? So that’s what those assaults

Eric Dodds 09:17
makes total sense. Can you give us just an example of, you know, something and really material or that kind of you both? It sounds like you both, you know, assault, you know, face this and solve it using a complex stream technology. Give us the way that you solved it before. And if you can give us maybe a specific example, you need to deliver a data product that did x and what are like what stack did you use to actually deliver that?

Matteo Pelati 09:49
Maybe I can start with the kind of problem that I have. And really I can follow on with this problem with his problem. The biggest challenge that I was facing was when I was a dBs. For example, DBS wanted to build a unified API layer to serve all the banking applications across products, and across countries. So we’re talking about core banking, we’re talking about insurance, we’re talking about wealth, etc, etc. And then, what we wanted to achieve as well was offloading all the source systems. So you can imagine, you have, in this case, 10 sets of different source systems serving different kinds of product. And in order to achieve all that, we had to start building a very complex infrastructure, capturing data from all the source systems, preparing it, caching it, and building a UI, on top of it. Now, saying that seems very simple. But in reality, it’s fairly complex work. And because we’re talking about a number of systems, we’re talking about capturing everything in real time. And we’re talking about making sure the system is extremely reliable, because this is not just a dashboard data that is integrated, it serves directly to the bench. So and that’s where that’s where we realized how much time was spent to build the entire plumbing. And that’s how those are the ideas of those who are team members.

Eric Dodds 11:45
Yeah, interesting. So that’d be like, this is probably a really primitive example. But let’s say you have an account balance that needs to be available in multiple apps, you know, that relates to insurance or something. And so you need to actually, you need to serve that, like across a variety, you know, you need an API that essentially makes that imbalance available within, you know, all sorts of applications across the ecosystem.

Matteo Pelati 12:12
Yeah, that’s correct. That’s correct. And especially if you think when you open your app or your banking account, with your banking, you see your current account balance, you see your wealth account balance, here, you see yellow, or purines. So these are enhanced by different systems. And traditionally, you query each and every system to get the data. But that’s, that becomes complex and very, in the load to the source system is very heavy. So that’s what we wanted to achieve: a unified layer that was much easier for app developers to integrate, and also, at the same time, reduce the complexity and the load to the source system.

Eric Dodds 12:59
Yep. Yep. I mean, I would guess that you have to make a lot of decisions around cost? Because you have to decide, you know, how much of what to cash depending on how often people check something, right. So like mobile more often than Web? Or were there a lot of trade offs that you got to make just in terms of caching and the cost of running the queries? Because I mean, people want to know, like, you know, when money hits her bank account, I mean, they need to know that, you know, basically, when it happens, right?

Matteo Pelati 13:35
To be honest, the trade off in terms, of course, talking about data caching, we didn’t have to make so many trade offs. Because the cost of running the read load on the source system was so high that even caching the entire data pre, so fundamentally, what we did was pre packaging the entire data pre aggregated layer, terror data and store in the cache, which was basically a persistent cache, and entire user profile with with a part of the transaction history. And this was at a much lower cost anyway, then hitting the source system itself.

Eric Dodds 14:21
Interesting. Wow. Okay, so the caching just wasn’t even a big deal because it costs much at the source to generate it. Wow, okay.

Matteo Pelati 14:31
That’s fine. I mean, we’re talking about legacy systems like mainframes like their call wear, that are not specifically designed for reader load. Each read operation is a cost to the company. How many?

Eric Dodds 14:54
Just out of curiosity, how many separate? Sir, let’s just say like these data engineering tools, would you say you used to, to build the pipeline that served that? I mean, if we’re talking like three or like 10? Or

Matteo Pelati 15:12
we’re talking about, probably around 10, maybe a little bit less than 10? Yeah, not in, because there were a bunch of tools that were like legacy tools to connect to the system, sir. Plus, we had the entire infrastructure, or leveraged Kafka. Plus, some obviously, there was a lot of custom code as well.There was a lot of custom code. And, you know, when we started implementing we, we, we just, we were not just teaching the pieces together, but we properly defined a fuller I would say, now you use the term data contract. So that all the data that was published with the API available, were fully documented and documentation was fully generated. And we had a caching layer. And not just one, but multiple, depending on what kind of lucapa.

Eric Dodds 16:20
You’re right. Sure. Yes.

Matteo Pelati 16:23
I mean, it depends on your query, Potter. You will choose one or another.

Eric Dodds 16:31
Yeah. Okay. Well, to all of our listeners next time, you’re like trying to refresh your banking app, you know, to see your paycheck hit, just know that. There’s a lot going on, you know, and so the little spinning bar is running a lot behind the scenes. Vivec, do you want to, I’d love to just hear, you know, what problem you faced, and then let’s dig into dozer.

Vivek Gudapuri 16:56
Yeah. So before that, I’ll bring us to a slightly higher level for a second. This problem manifests, as we see in multiple different ways, larger organizations, organizations, we’ll call them an experience layer, where you’re bringing data from multiple domains and serving certain domain API’s to end users, which could be direct customers, or it could be internal clients that are doing different things. It could be a simple problem if you have stellar microservices, and you simply have an API. For example, let’s talk about a use case of user personalization, where you want to have some amount of calculated data about a user, which is coming through a machine learning model or some other for example, in the case of FinTech, which we were talking about earlier, it could be your credit scores, and your risk profiles, etc, etc, which are useful. And you’d have a certain amount of master data which are coming from certain other systems where you’re thinking about data production rules, etc, etc, as well. But when it comes to a mobile app, where Matthew was describing, you’re putting all of that as one user API, for example, now we have to stitch together data that is coming, coming from multiple systems. And having real time data in these scenarios becomes very important. Similarly, another use case I would describe is that you have data sitting in for example, I mean, today, a lot of system data will be available in a data warehouse. And data warehouses are typically not suitable for doing low latency queries. Let’s say if you have millions of users hitting your application, you cannot make all the calls back to a data warehouse, you have to bring that into a cache to serve all these API’s, and that suddenly becomes an entire pipeline to manage. And you’d have to think about real time and you know, all the caching policies, etc, etc. So in my experiences, we have had to deal with some of these problems where we had a warehouse data warehouse in place, and we had to kind of bring information about users and certain profiles in the form of reports in the form of embedded personalized experiences for users, users in the in, for example, in the case of FinTech, as I mentioned, it could be risk profiles, etc, spend patterns, whatnot. In the case of logistics, it was, for example, it could be driver locations, it could be it could be customer latest, you know, a number of inventory, inventory, items in inventory, etc, access, right, there are many things that are there are supposed to be kept real time, but this data is often coming from multiple different systems. And we still need to serve these API’s at a low latency for a large throughput.

Eric Dodds 19:31
Yeah, that makes total sense. This is probably a dumb question. But, you know, a lot of the data sources we’re talking about aren’t necessarily real time themselves, right? I mean, no, of course, like a Kafka or, you know, like a, you know, sort of a, you know, if you’re running Databricks and a Spark cluster, you can run some of those things real time. When we think about a data warehouse. Is the problem overcoming the limitation that you are because a lot of the data coming into the warehouse is running on a batch shot, right? And so you’re gonna get your payments data, what you know, every hour or every six hours, you know, whatever. And so the idea is that, okay, well, you actually have that data and Snowflake or BigQuery, or whatever. And you need to make the update, like the latest data available in real time without having a complex set of pipelines.

Vivek Gudapuri 20:26
Yeah. So on that note, obviously, warehouses, as you mentioned, sometimes could be a snapshot of information, which is done at a certain Chartio dozer works best in the context of real time when you connect us to the SOAP systems. So if you’re connected to a transactional system, we typically take the data in CDC and move that in real time. So we have inserts, deletes and updates as they’re flowing through from the main transactional system. And we keep information fresh in real time, where real time, I mean, obviously, there’s a bit of a data latency. As you know, CDC will also have a little bit of lag. But it is as best as you can get from A to D latency standpoint. But if you already have information in a data warehouse, and you want to connect that with your other data streams, that’s something you could do in a very similar way from an experience standpoint, we can do that in a very similar state, very similar fashion. So you could basically pull in a Snowflake poll and Postgres are pulling in the future, other transactional systems as well. And you can connect them as if you’re writing a simple joint query between tables and columns. And that will immediately produce an API.

Eric Dodds 21:33
Interesting, okay, so I’m just gonna come up with a fake use case here. So let’s say that, you know, I have a SAS app, and someone’s on a free trial. And I have, you know, my, you know, my app database is running in Postgres. And so I have, like, some basic data in Postgres about, like, what features have been used, like, the status of the, you know, person’s trial, whatever it is. And, of course, like, I want to, you know, send messages to that person, or even like, maybe modify, like, the app or even the marketing site, you know, using that data. And so, with dozer, I mean, from what I can tell, I can essentially turn that Postgres data into an API, and then just hit the API to grab the data that I need to make the decisions that I want to make. You know, whatever. And like my react app, or whatever, I’m delivering my app, as is that accurate?

Matteo Pelati 22:35
Correct, you can commit to visual real time sources in, let’s say, less real time sources. You can define your data, how you want to combine the data, or even if you want to pre aggregate the data, create the payload of your API, and the API is automatically exposed.

Vivek Gudapuri 23:03
That’s roughly how it works.

Eric Dodds 23:05
Interesting, okay. So when you say pre combined data, so let’s just say I’m running my app database and Postgres, but then the marketing team is, you know, collecting a bunch of whatever data they collect, you know, Clickstream data, web views, you know, marketing data, and Djoser would allow me to actually, like, join that data, and make the join available as an API, like an API, like an endpoint does curl out.

Matteo Pelati 23:42
That’s correct. So fundamentally, every you can join, let’s say, you as you mentioned, you have your Postgres database, you have let’s say, you have also some analytical data coming from out of your Snowflake or Delta Lake and, and you want to join this data, or even do some additional stuff on top of the join, you want to do some aggregation you want to do anything. So every time something changes on the shores the change is actually propagated to dozer and those 20 really pay the output and storage fee the cash and make it available.

Eric Dodds 24:28
That is fascinating. Okay, so, man, I have so many more questions, but I want to know Costas has a ton of questions. But let’s just talk it through and this will probably be a good handoff for Costas. If I have a Postgres database, and then I have, you know, an analytical database with Snowflake. And then just to make it even more complicated, let’s say like, our ML team, you know, is working in Databricks or Spark and so I have some output there, right and it’s So I want to figure out how to provide some sort of personalized, my marketing team wants to personalize this page on the site based on something we know about these people that needs to combine these sort of three key, you know, outdated base analytics database ml database. How do I do that with those? So like, how do I install those? Or how do I connect the sources? Can you just give us a quick walkthrough of you know, I have those three sort of data sources, and I want to make them an API.

Vivek Gudapuri 25:37
Yeah, so dose experience is mainly driven through Yamo. And SQL, right, so you would put a Yamo, one block would one block would be connections, you would specify the three connections, you mentioned, one block, one block would be about the sequel transformations that you need to perform, to write all the sequel transformations that you want to perform on the source systems. And specify where the endpoints API, how they are to be exposed, and the indexes that need to be created. And that’s it. That’s all you have to do. And you have API’s available and gRPC. And rest, right? What’s complex, super easy, but I’m sure like, like, a little bit more details there and how it happens, right? In the back end, at least. So can you guys like to take us a little bit? What does it mean, like what happens right behind the scenes, when I provide these Yamo files. And when I provide like this sequel and my choice of like the API protocol that I want to use,

Matteo Pelati 26:38
Behind the scenes, we have multiple connectors that basically capture real time data from databases or data warehouses. So let’s say from Postgres, we use CDC from Snowflake, we use table streams. So every time there is an event, that can be an insert, and delete and update from any of the sources, we capture all this data. After we, after this data is captured, goes through the SQL that you have defined now, this SQL is fundamentally transformed into a DAG Directed Acyclic Graph. And that Da is executed that PHP is executed in real time as the data is in transit. So we keep the state of the output data always up to date, and in the caching layer. And because we know what is the output of, of your SQL query, we can actually produce what is the output schema of the API. And that’s how we generate the protobuf definition and the open API definitions. So in the brain, this is the entire flow of execution from the sources all the way to the consumption.

Vivek Gudapuri 28:10
All right, that’s super interesting. And I’ll get you in a way as well, only Schumer, like the service that writes off data that is coming, like from a number of other services that you don’t really control the schema there. Right? That’s right. And going and for whatever reason, I drop a column on Snowflake, right? Or even worse, like on my production, Postgres database, right? Which probably means that I’m using Drop, like, for a reason, because adding a new column and build is a bit easier to handle. The other problem is like more silence, right? Like, data will come and they will be missing or like will tell me the Nautilus, right or something like that? How do you deal with that? Because, again, we are talking about a service on the other side that someone is consuming, right? Like they are driving a product or an application or doesn’t matter if it is internal or external, right? How do you deal with that? Yeah, that’s actually a really good question. Because this is what happens typically in companies and when you have multiple people working on multiple systems, there needs to be an entire coordination that needs to be in case for some to do a schema migration of sorts, right? So we actually have something we really thought about. And those are as an API versioning experience, where if for you to kind of create a new API version, you just have to change a sequel or you know, the source schema has changed or the types of change, you automatically publish a new API. And with a few commands, you can switch the API to the new version, right? So we actually run into pipelines in bamboo, and basically populate both of them. And a developer can simply switch from one version to the second version. Obviously, as you mentioned, if it’s a distraction from this destructive change that one of the pipelines completely breaks, we have an error notification, kind of an experience in play, where we will let you know that family is not working anymore. But if it’s a straightforward change, and nothing has to change in terms of schemas, we simply overwrite the version. But let’s say if there is a braking version change, we automatically create a parallel version. So this is, from a developer standpoint, you typically work with YAML, you deploy a new pipeline, and it starts to kind of work as the parallel version. And you could simply switch to the parallel version, by the way, so this experience of API versioning is not kind of part of our open source, because it’s a lot to do with infrastructures and just code. So that is, that is coming out in our cloud version, which we are kind of launching soon in beta. Right. So typically, like, if anyone did anyone deploying this in a self hosted manner, they could also kind of deploy it in a similar fashion. Wendell writes right about that on our blog. It makes a lot of sense. Okay, so from my experience with working with what is happening with these systems is that you have like the database from one side, which represents some kind of state, right. And from these very concrete states, you go to a series of events that actually represent how these, like the operations that are applied on the state, right? The reason I’m saying that as an introduction is because one of the, let’s say, tough situations with CDC is like going and recreating the states, right? Because you might need I mean, not the whole state, but like part of it. But the events themselves. It’s just like, inducting, on an individual event is just like part of what you usually want to do, right? But this guy’s like, some complexity, like in terms of the sequel, do you have to write to do that? And also has computational complexity? Like there might be a lot of events happening, right? Coming from CDC, and when we are talking about systems that they’re operating, like, more like as a cache? Okay, you always think that there are some constraints, like in terms of like, the resources that you have there. So, how is it like dealing with these complexities of working with event streams coming from data systems?

Matteo Pelati 32:43
Yeah, so I, so one, one technology choice that we decided to use with a dozer is to implement everything in grass. I mean, like, that’s, that’s something that is happening a lot in a lot of tools, data engineering space. And, and we fully believe that this is gonna, this is gonna change a lot. And so, in my experience, you know, when you have to deal with distributed systems, and JVM based tools are more like most of the tools in this space, sir. You are a lot of complexity to your system for no party, sometimes no particular reason, okay. In some situations, it is really justified to have a fully distributed system because the volume is so big, but in some other situations, you don’t really need to. So we said, Okay, let’s take a much linear approach, because a language like dozer is in your language like, like crossed allows us to get incredible performance with much more simplicity. And that’s how we do that’s what, how we follow the implementation in dozer. I mean, the execution of the pipeline is actually run by a single process. Now, it can be distributed among multiple processes and nodes. And that’s what we are doing in our cloud version. But the open source version is fundamentally a single binary, which is much much easier to run and manage rather than having a full cluster. That’s the approach we follow in the news, and that tends to be quite vital. Quite useful. I mean, it’s simple to manage and with the performance that you get, and another thing that we noticed we started experimenting around the Windows are on different, different machines. And, you know, with the, with the arm base course getting more and more popular, and especially with the large number of cores, or you have like an ARM based machine with like 64 cores, you can really scale out your computation not on, you don’t need to really cluster but you can scale out on multiple cores on the same machine and achieving incredible performance. So you can achieve what was not possible before with much, much simpler code, and much simpler infrastructure.

Vivek Gudapuri 35:42
Yeah, that makes sense. So from the user perspective, right, when the user likes to deal with movies, event data and works on recreating, let’s say, the state, right? How is the experience there? I mean, how is the user like, going from a stream of events that represent changes that have been applied? Like on a table, let’s say like, I am, like, let’s say, the user table, right? How is the experience of doing that, like how always, the user of those are going to implement the SQL query that, let’s say, takes that and from that creates, let’s say, keeps only the, the new signups. And only these are exposed, like through the VA API. The reason I’m asking that is because okay, like when someone develops, you need to have access, like to some data, like, you know, go through, like some process, right to do the actual development. So what’s like the experience here, because it’s a little bit different in like a bill based system, right? Like getting the database system, you have your interface, you have the data, you start writing a query, see what happens. And at some point, like through iterations, you end up with a query at the end. But if you have something like those, how is this experience happening yet? So as Matthew mentioned earlier about the high level infrastructure, right, so those are fundamentally four competences, we have connectors, there is real time SQL engine running, we have a caching layer, and on top of that, APs are available. So underlying, once the data crosses the connectors, everything is turned into a CDC, which means it’s an insert, delete, and update. And the sequel is working on data, as if it’s a simple table, and there are a bunch of columns, right? So if you are connecting to a CDC, you have a Postgres database, for example. You’re getting inserts and deletes, inserts, deletes and updates as they are flowing through. So you’re making changes to the database, and you get inserts, deletes and updates to dozer, but let’s say if you’re working with a Kafka, and as you mentioned, you’re working with events in that context, right? So you could actually say events would be available as a table in your SQL. And you could write whatever business logic you have on top of events as a sequel based. So you can actually combine that with data coming out of Postgres and present that as a series of transformations. And the output of the sequel would be produced as an API. So that’s the experience. So when I’m connecting, so those are on Postgres, right, what I see is not a stream of events, insert, delete, and update still are coming. I see the table, right, like, let me if I connect, like on the user table, what I see through those are the table itself, right? That’s right. News. The end of it after you produce a sequel is when you call the API, you’d see records as if you’re seeing a table, but you’re not working with events directly.

Matteo Pelati 38:55
On top of that, you see the table, but you don’t see a static table, you see a live table with all the data that is actually changing in real time. So basically, if you do a select, and your select produces like to say 10 rows, and in the database, a new row is added the new role, if the row satisfies the actual condition of your SQL, it will suddenly appear in the list actually, so what do you see the table but it’s actually it’s more than a table because it’s a live table.

Vivek Gudapuri 39:37
And one of the I mean, it’s like historical issues with CDC is how you seed the initial table, right? Because CDC, like when you connect to the CDC feed for Postgres, has a limited capacity, right, like you can’t really access the whole table through the city. See feed itself? So how is this happening? Those are like how do we get access to the whole table? Or? It’s a decision like that you made that you only get the updates that seem like the dates that you have installed the pipeline. Yeah. So typically with connectors, we have a snapshot. And then, you know, like a CDC that is continuous, so we take the initial. So basically, in the case of Postgres, for example, we start a transaction, you would basically get the snapshot of the table and we kick off CDC, so you would get the initial state and all the updates that are coming after this. Okay, what’s, what’s interesting? Okay, so enough with the technicalities? I’m asking all these questions, because I find these things, I don’t know, very fascinating, like a way of dealing with data. But it’s also quite challenging, especially if you’re trying to like to scale that, right? So it’s very interesting to see how a real system that is built like, handles like all these, let’s say, interesting parts of C, and trade offs. So okay, let’s move towards the use cases, right. And as we go there, I want to ask something because Matteo was saying, like at the beginning of our recording, that he used the term read heavy applications, right. We’re usually used like a Kasler. But it’s the whole idea of having a cost like rabies. And I would like to ask you guys like, when you were thinking about those and how those are reasonably meant today? Are we talking about a system that is primarily trying to serve, like reading heavy workloads? And this is, let’s say, like, a big part of the definition of what real time is? Is it more about the latency? Or is it both? Right? Because you don’t necessarily need both, right? Like, you can have a low latency system that processes data in, like, sub millisecond, even latencies. But still, you’re not going to have like too many reads happening, right? Or like too many rights, like these are like, different concepts. They’re between, like, the throughput and the latency. So what is what like those are stands between like these parameters of like the problem that you’re solving

Matteo Pelati 42:40
good questions, because this is coming linking back to the use case that I was mentioning. In the bank, it’s actually both records. You know, one is the thinker, if you think about the actual banking application, read heavy in terms of cash, because obviously, you have a lot of users logging into your app and checking their bank account. And, you know, it’s surprising to see how many users log into the bed, their banking application, after they do any operation, they withdraw the money at the ATM, and they immediately login into the banking app to check if the transaction is correct. That’s really heavy on the cache. But at the same time, because there are these kinds of users, you need low latency on the pipeline. So if I withdraw my money from the ATM, and the database, the source database is updated. Obviously, I need a low latency in the pipeline execution so that I can display the data in real time or near real time to the user. So we try to address both scenarios to them.

Vivek Gudapuri 44:06
Yeah, just trying to add something on top of that, right. So that’s why what we want to do in from a dozer standpoint is think so to become the de facto standard and the way you think about data serving, right, in some cases, we are unlocking that unlocking data for companies, for example, sometimes in enterprises, you don’t have access to a source system that is hidden behind several controls. And you know, like it’s sitting in a certain business unit. And to actually kind of make that part of a user experience, you’d have to think about creating so much infrastructure internally. And it’s such a challenging several months of project right. In some cases, you’re dealing with read scalability, or Postgres is not able to answer those queries anymore. In some cases, you’re talking about creating an entire one layer of API’s where you are combining several things and exposing that so definitely it comes into play at a certain scale of the company. I would not say a company starting today. would not need those right away. But when you’re thinking about standardizing or you read traffic, thinking about scaling your real infrastructure especially, and data serving capacity, where does it come into? What makes the most sense? And okay, let’s talk a little bit more about how the use case is not okay. Like we’ve used the banking sector a lot, as an example of how, like a system like this is needed. What are the use cases that you have seen so far? That’s like, actually, like one of my opinions, like the beauties of building a company, because there’s always like, people out there that surprise you with how they use, like the technology that you are building? So what have you learned so far with those, like, what have you seen, like people are doing with it? Yeah, so I would like to describe two use cases that we saw that are very interesting. Smaller companies are basically looking to use dozer because suddenly you get an SDK that you can plug in, and you get real time API’s on and you can start building API’s right away. So the cost of saving time and immediately starting to build products. That’s what is appealing at the lower end of the spectrum. And if we talk about enterprises, and we are currently engaged in a few enterprises, this is where unlocking the value of data is coming into play, and the terms like experience layer, etc, etc, coming to play where today, without naming names, some companies, some enterprise companies are dealing with large volumes of data sitting in disparate systems. And they’re currently thinking about creating a large infrastructure, which is potentially a few months or even a few years long, and putting together a large stack and a large team to solve this problem, right, it just suddenly becomes a multi billion dollar project solving, solving these, you know, like your, and at the end of it, you still don’t know how to exactly build it, because there’s too many technical complexities involved. And several key stakeholders were involved. So, this is where I think, you know, like, we received some inbound, which was very interesting for us where basically, instead of solving, you know, creating all this infrastructure code plumbing yourself, Dozer can actually immediately provision a Data API back end for you. And you can start to kind of work on one API to API and actually kind of start to build an entire experience layer for a company. So that’s what we have seen for large enterprises.

Matteo Pelati 47:33
I think there is another kind of interesting use case came that came mostly out from our open source users, that is where you have a Latvia product engineer, full stack engineer, the test to build that is the test to integrate data with a consumer facing application actually, and you know, this engineer has to in it, I mean, you know, all this data is coming from can be coming from the source system, as well as the data warehouse. So, it has to deal a lot with the data engineering team. And you know, there is always friction between product engineering and data engineering, who does water and those are actually started to strove to be very useful in helping these kinds of engineers to get the data they want, combine it in the way they want, and expose API and integrate without having to go through the entire, you know, process of building pipeline on the data lake getting approvals for the, for the variable, to run the pipeline there, etc. So, it kind of started to be the, let’s say, last mile delivery of data for product engineers to bridge the gap between the data engineer and the product engineer.

Vivek Gudapuri 49:01
That’s super interesting. And you use the area like the term experience layer. Can you explain what do you mean by experience layer?

Matteo Pelati 49:14
Yeah, maybe an experienced ledger is something that is typically used in banking and telco and bigger enterprise. So fundamentally, you have your domain which typically, so domain layer where like let’s say for example, a sake the banking space, so you have your domains which are fundamentally wealth management, for banking, insurance, etc, etc. And these products, they build the system basically, they expose data relative to that specific domain. Now when you build a mobile app, We are mostly talking about a user experience. And at that point, you don’t care about the domain. But you care about, for example, giving an overview of your balances, whether it’s in insurance, whether it’s wealth, whether it’s, that’s the kind of definition your V experience layer, what what you is the layer that you put in front of your user to better serve the the user. I think this is something that is maybe mostly used in, in these spaces, actually. But we have seen, even if they don’t call it an experience layer, we have seen companies needing something like that. Maybe they call it in different ways. But that’s fundamentally what it is.

Vivek Gudapuri 50:45
Yep, yep. Yeah, no, mix up calls sent. Okay, one last question from me. And then I’ll give the microphone back to Eric. So I would assume that having a system that’s on one side might possibly be driving a user facing application, right. And on the other side, like consuming data from various data systems that are probably like working already on, like, their limits and all that stuff? I would assume, but reliability is important, right? So how do you deal with that? And what kind of guarantees those are, can give when it comes to the reliability of the system? Yes. So this is actually a I mean, as you rightly mentioned, it’s a difficult problem, the way we solve this, for companies. So reliability, we do this in multiple ways. Firstly, here are some things that are also coming in some of our future versions as well. So the data as we get it from the sources, we actually kind of stored, depending on the type of the connector, for example, if the connector can support a replay, we don’t necessarily have to store all the information ourselves. But let’s say if the connector doesn’t have a replay mechanism, we have the ability to persist that in a certain dose format. So, that’s one guarantee. So even though let’s say if a pipeline breaks, we can kind of restart and kind of replay the messages and recreate the state. On the other end of the spectrum, the API layer at the caching layer is based on the LM DB database, it’s a memory mapped file. And we can basically kind of scale the number of API’s on the existing state as it stands in a horizontal fashion. So let’s say even the pipeline breaks, we can still serve the API’s with the existing data as it stands, it might be you know, if the pipeline breaks, for example, you will see a little bit of data latency. But when the pipeline kicks in, you have a new version deployed, we automatically switch to version and you have API available again, right? So all this, we still guarantee that API’s are not down, whereas the data pipeline, we will try to kind of replay the message and recreate the state based so we can run, we can run dozer, in multiple, I mean, in the cloud version, or in an enterprise deployment district, this typically would be a Kubernetes cluster with different types of pods doing different things. And even though so as the parts go down, we still have a way to maintain the state so that API’s will not go down.

Matteo Pelati 53:24
Actually, one thing I want to add is that in addition to reliability, one important aspect is also observability here, because you know, you’re serving data to customers. So it’s actually 10 times more critical than any internal dashboard. So it’s not uncommon to say, going back to AWS and getting a support call, say, okay, my balance is wrong. Why is that actually, and you need to really understand why the balance is wrong. And trace it back, actually. So, that’s another important aspect about it. All right. That’s all from my side.

Vivek Gudapuri 54:08
Eric, the microphone is yours.

Eric Dodds 54:13
All right. Well, we are really close to the buzzer. But of course, I have to ask Where did the name dozer come from? I mean, usually think about, you know, a Bulldozer just pushing mounds of dirt, you know, into big piles, but give us the backstory. Okay.

Matteo Pelati 54:34
That’s quite interesting. Okay, so when those were started, like, it’s like, we’re almost a year into the journey. And we were iterating on the IDI and I like the content very much or I share with that, and I stumbled upon an article from Netflix, where they were describing its system that was very similar to what I built in DBS. And, the system’s name is actually Bulldozer. And we kind of got inspiration from that. Obviously, we didn’t want to use the same name Bulldozer, so we abbreviated it and it actually became, it became doser. But actually it was very, very good. Because now you want it, that is the main outer of Bulldozer and Netflix is actually helping us out as an advisor in the company.

Eric Dodds 55:42
Oh, wow, cool. It’s,

Matteo Pelati 55:44
it’s actually it’s a,

Eric Dodds 55:47
It’s a nice story. Wonderful. Well, what a great story to end on the back, Matteo, thank you so much for your time. Fascinating. And I’m really excited that technologies like dozer, I think, are going to enable a lot of companies to actually deploy a lot of real time use cases, even at a small scale, you know, when they’re early on, and then scale to, you know, do things like the huge companies that y’all have both worked for. So very exciting to see this democratization. Especially in the form of a great developer experience. So congrats, and best of luck, and thanks for spending some time with us on The Data Stack Show. Thank you. Thanks for having cost, this fascinating conversation with the decK and Matteo from dozer is so interesting. The problem space of trying to turn data into an API, right, think about all the data sources that a company has. And their goal is to turn all of those sources into API’s and actually even combine different sources into a single API, which is where things get really interesting, right, you know, imagine a sort of a production database, analytical database, ml database, being able to combine those into a single API that you can access in real time is absolutely fascinating. I think my biggest takeaway, you know, is that we didn’t actually talk about this explicitly. But I think that they are anticipating what we’re already seeing becoming, you know, a huge movement, which is that data applications and data products are just going to become the norm. Right? Whether you’re serving those to an end user and an application, you know, so we talked about a banking application, where you need an account balance across, you know, the mobile app, the, you know, sort of web app, the insurance portal, etc, right, of course, you need that data there. Or you’re personalizing experience, right, based on, you know, sort of demographics or whatever. All of these are data products. And we haven’t talked about that a ton on the show. But I really think that’s the way that things are going. And this is really tooling for the teams that are building those data products, whether they’re internal, or you know, sort of for the end user. And I think API’s make a ton of sense as the way to sort of enable those data products. So that’s my big takeaway.

Vivek Gudapuri 58:36
I don’t think like an application engineer is going to change the way they operate. Right? Like they have the link and they should continue, like working with squats, they know how to use right and dude, like, I mean, how they do like already in application show, that’s where I see that the opportunities for tools like those are, right? The same way that like a data engineer doesn’t want to get into all the protobufs and I don’t know, like what else like applications are using to exchange data, right? The same way an application developers shouldn’t get into things like, what’s the Delta table ease like, why should they care about that right? Like or what Snowflake is, what they care about is like getting access to the data that they need in the way that it has to be shown they can build their stuff. And that’s how I think about what is happening, I think it’s primarily like a developer tooling problem to be solved. It’s not like marketing. It’s not like a sales job. It’s not, it’s not any of them. Like orbs kind of like it. I mean, there is Basil obviously also for these tools, but if we want To enable, let’s say, Build, there’s a need to build tooling like for engineers to go and build on top of that data. And I think like, we will see more and more of that happening, like even like in the, you know, like reverse ETL tools that we’ve seen, like, coming in like the past, like two years. And that you see that like, even with Toto, what’s the name of this one? See, one of these companies are the dots. Yes. They start implementing, like a caching layer on top of. Yeah, Snowflake, right.

Eric Dodds 1:00:37
So like an audience cache? Yeah, for sure.

Vivek Gudapuri 1:00:41
But yeah, but like, forget, like audience and put, like any kind of like, query result, right, what I’m saying is that they started, like, from a marketing use case, right. But at the end, like what they are building right now is interfaces for application developers to go and build on top of data that believes inside the warehouse, right. And I’m sure we’ll see more and more of that. But it’s interesting to see that like, even like Hightouch, luxotic was a company like, very like focused on like the marketing use case of leaflets, my understanding, like when I saw them, like when they started, they’re also like, moving towards that, which is a good sign. It’s a sign that like more technology, it’s coming exciting to link and developer tooling.

Eric Dodds 1:01:29
Yeah, I agree. I think that, you know, we’ve talked a lot on the show over the last two years about, you know, data engineering, the confluence of data engineering, and software engineering, right. And nowhere is this more apparent than, you know, putting an ML model into production or taking data and delivering it to an application that’s providing an experience for, you know, an end user. And so we’ve actually had a lot of conversations around, you know, software development principles in data engineering, you know, or vice versa, right. And tools, like those are fascinating, because they actually may help create a healthy separation of concerns where there is good specialization, right? Not that, you know, there isn’t good, you know, healthy cross pollination of skill sets there. But, you know, if you have an API that can serve you data that you need, as an application developer, that’s actually better. You can do your job, to the best of your ability without having to sort of CO opt other skill sets or, you know, sort of, you know, deal with a lot of data engineering concerns, right, and the other way around. So I think it’s super exciting, and an interesting shift, since we’ve started the show, so stay tuned if you want more conversation like this, more guests. Lots of exciting stuff coming your way. And we’ll catch you on the next one. We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe to your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.

🎙 Sign up for The Future of Machine Learning Livestream!

🗞️ Signup for Our Newsletter

Episode 149:

Turning Tables Into APIs for Real-time Data Apps, Featuring Matteo Pelati and Vivek Gudapuri of Dozer

August 2, 2023

Notes:

Transcription:

About the Podcast

Sign Up for The Data Stack Show Newsletter