Episode 99:

State of the Data Lakehouse with Vinoth Chandar of Apache Hudi and Onehouse

August 10, 2022

This week on The Data Stack Show, Eric and Kostas chat with Vinoth Chandar, the vice president of Apache Hudi. During the episode, Vinoth discusses all things Data Lakehouse, from a definition to necessary services and everything beyond.

Play Video


Highlights from this week’s conversation include:

  • Vinoth’s background and career journey (3:08)
  • Defining “data lakehouse” (5:10)
  • Databricks versus lake houses (13:37)
  • The services a lakehouse needs (17:37)
  • How to communicate technical details (26:55)
  • Onehouse’s product vision (31:41)
  • Lakehouse performance versus BigQuery solutions (36:44)
  • How to deliver customer experience equally (40:17)
  • How to start building a lakehouse (44:00)
  • Big tech’s effect on smaller lakehouses (55:33)
  • Skipping the data warehouse (1:04:39)


The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.


Eric Dodds 0:05
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com.

Welcome back to The Data Stack Show. Kostas, we always talk about getting guests back on the show. And we haven’t actually done a great job of that. But it’s kind of hard with all the scheduling stuff. But we were able to do it. Vinoth, who was one of the creators of Apache Hudi, is coming back on the show. And I am really excited because last time we talked to him, his project was in stealth mode. So I remember before the show, he said, we can’t talk about what I’m working on. But it is now public. It’s called Onehouse. And it’s super interesting. It’s data lake house built on Hudi, of course, which isn’t a huge surprise. So I’m super excited to learn just more about Onehouse in the way that tackle the problem. But one thing I want to do, we got a really good explanation from Vinoth. Last time about the difference between a data warehouse and a data lake. Maybe one of the best explanations we’ve heard. But Onehouse is squarely in the data, Lake House space. And so I want to leverage his ability to articulate these sort of deep technical concepts really well to ask about what the data lake house is, and just get a definition. So that is what I’m going to do. How about you?

Kostas Pardalis 0:22
Yeah, I don’t know, I’ll have a hard time. To be honest, we noticed like one of these guys, that’s like, always awesome to chat with him on a deeply technical level. But I’m also like, very interesting to share more about the product they are building the business that they are building, and his whole experience of going from monolithic like an Apache open source projects to try and to build a business on top of lots. And Lake houses are also very interesting, can you let’s say like product category out there. And I’d love to hear more about that and how he sees the future. So we’ll say I’m pretty sure we are going to have like a lot to chat about with him.

Eric Dodds 2:24
There’s no question. All right, let’s dive in.

Back to the show. This is your second time joining us on The Data Stack Show and it’s so good to have you back.

Vinoth Chandar 2:33
Yeah. It’s fantastic to be back. I look forward to another LogStash tomorrow. And I think it was a very deep, interesting technical conversation. So I look forward to another roll of interesting, deep conversations here.

Eric Dodds 2:48
Absolutely. Well, for those of our listeners who missed the first episode, we asked have to ask you to do your intro again. So you can just give your brief background and then I’d love for you to finish by— Last time you couldn’t talk about this publicly, but what you’re doing today at Onehouse.

Vinoth Chandar 3:07
Yeah, my name is Vinoth and I’ve been working on open data infrastructure, this area in our own databases Steelix for the last 10 years and Jane and I started my career at Oracle with Oracle server data replication that that key-value store in LinkedIn during the time when our key-value stores with the cool thing that you build. doober is where Apache Hudi happened. And we kind of broad transactions on top of our how to do like back in the day. And what we call transactional data lakes, I think it’s a pretty nerdy engineering me actually came with this very kind of what is known as the lake house can architecture play, I continue to kind of grow the project, ESS Apogee foundation. So we’re the team’s Jarrettsville for Apogee Hudi. And right after Uber actually had a good spend a good amount of time on confluent as well, I wasn’t working on idea of working on Kafka. I was working on case equals dB, if you heard about that a streaming database and connect a bunch of like other things. I’m most recently I think now I’m super excited to talk about Onehouse, which is where my current deployment lies. I’m the co-founder at Onehouse our goals is to build, bring, manage data, lakes or lake houses into existence. We see a world where there are poorly managed closed systems, and then there is DIY open systems in the world. And we’re trying to actually build sort of like that kind of manage experience on top of open technologies like Apache.

Eric Dodds 4:54
Love it. Okay, I’d love to kind of set the stage and focus on a turn term that you mentioned, which is Lake House, and some of our listeners will be familiar with that. Some of them will have seen it in some sort of marketing materials I’m sure out there. So I want to ask you for a definition of data lake house. But before we go there, could you remind us what the original use case for Hudi was? Specifically for transactions on datalake? So like, what were you facing in that role inside of the company? And then, like, why did you need transactions on datalake?

Vinoth Chandar 5:34
Got it. So yeah, so but here’s the thing, we need to go back to actually 2015, 2016. And Uber was growing very fast, we were building out our data platform, and all we had was an on-prem data warehouse at that time. And while essentially, we were hiring fast, we were building our new products. And we’re collecting high-scale data. So we couldn’t fit all this data into our on-prem bounce, it’s not built for this amount of storage. a Hadoop cluster is like an HDFS cluster, even before we were right, LinkedIn or Twitter or in many places, Facebook, it’s been scaled to several hundreds of petabytes, at least. Right. So we built our Hadoop cluster, or data lake. And here’s where I think we had a very interesting problem that remember my previous stint was at LinkedIn, this was a some something that we didn’t even face at LinkedIn, which is Google’s very real-time business. So we feed grains, the prices change and then there are hundreds of huge operational aspects to the company. There are 4000 engineers, and let’s say 12,000 people who are operating cities, right, and they all need access to like fresh near real time data about what’s going on out there. So essentially, what we found was, while we can sign up a Hadoop cluster, and dump a bunch of files onto that, on there and bring spark or something and write some queries, we were able to, for some of our core data sets, remember, like the trips, transactions and the score database tables, we were not able to actually replicate them very easily. On to the data lake, we would suffer multi our delays eight hours to allow delays in the first ingesting it, and then writing ETL on top of it. So it got to a pretty serious level where people actually figured out to be current or not fraud checks fast enough. So we were actually losing money from fraud. It was like a really serious business problem actually unnecessary. And we actually started to look at how do we solve this? And we essentially actually looked at what we had before that, how are we solving it before Hadoop cluster, we were like the one from battles that we support clients transactions, updates, and it can actually do kinda like you can write it like merge style ETL is on top that people currently write using BBT on all of these balances, right. So essentially, we are like, that’s pretty much it, like we need to essentially build that sort of functionality, bring it to the lake, but do it in a way that we retain the scalability, all the cost efficiency, all the different advantages of the lake. And that is how he was born. So we actually call it a transactional data lake. Because, in our mind, what we were doing was introducing basic transactions, building some indexing schemes, updates, deletes, your datalake is now mutable, which means it can absorb, you can get our change occurred from upstream, you can update the table instead of rewriting the whole thing, right. And that’s, that’s kind of like how he was born. And it was pretty like yearly like it came before. Most of the other contemporary technologies that you see out there.

Eric Dodds 9:03
Love it. Such a great store. I remember you talking about that in the previous episode, and just so wonderful to hear the Genesis story again. So you’ve kind of already answered a lot of those questions from a historical lens, but define so, with that context, define the data lake house, especially sort of through the lens of how you view the world at Onehouse.

Vinoth Chandar 9:28
Yeah, so great question. So actually one of the, one of the key technology-wise, like the one that goes out to datalake is, as I mentioned, transactions updates, right? It gives you more, like upgradability. So it gives you like an impedance match with how you do things on the warehouse. If you get a look at that way. Like from a user standpoint, it’s also from a user standpoint. There are two other important aspects though. These are mostly used to kind of improve the baseline performance of the data lake compared to a warehouse, for one is the metadata management. So most rattles is even cloud warehouses, if you see today, they actually have pretty good, fully managed metadata systems where if you want to execute a query, statistics for different files, columns, era, all of these things are sort of build, maintain, and they’re organized in a way that query scaffold plan very quickly. Right. So that is another angle that there will be some technology that Lagos ads, because lakes were pretty much files, and individual query engines word, highly like Hive meta store is basically what we had for meta data management. And if you look at what high but hive meta store never tracked, any file level, statistics are everything. So really file level, granular statistics, all of these things, that’s one big area like the second which, which is where I think in Hudi, we spend quite a lot of time on around, and we’re like much further at once, there is what we call table cells. So if you look at any warehouse, take Snowflake or BigQuery, like you’ll find a fully managed upstream service, you’ll find all these different services that do useful things to your table. And they’re all self-managing, you don’t have, like, you don’t write code. For all of these things. But if you look, third, Soda, that is why I feel the table format doesn’t do justice to sort of like what we need to build overall, the game format alone is not important. You need like a set of services that rival warehouses that can provide you clustering data loading, ingestion, all these other things. This is well, what before, focused a lot on Hudi. And this is I would say, all these three put together like the storage format, like the table format itself, accepting updates, deletes and the transaction transactionality plus, like a well optimized metadata layer, plus these kind of like, well, well-managed table services they give you together, I imagine if you take a vitals and break it sort of like horizontally, you get the bottom half of the warehouse today. And then you can fit like a query engine, like sparkle in Airflow, or presto, or anything really on top. Right. So that in my mind is what a Lego should be. And in that sense, yeah, Duck Duck can connect to Onehouse, this is sort of like what we want to lock is for people to be able to get this bottom half as a service, while they have the choice to pick inequality agents pick and choose.

Eric Dodds 12:57
Love that. Okay, one more question. For me, Kostas, just tell me in our listeners set the stage. From a marketing standpoint, Databricks has invested a lot in the lake house term, which is maybe one of the ways that a lot of our listeners, including me are just are familiar with the term or have become familiar with the term. How do you think about Onehouse in relation to a Databricks flavor of lake house? Are they similar in terms of— I love the illustration of the bottom half of the warehouse, but help us understand the differences and similarities.

Vinoth Chandar 13:34
Yeah, okay. So it’s a great question, though. I think Databricks is articulation. Lakehouse is slightly different, right? I think if you going from the paper even essentially, it is partly because essentially RS Parquet Databricks ClickHouse. Right. And even if you look at Delta Lake that are there is an open source version of Delta Lake. And then there’s a paid version of Delta Lake. So they’re essentially two flavors of the bottom layer, if you will, that I just mentioned, while they are top layer, which is super optimized, sparkly, IR, and Dino photon and like all of the investments that they put into that, honestly, they can apply to other formats as well, right. It’s end of the day see, like all these stable format games, they’re all quitting party files at the end of the day. So surely, if you can optimize I think it’s a decoupled problem. And the way they market it is as a full vertical stack against Snowflake, right? That’s kind of like this bar. I’ve seen most of their marketing energy being spent so far. And that’s probably because Snowflake is one vertical stack, correct? Yeah. So but if you look at the pieces overall, it still kind of like aligns with Adobe the biggest problem and we see this with a lot of— Hudi and delta have been around for much longer. Supporting mutable workloads and everything right for like three years now, right and out in production. So we routinely run into this right people want, either people like Hudi for how rich offer table service ecosystem, how actually vibrant and grassroots open source the community as are several key technical differentiators, like it concurrency control, or like indexing and whatnot. But they still want— databricks spark, so I think, poorly, we didn’t have to ask Metaflow DAG, but as Onehouse, we deeply care about that because somebody who, who wants to buy both Onehouse and Databricks should be able to get, like a really good into an experience. So even for us, some of the thinking is now very customer focus that way, I would say. So there is a slight difference. We don’t believe in one vertical stack. Either way, I think this can be accomplished by breaking the bottom bottom half separately, and then fitting every query engines. So let me just give you some data. datalake re. Flink, and then that’s got like any other upcoming query agent? For what it’s worth between them, they have some 50,000 60,000 GitHub stars. So it’s, the monitoring agent is like a new thing that I think like Bordeaux, like the domain, new query engine innovation, that’s gonna happen. So I think decoupling the data layer from the compute layer, or the vendor or even the staffs level is that good thing overall, we feel?

Eric Dodds 16:39
Yeah, super interesting. Yeah, it’s almost like bring your own interface to the bottom layer or multiple interfaces, which is super interesting. Okay, Kostas. I could keep going. But please, naturally more interested in what you’re gonna ask than what I’ve already asked.

Kostas Pardalis 16:53
Yeah. Oh, well, come on. That’s not true. I think you’re like asking all the interesting questions. I’m, I’m boring. I’m just asking for a little bit more technical stuff. That’s all. But yeah, okay. I have like something that I really want to ask you because you mentioned something. You said that there is a number of services that a lake house need to have in order to rival warehouses. And I really like the word rival, first of all, but can you tell us— You mentioned them, but like let’s enumerate these services again so our audience has a much more clear idea of like what we are talking about in terms of like technical services there.

Vinoth Chandar 17:36
Got it. So let’s start from the initial eight. You need a service that can ingest data, first of all, right, and we built an intuition system in Hudi from, like, three, three years ago. So this is similar to an autoloader, kind of snow pipe. What I don’t know exactly the word it’s called product it’s called. So I think there’s an engine system that can like load data that we get down into cloud storage or different sources. That’s one of the other reasons for it to be like, sort of be aware of the sink, because you can do checkpoint management and many other things very, very easily if this system actually understands that it’s Hudi, it’s writing to number two, when you update the on underneath, what happens is we version files, you create garbage right there is you’re writing new versions of files, and you use the old version of it, somebody needs to clean up. This is what we call cleaning in theory, and what is called like vacuuming, I think in Delta Lake, and you need a service that can actually no, you can’t tell it, hey, I want to read an X version or something, and that he can automatically do this for you. That is one. The third thing I see no failures happen when you’re writing to table and you have like some leftover files, uncommitted data lying around, you need systems that can like services that can clean the data so that these like dead files don’t clutter up your tables and things like that. Number four, this is slightly specific duty but Hudi supports a merge and read storage type where we can actually land data very quickly in a row-based format or flexibility in a column-based format. And then later, solid compacted right. And when we say compaction, when what we mean is what compaction means in databases like Cassandra HBase, or it’s like compacting Delta files into a baseline, so you need a service that can do that. And Hudi is compaction service can, for example, like keep compacting, even when writers are going on, right as you can imagine, like Uber or like TikTok, like via this like stream of high volume data coming in. It’s impossible to stop and do OCC optimistic concurrency control for big things. So you need like service like this. Again, I’m making the case that This has to be deeply we aware of this services need to be aware of feature. And that is how databases work. Right? The other one is clustering service. Like DLP implemented reordering rules, but curves, unlike just like linear sort or a clustering. So fundamentally, what a table format metadata layer can do is remove bottlenecks in planning, right? It can store files or things when it’s statistics, which is used to plan what end of the day, if you look at most viruses, if for high performance and to reports and stuff, people actually three performance by clustering and playing with the width in Vertica, I think it’s called projections and big data, different names and different things. But you create the actual storage layer to squeeze performance, right? And then you need some service, which can actually know, understand the right patterns that are happening in the table, schedule these clustering execute them, if they fail, they retry, right, so what Hudi actually the bulk of the value that we add, we believe, is in this layer, where you write to every table, all of these servers will be scheduled, executed automatically, they can fail, they will be retried, in all other ways, if you take like a very thin table format, as an alternative, whether you need to write all these jobs, your soul. And what I’ve seen from the lean my LinkedIn days in the last 10 years living through the Hadoop, Watson cloud era, Hortonworks, all of this thing is this, everybody focused on the format, like acid play, you saw the format, and then everything, but open alone doesn’t cut it. Right, that is the painful lesson that we should learn from the rise of cloud lattices. What we should focus on the standardized services, and they take years to get like standardized and hardened in production scale like this. I think this is the main thing that we’re not right now, even around Lakehouse marketing by any vendor, I don’t see enough emphasis later on some of this, I’ve recently started noticing that regions had some content on this recently, I think Starburst had some recently, but it’s a very recent thing that’s happened in the last few months. And this is what we’ve been for the last three years.

Kostas Pardalis 22:22
Okay, so just to make sure that like I, I also, like understood correctly, right, we start like, our foundation is a data lake where we store the Parquet or C files, let’s say Parquet, like about the standard. And on top of that, then we need like a number of services, whose I counted five. I hope I didn’t miss anything, but let’s say the at least like the most fundamental ones. So we need like an invitation process there. We need some service that’s doing like to prepare the data and like make them like available. We come vacuuming or literally cleaning and taking care of like all the version files and like all the stuff that are happening, like on a low level to make sure that we implement concurrency, we have some kind of garbage collection, let’s say I’m using garbage collection, more like a broader third, compaction with compaction from what I understood is like more of a specific use case for Hudi because you have the columnar and they’re all based like representation show you at some point, take these two and you merge them into one or something. Is this correct?

Vinoth Chandar 23:39
Yeah, it’s correct. I think it’s slightly different. Most of the other two projects are written for as a final statistics tracking system right? This is where complex is not new at all to let’s say a rocks DB or LSM stores or anything in the like the database folder. I come with done background. So compaction is more about controlling I want to write smaller amount and then lay like a more like queue up a lot of these updates later, merge them merging them right away. Okay, I think that is the key technical rationale for compaction.

Kostas Pardalis 24:19
Okay, that makes sense. Is this like something similar to what happens when tombstones are, for example, used and then you go and remove the stones from there so you can actually like, rapidly made deletions? Or not?

Vinoth Chandar 24:35
If you read a blog structured merge freeze LSM trees, for example, they will talk about there’s a whole bunch of signs around how to balance, right cost and read cost and merge cost. And that it’s a very, very, widely adopted database stick, right from Google’s big table to Cassandra to headspace to rocks db to level DB that flex all thieves there.

Kostas Pardalis 25:03
Awesome. And then the fifth one has to do with what you called clustering, which is more about like how you can optimize, like on a lower level, the data of how it is stored. So you can actually improve performance. Is this correct? Does this have to do with encoding? Give it a little bit more information.

Vinoth Chandar 25:26
Yeah. So I think clustering changes how we use sort how you actually pack records into files, is balanced that. If you know something about the query, let’s say, for example, you are a SaaS organization and you have 1000s of customers, and then you’re collecting logs from them. So and then you know that your query patterns mostly are you will query for one customer at a time, then, instead of spreading this customer data across all the Parquet files in a table or a partition, what you can do is we can cluster them so that the records similar to one customer is in like the fewest number of files, which means that when you query them, you will read the smallest amount of data, right? This will give you 10 grand next orders of magnitude of query performance. And while I feel compared to let’s say, five listings by listing is a real problem, only for very large tables, right? So related to all that this fundamentally affects your compute dollars, and it can reduce a dramatically reduced cost for your lake.

Kostas Pardalis 26:36
All right, so that’s, that’s amazing. My question is, I’m going back, like to the initial question, these are, like, let’s say the minimum state of like, additional services that the datalake needs in order to rival a data warehouse, but there’s like a big difference that I see here. And the difference is that with a data warehouse, I don’t really care about all that stuff, right? Like, I don’t have to know about, like, all these very technical and interesting, like, details that right? While in the lakehouse, we have to talk about that stuff. So how do we change that? Because not everyone wants to become a database engineer to query and store the data.

Vinoth Chandar 27:19
Yeah. Unfortunately, we opened that door, then we wanted updates on the data lake, right? Because before that, if you like, appending some files to a folder, and then collecting statistics on it, I think it’s a very simple thing to do. It’s like, conceptually, it’s very easy for people to understand that is, and people in the data lake have grown up thinking about everything as formats. If you look at the video series updates, you turned it into a database problem now, and if you look at the database world, like, you don’t actually see, I think I made this statement last time. You don’t see CockroachDB, MySQL, everybody’s saying, let’s standardize on some, like one format, and then build something on top. It’s not a thing, right? When when you chain into a database problem, the stuff that we talked about, those are the higher order problems, right? So for an unfair our, to answer your question, what do you do to change this? That, honestly, is the core of why we even started Onehouse to begin with. And this is what I say in lots of places, lots of people have asked me and they come to us for enterprise Hudi support or something that is not what we’re trying to build here at all. We’re not trying to build a enterprise Hudi company, what we’ve seen and you’ve spoken to Kyle or head of product was in a different campaign for this, right, technology-wise, the common thing that we see is, it takes six to nine months for people for data engineers to become database engineers and platform engineers understand all these concepts and actually implemented them. So what if that existed a single-layer managed service where you can click four buttons, and then you have your all-day codes up and running? And it’s open. I think open is super overloaded with marketing these days. Let’s, let’s truly what we care about is interoperable and extensible. Right. So if you have an engineering team, you can go to the project you can contribute to the project, get a seat on the table on the BMC. Yeah, that exists. And then it’s interoperable. It works with every open standard, there is no vendor bias or anything the project, right, so we need a very foundational technology like that, on top of which we build this managed service. That’s how we are thinking about it. Even I speak to a lot of cloud Barrows users of like my day job right now, right? And what I see is ultimately they realize this right? They start with a fully vertical stack because it’s fully managed and like you say, people don’t even have to care about it. But whether you’re signing up for a two-year, like migration project to your adopt, like, right, then when you’re making the choice, I think fundamentally, we need to, like, sort of bring some manageability to add open alone won’t cut it. That is what I’m trying to say like open alone is not a key business thing, right people our customers are looking for how soon can I get my like Legos up and running? Technology aside, and we are focused on that. And I feel while it’s open as the only kind of USP against a close tag or to take on like Barrows is not good enough in my mind, Cloudera Hortonworks tried that and failed.

Kostas Pardalis 30:49
Yeah. It makes sense. Alright, having let’s say, the experience that someone has with like a cloud data warehouse, something that’s okay. It’s good, like we are after that, right? Like, we want to offer this over like a data lake. And that’s what Onehouse mission needs, from what I understand. So do you want like to spend like a little bit more time like to explain to us how we can go from these, at least five, pretty complicated, like technical concepts and services, to an experience, which, with a couple of like, clicks on a cloud dashboard, we can have a lake house up and running, and we got started, like interacting with it. So how does it work? What’s your vision for Onehouse from a product perspective?

Vinoth Chandar 31:39
Yeah, so honestly, like even detaching myself from it, right? If you have to look around now and see what we like pick today to build a product experience around, I’d still go and record it, because Hudi already has most of these servers. But they are, it’s a library, but he’s a library, you need to adopt it, tweak it. So what we learned from some of our initial kind of users that we’re working with that everybody is that just by hiding a lot of configuration, just like we expose a lot of computationally speaking for Hudi, we expose a lot of configuration, just like any database, you go to Oracle, or MySQL, the point is to expose all the configurations, or administrators will pick it up over time and know what to do, right? I think we have to simplify that. And, for example, don’t even show file sizes. Why do you care about volume, what the file size should be? Right now, we asked people to go and shoot that has given the paddles on optics. So in our experience, it’s a whole bunch of sort of like auto-tuning or intelligent configuration management, and sort of that I think, is the first ingredient to get there. And the second thing, were specifically talking about one, I was glad to be back about six more is, our team actually has operational experience, not just build it, right, like I’ve been on call for about 250 better buy it datalake, I had to like wake up in the middle of the night and recover a table that I do this kind of thing. So that’s the second part, which is usually in datalake. So far, we’ve not telling the user manage the tables, right? And if you look at Snowflake for BigQuery, if a table is corrupted user has no control whatsoever, they can use like some Snowflake engineer, then what a valid state should battles you just have to figure out what’s going on. So that’s the second part like building enough manageability operability to this product. But you’re taking control away from the user in the in the name of simplicity and getting started quickly. But we need to now build all the operational chops to be able to pull this part off. I think this is the hardest part, hitting Jay Jay Krebs confluency, like as a thing running says, in ranking programming theory, what’s much harder is debugging that thing. And watching what’s much more hard, like operating that piece of code. And I think business where my disappointment with all of the marketing that happens in the Data Lake Land is that we focus very little on these operational aspects. It’s all super DIY. And then later, we also complain that Oh, like, it’s not standardized stuff. Bla, bla, bla. We have to build these taxes that nobody likes it. But people. That’s what Bill Clinton did really well. I actually admire them for what they’ve done in the last 10 years, they’ve actually accomplished a lot.

Kostas Pardalis 34:42
Absolutely, absolutely. Cool. So we started with like auto-tuning and management of configuration in general, like signifying, let’s say, the whole like setup process for user yet there and also introducing, let’s say, like, abstracting the operations right slike making, giving, let’s say like a cloud experience, right? Like, there is a team that will stay awake, like to take care of things when they go wrong. Like instead of having to build your own team like to do that, and especially like, for so complicated technologies like these, where it’s not that easy, like, do you know exactly what might go wrong? So I think it makes total sense. And my next question is, I can understand, like, what I think one of the benefits that, let’s say, the cloud warehouse, and all the vertical solutions, in general that they have is that when something is vertical, and you have like, complete knowledge and control over all the components, right, you can control the experience exactly as you want, right? Like, exactly how is going to be experienced by the user. At the same time, you have like much more control over what kind of optimizations to do. Right? And we see that like, with things like BigQuery, and like Snowflake. So when it comes, and they’re like, actually, two questions One has to do with like, the experience, but let’s keep it like for next, and let’s start, like with performance, right? So this is things like when you’re like Vertica really degrade all the components, you can go there and be like, Okay, I’m going to build something like photon and have like, on top of that, like these changes that need to happen, like from the different components and make sure that like I squeezed out every piece of like, every level of like, performance out there. Where do we stand with the lake house architecture, when it comes to like the performance compared to the solutions like Snowflake? Or even BigQuery.

Vinoth Chandar 36:41
Yeah, it’s a great question. So I first for once I feel like things like photon could be built on top of like, the end of the day, going back to the previous statement, on the on the read side, right, even with the Lagos, these transactional, like format, on the lead side, all that happens is your getting some statistics and planning some query from there your query performance is dictated by things like that. I feel like I think already, we proven that this can be built independent of the light in a very decoupled way. And then if you now keep things like all the tables, services, and all these things that we talked about, they’re pretty decoupled from how the query is processed: a huge cluster and then they’ll read it. That’s it. So in that sense, I don’t see a technical limitation to optimizing the stack Soda, like vertically, like how we do it, right. But I do see that there are different companies here, there is no single company, right? Like, even for us, we routinely work with different query engines that are different projects. And each year, we take months to, like, learn certain things, and like it can be like lots of different friction points, in terms of how quickly we can move forward. But I think the performance itself, comes from the engine, like a lot from the engineer, at least for interactive query performance, a lot of it comes from the engine, the better with a better integration with things like Hudi workflows, Hudi, or even like, one our services, we can probably match the experience that you go on maybe configured clustering in Onehouse while you go query on like presto Trino or something, right? Like that, that kind of experience, you can product experience you can build, but I think there are significant across organizational boundaries and reporting on companies. I think it’s gonna slow us down there.

Kostas Pardalis 38:44
Yeah, absolutely. Like, just like to reiterate on what you said, like, there’s no, let’s say, increasing technical reason, if you have like a data lake slower than like a data warehouse, but when you build like the product, and how like the user experiences, the product, like things get a little bit more complicated. Like, just to give you an example, let’s say, I have like a setup with Hudi and Trino. Presto, like I’m running my queries, and I see a performance regression at some point happening somewhere, right? Yeah. What do I do? Who do I reach out? Like, debug this thing? Figure it out? Should I come to you? Or no one files or should I go like to the Trino community and ask there? Or is it my data engineer doing something stupid out there like things? While when I do that, like with Snowflake or? Okay, Google is notoriously known for its support so forget Google it’s on Snowflake but at least that Snowflake I’ll open attic and I’ll be like, guys, like something goes wrong to like, figure it out, right? And that’s like my, the other like parts of the equation which comes like to the user experience so how we can also like as vendors that we believe in these unbundled, let’s say, DB system, or the lake house, is how we can deliver up the in the same experience to the user, or at least like a similar experience to the user?

Vinoth Chandar 40:15
Yeah, I think, right now, there’s a lot of structure. And like, first of all, there is no standard like APA Friday, even when we attempted this with even plus, which is even the hive connector, right, we try to induce obstruction, so that you just like change the way you are getting file listing, you are listing the thing. There aren’t even like, good abstractions points right now, and across these different engines for us to test and guard. I think, as these get more standardized, right, all these three transaction formats have their own connectors now. Right, at least a PRs out of, like, things landed. I think, starting with even basic stuff, investing in some basic things having like, between these companies testing them, I think we have some very basic gaps to fill, I would say, longer term, it’s a pretty interesting point that you bring up, I think, end of the day, there will be some level of trade-off that for the user, where they are consciously choosing, I want the freedom and the flexibility. So yeah, I’m really go for that. And you have to pick and choose. It’s like buying Android versus iPhone, like will show like, in other words, the experience that you’re getting, but it’s going to vary differently based on the underlying hardware and the manufacturer and linker block. So you kind of have to go through that. I feel like even with that once we I know the basics, I think it’ll get to a manageable level. I don’t think it will be lawlessly one level will not always be a problem. I think it won’t be completely eliminated. But I think that’s where I feel the lake, the lake storage players and equation integrators have to like, work much more closely together than what’s going on today.

Kostas Pardalis 42:08
Yeah, yeah, no, 100%. I agree with what— Obviously, there’s a lot of like, face for improvements out there. Okay. Like all the vendors right now, especially when it comes like to like vendors like Onehouse because you just started the business, right? Like, cool. Like, it’s one thing to have, like, open source project. And it’s a completely different thing, like to build like a cloud product on top of that, right? Like, it’s, there’s a lot to be discovered there. And I will show ads. And that’s like something that like, I really admire, like to people like you that gay, you’re also starting something that it’s, like, completely new, right, like in terms of as a product category, right. So there’s a lot of learning from both sides, both from the customer side. And also like from the vendor side. And this takes time, it’s very risky, but potentially also, like, super rewarding, but there’s always going to be I think, like a trade-off at the end. It’s not like, Okay, we’re going to have, let’s say, the Microsoft Access experience with like, a lake house or the picture, right, like, there’s going to be like some kind of trade-off there. Okay, so let me ask you a question, though. This like, also like a little bit of like, personal question that they have. So let’s say right now, I want to start building a lake house. It’s the first service that you mentioned, link chain sheriff’s. Somehow you like to push data into there. How do I do it together? Today with Hudi? Like, the only way that I can do that is like through the list, ingestion loader that you have builds, there are other ways like, how does this work?

Vinoth Chandar 43:59
Yeah, it’s pretty simple. Actually, you go to docs. And if you go to how to streaming injection, it’s a single command, it has like a numbering set of parameters, you say, what your sources what your target is, configured a whole bunch. And that single spark, Spark submit command actually can ingest from Kafka. It can just from JDBC sources, it can just from S3, kinda like event streams. And then it can also do things like you can configure clustering, cleaning, compaction, all of the stuff that I talked about, right? It’s almost like running a database on itself. So if you just run that one command, it will internally RB as part, it spins up a Spark job. And then within that it will self manage all the resources that we need. For ingestion. If you’re not ingesting, it’s going to do clustering. If it’s not clustering, it’s going to do compaction, even how to resource management. So we made it like Super, super easy. And the fact that so we actually have built a very similar thing at Uber. And I actually started writing this tool as a replacement for it in open source. I think we, it’s gotten so popular that it’s used in many, many companies in production. Right? A lot of those companies, this is their main thing. This is like they’re doing just service. So yeah, we that’s what I’m trying to say like, as a project, we’ve tried to make it very easy for the just because we suffered through all this integration pains when we are driving Uber. But in spite of that, I feel still the operational overhead is too high. I don’t know. Like, that’s what one of us is trying to solve. But yeah, Hudi already makes all of this very funny.

Kostas Pardalis 45:47
Okay, so how this would work, like with open house, like, what’s the difference there?

Vinoth Chandar 46:00
Yeah, so the thing is, we’re not talking today, we don’t have our 34 we are. So if you look at how an even let’s say Qlik, label SQS, ingestion or S3, International something. And I usually will have a blog, which describes an A to an architecture, right, we have platform Ising that into an architecture on top of, it’s almost like we’re automating the blog that we wrote, you can still run with the if you want like that, that’s actually something that people like, which is they can like a lot of real users that like public security, they are happy that they can start with something manage, so they don’t have like a long lead time for the day goes on. But if for whatever reason, they don’t like us, right, and they can just turn around and all the services are in open source, they gain by support from AWS. And now that said, right, they can move off of one of the most open source GTX or build go-to-market strategies, or, okay, it is just project is the phrase kind of on top of it, they think they’re trying something new where we are, we are open for this project, we’re trying to cheat, hide it as much as possible within our product, because we want to uplevel the experience, then, if you get really familiar for whatever reason, they’re not adding enough value, you should be able to walk the data as yours, right? I think this is the fundamental problem. Now you contrast with the DeVos more, this is a fundamental problem. Once you’re stuck in the warehouse, you want to migrate the data, right? If you’re happy with it, there is nothing you can do about it. So that is actually what we want to change and like you’re saying it’s a product. And as kind of like also as a architectural category. It’s something pretty, like new and experimental. Architecture technology-wise. Sure. It’s pretty proven outright? Your question around this unbundled stat, see whether we like it or not, whether one knows exists or not doing that. So people were using the lake even before me, right, you are using Parquet and using Presto or spark or hive. That’s literally how we started at Uber as well. So this multiple engine on our open format kind of thing already existed. before. I think all we’re trying to do is build a path for users to get started on sooner. And hopefully as a company, as a product, we add enough value that know we can retain.

Kostas Pardalis 48:29
Yeah. Okay, I’m going to make it a little bit harder for you. Okay. Okay, I actually like challenge. So let’s say I’m a data engineer who is coming, like from the model data stack environment where I’m used to use, let’s say, Snowflake, and to like our BI or Fivetran, right, where I know that I’m going to make like a source the day that I like to be loaded on S3, then like a copy command is going to be executed on Snowflake the data will get important in the Snowflake table format and then I’m able to like to read that and all these things happened like inside transaction so nothing is going to get corrupt Right Start stuff cool. And now that my boss says go builds a data lake and okay, like we need like to expose it to the rest of the organization. So it should be like, feel like the same let’s say as like, okay, and I come to Onehouse. How like, think of mu have like I have these experiences in my mind, right? That’s like the journey that I think of when it comes to loading data and like this whole ELD feat. Is this something that like I can do in the lake house in general, first of all, and second of all, like some things that even if we can do cannot do today, let’s say like I can’t do I will be able to do that like in the future with Onehouse is like how you think of things and like how they should be.

Vinoth Chandar 49:58
Yeah, I think it’s so we first of all, Getting the experience should be similar to what you’re used to in an existing managed service. Right? But how we accomplish that in Onehouse can be through us having more upstream partnerships. Like, for example, my previous employer counseling, I think a lot of scenarios are right, when people are at the point of thinking about data, lakes and everything, they are also thinking about, okay, I want to open up event streams to my company, right, I want open up for stream processing. So most of the things they would naturally do something to get extract all this data into a big even box like Kafka or Bulldozer Optimus is one of these things, right? And the minute you get into that, then it’s, it’s pretty simple, like so you can use a ideally one-off scan, like probably the same experience, whether we run it or whether we partner but I’m saying like we want to, like we could right now, we would recommend for people to rethink how they’re doing data streams. Right, right. Okay, the CDC that we are capturing from IO cron. Can you do that into Elasticsearch? No, you can’t. Right? You can only send it to one point which is Snowflake, right? Does that’s not the forget Hudi. Databricks everything. That’s not what people ultimately build us a data architecture, right? I’m sure you’re familiar with like data measures. And we live in a world where there are like, there’s enough data that there is like so many specialty stores. So I think that will make this move much easier for us I feel for something like Onehouse the movement towards streaming data. And NATS technology Hudi is very well positioned to be the absorb all the streaming data and integrated very well. I didn’t Manoj just has to focus on that problem.

Kostas Pardalis 51:54
Yeah, what I keep like from what you say is that, yeah, like things when it comes, let’s say to the lake house story, we’ll get like, what would say closer to what people are produced to us from like, clouds warehouses. But there’s also like education that needs to happen, like people to understand that there are like different ways that we can do things. And there is value in that, right? Like, there is like, it’s not like you use just like how easy it is like to do something you also gain, let’s say, flexibility and opportunities to optimize your infrastructure to do more things with your data at the end.

Vinoth Chandar 52:30
I also feel like when people when users like the data that you talked about is usually at a point where they’re building a data lake for the company, they actually have a business problem to solve already, I think it will, they’ll, they’ll mostly look at it from that lens, for example, it can be stream processing, data, democratization is what kind of like what I just talked about, it could be just that I am building a new data engineering team, or a data science team. And there is always like event logs and data that I can’t even ingest into the virus anymore. It’s not like it’s replicating the same data that exists in a bigger house outside, right, I believe a lot more data sits out there on some S3 buckets or cloud storage buckets unmanaged completely. So I think there’s a vast amount of data that is not even getting into viruses are now if you now think about it, right from this lens, I don’t think the existing Managed IT solutions are operating at that scale, right? They’re not operating on even scale. like Uber, we do like 10s of millions of trips a day, if we did that, then we are ingesting hundreds or a billion events per day, like the scale difference in the amount of events and data volume. These are things that we’ve done routinely in open source and we ourselves have actual tool hands-on experience building. Yeah, so I feel they quickly scale-wise, it’s a very different problem. And when people canister and Lake, they have one of those costs, scale problems already, that will motivate the experience that people and but but by and large, I think it’d be fine.

Kostas Pardalis 54:10
Yeah, that’s an excellent point. And like, it’s a very fair point also, because yeah, like I’m giving an example, let’s say, but like the example of like, the behavior, let’s say someone has with a product cannot be taken out of context. Like, there’s also like, the problems that someone’s trying to solve. And you’re absolutely right, like when you reach the point where you need it. datalake There are reasons for that, but it’s not just connected, like you don’t like Snowflake.

Last question. And then I’ll give it like two to Eric because I completely monopolize the conversation, although he’s going to be very kind and say, “It was so enjoyable” and like blah, blah, blah, all that stuff.

So we have seen lately, both from Google, with the big lake initiatives that they announced at some point, but also with Snowflake with the support, both as external dates with Iceberg but also like as native format that we see like the data warehouses are also making, let’s say, a move towards, like more openness and like embracing, let’s say, the lake house, or data lake parody. How do you think that this is going to affect Onehouse as a vendor, like in the space? And how do you think this is going to evolve as part of like the data warehouse experience that we have seen so far in the cloud?

Vinoth Chandar 55:32
Yeah. So I think that— Let’s take Snowflake’s expansion and stuff might be the key question, I would ask this, how do external tables actually perform? Right is one thing to have an integration, but it’s another thing do they perform as well as new tables, right? Because internally, you might have read the big metadata, because that’s like LoRa metadata optimisations problems that the transactional formats solve have been solved in a very different way, in viruses. So I think that there is going to be like, my feeling as this is a nice thing where you can actually access data, but by and large, people are going to moving What if they wanted to make performance critical, they want to move that coffee into a native table and say, this warehouse, I think that’s what I already think. I think it’s really early right now it feels like everybody wants to do something against Databricks. And everybody wants, like, I have a lake house, or they want whatever they want to use. That’s how it feels like to me. So we’ll see, like, of course it can also evolve over time, the end of the day, Avanos is still are used for traditional analytics use cases, right? There’s much more beyond that can be unlogged, in kinda model that we’ve been discussing so far. So it’ll be interesting to see how broad they want to be, like, Barbaros just want to make this right. So it’s not like thing that won’t happen. But historically, if you project it out it it may or may not happen, right? Yep. The second thing here is war. All let’s look at this architecture now. Let’s say Okay, so we have a common format, and then all the engines read and write from that, like, the same table is returned from like Snowflake and B query. I haven’t seen a use case like that, why would you do external table, you do external tables only? Because you want to do some spark processing on the same data that you want to also query? Right? Then it sparks performance. Good enough, what do you expect, I just don’t see clarity in these individual graphs, cases, to a level Oh, for BI one, Li always use X. I don’t, I don’t see that kind of thing. Right? I see way more users not caring about, I want to actually keep my data more future-proof. Because four years ago, or three years ago, nobody talked about Snowflake as the de facto kind of virus that you dumped everything ignored in the breakthrough. So maybe in the next three years or something else. So I just want to keep my data future proof this data will outlive the vendors and query engines. I see far more companies worried about thinking from that perspective than this. I want to have this thin layer that I can read from any index.

Kostas Pardalis 58:32
That makes sense. Yeah, it’s still early. And I think it’s going to be couple of like, interesting, at least, like years ahead of us are the ends. And all these like innovation, like product developments, or hopefully are going to be beneficial for the customer. And I think it’s also like, from my, from my point of view, and like also like putting, let’s say, my intrapreneurial thoughts, let’s say, being like a new vendor in this space and seeing like these much bigger and like well, established, vendors like to be investing towards like, something that I’m also doing. It’s good things look like, there is market there is appetite in the market for that kind of stuff. Now, who’s going to win? I usually say that it’s the smaller vendors that win in that kind of innovation. Yeah. But we’ll see. It’s going to be interesting.

Vinoth Chandar 59:31
Yeah. To that point, actually, quickly. Like, even think about it. All of the CIO, who writes code in these systems, I think, like if we go back to who’s pushing the also the transactional formats forward. I think that matters more. Right? Because those people are the ones that are closest to the problems closer than the technology and that’s kind of why I think it reflects in the smaller vendors winning because they’re much much closer. That’s the only thing that they focus on. Overall, it’s great. By the way, don’t get me not subsequently fantastic that barrels are now taking external tables super seriously. I think Redshift deserves a lot of credit for this. They stayed on. I’m not seeing anybody give them spectrum honor Hudi two years ago. And they deserve a lot of credit for that.

Kostas Pardalis 1:00:27
Yeah, I agree with you. It’s a little bit of a shame because like, okay, like, there is some kind of perception of like, Redshift is some kind of like, let’s say, dead, let’s say, in a way, although, Redshift was the first cloud data warehouse out there, and they like the guys there. They keep building amazing technology. So people should keep paying attention to them. They’re doing like a great job.

Vinoth Chandar 1:00:51
Yeah, yeah. And I think this is like marketing, right? Like this. This is when marketing says reality thing, I think as, as a founder. Now, I also have the job of empowering my team that they know, what works, marketing, and what’s like, this is like, a pretty blurry line. But Redshift, I think, makes maybe a little bit more, I don’t know, but I think, the same ballpark as some of the more successful that we talk about, right? And then they have like, 10s, of 1,000s of customers. Yeah, this is where I think, for us, we will be like, No, not, we’ve not we’ve, so I would say we under God, who are disadvantaged. Because when you’re like EMR, I don’t know— Like, a lot of AWS services are deeply integrated with Hudi. And we didn’t start what knows back then. And because the current Bash, starting when marketing, it’s we’re not under the marketing shine the spotlight. Now, but I think I’ve seen enough systems called mango, that I know that, end of the day, the technology has to work, and somebody has to operate this system of solving customer problems. So we are super, pretty hopeful for both the open data, Lake houses and Onehouse.

Kostas Pardalis 1:02:15
Awesome, awesome. So very, as you can see, you are the kings, like, we found you deciding that now we have to be ClickHouse, seriously, nothing companies out there. So he was the market CRO these group of guys. Like, we want to hear from you.

Eric Dodds 1:02:32
No, I was laughing about you saying the line between marketing and product reality can be blurry. That is certainly true. What one last question for you. And I’m thinking someone who’s maybe thinking through the lake house on a practical level. So you talked about the genesis of Hudi, you were you had real-time needs at an immense scale, and then also a lot of the you mentioned sort of, you have the bottom half of the warehouse, and you can run Spark on it, or like presto, Trino, etc. A lot of that tooling, I think, to a lot of our listeners probably at least hints at scale problems, right, like a lot of those technologies develop, because of scale problems. One interesting thing that when we talked with Kyle from your team was that he said his opinions been changing on the lake house, sort of practical for companies that aren’t at Uber, vast scale. I just love your thoughts on that, and maybe I can just frame it in sort of slightly unfair frame in the form of a slightly unfair question. But do you think that the lake house is at a point where a very forward-thinking data team could say, we’re just going to skip the data warehouse, and we’re gonna go straight for the lake house? And we’re just going to use that and it will sort of be able to scale with us up? Or do you still see a lot of companies hitting some of the limitations of your traditional datalake for object storage and then data warehouse or all the transactional sort of day-to-day practical stuff?

Vinoth Chandar 1:04:35
Yeah, that’s a good question, actually. So I think it’s a totally fair question. I think we are probably like a year or so from that, and eyesight, mostly all the DIY stuff that we need to do. Like, for example, somebody has to understand Debezium Postgres, Kafka like just to build a simple Postgres to Lake condition thing, right, like, so there’s like a significant issue. So I know you’ve spoken to smaller companies who basically know that the virus is gonna get expensive and scale over time. But today it costs way less than three days. Right? Yeah. So I guess the problem that most people start and that’s where we are starting right with that, as the in the product. And from that land, that technology, if you look at it, I think, cost performance wise, I think, oh, grand scheme of things, being even out of the league to be like, blade is much cheaper for running any large data processes, like the battery look at the world today is Barrows, or I think, in my opinion, still best in class, when it comes to maybe like interactive query performance, the work that is going into things like tries to Trino are changing all that, right. And then when you look at data processing, ETL, that is where it gets, like really expensive. The flip side of scale is cost, right? If you’re looking at large scale means also large cost. So even moderate scale stuff, that’s probably what Kylie, like hinted at even simple stuff, right? We can interest spending 100,000 bucks, you can probably spend 30k, on a leg strong as the similar kind of experience existed. I think that is opening up. And this is not possible without the cloud. I think cloud is what is the proliferation of all these different awesome engines. And that’s actually what acting as a kind of a catalyst for, for deriving that. So I don’t see this as a just a skill problem. And the onboarding, interesting, not when we started Hudi at Uber. For a year and a half or so it was like, that’s why you don’t see, like, we don’t have a launch or anything. I just like an interesting nerdy project that engineers at Uber built. because not a lot of people have that kind of scale with updates and this kind of thing, then, but right now, just with the time and then the data volume exploding, what we see routinely is, I’m surprised that like much smaller company, the scale that they have, like, oh, wow, okay, you have a two terabyte partition. Okay, I did not do that, like, yeah, that’s so that is also that I, my view has been ISIL evolving. For a while I myself, to be honest, thought like that. It is like it was a high-scale problem. But then, the community, when I saw the scale that they were like doing things are changing, like I just literally matter, airline data tracking system copy, they bring in just some 10s of terabytes every day, you wouldn’t have even heard about them, like they track all the flight data across all the airlines, the US. And then they’re able to get something up and running. And they have that they can’t send this data into a warehouse. So they have a leak-based solution. So there is also the organic data of volume growth that is pushing people.

Eric Dodds 1:07:55
Yeah, super interesting. I can absolutely see that, let’s say in a year’s time. You have say people who have been working at a larger scale company, maybe they adopt some sort of Lakehouse flavor technology, then they go to work for a smaller company. And they’re like, Hey, we can do like we can do this actually, instead of waiting until the bill gets to 100 grand and then having to do a complete re-platforming. Super interesting.

Vinoth Chandar 1:08:25
Yeah, it’s going to change into a lot in the next three, four years. And it’s gonna I think, yeah, I think we got to get to a point where it’s feasible, right? You can, like no cost, no trade-offs to your pain links, you can get started with this thing.

Eric Dodds 1:08:40
Yep. Yeah, it makes total sense. Awesome. Well, I think we went long. But that’s because Brooks let me record this time, so we get to break the rules. Addage is always great, but this has been such a great conversation. We learned a ton as always, and we’ll need to have you back for sort of a third time’s a charm round on the data section.

Vinoth Chandar 1:09:02
Yeah, see if you go do and then thanks for all the awesome questions. That’s one of the things that I really enjoy. Is that the quality of the questions and it gives me push me on the hot stuff. So yeah, this is fun that she doesn’t.

Eric Dodds 1:09:19
Well, that’s a very high compliment. So thank you so much, and we’ll talk soon.

I love talking with that guy, Kostas. He just has this really incredible ability to like, answer questions with a high level of detail but keep the explanation really concise, which is a really challenging skill that I have a lot to learn from. I think I think probably one of the takeaways for me was the conversation right at the end, where he talked about how the market is changing. And when he thinks that data lake house technology He will come down market, and potentially even be adopted instead of a warehouse sort of as the first major, like, operational data store in a company, which is really interesting to think about. But at the same time this point was, well, four years ago, like no one thought about Snowflake is like, Okay, you need a warehouse, you just stand up Snowflake, right? And so he said, in another three years, who knows what could happen? So that was just really interesting. And I know, I’ll be thinking about that a lot this week. How about you?

Kostas Pardalis 1:10:31
Yeah, I agree with you was like a very interesting point, and sleep sofa, like something that remains to be seen in how exactly it’s going to happen, and what will happen, what it is happening, okay, there are like, many different points, I will keep, but one of the things that I really enjoyed was the conversations that we had about how a lake house as an experience and does like performance, and like a couple of different parameters that we put there seemed like too much the experience that we have with data warehouses, and I like how pragmatic he was about that, and saying that the game. Obviously, things can improve a lot. But there are always trade-offs, right? Like, you are, like, you’re not going to have, let’s say, the exact same experience. But at the same time, you are going to have, let’s say more flexibility or like more scalability or like having different capabilities that you cannot have right now like and you will probably not have ever come up with a vertically integrated solution, like a cloud data warehouse. We’ll see. It’s still early with all these products. But it’s always great like to talk with people like him because he gives like a very accurate and say prediction of the future.

Eric Dodds 1:11:50
I agree. All right. Well, many, many more great guests coming up. Subscribe if you haven’t, and we will catch you on the next one.

We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.