Episode 70:

The Difference Between Data Lakes and Data Warehouses with Vinoth Chandar of Apache Hudi

January 12, 2022

This week on The Data Stack Show, Eric and Kostas chat with Vinoth Chandar, Creator of the Hudi Project at the Apache Software Foundation. During the episode, Vinosh discusses his experiences building data lakes at companies like LinkedIn, Uber, and Confluent. He also gets into the differences between datalakes and warehouses, and when going open source makes sense.

Notes:

Share on twitter
Share on linkedin

Highlights from this week’s conversation include:

 

  • Vinoth’s career background (3:19)
  • Building a data lake at Uber (6:52)
  • Defining what a data lake is (14:01)
  • How data warehouses differ from data lakes (22:46)
  • When you should utilize an open source solution in your datastack (37:36)
  • Evolving from a data warehouse to a data lake (45:09)
  • Early wins Hudi earned inside of Uber (52:30)

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Automated Transcription – May Contain Errors

Eric Dodds 00:06
Welcome to the data stack show. Each week we explore the world of data by talking to the people shaping its future, you’ll learn about new data technology and trends and how data teams and processes are run a top companies. The Digitech show is brought to you by rudder stack the CDP for developers you can learn more at rudder stack.com. Welcome back to the dataset show today we’re going to talk with the NOC, and he works on hudi, the Apache Foundation. And so we’ll talk all about who he is, if you don’t know what that is. But he has an unbelievable resume. I mean, just just so impressive. And I’ll give a little bit away here and probably steal the show with my question cost us but we’ve talked with some people who have worked on technology inside of large companies like Netflix that has later been open sourced and sort of made available generally. But we haven’t, at least in my knowledge, talked with someone who was there at the beginning and really started it from the very beginning. And I just want to get that story. Like what were the challenges they were facing Uber? Where did the idea come from? And then like, how did it actually come to life inside the company? So I think that sort of origin story is going to be really cool to hear about hoody.

Kostas Pardalis 01:23
Yeah, absolutely. And I think it’s probably like the first open source project that it’s actually an Apache project that we have here. I was just, I think so. So that’s going to be interesting, because, okay, that this, this is also important. It’s one thing to have like a project to open source something on GitHub, it’s another thing to have something that’s governed by the Apache foundation. So especially from the governance side, like it’s a very different situation. So I think that’s going to be very interesting to chat with him and gave enough this is a person who, who has been like, in the right place at the right time, many occasions when something interesting around data was created. He wasn’t eating Uber confluent. Yeah, so and he’s one of the like, I think, I think it’s one of the best people out there to talk about what the data lake is, because that’s what would these. And it’s going to be very interesting to see how he started playing with the idea of like building something like a data lake inside Uber, and why these got open, like as open source and why now, data lakes are so important, and so hyped. So I’m very excited. I think we’re going to have a very interesting conversation with him.

Eric Dodds 02:43
All right, well, let’s dig in. Yeah, dude. We’re not welcome to the show. We’re so excited to chat with you. Great to be here.

Vinoth Chandar 02:50
Thanks for having me.

Eric Dodds 02:52
Well, in the world of data, I think it’s safe to say that your resume is probably one of the most impressive that I’ve ever seen. So do you want to just give us a quick background on your career path? And what led you to what you’re doing today? With Hootie? Yeah.

Vinoth Chandar 03:10
And first of all, I don’t know if I am deserving of all those kind words. But I tend to think of myself more of a one trick pony who keeps doing databases for over a decade. Because that’s the only thing I know. So for me, I started first job out of college, or UT Austin was Oracle, work on the Oracle server data replication, what passed for stream processing back in the day CQ l streams, Oracle GoldenGate, data integration, CDC. That’s where I started, then moved on to LinkedIn bear, I led the Walmart key value store, I think, I don’t think I think most people have forgotten that project by now. But it was like like lignans, Cassandra, it’s actually a pretty popular project, and actually lead that and we scale that through all the hyper growth growth stages for LinkedIn, from like, 10s of millions of users to hundreds of millions of users. That’s what lasted us through that. Then I moved on to Uber, where I was the third engineer on the data team, where we had a Vertica cluster on some Python scripts and that’s kind of like this is over 200 engineers 100 Plus engineers and back in 2014 So I spent almost five years there working on data I had doesn’t say giant nearly and that’s kind of what I was looking for when I left LinkedIn is like a blank sheet of paper in which I can actually try work hard try to build something new make mistakes learn like that, that journey I wanted for myself. So over gave me that and we ended up creating hurry there, which kind of like has become great actually see how the space has evolved over time. The last four years, I also did a lot of other infrastructure things Uber, including Ubers, one of the first companies to adopt HTTP three. As it was getting standardized, I still don’t know whether it’s fully standardized. So we ran quick or, and then replaced TCP with UDP based addition. So don’t like to dabble with a lot of infrastructure stuff I was working, working used to work with the database teams that were. So then I left Uber went to confluent. But I met some of my old colleagues from LinkedIn, work done, get another database case equally. And parts of Kafka storage, connect, and, you know, so generally been around this stream processing database data, pipeline, this kind of space for a while. And yeah, I’m like, like, I have some time now to actually dedicate full time to hurry, hurry was something that I kept growing the community in open source in ASR for almost four years now. And then finally have some time to dedicate to it. And then I’m doing I’m enjoying that this year.

Eric Dodds 06:06
Very cool. First of all, I don’t know if after hearing that my conclusion would be one trick pony. But again, I have so many questions. But one thing I’m really excited about is we’ve, we’ve talked a lot on the show about trickle down of technology from organizations like Uber, Netflix, etc, that are sort of solving problems at a scale that just haven’t been solved before. And some really cool technology emerges from that. But we haven’t been able to talk with someone who is part of the sort of development of that from the beginning. So could you just, we just love the story of hoody? What problem were you trying to solve? And I think in the age that we live in, it’s sometimes hard to think back to, you know, 2014, and what the infrastructure was like back then. But we’d love to know, like the toolset, you were working with the problems you were having, and then just tell us about the birth of hoody inside of Uber?

Vinoth Chandar 07:02
Yeah, I think it’s a good, fascinating story, actually. So Google’s in 14, as you can imagine, Uber was like we’re hiring a lot, we’re growing a lot. It’s like launching new cities every, every week, if not every day. And if so we’re really in that phase. And if you look at what we had, like we had, like a typical on prem data warehouse. And while Vertica is a really great MPP query engine, but it’s not really we couldn’t really fit all of our data, our volumes into it, if you look at all the IoT data, or the sensor data, or like any large volume event, stream data, or any of these things, if they don’t fit inside that so we built out Hadoop data lake, most people that I came from LinkedIn before that so very well like that. I knew the runbook to what to do, you know, Kafka you do like event streams, you do like CDC change, capture, get lake up and running. And,

Eric Dodds 08:02
you know, that was familiar territory. Like

Vinoth Chandar 08:05
that was familiar, the things that we really replaced the certain things I wanted to fix over kind of like what we didn’t do at LinkedIn, which was we wanted to ensure all data is columnar never have a mix of like, like JSON or don’t do so we essentially forced on both lender company to schematize the data and build a lot of tooling around that end to end, you will get a pager duty alert if like data has broken off. So we did a lot of things to actually ensure that Blake can be operationalized. And within a year, we had a lake where you can do the data was flowing and where we can do Presto for interactive querying some spark ETL tools, and hive, which was still the mainstay for ETL at this point, because spark was 1.3 like this coming up. Right? So the main problem we had was, as you can imagine, over is a pretty real time business, right? So what we weren’t able to do was we had our trip stores upstream a lot of different databases, we wanted to take the transactional data, which is kind of not changing, and reflect that onto the lake. With something like Vertica you already have transactions updates more just like we can do these things that are indexes. And while the data lake could scale horizontally on Compute and Storage, it cannot do these things. So that led to the creation of UI where we said hey look, we have between a rock and a hard place we can’t fit all this data there and and but we don’t have these functionalities here. So we chose to basically bring these database functionalities or data Barassi transactional functionalities to the lake and that’s kind of how could he was really born. We and the key differentiator I would say from some of the maybe the other projects in the space would be right away we had to support like three engines. Like Presto like like three I mentioned had to work out of the box, right. And the other thing is like with every company, we had our raw data, then we build ETL on the lake after that. So it’s not sufficient that we just replicate the ship data very quickly. By building updates and everything, we also had to build the downstream tables quickly. So we essentially borrowed from stream processing a lot, having like worked on stream processing systems before, we built CDC capabilities, or streaming incremental streams into the even in the very first version, and so that we can actually, the effect was upstream data store every few minutes, it’s up to date with a downstream table on the lake. And then you can consume incrementally from that lake and build more tables downstream. So we kind of moved all of the core data flows over into this model. And that gave us 10x, or in some cases, like for a given day, we’ll see 100x Kind of like improvements over the way that we used to process data before. So fundamentally, UT was created created around the concept of okay, yes, we added transactions, updates, deletes, but the bigger picture was this enable you to process everything incrementally, as opposed to doing big batch processing? That’s kind of how hopefully, was one.

Eric Dodds 11:20
And well, I feel like we could talk for five hours because I have so many questions. But quick question less about the technology. How many? What was the size of the team? And how long did it take you to go from sort of the idea or the definition, the problem or maybe like early spec to sort of having an early version of Hoody and production?

Vinoth Chandar 11:47
Yeah. Okay, it’s gonna like a funny thing, because I started writing like a first kind of draft, or at least for the writer transactional side, I think in my second month that over, but we didn’t get it until for a year because we put the business first we’re just trying to build an operation is less so many other things to build. But finally, we decided to kind of fund the project with three people in I think, 20, late 2015. And then by mid 2016, we actually had like our mid or late like, you know, q3 ish, late q3 ish, we had all of our core interest tables, at least, like running on the project. And I thought we were able to only do that, because we use existing RDS only scalable storage. We used all the existing batch engines, right? So we didn’t write a new server run time, or like, build a lot of things. We didn’t try to build, like a Kuru, for example. Right, which was like its own something that we considered back then before the Lingus. Yeah. And then I think we can we open those as a project pretty early in 2012 2017. So when he was the first sort of like, trailblazer for, like transactions on a data lake across multiple engines, and we mostly wanted to open source it, because we weren’t really sure if it was the right thing back then. To get more feedback, I can tell you know, it’s like super visionary and all that. But we were like, Okay, we’re doing something a little bit awkward, at least it felt a lot awkward to the, the more of the Hadoop people who grow up in the Hadoop space, to me felt very natural, because I was working on key value stores and databases and change capture before that. So it all was like But but there’s like a lot of the bridges to cross before I think it became a mainstream thing.

Kostas Pardalis 13:47
Can you give us a definition of what datalake is?

Vinoth Chandar 13:51
Cool. Okay, so in my mind, at least. So most people, if you ask them, I’ll start with that, like a data lake is files on like s3 or GCS. So that’s kind of like the perception that people have in reality through the I think we build up I would, what I would call a harnessed data lake architecture to where, where, which is what it is. So data lake is basically an architectural pattern for organizing data. Like you can even build data lakes on like an RDBMS if you want like main ideas, you replicate your operational schema raw, keep it like simple there. So you do an ETL and then you do ETL there and then you try to keep so it’s been it’s been over years or loaded with a lot of different constructs here. That it means Hadoop in some people’s mind it means like s3 in some people’s mind or parquet some people’s mind. So the basic idea I think remains that which is like you know this like Ron drive data and from The impact that we saw doober, what I can say is embracing the architecture has a lot of key benefits. It completely decouples your back end teams are data producers, and your data, consumers have this raw data layer now, which they can use to actually even figure out what the data problems are, right? Otherwise, a lot of people do transformations in flight. So you can’t really like you have to go to the source system to re Bootstrap, we had a lot of basic issues around just how we do the data architecture. That’s That’s how I see a data lake to be like an architectural Batman. Yeah,

Kostas Pardalis 15:39
so that’s very interesting definition, actually, and probably the most accurate. And I have here, like so far, because I was like, also personal Lego. I’m still like a little bit confused. Like you can see like many different pieces of technology that they fall under the umbrella of a data lake without being very clear what their role in the data lake architecture is for them. And obviously marketing doesn’t share without staff, especially now that we have all these like the lake house. And so we’re trying like to, let’s say, take the data lake and make it equivalent to data warehouse and vice versa. And that’s my next question. What’s the difference between a data warehouse and data lake in terms of like architectural patterns, right?

Vinoth Chandar 16:28
So now actually talk about how the system design of data lakes and data warehouses typically have been, right, that is what I think most likely what I think what you’re speaking to, if you look at it a minute, right, if you just go back to how we were doing Vertica, Teradata, we had, you essentially bought some software, installed it on like a bunch of servers, right? They had deep coupling between storage and compute. And it’s a fully vertical Lee optimized stack, right. So having close file formats, like one query engine, one sequel, on top of that data on like a fixed set of servers, gave them like, they’re able to probably still squeeze out performance per single core. Right? That so that is how your on prem or the traditional data viruses have been built. And your data lakes typically rely on even from the Hadoop era, right? They rely on Okay, HDFS or cloud or some horizontally scalable storage, you decouple the storage, and then you can fit even even back before even before, like, even if we go to 2014, you can fit like, like I mentioned a Presto or a spark or a hive on top of the same data. So the fundamental difference here is that data and the compute are more loosely coupled and data lake. And they’re much, much more strongly coupled on a roundhouse, in terms of like, across the stack optimizations and how it’s spelled. With the modern cloud are also they’ve changed the game where they’ve actually decoupled the storage, and the compute, but the format and everything everything else is still like, vertical, right? So there’s like one format for snowflake, or BigQuery, one sequel, and it just like, operates in a different way, which gives you a lot of scalability over traditional warehouses. So that’s why you see a lot of people talking about, okay, you don’t need data legs, you just need like Cloud warehouses we got, right. So if all you’re trying to do in life is just big, maybe they are right, you have to cast aside, maybe they’re right. If you now go to cloud. So while the viral cloud are also sound like leapfrogged on prem viruses and evolved, if you look at on our data lakes in the cloud, they’re very similar to how they were on prem. So that’s where we are today. And that’s where the sort of the lake house comes in. And I think we didn’t Well, we did in like, you know, Pioneer kind of transactions on the lake and all that, but we didn’t call it the lake house back then. Mostly because I’ve still felt even today, the transactional capabilities that we have in a hurry or like all these similar technologies are, like much slower compared to what you would find in a typical Barrows, full blown warehouse. So we were like a little bit shy about those things, but I think many people weren’t so so but that’s kind of what we refer to lake house now. Right? So we brought some of these core data warehouse functionality back to this data lake model, put it on top of like a Parquet or ORC kind of like open format, and then make it accessible to multiple engines. That’s what a lake house is, and it gives you some of the important part, some of the capabilities of the cloud data warehouse, while it still retains the advantages over a warehouse – the lake has over a warehouse. For example, it’s much more cost efficient, it’s way cheaper, you can run like, eventually, if you think you’re going to need machine learning or data science, it’s a more forward looking way to build value, get your data first into some sort of like lake house thing, and then you query you do your analytics and data science there. And then you can move a portion of your workload into a cloud warehouse. Right. So that’s kind of like, I feel like we will go back to that model in the next few years. Because the the the, the cloud data warehousing architecture fundamentally doesn’t really suit running large data processing on them. So at least a good segment chunk of the market, I think they’ll move towards this model, I think,

Kostas Pardalis 20:53
yeah, that’s interesting. I remember, like I was talking recently with some friends in a company where traditionally, they had would say, when it comes to like data management, data processing, like they have like two parallel paths, they had a data warehouse that was used for data analytics, and BI. And then they also had, let’s say, data lake. And it was based like on Spark and on top of like, s3, but was used from the data science team. And what they want to do now is actually they want to move into these, let’s say delta lake, but like the lake house architecture, so they can merge these two together. So the two teams inside the company don’t use like two completely different, like stocks for their for their work. So that sounds very interesting to hear also from you, because there’s a lot of like, with what like people are trying to do you talk about transactions, and getting transactions and be making transactions on top of like, datalake. Why transactions are important, why we need them?

Vinoth Chandar 21:56
Yeah. So if you look at how, let’s look at it through the lens of like a use case, right, GDPR, I look back at GDPR. And I could see that, that is that was the one use case that kind of trickled theory down to everyone else. Because till then, if you look at the stuff that I talked about, and then sure, Uber had the needs for a lot of business, like faster data, and we did it certain way. And anybody who does that will get the benefits, the efficiency gains that we got. But there are the business drivers for that weren’t simply there before, something like GDPR. So you need to ingest data. And then you needed like a team now to go scrub the data and say, like delete people who left your service or something, right. So this kind of like you now introduce two teams who want to operate on the same kind of like data set or table. And then now that forced updates, deletes and transactions. And that pretty much is what kinda like made this into like, an inevitable sort of transition. If you’re doing a lake, you’re going to probably want to just move into one of those, one of these, like newer things now, like, so that’s kind of like the main thing, I would say.

Kostas Pardalis 23:12
Okay, that’s very interesting. And like you said, at some point, that’s okay, we take something from the database point. And we’ve been married, like on top of the file system, which is transactions, but again, like the transactions, the way that we implement them is not like exactly what you see, like in a data warehouse, right? So what’s the difference? Yeah, what what is the what we don’t need? Yeah.

Vinoth Chandar 23:35
So here, I think there are significant key differences, like people, people tend to talk about the like delta lake or hurry in the same kind of like thing, because we like to compare things. And then it’s easier for us to compare things and understand, right? But if you look at the concurrency model, even they’re completely different, how they are designed. So there are houses do like multi table transactions, for example, like here within say, on the lake between the house, we’ve been saying, Yeah, we can do it, we can probably add multiple transactions. But the locking that you do, they do more, they can do more pessimistic locking, they probably can, since they have long running servers, they can probably do a lot more human constrained foreign key validation, these kinds of things that you would expect in a, like full blown database they’re able to do today. Right? So yeah, and if you the other key difference with the current Lakehouse architecture is it’s completely it’s kind of like serverless, right? It’s like a, it’s like a serverless whatever Ravels if you will, that comes up part by part as needed on demand, right? Okay, this like the writer comes up, right? And then goes away. And then like, a reader comes and goes away. So there is no long running, things that you can do to do coordination. So that’s like for some interesting challenges, right? So if you if you take a look at Delta Lake, they pretty much do optimistic concurrency control, which basically is, if I if two writers don’t contend you’re fine, but otherwise one of them fail, right. And if you if you look at what we approach we taken hurry, we try to serialize everything we try to resolve conflicts by supporting large structures, differential data structures, we try to take in the rights and then sort of do collision resolution later on. And we try to, because end of the day data lakes are about high throughput, rights, these these transactions are in database terms, very large transactions. So you cannot really afford to have one of them fail, like imagine like a delete job that ran for eight hours, and it fails now. And then you lost like some eight hours of compute and all this cloud. So we took a very different approach, I could see because we were focused a lot more on streaming CDC data. And like all of those incremental use cases, if you look at data, bricks and delta Lake, probably they have a lot of batch spark workload that they run. So they probably don’t get that much concurrency overlap. So maybe OCC works well for them. So just like with databases, like how we have an oracle or Postgres or MySQL, I think there’s so much technical differences with these projects that we will end up with like a bunch of these things. I feel corps time,

Kostas Pardalis 26:27
that makes sense. Makes sense. Do you see like, what’s my last question on like transactions? Do you see the transaction from the data lake to get closer and closer to what we haven’t done our basis? Or do you think that is like, there is a limit out there that it doesn’t make sense, or we cannot say, but

Vinoth Chandar 26:46
I think we can, we can build a same thing. We are actually in hurry, at least we are experimenting with adding a metal server. So essentially make. So if you look at the problem as data plane and sort of metadata plane, the Data Warehouse has servers for both data and metadata. The lake has no servers for both data and metadata today, right with the way that things are evolved with, with like delta lake or iceberg, right? Where you stick metadata into a file, right? I mean, that’s not going to be like performant. If you compare it to what let’s say snowflake does, which is like keep metadata in another OLTP, horizontally scalable, OLTP database, like FoundationDB, for example. So we are trying to tinker with a module where we have servers for metadata, and we keep the data plane like kind of serverless where in a spark jobs should be able to access S3 raw direct. Right? So that’s one thing we feel like we’ll bring it a little bit closer. This I feel is the gap in the lake house architecture today. But I like the first aspect you mentioned, right? Like, do we need to do that. So that’s the other part. So unless you are running really lot of concurrent workloads, today, there isn’t like a pressing thing, right? Then the Lakers vision is starting up. But to if you have to fulfill that, I would imagine that you need like a full set of capabilities. People should be able to run workloads on Lake House, which are like highly concurrent and highly scalable as they would on like that house. So I think there are technical gaps and a lot of things to be built in the next couple of years or more going forward. They’re

Kostas Pardalis 28:34
super interesting. And outside of transactions, what else do you see as a components from like a more traditional database system that it is required also from data lake or lake house? Yeah.

Vinoth Chandar 28:49
So I don’t know if this fits into the lake house model. But at least for Hooni, we actually borrowed a lot from OLTP databases as well, like indexes, for example. We have an interesting problem for CDC, right? If you say, Okay, you have an upstream like Oracle or Cassandra, or some oil DB databases taking rights. If you have to replicate that downstream to like, a data lake table, then I mean, why are the updates faster on the upstream old DB table because they have indexes and like whatnot to like, update them? Right? But if so, if you have to keep up with an upstream OLTP table, you’re right on the data lake table has to be like, feel like you’re writing to kind of like an oil, VP table. So we invested a lot and more sort of like So this problem is similar to running a Flink job reading from like a Kafka, CDC and then upgrading a state. So essentially stream processing principles. So we borrowed a lot from stream processing and databases and brought it also to the data lake. And that is, I think, at this point, a pretty unique thing that we’ve been able to achieve. If you look a lot of UI users, they are able to stream like a lot of data raw to the lake very quickly. And that’s all possible, kind of like, because of this. But for the core warehousing problem, I think we already have columnar formats, we close the loops on transactions and get get the get the usability there. That’s something that we haven’t talked about at all. If you compare stack, we talked a lot about technology. But we talk about usability. How quickly can you build a lake house versus how like starting on a viruses viruses been all the time? Right? So these kinds of things are more important for the Lakers vision, I think, then, but but we are trying to add more capabilities on the lake Dan, and even a typical borrower saying, What did you do today?

Kostas Pardalis 30:42
Yeah, that makes total sense. And what about the query layer?

Vinoth Chandar 30:46
Yeah, that’s a interesting one. So I think today, if you if the lay of the land is you pick on the y, if you’re on the lake, you pick like presto, Trino, equal and for a lot of the interactive queries, and you write Spark, or Flink or high vtls. I think I know, I’m broadly categorizing, but that’s the major things that pop up, right. And the key thing to understand here is, there is a lot of things that we don’t typically even classify as query engines. Like all different NLP frameworks are like, some of them are not even distributed, right? There’s like, but they still work on these open data formats. So there’s large this more kind of like, more fragmented solid like toolset around the ML NLP AI, deep learning machine like that space, that is also kind of going to kind of only grow. So I don’t see a future where there’ll be more query engines on the lake, there’s going to be like more and more agents. And I think the smarter strategy here would be to, you know, how Lake kind of strategy again, and build towards or keep your data, sort of like in a in an open format, that you can buy support from many people? And kind of like, have it be more future proofed? That’s kind of like what I think inevitably this is gonna lead organizations into,

Kostas Pardalis 32:22
yeah, I’ll I’ll ask something like from the completely opposite side of the stack, because we’re talking about to talk about like the GUI, and correct me if I’m wrong, but what I understand is that the datalake at the end, like your work as the creator was, like, who before examples to build, let’s say, a day before much on top of some file formats that we already have that usually, we’re talking about partying, Dorsey here, right? Is this correct? The way that I’m understanding it?

Vinoth Chandar 32:51
Yeah, so the thing that this table format term, again, is like, doesn’t do justice to sort of at least like what he has to offer, for example, right? There is lot more than what you need than a table format. So if you look at what a table format is, it’s a metadata of your file formats, right around what right? It’s, it’s a means to an end. What I think we we built in open source today is a lot of the services that also operate on the data, we because without them being open, it doesn’t matter with open format, right? You don’t own the services that operate on them. So you have to basically you’re saying, I have to buy some vendor who will operate the services for me. So so this is the gap that I think like something like the folks here I’m speaking for already, we have compaction clustering, we have the like the bottom half of the warehouse, or a lake house or a database, running us, kind of like available to you, which you can now used to query multiple different file formats with. And to your point, yes, we mostly it’s analytical storage, right? But if you look at hoody, there are some use cases that come up where people really don’t want a database, but they want like a point, like a key base lookup on top of like s3 data, we support hex file as a base format, for example, like hex file is the underlying file format for headspace. It’s like really optimized for kind of like user range reads to do like get batched, you can do like batch point key gets from his file. So there are I think, going to be like more and more use cases like this. I can totally imagine how this can be used for, let’s say, hyper parameter tuning or something on a machine learning pipeline. Right. So I think there’s a lot more that we probably haven’t built. And this space is solid, like still nascent in my opinion. Yeah, for all the reasons that I’ve been citing. It’s, it’s still a lot more work to do here.

Kostas Pardalis 34:56
Do you? Do you see like any innovation any space left for innovation like when it comes to the fight for months themselves? Because okay, we take for granted like pocket out there or see. But like, that’s pretty much what’s everyone’s using right? Do you see anything changing there? Or we need something to change there?

Vinoth Chandar 35:18
Yeah, so that’s the thing, right? So often and oftentimes in open source, that’s the other kind of like my I mean, I’ve been an open source for 12 years. So but my own pet gripe is sometimes I think what wins is the more most popular is what happens, right? It is a popularity contest. In some sense, it becomes that while on a more man in so as you get swapped out with something that happens new. So I think, for a change to happen, like that file format layer, I’m pretty sure that that can be a new, better file format that can be written, even like Google has a capacitor is the file format, on top of underlying BigQuery, right, it is a successor to Dremel, which is what parquet is based on. So they when you can read a blog, they don’t open source the format this time. There’s already there, one there. So it’s more like the if we now we’ve done this now. So like it’s going to take a while for people to migrate. But I’m pretty sure with like new processors coming out all the time. And there’s not documented things around like CPU efficiency around like how you access parquet. So there’s like plenty of room for improvement. I think like original parquet was designed in an era where mostly on prem HDFS, right, so you had to care a lot about storage space. But if you now don’t care as much, would you do certain things differently? I haven’t put a lot of thought into it. But I’m pretty sure there’s something that is better that can come out in the future.

Kostas Pardalis 36:50
That’s, that’s super interesting. Cool. Let’s say you mentioned open source. So let’s spend some time like on on that aspect of the data lake. Because let’s say we have like three, as we said, like major technologies out there. All three of them have like some open source presence. And I will start my question with asking you why data lakes are open source like we can see open source there. And when we are talking about data warehouses? I don’t know I think instinctively the first response will be we don’t have an open source data warehouse, right? Why is that?

Vinoth Chandar 37:26
I honestly feel this all started from like the the Hadoop data lake kind of Hadoop era basically where I think, I think Cloudera, if I if my memory serves me, right, they like boldly declared that like, you know, everything open source is the way to go. And I think I agree. But it’s basically been a train from there because like spark was open, like the major tools that have succeeded, have been open, right? And then I think we ended up with like, the legs being open, and the virus was being more closed. I don’t know why that is, though, I do see that there is advantages in being closed and moving faster, and you can build more vertically optimized solutions. So historically, databases have been that way. If you even take like RDBMSs every single, we won’t even talk about something like this in oil DB databases, for example, right? We won’t say, why don’t we have a common table format? And let’s have spanner and are you gonna buy it and CockroachDB all query that format or something? So I think I don’t find that very weird. I wouldn’t be the person who would say, yeah, it should be to just be open. Otherwise, it’s wrong. I don’t think that’s true. I do think that this to that point, did what do databases add? They add a lot of runtime or that format, and then at that point, you’re not dealing with the format. So it doesn’t matter whether it’s open or not. Right? So what I really care about, again, going back is whether the services are open, right? Can you pluster A snowflake table outside of snowflake? If you don’t buy that maybe there is someone who can use AI and supercluster your tables? automagically they know this is like, like a genius who has this like one, a clustering algorithm? Can you use it? You can’t? Right. So I think that is the main thing that I would say that the lakes bring. And it’s been that way. And I feel on the flip side, that houses do have better out of box easy to get started. Like those things. They’ve made it work for the cost and the cost of openness. And on the lake I would say people still have to build a lake, right? You can use a warehouse but you have to build a lake. You can either download one or you sign up on something and use it right. You have to go back, hire a data engineer, hire some people Build the data team and then they will build a data lake for you. So there’s pros and cons to both approaches, I would say, I think I don’t know which one’s right.

Kostas Pardalis 40:08
Do you think this is good to change from data lakes? Do you see like, more effort put towards like the user experience on say, of these technologies?

Vinoth Chandar 40:18
Yeah, I think there’s suddenly at least we are doing it. And we’ve been doing it for that’s kind of like how we even got started. If you go to hoody, you will find a full fledged streaming ingestion service, right, there is a single sparks of the command that you can run. And then the tables, like gets clustered and cleaned. And like all this, like indexed and all the Z ordering, or Hilbert curves are stuff that is like locked away to even table like data bricks, or like snowflake, you can find an open source. And we try to make give you a like tool set where you can actually run it easily. But here’s what I see, I think as even as we make usable, make it more and more usable, more and more consumable. It’s still the operational aspects of it. I do see people on the community, like really talented, Q, like driven engineer, data engineers who come to the community, they’re trying to pick up all these database concepts, trying to understand what data clustering is wide. What do I what do we do linear sorting, or like Hilbert curves, like they’re trying to understand all these fundamental database concepts, try to become platform engineers, try to like run 1000 tables and manage that entire thing for the company, right. And many of them come out with flying colors, some of them don’t. And in any case, it takes like a year or more for people to get through that learning go and do this. So this is where I wonder where there is a, like a better model here, where company should be able to get started with as easy as how it is, I mean, okay, don’t worry about all of this, just get started with all of these like Lake technologies, then, yeah, maybe you don’t want, you don’t want you want to do it yourself. Right. So then they should be able to fork off. This is a what I’m suggesting is a pretty much a reverse of what most open source go to market the view, which is your community, and then you make it so that you keep it bare minimum, and then people can use it. And then you build more advanced if on top. But for the lake, I feel like for like with hoody, we try to make everything easy. But the problem is people still need to take it and run it. It’s not non trivial thing to operate a database as a service, right? Having done that, Walmart has the residue like LinkedIn and like case equal on the cloud. And like I can vouch for that much I can talk with some like authenticity. So we should make it easy for people to get started with the Lakers, like like model lake or whatever. And then at a point your business will grow where it needs data science business will need ml, right? At that point, you can decide, okay, am I going to be able to hire better engineers than that vendor, then you shouldn’t be bottlenecked on the vendor, you want to move quickly, you should be able to branch out from open source run your own thing. Right. So that is I think the model that we should build. And unfortunately, what happens in the de datalake space today is it’s like, you may remember the famous parquet Warsi format wars of the of the huddle, right? I mean, where two companies are just like the same two formats or whatever. They kind of like doing the same thing to table format, which defeats the whole point of the thing having been open to begin with, right? Because most datalake companies are a query engine, or like a data science stack. And they’re basically going and upselling users, hey, use this format use that form, including money, right? Like, but the real problem here is they have to go and do hire the engineers and do the ops and like that good engineers have to get every optimization right. For that organization to have someone sign the check high up is like, oh, yeah, you are like better than the warehouse or you are you’re now future proofed it for the organization to see the benefit. So I think if we don’t fix this problem this way, it’s not about technology. I think we can fix all the all the gaps. But I think this is the problem that I see the managed aspect of it. And there’s no easy way to get started. So otherwise, I think it will it will remain in the cloud. Cloud warehouse will be the entry point and you build a lake when you’re suffering from cost or openness or you want data science team. That’s how it will be if we don’t fix it this way.

Eric Dodds 44:41
Quick, quick question on that front. And I’m thinking about our listeners who we certainly have listeners who are sort of managing complex data lake infrastructure, but I’m thinking about our listeners who maybe started with a warehouse and they know that the data lake is is inevitable in some way. For the organization, but to your point, that can probably be a big step, what are the things that maybe they need to be thinking about or even sort of planning for, you know, six months or a year away from sort of the inevitability of like needing a larger data like infrastructure, or their decisions, or architectures or sort of even ways they think about data now that will help them make that path smoother, even though the tooling isn’t quite there to make it. Easy for them. Yeah, yeah.

Vinoth Chandar 45:33
So the first thing I would say is like, no, like, do do more of the, the streaming even based or the the Kafka hub kind of like architecture, right? Because it really having the ability for you to get all your data streams in a single kind of like firehose, and then you can now tee this off to the Barrows, or to the lake, you have the flexibility, I would say, most people who are in the journey today are using like a opaque kind of like data integration pipe, which takes data from data like Lake and let’s say, five Tran, for example, or five Tran or BI really great services. But I’m just like, the architecturally, you just don’t see the tap into the data streams, it’s so you, you really have to like there’s

Eric Dodds 46:15
a capture, like a core data infrastructure pipe that those tools actually feed into for you to actually feed it out into your own

Vinoth Chandar 46:22
data, right? Yeah, switching my heart a little bit. If you look at my my, like my life, or confluent, like what we were trying to build was okay, you do like the source connect and the sync connector kind of decouple. So you get the CDC live from like an article in database, and then you can feed it to many other systems. So a lake, and Navarro, so make sure your data can flow into both. And you have the optionality to pick which one you want to send where that’s one. The other thing is start with probably your more derived will add to the lake. That’s why we have most of the, the the data volume. And since it’s usually in a while DB schema, not optimized for like analytical queries, that’s fair, probably you’re spending most of your costs on Barrows, as well. Because like, they’re not really in that schema. So those are like really good candidates for you to start. And then when most scenarios that derive tables, you can keep them there, they’re more performance into the so you can slowly migrate them over here, right? And then what you need, in the meantime, as you should really push for your Cloud Data Warehouse provider for better external table support. Because, like, they have no incentive to do that, unless you force them to do it. Sure, right. Because technically speaking for organizations, what I can see is, okay, I’m using Pipeline Company X, and then they’re using data warehouse why. And then if I if you want to now build a lake, right on offload your raw, like, you want to build a lake. And going back to our first question, you want to run a on a direct data layer, you gonna move raw data out, I mean, if you do it, then all the sequel has to still run correct like to before you can build the derived data. So that is where I think there is stickiness and like lock in points for viruses. But unless the sequel can run in a reasonable amount of time on the lake, this project will fail. Right? So for example, in hurry, we just added DVD support, so that you can get raw data tables in hurry. And now you can use DVD to probably like transfer over we’ll be working towards more parity or more standardization, we are today as standard as what sparks equal is, right. So you can now use that and use DVD to do a transformation on the lake, even if there should be a way for you to move the workloads to the lake seamlessly. Look, think about those abstractions, whether it’s DVD or airflow or like house compatible sequel as think through all these things. But if your cloud garos provider provided better external table support, then you can keep those queries running. Even though if you offload the raw data lake, you can query like try presto, or some other Lake engine in the meantime, as you decide how things are going. Right. So it’s it’s not going to be an easy switch, this is going to take a year, or like at least six months for you to switch or reasonable amount of data. So planning ahead or on all these touch points is the Is this what I would kind of like always to think think through first

Eric Dodds 49:32
sure why I think that’s really helpful because the question I asked was, what, what do you need to be thinking about if you’re sort of going from a warehouse, warehouse base infrastructure and then adding the lake infrastructure? And you would think that the answer is more around the lake but it’s actually more around the orchestration and the pipelines and giving yourself option value as it relates to all the various components of the stack. You’re going to arise from moving towards the lake architecture.

Vinoth Chandar 50:05
Right? I’ve seen many companies, right. So I categorize them into two buckets. One is if you don’t do this, right, what happens is, there is a lake, but no one’s using it. And then over time, the data quality was collected slowly, these products start to fizzle out. If you don’t do this, right, the ones that succeed have top down kind of energy to say, Okay, we’re going Lake Forest, and we are going to like revamp the whole thing. In lots of scenarios. For example, the lake comes in when data science comes in, when data science comes in, usually what comes in as Dave data scientists would show up and say, Hey, I like okay, fine, you want me to improve your app? But give me some events? Tell me what’s going on in that, then Kafka comes in? And then you like, pump a lot of events, right? And then that’s when the data volume spikes, and then that’s when people are like, oh, yeah, like, right, this is kind of like how that cycle works. Typically, people who start like that have lot more drive to get it done that way. And like what we tried the missing puzzle there is moving data from about house and replicating the database of SAS data that you may already have in the cloud warehouse. But those are people who are like more leaning on the I’m going to pay the double cost of barrows and lake for some time. And then over time, I’ll figure out how to move things. And I think this will be the most interesting thing to watch. Because right now, given the performance, like bad things are right, like a super optimized for running large scale data science, machine learning workloads, the virus are really optimized for running like bi, then I think the beer workflow stays there. The essence what we’re saying here, I think we as we build Tech, I think maybe they will like more big goals. That’s what the rise of Starburst tells you rise of Presto tells you right? I think it’s very interesting times I think, to be building data, it’s going

Eric Dodds 51:57
to be super fun. It’s going to be super fun. I have one more question for you. And this kind of goes back to where we started with the origin of hoody. I’m interested to know. So you actually you got running with three engineers in production in a pretty short amount of time for developing a new technology, that sort of managing the scale? Was there a sort of feature or optimization in the business that sticks out in your mind is like an early win, I just love for our listeners to, to hear like, Okay, as you develop this amazing technology, like I’m sure we have users of hoody or people who want to use it, but I just want to know, like, what was an early win inside of Uber that came directly from the technology?

Vinoth Chandar 52:42
Yeah, there’s like a direct dollar value attached to that project at that point, like dollar value that is that exceeds, like, hundreds of millions of dollars. Because the we were able to run fraud checks a lot faster, which meant we report to banks a lot faster. And you can imagine the how complex these checks would be the very hard to write those in streaming sort of way. And like no, like, no, get it right. But if you have the like, I query basically, or something like that we needed real near real time data, not like real time, real time. But we needed to be running some checks, like every hour, for example, as opposed to every 12 hours or every and the you can imagine right at Uber scale, the amount of like rides and everything, like the banks typically give you money back more money back if you report sooner, kinda like, again, don’t quote me on this is how it was there. I don’t know how banking rules have changed, but not financial means. That was the main driver. And then of course, there was like intrinsic Uber is, it starts raining, it affects our business, right? There’s a huge concert, the traffic changes. So intrinsically, the business had real time business real time means and this sometimes hard to put $1 value around it. Except for we can count the number of times people wish data was there sooner has been faster. But the real tangible dollar value was we can do all the background things, rider safety, for example. We can do all these like background, things, like tasks and prod data processing that that we do to make Uber that we were experienced with the better, can run faster, quicker, more incremental, that sort of thing. And this was actually not a very, I at least came with that mindset. Because at LinkedIn, the main thing that we will try to incremental eyes was people you may know, for example, it’s a fairly complex graph algorithm, but the whole like, spend a lot of time around. Okay, if you connected you and I connected now and then it’d be cool to like, I go to LinkedIn and then I get the thing right away. Probably they made it work. Now I haven’t kept track, but we were in that mindset. Okay, yeah, let’s make all the batch jobs. Incremental. There is no Only reason for them to be running full batch and eating up for entire clusters. Right? So that’s sort of how we went about it.

Eric Dodds 55:08
Amazing. We’re gonna at this has been an amazing conversation. I know we could keep going, but we’re at the buzzer, thank you so much for joining us. I learned an immense amounts, and I know our audience to to so thank you for sharing some of your time with us.

Vinoth Chandar 55:23
Yeah, I got to be here. And these are like really deep questions. So so thank you. Thank you for these questions. It also helps me think think better. Alright. Thanks. Thanks, everyone.

Eric Dodds 55:35
I’m going to break the rules I have on this in this recap. There. I have two takeaways. One, I love that he called himself a one trick pony. I think he was a very authentic animal that that was just hilarious to me. The other one, which we talked about, right towards the end of the episode, sometimes you think about the the gains from sort of building your own infrastructure, how do you calculate ROI on that? Is it engineering time saved, etc. But he was talking about financial transactions to the tune of hundreds of millions of dollars, which is wild. And those sort of stakes are really, really high. And so that was just amazing to me. I wasn’t expecting that quantity of a sort of ROI impact. But it’s massive. So that’s just man. It’s crazy.

Kostas Pardalis 56:29
Yeah, 100%, I think it was super, super interesting conversation that we had, I think that we managed to make much more clear what the data lake is, and why it is important. That was a distinction also with a lake house, where things are going where they are today. And we have like a pretty technical conversation, but without getting into like, too much technical detail. Yeah, but it was very, I don’t know, I really enjoyed this conversation. And we definitely need to get him back. I think we have no questions, much more to discuss about we didn’t have the time, for example, to talk about open source, open source project governance, like what’s his experience there? Why it is important.

Eric Dodds 57:17
Yeah. I’d love to hear more about running a project like hoody within the Apache foundation. Yeah, that would be so interesting to hear about. Yeah.

Kostas Pardalis 57:27
100%. So yeah, hopefully, we will manage I think he was the first guest that had like an immediate relationship with a data lake technology. There is more out there. Hopefully, we will manage to get more on the show to discuss about that. Both like Lake House and data legions. Everything. So yeah, I’m really looking forward like to have him back on the show again.

Eric Dodds 57:50
Well do it. Alright. Well, thanks again for joining the datasets show. A lot of great episodes coming up. So make sure to subscribe, and we’ll catch you in the next one. We hope you enjoyed this episode of the dataset show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me Eric DODDS at Eric at data stack show.com. That’s E R I C at data stack show.com. The show is brought to you by Rutter stack the CDP for developers learn how to build a CDP on your data warehouse at Rutter stack.com