Episode 103:

Everyone Is Invited to the Data Lakehouse with Kyle Weller of Onehouse.ai

September 7, 2022

This week on The Data Stack Show, Eric and Kostas chat with Kyle Weller, the Head of Product at Onehouse. During the episode, Kyle discusses data engineering products, the distance between data powers, and how to get started in the data lake house world.

Play Video

Notes:

Highlights from this week’s conversation include:

Kyle’s background and career journey (2:38)
Unique challenges in building data engineering products (9:33)
The problem set Databricks resolves (13:46)
About Onehouse (17:15)
From Microsoft to Onehouse (20:59)
Why there’s so much distance between data powers (24:45)
Why the data lake is not enough (30:15)
Who should have a lake house (39:03)
Why we have all three data platforms (43:53)
How to step into the data lake house world (49:48)

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 0:05
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com.

Welcome to The Data Stack Show, Kostas. Today we are talking with Kyle from Onehouse. Now, we’ve actually heard about Onehouse before, although we didn’t know the name, we had the north, one of the creators of Apache Hudi on the show, one of our best-performing episodes ever. And he told us, I think before the show, and he may have mentioned on the show that he was working on something based on Hudi. But they were in stealth mode. And now they are not in stealth mode anymore. The company is called Onehouse. And like I said, it’s built on foodie. And we’re talking with Kyle from Onehouse, he’s their head of product, and he has a really interesting background, spend a ton of time on Microsoft and establishing one else along with the North. You know, I’m really interested, this isn’t gonna surprise you at all. I do have a lot of questions about Onehouse. But I kind of want to hear about his experience of Microsoft, because he did a lot of things he worked on. You know, the Office Suite, he worked on Bing, he worked on sort of the Siri, Siri s, I can’t remember exactly what Cortana I think it is from Microsoft, and then as your Databricks. And so I want to hear about his experience in Microsoft. We don’t talk a ton about Microsoft on the show. But there is a huge, huge company with tons of data products. So that’s what I’m gonna ask about. How about you?

Kostas Pardalis 1:50
Yeah, I have plenty of questions to ask about Lake houses in sort of the work that they’re doing at Onehouse. But one of the questions that I definitely want to ask him is how it feels to build from shot to be promoting Microsoft’s thinking stage, standard deceit stage stuff. Yeah. Yeah. I’m very curious to see like, how, how it feels and how it’s going. So let’s all right, let’s do it.

Eric Dodds 2:22
Kyle, welcome to The Data Stack Show. We’re so excited to have you.

Kyle Weller 2:26
Thanks, Eric. Excited to be here.

Eric Dodds 2:28
All right. Well, let’s start where we always do, give us your background. And then tell us what you do today at Onehouse.

Kyle Weller 2:36
Yeah, great. Thanks. Yeah, I’ve been in the data space about nine years. And I started that journey at Microsoft, actually. And in that time, I built data platforms, data engineering platforms for large-scale services like office, I joined actually Microsoft in some interesting times when they first released Windows eight. Oh, yeah. But we were also developing the new, old 365 apps, mobile apps. And there, I was tasked for building a new telemetry system for all of these Office applications. So it was really fun time back in 2013, we had an internal tool built on Hadoop. And we actually had some kind of transactional datalake components in their project called Project Baja, if anyone’s interested to search these things back. Back then, in 2013, has pretty advanced stuff. But yeah, faced some interesting data engineering challenges, building data platforms for office, then went over and did this for Bing Microsoft search engine. And that was, of course, a much larger scale, more mature kind of data platform. And from there, I wanted to consciously drive my career more into like, more and more product building. And so that first step from there was going over to Cortana. This is Microsoft’s digital assistant, like Siri. And there I was driving product growth strategy and measurement strategy. So a lot of data science work and defining from a business perspective, what does success mean for the product, and try tracing that back down into how we measure the product and track its growth. And then I switch over to building like true data products in the cells. I went over to the Azure Machine Learning world and worked on some interesting components like Python and our execution inside SQL Server. So we had these unique ways that we could do remote compute execution. The goal was for the Think of the persona of data scientists who like to be in Jupyter notebooks, and writing all their machine learning development inside Jupyter Notebooks. But if they had data inside SQL Server, usually they take a dump, a CSV dump or something and pull it in, sample the data and run with it that way. We made ways where people can stay in the IDE of their choice and then send remote compute execution, the code that they write in I found her arm into SQL Server. So it would process the data there and SQL Server return results back to their notebooks and things like that was super interesting. Yeah, it’s pretty fun. That was near like 2017. Then shortly after that, I remember being I distinctly remember this experience, I was at the Microsoft Build conference, which was in 2018. And I was at the booth, it was a shared booth for like Azure Machine Learning Azure data kind of components. And Microsoft first announced the preview for Azure Data bricks. If folks don’t know, either, Microsoft and Databricks have a special relationship, where we would take Databricks is native product and build it into deeper integrations. It’s the Azure backbone stack. And then we’d actually mark it and sell this as a Microsoft first party service, we call it Microsoft Azure Databricks. And of course, behind the scenes, it’s Microsoft and Databricks, building this together. And I was still on the Azure Machine Learning Team. But we announced that preview for Azure Databricks, in the booth was flooded, completely flooded with people that wanted to talk about it, I get really funny questions like, like, what is the data break? And just like really funny questions, people would come ask at the booths, I’m like, like, where, where’s the team that we had, like one person that helps like bootstrap this partnership and integration, these kind of things. But then shortly, we were, we were preparing to have like the GA release of the Azure Databricks product, and needed to staff have a bigger team. And so I jumped onto that team, and was there from when we first launched Azure Databricks. Ga, service at Microsoft. And that was a really fun ride was a product manager, their product lead. And I got to see firsthand the growth of this service from making $0 to become like the fastest growing analytic service on Azure. We had some really fun things that we went through with the scale, the scale challenges, and outreach challenges and all kinds of things that just want to build on a fast-paced, fast-growing product. That, yes, so it’s on the front lines of this emergence of the Lakehouse category that Databricks calls it here at the lake house, and help you know, hundreds of different organizations enterprises, modernize their data stacks, their data architecture onto the lake house architecture, and I was deep in that domain. And then I bumped into VanillaJS Chandar, who I know has been on your show in the past. Yeah. And he’s the creator of Apache Hudi. So it folks on the show note that Databricks has this open source project called Delta Lake, very successful, amazing, amazing product service. And it’s pretty comparable to what Apache Hudi is, as well. And so of course, I’d heard about Hudi. And so I was interested to talk to the nolloth, I learned about what his vision was for what he wanted to do, and build a company around Apache Hudi, as well. And I decided to jump ship. And that’s where I’m at here at Onehouse today. So I’m head of product at Onehouse. We just emerged out of stealth about three months ago. And we’re building Fast and Furious.

Eric Dodds 8:21
Awesome. Okay, Kyle, so many questions, especially around as your Databricks. But I actually want to go back because you mentioned something about some of the work you did in the context of office at Microsoft. And what I think is interesting about that is the time when you were there, you know, office is a super interesting, you know, the most arguably influential, you know, software space in the entire world. Excel is the most used business application in the world. But correct me if I’m wrong, but during the time when you were there, there was this big push to get a lot of this sort of, you know, sort of local, locally run software connected to the larger Microsoft ecosystem online, including Mr. The Microsoft accounts. And you mentioned the word telemetry. And so I can only imagine that there were some really unique challenges, sort of crossing the chasm of locally run software connecting these online accounts. What were some of the unique things that you experienced trying to build data engineering products and workflows around that?

Kyle Weller 9:26
Yeah, awesome. That’s a really good question. I’m glad you picked up on that as well. It was an interesting time. Looking back on it, I was new in my career and didn’t have a lot of perspective on like history and, and data and the evolution these kinds of things. But I was in the thick of it right then, like you said, the family of office products before that. Were all like local install, like there was no telemetry basically at all. I think there was, if I remember right, I think there was like a pop-up that would say when it crashed, do you want to send this diagnostic log to Microsoft or something like that? Yeah. But there’s basically like No, no telemetry on these things. And so with the move to more that, like, oh 365 Was that that move to subscription-based services? No. So by default, you have like an online subscription. And, and so there are things that authenticate and connected to the internet and things like that. And we also develop the web apps at this time, we developed the mobile apps at that time. And this is like 2013, 2014, all these cool products are coming out. And so there, we were designing a new telemetry system from the ground up everything from like, even the instrumentation SDK, so identifying, you know, of course, Office products, if you look at them, they have a common like, shared ribbon on top. And so we developed these SDKs, they would have a lot of like shared Instrumentation and Telemetry inside there. And then we had to devise all of the engineering platforms for ingesting that data, bringing it into the system, dealing with a lot of late arriving data, like you say, devices aren’t always online, laptops are frequently offline, mobile apps, these things. So there were some interesting data challenges with late arriving data, Lambda architectures, we’re building on Hadoop systems back then we had an internal tool at Microsoft called cosmos. And being the in the most Microsoft way, we had this special language that you could write, it’s called scope, you can look these things up online, and the scope languages that are mixed between SQL and C sharp, you could embed some like C sharp inside your SQL, Oh, wow. And yes, then we developed these pipelines to process the data and then expose it back out through API’s to all the internal engineering teams that were like working on the products to so that we could like, measure the health of the products, like availability, reliability, these kind of things. But then also like business kind of metrics. And we had clickstream usage analysis, and error trace, built some systems that we struggled to get some of the near real-time at least when I was there in Florida in 2013. We struggled out some of the neural time, but it was certainly a fun time of change and revolution, I felt like I was also on a team where we didn’t have a lot of captains that were like doing this. For a second time or third time, it felt like we were all doing this for the first time. And so it’s really exciting ways that we had to think out of the box. And, of course, we leaned on some other teams, like I mentioned, I went over to the BEAM team, the Microsoft search under him being after that. So of course, like, I would meet with them frequently. And interview Hey, how did you design these things, these tables with 1000s of columns, and like, you know, exabytes and petabytes of data inside these processing systems for being. So we learned a lot of things from that we carried over to this new telemetry platform for Office.

Super interesting. Thank you for indulging me. I just love hearing some of that the historical sort of insider stories around some of the things that we’re all familiar with. Okay, let’s fast forward to as your Databricks. So in a very short time, it went from a product announcement to being the fastest growing analytics product on as your so yeah. Help us understand why did that happen? It sounds like there were a lot of users who just sort of said, Yes, like this fits solves such a painful issue for me. What was that sort of problem set?

That’s awesome. Yeah, I think it breaks down into a few different categories. I think first off, we should give like, all the credit, and kudos to our Databricks partners who built the service, and they had it already available on AWS. So they were bringing existing product that’s like proven product market fit everything like that. And of course, it was the like best spark experience out on the market there that people enjoyed on AWS already and we brought it to Azure for the first so that that’s one factor that like Hey there, it’s it’s a product that had legs that people kind of knew about as well that we came and launched on the Azure platform. And you know, then we married up our Microsoft Azure, like global sales fleet and field thing that you know, we have a foot in the door of every enterprise customer out there and so combining and marrying like an amazing product that that’s there and a global sales fleet trained to get that in the door with customers. We have like this really great partnership of development. But then when you think of like your question on like, why was it so successful? What made it take off these kinds of things? Some of this I would, I actually did. I did a conference In Florida, somewhere halfway between this journey, and I asked this question to the audience, and that was there I said, like, why, like, Why do you like Azure Databricks? Shout out from the crowd, What do you like? And people, I think there were, like three simple answers that came out. It’s fast. It’s easy to use, and it’s secure. And I think that ease of use was really important. Looking at other comparable Microsoft products that we had, we had some others services that did offer Spark, open source Apache Spark, but they were not that easy to use, and a little bit cumbersome. So Databricks brought the easiest spark experience to the market. And it was just a pleasure to use the product that that data, the collaborative notebooks, and everything else you can do for data science and whatnot. And that was a big factor. I think another dimension that made this successful is people saw this, as you know, every other product that we offer from Microsoft is like in the Azure Cloud only. And this is actually a unique product that is available on multiple clouds. And some people would feel that they are future-proofing their data and environments and data stacks by, hey, if I ever needed to leave Azure and go to AWS, hey, I’m using Databricks. Here, I use Databricks over there. And so they felt comfortable picking a product that was available on multiple clouds. I think that was an advantage that we had as well. Does that answer the question?

Eric Dodds 16:34
Yeah, super helpful. Okay, I have one more. And then I know that Kostas has a bunch of interesting questions, especially around the lake house. But tell us about, tell us about Onehouse. Yeah, you know, so I know that, of course. Vinoth, who was such a wonderful guest to have on the show, and Hudi. But tell us about Onehouse. You came out of cel three months ago. So this is I feel privileged that we get to talk to you after coming out of stealth in such a short time ago.

Kyle Weller 17:04
Yeah, yeah, this is really exciting. And of course, a topic that’s dear to the heart as I’m head of product here and in the weeds building this thing right now actively from the ground up. So Onehouse, I would summarize, like, if you want a one-line answer, I would say we are a prebuilt Lake House Foundation for analytics and AI. And if I break that down a little bit further, like we’ll get into it sounds like the topics of what a lake house is, and why it’s important, why it matters, those kinds of things. But what I observed in my time, with Azure data, bricks and otherwise working with Friday, large diversity of different customers out on the market is that it’s hard to build a data platform on data lakes, even with amazing products like Databricks, even with amazing technologies like Delta, Lake, Apache Hudi, these different things, it still is time intensive to build these lakes. And I would frequently observe customers take six months or longer with large engineering teams to operationalize and truly have like a production-grade data platform, right, I built this with Office and other things even in firsthand experience. And so what we plan to do with Onehouse is we offer this fully managed experience to have a prebuilt foundation of your Data Lake and Lake House. And, and so we are a you mentioned Pinocchio and also the founder, CEO of Onehouse, and he is the creator, the original creator of Apache Hudi that he created in 2016, at Uber, and so Apache Hudi kind of pioneered this new transactional data lake category that now we call the lake house. And here now we’re at Onehouse, we’re offering automation on top of the open source components that Apache Apache Hudi has to offer. And so if you look, if you read up on Apache Hudi, or go to the docks, and see what those different services are, we have things like ingestion services. And so with Onehouse will offer managed ingestion points where your data that will bring it in, stream it in real-time and efficient ways. Then there’s a lot of like table services, like when you think of what a lake house is, and I think we’ll dive in that category soon. It’s like, there’s a lot of things you want to a lot of services, you want to operate on this data like clustering, compaction, indexing, cleaning up historical metadata, these kinds of things, and will automate the use of all these all these services, then, yeah, from Onehouse there, we were not building a query engine. So like one of the other goals that we have is to try to decoupled data infrastructure from query engines. The worldview that I see out there today is most people build a query engine there are a lot of dollars that you can chase after for ETL and career dollars. But then they build a vertical optimized stack down to the lake or the warehouse wherever they’re the data resides. He built a vertical optimized stack that like, hey, it’s going to be the best for this query engine, let’s crank the wheel and make more revenue through our query engine, etc. So what we want to do is make it so that people can very easily stand up a data lake or lake house platform and have interoperability across the to be able to use the query engine of their choice. I’ve seen people do want to have mixed mode, compute and use things like Trino. Use things like presto, use things like Spark, Hive, you name it. And we want to be able to provide that flexibility to future proof that your data as well.

Eric Dodds 20:37
Love it. All. Right, Kostas. I’ve been monopolizing. And I heard a couple of words in there that I know are very near and dear to you. Work on every day, so please jump in.

Kostas Pardalis 20:48
Yeah, I mean, I’ll start with a beautiful, more personal question, though, before we go to the lake house story. So Kyle, from working in one of like, the biggest enterprises in the world, which is Microsoft, you moved to a significantly smaller company. How does it feel?

Kyle Weller 21:10
Oh, wow, it’s a big change. It’s a big change. So yeah, in my career journey, there’s like a side note where like I was volunteering, helping some people do some like startup things start up their own businesses, with this program called defy we can talk about later, but I was inspired by the startup ecosystem and culture and then also being exposed to Azure Databricks. Like, we started putting up partnering up with Databricks, they were series D, about a few 100 employees. And I was there for that whole, like, rapid growth journey. And so then when I started to look at what, like next opportunity I wanted to do in my career, I looked around, of course, first inside Microsoft, I’m happy to be you know, things are good here. And I felt like I’d be bored with any other choice. downstair, like after experiencing such fast pace. And so I started to look at more and more startups scale, I didn’t think I would go this small, honestly, I thought I’d land somewhere in the CD, kind of, you know, that kind of range somewhere in the middle. But when I met Vinod, and this was, you know, I was deep in this lake house domain already, and Delta Lake and everything else. And then I met him and he’s a creative foodie. And so the parallels there. And when we started to talk, because just lightbulb moments are going lightbulb experiences that were going off in my head of Mike, this is a these are incredible market opportunities that I already know. And I feel like this, this strategy to build this company weighs like once in a lifetime. I was like, let’s do it. Let’s go build. So I haven’t been a part of some startup this small. I’m learning as we go. Man, it’s a lot of fun.

Kostas Pardalis 22:44
That’s nice. Nice to hear. And so can you sort of like inexperienced? That really surprised you like something that’s like, obviously different. But you also didn’t expect?

Kyle Weller 22:58
Yeah, good question. I think like, most of these, I expected when I got here, right, the complete lack of structure, the complete lack of like, guidance and direction, like it’s all on you, it’s all on your shoulders. I think maybe to answer that question and link these together. One thing that excites me, and energizes me about this experience is like feeling that ownership feeling that accountability and feeling that, like, hey, there’s, there’s no one else here that will get it done. Unless I get it done. If I fail, it’s on me. And so feeling that accountability really amps up the energy and gets me excited to come to work gets me excited to lean in and try to build for the future.

Kostas Pardalis 23:50
Yeah, that’s so so um, I mean, to be honest, like, I admire people that like they are able to go through such a radical change, and still excited because this is why this radical change. Yeah, completely changed the way of like, how things are operating and what kind of mindset you need to have. So, yes, yeah. That’s, that’s amazing.

Cool. So, okay. Let’s see. One question about Microsoft’s also, Sumo. It’s kind of interesting, like the story around like, Microsoft’s they, they like data infrastructure in general, because Microsoft like has, like, traditionally, a lot of innovation is face, right? It’s like, if we concentrate only on MSSQL and Joan, like it’s, like a lot of like, innovation in like database systems and working with data. But from someone who let’s say you spend most of your time in the modern data stack. It’s we don’t hear about as your bookmarks, we don’t sit around about like Microsoft club mouths. We know that it exists probably is like There’s something really big, but it’s like also distant. Why is this happening? Like, why am I so let’s say, distance between, in one side, we have like, Snowflake, we have Google, we have AWS. And then we also have Microsoft, like, why is this happening?

Kyle Weller 25:22
Yeah, I would maybe I probably don’t have the perfect answer, but I’ll take, I’ll give you my take on it. I think there might be two components. One is Microsoft is hyper-focused on like, their environment, their lane. Like, let’s get it done for Azure and Azure customers. And we don’t have many, like, cross-cloud plays like Google, Google does great, like cross-cloud, cloud plays with BigQuery, and a bunch of other services this way. AWS, of course, is the market leader. And has the most market share in terms of like cloud computing and things like this. And so I think some of it is because of that, that stay in your lane marketing that Microsoft focuses on the other half, maybe because the modern data stack is also pretty new. And it’s evolving. And, and, you know, building a startup myself, now I see that like, the first place I’m going to build is AWS, where were the largest market shares. And, of course, Azure is still I would say, my, my favorite cloud. And so I want, I want to take my product to Azure as well, I just need the right customer demand, mix to take it there. And so maybe that’s also where we see some bias in the modern data stack is like, hey, it’s starting out. And some of these are new products. Some of them are mature products. But you’ll see the new ones will probably gravitate towards where they think they will find the most customers. And then they can expand and grow from there. Because I think if you look at the modern data stack, most of these companies are outside of like the cloud-native vendors. And so they’ll want to build multi-cloud products, it makes sense for anyone outside of a cloud vendor to be available in all clouds. But AWS is an easy choice to start from first.

Kostas Pardalis 27:18
Absolutely. Do you think that those of has to do with like, let’s say, well, markets, like each one of the cloud vendors, probably focusing a little bit more has more success? Because I don’t know, like, in my mind, at least, Azure. Microsoft is always like an enterprise. That is like what they know really well how to do comfortable there. And like all these things, while on the complete opposite side of the spectrum, you have Google, which is more of like the medium-sized, like small size like customer, then somewhere in between you have AWS? Do you also like think?

Kyle Weller 27:59
Yeah, I think it does influence because you, if you look at this from a perspective of like owning these services, or owning these products, you want to hyper-focus your efforts on where you have the most success, where you can drive the most revenue. And if you have these largest enterprises come to Microsoft, for these big contracts that sometimes people combine, and they want a single vendor for like office, we’re talking about office right now. Oh, 365, and Azure and combined spend these kind of things combined relationship, then, yeah, then if I own these products, I would focus the success on where I know, I can turn the crank and drive more revenue and dollars that way. So I think it might have a component, too.

Kostas Pardalis 28:42
Yeah, makes a lot of sense. I think it’s probably like, also one of the reasons that I mean, Databricks, together with Microsoft knows your strengths. And it was like so successful. You take like a platform that it’s, let’s say, build for the enterprise, and you have a product that it’s ultimately, let’s say, addresses the needs of the enterprise, as they show you put them together and you have success, obviously, like, of course, yes. The go-to-market is going to follow in like it’s going to work very well. But I think it’s like the perfect context government built-in integration product there with the two together. So my assumption would be that like that was also like an important factor to the success of this collaboration. Okay, cool. So let’s look about Lake houses now. As you said, it’s a new category. The common I think, understanding is that the lake house is evolving, that emerges through the needs to work with data lakes, and the issues that we have with building and operating data lakes. Can you tell us a little bit about like, what are these challenges? Why the Datalake is not enough. Why do we need something on top of that? Or? I don’t like it completely different particular.

Kyle Weller 30:07
That’s awesome. Yeah, good question. And I would start it with helping point people to some resources to so they know it’s not just my opinion. If you go search, like Gartner’s latest data management hype cycle, what you’ll see on there is data lakes are in the pit of what Gartner calls the trough of disillusionment. And same with that, I think, like, I think I have the chart handy here. Even data engineering side by side with data lakes, it’s in that in the trough of disillusionment. And the pieces that are on the rise are data ops, Lake House metadata management, a bunch of different components that if you look, if you look at these trends together, it starts to call out and make obvious what the problems are, if you’re not in the space, if you’re not living and breathing, and felt the experiences already, you can learn from this perspective. Because when I mentioned that, like data lakes are are hard to build, but also when compared to alternatives. They lack many qualities and features, right? Like a data lake II is just a collection of files out on Cloud Storage, whether that’s S3 or ADLs, GCS, and these files represent data. And then you have to build metadata systems around how to understand what data is in what files, and track these and then you have to manage the size of the files and how the files are organized. And not to mention access control around the files, all these different components that make lakes so painful and hard to use. Whereas if you look at an alternative, like a data warehouse, you can pick up a data warehouse off-the-shelf purchase, like a data warehouse and use it, it’s ready to go. You put data in. And it’s all like their schemas manage these are tables, you can write your SQL, it’s, even if you’re, if you’re using a service like Snowflake, or otherwise, it’s like no knobs, performance tuning these kind of things. And it just works. And that data like you have to go build and but the lakes also miss like if you studied data warehouses, the lakes are missing a thing called ACID transactions where you can’t process updates, or merges or these kinds of things on lakes, because the file systems are immutable file storage. So you don’t have asset transactions, you don’t have like, manage schemas, metadata, these different components that are different. Now if you look, if you flip the tables and look at what are the advantages of the lake versus the warehouse, because what I just described, maybe you’d be inclined to pick a warehouse over a lake. But on the flip side, there’s a lot of advantages, the other direction where the lake is a lot cheaper, the economics, especially when you start to scale on those economics, it’s a lot cheaper to use, like, you also can use a variety of like, the structured and unstructured data. When it comes to machine learning data science, a lot of this is unstructured data, the warehouse also kind of locks you up to a single vendor, right? Like, like your you put your data in that warehouse, and then it’s all run on the computer of that vendor, etc. Whereas on a lake, you are open to play in more of the open source ecosystem and have a variety of tools, you’re kind of more agile and future proofing in that way. So backing this back up to Lake House, right? What a lake house is, you can take a there’s these open source projects, Apache Hudi, which is of course close to our heart at Onehouse and Delta Lake, which was close to my heart at Databricks, where they take the best capabilities of a warehouse and bring them to the lake. And that’s why we call these lake house environments. So does that help answer some of those? Like why the lake?

Kostas Pardalis 34:04
Yeah, absolutely. Absolutely. It kind of feels a little bit like trying to build a database system from starting from the file system and going out, right? Yeah. Which kind of is actually just like, I mean, there’s a little basis in front of like more of a monolith. So there are like, very complex architectures behind but everything is like hidden behind the configuration than the SQL dialect that we are using. But right now we are like, pretty much when it be like a lake house word YAML. Because you have to know about like query engines about like Parquet files, or zip files and then table formats on top of that. And it sounds like a lot of work right? And I understand that like operating a system like this is going to be like pretty tired, and not where like I’m going to, in a way, like ask the question again, but like from a different angle, what’s so great and so important about the data only that puts people into like, all these efforts to do that, like, Okay, why people? Let’s say, don’t just get Snowflake and call it the day on the right. Why is it that we cannot do with like the data warehouse?

Kyle Weller 35:25
Sure, sure. I think this I can answer with that perspective on what I’ve seen from customer journeys. And then maybe some specific examples to what I’ve seen in the market today, there’s, there’s a common pattern where, because it’s easier to start with. And if you think of like, also a company’s lifecycle, or their data engineering teams lifecycle, when you only have a few engineers, or you’re just getting started, or maybe your data’s not huge in size, hey, I’ll just pick up a warehouse, maybe I pick up a combo, like a Fivetran Snowflake kind of thing. And I start building on this warehouse. But where I’ve seen countless challenges is once people hit growth phases, in their data engineering platforms, or they hire more like data scientists and data scientists are like, hey, I need to train these models. And now I need you to go instrument these, like we’re talking to Office apps with more events, and like machine-generated data, and things like that. And the size of data increases, like by being incredible scale, but also the complexity of the workloads that you want to run on your warehouse. And this is where I see a lot of tumultuous kind of migration start to happen, where when people scale in their warehouses, they get this cost fatigue. And I see this even examples from you know, I’ve been talking to a lot of customers and seeing the amount of dollars they spend just on like, ingesting data into these warehouses, and then subsequently, even like ETL is and these other things, it’s, it’s huge dollars. And what I think you know, people realize is that on a lake, you are able to break apart different parts of your workloads. Like if you just try to try to dissect what are the workload types on the lake, right, you have, you have query up on the top, you have ETL, you have, like data management, this might include like, pre-processing or backfilling GPR, deletions, you have like performance tuning kind of things like cleanup of data. And you have ingestion, right, and you break apart these different types of workloads. Like, not all of them have the best ROI characteristics on one compute platform. And so like, as I mentioned, these three characters, like, like query dollars, depending on the type and the concurrency and the requirements of latency, perhaps warehouses are still the best pick. But for like exploratory Analytics, you might be better off with like a Trino. For, like, ETL kind of processes better off with a spark. And so not just better off in terms of like a, it’s gonna be more successful, but like actual, big cost savings that you’re able to drive across your platform. So the warehouse, you’re locked on one compute framework, the others. Now you can segment these across the board. And this is why I see people have big problems once they get to those scale curves with their data platforms.

Kostas Pardalis 38:36
Yeah, but it makes total sense. That’s really interesting. This whole conversation around like the workloads. So okay, you mentioned something about scale. You use like this word, like a couple of times, and this big, quite when you’re talking about the moment that like gas, I’m gonna start realizing that they got like, it becomes like really a carcass like to scale we usually just like within our house. So is the lake house something that it’s only relevant to enterprises or like to big companies? Like, who cares about the end for the big positive? Who should have been?

Kyle Weller 39:16
Yeah, I think that is kind of where Onehouse comes into play as well. Because right now, if you look at it, it’s still kind of hard to build a lake house and you need the right ROI characteristics to enter a stage where you decide, hey, I’m going to pour a couple of engineers on this project, it’s going to take us X amount of time. So the economics work out but yeah, let’s build a lake house, right. And what we’re trying to do with Onehouse is flip that model completely. And make it just as easy to build a lake house as like a Fivetran Snowflake combo. Click, click, click the button you’re in, your data is moved in. It’s all formatted it up. It’s synced to the catalogs of your choosing. Now you could just have a data analyst come into the picture and start querying In this data using the state of cabrita data scientists at the table, we let them use the compute context that they like. And so let me see if I can regroup back to your core question. I see the, the actual architectural pattern of the lake house is important and viable, I think, to companies of all sizes, this ability to have a single pane of glass or a single centralized place that you can manage your data. And you can govern access controls around this data. And you can share this data within your organization. Rather than, you know, an alternative. I see people kind of build out the silos are like, Hey, we’ve got a data warehouse for these types of things. And then we’ve got this kind of database for other things. And we’ve got, you know, it’s all kind of mixed. So I see that, to answer your question, I think the lake house is of value to companies of any size. It’s just on. Like, it’s a tough sell to try to build one sometimes, if you’re looking at it from like, you’re getting started from scratch, and you look at what the alternatives are.

Kostas Pardalis 41:15
Yeah, makes sense. Do you see a future where a lake house will be moving side by side and move to their house? And how do you see the like, these two? Are the example of the coexist, like in the company?

Kyle Weller 41:29
Yeah, I think they already do coexist today. And I’ve seen a lot of successful cases of coexistence. As well, I think what might be interesting is to hear this from an angle of like the emergence of lake house, too. So because when I was working on Azure data, bricks, I saw these patterns start to emerge with customers where before we had the term lake house, and but we had Delta Lake, we had the Databricks serve, customers were looking for ways to eliminate warehouses out of the picture, they had a mixed mode of, of lakes and warehouses, and that they would use warehouses for BI analytics. And this, or for machine learning, and ETL is and everything else on the lakes. But I saw time and time again, customers trying to eliminate the warehouse and bring these in. But sometimes they were struggling like that. When you look at BI workloads and the type of concurrency that happens from users that are in like Tableau or Power BI dashboards, you click one button, that may end up triggering, like 20 queries that go and execute to your query engine. And you look at Spark, right as your Databricks with Spark, that’s coming to one single driver node. And if you study Apache Spark, like the fair scheduling, within Spark, these things are not good for managing concurrency scale. And so I saw a lot of customers actually tried to build like houses before we, like had this ClickHouse thing. And so the demand signals were very obvious to us. And, and we knew that customers wanted to do this, they were feeling the cost, but they wanted to approach these scenarios. And so that’s where, you know, with Databricks, we solve these challenges by offering sequel endpoints that now can load balance between different Spark clusters, and also the move to serverless. Right? You see, you see Databricks, making that move right now, as well. And so these type of things were at the same time. And when we came out and said, Hey, this is like officially the Lakehouse, right, like now BI queries can be distributed, be scalable, and actually work on the lake in a comparable way to warehouses. That’s where, where it was big to double down on that.

Kostas Pardalis 43:45
Cool. So one last question from me, and then we’ll move on the microphone to back to Eric. So you mentioned the album. You mentioned Hudi, obviously, and we also know that there’s like Iceberg out there. So they’re like every day move for months. There is something common in all three of them. And lastly, production of adding like acid guarantees there for transactions, which is let’s say like, the minimum requirements for creating the table format, but what’s different like what each one like brings on the table right now that the other one does not charge? Can you help us like understand a little bit better? Like why we’d have three of them out there?

Kyle Weller 44:34
Yeah, sure. I think they were full. They were all born in different ecosystems at different times. Hudi was invented 2016 came out of Uber from footnote that a CEO of our company, Delta Lake came to market and I think 2019 and Iceberg cameras at 2019 as well the same year as Delta Lake. Maybe Iceberg Iceberg came out in Netflix and these kinds of things and I think at the get go have you study, Hoodie and Iceberg. They were built to solve slightly different challenges but ended up building like really overlapping, solving general solutions the same way with Delta Lake. Now if you want to look at ways that they’re different or differentiators or things like that, I can talk from a patchy Houdini’s perspective first and because this was this is something that I grilled Vinod on as well, when I first met him where I was sitting that in a really great spot and Azure Databricks growing so fast, and I met for an oath, and I kind of grilled him like what like, what are you going to do with this Hudi project? Like how is it going to make it with this gorilla in the room with Delta Lake. And then I started to actually learn about these technical edges that Hudi had. And you can go out and study some of the use cases or people that are talking publicly about how they’re using Hoodie and why they chose Hudi. And these kind of things, a lot of them. There’s a common pattern that emerges around like CDC kind of workloads, where you need to ingest CDC data into your data like some of those are because of there’s two right formats that we have with Apache three, there’s copy and write and there’s merge on read. And so we have these, this merge on reuses, Avro-based ways to write the data that we can then asynchrony, asynchronously compact in a columnar formats. We have record level indexing, with Hudi point 11 Warp, we just released also, this multimodal index, which is really exciting, go read about our latest release point 11. The new ways that we’ve extracted another 10 to 30x, gains on query performance, and even switched to using h file metadata file formats that we can get like 10 to 100x, performance gains on how you access the metadata enabled data, skipping these kinds of things. Hoodie also takes a pretty different stands when it comes to concurrency control. Both the products or projects, I should say are working with optimistic concurrency control, which is a you know hope that things don’t collide when they do retry this kind of mode, whereas with Hoodie, we have OCC and MVCC. And you’re able to get multiple riders and also have the tables services around your data like when you have to compact the data cluster the data index the data, we can manage all these through a timeline server and make sure that there are no collisions at all for managing the data. So I’ve seen when customers do deep evaluations, they do deep technical studies and benchmark comparisons, these kinds of things. They usually tend to find like a Hudi and Delta Lake come out kind of close, when it comes to performance vary. But then when you look at like, feature sets, he’s got a really exciting bunch of feature sets and also the roadmap. That’s there. So that help on the question.

Kostas Pardalis 48:10
Yeah, absolutely. Absolutely. All right. I mean, I think we need to have like, at least one more episodes. Luzzatto more to be honest. Maybe we should arrange to have like both humans and also at the same time. Oh, that’d be.

Kyle Weller 48:25
That’d be a fun combo.

Kostas Pardalis 48:26
Yeah, absolutely. So we should do that. So I give the microphone back to Eric. Eric, it from the conversation that there are times in life loves being an optimist is not always good. At least waiting to catch up on guarantee. Yeah.

Eric Dodds 48:48
I love it. Okay, just one last question. Because we’re close to the buzzer year, Kyle. But so you said that the Lakehouse format is really for companies of all sizes, right? Let’s speak to a listener who, you know, sort of maybe is living in warehouse only world. Or maybe they are living in sort of like, we have sort of a whole separate, you know, datalake infrastructure than a whole separate, you know, warehouse infrastructure. You sort of talked about those, you know, performing different functions. How do you begin to think about a world where you are working with a data lake house, right, like maybe that’s in your future? And so, what are the different sort of modes of thought, workflows, etc, that you need to be thinking about?

Kyle Weller 49:44
Yep, that’s a good question. Let’s start from the angle of someone that’s that a lake or they have a mixed bowl with Lake and warehouse. The lake house fits in really nicely to complement where warehouses are where you can make a central place for ETL, BI, machine learning these kind of things. And then when you need true, like enterprise-scale BI, and lots of lots of internal users and dashboards and audit concurrency and these kinds of things, then you can push aggregate, aggregate data, and more, cleaned up data into data warehouses, and be able to use it as more of like a serving kind of layer. And that’s, that’s when you’re starting from that angle, when you’re starting from the angle that you described of, hey, you’re your Big Data Warehouse user, you don’t have to, like anywhere in the picture, how or where would a lake house come in? I would say, you know, look at where you’re spending the most money in, in your warehouse. And identify like, what are the workload patterns that are costing the most money, whether that’s maybe some ETL process, some kind of cleans or, or business aggregates or things like that, or, and find those and then run it run a POC on the lake and see how much it costs you and just see the amount of savings that you’re able to get there for people are looking on? Like, how can they take everything that they built in the warehouse and come to like as well, I think DBT is actually a really interesting choice and few there where if your logic is all written inside DBT you can eat that your bike, all your logic is pretty portable across these systems. So that’s actually if people are on the cusp and getting into their warehouse, and they’re like, Hey, I’m worried, you know, I might be thinking about a lake later. But I still need to build a warehouse, perhaps put a layer on top that makes you more agile and portable for the feature. That’s just a suggestion.

Eric Dodds 51:51
Love it. Awesome. All right. Well, Brooks is telling us that we are at time. Kyle, this has been such a wonderful conversation. I learned so much. And it was a real pleasure to have you on the show.

Kyle Weller 52:03
Thanks, Eric. Thanks, Kostas. Appreciate it.

Eric Dodds 52:05
As always a super fascinating conversation. You know, I think one of my takeaways Kostas Is that? I don’t know if I’ve heard the opinions stated. So clearly that the lake house architecture is for companies of all sizes. The underlying context, and many of those conversations is enterprise use cases, high scale use cases. And I mean, cost was certainly a subject that came up on the show. But it was really interesting. I think that’s a, that really stuck out to me. You know, and so maybe we are seeing sort of the beginnings of a migration to a new architecture, or at least like the, you know, sort of the very early stages of that, and I don’t know, maybe Onehouse is the company that will that’ll make that happen.

Kostas Pardalis 52:58
Yeah, yeah. I mean, I don’t know, I think I I’ve said that, like many main nlds a few times, like on our show, or at least like in private conversations between the two of us that’s like one of like, the way that we should be looking into what the future will look like, in the data space would say, ecosystem, whatever you want to call double. The sector is like, sake, what the enterprises are doing, like things start from there, and then they go down. Like, it’s pretty much like the opposite of what was happening with SaaS, where you would go and like innovate in the medium size. There are multiphone bonuses that go upmarket now, like things are actually happening, like the opposite direction here. I think one of the reasons that like people do not use late housers matters, because exactly like there’s like, you need to have a lot of expertise and infrastructure, in terms of like human resources like to go and do that, like give me the kind of like the data engineers like specialized like systems engineers who can go and date with their family content with either lake house. And I think that’s exactly where the opportunity lies. For companies like Onehouse like even Iceberg with tabula the company behind DBLog, and Delta Lake with Databricks. Right? How we can, I mean, the market can come up with products that will make it like much, much easier to build the systems because I think that the end works. What the lake house delivers, is a platform that is flexible enough to accommodate in the most optimal way, all the different workloads that the company might have, like on Wednesday, not just one workload anymore, even is more common. So I think that like what is happening there, and similarly, but I think we will hear more and more about this new category. Like next couple of months at least.

Eric Dodds 54:59
I agree. All right. Well, thank you for joining us on The Data Stack Show. Tell a friend about the show if you haven’t already, and we will catch you on the next one.

We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.

🎙 Sign up for The Future of Machine Learning Livestream!

🗞️ Signup for Our Newsletter

Episode 103:

Everyone Is Invited to the Data Lakehouse with Kyle Weller of Onehouse.ai

September 7, 2022

Notes:

Transcription:

About the Podcast

Sign Up for The Data Stack Show Newsletter