Episode 127:

The Anatomy of a Data Lakehouse with Alex Merced of Dremio

February 22, 2023

This week on The Data Stack Show, Eric and Kostas chat with Alex Merced, Developer Advocate at Dremio. During the episode, Alex talks about his experience opening and managing comic book shops has impacted his journey with data. The conversation also includes all things data lake houses, how Dremio is solving painpoints for users, what it means to be a developer advocate, and more.

Notes:

Highlights from this week’s conversation include:

Alex’s background in the data space (2:41)
Comics and Pop Culture Blending with Finance training (5:20)
What is a data lake house? (7:36)
What is Dremio solving in for users? (11:21)
Essential components of a data lake house (16:35)
Difference between on-prem and cloud experiences (33:53)
What does it mean to be a developer advocate? (41:31)
Final thoughts and takeaways (49:02)

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 00:04
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com. Welcome back to The Data Stack Show. This episode is going to be exciting. I’m actually excited to hear about the questions you asked just because you have a lot of experience in the space. We’re going to have Alex from Dremio on the show. And actually, we’ve been working on getting Jimmy on the show for a while. They’ve been around for quite some time. And they do some really interesting things on the data lake and actually have recently made a huge push on data lake house architecture. So that’s really interesting. And then Alex is an interesting guy, he has a lot of education in his background. So of course, I’m gonna ask about that. But I want to ask what his definition of a lake house is. We’ve had some really good definitions from other people on the show who invented Lake House technology. And Jimmy is kind of in an interesting place, and that they kind of enable Lake House functionality on top of other tooling. And so that’s what I’m going to ask. But yeah, I’m so curious to know what you’re gonna ask.

Kostas Pardalis 01:27
Yeah, I think having someone like Alex from Dremio, like, it’s a very good opportunity to go through all the different quality components and technologies that are needed to build and maintain a lake house. Because dremio is a technology that enables all the different components to work together and again, gets, let’s say, like an experience, like a sim, warehouse, but on top of like, much more, I would say, open architecture. So that’s something that like, definitely like to do with him to get that and see how these things work together. how dremio works with them. And also talk a little bit about the future and what is missing from the lake house to become, let’s say something equal in terms of like the experience and the capabilities of our thoughts, right. So, yeah, like, we’ll start with without, and we’ll see, I’m sure we’ll like more things to come up.

Eric Dodds 02:29
Indeed. Well, let’s dig in. Let’s do my one job. I did it. Yay. Yay. All right. Welcome back to The Data Stack Show. Alex, so excited to have you on. We’ve been trying to make this happen for quite some time. And it’s a privilege to have you here.

Alex Merced 02:50
Thank you. It’s a privilege to be on. I mean, I’m very excited to be on the show, very excited to talk about data very, except for just talk and just be part of the data fabric of the show. God, I love it. Love the

Eric Dodds 03:03
energy. Well, let’s start where we always do give us your background, because we have some finance, we have some education. I mean, very interesting. So tell us where you came from.

Alex Merced 03:14
Ya know, life has kind of taken me to a lot of different places and given me a lot of experiences, which I feel has given me a lot of perspective that’s made it fun to talk about things. But basically, the story starts back when I was younger. Like many kids, I was really into video games. So I wanted to be a video game developer. So I went to college for computer science, but eventually changed my majors. Because of a variety of different events in life, I just made a shift into marketing and popular culture, which led me to start Chino comic book stores. But then shortly after that, after I graduated, I went to New York City, where we’re actually working in training and finance. So I learned a lot about just being one speaking in public and producing clear communication, buy that training job, but also got to experience a finance side, which is a very data heavy industry and learn the importance of you know, like, the real time data when you talking about like stock prices and stuff like that, and how much it matters to how everything works, to give me an appreciation of that stuff, but basically run when 18 2019 You know, I was kind of ready to move outside of New York City, which meant time for a career change. I still always dabbled a lot with code and technology. So it felt like this is what I do for fun. Why not make that what I do all the time. So I made that shift, first off, into full stack web development. And just was completely enthralled. Like, basically, I ended up not just coding but putting a lot of content around coding and have like 1000s of videos on YouTube, about coding and pretty much any language you can think of. But eventually, basically, I wanted to kind of combine the skills that I have from all my walks of life, training, marketing, coding technology, public speaking and developer advocacy and seemed like the right path and then on top of that, I found myself constantly playing with different data technology like Just in my free time learning how to go deeper with Mongo, Neocore J different databases. So I sort of targeted the kenangan database. And I discovered things like the data lake house and dremio. And I got the privilege of becoming like the first developer advocate at dremio, and got the combined all my absolute all my interest on one day to day thing. So I just like living where you eat this stuff nowadays, because I find that exciting. And that’s how I got to where I am. Ah, very cool. Okay, tons to talk about,

Eric Dodds 05:29
both for developer advocacy, because I, you know, I think cost us variables are very curious about that. And dremio. One thing I’d love to hear about, though, and this may be getting too close to the specific questions around developer advocacy, but studying pop culture, and then, you know, running, you know, a chain of comic book stores, and then going into finance training is such an interesting series of steps. And I’d love to know is, was there anything from studying pop culture and working in the world of comics that you really carried with you into finance training? Because most people would think about those two worlds as sort of completely separate? And maybe they were but you know, just hearing about your background and the way that you like to combine learnings from different spaces? It seems like there may be a connection?

Alex Merced 06:23
Yeah, no, I definitely think it’s sort of like, the way I’ve always wanted, one of my skills in life has always been to sort of notice patterns, which is a great skill to have when you’re writing code. But the bottom line is just like doing all these different things, you notice a lot of things are the same, a lot of the things that you need to do to be successful are the same, you start picking up on these patterns. So basically, patterns that I learned when studying things like cultural cultural studies, where you’re learning about, like, what different cultural works mean to people, and the meanings they can take on and how you can use it to kind of structure communication, here it into me starting my car bookstore, where I also learned a bunch of entrepreneurial skills and learned about a lot of marketing techniques. And, that time was like really early on online marketing. So it wasn’t like what it was today, it was like starting a message board and trying to build a community on an old school like, Yeah, but taking then taking that. And then when I get when I get into, like finance, you know, when I ended up learning about all these, like financial things, and learning about that industry, but at the same time, I’m taking a lot of that the ability to communicate and organize myself that I picked up from those other things, and again, be able to take what’s typically a really complicated thing to teach in finance, and be able to teach it in a more entertaining sort of palpable way, which is also then something I then repeat now in technology where basically, I don’t necessarily always deliver my explanations of things in the most technical highbrow way, I try to really speak in a way that’s accessible to anyone that anyone could talk to me for five minutes and be like, yeah, kind of get what a B. Lakehouse. That’s pretty cool. And they think like, that’s what this wide journey that I’ve had, and experience that I’ve had have really kind of helped me bring to

Eric Dodds 07:56
the table. Yeah. Why don’t we put that to the test? Can you explain data lake house, and we’ve had some good explanations, you know, from the North, who, you know, who created Hudi, and several other people in the data, Lake House space, but this is a huge, you know, area of importance for dremio. So, can you just level set us on a, you know, how do you view the data lake house? And what’s your working definition of it?

Alex Merced 08:23
Got it. Okay. I mean, I would say at the core, the whole idea of a data lake house is just saying, Hey, I have data warehouses, which have certain pros and cons, and I got a data lake, which has certain pros and cons, can we create something that’s in the middle that has the pros of both of them. So when I think of a data warehouse, hey, I got this nice enclosed place, record and close and data, it’s gonna give me like, really nice performance, really nice user experience, really easy to use to work with my data. But if I have my data lake, I have this place where I can store a bunch of my data at a much lower cost point. And basically, it’s much more open to use with different tools. Oh, I don’t like all those things in one place. So how can we make it where hey, I can have all my data in the data lake, but still get that same performance and ease of use of the data warehouse? So that’s essentially the premise. But now how you architect that, how do you make that happen? Just everyone’s kind of got their story to tell. And as a drummer, we definitely have ours. But at the end, like, the key component is going to be like speaking since you mentioned, but often Hoodie is going to be that it’s going to be that table format, because that’s going to be whatever tool you’re using. It’s those table formats, Apache Iceberg, Apache Hoodie and Delta Lake, that are really gonna enable those tools to actually give you the performance and that access. And then each tool can then provide you that ease of use. And that’s where dremio will really specialize in trying to say anything that makes it difficult to use a data lake as the center of your data world before Drummond tries to address that. So you think of ECU streaming as like a UI that makes it really easy for anyone to go query the data. But what else when it comes to governance and controls, dremio has a nice semantic layer that makes it really easy to organize your data across a bunch of different sources, and act and control the actual access to them. So that way you can meet your regulatory needs and whatnot. And when it comes to things like migration, especially now, we now have the cloud product. But even with the software product, if you were on, you had an on prem data stack, and you want to start moving towards the cloud dremio works with dremio software works with your cloud, and it works with your on prem data. So in that case, you create one unified, where basically, people who are working with the data, they don’t have to notice the migration, they’re just accessing the data from Dremio, and they don’t even realize that data is being moved from on prem to cloud making migration to the cloud much easier for companies. So there’s also different benefits that dimri provides, again, trying to make the data lake easier, and also more performant. Because dremio has only one, it really leverages things like Apache aero and Apache Iceberg, really from top to bottom, but also has features like the columnar cloud cache, which makes using like Cloud Storage faster. And also data reflections, which is the real secret sauce of dremio. Think of it as like this, I mean, it’s a little more complicated in this, but the way I like to think about it is like automated materialization. So normally, in a database, you could create a materialized table. So like this sort of mini copy of your table to make certain things faster. The problem is like, if I’m querying, and I’m gonna take advantage of that materialized view, I actually have to know it exists and say, okay, query that, not this. Now, with dremio, you have reflections, and if you turn on reflections, if the reflection can speed up a particular query on many different datasets, you won’t have to think about the unit, they’ll be aware that it exists. And that basically, really, one makes it easier to make things faster, but also makes it easier for people to take advantage of that.

Eric Dodds 11:42
Yep. super interesting. Can you describe for us the state of a company? You know, before they adopt dremio? And sort of what does their architecture look like? And how maybe, how are they trying to solve some of the problems that dremio solves? I think that would help me and our listeners just understand, like, Okay, what state are companies and before they adopt training?

Alex Merced 12:08
Got it, okay. I mean, there’s a variety of different positive possibilities. So thinking about Dremio is like, it’s hard to say, this is the why, because there’s so many ways, but I think one of the most compelling stories is definitely that migration story. You’re a company that wants to use the cloud more, you want to have your data to S3, you want your data to Azure, you want to get out of the cloud. But the problem is like, you have tools that work with your on prem data, and then there’s tools that you want to use with your cloud data. And now you have all your consumers having to learn different sets of tools, there’s all this migration, friction, sure, well, dremio creates like that unified interface. So it makes it easy to pursue set up dremio With your on prem data, get everyone used to using it, and then you start migrating the data over to something like S3 or Azure, and they don’t even like to notice it. So that makes that kind of migration easier. But also see use cases where basically people just maybe had a really big data warehouse bill that wasn’t really working for them. And basically, by moving using dremio, they’re able to access that data in their data lake and using all those performance features, they will get that performance and have with that UI get that ease of use, it makes it easier to move less and less of that work on the data warehouse and really cut down their costs a significant portion. So. So bottom line is, if you have a big data warehouse footprint that you would like to do smaller trim is worth looking into, if you have an on prem data lake that you would like to move to the cloud dremio is worth looking into if you want to if you or if you just have an on prem data lake that you like, but you just want to get more juice out of it. Drim is going to provide that to you because it is going to provide you that better performance on the data lake is probably one of the best on prem tools there is right now. So generally, if you’re using a data lake, and you want to use that data, Lake more, Jeremy is going to have some sort of solution.

Eric Dodds 13:47
Yeah. And we’re just so curious to know,

13:52
you know, we were chatting before cloud was fairly recent, you know, in the history of the company it is fairly recent, for dremio, you know, in the last year or so. And so, having a company that, you know, is largely built on and has been extremely successful with on prem. Can you just describe being inside of a dremio? Like, what has the mindset shift been? And what’s that been, like, you know, sort of focusing on Cloud, having spent so much time and effort on prem. And I know that migration story is a big part of that, but just interested to know, that’s probably something that, you know, some of our listeners may know, migrating from on prem to cloud, you know, from a basic infrastructure standpoint, but you doing that as a product is really interesting.

Alex Merced 14:38
Yeah, so essentially, like, you have like two product two overarching products, or they’re like dremio, in the sense that you have dremio software, which is you would run sort of, you’d create your own cluster that runs dremio software, but that can access data on the cloud and data on prem. So that was already being like used for those kind of migrations or to access the but over the last year, what we released is dremio Cloud, which instead of you having to kind of set up your own dremio cluster and all this stuff, you can really in a few minutes, just sign up for dremio Cloud have a free account, it’s essentially it’s free of licensing costs free, you just basically the only cost it would be, would be any costs of any instances, you run up to run a query, yes, outside of that the accounts for you want to use our catalog dremio Arctic that’s free. And basically, sometimes I’ll just open up, you know, run some queries with a spark running in, in the Docker container, my computer against my art to catalog and again, that’s a zero cost operation. So basically, it makes it just easier to get that dreamy experience. So dreamy, made it easier to use the data lake house and dremio, cloud made it easier to use dremio. So it’s always about that journey of trying to make things easy. And those are the two key things you want to do. So dremio, Cloud makes it easier. But either way you’re using software or cloud, it’s open, you can connect all your data sources, you can connect, work with your data, and also just work the way you’ve been working. Like you’re not necessarily locked into doing anything, particularly the dremio way. So you have ways to take that data and use it elsewhere. So that way you don’t have and that’s another thing a lot of people really like about Dromio is just that they don’t have to learn a new way of doing things, they can generally make whatever their existing workflow work.

Eric Dodds 16:14
Yep. super interesting. Yeah. I mean, it is I mean, I know that building for you know, on prem versus you know, sort of a, you know, a pure cloud SaaS product, very different, but thinking about it through the lens of making things easier, and taking patterns that existed, but making those easier and delivering those as SaaS, you know, without the infrastructure burden. makes a ton of sense. Well, I have a ton more questions, Costas, please jump in, because finish it at this one on time with Brooks out, I’m sure.

Kostas Pardalis 16:47
And feel free to interrupt me. If you have any questions that you have to ask. So Alex, let’s start with the basics. And let’s talk about the data lake. And, in your opinion, what the dates to build a data lake and cobbles of dremio fits in these

Alex Merced 17:13
architecture. Got it. I mean, bottom line, I mean, to build a data lake, it’s just a matter of just having some way to store your data, whether that’s an on prem like Hadoop cluster, or you know, object storage, like an S3, Azure, Google Cloud, I mean, some of the store data and a way to get it there. So your ETL pipelines that are going to take your data from your OLTP sources, or whatever other sources you may have, and move them to that storage area. But then they always the next step comes to like, what do you do with it once it’s there? And then that’s where things start to get more interesting, because before really, you could read data, you had tools that allowed you to do things like ad hoc queries. And that was all fine and good. Problem is like, what do you want to do, big updates, deletes, things like that. And then that’s where we kind of get into sort of like, we start crossing the line from data lake data lake house with things like Apache Iceberg Hoodie, Delta Lake, but when dremio comes in there is that there’s all these pieces that you’re gonna need to kind of put all that together, like, you know, you might want Apache Iceberg as your table format. So that way, you can treat all your data like a table and be able to do deletes, updates, you may want to leverage things like Apache arrow, so that way you can communicate with your data faster, because just less serialization between sources. You know, things like Project messy, which will allow you to take those Apache Iceberg tables and do semantics, like do the branch table, then merge changes in the table. So you can isolate changes, in the same way we would with code, all the things are really nice, but by themselves can be a lot of work to set up and put together. But with dremio cloud, so you have to enjoy your cloud, you have two tools, you have dremio sonar with the query engine, okay, it’s gonna really make it easy for me to connect my different sources, whether it’s my cloud storage, actually, whether it’s databases, like Postgres, or MySQL, connect them, join the data together, do whatever I need to do to accelerate that data using reflections. And just do it, again, have governance and set up like permissions and do it in a very easy to use way on my data lake. So it makes that aspect a lot easier. But then you have dremio Arctic, which is sort of a newer product, which is in preview still, which basically gives you that project Nessie catalog as a service. And that allows you to have this one catalog that you can connect with any tool you can go you can connect the project NATS even presto, with a Trino I think there’s a pull request for Project nessi in Trino that you can connect to it with Flink with Spark can get you the database. That catalog allows you to connect whatever tool you want, and work with your data. But dremio provides you this UI that’s gonna allow you to manage that be able to observe, like who’s making what changes to your data, when did they make them What branch did they make them to, again, have all the benefits of that isolation from a nice place with an easy setup because again, you don’t have to do any setup literally setting up the dremio art to catalog is you sign up and you say make one and it’s exists and you just connect. So its role really is just to make the patterns that make a data lake house more practical. It’s easier. Bottom line, it just becomes a gateway to make a sale, I don’t need the data warehouse, I will just do it here. But I can still bring in all those other tools because it doesn’t really try. It always tries to adopt as many formats as possible, as many sources as possible, and be open to connecting to as many tools as possible. So that way, you’re never you’re not locked into anything. Yep. That’s awesome. Okay, so

Kostas Pardalis 20:22
you, that’s like a couple of different things. That’s, of course, like for people who are working with data, like they’re probably like, okay, known terminology. But that’s not necessarily true, like for who listens to the podcast. So let’s dive in a little bit like, more either, like some of, in my opinion, like fundamental pieces that you mentioned. And if I forget anything, please feel free to add these. So let’s start. First of all, you talked about it yet, right? Like do you have, let’s say, we have like a mechanism that is going to our transactional database, the ones that we use for our products, pulls all the data out and goes to a file system. It doesn’t matter if it’s like S3, or your local laptop, whatever you gave, still, like a file system. And you stored the data there, again, now, from that, to being able to query the data, and query the data at scale. And when I say at scale, I don’t mean like at scale in terms of petabytes, but on the scale of the organization to make it available to everyone, there is a lot of work that needs to happen. It’s and let’s start like with the first, which is how this data is stored on the file system. It’s not like, you just throw up, grab some stuff out there, and the query engine will figure it out and make it available. So there are like four months out, right. And before we even go to the table format, we have the file formats, we have our C, we have Parquet. So what are let’s say other like some specific requirements, that driving your cars in terms of like how files work, how the data has to be, like stored on the file system, or it can be anything,

Alex Merced 22:15
I mean, dremio is going to have like the best experience when you’re using Parquet files, but I share this support over RC not sure about Avro. But the bottom line is basically, when you’re using dremio. If you just for example, like the reflections feature, it’s going to materialize your data into Parquet. So for example, let’s say I have a Postgres database, and I’m joining it with some other table that I have somewhere else, you know, what’s gonna happen is that if I just join them, this is always like, sort of like the issue instead of like, you know, doing, like data virtualization is that like, hey, every time I want to look at this join, it’s running this query in Postgres. And it’s running this query for this other table, and they may have differing performance. But with reflections, I can turn on reflections, run that query, and then take that result and materialize it in Parquet. So the next time I look at those joins, it is performing so Parquet really is really at the bottom layer. So we were kind of going back to that foundation level and building that data lake house. That first step is basically land your data in a format like Parquet that’s been built for analytics, because Parquet is going to offer you lots of benefits, like one is that instead of just having all the data just laid out there, it’s organizing them into different row groups, the row groups have metadata. So a query engine is like a drum, you can actually scan that file and be like, Okay, do I need to scan it? If not, let me skip to the next one, and really have those more efficient query patterns. But once you have all the files, well, my table might be like, went like my table might be 100 Parquet files or 1000 Parquet files. So how does an engine know that these 1000 Parquet files are a table and then we ask for the data table format comes in, basically lets you store the data get Parquet so that way you get that nice grant easy to scan files, and you get the table format so we can recognize those files in the table. And then above that, you need engines that can actually read the metadata from Iceberg and then also know how to read Parquet files to drill into those two layers to get the best performance possible. Cool.

Kostas Pardalis 24:08
So you did something great. Here you move to the next fundamental piece of a data lake or lake house, which is the day before so we have Parquet which is the realization where we write the data. We store it like on the disk and then we’d have the table format like what’s organized like tables what these table format bring to the user rights outside of like, okay, going out there and creating some of that data would say like, alright, this table consists of 1000 files and you can find over there there are also like other things that these formats provide right like conditions without like, what else Iceberg delta and who they are bringing it to the end user. So old

Alex Merced 24:56
table format, the main goal is to not only be able because before You could recognize what a table was with hive. But hive did it based on a directory. So you said, Hey, this folder was a table. And whatever files were in that folder was the table, which was great at the time, but also had a lot of different things that it can’t do, particularly when it comes to safe updates, delete things like that. So modern data formats, table formats, you know, Iceberg Hoodie, Delta, basically, their goals, I saw that and I say, hey, we need to find some other way to sit and say, okay, these files, make up x table, and then also provide supplemental information for engines to be able to query that table efficiently. So basically, if you look at Apache Iceberg, it does it through sort of a metadata tree. And basically, by going through that tree, the engine can whittle away and say, okay, hey, there’s 1000 files here. But once it works its way through the metadata, there’s only really 30 files on the scan, and allows you to kind of look at it completely. So basically, all the actual scanning is done the metadata, you take a look at, like Delta Lake, what it does is it basically works through several log files. And essentially, you have like log files, zero, which is like the initial state of the table, and then each kind of like, like Git diffs, each log file says, okay, here are the changes to which files at the table from the last log. So essentially, you’d say, okay, hey, I’m gonna scan a table, and there is some metadata in there. And some indexes they will use to help do what’s data skipping. So all three of them are trying and you’re trying to skip data, you don’t want to scan because you scan less data, you speed up the query without having to spend more on Compute. Okay, so that’s always, the name of the games go faster, without spending more. So what happens is, then you have Hoodie. And Hudi works for more than this, like a timeline system, where basically, every change is done on a timeline that was built initially, the facility like streaming. Now, in more recent versions, they’ve made it now the default. You have this like a metadata table that facilitates data skipping. So like, it’ll read the stats that are stored as a metadata table that’s kept alongside your table, and then plan the query around that. So like Iceberg, I think that those stats are sort of more really built into the intrinsics of ALEC works. Because literally, like, the pattern would be like, if I’m a query engine, what would happen in Apache Iceberg is I, you will have something called a catalog, which could be like that dremio Arctic catalog I talked about earlier, or something else? And it’s gonna say, Hey, there’s this table that you said you have, where can they find that data, and it’s going to point it to somewhere that metadata is, it’s gonna go through each layer, and like that first layer is gonna say, Okay, this is what the table looks like. And the second layer is gonna be like, This is what a snapshot snapshot you’re trying to query looks like. And then in the third layer is saying, okay, these are the groups of files that you may need to scan, there’s some more additional metadata just on those individual files. And then the query engine can be like, okay, that file, I don’t need this file, I do need this file I don’t need and then at the end really only have to scan the file that I absolutely need. And then that’s silly with the table formats doing and saying, Hey, not only are these 1000 files, the table, but I’m gonna give you the information to say, hey, even though that’s 1000 pounds of the table, and we need to scan through, that’s how you get that performance.

Kostas Pardalis 28:00
Awesome. And then he mentioned something gills, which is a couple upgrades. So that’s also quite important. So what does it cover?

Alex Merced 28:07
Got it a catalog. I mean, particularly, especially when it comes to like Apache Iceberg catalog essentially is going to be where, like, basically memory like the old JC Penney’s catalog with those remember JCPenney, you know, like it was, I would be able to see what the inventory was of the store and say, This is what I want to order for Christmas? Well, same thing when it comes to a table format, it catalogs and tells me what tables are available and gives me the information so I can access those tables. So it’s, basically, the layer between the engine and the table format that allows the engine to do a few things. First, it needs to know where the table exists, that’s what the catalog does, and that it needs to know which files are part of the table. That’s what the table format provides. And then it needs some metadata on the data in each individual file to fine tune its scan. And that’s what the Parquet file format does. So each layer is just giving the engine a little bit more information to get to that eventual scan, without having to scan every row in every file every time. But basically, like with Iceberg, the catalog is like you have to have a catalog. It’s built in Serato works. And that’s why it’s able to decouple from the directory approach. So again, the hive had that directory approach in Delta Lake and Hoodie, you still very much kind of have that we’re basically this particular folder is the table, it just has some additional metadata to kind of help wade through that. But Apache Iceberg, your files can be all over the place. Okay, and you will still be part of your patchy Iceberg table as long as the metadata hasn’t been listed. And that creates some really interesting possibilities, particularly with migration. Because if I want to migrate my Parquet files, let’s say from a Delta Lake table to an Apache Iceberg table, I don’t necessarily have to be writing data files into a particular folder. I can just run an operation that says okay, these are the five these are the protect key files that make up the currency of my table. Right, some Apache Iceberg metadata, you really rewritten nothing, all you’ve had, all you did was write some new metadata, and your table has migrated. So that’s, that’s, to me one of the really cool Who will differences when it comes to like, Apache Iceberg versus some of the other formats? But the catalog is but the facilitator, that’s what you need to catalog? Because otherwise, how is it going to know where all these files are, if it can’t figure out what the initial metadata file is. So that necessity per catalog is what allows that coupling to really be an up thing.

Kostas Pardalis 30:19
And, okay, let’s talk a little bit more about dremio. Now, let’s say we want to build a data lake or lake house. And we need all these components, that’s you mentioned, right? Do I like to breed my own terms here? Or is it something that like, I just signed up on the cloud version of dremio. And like Dromio, can take care of like, all the different components that I need to build my

Alex Merced 30:45
lake house, it can go both ways. So basically, like, if you don’t already have a data lake house, you could just open up a new account, connect wherever your data is currently. So again, if you have a Postgres database, MySQL database on your transactional side that has all your data, and you want to start moving it over to a data lake house, you can just connect them and just start moving the data incrementally, you won’t even think about it, you won’t realize it that’s already being stored in Iceberg tables being stored in Parquet files, if using dremio ArcCatalog has kind of got some really built in functionality. So all those pieces are going to be there. So that you think about the configuration of any of this with the deployment of any. But if you already have stuck in the way you like, if you have Parquet tables that are not Iceberg tables, and you want to use them, you can use them, if you have a Delta Lake, you want to scan, you can do that, like basically dremio allows you to keep the choices you’ve already made. But will make very sensible, easy to use choices. If you go along with sort of if you’re building with dremio from the get go. So it just depends on where you’re coming from, but it always tries to meet you where you are. That makes sense. And from your experience, from what you have seen, like out there, like as part of dremio. One of the architectures that most commonly people have implemented for a data lake or a lake house. And I’m like, Okay,

Kostas Pardalis 32:07
I don’t stop that much about companies that, you know, they might have started a year ago, like a data lake initiative. Because, again, people need to understand that we might invent new words for things, but like things exist for quite some time, or like pretty much since the hadoo. Seemed like Hadoop came out. They’re like Hadoop, like Daedalic at the end, like it is like a file system or you go and like, you can store all your files there, then you can use MapReduce to give it a query if you want. Yeah, it’s super primitive. It’s not, doesn’t have like the stuff that we have today. But there are companies who started from back then, and they are still evolving their infrastructure. Right. So what are like the, let’s say, the broad themes that you have seen out there that like they are like common.

Alex Merced 32:58
The hardest thing, because I wouldn’t have to say like, the problem is like, up until recently, there wasn’t like, over the last several years, you have seen some standards sort of rise up again, like a lot of stuff we just talked about like Parquet, and whatnot. But before then there really wasn’t that much of a standard way, maybe high was pretty ubiquitous. And so that’s probably one of the few things I do see sort of consistently. Well, I really like when I take a look at many different customer stories, or potential customer stories. They’re big, they vary quite widely. And I think that’s why this space is so interesting right now, because right now, you are starting to see sort of a movement towards more standardization and more what those patterns are going to be. But you know, you see everything from people who are literally treating a database as a data lake or, you know, a loop or moving all their data into a data warehouse or something doing some weird hybrid between, you know, cloud and high as far as like file storage between like a Hadoop and cloud for different use cases or different departments. But so every way, I have to say almost every customer story I’ve heard up to this point, has been different than the last some Yeah, so it’s hard to kind of say like, what’s, but I can’t think of any particularly the time is when I would say like the one thing you see over and over again, data consistent.

Kostas Pardalis 34:13
Do you think there is like a, something that is like very different, if you consider, let’s say, on prem setups, with Cloud setups,

Alex Merced 34:22
I would say the big difference nowadays is that if you’re, you’re on the cloud, you’re gonna have a lot more of the newer tools available to you, just like everyone’s sort of gearing towards cloud. That’s one of the nice things about dremio. It is kind of like a newer, more modern tool that still very much makes sure that it can cater and take care of people who are on prem. So you have that benefit, but it’s consistent far as like the experience. So you’ll have the same sort of, at least from the end consumer, you’re gonna have that same experience, whether you’re cloud and on prem. And that’s sort of what it brings to the table. But I guess the big considerations just depend on what tools are going to be available to you. So that’s continuing to shrink. Prem will continue to grow in the cloud. Yeah, that makes a little sense.

Kostas Pardalis 35:03
And you mentioned at the beginning, when you were chatting with Eric about how something became a thing, like by getting like the data lake, you have the data warehouse and try to grip like a hybrid there. Right. So what do you think that is missing currently, from the lake? How to make it? Let’s say, to realize the dream behind this hybrid?

Alex Merced 35:30
I think the standardization of the catalog? I mean, I think you’re starting to see more that we’re not you saw a few years off before, you’re gonna see like, what does the industry standardize on for his table format, but I think you’re seeing certain movements over the last year, that a lot of coalescing around certain formats. But the next thing will be the catalog. Okay, because basically, every tool, the way it gently interacts with your data, regardless of what table format it is, regardless of what file format is the catalog. So basically use a different if you need different catalogs for every tool, you’re still kind of running into interrupt issues. And this is where Project messy I think, is really going to be really important. Because it does what it offers a catalog that’s built to be a catalog in American modern air, like it’s, that’s its purpose, versus like a lot of things that we used to catalogs nowadays, you know, using, like an Iceberg, you have a choice between like using like databases, a catalog type is your catalog, you can use lose your catalog, but none of these tools were really built to be sort of like that kind of catalog in the same way project has built in gives you these extra features the brand to do a lot of like new operations, and also be able to like control governance across tools. And that’ll be also like part of it, like being able to set rules on different branches, and whatnot. So that way, hey, if I connect to that same table, from dremio, and Trino, I’m going to get the same access rules, that’s going to be sort of really key because that gives really one place where people can control access to their data across all their tools. So that’s what’s nice about, like, dremio Arctic service, because it’s gonna make it easier to adopt the project, Matthew, and most tools can already connect to it more in that expansion. So once you start seeing people sort of standardize on a catalog, then it makes it easier for tools to really just focus on supporting the table format, and supporting the file format. Because we’re not supporting 50 different catalogs. Again, the more variety, the harder it is to kind of give full support to anything. So as we standardize on each of those levels, that’s when you’re going to really see like the data lake house continues to reach its next NFL level like it’s already at a pretty insane level. Now, like I’m, you know, when you think about just where we were a few years ago, and what you can do now, with this technology, it’s amazing. But when you think a few years from now, when basically, you know, more people are using the same catalog, more people are using the same table format, more people are using the same wire format, the level of support that people provided by old tools to that isn’t kind of amazing, because then again, you’ll have that promise of openness where I can switch between tool one, and there’s no vendor lock in. But like, so to me, like that’s sort of like that next step. And like Jeremy Arctic is going to really help provide that step to give you that sort of open catalog that lets you use whatever tool you want, and have access to the data you want and control how your data is accessed from one place. Okay, and this is a necessary product. That’s what you mentioned, right? Yeah, so project NATS is an open source project. And the Arctic is sort of like the Nessie as a service, product contribution. But beyond that, it’s not just a service, it provides you the catalog, but also provides you with a nice UI to provide you with automated optimization features, so that you can just optimize your table as you’d like. So there’s other features that are coming down the road. But at the core, you’re getting this catalog, and you can connect to that catalog and using things like presto, Flink, Spark dremio sonar appreciated is a pull request on Trino. To have that, that as well. So you’ll be able to use all your major data lake house tools with it. And then that’ll just continue to grow from there. But again, and the benefit is, again, like you mentioned, to get semantics, but the real use cases there are threefold isolation. So for example, if I’m doing ETL work, you know, I might want to do some auditing first. So I can ETL that data into a branch and not merge it. So I’ve done my like verification and validation, multi table transactions, let’s say I want to update three tables that get joined regularly, instead of updating them one at a time and running the risk of having sort of broken joins, I can update them on a branch, then merge them when I’m done. Or you know, if I want to basically create a brand that isolates data at a point in time like an ML model so that we can continue to test against very consistent data, you make all these much more possible, much more easier.

Kostas Pardalis 39:27
That’s super cool. All right. And that was like you mentioned, the catalog is like what is missing right now. And project nurses trying to fill this gap but like how far away we are from silly things getting right and is it like a technological issue that makes it like, let’s say, slower as a brokerage? Or is it also a matter of like the current state of the industry and having like all these, like different stakeholders where each one is building their own catalog? And of course, they won’t promote their own catalog. Like, I think Databricks it’s pretty recent that they introduced their own, which is closed source also, like, it’s not even, like possible like to consume it outside like that. Right. So what’s your take on that?

Alex Merced 40:18
Yeah, I mean, that’s inevitable. I mean, like, that’s one nice thing about like, again, Apache Iceberg does support multiple catalogs. So I mean, like, Snowflake had just recently had Iceberg support, and they created their own catalog, and now they have a pull request that kind of adds support to Iceberg out of the box for that. And that’s just gonna, that’s going to happen, you have people who have tried to continue grading, and that’s one of the nice things about like an Apache Iceberg, they do have this new thing called the Apache rest catalog, or the rest, which basically, it’s like a standard API. So basically, if anyone wants to build a catalog, you can just follow this REST API, open spec. And then basically, you Iceberg would automatically work with that catalog. Theoretically, everyone follows that spec, then it doesn’t matter, even though you wouldn’t even have to standardize on the catalog, and you’d still be able to use it everywhere. So you have technologies like that. So I do think, right now, again, what you’re starting to see first is gonna be the standardization of the table format, because that’s very common, which catalogs people will choose from. And then once you start seeing much more standardization on the table format, you’ll see much then you’ll see like that battle, which catalogs up that table, I do think this year is gonna be an interesting year, mainly just because there’s a lot of interesting things that will be coming down the pipeline this year regarding catalogs, on different levels. So as much as I can say, but the bottom line is like, I do think that the catalog conversation will be a big conversation this year. All right. It’s super interesting.

Kostas Pardalis 41:43
Okay, one last question from me, because I wanted to do something like this for Eric, also to follow up with any additional questions that he has. So the last question is about developer advocacy rights. And I’d love to hear from you what it means to be a developer advocate for something that it’s okay. It’s technical, but it’s also let’s say, very, there are many moving parts. It’s like when we’re talking about the data, like we spent, like all this time talking about table formats, file formats, catalogs, query entries, materialization. Like, it’s so many different things, and you have, like, so many different technologies that you need, like orchestrate all together, right? Which is very different compared to being like, Okay, I’m advocating for something like, I don’t know, JavaScript library, right, for the front ends, which don’t say that it’s not complicated, but it’s much more like the scope of the technology itself is much more narrow compared like to something like MC house. So what does it mean? Like? What’s unique about what you’re doing? And the value that advocacy brings it to?

Alex Merced 42:59
to the industry? Got it? Okay, so first, we’ll just start off with, like, developer advocacy, as a thing. It’s been really interesting, like, you know, when I first discovered that this role existed, I realized that this role is like tailor made for me, because there’s certain skill sets you need, like, basically, the idea, I mean, at the end of the day, like the hope with a Developer Advocate, is that you’re sort of like the cross between, basically, you know, like, if you took a PM, and I want an American team, and like, mush them together, that’s ideally kind of like what you want someone who can, basically understands the product enough to be able to communicate its value with conviction and authority, but someone who can, you know, also understand sort of, like, the marketing, and basically the idea that, hey, you want people to make a choice. And think about that, but I made it into being a good developer advocate. I will be good about those things. You need one, you need technical knowledge, you need to, you know, know, space, know, technology and know, technology in general. But then you also have to be a good communicator, which is why I think, you know, like, you know, having a sort of a history in sort of educating, really was really helpful. One also has conviction ideas, like you can’t advocate for something you don’t believe in. So you have to believe in whatever you’re the developer advocates for. So I was excited to be at dremio. Because it’s such an exciting product at a very exciting time, which is, I think, the most exciting part is just like, the state of the industry in such a kind of this is such a moment of flux between so many different competing technologies, it makes it that much more interesting and makes it that much more exciting to be to be on the frontlines of that. But the bottom line is, and also to be a content creator. Because I mean, you know, to get that word out there to be in front of people requires you to go speak at meetups because you go do podcasting requires you to go make videos, find any clever way to kind of get in front of people to speak that more technical level and also creating like, example code or useful tools. It goes beyond just saying, okay, hey, this is what we do, and this is why you should use it. It’s really being able to like, empathize with people and like to take a look and hear people’s experiences and their stories and be able to Think of like, get it because you understand them on the technical level, but you also understand the pain at a different level. And it’s difficult and there’s one thing I notice I get, it’s a, I can imagine it’s, it must be a difficult position to hire for because it’s not. Usually you can find people who are good communicators. But then you can find people who are really technical, but finding both of them and you know, sometimes can be really tricky. So, you know, what’s another reason why I’m like, I’m very grateful that I’ve had such a weird backstory that took into so many different experiences, and why I’ve just loved doing what I do, because it really is like, position is tailored to the life story that I’ve had. Yeah, well, I

Eric Dodds 45:39
I think it speaks a lot to you, because finding joy and understanding the deep technical stuff. And in the process of trying to condense that down, and I think you’ve been throughout the show, it’s been wonderful to hear you use examples, you know, you say, admittedly, this is over simplified, but I like to think of it as XYZ is very clear that you have, you know, sort of a deep love of both the technology, but also, you know, the way to communicate that best why a question I’m interested in, you know, especially relative to your excitement around this technology that we’ve talked a little bit about before, on the show, when the subject of data lake house has come up, is when you think sort of wide market adoption will happen. And to put a little bit more detail on that question, you know, there are certain characteristics that make the data lake house make a lot of sense, that are really large scale, say, you know, sort of enterprise scale, right. So, a couple examples, you get, you know, moving from on prem, making the on prem experience better, you know, dealing with significant warehouse bills at scale, but at the same time, there are a lot of really interesting things about this that applied to the down market as well. Right. And I certainly foresee, you know, some companies out of even a small, you know, an early stage like adopting a lake house architecture from the outset. Right, just so that they can essentially have a glide path towards scale that doesn’t require any retooling. Now, that’s not to say, there’s still a huge market for and it makes sense to, you know, like, just adopt a warehouse or query your Postgres directly, or whatever. But I’m interested in, what are you seeing out there from the dremio perspective about companies adopting this way earlier than maybe 10 years ago, companies trying to move towards a lake house architecture because of the enterprise specific issues?

Alex Merced 47:42
Got it? Yeah, no, actually, this is something I did. Because I do a podcast called being a nation. And it was actually an episode, I did an episode specifically on this where I think people are saying that companies should adopt the Lakehouse Premier, because really, the usually the things that would impede us is that like the cost of having a lot of big data infrastructure earlier on, which is really expensive and complicated. But especially within like dremio cloud, it’s easy and cheap, like literally finding a premier cloud just signing up. And you know, you use it when you need to, and you don’t use it, if you don’t you have your account. So if you’re a small company, and you’re thinking, hey, you know, wait, you know, I’m gonna get to a point where I might want to start hiring, I will be an analyst, you know, and maybe, right now you have everything saved in maybe spreadsheets, or you might have everything in a Postgres table, you know, that you can still connect them you have the higher your data analysts happens at working directly from there. But then as you scale, that’s kind of you’re saying, like, you’re Oregon to be, your workflow isn’t gonna have to change when you get to that point where you’re scaling because people are already using the tool that you’re going to be using. And then you just shift sort of how you steal your data, the way your data is sort of managed on the back end, but your consumers never notice the difference as you grow.

Eric Dodds 48:49
Yeah, super interesting. All right. Well, we are at the buzzer. Alex, this has been an absolutely fascinating conversation. I’ve learned a ton. And we’re really thankful that you gave us some time to join us on the show. Thank you for having me.

Alex Merced 49:03
And then I just recommend everyone out there to go follow me on Twitter at am data lake house, you can also add me on LinkedIn, check out my podcast data nation and also dremio We’re starting a new weekly webcast called gnarly data ways where I’ll be posting and we’re going to have a lot of interesting people come talk. So come check us out.

Eric Dodds 49:18
Awesome. Thank you so much. One thing that struck me was the emphasis on openness, which I guess makes sense for a tool like dremio, you know, where they need to enable multiple technologies. But a lot of times, you’ll hear technology companies be a lot more opinionated, you know, like this, we are doubling down on this file format because of these really strong convictions. And it was just really interesting to hear Alex say, you know, probably works best with Parquet, but you should try to query a bunch of other stuff with it and then that’ll work may not be the most ideal experience but I appreciated that openness. Right? And it seems like that’s sort of a core value of the platform, at least as we heard from alums. And so I thought that was really neat. And honestly, I think it is probably pretty wise of them, even though they’re, you know, obviously, I think a lot of their customers are well served by the Parquet format. But the fact that they seem to be building towards openness, I think is probably pretty wise for them as a company as well.

Kostas Pardalis 50:27
Yeah, 100%. I mean, I don’t think that you can be in let’s say, the space of the lake house or the data lake without being in I think that’s like, the whole point. That’s how a data lake started as a concept, like compared to a data warehouse, where you have like the opposite. Like you have, like an architecture that is like, closer you have like a central authority look like optimizer is like every decision and have had like total control over that. And, okay, the date I make is the opposite of that is like, okay, here are all the tools to figure out how to put them together and optimize them for your own use case. Right? So obviously, there are pros and cons there. Yeah. I have to say, though, that openness is a little less, I think, easier in this industry, primarily, because the things that you have to support are not that many. Right like gate, if you compare the number of front end frameworks that we have compared to how wild spoken words we have for columnar data is like you can have like compared, right, and there is a reason behind that is because it’s a different type of problem. And cause like, more limited, let’s say probably like a set of solutions. It’s so low that something that’s easier also likes to achieve and maintain. Yeah. But this doesn’t mean that it will cut right? If you like to productize it’s one thing like, well, this other thing like the product.

Alex Merced 52:10
So yeah, it’s very interesting. I really want to

Kostas Pardalis 52:15
keep what Alex said about the catalogs and the importance of catalogs. That this year is good to be important. Share a lot about that. And yeah, like, hopefully, to call him again, like in a couple of months, and see how things are progressing and not just for dremio, but for the whole industry in general.

Eric Dodds 52:38
We will have him back on. Thank you again for joining The Data Stack Show. Subscribe if you haven’t told a friend and we will catch you on the next one. We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe to your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.

🎙 Sign up for The Future of Machine Learning Livestream!

🗞️ Signup for Our Newsletter

Episode 127:

The Anatomy of a Data Lakehouse with Alex Merced of Dremio

February 22, 2023

Notes:

Transcription:

About the Podcast

Sign Up for The Data Stack Show Newsletter