This week on The Data Stack Show, Eric and Kostas chat with Ryan Blue, the Co-Founder and CEO of Tabular, and also creator of Iceberg and former Cloudera and Netflix employee. During the episode, Ryan discusses the challenges of managing large-scale data and the development of Iceberg, a new table format. He explains Iceberg’s benefits, such as automatic partitioning and improved metadata management, which simplify data engineers’ tasks and enhance query performance. The conversation covers the importance of atomicity in analytics systems, the scalability of Iceberg, and the trade-offs in mixed workload environments. Additionally, Ryan addresses the differences in cloud object storage performance and the integration of security and access controls into distributed file systems. He also touches on recent Iceberg updates, including Python and Rust support, the anticipation of view support in the upcoming release, and more.
Highlights from this week’s conversation include:
The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Eric Dodds 00:04
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com. Welcome back to The Data Stack Show. Kostas, you, we’ve talked a lot about databases, database technology. You know, it’s been a common theme on the show. But today, we’re gonna dig really deep into that world. That high scale. So Ryan blue is our guest. He helps create Iceberg, which is not part of the Apache Foundation. And it’s going to be a great story. I mean, I am really interested in hearing the background of the challenges that they face at Netflix, you know, where this was originally developed, and then it’s above my paygrade. But I am really interested, if you would be willing to ask him about file formats, because that is actually another interesting thing that we haven’t covered in great detail. I mean, we’ve done it here or there. But, you know, that’s a huge topic when it comes to Iceberg. And we think about data lakes. So that’s another topic that I’ve been thinking about, just as it relates to all of Ryan’s experience. So hopefully, I didn’t steal your thunder on the files. Question. But what do you want to ask about?
Kostas Pardalis 01:45
Yeah, I mean, first of all, I know that like most people, when they think about Ryan, they think of Iceberg. But what is like, I think extremely interesting is that Ryan has been around for a very long time, he has been part of building some of like, very foundational pieces of technology that we are using today, like things like Avro, Parquet. And obviously, like that, table formats, like Iceberg is. So outside of anything technical, that we will be talking about with him. One of the things that we’ll spend quite some time with is like, do a little bit of history, like, Why think why things actually happened the way that they happened with him. And it’s like, in my opinion, super interesting about how when it comes to data processing, there are actually two parallel tracks of development that happened in the past like 1015 years. One, which is coming primarily, like from the database folks that were building database systems. And another one is like coming actually, from people that were primarily distributed systems people. And that’s where things like MapReduce game stuff like Hadoop, and like all these big data, technologies that we are talking about, and we will see that there are like some very interesting comments, and points that are made of like how we were invented some things or we did some things like differently, why this happens. And Rand gives, like a very interesting perspective into the evolution of these systems and how they happened and why. And outside of that, we’ll talk a lot about file formats, which is also why a difficult topic or care, for example, has been out for a while. There are a lot of conversations like, we need to update it. There are some new things coming out these days. So I think it’s like a very good time to do something like a refresher on what file formats aren’t for storing data, and how they differ between them, and how they differ from table formats like Iceberg, right? And on top of that, we’ll also talk a little bit about how tabular is compounding and also about some other, like, really interesting things that are happening right now in space. So make sure you listen to the episode because it is very interesting. Brian has a lot to share. And we have a lot to learn from him.
Eric Dodds 04:28
Great. Well, let’s dig in and talk about Iceberg. And all the other fun stuff.
Kostas Pardalis 04:34
Let’s do it. Ryan, hello. It’s very nice to have you here. Again, we had like a panel a couple of weeks ago, together, but now we’ll have the opportunity to be just the two of us actually today and talk about many interesting things like from table formats, to security and anything that has to do with data lakes out there. But before we get to that, let’s do a quick introduction to a little bit about you. Like, how do you get into that stuff? Why you decided to work on these things, and what you’re doing today.
Ryan Blue 05:14
Thank you. And thanks for having me back on the show. It’s fun to have these conversations and do all these things. Yeah, so I accidentally fell into the big data area. I was not a fan of databases or database systems. When I had college classes. I was really more of a distributed systems, you know, kind of engineer. And as I was doing distributed systems in the government, I got more and more into big data because everything was a distributed systems problem at the time with the early days of Hadoop. So I eventually worked at Cloudera and then moved on to Netflix. And that was because I was working on the Parquet project. So I was on the Parquet and Avro sort of data formats team at Cloudera. And then we use them very heavily at Netflix, because we needed, obviously, a whole lot better performance, just reading bytes off disk.
Kostas Pardalis 06:21
Okay, that’s great. You should park it in Davao. And I have to ask you, what’s the difference between the two? And why wouldn’t?
Ryan Blue 06:31
Well, so Avro is a record oriented format. So you sort of blend together all of the columns, and you keep chunks of records together, individual records together. Of course, Parquet is a columnar format. So you reorganize and store all of column A and all of column B, et cetera. And the reason that you do that is so that you can basically keep the columns separate. If you only need columns, A and E, you can read them effectively as large chunks, which is like I was saying what we were interested in at Netflix, because what we didn’t want to do was, you know, read a ton of data, every single column, every single value from S3, what we wanted to do was read just the pieces that we need, because, you know, you’re talking a very different bandwidths and latency profile for all of those requests. When you’re working with S3 Instead of HDFS, it’s also just generally good practice to store your analytic data in a columnar format. Yeah,
Kostas Pardalis 07:40
So in a way you win, like from distributed systems to not just databases, but to the very, very like, core part of the storage of that, like how we lay out data out there. So yeah,
Ryan Blue 07:55
Well, that was actually an accident. I joined a different team. And then at Cloudera, and then like three months in, my team took over data formats. And that was pretty much all of what we did. And I’d done, like specifications for API’s and things in the past. So I was pretty familiar with like, you know, bytes on disk, or supporting something forever. Those sorts of things that definitely came in handy when we were talking about Parquet and Avro. Yeah.
Kostas Pardalis 08:26
100%. Okay, so you joined Netflix and moved away from it’s the first to S3 And what happened there, like what were like, the new things that your shows are that were like different, and hopefully interesting compared to like, the stuff that happened to me, as
Ryan Blue 08:47
well. So we had all the same problems, they were just 10 times worse, because the latencies were 10 times higher. So you know, the hive table format that we were using, at the time, had to list a whole bunch of directories to figure out what files are in those directories. And that was okay, when you were talking seven milliseconds to the name node. When we move to S3, and it’s 70 milliseconds, all of a sudden, you’re talking like 1015 minutes of planning time for regular queries. And it was, it was awful. So we started band aiding, and you know, sort of patching all these solutions. So we parallelized the file listing in SPARC. But then you’re still running up against things like, well, the listing might be inconsistent. So like, you’re not only checking S3, you’re also checking some source of truth and DynamoDB. That was our sempere project. And we just kept creating these sort of workarounds and band aids. We actually had our own implementation of the S3 file system that was based on the Presto one and It was funny because we would turn off certain checks, you know, to meet the requirements of a file system, you have to know that, you know, when you list a directory that it’s not a file, or vice versa. And we would turn off those checks. Because, you know, if you need to check if something’s a directory in S3, you actually need to check if the object exists? And you know, say, Hey, I’m a directory, the underscore dollar folder dollar paths. But you also have to do a listing to determine if it’s a directory and not a file. And you have to have some behavior to say, like, hey, if it is a file and directory, what do we do? So it was all just sort of bonkers. And we said, you know, what, we’re just going to turn off those checks. And our jobs ran a whole lot faster, because we actually treated S3, like an object store instead of trying to put this facade of a file system on it. Yeah, yeah,
Kostas Pardalis 10:59
kinda reminds me a little bit of like, the other thing that’s professional che, like in database classes that, you know, you have to bypass the operating system, in a way like to start working with that stuff. So I think we see that like, on a different scale, but you mentioned like latency, I was like, okay, like one of the main reasons behind that, like, the frustration and like the problem that we were trying to solve there. But can you like, elaborate a little bit more on what latency means? And work from seven millisecond to 70 millisecond? Like, does at the end to the user. And why this is becoming like a problem, because like, so many milliseconds on its own doesn’t like it will be like, okay, like, whatever, like, who understands, like the number 70 milliseconds, right. So what is happening there that, you know, like, Snowball down, like, big problem at the end?
Ryan Blue 11:54
Yeah, I mean, it’s just the minimum amount of time that any request to a service takes. So that’s what I think of as latency. Of course, like, you could have requests that take much longer, right. But there’s a certain bottleneck is there’s some minimum that you can’t do better than Yeah. And in the case of the hive tables that we were using, there were just a lot of those requests. So for everything we’d need to check, like, hey, is this directory, a file, and that was part of listing, right. And like I said, you can go in and make some assumptions and turn off this or that. But you know, if you’re making multiple requests to S3 for every, quote, unquote, directory that you need to list, then you’re talking like, that adds up to maybe 200, maybe 300 milliseconds. And if every single partition takes a substantial part of a second, and your listing, say 1000. Like, that’s a bad situation. And when you want your queries to ideally take, you know, a few seconds, that planning time was just completely, you know, there was just a huge problem that we couldn’t walk around. And I think there were also enough problems with the other aspects of this, you know. So the fundamental problem is the tables, we’re keeping the state of the table in both the file system and a meta store, the meta store says, Hey, here’s a list of directories you need to go look at. And the file system or S3 object store was responsible for what is in those directories. So you’ve got the latency problem where multiple requests stack up, and you’re doing 1000s of operations, which also, by the way, turns out to be super expensive. Just in S3 request costs alone, if you have this small files problem, you can actually have request costs that exceed your storage costs. Yeah, so like, don’t get into that. That’s a future topic. But another thing that was a problem here was that hive can only push filters to the partition level. So you can say, Okay, well, I can select these partitions, but then you don’t have any information about which files in that partition match your query. Yeah. And that was another order of magnitude, if not two orders of magnitude improvement. When we said hey, let’s keep more metadata about what’s in those files. So that we can select individual files and not these like, you know, several 100 megabyte groups or get or more large groupings of files.
Kostas Pardalis 14:54
Okay. questions like Why? Why we have like the separation We’d like the metadata, why would you then keep like, just like everything on the meta store, right? And we like to have some stuff like the meta store, and then some, like on the file system? Because I would assume that like, the meta store is optimized for accessing metadata, right? Like the file system is a file system. So what was behind these decision boxes and how do you like to design things like that?
Ryan Blue 15:24
I think that the file is the hive table format, really, and when I say table format, I’m referring to this space of things that we sort of labeled as, like, how do I keep track of which files are in my table? In a way, like table formats have always existed? You know, certainly, Oracle, and MySQL and Postgres have some way of baking their storage down to files. But we never see your care, what that is, it’s only really come to light in this Hadoop era, where we did it sort of naively and to answer your actual question, I think that the problem was that we didn’t design it, right? Like me, we were all distributed systems engineers. And so we went for something simple that seemed to like, you know, approximate a table if you squint. And what we had was, okay, we’ll just lay out our files and directories, directories have some property that we can filter by, and then we’ll figure it out from there. And you know, we’re using commodity hardware. So who cares? If you can’t, you know, select individual files, I’m sure it’ll be fine. It just was not well thought through, because I think everyone was trying to move so fast in those early days, a dupe was like a Cambrian explosion of projects. And, you know, in the early days of Hive, they were working on both, you know, the storage representation, as well as an entire SQL engine. And I think the sequel engine was the more interesting and harder problem at the time. And we sort of let the table sit there. And it did evolve. So you know, you’re also asked, why did we not just put everything in the meta store? metaphor was sort of an afterthought as well, right? We hadn’t suddenly 1000s of partitions and thought, oh, it’s really slow to go list and discover the partitions. Yeah. And we sort of need this ability to swap partitions at times. And so like, the meta store was bolted on after the fact. And it just reflected this already existing convention that we’re going to use directories to track what’s in our table.
Kostas Pardalis 17:43
That makes total sense. Okay. And what’s next? What happened next? So you are facing, you’re trying to scale the infrastructure there, obviously, like, Okay, your Netflix that I’m Netflix probably is also going through like an explosion and girls who are in this type of companies and explosion of growth means exporting data. So what do you do? Like how do you mean, there are only so many options that you can turn off on the file system, right? What do you do next?
Ryan Blue 18:12
Well, it was like, we’re running out of bandits. Right. We also realized that the techniques and sort of band aid approaches that we took were all specific to Netflix. So we had this massive development cost of like, every time a new spark version came out, we had to update it. Right, we had to port our changes to that new spark version, and it was a giant pain. And we also hadn’t even begun to fix some of the other problems. So problems like users expect SQL behavior, we’re getting a lot of data engineers with backgrounds in Vertica or Teradata, or these other like data warehouse systems. And we would have to retrain them. And that retraining was not like, Hey, here’s a list of the things you can’t do, including, right from presto, renamed columns and things like that. Which, like, can you imagine if you’re a data engineer, and on your first day of work, they hand you a list of like, Don’t rename columns, don’t write from these systems, like, but we couldn’t actually attack any of those problems, either, because we were spending all our time on band aids and things like that. So what we realized was that these were all the same problem, fundamentally, was that our format for keeping track of what is in a table is too simplistic. We didn’t have the metadata to do better filtering for faster queries. We were treating S3 Very badly and not using it how it’s intended to be. In addition, bringing in problems like eventual consistency with listings that could affect and just completely destroy query results. And then we had all these usability challenges, you know, our workarounds, were restrictive, and you couldn’t write from presto, and things like that. And we realized these were all the same problem. It was that our view of what makes a table was way too simple. And we needed to replace that. And that we could actually fix a lot of things. If we did. As I mentioned earlier, the small files problem. Yep. And another thing that we were hitting was, you know, if you don’t have the ability to change your table, atomically without corrupting the results that someone else sees, you know, another one of those minor usability problems, you actually can’t fix those small files problems. So you’re talking about assigning a person to make sure that data was written correctly the first time, and it creates a massive amount of work for data engineers, that the table representation was too simplistic. So you know, part of what we wanted to do in addition was like, go fix those situations automatically. There was no reason why your database shouldn’t be maintaining your table data. It was just that we didn’t trust ourselves to do it for some reason. Yeah.
Kostas Pardalis 21:17
Yeah. Yeah, I understand. So what’s Okay, so tell us a little bit about the journey there, like, how you start to try and like to solve this problem and not the band aid weight, right? And how did you end up with what the solution will look like, right?
Ryan Blue 21:36
Well, so what we knew we wanted to do was track state somehow, right? Basically track all of the tape and all of the files in a given table, we wanted to do that, and solve these sort of table level concerns. And we’d been accumulating this list I mentioned, working on the Parquet project. And we kept having people come to the Parquet project with problems that were not file level concerns. But there were table level concerns. Like, hey, I’ve got two files with different schemas. And I don’t know what the table schema is, can you guys, you know, build a merge schema tool that tells me what the table schema is? Or like, well, if one file has column E, and the other file doesn’t? Was it added? Or was it deleted? You really don’t know. And so you know, we kept having to say, like, this isn’t something that Parquet can solve. Yeah. So we grabbed a whole bunch of those problems. We said, Okay, we know, we want to track metadata at the file level, so that we can do better pruning, so that we can make modifications like fine grained modifications, that file list and do things like atomic commits and things like that. And so, you know, we basically designed for those constraints, and now built essentially a metadata format, with a very simple commit protocol to make all that happen.
Kostas Pardalis 23:03
And how was that implemented? Insights? Because you’re describing an environment that ‘s quite complex, right? Like there are many tools out there mentioned, presto, you mentioned hive, you mentioned spark, and maybe others. And suddenly, you come in and you say, hey, like, we’re doing like to model data that are important for these systems, right? Because if these systems ignore these metadata, then what’s the point? So there is, in my mind, at least, and correct me if I’m wrong, there is like, problem here on how you start a project like that. Because it’s not just like, laying out like a specification out there, or even like implementing a way to store and manage this metadata, you also have to integrate that with a number of tools. Right? So how did this happen? How did it work inside? like Netflix, I guess, like initially? And then what’s the story after that, because we have like an open source project that came out of that, right? That was a little bit like this journey.
Ryan Blue 24:07
So luckily, and I think the reason why this made sense for Netflix as an investment was that so much of our time was spent delivering new spark versions, or delivering new Presto versions, or things like that, where we thought that we could long term reduce the amount of time to release a new version, because we wouldn’t have to port all of our band aid fixes over to that new version, and update them and then test that it all worked. Like we wanted a system where we could just plug in using the existing API’s or in that didn’t actually work out in Spark, we had to replace the API and spark to actually plug in. But like, you know, Trino and presto were called presto, at the time. Trino had a very good API that we could cleanly plug into. So At least there was that body of time that we could either spend constantly maintaining, or building something new to reduce that maintenance work. So that gave us the essential ability to spend or invest in this area. And from there, we decided, you know, what we needed, because we had multiple different projects that were all going to interact with this thing, we needed a core library, the core library is what an engine will use to essentially say, Hey, give me data matching this, here’s a filter, give me the date, the files back. So we built that library, and then the ability to commit in that library. And then that’s what we integrated into these other systems. Luckily, we had, because we’d been porting our own sort of hacky solution, from version to version, we had a lot of knowledge of the inner workings of those projects. So with one core library, we were able to integrate that fairly easily into the projects.
Kostas Pardalis 26:05
All right, and that’s what became Iceberg. Correct. Exactly. I’d love to share with you, if you remember, like that first, first version of Iceberg, right? Like the first thing that you put out there, like what were the features that Iceberg had in this first version? Like, what is the reason I’m asking for that is because, okay, I’m sure, like many things have, you know, like, evolved, sort of like the years. But it’s always interesting, like to see these first choices that I think are making, because they also in a way, like say, what was like the big problem at that time? Or what? It couldn’t be solved in a timely manner? Right? Yeah. So can you tell us a little bit about that? Okay,
Ryan Blue 26:50
so I apologize if I get any of this wrong, because this is from memory. But I knew that I wanted to get rid of manual partitioning, because so many of these things that inspired Iceberg features, were constant problems that we had people coming to us and asking about. And we knew that if we were replacing this bottom layer in our software stack, we needed to get it right, because we were only going to do it once. So we included a number of other things in the initial version that we thought were very important. So first of all, was that hidden partitioning, don’t make users need to know the physical layout of the table in order to effectively query it. Same thing with schema evolution across multiple formats. So we said, if we’re going to get, you know, fix this rename problem, or fix the ability, you know, the, if you drop and add a column, you resurrect zombie data, right, we knew we had to fix those things. So we had schema evolution, hidden partitioning, and metadata file, essentially file level, metadata tracking, and atomic. Still, the same, like atomic swap that we use for commits that we do today was all in the initial version. What was missing was, we didn’t yet have the metadata pushed down. So we didn’t have column level metadata in there at first, we added that very quickly afterwards, for that file level pruning, because that was actually a huge win. Because going to users in Netflix, we said, hey, we can solve a lot of your problems. And they said, That’s awesome. I don’t really care about, you know, not being able to rename columns. And we’re like, but that’s a correctness issue that you spent a week last month fixing. And they were like, Yeah, but it sounds hard to migrate my tables. So we ended up going in and building some of these like, bigger features, in order to entice people to move over. Because if we were just replacing hive tables with something that had the same performance profile as a hive table, but was safe, people just didn’t see the need to move over. So we pretty quickly added the early use cases were around scale, you know, going and getting to scale that we couldn’t before, especially in planning time. Our early slides talked about taking a job that took, I think, 10 minutes to plan even in parallel and making it run in 45 seconds. You know, and I think it took more than 24 hours to run because we couldn’t do that file level pruning. Yeah, we were reading every single file and A huge set of partitions, and there was just no need, you could get that down by 100x. So that was really cool. The other was actually thanks to hidden partitioning, what we could do was we built a Flink sink, that would write directly into event time partitioning, rather than processing time partitioning. And that was huge for our users, because they always want to deal with downstream event time, no one wants to deal with processing time, because then you’ve got to have you know, multiple filters, you got to say, Okay, well, I’m willing to take this window of when the data may have arrived. But I’m really looking for this data, you know, in terms of like, when it actually occurred. It’s a massive pain. And if you can deliver data and partition it from the get go in event time, and then maybe use automatic compaction and things to take care of the small files problem that relieves enormous burden on data engineers for actually the way they think about and manage that data.
Kostas Pardalis 31:11
You mentioned partitioning, like a couple of times already. And I mean, it is like a big part of working with, like, lake architecture. Can you tell us a little bit again, you mentioned, like you knew that like one of the first things that should be delivered, there was the automatic partition thing, right? And change the way that people interact with partitioning the data, those a little bit more about, like how it happened before, from the perspective of the user, right? Like the data engineer was like working with the data. They’re like, how they were doing it like before, and what changed after Iceberg? wasn’t the greatest there.
Ryan Blue 31:49
Yeah. So it was very primitive, before. partitioning is the idea that you just, you know, essentially going back to that earlier definition of like, database, split across multiple directories. A partition is a directory. And so you would, we would break down data into these directories based on special columns called partition columns. Yeah. So imagine you have data flowing into this table, you need to have a column for how you want to break it up by directory. So you might take the processing time. And then you might derive a date from that processing time and store all the records for that date. In the same directory, right. Now, there are a number of problems with the way that actually happened. So that whole process is the same as what Iceberg does. The problem. The difference is that Iceberg does that automatically. We say, Oh, we know the relationship between your layout and this directory structure. And the original timestamp field this is coming from in hive, it was all manual. So you needed to supply the date for any given record. And that sounds easy. I can derive a date from a timestamp. But it’s actually not. Our, I mean, first of all, like, well, the originally, typed hive didn’t have date types. So you would derive a string or at Netflix, we were still using what we call the date end, which is basically the integer you know, 2013 0611, right, something like that. And so you’d have to derive these things yourself and put data in there. Which means if you do that wrong, you are putting data in the wrong place, and no one will ever find it. Yeah. So if you use the wrong timezone, write if you use the wrong date format, and all of these things could just be silently wrong. Yeah. Not only that, users that were reading the table needed to know that they were querying by this other special date call. And not just the original timestamp. Because if they say, Oh, well, I’m looking for event time between, you know, t zero and T one. They go do that, but hive has no idea how to turn that into a partition selection. Yeah. And so you, they would have to say, Okay, well, I’m looking for this range of timestamps, and you can find them in this set of days. And that translation again, is like, are you using the right right timezone to derive the days for your start and end time? And you also get really wacky queries. If you go down to our level partitioning, because you’ve got to say, Okay, well, this partition column between here and here and then on the First day between, you know, 11am and midnight, and on the last day between zero, midnight and 11am. It was just a total mess. And if you get any of that wrong, you get a full table scan and partitioning doesn’t help you at all. Yeah, so this was just full of user errors. Yeah. And a lot of places built up libraries that they would use for doing this, right, like Netflix had data and libraries. And it wasn’t awful. But it was a huge tax on people. And what we said and Iceberg was like, let’s just make all that automatic, you should figure a table with something that says, hey, you know, chunk up this timestamp, and today, size ranges, and use that. And then, if you have any filter whatsoever on that timestamp column, we can bake that down into the partition ranges that we need. And we can also then use the additional metadata of profile column ranges on your timestamp column for that purpose. That’s what we ended up doing. And I think it was really helpful because it freed people up to no longer worry about those low level details. In retrospect, when we look back on what we’ve done, it’s really restoring the table abstraction from SQL. It’s saying like, hey, users of tables don’t need to think about how this breaks down into files. Yep. And that frees us up as the people managing those files and managing the table data to say, like, hey, we can have real database functions on this table data, we can have automatic compaction, we can build these things. And it’s not going to screw up users who, for some reason, look under that abstraction layer and want to do specific things. So it makes data engineers a lot more productive, and means that we can do more as a data platform.
Kostas Pardalis 37:06
Yeah, yeah. 100%. Okay. And you mentioned, we talked a little bit about how you integrate these with, like, different query engines out there, right, like, so there was like a library that was built. But outside of the library, like, how do you build a table format? Like a software project? Right? I know, it sounds almost like a silly question that I’m asking, but what does it mean? Like, we know, if we’re gonna build a web application, we kind of have a mental model of what that means, right? But when you’re starting, like to build a table format, what does this mean in terms of like, building it as a software artifact? Right? Like, what are the choices that you make? How do you represent these, how do you store them? Right? Tell us a little bit more about that. Not about the integration with the libraries, like obviously, they’re like, API’s will have to be built there like to access the mental data, but more about, let’s say, the format itself, like how you as like, the designer SaaS, like, the system, you approach, designing it and implementing it,
Ryan Blue 38:14
The first stage was just rapid iteration. So we bid off this problem area of, we know, we need to track individual file level, we know we need atomic changes to that file level. And then it’s about solving the next challenge. And the next. So we initially said, Okay, we’re gonna have a file that tracks every file on the table. And then we want some way to atomically swap from one file to the next. And that’s essentially the basic design, right? And then you layer on new things that help you, you know, gain performance. So if you think of this, conceptually, it is a big list of files. That’s great. But you don’t actually want to rewrite the entire list of files for a table every single time. Yeah, so the next insight is, okay, well, we’re not going to have one single manifest file for everything in a table, we’re going to have a set of manifests that have different lifespans. So older data goes in one, you know, medium, age data goes in, and another new data, it goes in another and so on. And you basically have the ability to swap out certain files to create different versions of the table that gives you lower write amplification, right? So that was one of the very early things that we ran into, it was like, Okay, we’re not going to be able to rewrite all the table metadata every single time. And in fact, like some of the, you know, multi petabyte tables have like 10s of gigabytes of metadata, so it really is not practical. And then you say, Okay, well, how do we keep track of these manifest files? Well, we need something that holds their metadata, right? And so you introduce this manifest list level, where you can actually keep partition ranges and metadata in that list. So when we’re planning a query, we don’t even need to go to every single metadata file in the table, we can go and say, okay, which Metaflow metadata files do we need to read? Read those metadata files to get the data files. And then the layer above that is just where do we track all the current versions of the table? And what do we use for basically an atomic swap? Because the commit protocol is super simple. It is a pointer to the current table state. And you can swap that pointer, if you say, Hey, I’m replacing version four with version five, the meta store swaps V four for V five, and you’re done. If someone else gets there first, then you need to rewrite and essentially retry that commit operation. So it’s all very simple solving, you know, one problem after another. And we just iterated on it quickly. I think the original library took four or five months to write, and then it’s largely unchanged from there. I shouldn’t say largely unchanged, like the design is very similar. Yes, yeah.
Kostas Pardalis 41:34
Yeah. 100%. And, okay, let’s, I want to ask you a little bit of a, like, atomicity, because you mentioned it a couple of times. And it sounds important, right? Like, obviously, you don’t want someone else to be writing and to be making a mess there. Right. But how big of a problem is in a system? That it’s not, let’s say an oil TP system, right? Like something that is primarily for writing and updating data, convert, like to a system, that’s probably reading, right? Like we write, the whole idea behind all AP systems is that we’re doing things like moving the data, and preparing the data in a way that’s going to be very efficient to read. So we can go there. And when we query the data, be like really performant, right? And, like, do large amounts of data. So it’s as dominant as reading blockchain-like operations. So in what scenarios like atomicity, when we write it’s like becoming important, and how much of let’s say, a similar scale of a problem is compared to the transactional databases, right?
Ryan Blue 42:44
So out of necessity is always important. Because you want to make sure your changes either are entirely applied, or not at all, yeah, and fail. Because I always compare this to cutting and pasting between spreadsheets. Yeah. Right. It’s, it’s, you know, something that you’re going to do, you’re going to grab a ton of rows, and you’re going to copy from here and paste in the other. Well, what if that paste operation came back and said, Hey, we got like, 350 out of 400 rows, but we can’t tell you which ones. Right? You’d be like, what? And not only that, what if that spreadsheet were driving some dashboard that a C suite person is currently looking at? Yep. Right? Clearly, none of that is okay. All right, you don’t want to have to figure out and reconcile which changes were made and which were not. And you also don’t want some important query to be lying while you’re doing that reconciliation. So atomic changes are always important. Now, in the analytic space, the changes are larger and, and fewer. So you might have hundreds or 1000s, or even millions of rows changing in a single operation. And you’re also you don’t need the state of any given row immediately. So in a transactional system, I might make, you know, several updates within milliseconds of one year, and I always want it to be up to date. And so your architecture includes, you know, in memory caching of changes, a write ahead log to ensure durability, non like, a lot of things to make sure small changes are always up to date. In the analytic world, though, because you’re willing to take that trade off. You’re willing to say, okay, like writes can be slower, but I want reeds to be super fast. You know, I’m willing to spend that time to prepare the data. You can get away with a lot more and actually in Iceberg And these shared data formats, we go even further and say like, there can be no runtime infrastructure other than a catalog. So it’s not like you have something maintaining a right ahead log on a local disk. Yeah. Like, you write it all into S3, you swap that pointer over to, you know, make it live. And then that is your table state. And basically, the only thing that that knows about is the meta store that tracks the current table pointer or current table metadata. And then all the readers go directly to S3. And that allows us to scale well beyond like, what a meta store, the hive meta store.
Kostas Pardalis 45:42
Okay, that’s a great point, actually, because I think like, in my mind, like, automatically, without like, competing writers, but it’s not just that, like, there’s many more things that are happening, like, as you said, like you start writing something that might fail. What do you do in that case, like, you don’t even like think.is, just bad stage, right? Like, you need to keep track of that. And like, make sure that either you write everything or you don’t write. So that’s okay, I get that when it comes to liking the analytical use cases where you have, let’s say, like a GUI engineer that’s gonna like to query the data with their active analytics and stuff like that. But then, usually, in these environments, we also have ETL, right? Where the workloads are more mixed between, like reading and writing. And the reason I’m asking for that is because you said, like, okay, we can make an assumption here and the trade off is that, okay, we can sacrifice a little bit of writing, right, but we want to make sure that what we write is correct. And at the same time, when we read, we’re like, super, super fast, right? But when we start having, let’s say, more mixed workloads, where like writing is also like, important, because when you are ingesting data, for example, like you might have like to write a lot of data out there and be like, more performance than like, the right and how do we manage that? If it is a problem, say to me, like, you know, what, like, not not like, we can write data like from Kafka to like Iceberg and vary by very like, Boston, everything’s like, great, we don’t have issues there. But is there a difference? Like the trade offs when we are also starting to consider writing? And how do we deal with that impulse of like, designing? Right?
Ryan Blue 47:29
So it’s all about the amplification of that, right? You know, how much do I have to rewrite in order to express this change? In Iceberg, we use a Git like model, where everything is a full file, a full tree structure of your table data. And, and like, you can go read that independently, as opposed to like an SVN type model, where you’re just accumulating diffs, against the last version. And they have to sort of reconcile those over time, right amplification and means like, you know, if I want to change a single row, while I’ve gotta rewrite that data file, and then I’ve got to rewrite the manifest that tracks it. And then I’ve got to rewrite the manifest list to swap the new manifest for the old one. And I’ve got like, so there is quite a bit of rewriting and write amplification going on for a single change. Generally, that’s okay, if you’re coming in with a bulk or batch workload. And I include a micro batch here. So streaming data from Kafka and writing it into an Iceberg table works extremely well, because you create, you know, say, 100 files and commit them all at once. And the cost of that commit is, you know, you’re getting potentially millions of rows in every commit. And so that’s what makes the analytic world really work, is getting a lot of work done in a single commit. And the trade off is really that you can use this underneath so many different engines, because they’re not sharing anything. The minute you introduce that runtime infrastructure, that right ahead log or something like that, other than like a caching tier, I’m not talking about that. Yeah. But if you’re talking about having a write ahead log so that very fine grained commits can be durable, and you know, served, then you’re creating a bottleneck, right? The beauty of Iceberg is that the scalability is the scalability of the underlying object store. And S3 does pretty well. You see this in a lot of databases where, you know, say you have a Trino cluster sitting there. And, you know, you had some infrastructure underneath that Trino cluster to be able to, you know, satisfy queries and do essentially this fine. grained changes to your data really quickly, assuming that exists. Well, if you have 1000 processes all start up and hit that at once you’ve got this Thundering Herd Problem where it just takes down, you have not scaled up your cluster to the point where it can handle that workload. Whereas if you’re talking Iceberg, just sitting in S3, you can do any of that. You can scale up to 10 presto clusters, at the same time that you have a Hadoop cluster running 1000 Spark workloads, like none of that actually affects the scalability of the system.
Kostas Pardalis 50:38
Okay, that makes a lot of sense. And, okay, I have to ask you that. So you mentioned S3, right? Like, scales, like really well, and like all that stuff. But there’s not only S3 out there, like pretty much every cloud provider has like the equivalent of S3. And okay, it’s okay to be politically correct. And say like everyone’s doing, like grades rights and swap, Microsoft has like also doing great and like what they should be has also doing great, but have you seen because you’re like, in a very unique position like to have to understand very well in war, like with block store ads, they’re like to go and deliver, let’s say, then, the performance. Do you see? Have you experienced, like the President said, like, also, like any anecdotal information that you find interesting to share about, like, the differences between how these systems like our operating are built, because the API’s are pretty much the same? Like, they’re not like, crazy difference? I would say, but there’s huge, like, technology investment behind that to make it work. Right. So I’m sure they are interesting. hobbling.
Ryan Blue 51:47
Okay, so to me, I’m a very, like, first principles in design kind of person. So to me, an object store is an object store. Yeah, right. Like, they are fundamentally scalable, because of the design, which comes down to like, you’ve got a block store, with some sort of, you know, distributed index on top of it. And so you’ve got a key based index, right, and that’s a key value store. And then you’ve got a block store, that’s also a key value store. And those things work very well at scale as well as distributed systems, because you can always know from metadata who was responsible for either this key or, you know, this block. So yeah, those two key value stores that make up an object store, like that makes a lot of sense to me. As a scalability thing, I think where you get into the performance differences and trade offs, is when you talk about the guarantees on those indexes. Right? So this is why S3 didn’t have a consistent listing for the longest time. Yeah. Right. And this is why, you know, reading after writing consistency, and, sort of like the guarantees are hard. Because, you know, you have this key value store, you distribute state to nodes, and then you have a problem of how do we keep those nodes in sync? Yeah. And what if you hit the wrong node, you know, and, of course, you’re gonna have duplication and replicas and sort of that problem. And then it’s how well can we manage the trade offs of keeping these things in sync? And also, you know, parallelizing, in order to have that workload scale up, and I think that’s where you see different performance trade offs. So back in the day, S3 didn’t have a consistent listing, but Azure Storage did. And I looked at that and didn’t think, Oh, hey, that’s great. I’m like, okay, so Azure is saying, I’m going to wait for my listing request, until like, everything is reconciled. Yeah. And in S3, I’m not. So I should expect lower latency on the S3, and higher latency on the Azure call. And, you know, you can choose whichever one for design or convenience. But I really see like, I want to keep with the underlying trade off that they’re making on our behalf. Yeah. And, you know, I think that’s kind of where we ended up as everyone said, Hey, consistent listing is conceptually important enough that we’re waiting, we’re willing to pay that latency penalty.
Kostas Pardalis 54:46
Yep. Yeah, that makes total sense. Cool. All right. So we’re talking about tables here. So we’re talking about database systems in a way in the database systems like they are more data more management systems, right? Like, it’s not just a file system at the end. Not that the file system doesn’t have controls there to manage, let’s say like how people access or like how they access the data and like what you can do with the data. But definitely like a file system is much more simplistic like a system converts to a database, when it comes to how you interact with the assets that are exposed through the system, right? And there’s a lot of, I mean, people that are coming, like from databases, they are obviously exposed to a very rich environment of like, how can you say, hey, you know, what, like, Ryan can have access to this table, but Microsoft cannot, right? Or maybe you can have controls even on let’s say, the columns themselves. And like, all that stuff, or what I can do, can I create, or I cannot create, right? Obviously, something like S3, has some controls. But these controls, first of all, have been designed in completely different ways for different users, not people who are querying data, right? I would find it very interesting. If anyone wants to query data, they’re like, Oh, let me create a role here in IAM to make sure that, you know, the appropriate person, right?
Ryan Blue 56:15
Not only is that like usability, just very foreign to a database consumer. But it’s also intended for a completely different audience, right? The IAM policy is generally managed by an administrator or something like that, like, I mean, I would not want that level of complexity placed on individual database users just trying to give one another access to the dataset they just created.
Kostas Pardalis 56:44
I would present and like, if anyone hasn’t tried to create an im like, bowl the shelves there and see how it works, I would suggest he go and see if it’s experience. And that’s, it says a lot about the people who are taking care of WhatsApp, like, I have a huge appreciation for, like the people managing these things like they are complex for a good reason, right? Like when I’m not saying that, like something wrong with them, but they are a very different audience. So how do we, let’s say, start adding what has been most like a standard thing when it comes to database management systems out there, when it comes to security, access controls, and all that stuff. But now, we are building on top of a file system, a distributed file system, and we want at some point to expose these things, this kind of functionality so that we can show these users? How does it work, if it works, how it will, in the past? And how it is today, right? Because I’m sure that if we take the Hadoop days, and compare them with like, today, probably there’s like a gap there. So it would be great to also share like a little bit of the history to be honest, because you’ve seen that so it will be good, like for people like to get like a idea of like how kind of a wild west it was in the past when it gets like who has access to wealth, like on the data there?
Ryan Blue 58:08
Yeah, well, I mean, the original Hadoop days, and I guess to some extent today, because it translated into object stores. It was very much like, like the hive table format, where we’d already exposed individual files and what’s underneath this table level abstraction to users. And so we wanted to have both sorts of security models. And we ended up with something that was just horribly confusing, because some people wanted to just access files underneath these locations directly. And some people wanted to treat everything like tables. And I think you want only one abstraction level for security. So that was a huge problem in the Hadoop and Ranger days, where you had to have separate policies for whether you’re coming in through HDFS, or you’re coming in through a, you know, a table abstraction. And it just got really messy. Not only that, the world got a whole lot harder in the Hadoop ecosystem, because the assumption in the Hadoop ecosystem is that you’re using multiple engines. And these engines don’t really know about one another. So you’ve got, you know, Hive, presto, Trino, sparge pig at one point, like all these different Flink all interacting with the same tables at the same time. And that was something that, you know, technologically Iceberg fixed the ability to work with those tables at the same time without corrupting results and things like that. But the security model was never something that Iceberg, you know, wanted to wait into, because that’s super messy. You know, The solution was you locked down files and Ranger you locked down tables and Ranger and you sort of hope that those things are consistent. It’s not a great strategy, in my opinion, that was all then moved into S3, and you know, the data lake world when Hadoop sort of gateway to these object stores. But we still have generally a problem where you have multiple security models in mind. And what we see, probably the most common, is that you lock down Trino with Ranger, and you leave the spark wide open through IAM policy. And you’re like, Well, if you have access to run Spark jobs, you just have access to everything, or main B, you lock it down by bucket. But then any physical access policy and IAM is very inflexible. Because it’s not like you can just go make a programmatic change to say, Oh, hey, you know, swap these permissions, or do this other thing, because you might have hundreds or 1000s or millions of objects in that object store that you need to swap the policy on. And not only that, like, how do you translate policy from a table level policy? Like, you know, you can read this table down to those objects? There? It’s a massive ugly Pro. Yeah. Okay.
Kostas Pardalis 1:01:33
I also have a feeling like from, like, seeing the industry how, like, it operates out there, and especially like, okay, vendors that are building companies, right, that’s access controls. And like, security, in general is also like a way to create, like, a location and like, kind of keep the customer in and, like, monetize more on that. And I think that’s kind of creates also, not the right environment, like to find the solution at the end. Because no one is like, in a way, like incentivized, like to cooperate with the other entities that they go and work on the data. Right? I might be wrong, I’d love to hear your opinion on that. But because I’ve seen it, right. Everyone says like, okay, let’s say we have, like Databricks with Parkins, they have like, their way of doing things, right? You have Starburst with Trino. Again, like they have their ways of like, providing like access control charts or and like controlling like the data and all these things, but all of them like at the end and look for a good reason for like, they’re like bad doctors or whatever. Like they have an incentive to try and keep the user in their own like, like a gated environment, like for these kinds of features, which are extremely important also, like in the enterprise, right? Like, okay, if we’re talking like a small company, like maybe they don’t care that much. But if you have like 1000s and 1000s of users, and you’re a publicly traded company, yeah, like access controls, like to date is like an important thing. Right? And so, how do you see and because you’re like, also, like, in a very unique position, as like, was the Iceberg in tabular also like to be, you know, like interacting with all these actors. You know, creating this industry was like your feeling of like, how things are like working today? And like, where are they going? Right, like, because you mentioned like, there is a gap. So where do you see things going?
Ryan Blue 1:03:38
So I think we are just now coming to the realization that we have to share things. Like before, traditionally, databases have been stored and computed glued together. And you do things a certain way. Right? Storage was never shared, until we got to hive and hive did it so poorly that we didn’t really have time to look at the problems other than, like, correctness issues all the time. Now we have projects like Iceberg that solve that core problem. And now we’re looking at, I think, big questions about how database databases are going to be restructured in the next decade. So I put this very centrally in that bucket of problems. We previously saw, security is something that happens at the query layer, then we parse it, we figure out what it needs to access, we check all those accesses before moving on. And that, you know, doing this at the top level means that I need that in Trino. And I need that in Flink and spark and Snowflake and whatever else touches my data. And not all those things can be trusted by spark and flow. Think they accept user code. So like, we can’t just make security decisions and process? Yeah. The other two are very opinionated about, like, what they support. And you know what permissions mean? And, you know, things like, select, that’s easy. Things like who gets what access when you create a table? That’s really hard. Yeah. And so I look at this as really a restructuring of given now that we have this shared storage layer. How do we secure data? And to me, the clear answer is that storage needs to take on data security. Right? It needs to take on access controls, because what users want is a common set of identities, a common set of access controls defined on the table or the storage level, and implemented everywhere. Yeah. So I really think that and this is one reason why tabular, my company, built an access control system that handles all of those concerns and sort of plugs in nicely to other things. I think that now I’m going to be one of the biggest concerns moving forward. Because I don’t think that with GDPR and CCPA, and, you know, other new regulations on data coming online, that we’re going to be able to just give blanket access to anyone that can write a Spark job
Kostas Pardalis 1:06:29
anymore. Yeah. Yeah. No, that makes sense. So, okay. You say, like the responsibility moves to the storage layer. Makes sense, when you want to have all these like multiple different engines like going and working there. And I get how things can work with a system that it’s, let’s say, more relational, right, in terms of creating things like access control and all that stuff. But as you mentioned, like Flink and spark, we are talking about systems here that you can literally write and execute any type of code, right? How do you deal with that? When you have to be agnostic, right? Because now you are the storage, you have to control the access there. But you have different, let’s say ways like that people like to interact with the data. So how do you solve this problem?
Ryan Blue 1:07:22
Well, so for tabular, we are primarily coming from a structured or relational model. So when you load a tabular table, we provide you with temporary access credentials, to be able to access the files underneath that table with the permissions that you had when you loaded it. So you know, if you have read permissions, but write permissions will give you credentials that can be used against S3. With that. Now, that also has a lot of assumptions baked into it about how we manage the files, where they go, you know, how we break down tables, and so like that it’s a whole system inside the catalog to make sure that the access controls are not violated when we hand out credentials. And but that’s how we map our logical table level permission system to the physical file access, in tabular. When you look more broadly, outside of just the tables, and that structure, then you’d need some strategy, some similar strategy for doing that, right, you need to be able to delegate permission to that underlying storage system. And that delegation naturally has to make assumptions about where you’re going to be placing files, and where you know, where files exist that you have read access to.
Kostas Pardalis 1:08:55
That makes sense. Okay, so we have like to, like potential personas here interacting with like tabular, right? One is, let’s say the actual user, like someone was creating the data through a system that they don’t even care what the system is, as long as they can access their data, right. And the other one is like a vendor, like someone, let’s say Trino writes, a one like to add, like support to, let’s say, delegate access control, like to like Tabular? What’s the experience of each one of them when they are doing their work, and they are relying on things like the tabular catalog, like to access and like monads access control?
Ryan Blue 1:09:39
Well, so to the user level, it should be seamless. Now, what we do is there’s some administrator setup, and this is on the side of the person running Trino. To establish a chain of trust, right, we need to delegate trust to Trino so that it can and represent its identity or preferably that user’s identity to tabular so that when Trino goes and loads a table, we say, Oh, this is Trino acting on behalf of Joe, right. And we send a credential back that says, Hey, Joe can access this table, here’s a credential for doing it. And then we trust Trino to do that, that responsibly. So the user just knows that someone has set that up, they’ve logged into Trino, and have provided their identity to Trino. And the chain of trust goes from there. The administrator, ideally, I think, gets that setup through an OAuth exchange. Right? So hey, our ideal is a screen in Starburst or some other Trino provider that says, you know, connected, tabular, you click on it, and tabular shows you something just like a whatever app would like to access your Google contacts, we show you one that has, you know, hey, we’re Tabular. Trino wants to be able to, you know, access your tabular catalog, and say, Okay, that sounds good. And if you’re an administrator in tabular, you can set up that access and, you know, establish that chain of trust that allows tabular to know, this is a trusted Trino system that is set up by an administrator, who in the end, owns the data in tabular. Yeah, makes
Kostas Pardalis 1:11:35
sense. And how it works, I show the window, right? Like, I’m the creator of Trino ology. And now I want to extend my access control, like the infrastructure that I have in my product, right? Like my system, I support Tabular. What’s the process? There’s like, an API like? Like, it’s the case that someone is like, taken like the grading, like, how does it work? Well, so
Ryan Blue 1:12:06
The open source Iceberg project has a catalog protocol that all of the engines should be using to talk to catalogs. And also, this is a standard protocol, much like the hive thrift protocol only. Again, we wanted to make sure that it was well designed, rather than being something that through ubiquity may have a standard, we wanted to design it as a standard. So we carefully designed this with the ability to pass multiple forms of authentication through. So if I’ve given you a token, you can put that in an authorization header, you can also use AWS native sig v4, which is a way of taking your AWS identity and signing requests with that, and then passing it on. So you have an identity passed through that catalog API. And then we have the data sharing or access to underlying data, also documented and passed through that rest catalog API. So you say, Hey, here’s my identity token, you know, load this table, tabular, then says, Yep, I can see that you have access to this table, I’m gonna generate, you know, the underlying data credential for you, and pass that back to you, then you can use that. And you might be, you know, Trino acting on behalf of you, or you might be just you with a Python process. And that all works in this model.
Kostas Pardalis 1:13:39
Okay, well, that’s great. And when you say someone, like, through, like, toddler can have like, like access controls, I assume that someone can control what tables a user can have access to. What else is there? Like? Is it possible to define, let’s say, which columns someone does not have access to? Or how you define basis, like role based or like, you know, it’s a whole bottomless rabbit hole there of house keys things can be managed. So what is supported there? Like what a user should expect, like from Tabular? On that front? Yeah.
Ryan Blue 1:14:19
So this is where, you know, actually, if you’re going with just SQL access controls, you start running into some pretty hairy situations, because column level access is not real standard. So what we do is we have a system of labeling columns and sensitivity. So you create a label and you create a policy for that label that says, don’t show this unless you have Select Access on the label, or maybe mask it if you don’t have Select Access on a label. And then you can label columns throughout your warehouse, and we’ll apply that policy. Now. We apply that policy and In a couple of different scenarios are ways in a trusted system, right? If Trino comes to us and says, Hey, I am Trino, I am trusted, we can send back something that says, Well, you can access on behalf of this user columns A, B, and D, and Column C, and we trust Trino is going to enforce that. Yeah. So that is something that we’re standardizing now to be able in the rest protocol to be able to make those policy decisions and communicate them through the rest protocol for trusted endpoints. Of course, for places like Spark, how do you do that? We have two modes. So one is you deny access to the table entirely, because any data file has data that you can’t see. And therefore, if you can’t see all the columns, we deny access, you can see metadata, you can’t see data files, that works. Another mode is actually just more permissive, where you can just lie to spark. Oh, no, it sounds crazy. And this is not real security. But this is essentially taking care of the accidental leak, the accidental data leak rather than the malicious data leak, because yes, if I give you a key that can access a data file that has data, you can’t see, that’s not real security, because you can grab that key out of memory, and then go download the file yourself. On the other hand, what most people are interested in is avoiding accidental data leaks. And for that, we simply say this column doesn’t exist. When you load that table in SPARC, the column is not listed, and therefore Spark has no idea how to project it yet, or use that column. It’s just going to ignore it. And that is pseudo security. But it’s what a lot of people actually want in practice. Yeah. So I think it’s important that you get to choose as an administrator, whether you want to fail all queries, or just have this pseudo security. And if you want real column level security, then you go through an engine that is actually trusted to provide
Kostas Pardalis 1:17:21
it. Yeah, that makes total sense. That’s super interesting. All right. So we’re close to the end here. And I just want to ask you, to share with us, a few things that you’re excited about the future, like both Bob like Iceberg, Bob Muller and the industry in general. Right? Yeah. So yeah, what’s exciting Ryan these days?
Ryan Blue 1:17:47
Well, I do want to highlight a couple of releases that just went out this week. So Python, Iceberg just had a release, in which we finally got the right support. So you can append or overwrite append to or overwrite on partition tables, which is, you know, really great. We expect, like, providing these tools to the Python world will really help that integrate with the rest of the data ecosystem inside of companies. And the same sort of thing. The first release of Iceberg rust just went out this week. It’s early, it’s more, it probably is not directly useful. But it’s really cool to see the interaction with rest catalogs, and, you know, basic ability to work with Iceberg metadata and read it. And we expect that to actually mature quite quickly. So I’m really excited about those. I’m also excited about the upcoming Iceberg release with view support. I mentioned restructuring the database world, now that we have shared storage and security is one thing that also moves down to the data layer. But so do definitions of things like views. Views are really critical as just shorthand for like, oh, apply this filter or, you know, remove these columns or, you know, apply some transformation. And they’re, they’re very useful. So Iceberg is now trying to standardize how we express views and can use them across engines. So that is, and that gets to the kind of IR conversation that we had. The last time I was on the podcast. That’s our use case around an intermediate representation for sequel logical plans.
Kostas Pardalis 1:19:44
What’s the status of that project right now? In terms of figuring out, let’s say, how to be implemented because views are very interesting, as you said, like very important, but also very hearty, right? Like, there are systems out there where, like a view, like figuring out semantics and how they can be equivalent between different systems, right? Like all these things, they can become very hard. And yeah, and it’s always like the edge cases. But the question is like, how many edge cases like how long you’d like these long tails out there? Right? So what’s the kind of stage of that? And what are we looking into Right? Like, are you building your own hierarchy? Like, for example, you’re considering doing that like defining a way of representing, let’s say, intermediate representation of a query? Are you evaluating some projects? And what’s the collaboration also, like with the engines, because at the end, the engines are going to be executing these right and possible at the end? So
Ryan Blue 1:20:51
That’s a little bit more about that. So that is, I think, the big question, right now, what we’ve done is, we’ve allowed the ability to create multiple representations of a view that are mandated to produce the same thing. So right now you can create a view in the 3d API that has a sequel for Spark and SQL for Trino. And Trino, will use the Trino SQL spark will use the Spark SQL and you have compatibility that way. You could also have Spark, say, use the Trino SQL and just parse it and try to rely on the fact that it is a standard language. That doesn’t work for a lot of cases. But there are a lot of cases it does work for, for things like very simple transformations. Things like, you know, filtering, or sorry, yeah, running a filter statement, projecting columns, it’s pretty simple. Even stuff like Union all. So if you have a use case, like, you know, I’ve got fast moving data coming over here and slow moving data in this table, I can union those tables together. That works really nicely. But we do want to get to this state where there’s some other intermediate representation that has, you know, really high fidelity to, you know, essentially translate between SQL plans and this intermediate representation. So Trino is responsible for their translation SPARC is responsible for their translation. And we all agree on what the intermediate representation says. That is further out, we’re taking a look at IR projects now. And we ideally do not want to have one in the Iceberg project. That would be very difficult. But even things like substrate, which I think is the most well known one out there, is difficult to work with, because it expresses literally everything. So even things like Boolean expressions are custom functions that are defined in YAML files that you can swap out. So things like equality. Pulled in a definition of that from somewhere, and then it’s a little too expressive. So we’ll need to look into exactly how to make use of those types of projects. And of course, like you were saying, the community side is tough, because it’s not that Iceberg makes this decision. And everyone follows along without complaining that we’ve got to convince everyone that, you know, the representation that we are choosing is reasonable and should be supported in Trino. And spark and Flink and whatever other commercial engines are looking at it. That’s going to be difficult, but at the same time, like super valuable, if we can get it
Kostas Pardalis 1:24:05
right. Yeah. 100% I’m especially like for practitioners out there like okay, like the vendors, like different kinds of beast to tackle and like, convince them why this is important at the end, I think the market should enforce that. But yeah, like hearing like people having to migrate, like high views, like in big systems and moving to other systems. Like it’s a lot. And it’s a hard problem. And it isn’t Benjamin on the end of like the progress, right, like people get stuck in systems that they cannot move out of them just because of like the amount of resources that are needed to go and like migrates right, like, and that’s
Ryan Blue 1:24:45
What’s so encouraging about this new world, though, right? In a world where we have shared storage. Yeah, the migration costs dropped to zero. Yeah, right. And well, not zero, but like it’s a step function. and down the. And then if we’re able to move things like security and access controls and views, you know, more and more things are going to be useful across engines. That’s a radically different world. And it’s really encouraging. I love that you talked about practitioners, because that’s really who we’re trying to serve with all of this, you know, people who spend their days, you know, making sure that the copy over to Redshift or Snowflake is in sync with the original data. And like, are we securing, locking down these two things with separate permissions models, even though they’re 99%, the same permissions model, they’re separate, and they’re stored separately, and they’re maintained separately, are those things in sync. And I think with the new design here, we can really do away with a lot of those challenges. And just annoyances, I think we’re moving to a world where we have centralized components, and specialized components, and storage views, encryption, security, and access controls, those are all going to be centralized components. And that’s a really exciting world to be in, hopefully, then we have a simple OAuth process to plug in new specialized components, like your Trino is and your sparks and your snowflakes. And that is a really cool data infrastructure world.
Kostas Pardalis 1:26:35
I 100%, I think like this, that’s also like, the only way that we can, you know, like, starts accelerating, like how we build value here, because the more we turn these people, the practitioners into, like, operators instead of builders, which is like how it was in the past. And for good reason back then, right? Like, the less we give them, like the space to go and create value. And I do think like, especially if you compare our data practitioners and compare them, like two application engineers, and see the difference of like, the tooling, what one or the other has, and like how much they can focus on Hey, now I’m building a new feature. Okay, well, we’ll have to do it but even that, like it’s much more streamlined as if he’s like, in a good data war, like, that’s what we should go after. Like, that’s the experience, like, the petitioners should also have out there not spending 80-90% of their time, like trying to figure Oh, shit, did I copy the stable IP rights? Like from one system to the other? No, that’s not what humans should be doing. Like, they’re much more interesting things. And we need that, like, if we’re talking about like, all these AI, and all these things out there, like how all these things are going to happen without data that are consistent that at the minimum, right?
Ryan Blue 1:27:53
Yes, we have to make practitioners productive, just with data in order to get to these other things. Because if there are people spending their 40 hours a week, just looking at all the access controls in sync. Oh, no.
Kostas Pardalis 1:28:13
100% out of a cent. And that’s what I think it will be like, these next couple of years, like really exciting for anyone who’s like working in this space, I think it’s going to be like a very interesting place to be in, crazy opportunities like to build very interesting things.
Ryan Blue 1:28:27
I agree. It’s an exciting time, because we’re seeing this fundamental shift and architecture. I don’t think we’ve ever seen anything like the shift to centralization for storage and security and things. Yep, yep.
Kostas Pardalis 1:28:43
100%. So, Ryan, thank you so much. But we’re a little bit over the time here, but it was like, that’s an amazing conversation thing to show mods. And I can’t wait to call you back and continue the conversation. I think there is more talk about looking forward to doing it again soon.
Ryan Blue 1:29:00
Thanks for having me on. It was a lot of fun.
Eric Dodds 1:29:03
We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe to your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.
Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
To keep up to date with our future episodes, subscribe to our podcast on Apple, Spotify, Google, or the player of your choice.
Get a monthly newsletter from The Data Stack Show team with a TL;DR of the previous month’s shows, a sneak peak at upcoming episodes, and curated links from Eric, John, & show guests. Follow on our Substack below.