Episode 175:

The Parts, Pieces, and Future of Composable Data Systems, Featuring Wes McKinney, Pedro Pedreira, Chris Riccomini, and Ryan Blue

January 31, 2024

This week on The Data Stack Show, Eric and Kostas chat with a panel of experts as Wes McKinnyey (Cofounder, Voltron), Ryan Blue (Co-Founder and CEO, Tabular), Chris Riccomini (Seed Investor, Various Startups), Pedro Pedreira (Software Engineer, Meta), all share their thoughts around the topic of composable data stacks. During the conversation, the group chats about the importance of open standards and APIs for efficient interoperability in data management systems, the evolution of data workloads, the need for specialization, and the challenges in building composable components. The conversation also covered the significance of an intermediate representation (IR) for decoupling various layers of data systems, the complexities of data types, and the desire for more secure data sharing methods. The panelists explored the evolution of open standards and the trade-offs between composable and monolithic systems, expressing excitement about new data infrastructure projects and technologies, modular execution engines, new query interfaces, standardizing policy decisions across different data management platforms, and more.

Notes:

Highlights from this week’s conversation include:

  • Introduction of the panel (0:05)
  • Defining composable data stack (5:22)
  • Components of a composable data stack (7:49)
  • Challenges and incentives for composable components (10:37)
  • Specialization and modularity in data workloads (13:05)
  • Organic evolution of composable systems (17:50)
  • Efficiency and common layers in data management systems (22:09)
  • The IR and Data Computation (23:00)
  • Components of the Storage Layer (26:16)
  • Decoupling Language and Execution (29:42)
  • Apache Calcite and Modular Frontend (36:46)
  • Data Types and Coercion (39:27)
  • Describing Data Sets and Schema (42:00)
  • Open Standards and Frontiers (46:22)
  • Challenges of standardizing APIs (48:15)
  • Trade-offs in building composable systems (54:04)
  • Evolution of data system composability (56:32)
  • Exciting new projects in data systems (1:01:57)
  • Final thoughts and takeaways (1:17:25)

 

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 00:05
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com. Welcome to The Data Stack Show, we have a truly incredible panel here to discuss the topic of composable data stacks, so many topics to cover today. So let’s get right into introductions. And I’m just going to do it in the order that it shows up on my screen. Chris, do you want to start out by giving us a quick background and intro? Sure,

Chris Riccomini 00:47
Yeah, my name is Christopher Riccomini. I have spent the last 20 years of my career at two companies, mostly LinkedIn, where I spent a lot of time on streaming and stream processing, and was the author of Apache Samza, which was an early stream processing system kind of similar to Flink. And most recently at a company called Leap pay, which was acquired by JPMorgan Chase, where I ran our payments, infrastructure, data, infrastructure and data engineering teams for a stretch of time. I’ve also written a book for new software engineers, kind of a handbook, because I was tired of saying the same thing in one on ones over and over again. I’ve been involved in open source. I was an editor for the airflow project and helped guide it through an incubator on Apache. I also do a little bit of investing. And so that’s where I spend a chunk of my time now. And I Yeah, write a little newsletter on all things, systems infrastructure. That’s me in a nutshell.

Eric Dodds 01:37
Very cool. Wes, you’re up? Yeah.

Wes McKinney 01:40
I’m Wes McKinney. I’m a serial open source project. An open source software developer or CO created a number of popular open source libraries, pandas and Ibis for Python, Apache aero, kind of in memory data infrastructure. Layer, it’s very relevant to the topic of today’s today’s show. been involved in a bunch of a bunch of companies, most recently, a co-founder of Voltron data building accelerated computing software for the composable data stack, and the data science platform company for R and Python. I am an author of the book Python for data analysis. A popular reference book for the Python data science stack. And I also do a fair bit of angel investing in and around the next generation data infrastructure startups.

Eric Dodds 02:40
Very cool. Ryan, you’re next on my screen.

Ryan Blue 02:44
Oh, thanks. I’m Ryan Iglu. I’m the co creator of Apache Iceberg, which is one of the open table formats that I think is slowly but steadily making a big change to the way we architect, you know, big data systems, especially in object stores. I’m also a co-founder of tabular, where we sell an Iceberg based architecture that has, you know, security, and data management services baked in. I left Netflix to found tabular, and Netflix, at Netflix, we were on the open source, big data team. So I got to work on Parquet and Iceberg and replace the read and write pads and spark and various other things. Very

Eric Dodds 03:37
cool. And Pedro.

Pedro Pedreira 03:40
Alright, hello, everyone. I’m happy to be here. Once again. I’m Pedro Padre, a software engineer who has been a matter for a little bit over 10 years, always involved in projects around data infrastructure, a little bit closer to the analytic engine DBLog processing agent. So it’s been most of my career just kind of developing databases and data processing agents and I think evolved in the last five years, I started getting a little closer to this idea of composability. And how can we make it, can we make the development or those engines more efficient? So we started working on a variety of projects related to this space. One of the projects that we eventually open sourced that got a little more visibility on the industry was Bell Hooks, which was recently open source to this idea of making execution more composable for data management systems. But he said, Matt, I work with a variety of teams, with most of the warehouse compute large warehouse Compute Engines, like presto, like Spark. So kind of this data processing area for analytics, developing efficient query engines is sort of the thing.

Eric Dodds 04:44
Very cool. All right. Well, I just want to dive right into it. And Wes, I’m going to point the first question at you and then have the rest of the panel, you know, sort of weigh in with, you know, agreements or disagreements or comments, but So let’s try to define what a composable data stack is the term composability, you know, has been thrown out a lot, you know, there are even companies sort of CO opting the term for marketing purposes, you know, which always has a lot of confusion out there in the marketplace. But can you give us a definition of what a composable data stack means to you?

Wes McKinney 05:22
Yeah, so, it’s a project or collection of projects that serves to address a data processing need, but where the component systems are built, using common open source standards that lend themselves to efficient interoperability either efficient or simple interoperability. So the different pieces that you assembled to create the ultimate solution for your data platform, you know, can be achieved without the developer having to write nearly so much, you know, Glue or custom code to fit the pieces together. So those points of contact between different components of the stack are based on, you know, well defined, well defined open standards that are all kind of agreed upon and shared amongst the different components systems.

Eric Dodds 06:11
Any additions or disagreements? Nick crew? Yeah,

Pedro Pedreira 06:17
I think maybe just adding to what was said that he has maybe two different aspects. One is, I think, this idea of having open API’s and standards upon which different components can communicate, but there is also the idea of using common components, right. So I think, at least how we see this internally, this idea of how can we factor out common components between agents as as libraries, and how can we define common API’s and kind of common standard between them to communicate, right, so there’s, if you look at the industry or the projects on this area, there’s usually those two things like one is just defining the API on the standard, and the other one is actually implementing things that do something with those standards. I think there’s this idea of just providing those components that communicate via API via common API’s and making sure that they’re somewhat interchangeable.

Kostas Pardalis 07:07
I’d like to add a question here, because we are talking about, like API’s and like, libraries and all these things, but who are these projects? Who are these? Let’s say, if we did, like, if we had like to define like a minimum set of what defines these, let’s say, set of like, API’s that we can compose data systems with, right today? What would that be? And I’ll start with us, because I think a lot of that stuff started with, like Arrow defining, let’s say, the vocabulary that we use today. So what the least, is to start with each other to do that? Yeah,

Wes McKinney 07:50
I mean, the way I think is the easiest way to think about it, and the way that I often explain to people is to think about all of the different layers of a traditional database system. So historically, you’d have database companies like Oracle that would create these vertically integrated systems that are responsible for implementing every layer of the stack, data storage, metadata management, physical query execution, query planning, query optimization, and then all the way down. at the user level, you have the user API, which would generally be SQL. And so if you think about this vertically integrated stack of technologies, and you start to think about the logical components of the system, you can start to think about, okay, well, you know, are there pieces of the stack, which could be peeled off and turned into reusable components. And if you want to have a reusable component for, let’s just say, storage, you start thinking about designing open source file formats, or open source, you know, metadata, dataset management. But you need to think about those interface points between the different systems to say like, if you want to turn something into a reusable component, like designing the API or interface for hooking that into other systems, so that it is reusable, and well documented? You know, that’s a lot of engineering. And one of the reasons why historically, data system developers didn’t do this was because engineering work to make systems composable or to make the components of a vertically integrated system separable and reusable is a lot more difficult and a lot more engineering.

Kostas Pardalis 09:40
Ryan, what’s your take on it?

Ryan Blue 09:43
I think it’s pretty funny. You’re right, it is a lot harder. But I think the Hadoop World taught us that you don’t actually have to do that work. Right, like Hive tables, no one ever did that work. It was just unsafe, and sometimes you clobbered the result that someone else was getting. And like, you know, we lived with unsafe transactions and the storage layer for a really long time. But it was still super useful. Because a lot of the time, you only had one person changing a table at a time, and you were reading historical data, and it just sort of worked. So I think we actually backed into the storage layer, at least making it more reliable than like having the behavior and guarantees that we wanted to have

Pedro Pedreira 10:35
better, please. Yeah, I think it would even go further to what Wes was saying. I’ll say that most companies don’t even have the right incentives to invest in composable components, right? Because I think, like we said, developing components is a lot more expensive, right? There’s a lot more thought, and what are the API’s like is a separate project, you know, treat each open source, there’s, there’s a cost of maintaining this open source community. So if you’re developing a single engine, it’s a lot more efficient for you to just write this as a small monolith, because you have full control of that it’s a lot easier to evolve, it’s a lot easier to control the direction of the different features and the architecture is going but actually thinking through what are the right API’s, you know, working as a community, setting Denali identifying how this should work, another end of it, it’s a lot more expensive, right? So I think that’s why Historically, most of the companies, they just, you know, they focus on developing the system focused on the particular workload they have in mind, I think where does this break is? If you’re a company, we actually need to maintain too many of those systems, then you start, you know, economically start to make sense. Okay, let’s actually see what we can share between those things. And I think this in addition to open source, like getting to a point where a lot of those components are already available, and already pretty high quality. I think that’s why, you know, we’re getting to this inflection point where people are actually rethinking their strategy as our kind of proprietary monolithic software. And, you know, thinking a little more about composability, and open source and open standards.

Kostas Pardalis 12:03
Now, that makes a lot of sense. But quick question here. And I want your take on that first better. And then I want to ask, like the rest of the folks here, because I want the perspective from both someone who works in a hyperscale company, like meta, but also like to figure out how that reflects to the rest of the world out there. Because not everyone’s like, met, right? So you talk about this inflection point, like, at some point, you need this modularity, like it emerges as like, like, a need. Can you tell us a little bit more about how this was experienced by you? Because I’m pretty sure like, there was some kind of evolution right, it was like Hive, then we started having like the rest of the season then where it’s like the point where you even have to take out like the execution engine itself and make it like a module in its own right. With Velux show a little bit more of like how these happen, like inside the company like meta?

Pedro Pedreira 13:04
Yeah, sure. I think a lot of that just comes. Because data workloads are always evolving, right. So first, you want to execute large MapReduce jobs, then you want to execute SQL queries, then there is stream processing, and there’s log analytics, and there’s transactional learning, there’s a lot of different types of data workloads. And I think the fact is just that we cannot build a single engine to support all of them. So this kind of drives what we call specialization. So what we end up doing is that you develop a single engine to support, you know, each kind of slice of this workload. So we have one engine that supports really large ETL SQL like queries, you have another one or interactive dashboards, we have a series of engines for kind of transactional workloads, you have a stream processing engine, you have now like, you know, training agent that can feed PyTorch and keep the GPUs busy. So it’s just because they have so many data workloads, it kind of drove this requirement of specialization. I think the problem is that those things were done a lot more organically than intentionally, right? So just while there’s a new workload evolving, people just go create a team and they start kind of developing a new agent from scratch, right? And then you get to a point where we have 20 of those. And then if you really look closely at them, like they are not the same, but there are a lot of the components that are very similar. So I think specifically, I think, to your question around execution, like if you look at the execution of all those angels, they’re very similar, not just of course, looking at different analytic engines. But even if you look at something like a stream processing engine, not exactly the same, but the way you define functions, and you execute expressions, and you do joins, like all of that is very similar. So that’s how we started this idea of like, okay, let’s first look at the execution and see what are the common parts and what we can affect her out as a library. And then, you know, just integrate and reuse within those libraries. So this is how we created valence which is something where you think we Getting across execution. And what we saw is that the more we talk to other companies and we talk to the community, the more people are really, really interested in that. Because developing those things is a very expensive project, right? It costs you hundreds of engineers, and it takes you 10 years. So it’s only that you can actually join forces of a much larger community, or just reuse an open source project that already does all those things in a very efficient manner. That’s just, you know, it saves you a lot of effort. But this is how we kind of got into this idea for execution. But they’re, they’re similar projects targeted to other parts of the spec.

Kostas Pardalis 15:34
Okay, that makes sense, grease, I think you have something to offer here. Yeah.

15:38
I don’t more or less agree with what Pedro was saying. I think the key word there was sort of the organic aspect of this. And I think Ryan calls this out, as well as looking back to the early days with HDFS and stuff. I think the big, like evolution of which S3 is just a continuation is the separation of storage and compute. And, you know, I think Pendo is focused much more on the query engine aspect of it. I think that probably is a symptom of being at Netta, which is a very large company. But the alternative sort of, I don’t know, storyline that I think people go through is they get their data on S3. And they’re like, Okay, I need to query it. And then like Ryan said, Well, there’s no ACLs. And so now you need some kind of ACL thing on top of the query engine, and then you need a data catalog or some form of information schema. And so very organically, you start building out these components. But because it’s kind of piecemeal, because initially, you just wanted to query your logs, right. And then you start getting, you know, streaming data, or OLTP data in there, or you start adding stuff over time. And so I think, that has been more my journey is more one of not so much going horizontally across a bunch of different query engines, but or maybe I’m not sure horizontally, vertically, but not so much going across a bunch of different query engines, but starting to add more and more features that a normal database would have, especially being in FinTech, most recently is like, you take security very seriously. And data discovery is a whole thing there. So I think that’s another point of view on how this stuff evolved.

Kostas Pardalis 17:14
Yeah, that makes sense. Chris, I want to ask you, because you have also like experienced, let’s say, not gonna be going on the path of like, the data would like start the race, and like they’re getting processed, but also like, the capturing phase of data, like the delivery of the data, do you see this concept of like, composability, that we are talking about, which, let’s say comes like, a little bit more from like, the data warehousing or like, the OLAP systems? But do you see this kind of concept, like being part also, like of the systems like in France, like systems like Kafka or, like even oil to be like, what’s, what’s your take on that?

17:50
Yeah, absolutely. I mean, taking streaming, for example, you know, when I was working on SAMSA, and I think LinkedIn was like this as well. Now there’s streaming and that sort of Nearline ecosystem is pretty adjacent to batch. And so you know, many of the query, you know, streaming query engines can also do batch processing on top of HDFS or S3, it’s a very concrete example, going even farther upstream. There are, you know, OLTP databases that are experimenting with sort of bridging the gap as well, whether that’s something like materialized as, so sudo OLAP, or something like neon, it has a tiered storage layer that includes S3, or it’s made persistent, that is right ahead. Log definitely does, you can kind of look in any direction and see disaggregation and two opponents being overlapped, or shared.

Kostas Pardalis 18:39
Yeah, it makes sense. Ryan, I think you want to add something?

Ryan Blue 18:44
Oh, yeah, I was just gonna say that. I think that Pedro and Chris are kind of coming at this from two opposite ends, like, I come from more Chris’s perspective, where we had a whole bunch of open source engines, and we needed them to work on the same data sets. And we needed to have this sort of architecture that all plays nicely with streaming and batch and ad hoc queries. And you know, anything from a Python script to Snowflake or Redshift. Whereas I think Pedro’s perspective is kind of fun, because he’s coming at it from like, how do we build engines and share components, like the optimizer? Which is a lot of fun as well. And you know, Wes, as well, where, you know, can we have really high, high bandwidth transfer of data between those components within the engine itself? So I think there are like, two separate ends of this conversation.

Kostas Pardalis 19:44
Yeah, yeah. We have more of the, let’s say, the user side of things and like the builder side of things. Whereas I think you also want to add something of the previous thing. So

Wes McKinney 19:55
yeah, I mean, I think it’s interesting. I think the way that you know the way that we are right I have even this concept of, you know, composable data stack or composable data systems, you know, it was a little bit, it was a little bit organic. So when I got involved with what became Apache arrow, I was needing to define basically an ad hoc table format or an in memory, data representation for data frames or tables, so that I could hook pandas up to systems in the Hadoop ecosystem. And there were many other systems that had defined in memory tabular columnar formats, either for transferring data, for example, between Apache hive and clients of Apache hive, or many database systems had built in memory columnar formats that were essentially implementation details of their execution engine. And they had no interest in exposing those memory formats to the outside world. And so as I was finding myself, basically, you know, starting to create an ad hoc solution to the problem that I had, which was connecting Python to these other systems. It was only at that moment that we, you know, there was a, you know, a collective realization that, like, we should try to create some piece of technology that could be used in a variety of different circumstances for solving that problem, rather than creating yet another ad hoc solution that’s incompatible with every other solution to that problem. And so I think, as time has gone on, you know, people find themselves reinventing the same wheels. And then, you know, finally, you know, if you have the bandwidth or the motivation, you know, to build a software project, or an open source project, or, you know, internal corporate project that is more reusable or more composable. And you have the experience to do it the right way, then, I think that’s what’s caused this to happen now, as opposed to, you know, 10 or 15 years ago, when the open source ecosystem was comparatively a lot more Nasod emerging, whereas it’s a lot more mature and mainstream now. Yeah,

Kostas Pardalis 22:05
That makes a lot of sense. Better. You have to want to add something. So please. Yeah, no, I

Pedro Pedreira 22:09
I think just quickly addressing Ryan’s point, I think it makes sense for us, when we started looking at this space from a practical perspective, right? How can we be more efficient as an organization? How can we add a little more from a software development perspective? But as we make progress, we try to get a little more scientific with that as well. Right? So essentially, if you stop, if you remove how the ancients are developed today, and the standards and components we have and just think about, what are the different layers? The like, what are the common layers between every data management system like so we kind of define an architecture saying that, Oh, every single data management system has a language layer, which is essentially take something from a user, sometimes it’s a SQL statement, sometimes it’s I don’t know, pi Spark, or Pendo, or something noSQL, but you take this. So there’s another component, that is just how you represent the computation, right? So you take the user input, and you create an IR, which is like substrate has was one project targeting kind of standardizing the IR but every single if you look at every single system, from analytics, to transactional, to data, ingestion to machine learning anything, you have a language it translates to AR, there’s a series of transformations that you do on this IR, both for metadata, resolving views, Ackles, security, all sorts of things. And at some point, you get to an IR that is ready for execution, it goes through an optimizer. So every single agent has or sometimes the optimizer just doesn’t do anything. But there’s a system that takes this IR and generates an IR ready for execution, there is some cold or some component that can actually execute this IR given a host and given resources with just a little more of what we’re targeting with Alex. And then you can move further, there’s a lot more details, there’s the environment where you ran those things, which is the runtime and then it goes from MapReduce spark to write an old now, I heard that Redshift can run on serverless architectures, but it just like this environment that we call the runtime. So we kind of define those and we see that if you look at every single data management system today, they all compose up those layers. And of course, like though, those layers are completely aligned between them and they don’t use open standards. So there is a discussion of what exactly the project is addressing each one of those components and how the what are the right API’s but if you look at all those so all those agents they kind of follow the Smalltalk so this is sort of the mental model we have internally like I said Velox addresses the execution part but he also have some other efforts on the language part on the IR and the even on the the kind of common optimize it

Kostas Pardalis 24:42
makes it Ryan Yeah, and I’ll I also

Ryan Blue 24:47
I want to ask you something based on this mental model that’s better described where Iceberg seeds are right? But you also want one sample I was gonna get initially to say that our experience creating Iceberg was largely like Wess where we sort of backed into it by saying, how do we make people more efficient? How do we make these things work together without stomping on one another’s results and things like that. And what we ended up with was kind of like Pedro was talking about, we said, hey, what are the existing concepts for this space that we should reuse? You know, how do people expect it to work? And I actually really liked Pedro’s, you know, breakdown of the different layers, right, the language layer, the IR, the optimization layer, the execution layer, and then the, I guess, environment, environmental sort of layer, I forget what you call that one. And then underneath that, I think storage and that’s where Iceberg fits in, which is weird and orthogonal. And another thing that interests me here is what is moving between layers. So, you know, security is traditionally done at that very top point where you understand what the user is trying to do. And then you have enough information to say whether or not they can do it. But if you have multiple different systems, right, if you’re talking like a streaming, or maybe a Python process, or some SQL, warehouse or other system, you need all of those things to have coherent and similar policy, you actually have to move that down, right, you have to move it beneath all of those very different engines, and actually, into the storage layer. So composability is really, you know, causing a lot of change and friction in the ecosystem right now.

Kostas Pardalis 26:54
So would you say that, let’s say access controls is another component of the stack we are talking about,

Ryan Blue 27:01
I would probably add access controls, I think someone mentioned views. But like, you know, reusable IR or, you know, a view type concept is definitely there. I would also say the catalog as another sort of reusable component of the storage layer, and how we talk to catalogs here and get everything has been talking like the hive thrift protocol for so long, that we really need something to replace it, like the Iceberg is coming out with a rest protocol to try and do that. So there are a lot of, you know, fairly niche components, even within that storage layer.

Kostas Pardalis 27:45
Crease, I think you want to add Yeah,

27:47
I wanted to add another one. On top of the stuff that Ryan’s been talking about, this is something I spent a lot of time thinking about. And that’s the data model. So, and really, you know, data description, one thing I didn’t mention in the introduction is I have an open source project. I’ve been hanging around for a year trying to unify the kind of Nearline offline and online data models. So this is sort of thinking about, you know, how do you represent it in sugar? How do you represent a date, and that’s something that goes up and down all the way down to the Parquet layer. Parquet is kind of punted on what the data model should be. It’s very simplistic, and there’s logical stuff. And then on top of that, you start compounding, you know, things all the way up to the language layer. So kind of runs the gamut. And it’s something that I don’t think we fully nailed, you know, it’d be nice to just say, Oh, we’re going to use Postgres, or we’re going to just use hive, or we’re just going to use, you know, duck DBS format, or whatever. But inevitably, what we seem to end up with is like a lot of coercion and sort of munging. I think the arrow is wrestled with this a lot. You know, I was talking beforehand, before we started recording about their schema flat buffer, which is a really good reference, if you want to look at it as an attempt to try and model what data looks like across a bunch of different systems. It’s super non trivial. So that’s, that’s another one. I’d like to throw in there that I would love to see more progress on. I’ll stop there. Yeah,

Kostas Pardalis 29:08
no, that makes it better. Oh, you will also want to add something. Yeah, no, I

Pedro Pedreira 29:12
I think just I think that, I think to the points that Chris and Ryan just raised, I think our current, at least model is that language and execution should be decoupled. And they should communicate via one API with the this API would probably be the IR but that essentially means that anything related to data model again like SQL, when the non SQL like anything, like all of that should be resolved on the language layer. In addition to some kind of metadata operation security resolving our call checking if you know users can access particular columns, all of that should be encapsulated in a way orthogonal to execution, and they should communicate via an IR. So even things like if you know, if you want to express graph computation, then this IR should have known that express general graph execution parameter. Uh, and all of that. So in a way, all those things should be decoupled from execution and execution should only take this IRS input. And of course there are, I think, security details of what you need to carry a security token to make sure that it can actually pull this data from storage. But like all the logic of checking if people have access to columns, like resolving assholes, privacy checks, like all that stuff should be decoupled from the IRS, it doesn’t mean that it necessarily should be part of the parse the SQL parser lambrate library, it could be something that you have many language libraries that generate an IR and then there’s some processing that happens on this is so that those echo checks, privacy checks can actually be dekap can be orthogonal from whether users are expressing things using SQL or noSQL. But we see all of that as being in a way orthogonal to execution. So execution should just mean that the business the computation in each executed check is already safe. Let me actually go and execute it.

Kostas Pardalis 30:55
Chris, please go on. I think you want to make a comment here.

31:00
So oh, I actually just wanted us to think that Pedro would be the best person. But can you define what you mean by IR? I think that’s something that maybe not everybody intuitively knows that they’ve not been, you know, knee deep in databases for a long time.

Pedro Pedreira 31:13
Yeah, no, I think that’s a good point. I think I always like it, it’s a term that we sort of borrowed from compilers. But essentially, this idea of having some intermediate data structure that can represent your computations in a way that you can execute without, you know, ambiguity. So essentially, that in most query engines, which are so that means that by the physical query plan, but he just called this IR, because that’s kind of the time to use the compilers for this the same idea of decoupling front end and back end, Ryan.

Ryan Blue 31:42
So I completely agree, I think IR is, like substrate are similar projects is one area that I’m most excited about. Because it is super useful, right? Being able to exchange query plans, basically gives you views, being able to, you know, pass off something from any language, whether that is sequel, or hopefully something eventually better. You know, like, that is all really cool. But I think one aspect that I want to bring in here is that it’s probably not enough, there’s always going to be that guy who doesn’t want to use a dataframe, API and Python to do his processing, he wants to jump into Python code. And like, people have attempted this with like, taking Java bytecode, and translating it into SQL operators. And it’s a gigantic mess. So like, you have to have either some willingness to use a language that produces IR, or the rest of the components in the stack actually need to support everything with an even stronger orthogonality. So like, I think that one, when it comes to building at least a storage layer, the storage layer doesn’t actually get to assume that you’re going to use IR and an optimizer or any particular execution, right, we need to be able to secure the data, no matter what you’re using, we need to you know, be able to give you that data, no matter what you’re using, and have like very well defined protocols and standards that level at least.

Kostas Pardalis 33:32
Right on one question, though, here. And then I’ll give the microphone like the Wizard, because I think she can also like other loads that you mentioned. I mean, I understand what you’re saying about, like store ads, and sort of the cogs come from that. But there is one thing and that connects with words, like Chris was also talking at some point about like, the type systems and like the data types, and the model itself, which, at the end goes down to storage to write like these data somehow needs to be representing then it has to be like able to serialize deserialize, whatever types you have there. And this is something that goes through, let’s say, all the different components. So how do you deal with x? I mean, because what I hear so far and Chartio for that, is that, oh, store ads, like, well, I will we can stay away from that. Pedro says, Oh, these things will be resolved by the, you know, like the front end parts, like the parser. And whoever generates VR is like a hot potato that you throw to someone else. And at some point, we have to deal with it. So there was a reason that like there, this whole thing was a monolith. Right? And I think that’s what we’re selling like shellfish here, like these API’s of the end, like communicating with the openness that we want to have is not that easy. So

Ryan Blue 34:59
I think that goes to Chris’s point, right, which is, if Iceberg and Arrow and our IR all used similar type systems, then we would be a whole lot better off. I do not doubt that, like, if Wes and I had agreed 10 years ago on the set of types that we would support, like, it would be a whole lot easier. And that’s why substrate when they started that project, they took a look at all the type systems out there and said, we’re only going to allow types in if it’s supported in like two or more large open source projects. I think one of them was an arrow, one of them was an Iceberg. You know, I think spark and some others, you know, so that is definitely a problem where we could use a more coherent standard. But let me also explain and argue for the, you know, fracturing here, it is the way it is, because there’s a huge trade off. And dealing with that trade off is why we have so many different approaches. I think arrow takes the side of the trade off to be more expressive, and say, Hey, if you want to use, you know, 4248 bytes for that type, you can go ahead and do that. Whereas on the Iceberg side, we’re trying to keep the spec very small to, you know, make it easy for people to implement. And like, there’s just a fundamental trade off there. And you’ve got to strike the right balance. Anyway, sorry, I’ve talked for a long time.

Kostas Pardalis 36:40
ways you wanted to add something?

Wes McKinney 36:44
Yeah. I mean, on this, you know, kind of discussion of, like, IR Rs, and like the, you know, the relationship between the front end of a data system, and, and the back end. I mean, I think one of the earliest and, you know, probably most successful systems in the multiple data stack is Apache calcite, which was created by julienne high. That was, the idea was, it’s, you know, the database front end as a Java library. So it does sequel parsing, query optimization, you know, query planning, and it can emit a optimized query plan on the other end, which can be used for, you know, for physical, physical execution, or can a logical query planners can be turned into a physical plan and then execute it. But I think that the calcite really was really important in terms of socializing, like this idea of like a modular, you know, Modular Front End, that can take responsibility for those parts of building a database system or a data warehouse that are, you know, traditionally something that that system developers want to have a lot of control over. I think substrate is interesting, because it’s provided for, like, standardizing on like, what’s that thing that your parser optimizer query planner emits. And that’s something that historically was not standardized. And so people, when people would want to use calcite, they would implement calcite, but they’d have a bunch of Java code using calcite. And then they’d have a, you know, a bridge between calcite and their system to go from the logical query plan into execution. And so it’s, it’s definitely been a journey to get to where we are now. And obviously, down in the weeds, we have the issue of the data types and trying to agree on, you know, all the data types, that we’re going to support all these different layers, which creates a lot of complexity. Yeah,

Kostas Pardalis 38:45
before I give the microphone to Pedro, because he wants to add something here. I do have to ask the two of you and Ryan, like, why couldn’t you agree on the types 10 years ago?

Ryan Blue 39:05
10 years ago, we were still screwing up timestamps. I think I was correct.

Pedro Pedreira 39:15
The answer to that is because there was a lack of composability, which means that people are implementing the same thing over and over in slightly different ways.

Kostas Pardalis 39:23
That makes sense. Better. You want to add something, please go ahead. Yeah,

Pedro Pedreira 39:27
no, just adding another conversation about data types. I think that is complicated because we were talking about different things, right? There’s many different levels of data types. I think there’s at least three discussions, right? There’s the storage data types, which are the things that we the storage and the file format actually understand what are, usually things like integers, floats, strings, and Boolean are like really primitive data types. Then they’re kind of logical data types that the execution can understand, which are things like maybe timestamps, or sometimes though the storage also understands TimesTen input. There’s no logical data types that the execution can understand. But there also may be user defined data types, which are kind of higher level types that users can define. I think some examples are things like sometimes when people are defining IDs, they don’t want them to be just integers. They want to actually like a higher level data type that just maps to an integer, but adds some level of semantic, right. So I think there’s different levels. And those different levels have different trade offs. And some of them are easier to extend. Some of them are more efficient than others. But I think that’s why we think that this model of defining things that should be resolved in the language like thing, resolving user IDs into integers, like those are things that, again, should be resolved in the language, it should be transparent to the execution type that the execution issue understands. So we can efficiently process those things, things like, for example, defining functions that are based on those types that only works if the execution understands those types. And then there are also types that need to be understood by the storage, which is a lot more about the kind of storage, efficiency, size, and it is related to encoding. So they’re kind of different levels. So depending on which types exactly we’re talking about, but anything related to a more logical type data model, again, like I would say that all those things should be resolved on the language layer, and then just capturing the IR.

Kostas Pardalis 41:13
Okay, that makes sense. One last question here, about types. And I’ll shut up about types. I promised that. And I want to ask Greece, because Greece can make like this connection with, like, the part we’re talking about, which is like the database systems, but there’s also like, applications out there, right. They’re like application developers, there are people out there who generate the data that we store, and then we process right and that in many cases, these people don’t necessarily have or need to understand what’s going on, like with the data processing infrastructure that we have. But they have somehow, like they’re feeding us with the data. So Chris, when it comes to the type system that we are talking about, or like the formats that we are using, like to represent like the information and like move it around, from your perspective, is that like something else that has to be added to make, let’s say, to solve this problem, like end to end?

Chris Riccomini 42:19
Something else to be added? I think, in my mind, and sort of my intention with recap, was to have something substrate or an IR, but more specific to the metadata layer, which is a way to describe in the abstract, the data that is flowing across the online, your learning and offline world. That would account for a large amount of the coercion that we see now there’s, I think there’s always going to be a little bit of covert coercion, because to not have type coercion, you essentially need something that looks a lot more like you, which is this sort of academic, very academic learning project that is essentially not usable for the average engineer. He’s just too complicated. And so to Ryan’s point about the complexity around all this stuff, you can’t make it too complicated for these application developers to use. And then as soon as you try and make it a little more simple, you end up with some form of coercion. But I think the thing that I would like is some common way to describe this, the datasets across the different stacks and layers. The closest I’ve seen that this instantiation aside from recap, is actually what arrow does, they essentially have two different layers. One is the schema flat buffer layer that describes the way Pedro was talking about a very specific, like, here are the bytes, you can have a float and a float can be 1632 64 128, right. But most developers don’t want to say something like float, 16.32, or whatever. So what ends up happening on top of that is like an actual implementation that gives you decimal 128 as an actual type. And so there’s sort of two tiers to it. But you know, for better or for worse than schema stuff is mostly wrapped up and used by arrow. And I would like to take that out of Arrow and use it across all the systems. And so that you can sort of audibly move around the data description from one vertical to the next. That’s sort of the area that I would like to see improved.

Kostas Pardalis 44:16
Ryan, you want to add something here? So

Ryan Blue 44:20
Chris’s description here just triggered something in my head, which is I have a little soapbox about losing structure of data and constantly coercing types. And I really think that one of the promising factors or you know, promising aspects of composable data systems is like stop losing structure, you know, share tables instead of sharing CSV, right? It’s always pretty ridiculous that I’m dropping CSV or JSON, right? I’m destroying the structure of my data set in order to push it over to you. So like, I think that it will actually get better. Hopefully, when we have, you know, more ability to share the actual representation and do so securely, and some of these other things mature. But I also entirely agree that we need to get to the point where we have some idea of a format that can actually do that exchange as well.

Kostas Pardalis 45:28
Okay, I mean, not from Stipe’s, me.

Eric Dodds 45:32
I’ve monopolized the conversation. We’ve been so good. So good. Yeah, I think you know, one thing that’s just hearing the conversation has been so fun, because I’ve heard it multiple times. I mean, we’ve come so far. And then also like, Well, yeah, that’s, you know, that is a problem. And I think the common thread through all of that seems to be this desire for open standards. And there are different areas of the composable stack. You know, where that seems to be a big need. And so I just love to hear where we have come in the last several years as far as open standards? And then, you know, what are the sort of the frontiers that are most important? Wes, maybe we can start with you. Give us a brief history, sort of your view of, of where we’ve come?

Kostas Pardalis 46:21
Yeah.

Wes McKinney 46:22
I think there’s, I mean, I think, you know, the main thing that came out of the Hadoop ecosystem was open standards for file formats. So basically, the foundations of what, you know, we now call open, you know, open data lakes. So the park we ended up with, of course, multiple competing standards, so Parquet files, and, and, or C files and some other open standards like Avro for developing, you know, RPC, kind of, you know, server client server protocols. You have things like script and protobuf, which had been widely adopted for building client server interfaces, I think in the last 10 years, moving up the stack into like, in memory, data transfer, like, you know, at, like in process, memory transfer, or inter process tramp memory transfer with arrow, that was a hard one hard one battle, but it’s great been great to see that, you know, cheap, Wide, wide adoption. I think interoperability is like open standards for computing. And like more of the computing layer is more of like an emerging thing that’s becoming, like, starting to happen that historically, there really wasn’t very much of. And so I think we’ve gone from an era of like these limited open standards for data interchange, data storage, to starting to think about more, you know, more of the runtime, you know, what happens inside of processes, rather than just like how we, you know, store data at rest, or move data on the wire?

Eric Dodds 48:09
Makes total sense. Any other thoughts from the rest of the panel? Yeah, I think maybe

Pedro Pedreira 48:15
just adding to what was said? Right, I think I see that, again, my mental model is that there are two things. One is defining what the API’s and standards are. And the second thing is actually having implementation for those things. Right. And those are, they don’t really go hand to hand, right. Sometimes there is no standard. And sometimes there is a standard, but there are multiple implementations and they’re not compatible or or the opposite might also be true. But I think maybe your question should water the open standards and API, like if we follow that mall, the model I presented and go around the stack, like if you go start from the storage layer, usually the storage layer is just some sort of block API. Right. So this is, you know, already pretty well understood, usually having some notion of file handle, offset and base. So this is just how you pull blocks. And then there is this idea of how you interpret what those blobs of data mean, right. And then I think like, well, like was mentioned, I think like Parquet, or C are broader kinds of well understood. Even though the implementation is all over the place, they can have many Parquet rather than right of implementation. They’re not necessarily compatible. But there is a standard. So you start from decoding those things, then there’s how you represent those things in memory, which is what I think Apache arrow defines really well. So if you need to represent this columnar data set in memory, how do you lay out this thing in memory? So I think the arrow solves the problem. Then there’s the question of, if you need to process this data and apply functions apply, you know, different operators. What is the relational semantic you follow? So we really think there is another discussion so if you look at Spark Spark has a certain semantic which kind of loosely follows NC sequel. Presto has a different semantic MySQL ports careers and you name it, they’re probably like 50 different semantics. You can, I didn’t. None of them are compatible with each other. They all sort of look the same, they’d have similar functions, but they’re never compatible. So there’s this idea of like, what is the standard for the semantic that your operations and your solutions that you call that if you go up another layer, there’s a discussion on how you represent this computation. So how do you know that? Well, you need to scan this data, then you need to sort it and you need to apply a filter, and then you join, and you need to shuffle this. So this is what substrate was supposed to do. Or it’s meant to do, just essentially having an open standard for representing this computation. Then if you go up, there is a discussion on what are the API’s, your how users represent this computation, right? Which is how we have Seco which is probably the worst standard of all of those, it’s very loosely defined. They’re just so many different implementations. They’re never compatible with each other. There’s also discussion of non SQL API. So we have pandas, you have pi Spark, they’re all non sequels, but still, there is no no standard. So let’s say that this is probably the, you know, the highest level on top of that, there might be even higher levels of how developers interact with those things. Like maybe you can have ORM, or a different sort of abstraction that actually maps into noSQL API or mapping to IR. So there might be some other API’s on top of that, but I don’t think there are any very, kind of industry standards on those. So at least that’s my mental model, if you go across the stack, some of them for some of those areas, there exists some standards, but they are not as strongly defined as we would like them to be. Yeah,

Eric Dodds 51:36
Would you say? I mean, it kind of sounds like, roughly not perfectly, but, you know, sort of from bottom to top, as you described, it is sort of the sliding scale of maturity, right, like this stuff at the top, you know, sort of least representative of open standards. Would you say that sort of generally true? Or what do you think about that? I

Pedro Pedreira 51:55
think not necessarily. I think it depends on which projects come close. Where did the battle come from, people actually adopting them in practice? I’m not sure if there’s a correlation to how deep or how high up on the hierarchy. And sorry, there’s one that I think very obviously very mentioned that I forgot to mention, which is a table API. Right, which is somewhere between the Yeah, I think there’s the storage layer processing. I think that’s a big one, which is why we have an Iceberg. But you also have Hoodie. We also have Delta Lake, we have meta Lake insight matters. So there’s another example of you know, there is an open standard, but it’s not, you know, 100% adopted everywhere, which also, I mean, good, but not great. Sure,

52:34
Chris. Yeah, I was just going to answer, I think your question on sort of where things are the most mature, and I really think it has a lot to do with humans in a given space, right. And so I think you look at what drives a lot of things like the arrow API, or data frames, or whatever. And so everything is having to integrate with that, right. And so I think, as there is, you know, theoretically one winner, we can all dream one winner in the storage layer, then that will, you know, you know, sort of solidify what that protocol is gonna look like. But as long as the more people you have competing, the more chaotic it is, and the harder it is, yeah, it’s gonna be what really drives a lot of the API is actually sort of organic through whoever wins, gets to say, it’s kind of disabled, the API looks like. So in that regard, I don’t think it’s bottom up the way you describe, I think it’s kind of middle out. I think, like the API layer, Sinclair is really describing a law enforcing people just to fit into that a lot more than

Eric Dodds 53:38
Yep, it makes total sense. One, one thing I’d love to discuss is we kind of already got into data types, you know, which is certainly, you know, a trade off, I would say, when you think about composability, as compared with, you know, sort of a singular monolithic system. What are some of the other trade offs? Like? How would you define some of the other trade offs of that composable system? And Pedro, maybe we can start with you?

Kostas Pardalis 54:04
I think that’s interesting, right?

Pedro Pedreira 54:07
I think some of the discussions we usually have with people working on this space is, I think what we mentioned before, like in a lot of cases, it’s harder to build a composable system than it is to build a monolith, right. So sometimes, if you don’t, if you’re just optimizing for speed, and you just want to build a prototype and have an MVP, customer running on top of that, as fast as you can do it, it’s easier to just prototype. Don’t just create a new monolithic system. Where we think that this fails is that it is usually easy to create a first version. So you can run a very simple query that supports this workload, but then a few months from now, they need to support a new operator and they need a new function. And then as this thing grows, I think it kind of slows down and then I think in the long run, it just doesn’t pay off. Right? But it’s a lot harder when you start something like should we actually spend a few months trying to understand Engaging with the Arab community, understanding how error works or understanding how Velox works. And if you need to make any changes, you need to engage with the community. And it’s a lot easier in a lot of ways to just kind of fork all those things and kind of make sure you can move a lot faster. So I think this is one of the obvious trade-offs. There’s also this kind of bias developers have that it’s something we elaborate on the paper as well that we like to think that we can do things better, right? So I’m not going to use an arrow, like if I need to create this, I’m probably going to write this better. So we see this kind of pattern with users, especially like more experienced engineers over and over people that want to reuse something because they think they can just do it better. And in some cases, they can. But in a lot of cases, it’s also not true. There’s also this part of yours, people prefer to write their own code and to understand other people’s code, right. So instead of again, like spending a month understanding dialogue, we’re just gonna go and create something that I fully understand in a few weeks, and then six months later, when you leave the team, then the next engineer has the same problem. They think there’s a lot of those kinds of fallacies that we hear over and over like some of them are kind of fair, like they’ll just time to market. I do feel like it’s true, but usually end up paying the price in the long run, which, again, like I mentioned, we elaborate some of that on the paper as kind of the reasons why composability hasn’t happened before. It’s just because there’s a lot of those internal biases that engineers have. And some of them are kind of driven by business needs. There’s no

Kostas Pardalis 56:26
Am I sorry, Eric, I want to ask something, Pedro here. So because you mentioned about, like, why composability didn’t happen earlier? And actually, it is like a question that I have about data management systems, because data management systems are very complex, but they are not the only complex systems we have out there, right? Like, we have operating systems and composability, no operating system just has been like I think for a very long time, right? Same thing also, in a little bit of a different way. But also, like with compilers, like we see like, the difference between having like the front end, and my little VM, and like other systems, like on the back end, like all these things, but why like in the database systems, it took us like that long to get to the point of like, appreciating and like, actually implementing these composite within this equation to all of you, obviously, guys, because you have like all of you like a very extended experience of that. So please. So

Ryan Blue 57:19
I think that part of this comes down to, you know, commercial interests, which I think is a big part of the data industry, right? At least, where we sit at the storage layer. Storage provides opportunities that the execution layer can take advantage of, for better performance. And if you can control the opportunities, and you control the execution layer, like you can make something that just fits together really nicely and has excellent performance. And at least so far in the storage world, it has not been a thing to get your storage from another vendor. Like that is, you know, a really weird thing that is happening now. That Databricks and Snowflake or you know, choose your other vendors Redshift can share the same datasets. And it comes down to who controls that data set, who controls the opportunities that are presented to the other execution layers, like the world gets really weird in this case. And I think that, you know, part of it is just how we’ve historically architected these systems, right? I think of the Hadoop sort of experiment as this Cambrian explosion of, you know, data projects that questioned the orthodoxy of how we were building data warehouses, and that led to, you know, pretty primitive separation of compute and storage that we then, you know, matured and eventually got to this point where, yeah, you can, using projects like Iceberg, you can safely share the storage underneath these execution engines. And that’s what is really, like, pretty weird right now. But all throughout our history, we have not been able to actually share storage, share it reliably, share it with high performance, and things like that. So I think that, you know, the business model of all those companies that have never had to share storage and are built around like, hey, we sit on all your data and you know, lease it back to you for compute dollars. You know, that has been a very powerful driver in the other direction.

Kostas Pardalis 59:50
Does it make sense? Better. You want to add something? Yeah, I

Pedro Pedreira 59:52
think maybe adding to what Ryan said and addressing your question of why we think why composability while why we think composability is more fourth employee report data systems. But why isn’t it such a big thing for compilers and operating systems? For example, like I see maybe a lot of that is just driven by a variety, right? So like how many widely used C++ compilers Can you name? And how many operating systems can you name and how many data systems can you name so that there is a lot of a lot more, I would say, even wasted engineering effort in redesigning and redeveloping those things for data systems than they are in operating systems and compilers. I think a lot of that is just because the API’s of those systems are actually user facing, right. So users interact directly with databases. So users’ needs and user requirements are evolving a lot faster than requirements for operating systems and compilers. So if you like the API’s of those systems are a lot more stable in the hippo, they don’t evolve as fast. So I think those systems are also a lot more mature. Right? So making them composable, there’s maybe less incentive to make them composable. Because there are only a handful of implementations of those. But data systems are a place where you literally have hundreds, probably 1000s of different agents that all have some sort of degree of repetition. I think there’s a lot more incentive than Okay, let’s actually see what are the right libraries that we can use to accelerate those entities, especially because workloads have been evolving, and they’re going to continue evolving in the future, right. So the workloads we had 10 years ago are very different from the ones we have today. And they’re probably going to be very different from the workloads they’re going to have five years from now. So not just making the agents we have more efficient, but we need to be more efficient as a community on how we adapt data systems as user workload doubles?

Kostas Pardalis 1:01:38
So I’ll send Eric, back to you shortly for like, interrupting again, oh, yeah,

Eric Dodds 1:01:42
Well, actually, we’re fairly close to the buzzer. Here, we got a couple more minutes. And so one thing that I would love to hear from each of you is, what new projects you’re most excited about. You know, we’ve talked a lot about sort of the history of open standards and projects that are, you know, written, that have pretty wide adoption. But I just love to know what, you know, what excites us sort of look at the newest stuff out there. So, Ryan, why don’t we start with you.

Ryan Blue 1:02:17
I’m pretty excited about IR. And projects like substrate, I’m also excited about some of the other, you know, more composable, or a newer API’s in this layer, Iceberg just added views, we are also standardizing catalog interaction through a rest protocol, you know, really trying to make sure that everything that goes into that storage layer has a open standard and spec around it. And I think that is going to really open up not just composable things, but towards the modular end of the spectrum, where stuff just fits together nicely. And you can take any database product off the shelf and say, hey, my data is over here to talk to it. So I’m pretty excited about how easy it will be to build not just the systems, but actual, like data architecture based on you know, these ideas. Very cool.

Eric Dodds 1:03:21
Chris, you’re next on my screen?

1:03:26
Yeah, I think we’re gonna call out one that is sort of obscure. There’s a fellow over at LinkedIn working on a project called Hot Demeter, sort of playing around in this space. And they essentially use calcite to build a query engine that can query across a bunch of different systems and kind of create this single view of streaming queries, and also kind of data warehousing queries. And they’ve done some really interesting experiments, to sort of plug it into Kubernetes and stuff. So that’s something that I think is really fascinating. It essentially kind of uses Kubernetes as a metadata layer. And then it has a JDBC implementation that will allow you to query like, basically do a join between Postgres and materialized or between, you know, data that’s in Iceberg and data that’s in MySQL. And so it’s really an experimental project. It’s super interesting. And the guy that’s working on it, Ryan, Ryan Dolan, is really

Eric Dodds 1:04:33
has a lot of interesting ideas. So that’s the one I want to call out. Very cool. All right, Costas. You’re next on the screen. Did you think I was gonna ask you? Oh, me? Yeah.

Kostas Pardalis 1:04:44
Oh, that’s a good question. I actually, I’ll probably mention something that is not that much about like, in the context that we are talking with more of an extended context. I’m very excited about that. Like virtualization technologies like divisors, for example, and firecrackers and how these can change the way that we build, there are systems. One of the really, in my opinion, like hard problems when you build solutions and products in that space is how you deliver like how you build multi tenancy and how you can deliver that in a way that you can build as a vendor margins, and at the same time, provide the isolation guarantees that people need out there for the data. So this interaction between virtualization and on top of what like how you build systems, I, it’s something that I find extremely interesting. Some companies like ngModel, for example, like experimenting and using devices. It’s very interesting. Anything that’s like, takes the data systems and delivers them like to the users out there in ways that are like, like, let’s say a little bit closer to the stuff that we’ve seen, like with applications. It’s super, super interesting, I think. And we see a little bit more on the OLTP space, like the noon database and like these, like vendors that are a little bit more like ahead of the curve, let’s say they’re like Uber to the OLAP systems. But I think there’s a lot of opportunity also, for the OLAP systems like to exploit these technologies and build some amazing stuff. But what I would say,

Eric Dodds 1:06:36
all right, patria.

Pedro Pedreira 1:06:38
Yeah, no, I think there’s a lot of open source projects that come to mind. But I think specifically on this possibility area, I would say both personally, and just make a quick plug here. So Belux, I think that’s the project closer to my heart, like we’re making really good progress, we’re getting to a point where more than 20 Different companies are engaging, and with us helping develop like we have more than 200 developers. So we’re making some really quick progress. It’s integrated into presto, integrated into Spark, we’re seeing like two to three is three acts of efficiency wins on those systems, which is huge, right? So I think this is the project I mean, very closely involved. So it’s super close to my heart, we were also working with hardware vendors to add support for hardware acceleration in kind of a transparent manner. So I do feel like it’s going to become kind of even more popular than it is today. There’s also a discussion on file formats. Right. So I think that today Apache parkades, probably the biggest one, but I think there’s a consensus in the community that it’s already getting closer to the end of its life. So there’s a discussion of what’s next, what WhatsApp to Parquet and what this format looks like, and how do we actually, you know, create this format and create a community around this. So I would say that there’s, you know, we’re probably going to see more projects, specifically on this area of file formats. Soon. I think going up the stack substrate was something that was super interesting, like actually, having the way of expressing computation across engine standardization, I think was a super interesting proposition. Even though I think the actual adoption and existing system, it’s a little, it’s been a little slower. I think there’s also discussion that while from a business perspective, like why would you invest into using substrate instead of your own thing? Like, what is the value that I think from Belux, there was a clear value of actually making your system cheaper, right? So I think or subtract things that maybe there’s still discussion on how exactly do we frame this? And how exactly do we increase the, you know, they do this project for larger companies. But I think in general, like a lot of other projects, they are super interesting to me.

Eric Dodds 1:08:42
And Wes, bring us home. Yeah.

Wes McKinney 1:08:45
I mean, like, like, Pedro, I’m really excited about, you know, the progress in modular composable execution engines, you know, deadlocks and duck DB being, you know, two prime examples. And other ones data fusion is a rust based query engine in the, in the aero ecosystem, and, and so in my company will try and data, you know, we’ve been, basically we’re building a system, you know, maybe inspired by the name of the company is to be able to take advantage of the best of breed and modular execution engines. And so I think we’ll see more and more systems built to take advantage of this trend, as time goes on, rather than, you know, building your own execution engine that, you know, you can just use the best available engines for your particular hardware. You know, Cindy architecture, yeah, whatever characteristics of your data system. Another thing that I’m really excited about real quick is something we haven’t really talked much about, which is like the language API or the language front end for interacting with these systems. I think people there’s kind of an awakening. And for a long time, people were awakened to it, but that sequel is kind of awful. And, so there’s a bit of a movement to define. I knew query interfaces that do away with some of the awfulness of SQL, but also can hide the complexity of supporting different dialects. So a couple projects, they’re pretty interesting. Mallow, a read at Google by the former Looker people. There’s another project called PR QL. Yep, she’s cool. That’s very cool. You know, I created a project called Ibis for Python, which is like a database of data frame API that generates many different SQL dialects. You know, under the hood, there’s a big team, pretty good sized team working on ibis, now, Thomas Neumann from Invicta, realized from to Munich have a new project called seine QL in a paper at sider 2024, discussing the many ways in which SQL is awful and proposing a new way of writing a relational queries. So I think that’s, you know, I think, you know, since we’re very focused on building these buildings, you know, viable systems, for solving these problems. And I think to be able to move on to more like, hey, well, how do we make people even more productive, and that includes all the way down to the code that they’re writing to interact these systems, that user productivity is going to become increasingly an important focus area, I think, for the coming, you know, a few years.

Eric Dodds 1:11:22
Yeah, absolutely. Ryan had mentioned that earlier. And I was like, Ooh, should I jump in and go down that rabbit hole?

Kostas Pardalis 1:11:30
One question with one of the, you mentioned, heavier outsides. But we haven’t, like, haven’t heard from anyone about anything that is happening, when it comes to optimizing it, right, which is like, big parts of the systems. At the end, that’s like the bar that actually takes like, the computations, optimizes, right, and like, make it like, efficient and like, does all the magic there. Is there anything interesting happening in space? Or like, Would you like to see anything happening there?

Pedro Pedreira 1:12:00
I’ll go like, I think, for me, in our team, like we have been deeply discussing this. Basically, using the same ideas of having more modular execution, can we have a more modular optimizer? I think the first reaction that people have is that while it’s unthinkable that the optimizer is very specific to how the engine works, if you actually stop thinking about this, there are ways to do this. So we actually have someone on the team, prototyping some of those ideas. I think where this stuff is basically on prioritization, right? They know how much value that provides and why people would invest? So others right now. So I think that’s where things then I think from, from more like, academic and scientific perspective is super interesting. It’s something that I would love to spend some more time on, but I see maybe less business value in investing in this versus investing in things like, you know, common table formats, faster execution and better language. So I do feel like this is going to happen at some point. And as we started looking at this space, like we saw, there were many other partners who were interested in this. I just think it’s more a matter of what the incentives are?

Kostas Pardalis 1:13:06
Make sense? Anyone else? Who would like to add about optimizers? Here?

Ryan Blue 1:13:11
I think it depends on what type of optimizer, right cost based optimization is definitely more tied to the engine, we could probably share a whole bunch of rule based optimizations and things like that. And I actually think that sort of work is going to happen as we coalesce around an intermediate representation, right? Like someone’s going to write a good optimizer for substrate at some point, or will be able to translate to and from easily with it. And, you know, we’ll just have that ability, I think, but then it’s like all those designs around Well, now that I need to incorporate how their engines actually act. And what the real costs are. That gets pretty hairy. Yeah.

Pedro Pedreira 1:14:01
I think once we just quickly added to this, I guess that was essential discussion, we had like, Are there actually common things that we can extract from the optimizer? And I think just adding more color to what I mentioned, we started at least like all the optimizers they have. Okay, how do you define what physical capabilities you have? How do you cost those capabilities? I think it was a discussion on providing you have the right API’s. So that agents can actually say, Well, I support merge join index joins and hash draw hash joins. And those are the costs, like the part of actually exploring the plan space causing those things, like, although that part is very, very common. So I think the idea was just you know, how do we define the right API so that they could be reused across Angel, but again, I think it’s something that should just be stopped on this part of why should we fund this? And maybe that’s something we will look into in the future.

Kostas Pardalis 1:14:51
Yeah, it makes sense. Okay, one last thing, and that’s like equations specifically for I am what is going to subsidy use Apache Ranger, for the enterprise at least. Come on, it’s like someone has like to go and fix these like,

Ryan Blue 1:15:11
I think it’s solving the wrong, wrong problem. So I’m going to answer a different question that I think is hopefully relevant. I am not bullish on sharing policy or representation, because a lot of different systems, the edges have very different behaviors. So like a future table permission, like you have in tabular versus like an inherited database level permission, or something like that, right? Like what do you do with objects as they go into and out of existence? I don’t think that sharing policy is the right path. Because we would have to come up with a union of all the policy choices and capabilities out there. I think sharing policy decisions is the right path. So what we’re incorporating into the Iceberg catalog REST API, or that protocol, is the ability to say this user has the ability to read this table, but not columns, x and y. And the benefit there is it doesn’t matter how you store that, it doesn’t matter how if you’re using a back or our back or you know, whatever you want to use for that internal representation. The catalog just tells you the decision. It says, this user can read it, but not these things, or like they can’t read this at all, or something. And so I think that is going to be the best way to standardize an exchange, which is essentially you have to pass through the user or context information necessary for making these decisions. And then the decision comes back in an easily represented way, because it’s a concrete decision from the policy rather than storing the policy and all of the ways you could interpret that policy.

Kostas Pardalis 1:17:12
Yeah, I think we need to have a dedicated episode, like just talking about that stuff, to be honest. But like, other times, like the theme today with different ends. But, Eric, back to you.

Eric Dodds 1:17:24
Yeah, look, we have a couple episodes here. We probably should talk about data types. We should probably talk about the death of SQL and access policies, so we can line up a couple more episodes here. Gentlemen, thank you so much for joining this. This has been so helpful. We’ve learned a ton I know our listeners have as well. So thank you for giving us some of your time. We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe to your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.