Episode 166:

Data Processing Fundamentals and Building a Unified Execution Engine Featuring Pedro Pedreira of Meta

November 29, 2023

This week on The Data Stack Show, Eric and Kostas chat with Pedro Pedreira, a Software Engineer at Meta (formerly Facebook). During the conversation, Pedro discusses the intricacies of data infrastructure and the concept of composability. He delves into the specifics of data latency, explaining the spectrum of requirements from low-latency dashboards to high-latency batch queries. Pedro also discusses the concept of vectorization in database operations and the benefits it brings. The conversation shifts to the development and usage of certain libraries and the goal of converging dialects in data computation. Don’t miss this great episode!

Notes:

Highlights from this week’s conversation include:

The concept of composable at a lower level of data infrastructure (1:28)
New architectures and components that allow developers to build databases (3:44)
Pedro’s background and experience in data infrastructure (6:18)
The Spectrum of Latency and Analytics (12:59)
Different Query Engines for Different Use Cases (16:32)
Vectorized vs Code Gen Data Processing (19:33)
Vectorization and Code Generation (21:21)
Examples of Vectorized Engines (24:33)
Rewriting Execution Engine in C++ (27:22)
Different Organization of Presto and Spark (33:17)
Arrow and its Extensions (37:15)
The similarities between analytics and ML (44:33)
Offline feature engineering and data preprocessing for training (48:00)
Dialect and semantic differences in using Velox for different engines (50:01)
The convergence of dialects (52:23)
Challenges of substrate and semantics (53:18)
Future plans for Velox (58:09)
The discussion on evolving Parquet (1:03:38)
The integration of the relational model and the tensor model (1:07:29)

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 00:05
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com. Kostas, this week’s conversation is with Pedro from Meta, and wow, what a lot to talk about. And his work on that and usage of that inside of Meta, which is fascinating. So an execution engine that does a ton of stuff inside of meta, but Pedro’s really an expert in so many different things, databases, you know, sort of architecture of data infrastructure, so much to talk about. The thing that I’m really interested in is this concept of composable has been a marketing term in the data space as it relates to sort of, let’s say, like higher level vendors that you would purchase to, you know, sort of handle data in an ingestion pipeline, or an egress pipeline, right. But Pedro has a really interesting perspective on this concept of composable, at a lower level of data infrastructure, and the execution engine is really sort of a foundation for that. And so I’m in this, I think this can actually help us sort of cut through some of the marketing noise that the vendors are creating with higher level tooling and help us understand, you know, at the infrastructure level, what is composable mean? So I think that would be an awesome subject to cover.

Kostas Pardalis 01:59
Yeah, I think, Okay, I think you put like, in the right way, because I think the difference here, when we’re talking about composability is what level of abstraction we’re talking about, when it comes to composability. Like the vendors you’re talking about, I think they are talking more about like composability of like, let’s say features in a way or like functionality of like the user wants composability, when it comes like to what we were like discussing with better is like a little bit more fundamental, and has to do more with how software systems are architected. And, okay, I’m sure like, people that listen to us, like they, they know, the value of being able like to build a system software system that is, has like some kind of like separation of concerns between each module to have much more like agility and flexibility in like building updating, like, having people that are dedicated, like the difference, like carriers and in general, have something that scales much easier in terms of building, right, like, we’re not talking about, like processing scalability. Now, traditionally, this was living like the case with de la based systems, though, right? Like database systems were like, kind of like big monoliths in that way. And like, for some very good reasons, like it has a lot to do with how hard it is to build such a system. So when we’re talking about composability, that we’ll be discussing with, better is more about that, like how there are some new architectures and some new components coming out that actually allow, like a developer to pick different libraries, and build databases, right. Or like data processing systems, let’s say in general. And that’s a very important thing in the industry. Because traditionally, building database systems has been extremely hard, because like, there was like, No, almost zero concept of like, reusability of like libraries, or software or whatever, I pretty much had to do everything, like from scratch. And that made the whole process of building these systems really hard. And also like from a vendor point of view, right. So we are entering a new kind of, like, era, like when it comes to these systems with technologies like Arrow, for example. Velox we start seeing, let’s say, some fundamental components that you find in every system out there provided as like a library in a way that you can take and like to integrate and build your own system, right. So this is like the things that we are doing like to talk about when we’re talking about composability with him. But there’s much much more actually we’re going to start a lot about like some very basic and important concepts when it comes to data processing. And by the way, like Pedro has been working for 14 years in meta data infrastructure, so he has seen that a lot in the past like 10 years like this show Many things have changed. And they were like buildings. So we’re going to talk a lot about things like evolution, things that 10 years ago were innovative. And today they need to be rethought. And that’s a hint, because he’s going to announce some very interesting things also about some changes, and some updates like some very important systems out there. So Velox is a very interesting project, but also like invalid, some like amazingly experienced people. We started with Pedro today, anyone, everyone, should listen and enjoy and take a glimpse of the future of life, what is coming, and hopefully, we’re going to have him back. And also like more people relate to these technologies to talk more about the stuff in the future.

Eric Dodds 05:49
All right, well, let’s dig in and chat with Pedro. It’s good. Hello,

Kostas Pardalis 05:53
everyone, again to another episode of The Data Stack Show. I am here today with a very special guest bedroom for meat. And we are going to be talking about some very interesting things that have to do with databases. But first of all, better welcome. It’s really nice to have you here. And tell us a few things about yourself.

Pedro Pedreira 06:18
Yeah, definitely. Yeah, I think first of all, I’m super excited to be here. Thanks for the invitation. Just to introduce myself, I’m Pedro Pedrera. I’m from Brazil, I’ve been working at Facebook, formulating Facebook now matter what for exactly 10 years now. It’s been my call this meta Versary. So my 10th meta meta bursary was about a couple weeks ago. So yeah, but didn’t always focus on data infrastructure and software engineering and developing kinds of systems query processing, like all sorts of things around data. I spent about five or six years working on a project called Kubrick, which was something that I started inside Mattox as something based on some of the ideas I was researching in my PhD. So the idea was more creating a new database, very focused on very low latency analytic wires. So we really have a special way to index data and partitioned data, kind of very small containers. And you could use that to speed up your queries, especially if it was kind of highly filtered by us. So I spent, like I said, about maybe five to six years working on something that was really cool, getting the things I was researching, and turning that into an actual product inside matter. It was something that at some point ran into many choirs.

Kostas Pardalis 07:33
And then

Pedro Pedreira 07:34
There was this other thing that was being developed in parallel that fresco, which was also a really great piece of technology, was making some good progress. And at some point, we started seeing that they’re kind of getting closer and closer to the point where we eventually when we brought the teams together, and we started merging the technology. And then it was when some of the ideas around Bell locks and actually unifying the execution agents, all those different computer angels ritual libraries started. And that’s how we kind of, you know, we started looking at your Belux. And I’ve been doing that, or I think the last three years or so. So it really was invalid springside Matt, I also work pretty closely with some of the other Compute Engine team. So I work very closely with the Presto team with the spark team graph analytics. But this is kind of my work like software engineering around all those analytics agents. But I also work very closely with the machine learning infrastructure, people with real time data, like all things kind of related

Kostas Pardalis 08:28
This is my thing. Okay, that’s super cool. Okay, before we start, let’s talk about more, let’s say recent things, and more like how LA is like Silicon Valley, kind of things. I’d like to ask you about growing up, like in Brazil, right, and deciding to get into databases. And the reason I’m asking is because I also come from I wasn’t born and raised in the United States. I was also like, in Greece, I went to, like a technical school there. And there were like, many different options like to go and focus on and what brought you let’s say in when you were like, Brazil, studying to focus on databases, like what was the thing like, wasn’t like, I remember like, most of my friends like for example, we were like at electrical and computer engineering school, like everyone wants to build at the beginning like a 3d engine, right. But back then, I’m a little bit old. And we all wanted to do something like that. The other bases were also sometimes will come later, like, it was like a very specific, like, small group of people getting into that stuff. So tell me about that. Like how you got exposed to that and got excited about it? Yeah, definitely. Good question. I think of you. When I was

Pedro Pedreira 09:53
In college, I started looking at two operating systems or source code like really going down into the Linux kernel, like under study. How do you know, all those things work memory management, like how to manage resources, things like that. And I spent some time looking at distributed systems, as well. So I think the first thing that really caught my eye on databases is just, I think the kind of the breadth of things that you, you get to interact with, right? Because if you’re building databases, I mean, you need to understand operating systems, you need to be a great software engineer. But also, there’s also compilers, languages, and a kind of optimizer. So I think there’s just so many things that you need to understand if you actually want to grow down the database, rather than with big lips. Just something that was fascinating to me. They started a little bit more on doing research on transaction processing, and then somehow it kind of fluctuated to more of this analytical side of a multi dimensional indexing, indexing and database. I think I think the more I started learning how databases were getting, the more I started, like I said, just getting fascinated by how complex as those things are, and anything, that’s why that was really interesting, because, you know, you can go as deep as you want in compilers inside databases, or they’re harder, like going down to how harder works and how prefetching your caching, like all those things are, you know, we need to have a really good understanding of those who view the database, I think that was just something that always caught my eye. And that’s how I got into that. Yeah.

Kostas Pardalis 11:18
Oh, yeah, that makes a lot of things. Okay. And another question that it’s like, I will ask these, in an attempt to try and clarify things for, like our audience, because it is one of those things that it always gets, like the semantics, variety, depending on who you talk with. And that has to do with latency when it comes to interacting with data, right? So we have streaming processing, we have like, what, like some people say, like real time databases, we have interactive analytics, we have many different, let’s say, terms that are used for different systems, they’ll have to do with the latency of the query. And it’s really interesting, because you mentioned like presto. And presto, it is also like a system that if you compare it with something like Spark, for example, right? It’s much more interactive, like it is, I mean, look how sport can be like today. But if we take it like when these things started? Well, the whole idea was that whatever we design here, like the system, we are designing like one, let’s say guiding principle is that these schools that we are running, they need to respond in a timely manner, right? We’re not talking about queries that we are going to run for hours. But then you’re mentioned like Kubrick, and you’re talking about like even lowered, like latency. Help us understand a little bit like how time and latency relates to databases and try to create some kind of categories there that are as distinct as possible and like to communicate this with our audience out there. Interesting.

Pedro Pedreira 13:01
Yeah, I think a little let me take a stab at it. I think like you’re saying, like from me, if we’re looking at more analytic ways, that there’s definitely a spectrum that goes to interactive analytics, like sometimes they call that last mile Analytics, which are kind of things that you need to serve with really low latency. Like imagine things serving dashboards that use a user facing dashboard, they have really high QPS. And you’re expecting very low latency. And beauty systems focus on that, like you need to take different trade offs. Or you cannot expect to scan petabytes of data for those queries, there’s probably some pre computation, some caching required, like even how you propagate your query between workers and you be very busy. Setting, we usually call that there’s this kind of this last mile analytics that relies on a lot of caching and pre computation, then there is more interactive analytics, sometimes they call them or maybe exploration or ad hoc, which are, in some ways you’re doing some exploration of data. So you know, your data set, maybe you can try to join different tables to get some insight, but it’s not something that you already found, and you ready, build a dashboard that will grow. And you have a few questions that you want to serve with really high latency. There is a spot in between that’s a little more exploratory, but it still needs to be executed with low latency because there’s a human expecting to see that result. Right? So it goes from this serving in a way to this exploration and then on the other. The other side, there’s more of those and a really, kind of higher latency. Batch queries are two zeros, okay, I need to prepare my data. So I need to join this. I don’t know 10, petabyte tables with another five petabytes tables, generate some tables, and then those tables are going to be used to serve something else. So usually, and this is only talking about analytics, right? But it goes on this side of serving dashboards, you can have human exploration a little more ad hoc to kind of large batch processing and ETL. And I think like I said, at least inside matters, like a lot of the batch workloads, they’re just winning side Spark, right because it’s over system is really good at handling large queries, there’s a good support that requires durability and so very reliable systems. And as you move to the more interactive Bharti Bucha system, like presto, and we also have things that are focused, you can have higher QPS and serve really low latency chores. I think that’s an expatriate. Again, there’s only talking about another. So if you talk about transactional workload, if you’re talking about in mail or real time data, like there’s another word, but that would be maybe a high level classification.

Kostas Pardalis 15:30
No, that makes total sense. And like, just like to help out people like map these categories, these three categories with like, some, like products out there, right. So you have systems like Beano or jewelry, or ClickHouse. And I would assume and correct me if I’m wrong, these are like the seasons are closer to what Kubrick was supposed to be doing? Correct. Then you have the systems like Trino, like presto, or like Snowflake that are more for longer running queries, but still, like queries where there is a human waiting in front of the screen, right? And then you have all the bad stuff that might take hours, truncate, like even days, like maybe, but then you have, like, Spark. And now okay, I’m going to ask a very naive question, but I have a good reason to ask Pete like that. Why do we need different query engines? And we can have just one and cover like all the different, like use cases? Yeah,

Pedro Pedreira 16:33
I think that’s a good question. Like it goes down to that whole, a one size does not fit all discussion has started, I think maybe at this point, 20 years ago, ish. But yeah, I think essentially, depending on the capabilities, depending on what you need, what are the capabilities you need from your agent in each assembly, your query agent in different ways. So for example, if you need something very low latency, you probably want to propagate your quiet, less servers, you want to have things maybe pre calculated. Maybe a good example is, if your query is really short lived and something fails, but you’re in the middle of your choir, it’s probably cheaper to just restart than trying to just kind of save the intermediate state, right? While this is very different from if you’re running a query that takes I don’t know, in the entire day and something failed, like you need to have ways to restart just that part of your player. So different systems, depending on the capability, depending on your requirements, need to be dated, they need to be kind of organized and assembled in different ways. Are there always discussions about H tab systems and systems that could magically kind of cover most of that there’s some products or even kind of successful in the industry. But I think, at least for us at scale, if we don’t see anything that can kind of work at our scale, and still fit all those requirements? I think the very interesting question, and that’s how we got your ballots is that you do have all those different engines. But if you look at them how different they actually are right? As well, if you look at them they were a good example. And where we started is, if you look at things between Presto and Spark, like you mentioned, presto is definitely more focused to lower latency queries more ad hoc, while Spark is just for those really large batches. But if you look at the code that actually executes the database operations, they are the same way there is nothing different about the way you execute hash joints, or the way you execute expressions. Of course, one of them is that right? The other one is a little closer to kojem. But essentially, kind of the semantics of those operations are always the same. So it’s I think there was a lot of questions that we or maybe the discussion we brought was that he has the agent needs to be different, because the way you organize your query execution requirements are different, but the code you execute can be shared, or at least a lot of the code can be shared. And that’s how we got into balance.

Kostas Pardalis 18:48
All right, that makes a lot of sense. You mentioned like, interesting, like, distinction terminology here. You mentioned vectorized versus Colgin. Right? Tell us a little bit more about that. Because and the reason I’m saying that is because I’m aware that they’re like two distinct, let’s say main categories of like how an execution engine is like operating, alright, architected. But I’m not sure that like many people know about that, like, they think of vectors and my think like for many different things, or Colgin, with, again, like many different things. So let’s get a little bit more into that. And that is going to give us also, let’s say the, like the stepping stone to getting Novell ops. Right. So what’s the difference between the two? Yeah,

Pedro Pedreira 19:34
I think it is essentially like how you process your data. I think we usually, when you talk about vectorization, assume that you take entire batches of data, so you’re not executing operations against every one single row, right? So you’d have a batch of say 10,000 rows, and then you apply one operation at a time over this entire batch. And then I mean, there are a lot of trade offs, but it’s usually a little better suited to CPUs because more than prefetching As better your memory locality is better. So usually executing those operations against the batch like you can have more ties at a lot of this cost. I think where things get tricky is that if you don’t, if you cannot Batch Records, or create batches that are large enough, then in a lot of cases you have a bunch of operations that you want to execute against a single record. So a lot of the overhead you have with vectorization, then it really adds up. So I think what we see is that if you’re looking at analytics, we’re usually processing a lot of data. vectorization is a better candidate, just because you always have large batches. So essentially the overhead of finding what’s the operation, you need to execute on that batch, it gets some more tight, because the batch is large enough. But usually, in some cases where your backs are smaller, and I think there are probably some disagreements in the community on where exactly this barrier, this, you know, this line is drawn, but usually we look at things that are a little more transactional. So we have operations, like lots of operations that you want to play over and one record or a few records, then Koh Gen is a better solution. Because you essentially take all the code and you generate a record at execution time you generate some, some assembly code that can execute those operations more efficiently. Right. And when we started those things, there was a lot of discussion on what would be the best kind of methodology to prevail. Second, or all the algorithms are created, or all vectorized, we have been experimenting with code, code Gen as well. One of the tricky parts of code Gen is always that there’s some delay, and there’s some overhead in generating this code of execution time. So it’s always like, you know, have some gains. But then like, does the compilation time really upset that and given at least our experience, like usually vectorization still was a clear winner. Another illustration, there’s always just the kind of complexity aspect of generating code at runtime, it might get things kind of trickier to debug, it may make things harder to develop. There was some recent work in the community that actually suggests that accordion might not be as terrible as it seems like how to make it easier to develop Hogan base engines, but I think our experience was that for analytics vectorization was still the clear winner. I think if you look at even if you look at the industry, but most of the Eigen engines focus on analytics, they do vectorization, which is this idea of just executing the same operation over large batches of their Yeah,

Kostas Pardalis 22:20
That makes a lot of sense. I’d like to make something clear here. Because I think people might also think that when we are talking about vectorization, it’s always connected to specific instruction sets that some CPUs have like a VX 512 Or like, something like that. But actually vectorization is much more fundamental than that, like, yes, you can have acceleration because of these, like open instruction set there. But you get benefits. Anyway, even without the instruction sets, like it’s a whole architecture, it’s not just like having special hardware that can accelerate some specific things correctly. Or just like to not confuse people more, like a lot of confusion out there. And we’re like to make sure that we make things like clear,

Pedro Pedreira 23:06
it knows exactly. I think that’s a great point that when we talk about vectorization, that doesn’t necessarily mean Cindy’s instruction, right? I think the point is just to design your database operations in a way that is better suited for hardware. And then in many of those cases, you actually ended up using Sim D, because while you lay down your data, you’re organizing your structures that are structured in a way that simply can be leveraged to execute those things efficiently. There is always a discussion on whether you do this explicitly, or you let the compiler do kind of implicitly discover that this pattern can be translated just in the instructions. But basically, what you said I think it’s maybe a little higher level, it’s more, for example, if you were executing hash joins, in each you look up on hash tables, like whether you take one record, and then you do the entire lookup in a hash table, or you force calculate all the hashes and then you kind of refactoring it’s more like kind of higher level how you you organize the the operations you’re executing, rather than just simply simply as part of that, but it’s not only that,

Kostas Pardalis 24:04
yeah, yeah. 100%. And, okay, can you like before we get into like, more than the like, the looks? Can you give us an example of a database? I mean, it looks obvious, right is vectorized. But another one that is like, let’s say, another library, the whole database looks like a vector, right? So people can relate to that. And on the other hand, like, again, like a popular system that is called gem and people might not know, but it is. Yeah, yeah, I

Pedro Pedreira 24:34
I think Presto is a good example of a vectorized engine. I think that the vectorization that Presto does and Alex does is not exactly the same, but I think the kind of the paradigm is the same photon at least based on the public information we have about paper like it seems like everything is vectorized

Kostas Pardalis 24:53
I have the impression that

Pedro Pedreira 24:56
Snowflake is also vectorized just based on the people who created it in the system they work in before but I think that there’s a little, there’s not as much information. But I would say that most of the analytics agents are focused on being essentially vectorized. Most of the systems essentially come from the king of the German universities like going there, that’s a kind of a bigger shop of vectorization. So I’ll go jump. So the Uber system I think, is maybe the newest system they have been working on, which is all based and kojem. Before that, they had the hyper system, which I think at some point was acquired by either Salesforce or as SAP. So that was also like one of the kind of really successful cogent systems. I think that those may be the best examples. I know that Redshift has some kojem as well.

Kostas Pardalis 25:45
But I think that yeah, king of the first one that comes to my mind. Yeah, I think DAG DB is also vectorized dB, for sure. It’s Vectorizer. Yeah, and I had the impression about Spark, I mean, not with photons. It was called Gen, but I might be in the wrong chair. But I had the impression that it was yes. I mean, if you use a NATS photon, but if he’s like an old gen, but so. Yeah, yeah, I

Pedro Pedreira 26:12
I think you’re right. Or just not sure. Because I know that we do have some internal kind of changes gist Parquet as well. Kylin. I don’t know, the code Gen is something that we added or improved. But we went into a public fall, a kind of Java’s colleague always based on it. It’s not vectorized but it’s encoded.

Kostas Pardalis 26:28
Yeah. Which kind of makes sense. Because vectorization I think in the JVM space is a little bit more challenging for machines, like in some games, okay. And quick question here, to get into like virtual villagers, like, by the way, like written like in C++, right? So like, divert is a lot from the paradigm of like, these systems like Spark and presto, that were JVM based systems, right? Why is that? Like that? Like the first question? And the second question is, how do you bridge the two? Because, sure, like, it sounds like interesting to go and be able to, like, let’s say to take Presto or take spark and change like the execution engine and put something that is written in C++ and like the games that you might have. But how trivial is it to do that, right?

Pedro Pedreira 27:22
Yeah, that’s a good question. So when we started Velox, I think there were maybe two different discussions, like one of them is what you mentioned, just how we will write that in which language. So that actually started inside the Presto team. So we have a couple of people just looking at accelerating presto. Right? So you know, what are the optimizations you can do on the Java stack? And then I think at some point like this, the idea came up of, if we actually rewrite those things in C++, like, would they be more efficient. So at that time, there was some smaller benchmarks for micro benchmarks, just to kind of compare how efficiently like if you just look at a very hot loops, right, and I think one of the things are the areas where we started was essentially table scans, because that’s where most of the CPU was going. So if you look at the main loops inside table scan, and you just kind of map those things into C++, or what we call native code, how much faster that would be, I think we got some good data that at least for those smaller micro benchmarks, there was some performance gains, I think, was already a few weeks from 4x to 10x, depending on what you saw, there was a good performance gain by just kind of moving things from Java to C++, and other Java lovers will probably yell a bit. There’s, of course, there’s a lot of kinds of things that you can stretch the JVM to do things even more efficient, but then you kind of start losing a lot of kind of good benefits that the JVM and Java some of the guarantees that the language users can kind of strap Java in awkward ways. And then it can bridge some of that. But then I think in the end, like it just came up that simpler writing those things in C++ in the first place was just kind of a better idea performance wise. So that was one aspect of the discussion that the second one was that we didn’t want to do, or when we started writing some of the ideas that we didn’t want to do on a per agent basis. So I think usually, if you’re looking at just pressing or just sparking, one discussion, as far as we have 10 to 20 different agents, right. So you don’t want to have to do that on a per agent basis. So I think that’s when the idea of, like, if we need to rewrite those things from Java and C++, like there’s a lot of work like this, kind of, I don’t know, almost 10 years of work. But if we want to do that, we want to do that once and then we use all the different agents from not just Presto and spark, but we’re using that in stream processing we want to use. We use that in machine learning for pre-processing for training for serving an inference. We want to reuse that for data ingestion. And we want to reuse that and then with as many kinds of vertical or different use cases as we can, and then grading that in C++ just makes this a lot easier. A lot easier. Like a lot of them, at least for us the whole ML stack is all C++. So it’s not that you could reuse a Java library in that world as well. So I think just C++ was a was a was a better choice at that point. And Egypt your second question about how we plan those things. It’s interesting. It also depends on exactly how you’re integrating or maybe how deep you want the integration with the angel to be. Specifically for presto, there was a discussion on whether you just replace the evil code by C++. And then in Dell World, you need to have some sort of inter process communication, or you need to use JNI. And that was questions about, you know, how, how much complexity would J and I add, there’s also discussions about memory management, like how much memory you give to the C++ side of things and how much memory the JVM would need. So what we ended up deciding is that specifically for presto, we remove Java completely from the workers. So essentially, present at this two tiered architecture where you have a one or a few coordinators that take your query parser query optimizer can generate choir fragments, but then the execution of those fragments, they’re, they’re done in worker nodes. So what we do is that on those worker nodes, we completely replace the job of wrestler, so there’s no job of running them. Most of the execution of the operation, they’re done by val x. But there is also a thin C++ layer that communicates that essentially implements the HTTP methods that communicate with the coordinator. So that’s the project we call Prestissimo. So processing is essentially this layer around Belux. So that’s how we grow specifically in Cyprus. So that’s how we go between languages, from Java to C++, we do that via a kind of HTTP REST interface.

Kostas Pardalis 31:43
And is there like Spark for example? And like, Okay, the reason I’m asking is sin also, like the project gluten, which is like, adopting, not the, I mean, gluten is like, more generic, but through thrive, like to create, like a layer there to, like different execution engines, like Spark. And I also have, like the impression, at least, like the photon folks that were talking about JNI and like, also the context switching that happens there. And like, what kind of overhead, this is also added in? Like, do you gain at the end? Or like, you lose, like, so from what you’ve seen so far? Because, okay, like, traditionally Java to communicate with other systems, you use JNI? Right? It just makes sense. Like to do that. If the system is okay, I mean, you take people who don’t design for like to write in C++ at the end, you go and dad likes benefits you in performance, you don’t really want to lose any of those benefits. Right. So based on your experience, like what do you what you’ve learned around that?

Pedro Pedreira 32:47
Yeah, I think that’s a good question, I think the JNI discussion was less about performance. Okay, I think the assumption is that you crossed the API boundary very, very rarely, right? So I think the idea is that you would go from the Java world nature C++ and cross the JNI boundary just once. And then all the heavy lifting operations would happen just inside C++. So we wouldn’t transfer thinking another way of saying that is that we wouldn’t transfer data between JNI and in Java. So like all the heavy operations that the C++ processor, I think Fisher was a little less about just performance, it was just about simplifying the stack and resource management. Another big difference is just how Presto and spark are organized right when press was pressed is essentially a service, right? So we have a few workers and those workers are always running, there is no single process. And you share this process among different ways. So in a way, you want to be very conscious about how much memory you have on the box, how much memory that process is able to use. And then inside this process, your shared memory and memory pools were among different queries, while on the spark side it is different, like Spark is just like it’s more like a batch job dispatching system. So when executing acquire, you allocate containers and different posts, and then at least how we use that internally is that you do that and that leaves while your query is executed. At that point, you kind of remove that right. So I think this has been able to efficiently share resources and share memory like it’s a much more important pesticide than it is on the spark side. With that said, on the spark side, there is also kind of we went through a few iterations on how those things would be integrated. There was a first try internally, and I think that there was at least two years ago doing something similar to gluten, but instead of doing via je and I just using a separate process. So essentially, spark the spark driver would create a new C++ process and then communicate using some sort of IDL or some service, some sort of inter process communication.

Kostas Pardalis 34:50
We went from that mode to

Pedro Pedreira 34:53
a kind of a completely different project that we we can maybe talk in a little more details, but essentially the idea was Taking the same Presto from tail and purchasing Presto worker code and executing them on top of the spark runtime, which is something we called Presto and spark. So I think that was kind of our strategy because the integration work would be a little less, right. So we just needed to integrate Velox into presto. And then if you were reusing the Presto engine on top of Spark vendor integration paths would be easier. So that’s how we use things internally. And then on the community, as I mentioned, like Intel started and created this project called Newton, which is a little in a way similar to to faltan. And since that uses JNI, to communicate with with Velux, and then like the API is based on substrate and arrow. So I think that one is closer, but we don’t use that internally. But we’ve been working closely with Intel and I think in a way is a good way to look at it. Gluten is sort of like being a JNI wrapper around Velox, that it can integrate that into Spark, but I think the project is generic enough that you can reuse that in other Java agents as well playing like Cuban Trino. And anything that uses Java.

Kostas Pardalis 36:07
Yeah, that’s interesting. Okay, what about Tarot? Because one of the, let’s say, like, the promises of arrow is to K we agree upon, like, the format on how to represent things in memory, right. And these can be shared among like, different processes and like all these things, which obviously like, Okay, sounds like very interesting, because when you’re working with a lot of data, like the last thing that you want to do is like, copy data around, right? That’s always like a big overhead. So where does auto fit? I know that Velux, for example, is , I think, like, inspired from Velox. But from Arrow, right, the way that, like the layout in memory is designed, but it’s not part of right. Like it’s not like, exactly like implementing an arrow. So how do you see the projects there? I would like to work together? And what’s your prediction for the future? In terms of like the promises of Arrow and the role of Arrow? In the industry? Yeah, I know that.

Pedro Pedreira 37:17
That’s a good question that there’s some background to that as well. Yeah, I think like I said, arrow is this column Nobel later layout that is meant to be standard in the industry. And then I think the whole idea is that systems would reuse that layout. And you can move data zero copy across agents. So when we started validating, of course, the idea was just kind of reusing that same layout, because while it makes it would make it a lot easier to integrate with different agents, what we started seeing is that it was in a few different places where we saw that if we made some small changes to that layout, we could execute, we couldn’t implement execution primitives more efficiently. So there was a lot of discussions on whether we would stick to the layout or the cost of maybe losing performance in some cases, or whether we could, you would extend the arrow in a way and implement kind of our own specific or specific memory layout, in some cases, just so we could execute some operations more efficiently. And I think because he really cared about efficiency, we decided to go with this kind of extension. So that’s how we kind of created what we usually call Belux. vectors, it’s essentially the same thing as an arrow in most cases, the only difference is how we represent strings is slightly different in how we represent kind of vectors, and sorry, arrays and maps. There, there’s some differences on how we do that. And we also add some more encodings that were not at least at the time are not available in an arrow like that constant encode the hourly and frame of reference. But then we decided to kind of extend that in a way that deviates from the arrow standard. So that’s kind of how bellick started. And all data represented in memory all follow the developed vector mode, then we started because many agents integrating with Well looks like they would like to produce and consume an arrow like there’s some sort of conversion. There’s a layer that can do this conversion, which like I said, in most cases is zero copy. But if you hit one of those cases where we can call or we where we represent things differently than in person in

Kostas Pardalis 39:15
cotton. So this is how the project started. Because

Pedro Pedreira 39:20
Velox was gaining adoption and gaining visibility. We also started working with the arrow community just to kind of, you know, understand those differences and see if that’s something that the arrow community will be interested in extending the standard as well. We started working with a company called DOT Voltron data that was also interested in Velox and they have a lot of good work on the arrow community. So what we ended up doing was actually proposing to the arrow community to add those valve extensions to the specification. I think at some point, we’re able to convince them that those things are actually useful. And not just the ballots, those that but some of the other Compute Engines also see things in a similar way. So some of those changes. already adopted and already available in the new version of Arrow, there is one less difference that we are still having some discussions, but hopefully we’ll make into the next version of air. So essentially, we think long story short, like when we started, it was sort of a fork or like an we call it an arrow plus plus version, but then we work in with the community children like to kind of contribute those changes, and then hopefully align. And then the idea is, of course, that we can ship data back and forth each era without any performance loss, like making everything zero carbon.

Kostas Pardalis 40:32
Okay, that’s awesome. So you mentioned a couple of folks out there who like using their looks, and who else is using Velox? Outside of meta, right? Like, it is like, okay, like, I’m pretty much here, like, open source project right now, like if someone goes there, you can see like a lot of contributions happening, like from people also outside of meta. So what’s the current status of the open source project? Yet?

Pedro Pedreira 40:58
That’s a good question. I think that’s one area that we are actually really happy on how much attention and how much visibility we’re getting. I think part of the like, our assumptions and with the open source balance is that because in a way that the audience is very limited, it’s not something that end users will download and use and in a way is very different from decK DB like that we must have millions of users because data scientist data engineer, they actually use that for other audiences, just developers who actually write computer dangerous. So in a way, I think the group of people who would be interested in this would be very limited. But regardless, we saw a very, very kind of quick growth on just attention and different companies in the industry, we’re just interested in the project. And why don’t you both start using benchmarking compared with people they’ve had internally and then eventually help us. There is a list of companies that are one of the first ones that join us really early on, and it’s been helping, specifically from the process perspective as a HANA. So they have a team of people working with us really closely, they recently got acquired by IBM. So ignite IBM paneling, one of the companies who have been helping us develop and use Velox, specifically on the press the context for a while, Intel was another company that joined us very early on, they’re interested in a little more on this perspective. And I think, of course, part of making sure that this library runs efficiently on Intel hardware. So they have a lot of work on the spark side as well. So they’re the company who created gluten, specifically for gluten, there’s a lot of other companies who will join us out for it and help develop and are using that internally. So we heard from Intel, Alibaba , Tencent, Pinterest, and a large number of companies who have been looking into that as well. We also have also been chatting with Microsoft, about Newton and some, you know, some other usages of ballots here. They probably heard, and we saw people from most tech companies at this point, with very few exceptions, like some of them already actively using that internally and helping contribute code like some of them and just kind of asking questions and trying to benchmark things. But I think most of the kinds of large tech companies like we heard from in some shape, or form, which is really interesting, was really validating your work.

Kostas Pardalis 43:17
Yeah, 100%. I think it’s very rewarding. Also wait for someone who’s like leading, like a project like so. One, two questions. That’s great. But I have to ask them, because I don’t want to forget them. Like, one. The first one you mentioned, at some point, like Velox is being used not just for analytical workloads, but for streaming videos like for ml. And I’d like to ask you, let’s think like a mill, as an example, just because it’s the hot thing right now. Right? So it deals with marketing. So why I mean, what’s the difference? Right? And the reason I’m asking what the difference is, because I also tried to understand, and I want to learn what they also have in common, right? Because if you can share, like shots, complicated components, like VanillaJS, just like the execution engine that does like the processing of the data, right? between a male and analytics, it means that they’re different, but they’re also like some things that are common. So I’d love to hear the blood from you. So please, help us learn. Yeah, no, definitely. Yeah,

Pedro Pedreira 44:36
I think like I said, when we started ballots, we started learning and talking to many other teams. And I think that’s what we started realizing that actually the similarities between agents developed for all those different use cases, they’re much higher than what we expect. And I think it was the assumption that adjusting different things that do various things, and they’re very particular ways but what we found was just, it was kind of basically the same thing. Developed by different groups of people. And you can see that within analytics, but you see that even more kind of evident between analytics. And ML just happens that although kind of, you know, data management systems and analytics were developed by database people and all the kind of data part of ML was developed by ML engineers. So, of course, there’s a lot of things that are very particular to GAML, like how you do training, like all the bulk synchronous processing you do on that, like all the hardware acceleration, but there’s also the data part on ml. So there were maybe two main use cases that we’ve been looking at. One of them is just what they call data pre-processing. Basically, when you want to start your training process, you need to feed lots of data. And usually this data is stored in some sort of storage like for us, like, can we pull data from the same warehouse? So there’s this logic where it’s usually not as of course not nearly as complex as a large analytical queries like sound? It’s not like you’re doing aggregations? Are you joining different data sets, but you’re still like you’re doing a table scan, you’re decoding piles, like sometimes, I think we don’t get to do a lot of filtering. But there’s always some expression evaluation, you execute some UDF switch. So there’s, this part is actually very similar to essentially, it’s the same way that there is nothing different about how you decode those files, like how you open Parquet or orc files and how you execute expressions like the UDF API is that machine learning users use, like the all the types we provide to them. So all their parts were very similar. But there was also discussion that you know how much we could reuse. And then what we ended up seeing is that it’s just kind of more of the same. So we had a different library that was also a C++ library that is heavily used for this specific use case, like for training when you’re pulling this data, and feeding this data into the model. We had a C++ library internally and besides that, the implementation of those things is very similar. So this is one of the areas where we started converting with Alex. The second part was around feature engineering, which is essentially how do you take your raw data and transform into features? And there’s a few different parts of that, like, how do you do that? On offline systems essentially, should you have your internal warehouse tables with the raw data? How do you do how you run your ETL processes that actually generate features that can be used for training? We also do a similar thing on real time sites, there’s also a real time system that does feature engineering. And there is also the part of serving your teachers. So essentially, when a user needs to see an ad, like when you get those, the input data, you need to transform that data really fast and really low latency to transform that into your feature. So you can actually do the inference on the model. So again, those last mile transformations they need to do from the raw data into features. It’s again, it’s essentially expression evaluation, sometimes, like UDF. So that’s the part we’re converting. And again, I think the more we looked into that, the more we saw that it’s just essentially more of the same, just implemented by different groups of people with different semantic bugs. But they’re the same. Yeah.

Kostas Pardalis 47:59
What’s the difference? Like? Just like we couldn’t be really curious here? What’s the difference between the offline feature engineering, data processing for training that you mentioned? Because, okay, what I understand why is the difference is that data pre processing is much more, let’s say, like, doing like serialization deserialization, like more doesn’t necessarily require like, typical like ETL, stuff, like joins and things like these, while the feature engineering might be, like, more complicated, like the type of processing there. Yeah,

Pedro Pedreira 48:34
I think the feature engineering side, you can imagine that offline, it’s essentially it runs on top of a traditional SQL analytic Compute Engine. So it’s basically the idea of doing large SQL Transformation, sometimes you’re joining things or executing expressions. But basically, you’re taking all this raw data and creating features. Data pre-processing is more: you already have those features created, you’re pulling this data and feeding those into the PyTorch model or whatever model you’re trying to train. So that there is no way you can see these slightly different parts of this pipeline. But also the complexity and the type of operations you execute, they’re also different, right? But when you create those features, the type of operations that can run it, there might be a lot more complexity, there’s a lot more to it. There’s also shipping data in different places, doing shuffles and moving lots of data to different places. And while on the training side, you want to be able to very efficiently just pull this data and feed into the trainers. Yeah, very fast.

Kostas Pardalis 49:33
Makes sense. And you mentioned UDS and some other stuff. And like I said, like the equation I wondered about like Velachery, if someone gets to the documentation of the logs, they will see that there are two groups of functions there. One for Presto and one for Spark, right. And you will see the same functions in a way, like implemented twice in a way right? Why is this happening? Like why do you need something like that?

Pedro Pedreira 50:01
Yeah, that’s a very good question. And I think that’s where we get to the whole kind of dialect and semantic discussion. So basically, when we create a develop, there was this discussion of which specific dialect we should follow, right? Because it’s not just that the code base is the same. But for example, what data types are supported? And when there, there is an overflow? What’s your behavior? Because different agents do that in different ways, right? So like, if you follow the behavior in one agent, and you try to integrate ballots in a different agent, then you change essentially changing the semantics, right? So like how exactly you handle operations, what are the functions you provide, where data types provide was, was a big thing. And what we ended up deciding was that because we need to use ballots across different ends, like ballots need to somehow be kind of dialect agnostic. So what we ended up doing is that the core library or the core part of Alex, we try to make it as kind of ancient independent as we can be. And then you have extensions that you can add to follow one specific dialect very. So I think what you mentioned is that for, if using Velox, for Presto that you use, you’re going to use the Velox framework, and you want to add all the Presto functions and that would be a Velux version that follows the Presto dial. But if you try to use that in Spark, that won’t work, right, because Spark is expecting different functions with different semantics. So there is also a set of Spark functions that you can use. So if you’re compiling things, compiling routes for gluten, for example, you’re going to use Belux with Spark SQL functions. But the other interesting part is that he lets you kind of mix and match those different things as a way we can potentially have a spark version of Velox running within gluten, but also exposing press the Function backward compatibility. So you can enable yourself to do these sorts of things. I think that’s one of the parts that we saw, this is actually pretty tricky, right? Because the code is the same, but each change has its own quirks and its own SQL dialects, and no different databases, they have the exact same, even if all of them support SQL and all of them claim to comply with the SQL standard. Exactly. They’re never compatible, they’re always different. So there’s also discussion, what we ended up doing internally side matter, because that’s also a problem that will really care, right? So if you’re a data user, and you need to interact with five or six different systems, and each system has a different kind of sequel, dialect with different partners and different semantics, it makes our life a lot harder. So as a separate go, we also have this idea of a kind of converging dialect that, like it’d be hard, would be a lot easier if we have one kind of SQL dialect that people could use to express their computation. And that would work just on top of all those different systems, but not just even for SQL. But potentially, if you’re creating your feature engineering pipelines, or if you’re doing you know, you first tried to start training or model like an event. Ideally, if the functions you expose the data types and semantics are exactly the same as your PRESTO queries or spark where it’s like it just makes the user’s life a lot more efficient, right? So I think this is something that it doesn’t necessarily come from ballots, the ballots allows you to do that, because we have the same library with the same function the same semantic being used in all those different angels like we can make the user experience more kind of more aligned with and this is which then makes their users and data scientists a lot more efficient. So that’s another thing it’s a lot harder than just looks because with more PD to make changes on engines there’s all sorts of concerns with backwards compatibility, but it’s something that is long term we’re also looking at. Yeah,

Kostas Pardalis 53:29
I think that’s also like my heart says I like Michael not concerned about something that’s always confused me a little bit with substrate. Because I was like, okay, yeah, sure. Like you can take like a plan and like you can realize and substrate and then let’s say you have like a plugin on spark that get that and like execute it, but at the end like this, the semantics of Spark might be slightly like different than like presto, let’s say for example, is this is substrate at the end, like provide the interoperability like bleyer That’s we would like to have at the end or it’s like limited at the end by how semantics are like implemented in IT system because K substrate cannot go and like implement that’s not It’s not its purpose. Right? So what do you think about that? Because it’s like, depending on who I talk with, it’s very interesting if I look like with researchers and databases that like doesn’t make sense we could get like into the point of like using substrate why like just implement the dialect you won’t like from scratch anyway if you want to do it right. Right. And then you should like people in the industry they’re like much more pragmatic. I mean, yeah, like they try to use it right but what’s your take on that like specifically for substrate and I understand that like, Okay, I don’t want you’d like to be here. Later. It might also be like, political in a way because like, they’re like, communities, they’re involved, but I think it is constrained Active, like from the audience of the people involved like to like, share opinions, because these are like hard problems at the end, right? Like there’s no easy solution. Yeah,

Pedro Pedreira 55:08
no, that’s a super hard problem. And I think in a way, that’s also, as you rightly pointed out, it’s sort of controversial. Discussion, like depends on who you ask different people might have different views on that. Yeah, I’ve been because of the work and dialogues and because of the work with arrow and essentially is the same community, right. So we’ve been talking to the substrate community as well, and how they’re they’re planning to handle that, I think they’ll do problem spaces, essentially, is that if you want to create a unified IR for systems, which is my is my understanding, that’s what substrate is meant to do. Like, you need to be able to express the computation without, like, you cannot have room for interpretation. Like it needs to be something that captures every single detail of your execution. And maybe a good example is what I mentioned before, like if you’re, if you have overflows, like what’s the behavior, like the throw exception, the wrapper around like, there’s something that needs to be captured by the IR. Other things as well. If you have arrays that you start at zero, you start with one. So those are just kind of toy examples. But I think we what we started seeing practice as we re implement some of those functions, there’s so many really kind of complicated details, like essentially, every single functions has, like now all sorts of big kind of corner case behaviors, there are bugs like so if you want to have a bug bite bug implementation of different things, is really hard. My understanding is that the substrate project perspective is that they’re trying to capture all those details inside the IR itself, right? So essentially, you know, or if they don’t have anything around execution, but the IR wouldn’t have enough information for you to implement things on a, you know, in a way that the execution is 100% Correct. Their questions on whether this is doable or not, there’s also the part of like a track record that actually captures all the single details like I mean, it is only as useful as the agent is actually able to provide all those different knobs. I think it is an open question. My personal perspective is that like getting an IR that was generated by one system and executing on a different system, other than maybe hello world examples, like nano can actually have a complicated way that has large expressions, functions, and ETFs, they hardly see that they’re really working in practice. But let’s see, I think like, we have this discussion with a substrate community, and they have people actually trying to enumerate all those differences. And let’s see, let’s see what open questions. Person is a little bit skeptical. But let’s see.

Kostas Pardalis 57:37
Yeah, no, make sense. And okay, so what’s the let’s talk a little bit about like, what’s next for Velox? Right. I mean, it is, like, you know, a mature state. I mean, it’s being used by metadata, like, okay, the scale of metal has other companies like, outside meta, and it’s actively developed. Right. So what’s next? Like? What are some exciting things about the logs that you’d like to share? With, like the audience out there? Yeah, I

Pedro Pedreira 58:09
I know that that’s another interesting part. Like there’s a few things that come to mind of things we’re working on right now, things have just started and what we were looking at in the future? Well, one of the parts I think is more tactical is just the open source part. I think there’s a lot of discussion of that just because of how fast the project is going, like how much interest we’re getting from the community and the discussion on how do we kind of be bold, the open source mechanics, like the open source governance, like how do we formalize a lot of those things. But this is something we’ve been actively working on, not just inside matters, but talking to your partners, like discussions on how the community operates, decision making, essentially, like open source, governance is never an easy problem. Right. So this is kind of the short term is something we’re looking at, of course, other than just continuing adding features, optimizations and things we’ve been doing for a while. The second part we have been, we started exploring some of that, and we were starting to see some of the early results are super promising is just this idea that now that you have all of your different agents using Velox, like using the same unified library, it gives you a really nice framework to evolve what you have underneath, right? So essentially, now if your hardware is evolving, right, so it’s not just sim D. But if you have other sorts of accelerators, you have GPUs, FPGAs are, especially if you want to optimize your code to make better use of hardware as hardware evolves. You couldn’t do that before because you had to do that once per agent. But now that we unify that and you have a single library like that this framework is really compelling. So we’ve started investigating, like, what would it take for like, essentially what are the frameworks in what are the API’s we need to provide an Alex to be able to leverage GPUs and FPGAs and other accelerators? So there’s a lot of exciting work in that direction. We’ve been partnering with some hardware vendors as well. I think it’s perfect for hardware vendors. It’s also a very compelling platform, they just integrate my accelerating Chevelle looks. And suddenly it can execute spark Presto queries, brand stream processing, and whatnot. So we’re seeing a lot of interest in the project. From the hardware vendor perspective, we have some really nice initial results on just how much of your quiet blend, you can delegate and execute efficiently in, for example, GPUs. We’ve also been working with companies that Design Accelerators based on FPGAs for now, but we’re seeing a lot of good, both kind of traction and initial results. So that’s one area that we really invest in anything just, of course, we care about this as matter. But I think there’s also something that we see a lot of interest from the community, both from hardware vendors or also from cloud vendors who want to enable different types of hardware. So this is another one we’ve been working on. Lastly, there’s also a discussion we started, which is related to balance, but maybe orthogonal in a few different ways. It’s just the evolution of file formats. I think there’s a lot of discussions on the research community, I think, even at VLDB, a couple of months ago, there’s a few papers comparing gist, Parquet and or SI, and, you know, kind of pointing some of the lawsuits or the discussion on, you know, can we do better? Like, what are some of the flaws, like, of course, not just flaws in the design, but also the use cases we have and the access pattern to those files or people, so and those wider that again, both of them were also developed many years ago. But there’s a lot of this notion of unbound much, or what we could change. internally. Inside matter, we have a project called alpha, which is a new file format that is a little more focused on ml workloads. So we also have been partnering with some other people from both academia and industry and trying to understand like, is there anything better? What comes up there Parquet? Like I mentioned, it’s a little orthogonal to Dr. Alex, but it’s related in a way that both Alex and I are able to encode and decode those files as well.

Kostas Pardalis 1:01:57
Yeah, 100% I mean, at the end, like a big part of doing, executing analytical workloads is like reading from these files, right. And usually, you read a lot of data. And I remember like, because it was very interesting, that paper you mentioned, it was very interesting to see. Like how math is left, like in terms of being able to saturate like S3, for example, when you’re reading data from there. So there’s a lot that can be done there. For sure. But like okay, like Parquet has been around, like, it’s like, very foundational, like whatever we do out there, too. How do you see these like happening actually, like how we can move into like, updating Parquet in a way that is not good to break things. And, in some cases, also, let’s say Allow, like use cases that might be like, almost healthy, that are not compatible with the initial idea of like pocket. And I have something very specific on my mind about that. Because you mentioned a male. And I think like one of the interesting things with a male, when it comes to like to store ads is that the column Nurb idea that you take the whole column there and like you operate like on a column exists, but they’re also like cases where you need like more than that, like you might need like point queries to or you might need to have different points in time for like, the features that you have, like, for example, like all these things, like in ML and AI as an extension or like important in terms of like like Parquet was like designed for that stuff. Like that didn’t exist 1215 years ago. So how do you do that? Like, how do you iterate on these things? Because they’re like, very foundational, right?

Pedro Pedreira 1:03:44
Yeah, no, I think absolutely. Like, that’s the interesting question i, there’s basically two ways of going around that we can try to evolve Parquet and just kind of making backward compatible changes. Personally, I’m still kind of skeptical that I know that there was also Parquet v2, which is, I mean, it’s been an interesting process. But I think a lot of the changes that we think that we might need to make like, I’m not sure if we can make them in a backwards compatible, whereas I think there’s a discussion on whether we can actually extend things and make this kind of somehow Parquet compatible, or if we would need like a completely different file format. I think the discussion about ml like the first single we saw the just breaks is just in all of those ml workloads you have like 1000s you then can 1000s of features right? So if you want to map those things, your Parquet file, like there’s all sorts of inefficiencies because it was not designed to have that many streams in parallel or so, that is one of the things I’ve been looking at. And like I mentioned this output file format, we have addressed specific pain points, but also discussions on hardware acceleration. Like if we’re assuming that GPUs are becoming even more pervasive. A lot of this computation will be offloaded to GPUs at some point and it means that they will scan We’ll be billed delegated to GPUs and then your Parquet recording will also be executed interviews and other things we’ve been seeing is that Parquet was also not designed with that use case in mind. And there’s a lot of inefficiencies that make it a little harder. Not harder, but not as efficient as it could be as you map things to GPUs. And even to things like sim D, right? There’s a lot of kinds of encoding algorithms that are present in Parquet, they’re about easily parallelizable. There’s a lot of discussions on if we had to design a new file format, given those new requirements, so supporting many different streams, but also things that can efficiently be executed and not just seen the vectorized hardware, but also nested accelerators. What would this format be? And I think what we’re seeing is that it could be fairly different from Parquet, there’s still an open question. It’s a super interesting research problem. I will see. But yeah, I think that’s another very interesting area. And we see a lot of interest in basically like all the partners and all the different companies, we reach out for the Ottomans problem, you can party is great. But we can probably do better at

Kostas Pardalis 1:06:00
this point. However, all right, one last question for you. Because we kept discussing, and we kept going back to new hardware, new workloads. And one of the things I’ve liked, I find very interesting, and I’m thinking a lot about it. And I’d love to hear what you think about this. So when it comes to working with data, we have two very distinct, let’s say, models of computation, right? We have like the relational model that we are using, like all these operators that we have, like in the database system, like including, like Velox rights, which is like a huge foundation of what we have achieved as an industry like in the past, I don’t know when it was introduced, like 40 years ago, or like 50, or something like that, right until today. And then we have like all these new staff, like AI and ML, and of course, like we are also using different hardware there. And we have like the tensor model of promising data, right. And, I mean, okay, both are like processing data, but like, they’re not exactly the same thing. Right. So but we need both in a way, like we need somehow to bring these two together, right? Do you see a path towards that without having to abandon one of it, too? Or, like, what do you think about this? Like this problem?

Pedro Pedreira 1:07:29
Yeah. So that’s an excellent question. I don’t know if I have a very good answer. But I think what I would say is that I think if you look at the operations, they’re all data operations I do. Some of them were implemented by database, people like vectorization joins expression evaluation, and a lot of them were just matrix multiplications. And things are implemented by AI people. So in a way, like he, philosophically you’re getting data and you kind of shuffling in, in executing transformations, which I think is just the data layout of those things is all very different. I think one way to think about this is all the data we have in the warehouse, and they’re all kinds of database computations, they run on top of Arrow, essentially, while the ML stuff runs on top of tensors. So I think though, the way we’re thinking about this is, I think, for now at least, is less about unifying those two worlds, but just kind of having efficient ways to pull this arrow data and then transforming tensors. There’s also discussion that, should we add support for tensors and Velox, as well. So you could actually even support tensors as a data type, and then potentially could help tensors exposed inside spark or inside press those data types as well. So like, how much and how deep this integration goes? I think it’s not an interesting open question. I think if we were to read design and rewrite the entire stack from scratch, you’re probably getting things that would look very differently. But I think just the fact that we have those two very mature worlds like one coming from PyTorch, another one coming from analytics, essentially databases like I think makes it harder. I know, there’s some very interesting work from Microsoft and actually doing the opposite of trying to reuse ml infrastructure to run databases. And I know that they actually have been getting some pretty interesting results. I don’t know, let’s see how this thing will unfold. But I think for us, at least, like having the kind of maybe good ways to kind of interoperate data between those two worlds are getting data from analytics, we’re like, transform then Japan says ending that to PyTorch. I think that’s probably kind of under the short term what we care about most but I do see that there’s more potential for that kind of integration. I think, like I said, I think it’s just things that were created by different people and ended up being unnecessarily siloed

Kostas Pardalis 1:09:46
100% Okay. Unfortunately, we don’t have more time. I’d love to continue. I think we could be talking for another hour at least. Hopefully we’ll have the opportunity to have you again in the future. There’s like So many things happening like Velox. And I think Velox as an open source project can give us a very interesting, like, view of what is happening in the industry right now. That’s why I’m so excited when I have people like you. It’s not just like technology is also like, getting to think like a glimpse of what is coming in general, right. So hopefully, we will have you like, again, in the future. Thank you so much. I’ve learned a lot. I’m sure that our audience is also going to learn a lot. And if someone wants to know more about Velox, or wants to play around with velvet, or whatever, let’s say like, they would like to get their hands on it, like, what should they do? Ya know?

Pedro Pedreira 1:10:45
So yeah, I think first of all, thanks for the invitation. It was a great chat. Like, I’d love just to kind of nerd out about database bases. I’m super excited to be here. Yeah, if anyone in the audience is interested about any of that about bow works, like we usually just ask folks to reach out on the GitHub repo. And just, you know, go there, create a GitHub issue, why then I’ll kind of reach out to some of us. There’s a Slack channel where we can read the reviews to communicate with the community. So just send me or someone from the, from the community and email, we can get you guys added, but just reach out, like if you don’t have any questions, or if you’re planning to use that if you want to learn more about something like then we’re always pretty responsive on GitHub on Slack. You know, just send us a message, send me an email if you’re free to reach out. Thanks again, for having me.

Kostas Pardalis 1:11:32
It was a lot of fun. Yeah. Likewise.

Eric Dodds 1:11:37
We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe to your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.

🎙 Sign up for The Future of Machine Learning Livestream!

🗞️ Signup for Our Newsletter

Episode 166:

Data Processing Fundamentals and Building a Unified Execution Engine Featuring Pedro Pedreira of Meta

November 29, 2023

Notes:

Transcription:

About the Podcast

Sign Up for The Data Stack Show Newsletter