This week on The Data Stack Show, Eric and Kostas chat with Aditya Parameswaran, Associate Professor at UC Berkeley & Co-Founder of Ponder. During the episode, Aditya discusses the zoo of data languages including a 101 on Pandas, why builders should be adapting to users, exploring what Ponder is solving in the data space, interesting theories on the way things should operate in the industry, and more.
Highlights from this week’s conversation include:
Read more of Aditya’s work:
The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Eric Dodds 00:03
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com. Welcome back to The Data Stack Show Kostas we have a great topic today, we’re actually going to try to bridge the gap between a Python library used for data science machine learning use cases, mainly locally, pandas and the data warehouse, which was really surprising to me when I first heard it. But Aditya, who started this company called Ponder, is also a professor at Berkeley is a really fascinating guy. So I think maybe I want to start out by doing a one on one on pandas. I think it’s something that, you know, a lot of people in and around data are familiar with. But I think it would be great to just do a one on one on pandas. And then I’m going to do a, I’m going to do a two for one here in the intro, which is unfair. I’m going to ask about the, you know, the transition or another transition, but covering the gap from Candice to the warehouse. I can’t wait. It’s gonna be awesome. Absolutely. Okay, I
Kostas Pardalis 01:28
have like, maybe things that’s we can make a bar, I’d love to ask him. First of all, I think there’s Modine itself, which is pretty popular, like an open source project. It’s hard, like almost 9000 stars, for example, like on GitHub. But outside of like the project itself, and like, let’s say, how it works, like what like, the secret shows behind it, and why it is important. What I really would like to talk about with him is why there is such a gap, let’s say between like data practitioners in terms of like to link, especially if we compare, like what like data engineering tooling is compared to the AI ML tooling out there. Why this is happening, like at the end, why a sequel alone is not enough. And talk a little bit into other stands better, like, what causes that? And what can we do to bridge this gap? And I think we have the right person to do that. Because the outside and Berkeley outside of like, the core, let’s say the law base research, a lot of work has been done, has to do with how people in their products and can be more productive with data. So there’s a lot of wisdom there that can be served, like, first of all with us, and obviously like, with our audience, so let’s go do that.
Eric Dodds 03:01
Let’s do it. A DTO. Welcome to the data SEC show. We’re so excited to chat with you.
Aditya Parameswaran 03:06
Thank you so much for having me. I’m excited to chat.
Eric Dodds 03:09
All right, lots to talk about with ponder. But give us your background. So you’ve done a lot of data stuff. But you’ve also taught a lot of people about data in the university setting. So just give us your background and kind of the brief story of what led to ponder. Yeah,
Aditya Parameswaran 03:26
so I’m a database guy, I guess. Or at least a data guy. I’ve been part of the data community for over a decade now. I got my PhD at Stanford, spent a year at MIT , then I became a faculty member, I was at Illinois, the University of Illinois for a few years and then returned to the Bay Area. And so I’m a professor at the University of California, Berkeley, I do research on data systems broadly defined. And the work that we do is trying to make data systems better, right. And the way we do it is to try to look at the problems people currently have with data tooling and try to make it more usable, more efficient, more intelligent, and so on. And in the pursuit of that research, we’ve explored a bunch of different tools, a bunch of different topics. At some point, we started looking at data science tooling, and specifically pandas. And we realized that a lot of people love pandas, and they found pandas to be incredibly useful. And we can dive into what pandas is and why it’s so cool and so useful. But they were having problems with it. They were having problems at scale. They’re having problems sort of getting stuck on how to use it and so on. And so we picked up a couple of open source projects towards making pandas better. One of which is this tool called Modern, which is a drop in replacement for pandas. And this was led by one of the PhD students I was advising. This led to more than an open source project, broadly used and adopted by a lot of folks. And so we said, hey, this is how we amplify the impact even more? So why don’t we go ahead and find a company around this? And so that’s what we did about two years ago. And finder is a result of that Panda is a company behind open source more than we sort of poked in our trajectory from open source more than two to other products. But I’m happy to dive into why we did that. And where we agree.
Eric Dodds 05:36
Yeah, absolutely. Well, this is I’d love to dig into the pandas 101 stuff, because I think just orienting us to like, you know, how pandas is used, where it fits into the data science ecosystem and the Python ecosystem really helpful. But can you just give us the brief rundown of what ponder does and the problem that it solves before you sort of go back to the root?
Aditya Parameswaran 06:01
Totally. So the word ponder does is it allows you to run data science at scale, directly in your data warehouse. So that’s the headline. What does that mean? What that means is, for example, take pandas, which is a popular data science tool, you can use pandas, as is. So use the same API, use the same scripts that you use and curated over many years. And now all of that runs inside your data warehouse, completely transparent to the user so the user doesn’t need to know whether it’s running it just now running in your data warehouse. And basically, that you get all of the benefits that come with that. So obviously, data warehouses are incredibly scalable. They are incredibly reliable. There are security guarantees, or governance guarantees, its provenance, all of that good stuff that happens all for free. Because now the execution engine is the data warehouse. So that’s what Andre is doing.
Eric Dodds 07:03
Love it. Okay, so let’s, let’s rewind and talk about pandas. So I know a lot of our listeners are probably familiar with pandas. But we love digging down to the one on one and providing context. So maybe what will be helpful is can you paint a picture of sort of maybe like a vanilla Python flow? And then how pandas sort of like, why do people love pandas compared with sort of, you know, maybe more vanilla, you know, libraries or flows? broadly.
Aditya Parameswaran 07:36
So what is Spanner? So pandas is basically, I would call it the language of data science. It’s the C at the Swiss Army knife for data science. It’s used for everything ranging from data cleaning, data extraction, data transformation, analysis, visualization, and modeling. It’s basically doing everything in the data science world that you would meet. And it’s been, it’s an incredibly complex API. Add in this complex API of more than 700 Plus functions, each of which have maybe 1000s of parameter combinations each, right. Like it’s a really incredibly complex, incredibly ETL API that has been lovingly curated, improved upon over the course of many decades, right. So it’s like, it’s the result of many years of love and attention from the open source community to build something that just like super useful for the data science and AI community, it is just the first tool that you would go to, if you want it to get the job done on any kind of data transformation task that you have.
Kostas Pardalis 08:55
The lightest, like, it is the
Aditya Parameswaran 08:57
library, right? And so like literally anything that you would want to do from a data transformation standpoint, data analysis and visualization sample data cleaning standpoint, you would, you would use pandas, so basically incorporate ideas, and API’s from the spreadsheet universe. It incorporates ideas, and API’s from the relational universal database universe. It incorporates ideas, and API’s from the linear algebra universe, which is why it makes it a great fit for the ML and AI side of the equation. It’s used by, I think, around 25% of all software developers. And remember, that includes a lot of web developers and so on, right? So it’s like, it’s very popular compared to a lot of other tools out there. It’s, I think, seven or eight times as popular as SPARC. So just to put that in perspective, we are talking about this piece of the data science community relative to the data engineering community. It’s the As much as very close to SQL SQL is also an incredibly popular language. Very close to SQL.
Eric Dodds 10:05
Yep. Fascinating. Okay, so we’re sold on pandas, like, if you’re going to do data science ml work, like, you know, it’s the library, install pandas like, and it’ll make your life easier. Yeah. But there are problems, obviously, or ponder wouldn’t exist. So can you? And I’d love to approach this from the user experience standpoint, I know you’ve done a lot of research actually, on sort of user experience, and you’re a database guy. So can you walk us through? Like, what is it like to use pandas? And where does it fail for the user in terms of the things that ponder solves?
Kostas Pardalis 10:47
Aditya Parameswaran 10:50
pandas? While it’s an incredibly rich, incredibly intuitive, incredibly expressive, convenient API. And if you would like to learn more about why, and why founders over something like SQL for the purpose of data science, I have, like a four part blog series that I can share it for you folks want to get
Eric Dodds 11:10
it in the show notes. Brooks up, we’ll have him put it in the show notes. So that’s the show.com. Yeah,
Aditya Parameswaran 11:16
yeah. So coming back to what happens when people try to use pandas, if you try to use pandas at scale, you basically will run into issues. Why is that a pandas is a single threaded. So even if you throw, you try to operate on a large data set on a beefy machine, you’re gonna get pretty much the same performance on a not so beefy machine, because it doesn’t take advantage of multiple cores, and is very inefficient with memory. And it does most of the processing of data in memory, as opposed to on disk. So you’re limited by the amount of memory that you have, it keeps making redundant copies. It’s also like the may often what ends up happening is you will be midway through your workflow, you run out of memory, and it will crash, right. And finally, it has no optimization built in. So it has every operator run by itself, it doesn’t actually take into account the fact that either multiple operators chained together, maybe I can reorder them to make it run faster. It’s not something that pandas do. So to recap, pandas is bad with memory, it doesn’t do any optimization and doesn’t take into account multiple cores. So all of this means that if you’re trying to run pandas on datasets that are more than a few 100,000 rows, or a few million rows, depending on the type of task that you’re trying to accomplish, you’re stuck, right? It’ll just break down, you won’t be able to get your job done. So what we’ve seen, at least in terms of workarounds for people in practice, there’s a few workarounds, like one is you operate on data, operate on a sample of your data in pandas, and then convince someone or you made me yourself, translate that workload to run it in Spark or in SQL so that you could run it at scale. So that’s one one approach. The other approach, like I said, is you just get it on a sample and then be satisfied with that sample, right? And the insights that you get with that sample, hopefully, you’ve taken a random enough sample that those extra insights translate into the entire data set. But if not, you’re toast. The third approach, which is what we are adopting ponder is to say, hey, we keep the pandas workflow as is we will run it at scale for you, you don’t have to do this translation back and forth to a language that allows you to donate at scale, you can just stay with pandas, and there’s no need to interact with the data engineer who, who will do that translation for you. There’s no need for these long feedback loops. Back there was a company that we spoke to too early in the Days of Wonder who said it took them six months to translate a pandas workflow into pi Spark, like six months. I think this is on the higher end, the typical, more typical numbers that we’ve heard are like three to four weeks of translation cost from pandas to big data framework. And so our value proposition is that they didn’t mean just do it for you, right, like it will be automatic. It’ll be out of the box, you don’t need to worry about it.
Eric Dodds 14:20
Yeah. So is it an oversimplification, as you were talking, I thought, Okay, well, I mean, it seems simple, like pandas is really good for local development on smaller sample data sets, right? I mean, that seems to be sort of the developer experience that pandas has optimized for. Is that an oversimplification or is that the main sort of mode of usage that you see?
Aditya Parameswaran 14:44
It is the main mode of usage that we see and is it because of pandas’ limitations? That is the main mode of execution, execution and development that we see. And it’s useful to decouple two aspects right? That One is pandas API and pandas the execution engine data, we conflate the two and say it’s one thing, right? Pandas are broken. But pandas are not broken. It’s, in fact, the API is great. People love the API PP is people saying, hey, the API is convenience expressive, it’s rich. It’s convenient. All of that is great. But the execution engine is broken. And so when monda what we are saying is that, why do we conflate the two, let’s not fault the API. For the execution engine, let’s throw out the execution throughout the execution engine, we will be the execution engine that will keep the benefits of the API’s. And so that’s the shift. In fact, more than when we started out, which is an open source project that partner was built out of was pandas on distributed computing engine. So it was on an array and Das, which are both distributed computing engines in the Python universe. And we started with the gods being incredibly popular. And then we migrated to data warehouses, because people were like, Hey, I don’t want to manage arrays and da das clusters, in addition to managing the data warehouse, a country just helped me run pandas on data warehouses directly. And so we, you know, hey, let’s go build that. And so that’s what led us to build upon the product.
Eric Dodds 16:17
Yep. Okay. I want to ask about the data warehouse aspect, because usually you don’t start there when you’re thinking about, you know, Python and machine learning use cases. So I want to dig into that. But let’s, can we dig into a little bit more of the challenge of translating pandas to what you called more of a big data platform? Right? So you develop something in pandas locally, because of the limitations, you’re doing it on sample data? And then of course, like, Okay, well, we need to run a model in production, we need to serve features, it needs to make recommendations in our app, or whatever you’re doing. Can you describe the user experience problems with going from pandas to like, Okay, I need to migrate this to pi spark and like run production on, you know, on spark to actually deliver this stuff.
Aditya Parameswaran 17:14
Yeah. So what happened? Why is it so hard to do this? Right? I quoted three weeks, four weeks, six months to do this translation. From pandas to pyspark or to SQL? Why is it so hard? So Don said, is it because pandas has a different data model and epi than the relational universe or even the spark universe? And what are these differences? So first, pandas is the order data model. So it actually zooms order for the rows, both the relational universe in the spark universe don’t assume order. And people can rely on the order, which is what makes pandas so intuitive. There are these notions of row and column indexes. So you can label the rows and refer to them using these row indexes. Again, a super convenient feature that is not available
Kostas Pardalis 18:06
Aditya Parameswaran 18:10
in relational databases, or spark. And then the there is a bunch of different kind of API centric operations that are very easy to do in pandas that are really hard to do in SQL, linear algebra, style operations, spreadsheet style operations, if you wanted to do things like pivots, very easy to do in pandas very hard to do in database, although some database is starting to build in some of that capability. Linear Algebra is extremely hard. If you wanted to do a transpose in a database, you’re toast, right? In pandas, you can do a transpose, you can multiply matrices together, you can do things like that. So all of these conveniences in terms of the data model and the API, can if you need to, like, handle retranslate, all of that into something like pi spark or sequel to get the same effects is actually a really hard problem. Even from a research angle, it’s a really hard problem to get your head on, which is why it took us a while to sort of build this out, right. This is why it originated as research in Berkeley, and then we developed a product.
Eric Dodds 19:22
Yeah, that’s fascinating. I mean, it almost seems like you develop a feature in pandas. And so you, and you can get there fast with pandas because of all of these ergonomics that are really nice at the API. And so you know, what the end state is, and you kind of know, like, the basic data model that I almost envision it as like, you know, what you want to say, you know, you know, the meaning of a statement that you want to make, but if you translate that to a different language, you actually have to know like, the context of the culture and, you know, conjugation and you know, all of these different things that like make it actually pretty difficult to get the same end result. Even though you know what that is like, you’d have to do a bunch of background work to sort of migrate all of this underpinning to sort of produce the same concept.
Aditya Parameswaran 20:13
Absolutely, I think that’s a great analogy. And I think part of this translation process is that because it’s done, usually done manually, it also leads to bugs, right? And those bugs are not discovered until you run it in production. And then you’re like, hey, there’s a bug. And then you go back to the person who wrote the script, and they’re like, I actually meant something else. And then you repeat this process. And each time it takes three to four weeks of translation, and like, it’s lost in translation. That’s what happens.
Eric Dodds 20:46
And just to put a sharp point on when we say translation, what we’re talking about is you produce things with pandas, say you are using linear algebra or doing something like matrix math or whatever. But like, if you have to reproduce that in SQL, especially if you assume ordering in pandas, and that’s taken care of automatically, like to reproduce that in SQL, that’s what we’re talking about, right? So you’re writing potentially 1000s of lines of SQL that sort of reproduce like, you know, linear algebra or matrix math with assumed order that you have to reproduce in SQL, right? You have to basically hand roll the, you know, any time stuff in SQL is crazy. Anyways, your hand rolling all that? Is that the translation?
Aditya Parameswaran 21:24
Yep. Like, often, and we have a single line, then then characters in pandas, this translates to 400 lines in SQL, right? Yeah, so you’re absolutely right. And we will handle all of the things like indexes, or the fact that the API is much wider in pandas and in SQL, all of that is handled transparently to the user.
Eric Dodds 21:47
Yep. Okay. Last question. For me, I always say that it’s a lie, usually, but alas, question me, because I know it costs us as a bunch of, and I actually want to hear it cost. This is a question about colors as well. But let’s talk about the data warehouse, because when you think about ml workflows, I mean, the default. Okay, so pandas locally, for sure. But when you think about running ml stuff in production mode, the standard, you know, the standard reference point for most people is you’re running Spark, you know, on some sort of Databricks, you know, data lake flavored infrastructure. So let’s talk about the warehouse with ponder, because you’re talking about a flow that, you know, is maybe just out of reference, or maybe not like a default reference point, right, like, I’m taking pandas work. And I’m putting it onto a data warehouse. I mean, I’m thinking about their compute implications, like their, you know, data structure implications? Can you talk about the mean, this is feedback that you got from users of modern. But walk us through that, because you don’t think about it. Okay, I’m going to take my pandas and dump it, you know, straight to Snowflake or something like that.
Aditya Parameswaran 23:03
Yeah, so maybe a bit of history is relevant. So we’ve seen the worlds of data science and ML, on the one hand, and, and, data analytics and data warehouses, on the other hand, cannot evolve separately over the past several decades, right? So the Python centric approach like pandas and pandas doesn’t exist in isolation. There’s also NumPy scikit learn what have you, right, and the visualization libraries like matplotlib and seaborn etc. So this is all in the data scientist’s toolbox. And then there are the analytics universities, traditionally OLAP centric databases, which then over the last decade or so, also incorporated sparks to deal with somewhat semi structured nested JSON, etc. Data, but still caters largely to the data engineering and data engineer persona rather than the data science persona. And data engineers are pretty comfortable with Spark, a solid framework that allows you to process data at scale and and pretty much in slightly more flexible, flexible operators than SQL, but because it’s still the lingua franca for the analytics universe. And so these two worlds have evolved separately. There have been several attempts to bridge the divide between the data science and AI worlds and the on the NB database worlds. One of those attempts has been to add in Python style UDS to databases, which didn’t end up gay it’s not super popular, it’s it adds more constraints. It’s not something that has a lot of adoption. I mean, it’s useful but it’s not as broadly adopted as a data science stack. There’s also been attempts to add in ML primitives to databases, again, not having a lot of adoption. But broadly, I think these two universities have stayed separate. Partly because it’s so hard to map the data science primitives to the data warehouse primitives, the relational world is very different from the data science world. And that’s what our research unlocked. Right? So our research on saying, Hey, we can take the pandas API, we can distill it down to a small set of core operators, which can then be mapped to the relational operators. That research unlocked the possibility to then be like, hey, you know that we’ve done this for pandas, you can do it for other other data science libraries as well. So are the ML libraries or the visualization libraries, all of them can be now pushed into this data warehouse? And so our belief is that the libraries are great, the API’s are great. The execution engines don’t need to be tightly coupled with the API’s and execution engines could be anything. It could be databases, or data warehouses. It could be distributed computing engines, like dots, and Ray, it could be even Spark, right? Like all of these are just execution engines. You shouldn’t need to know the details of array or Dask, or spark or sequel in order to be able to use your favorite data science or AI or visualization library. That’s our bill.
Eric Dodds 26:25
I love it. All right, Costas. I’ve been monopolizing. And I could keep going. But of course, I need to hand the mic.
Kostas Pardalis 26:34
That’s good. As you’re asking some incredibly interesting questions, and giving some amazingly interesting answers for what ETL shows. I don’t know, maybe you should continue asking the questions. Maybe I’m not needed here. But I’ll do my best to also ask interesting questions. So I did, let’s say, I was listening all this time, because you were talking about like pandas, like the API’s. I love the whole focus around the user experience or the developer experience, who wants to be more precise, which is great. And we have many things to talk about there. But before we do that, we have pandas, we have koalas, we have bears out there, like a whole zoo of others, let’s say different frameworks that somehow relate to pandas. Right? Because we also have like to keep in mind, like historically, okay, like, as we say, like pandas is, has been built, like for a long time, right? It’s not something like new. What is the difference between all these? And what’s your take also, like in what sparked the need to build these new T the abbreviations of like what pandas is trying to do? Assuming like all the work that has been put, like on pandas over the years, right. Yeah. So
Aditya Parameswaran 28:02
I think, again, it’s useful to talk a little bit about history here. I am a database professor, I keep bringing up lessons from database history. So my apologies for that. It’s very hard to convince people to change a language that you’re comfortable with, right? So if you are used to SQL, and all of your work is in SQL, then it’s hard for people to come in and be like, Hey, you should change all of your existing workflows, and use this other language or library. Likewise, I’ve seen a similar groundswell of adoption for pandas over the years, like I said, like one in four developers use pandas. And pandas is not going anywhere, just like the sequel isn’t not isn’t going anywhere. Pandas is not going in the sun and asks him. Now a decade of adoption Spark is 1/5 that have pandas? Right. So I think API’s come and go, there have been many attempts at building better data API’s. But it’s an uphill battle to convince users to be like, hey, you know what, you should switch to my better data API? Right? And yes, that better data API may solve some ergonomic issues for the user by making this a slightly better developer experience, slightly better ergonomics, slightly better this and that are slightly simpler. It’s not as complex as pandas. It deals with the spread out slightly more performant but people are not going to adopt it for those reasons alone because you and I, we are different languages, you’re very comfortable in your language. I’m very comfortable in my language. Yeah, we have a language in common which is English. I’m not going to switch it and learn an entirely new language just because you asked me to I’d much rather deal with the limitation to the language that I speak and if there are tools will help me deal with that which is what we’re trying to do with pandas keep will keep speaking the language you want to speak with just make it better. I think that’s a much more compelling value proposition than saying A now B wants you to go from Ground Zero, convince people to adopt an entirely new favor, right? Like, for example you brought up. You brought a call as a response in a second. But the owner is another one. It’s like it’s a language. They are. It’s commendable effort and commendable amount of adoption that they’ve gotten in a short amount of time, but still minuscule compared to pandas adoption. And I think there’s a couple of reasons. One, people, if you use pandas, you’re used to pandas, you’re not likely to switch to pornos. And now with tools like modern and pond, you don’t necessarily need to switch to polars to get performance benefits, right. The other pieces, like polars, are not trying to support arbitrarily large data sets. So the way we would want to support the data warehouse, you can just scale up the back end product is trying to be more more trying to get the most mileage from a single machine as opposed to a data warehouse. So differences in approach. And I should say that I feel the pain of anyone developing data, data query languages, I’ve developed a bunch of them over the years. And what I found repeatedly is that it’s a lot of effort, it’s a lot of, it’s a lot of blood, sweat, and tears that goes into it. But often, it’s still very hard to compete with the likes of pandas and SQL and spark now, even though Spark is like 1/5 data of panels, still very hard to compete with those because people have just gotten used to it.
Kostas Pardalis 31:38
Yeah, yeah. No, we’ve sold since and, okay. In humans, some call us. I also call Meson, because I wouldn’t like to make the connection there with Spark, right? Because Spark is, especially if you come like from, let’s say, the more analytical space. You think of Spark, as you know, the de facto tool when you want to work like with Python, right, like, even data warehouses, like Snowflake with snow Park, and like all the work that they have done, they’re still I think, whenever we put like Python together with data, like the mind of everyone was like, directly to spark, right? And there is like pi Spark, but before that, they were like, why not? So like, what? What was quite a loss and like, why was it created and how does it compare to pandas?
Aditya Parameswaran 32:34
Yeah, so koala was actually in an early attempt at bringing the pandas API to the spark. And this was, I’d like to think that it was inspired by more than because it started after modern. But maybe they had an independence and they came to this realization independently. Either way, I think it was an effort to bring the pandas API to the spark universe. We have a more detailed comparison with more than but a call us is very tied to to spark doesn’t quite support the same API as pandas does, it supports a subset that off. And due to the fact that it’s very tied to spark doesn’t actually get as much performance as we would like, because it has to deal with the spark idiosyncrasies. But yeah, it is addressing a similar problem that quandre is in that it’s saying, Hey, you continue to use the pandas API, on your existing data infrastructure, in this case, Spark so it is addressing a similar problem as ponder is from that perspective. Yeah. Your other question on sort of pies part, Coronas snowpark as Britt bridging the pie QCon universes and the and analytics universities, I think that is bought on. Let’s leave aside koalas for now. But pi spark and snow park definitely bring the benefits of the Python centric universe to the analytics universe. However, it is telling people to rewrite their workflows in snow, things like snow park or PI spark. And that’s usually not something that data scientists do or something that data engineers will do. So this is a tool targeted at data engineers rather than data scientists with different personalities.
Kostas Pardalis 34:26
Yeah, and my next question is: Why is this personal also so different? I mean, that’s why like, Okay, we have the T spark with koalas as an attempt to bring like a pie by Sonic API like to do the data, right? Why do we need pi spark? Why do we need this whole concept of like, the data frame where and like, like, why another iteration to satisfy let’s say, like these different persona, which is let’s say that the data engineering persona, right Why ml and data engineering have to be so different at the end? Well, I mean, if you abstract enough, like it’s the same thing at the end, like the way you have pipelines, you have data that needs to go through different stages of processing to end up with a data set that is going to be input to something right. Like, it’s not that much different, right? Why is this happening?
Aditya Parameswaran 35:24
Yeah, so that’s a really insightful question. And someone who is schooled in the database universe, I’m like, A, why should we have all of these ultimately, it’s just like data transformations. Everyone should just write SQL, right? Like that was my initial instinct, getting into all of this, like, what people should just write SQL? I think we underestimate the challenges that come with folks who are not schooled in a computer science background, who are getting into things like data science, ML and AI. And don’t know enough about distributed systems, don’t know enough about computer science, but are trying to get useful work done with data. A lot of these folks have taken, like data science and coding bootcamp. So they know enough Python, they know enough pandas, that’s about all that they learn in this data, Fivetran, coding boot camps, telling them to be like, Hey, can you write all of your workloads in Spark? Can you write all of your workloads in snowpark, it’s so much of a heavy lift for them, because their mental models can’t accommodate sort of thinking about distributed systems, managing a cluster, rewriting workloads that they are comfortable writing in Python, or in pandas into something like pi Spark, or snowpark. That simply is out of the question for them. Imagine, for example, you’re a biologist, and you’ve just learned enough Python data science to try to do some genomic data analysis. Now you’re being told, hey, you know what, now you need to have a deep understanding of distributed systems and databases, just get your job done. And it’s crazy. In fact, one of the origin stories for more than was this group of genomics researchers, who were, who were who did all of their work in Python and pandas. But were actually writing a Spark job to generate a sample of their data that they could then analyze in pandas they did was like that was even like learning how to write the spot so that they could extract a sample so that they can play around with it just the way they wanted it pandas. And it’s just such a heavy lift for a lot of these folks, and just expecting them to be like, you know, what, here’s another tool. It’s all under the hood. It’s all data transformation, right? Why don’t you learn and do it may be easy for you and me easy for other folks who are schooled in computer science easy for folks who are schooled in IT systems really heavy lift for someone who’s in finance, who’s coming from Excel, Excel world who’s done just enough Python and pandas to get by telling them a retranslate is heavy left biology, finance insurance, there are all of these industries that are becoming data rich. And they now have people who know a little bit of programming and a little bit of data science want to get insights from data, but they can’t do it, but they do because they get stuck.
Kostas Pardalis 38:18
Yeah, yeah. No, 100% I think like, and this is a great, like, point that you’re making here. Because like, we need to, and I want to get into like, the discussion, like towards more like what, like our industry should be doing more of, because I have a feeling that like, a we take like many things like for granted, because we are also like leaving away, like in our own bubble where Oh, yeah, like distributed suits, I’m sure Yeah, that’s fine. Like, like that, whatever. But like, not everyone is interested in that stuff. Right? Like, people have other things that they have to do. And they want to use tools that they are comfortable with, which has to do with the user experience in the developer experience. And like it’s a very good point because we see like, especially lately like so many like loud languages coming up. And we also have a very strong memory of the English industry I think because we just see the SQL flavors out there. Right? And like how difficult it is to just go from high right like to a different SQL system. Like we’re not talking even about anything else right? And we don’t like to think about stuff like we’re like oh my language is better like you should be using that Yeah, sure. Okay, whatever and then you go to market and craft because yeah, like people have better things to do though like Believe in your dream right? Like it’s so I think it’s it’s also like I think very good like to talk about that stuff like on on the show like to communicate that also like to whoever thinks of like going in building something new out there, like think about the user and where they come from, and I think it’s more Hold on for the builder to adapts to the user than the builder assuming that the user is going to adapt to you, right? Exactly, exactly.
Aditya Parameswaran 40:11
We have stool builders, we have a tendency to build. It’s like computer scientists building tools for computer scientists are system Linda’s building towards file system builders. Not everyone is like you, and not everyone is happy with a tool that was chucked over the wall at them, right? Like, nobody’s gonna be like, hey, yeah, you sent me this new tool, but you’re forcing me to abandon my existing workflows, all my existing scripts and use this, like the people that don’t go and do that? Yes, they might, because it’s a fad for a short period of time. And yes, you might convince some people, but eventually, it will revert back to the word practices, right. So we do, again, a lesson from history. In the early 2010s, there was sort of a boss number of no SQL systems that came up, some of which were addressing OLTP issues, but some were addressing OLAP analytics issues. And, there were so many of them, right. And now, if you think about it, the only ones that have survived off those languages that were invented expressly to deal with the challenges of deed of big data, so to speak, the only thing that has survived at the test of time is spark and SQL and spark to support SQL, right. Like, it’s not just pi spark. So, and again, I feel like I’m constantly pedaling my blog post, but I recently had a blog post that was targeting this big data is not it’s dead blog post that was put out there recently, and I said, a big data is, is actually not dead. I mean, a lot of organizations still have big data. And I trace a little bit of the history of NoSQL systems, and how we are really reliving the lessons of all that ultimately, we have to adapt or adapt to our users rather than expecting the users to adapt to us.
Kostas Pardalis 42:07
100% 100% I think like, another good example of that is like how many companies have died trying to kill spreadsheets? Like, it’s all happening? It’s simple as that, like? Yeah, exactly. Okay, so follow up question like these still peek. We do have like a kind of like the whole topic right. Now, why do you even like in mature data organizations, right, you have the mill themes. They have their tooling. They’re using Pandas, whatever they want to use. And you also have like a dead engineering organization in there. And he’s only vase use different tools, right. And in many cases, you see these like, into godlike and publication of both like efforts, and infrastructure, right. So how can we breed that we thought but at the same time, it was respecting what its persona wants, right? Because in the same way that there are scientists, when they’re pandas, the data engineers, they wonder why spark right. So yeah, how are we gonna do that?
Aditya Parameswaran 43:19
Yeah, I think this is a really interesting question. And it’s gonna see continued evolution over the next five to 10 years, I think, as these personas become a little bit more blurry. And we see a little bit more consolidation in the data stacks. So what we see right now is very simple, right? It’s like, we have your, we have your data warehouses, on the one hand, you often have a Spark cluster as well, sometimes, or maybe one of the two in certain cases. And then your data scientists and ML and AI folks often will just pull out a bunch of data as much as they can fit in memory and go off to work on it using their Jupyter notebooks and Python, pandas. That’s what is happening right now. So there are often multiple sorts of infrastructures like beefy machines that may be the data scientists they’re working on, you might have a Spark cluster, you might also have a data warehouse. And so there is a convergence between the spark universe and the warehousing universe. And we are seeing that happening both from
Kostas Pardalis 44:31
Aditya Parameswaran 44:34
traditional data warehousing companies like Snowflake, or big query or what have you, as well as databases I think are converging. We still see data scientists pulling data out and operating on it in their laptops or their beefy machines. That’s still happening and that’s part of the piece that ponders trying to addressing, Hey, you don’t need to use your laptops and knees with possibly Big Data. because your data scientist likes to operate on their laptop that shouldn’t be happening, you should still keep all of that in a heavy governance sort of environment, which is your data warehouse, and allow your data scientists to directly operate on it right? Of course, there are challenges in completing this vision. Can you map every single data science library that a person wants to use to a data warehouse setting? We don’t know yet. I mean, we are making progress on that front. But an ideal universe that I see in the future is, every one interacts with a common set of infrastructure, right? Like, be it a data warehouse, beat Spark, what have you, that’s just one, one, infrastructure, but people use their preferred languages to interact with it. And then there’s middleware that allows you to translate between various languages, right, so if you’re using PI spark, maybe you directly access your infrastructure using PI spark. But you could also use pandas, you could also use visualization libraries, you could also use BI tools. And in all of these cases, there is an intermediate layer that understands and can translate from whatever language the API is invoking to the infrastructure, right? So you can act as a middle person, and do the translation for you. That is the ideal universe that I would like to go in, because it will simplify a lot of work. Why do we have to manage three different infrastructures, just because one doesn’t speak panda really, or, or one doesn’t speak the spot, you should have just one unit one infrastructure that is heavily managed, heavily provisioned for reliable, fault tolerant, scalable, and then you users can use it in whatever way they want.
Kostas Pardalis 46:39
How far away we are right now, from this reality of having like these middleware, or, let’s say, getting in the data space, like it’s similar infrastructure as we have in combiners, right? Like, we’re, because that’s what I visualized in my head. Like, when you talk about these middleware, it’s more like, we have the front end of the development of the combined layer. It can be like, parse Iona, like sequel or like pi spark or whatever. But then you have like, on the back end, something like a leap VM, for example, that can target like different targets, right? Like, in one case, of course, that’s like, hardware, but it can be like a database system, right, like, but what is it? I ask you these both, from your experience in the industry, but also in academia? Right? Because I feel it’s also an academic question. It’s not just a matter of someone, you know, writing the code down there. So how far away are we from that? And well, that will take us to do it. Yeah, I think
Aditya Parameswaran 47:41
from an industry perspective, actually, let me start with an academic perspective. What we’ve shown with monda, and with moden, is that it’s possible, right? Like these worlds don’t need to be separate. And that you can translate from a data science friendly API to a relational API, you can translate that is feasible, right? And it is doable in a performant way. And we’ll continue to do this to a few other data science libraries like NumPy, and matplotlib, and so on. So it is doable, right? The proof points are there. Now, what are the challenges and how long it will take us to get there? I think part of the challenge is that there is a long period of data science libraries, a very large number of data science libraries. I think Condor, for example, can help you get to the ones that are very popularly used and adopted. So pandas NumPy, Scikit, learn and so on we that’s how we are addressing this and mapping it to the popular backends as of now, right? So the snowflakes and the big queries of the world and possibly in the future spark or what have you to ponder can help with the heavy, the heavy adopted libraries and the heavy adopted databases, but it’s a very long enough database and a very long tail of libraries. This, I think, has to come with a combination of open source contributions, because people are invested in having their libraries operate on data at scale. So open source contributions for that other database vendor has been like, hey, I want you to be able to run pandas at scale on my database, be it at a MongoDB or CockroachDB, or whatever bespoke database that you have. I want to be able to run that so I will help you build those integrations. I think it’ll go to that long tail is going to take a while for us to fill out. But I think we could get to 80- 90% of the popular interfaces and backends relatively soon is my guess. Okay.
Kostas Pardalis 49:41
That’s super interesting. I’m always like a personality, also very interested in the above Sukabumi. All right, that’s all from my side. I think we’re getting closer to like closer to the buzzer, as usually shared. So I’ll be the microphones.
Kostas Pardalis 50:02
As I said, Yeah. So I’ll give the mic back to Eric. And from my side, like, I think we need more time. Like we need to spend more time and specifically talk about the user experience. I think it’s something that is missing, like so much for like the data. Let’s say the industry, we can name it like this. And there’s a lot to learn, especially by seeing what happens like other disciplines of software engineering, right. Like, if you see words like front end engineers, experienced today, or what tooling they have or like, what progress have has been down there and compare it like what’s going on what is happening, like in beta, I think that there’s like, a little piece of white law like to hear and highlight your opinion and like, have your conversation with you about so hopefully, we can have you again in the future.
Eric Dodds 50:58
All right. Just one last question for me, because we were close to the end here. And this is I’m gonna, I’m gonna ask about your research, just because you’ve done so much interesting research, what is one of the most surprising discoveries that you’ve made in your research that you just completely did not expect?
Aditya Parameswaran 51:19
Maybe I’m gonna repeat the theme that we heard multiple times over the course of the podcast. But I think the most important lesson that I have learned from research is Don’t ignore the user. Right? Like literally don’t ignore the user, don’t ignore their workflows, don’t ignore their economics, don’t ignore bad practices, don’t ignore the bigger organizational contexts they fit into. Right. And also, early on, in my research, I had a tendency to chuck towards the wall users and people would not adopt that. And so it’s been a sobering lesson over the course of the last decade to build, you know, what we have to go whether users are and build tools for them, and learn about their limitations, rather than simply saying, here’s another nice new shiny tool, right? Because often the infrastructure investment in managing a new shiny tool is not what the added benefits you get from it. So yeah, that’s, that’s been a sobering thing, lesson. But it’s been great to work very closely with users both at Berkeley, as well as this company, learn about their challenges and build around those as opposed to building new tools from scratch. And I think that’s been quite rewarding, too, because they feel the pain, the users feel the pain. And then when they’re like, Hey, you solve it. For me, that’s great. That’s a different level of excitement and enrichment altogether.
Eric Dodds 53:02
I love it. Wise words, I think, in all the work we do so a DJ, thank you so much, for joining us on the show and giving us some of your time. It’s been wonderful. It’s been
Aditya Parameswaran 53:13
a pleasure. Thank you so much for having me. And I’d love to be back.
Eric Dodds 53:19
God says I love how practical that conversation with a DTF from ponder was, I mean, we dug into the specifics of pandas, but then we also got pretty theoretical, he has some really interesting ideas about the way that things should operate. And I think my big takeaway, and this is actually from your conversation with him, so I’m gonna unashamedly steal your thunder. But there’s this idea, I think, you know, he didn’t necessarily use these terms. But this idea is almost like artificial scarcity on execution engines, and he envisions this world where the ergonomics are such that you can use whatever language you want. And then plug that into whatever execution engine you want, that makes sense for your business. And currently, there are a lot of limitations based on the languages and then like they’re meeting different execution engines. But if you remove those barriers, it actually becomes really interesting to think about. And your point about compilers was really interesting. So when you think about sort of going from pandas to maybe like a Snowflake warehouse, that doesn’t seem like a normal mode of operation based on sort of typical ml workflows, necessarily when you’re, you know, going into production. But I love the vision. So that was really exciting. And I’ll think about that a lot, because I think that is where things probably should go.
Kostas Pardalis 54:49
Yeah. Yeah, for me, I mean, it was like a very fascinating conversation, I think. So that whole focus is around the user experience or developer experience, like the opportunities that exist there like to build and deliver value. And all the insights from our TTI around that stuff like was, was amazing. And also, I think, I don’t know. It’s like the second professional that we have now on the show after Andy from CMU. And I kind of like him, it’s really nice to have these, okay, kind of unicorns where you have like these very academic people who are also like all Chipotle stuff, like, and he doesn’t maybe we should do, like, hold them, have them like both.
Eric Dodds 55:39
On the show, I should do a panel because I think that’s interesting. Like they really do. It’s like, they’ve summarized a huge legacy of academic research into things that are very practical in the market. And I use a great way to describe that.
Kostas Pardalis 55:55
And they traveled. So they started in Poland, like their own companies. So it’s not like they’re not just like academics that took all the in theory, like, yeah, they have seen how the social decision was made, which I think makes them even more interesting. And I don’t know, it’s also like the personalities, I think, like, having the two of them like on the ball. No, it will be interesting.
Eric Dodds 56:18
We need to do it. Yeah. Someone starts out by saying, you know, I’m a database guy, like, it’s probably going to be a good conversation. And that just sounds cool. Like, I’m a database guy.
Kostas Pardalis 56:29
Yeah, yeah, absolutely. But yeah, a lot of so many sides of the conversation. And I hope we will spend more time with him to go even deeper in things around working with data systems at scale, and like seeing all these things from the user perspective, just like technology.
Eric Dodds 56:54
Absolutely. Well, thanks for joining The Data Stack Show. Subscribe if you haven’t, tell a friend. And we will catch you on the next one. We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe to your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at email@example.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.