This week on The Data Stack Show, Eric and Kostas chat with Tony Wang, Graduate Research Assistant (PhD) at Stanford University. During the episode, Tony discusses his journey from China to studying electrical and hardware engineering at MIT, his transition to data processing systems for his Ph.D., and the academic-industry connection. Tony shares insights on cloud data processing, the limitations of academic hardware projects compared to industry giants like NVIDIA, and the potential for software innovation in academia. He also delves into his current research focus on time series data management, the challenges of integrating different data systems, the goal of improving data processing efficiency, the sales aspect of his research, and more.
Highlights from this week’s conversation include:
The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Eric Dodds 00:03
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com. We have Tony Wang on The Data Stack Show today, Tony, we have a lot to talk about both academia, the data industry, and different kinds of selling and some cool data stuff in general. But we’ll start where we always do. Give us an overview of your background. Yeah, Tony,
Tony Wang 00:47
I’m a PhD student at Stanford University, one of the few people today still studying data systems and databases. Before that, I was at MIT for four years, studying mostly electrical engineering and hardware engineering. Before that, I came to the US from China when I was 16. I went to a private boarding school up in New Hampshire. I love to ski, and I bike a lot. So California has been pretty good for both. So one of the rare areas where you can drive for like, four or five hours, hopefully only and ski and also have, you know, decent weather year round when you’re not skiing.
Eric Dodds 01:30
Yeah. And it’s a great place for database research so you kind of get all you check all of your boxes.
Tony Wang 01:37
parts of California, great for database research, to
Kostas Pardalis 01:40
be sure, yes. Yeah. And that’s like one of the reasons that I’m really excited to have Tony here today, Eric, like, I think it’s the first time that we have someone who is actually pursuing a PhD, we have many people who have successfully done their PhDs and starting companies, but someone who’s like in the process of the paid way, I think, like the first time, so I’m super excited to talk about what it feels like to do that do research. And learn, of course, like what’s, let’s say in the state of the art right now, what academia is interested in and most importantly, what’s the connection between that and the industry out there? Because there is a continuing right, like the things that happened, like university specialties and stuff like databases, they have an impact out there on the systems that we build tomorrow. Super excited to chat about that. What about you, Tony? Like what would you like to talk about today? Sure. I
Tony Wang 02:41
I can talk about that. I can also talk about the stuff I’m working on. And like my thoughts on, you know, different data processing systems, and what I hope will become more popular in the future. From a technical perspective, although I know that other Peter products are also driven by other aspects, as well, that I have less of an insight on. Yep, sounds good.
Kostas Pardalis 03:08
So what do you think, Eric? Should we go and do it?
Eric Dodds 03:11
Let’s do it. I can’t wait. All right. Well, this is a really exciting episode for us, because you are in the midst of doing a PhD, getting your PhD. And I don’t think we’ve had anyone on the show who is like actively in a Ph. D. program. And so we want to learn all about it. You’re doing some really interesting research on data systems. So let’s just start there. Can you tell us? What is your main area of study and focus? You know, you’re cuz you’re close to the end. Right. Are you finalizing your thesis?
Tony Wang 03:50
I would hope so. Yeah. So I mostly work on data processing systems, and mostly around cloud data processing, around quickly processing data in data lakes that people use today like Apache Iceberg or Delta Lake or even just the buckets of Parquet files, which is unfortunately still way too common.
Eric Dodds 04:17
Yeah, we had someone on the show recently, the discussion was like when is you know, when are we going to move on from Parquet? So how did you decide that you wanted to do a PhD? I mean, you know, data lakes are obviously like, you know, very popular in industry and very widely used. But, you know, it’s not every day that you meet someone who’s, you know, actually studying those at a PhD level. So how did you end up going down that track? Or
Tony Wang 04:46
How did I start the PhD? Why I ended up at Stanford in particular is back when I was trying to decide what I was going to do after college. I mostly worked on hardware. Like Verilog and FPGA and a GPU, low level CUDA programming. So after that I applied to some jobs that I can video and then decided, well, maybe I pursue hardware research at Stanford University where some of the best hardware research is being done. Yeah, I turned down my job offer and video mid retrospect, maybe that was a
Eric Dodds 05:24
decision to say if they were offering options, you know, but,
Tony Wang 05:30
you know, I got my offer in march 2020, when the stock was like, the lowest point from the COVID. So, I looked for as much money as I personally had, and I bought Nvidia stock. And then I decided to write just like, go do my PhD program. Yeah, but, you know, halfway into my PhD, first year PhD program, you know, I realized that, you know, why am I in academia doing this stuff? And I look back at like, you know, people I talked to Nvidia, I realized that NVIDIA is just going to dominate, like, the hardware industry, and then the cool stuff is in hardware is like, in that industry, you know, it’s, I think it’s like a very hard for people in academia to be able to move the needle in the state of the art in the hardware industry.
Eric Dodds 06:15
Oh, interesting. Can you describe why that is a little bit I mean, so just to make sure you’re seeing you knew from your work, studying hardware that NVIDIA was going to be the big player in the market, and that was, absolutely was happening, give
Tony Wang 06:30
some unfiltered I won’t name anybody, but you know, I talk to people at video, I talk to people at AMD, I talk to people at Intel, the people at Nvidia, you know, they were like, truly excited to be there was a level of excitement that I could not discern from, from people that, you know, and video is like a software based company, it’s really hard for a hardware company to actually, like, get a software driven culture, because like, you know, other companies, maybe the company has started by hardware engineers, the founders are hardware engineers, so, so like, those people get more say, and softgrid gets, like, you know, maybe neglected and looked upon as something that’s, like, easy and are real, but a video, I think it’s really incredible for how Jensen is or the team, the leadership team is able to foster a culture where, you know, five out of six engineers are software engineers, and build this amazing software stack, right. And that’s what really dooms academic hardware projects, because there’s many aspects, one, one is that like, like, your project just cannot possibly, like try to, like, keep out at, you know, a very competitive day today, like five or seven or whatever. So you might just miss a lot of, like, you might have an amazing design that works at like, like 2028, or whatever. But you might like Miss problems that would occur if you were trying to do at a more competitive technology node. And the other aspect is the software, right? Like, you could build some hardware, but you know, to get people to use it, there’s like a long way between Python code and your hardware. Now, of course, there’s definitely value in academic research, right, in designing new, like hardware designers and stuff like that, that might inspire people industry to, to, to pursue, like, you know, certain architectural decisions, but I was more on the side of trying to do something that that can actually be used. And that is not where that is, unfortunately, not where I found that should be focusing my time. Yeah,
Eric Dodds 08:40
interesting. And so it sounds like and maybe I’m drawing the wrong conclusion here. But if academia is not really driving innovation on the hardware side, but it does sound like, for example, data processing systems, that there is a lot of innovation being driven in academia, and that’s why you pursued that path. Well,
Tony Wang 09:03
it’s funny, because I mean, the data processing systems, there’s also very entrenched players like Snowflake and Oracle, but but every now and then, you know, the software, you know, the barrier to entry of building a system that’s actually useful, I think, is a bit lower. You see, academic projects like duck dB, for example, are making huge traction in the industry. And that’s really started as an academic project, right? Like, that’s not that they didn’t not have the resources of say, the AWS Redshift, or Snowflake or something. Just a couple of guys, you know, in the Netherlands, and so it’s like this project called polars. Which is like a Rackspace rewrite of Pendo. And that’s really studied by one guy, you know, it’s like, one person could, you know, which, you know, maybe 10s of 1000s of hours of coding could could really try to displace, you know, One of the most popular data analytics libraries out there tells us, right? So that’s a testament to like, you know, it’s your dedicated developer who can really move the needle there. And what do people use in the real world? Yeah.
Eric Dodds 10:15
Okay, can you? This is so interesting, cuz I have a million questions. So I promise I won’t steal the mic for the entire episode. But, Tony, what does a typical week look like for you? And I know, that’s a difficult question, because it probably changes that, you know, I know that some of our audience certainly, you know, I’ve done a lot of post secondary study, but a lot of them probably don’t. And so we don’t really know what it’s like to be a PhD student, you know, studying data systems. So can you just give us a glimpse into you? So
Tony Wang 10:47
I’m very much on the applied side. And I know that people on the theoretical side, their days actually a bit different. And I wouldn’t actually say it’s that different from working at a regular job, because you, you show up and you try to try to program? Well, it’s maybe a bit easier, because you have fewer meetings, calls and code reviews? Because, yeah, like you there are no code reviews. Look, you can write whatever you want. But whatever.
Eric Dodds 11:22
If you wrote your unit tests.
Tony Wang 11:25
That’s, yeah, I mean, most academic projects, you know, only work on the five benchmarks, they’re writing their paper and less in Dallas. So you just have to kind of get your code into that state. But if you actually want your code to be I guess used elsewhere, it has to go beyond that. But that’s typically not inside of the purview of academia.
Eric Dodds 11:45
Yeah, that makes sense. Now, as you mentioned, you’re on the applied side, but they’re also people, your peers who are working more on the theoretical side, which sounds like a spectrum. But you’re both sort of, you know, studying data systems. Can you describe that spectrum to us? Like, what, you know, what does it look like to be more on the theoretical side?
Tony Wang 12:04
So on the theoretical side, you might be like, you so there are a lot of people working? Well, when I say the theoretical side, you know, people who are actually there might think they’re more on the applied side, again, it’s all a matter of perspective. There are people at Berkeley, for example, working on distributed programming paradigms, by the hydro flow project that tries to, like, kind of revolutionize how you do core programming and stuff like that. So these kinds of paradigm shifts, like theoretical work, I would say. And then, yeah, they could, they would probably spend more time like, you know, working around like programming languages, and deciding, you know, banquet specifications, doing some proofs, maybe to make sure that things work. It’s, you know, the last time I did a proof, which in my regard is just class at MIT. So it’s seven years ago. Yeah.
Eric Dodds 13:04
Yeah. Now, one thing that really struck me when we were chatting, you know, before we hit record, was that I kind of had made this assumption that, you know, the being an industry, for example, you know, trying to run a data infrastructure company, you know, it’s, like, wildly different than what you do. And your response was, Well, you know, not really, I still have to do a lot of sales. As a PhD student, can you explain that concept to us, it was just so interesting to hear you talk about that.
Tony Wang 13:43
Like a lot of time of PhD students is spent writing and reviewing and rebutting papers or whatever, trying to change your writing or your pitch. And then most people will tell you that in a writing academic journal publication is like telling a story, which is not too different from what a lot of salespeople have told me about when sales. You have to say how your system has novelty. How is your system, you know, better than all the other systems out there and worthy of publication? Yeah, and there are people that review your papers and tell you if they think that your system has kind of struck those goals.
Eric Dodds 14:27
Yeah, yeah, that’s really interesting. What could you describe some of the, you know, like, who’s the audience on the other end, right. And, you know, in industry, you’re trying to get someone to buy your technology. And it’s similar, but how’s the audience different and what are the different audiences in the PhD world for what you do?
Tony Wang 14:49
Yeah, so it is a very select by discipline, window, machine learning discipline, your audience and the systems disciplines, when you submit your papers typically goes through like a review process where they’re assigned to three or four or five other professors, or even graduate students who are hopefully, you know, versed in the research in the area that paper is purported to be on. Whether that is true or not, it’s you know, that there’s double blind review where you know, people, you don’t know who your reviewers are, and reviewers don’t know who you are. They’re single blind reviews, where you don’t know who your reviewers are, and the reviewers know exactly who you are. Now, kind of not saying one is better than the other or whatever, then there’s an open review where everybody knows, you know, who the counterpart is. So, like, the academic review process is this huge shame. That’s, you know, people have been experimenting over the years. But yeah, like, recently, there’s problems because in all the disciplines, there’s been a huge influx in papers, like, if you look at the number of submitted papers, to these conferences, over the past 1020 years, just been growing exponentially. So there’s a huge strain on the review process. And as a result, you know, a lot of my peers, for example, in machine learning, might just get like shitty reviewers. Yeah, their papers, for example, like a master’s student could be reviewing like a professor switch, and would just, like, put his post reviews that are completely incoherent, even to top conferences, like neurips, or whatever. Now, that is obviously a downside. But I mean, nobody has figured out how to do better than this kind of review system. So I guess there are a lot of plus sides to the review system as well.
Eric Dodds 16:43
Yeah, that makes sense. No?
Kostas Pardalis 16:46
How much?
Eric Dodds 16:48
How do you know the audience that you need to sell to? How much does that influence where you choose to focus your study? Or do you still feel like you have a lot of freedom to just pursue what you’re, you know, what you’re interested in?
Tony Wang 17:02
Oh, absolutely. So like, in, in academia, this is a culture of novelty, like, you’re actually trying, you’re absolutely like trying to do something novel that people have not done before. So I think this is good, you know, because maybe the point of academia is to do that, but it also limits the kind of work that people can do, right? For example, like, if you look at work, like polar is, for example, why? Well, so like polar is, for example, would not be a good academic project, it will be very hard to publish that anywhere. Because it’s not really like using novel ideas, or, you know, because then you kind of
Eric Dodds 17:37
answering writing units and rest. I mean, it’s obviously awesome, very powerful, but like, when you say that it’s okay, that’s what it is.
Tony Wang 17:46
That’s exactly what the reviewer is going to say, to reject this paper. So, you know, it kind of limits the scope of the projects that people like me can do. That could be very limiting. But otherwise it’s Yeah. So it does encourage, like, very risky ideas that might not have a good practical implementation at this moment. But you know, somebody at Redshift or Snowflake might read this paper and be like, Hey, I know exactly how to use this. And actually, you know, lead to significant impact, like, like, other places, right? Yeah.
Eric Dodds 18:24
Yeah. How? Just out of curiosity? You know, I know you’ve written several papers. How long does it take you to write a paper that you feel great about submitting for review?
Tony Wang 18:37
But we’re a long time, like writing a paper? Is it a time consuming process? Yeah.
Eric Dodds 18:44
So like, a month or like, nine months is,
Tony Wang 18:49
like, at least like a week of intensive writing? Well, I mean, hopefully, the work you put into writing the benchmarks or writing your actual system should take more than that. Yeah, yeah. writing the paper. I mean, I think I spent too little time writing my papers. But yeah, people are gonna say people, like, you can never spend more time writing your papers. And if you think about it, that’s actually a weird perspective, right? Because you’re spending all this time in a presentation or whatever, when you should actually just be like, maybe writing more unit tests to make sure your system works. Beyond the five cases written in the budgets, you know, it’s all a trade off. And I mean, the proportion of you know, that you have for sales versus engineering and reorganizations, you know, people can make similar arguments. Right. So,
Eric Dodds 19:43
yeah, it makes total sense. Yeah, I’m interested to know, and I want to, you know, I know Costas has a bunch of questions on the technical side, but, you know, as you pursued your research throughout the Ph. D. program, have there been any surprising discoveries that you’ve made that you weren’t expecting?
Kostas Pardalis 19:59
In a ha moment that you had during your, like, I got back.
Eric Dodds 20:06
That’s a much better way to put it. Thank you Costas for an aha moment.
Kostas Pardalis 20:09
That’s why I’m here.
Eric Dodds 20:11
I just sent
Kostas Pardalis 20:13
that guy, you know, like the aha moment.
Tony Wang 20:18
Yeah, I mean, that’s also a nice thing was doing applied research versus theoretical research strike while you’re doing proofs or whatever, back when I was still not like, there’s just definitely a ha moments where you’re like, oh, yeah, I could just prove it using this way or that way. But I think when doing applied research on others, the shatzer or sequence of smaller machines, so you can kind of see the project in your head, you can kind of see where it’s going. And you’ll have a pretty good understanding of what it’s going to come out with in the app. And you are incrementally improving your like, your intermediate steps so that you can get like, like, for example, I give you an example, which is like when I’m working on like full text indexing, repost the Iceberg or for logs or whatever, for Parquet foster logs, there were definitely had some ideas at the beginning of how you could how you could use this specific kind of index to speed up like substring queries on like, terabytes of Parquet files or whatever, and have the index be only like 1% or 0.1% of all the file size, but but then the index has problems like maybe low slow access time, or whatever. And then, and then gradually, you start to like, look more and more at your index structure. And then it just becomes kind of obvious what you should do. Once you have spent enough time looking at the algorithm. So you see it’s really like in a hopper moment, because it’s like, once you’ve looked long enough at the problem, everything just becomes kind of straightforward. And then it becomes kind of hard to present that to two papers, because then this is straightforward.
Eric Dodds 22:10
Yeah, that solution makes total sense. Yeah,
Tony Wang 22:13
so like, I think like, like the art of setting papers, but I have definitely not yet mastered this. How to present such, you know, maybe straightforward things in retrospect and in exciting fashion, that, that, that, you know, caters to people from craft, not salt a lot about this problem.
Eric Dodds 22:35
Yeah, that’s super interesting. All right, Costas. I have to hand the mic over. I’m just gonna keep asking questions.
Kostas Pardalis 22:42
So that’s okay. I think the conversation is like, super, super interesting, to be honest. So totally. Okay. Let’s talk a little bit about what you’re doing now. And let’s start with like, Stanford’s, you’re part of a lab there, I guess. Like, there is a structure in academia. Right? So tell us a little bit more about that. Like, what’s the goal of, let’s say, your team? They’re like the lab? And how do you think in labs, the academic
Tony Wang 23:09
labs have run very differently, it really depends on the professor like some professors are very hands on, and some professors are very hands off. I have a very hands off professor, fortunately. So he gives me great freedom in what I can do in my projects. And I know other professors who might even write code for students’ projects or kill the student exactly what to do. He has projects. So my professors are not like that, at least. So yeah, like, in my lab, different people might be working on different things that they find interesting. With different industry partners, potentially, like some people in my labs working with Nvidia. I work with maybe some other industry partners that are trying to use my stuff. Yeah. So really, it’s driven by you like, like, what projects are you interested in?
Kostas Pardalis 24:02
Yeah. Is there always a connection with the industry out there?
Tony Wang 24:06
No, you don’t have to, you don’t have to work on something that’s going to be useful to industry now.
Kostas Pardalis 24:15
So what is the value that like the industry brings to you as someone who’s doing academic work? Well,
Tony Wang 24:21
it kind of helps you grounded in real problems. Like, you might have this awesome idea of how you can do something that the people don’t really care about. And that it’s like, hard to justify why I have to go through all the motions of writing this paper. If the system I’m going to build, it’s not useful,
Kostas Pardalis 24:43
Yeah, that makes sense. So okay, tell us a little bit more about, like, you’re what you’re doing now. I mean, you said like, Okay, you were at MIT, you were more into hardware. Somehow you ended up doing research around things like data processing or data storage. So it will tell us more about that, but we First of all, how did you make this decision to move from, like hardware? To get into, let’s say, more of like what we can do with hardware? Right? When do we have it already?
Tony Wang 25:10
Yeah, so so so you mentioned like a pretty low Rambo, like, earlier, you know, I think it’s like a car to do middle moving working hardware. And it’s easier to build real systems that can provide real value to actual people, if you’re doing some kind of software research.
Kostas Pardalis 25:30
But, okay, but why data? Like why? Like,
Tony Wang 25:34
it’s fun,
Kostas Pardalis 25:37
very much actually showed, like, learn many things when we say data. But why what you’re doing now compared to doing I don’t know, like training models or doing AI or doing whatever else like,
Tony Wang 25:49
So I used to work on the like, speeding up natural language processing models, or whatever. In the first year of my PhD program, I took a leave and tried to start a startup. And I’d talk to, you know, hundreds of like, potential customers or directors of machine learning data science of like peak and I can make 10s of for a while they were like, five 10% faster by speeding up matrix multiply. Alright, so I had some code that beat Intel MKL, which is Intel’s way of multiplying matrices by like, five-10%, on some matrix size that I was extremely proud of as an academic achievement. But then I talk to these guys, and they’re like, yeah, yeah, you know, you know, the slowest part of us, you know, doing inference. This is like getting this meta data from DynamoDB. So that takes 200 milliseconds, or something like that. Whereas this matrix, multiplying the TensorFlow Trino, it takes, you know, what, your microseconds if you’re doing this, right, so, so, you know, this route was really eye opening experience, and also kind of like, really forced me to try to talk to potential customers to understand use cases before I started working on research projects today. Is that Well, well, of course gives you but that was, you know, kind of, kind of starting point and wise, wanted to go into data. And I was like, Yeah, this might have been efficient and helped you or process data and stuff like that, and just gradually found data. More interesting. And that’s where I spent most of my time today.
Kostas Pardalis 27:25
Yeah, so okay, we found an aha moment. Here, I think, right? I guess.
Tony Wang 27:30
Yes. Okay.
Kostas Pardalis 27:33
So tell us more about what you’re doing today. Right? What’s your focus in your research?
Tony Wang 27:40
Well, I’m mostly focused on time series, data management. So I work on trying to build the like, you know, so take a step back, maybe it’s so I think, for business data and customer data and generic data management, people are moving to Parquet files, Delta, Lake Iceberg, whatever, they’ll work really well. And you’re able to build all kinds of differentiated applications and dashboards on top of the same data layer. Now, in time series, data management, that is still not the case, like people are using Prometheus with the stone scaling solution, like low key or elastic search was a zero on like, ultra warm or cold chili or whatever you call it, to spill to S3, and then some, maybe some other completely different system to manage their creases. So I just think that, you know, we could probably make, like, a posh Iceberg and Delta Lake work for these time series, monitoring, use cases, and store metrics and logs at high scale and still be able to do the things that are Elasticsearch candy. Now, there’s a lot of promising recent projects, I quickswitch, for example, that claim huge performance benefits over Elasticsearch, right. But the problem is still the advanced storage format. I really want to be able to store things like logs and metrics and Parquet files and the posh Iceberg and still be able to empower the use cases that people might want to do in Prometheus and Elasticsearch.
Kostas Pardalis 29:10
Okay, and why is there this divergence between these two data related, let’s say, problems, right? Why we ended up having systems that are like, in a way like so different between the two right like the Parquet work on one side with the OLAP systems there. And then we all like the time series systems like Prometheus and the rest of you talked about, so why did we end up liking this reality?
Tony Wang 29:42
So it gets because, first of all, the Parquet world cannot efficiently support the use cases for Prometheus and Elasticsearch. For example, if you store like all your logs and Parquet files, and you try to do like substring inquiry or some kind of text search is no other way than to scan all your logs and start doing this regex and spark or whatever. And that is horribly inefficient compared to ElasticSearch where there is an inverted index that can answer this question in milliseconds. Now for permissive, you know, I think it’s a more of an issue of data modeling. So in Prometheus, you have the notion of time series and time series, just tag, and maybe try to store them those in Parquet files, it’s not clear how you can do that, too, to have this premise, this data model translate over to the tabular data model in a Parquet world, like, what would be the columns? And how would the columns be clustered by and how to get a kind of performance and permeases can have, and of course, they are just talking about data models, and CoreOS is also this big component of a real time. You know, he built it a premise, this and the Elasticsearch were first invented, and, you know, probably still use largely as real time systems where real time ingest data can be used, like in real time. Now, then, how does this translate over to the Parquet world, right, maybe you have some ClickHouse instance that is running, and then that spills to Iceberg or delta for longer term storage, or something like that. But I do believe that there’s definitely got to be a bridge there. So you can, you should be able to do things like runs FICO across all your business data, as well as your telemetric data and be able to join those sources and try to debug your issues or things like that.
Kostas Pardalis 31:42
Okay, let’s talk a little bit more about the data modeling part. Well, you mentioned how is data like models in the time series? Like the world? And why is this like different compared to what you do like in an all out system with tabular data?
Tony Wang 32:01
Well, so you think about data modeling, right? So permissive really is a system where, like our time series chunks, they are tagged by string tags. And you should think about your data as these chunks were text, and you can quickly access a particular chunk. Now, you think about translating the tabular worlds, you could think about, maybe I have a couple of columns, right? One column would be a timestamp. And the other column would be the tag. And then another column would be the value. So you could do it like this. But then what should you sort your tables by maybe if you search your tables by the timestamp, when you would have good ingest performance, because the new data would just be a pence, but then, you know, quickly, retrieving a, all the data corresponding to a particular tag will be very slow. So maybe you should sort your data based on tags, right? But then it just becomes a problem because you like your new data gets, like, super small files over a bunch of different partitions. So what are you gonna do? I mean, ultimately, I think that, you know, the premise is that data models could be implemented on top of Sparky, Foss and the fact that I’ve done that as part of my research project, and internships and whatever. And I do believe that it’s possible to do this as a particularly good tabular data model and some maybe external indices to tutions. Yeah. Okay.
Kostas Pardalis 33:35
That’s interesting. So how does Prometheus solve that? Is it because of I call that like, is it the storage problem at the end, like how you store the data, like, your store ads, or lack of indexing, let’s say in the OLAP world, because okay, like, traditional Inola, like systems, okay, you can think of, let’s say, partitioning or like, like bucketing, and stuff like that, that’s like a lightweight version of like, an index, maybe because you consider, like, what the workload looks like, and try like to change the layout to make it like faster, but we don’t have as the index of the traditional, like, another systems, right. But so what is like from your, like, point of view, what like, like, like, causing the problem here.
Tony Wang 34:23
So I think Prometheus is like an integrated system. And, you know, it integrates the real time part of like, how it gets the real time data, you know, separate somehow into these chunks and whatever. And then you can write these chunks and then spec and storage but, like, in the Parquet worlds, in the Parquet worlds, you’ve got to start piecing together different systems, and she’s like that. That’s the first thing and second way she talked about indexing is very interesting, because typically, these tags are like high cardinality. So like databases like M sweetie. be made. There’s a great talk by Rob from MC DB hashing that talks about the kind of inverted index. Instead, the FST site, finite state transducers that they build on the tags quickly allow retrieval of a particular tag. So, this is the problem, like if you have a billion like Kubernetes pod names that are your tags, how do you actually quickly look up, you know, where a particular tag and its corresponding chunks are stored. So, it’s integrated systems like MC dB, they could have Slyke inverted index, similar to Elasticsearch, they can tell you exactly where all the time chunks or particular key is stored. Whereas in a Parquet file, if you have a column with a billion potential values, it is very hard or even if they’re clustered together, it’s like pretty hard upfront with no external indices to figure out you know, which Parquet file these your tags are located in without scanning all the headers and footers of all the Parquet 1000 meter data lake or whatever. As well, like if they’re if your data happens to be sorted by time, and you’re and this tag is actually separated across all of the tables, and you can forget about doing this efficiently. Right. So this also I guess, but not the end of the world like you can definitely like buildings on top of like these Parquet files that can perform similar functionalities to what I inverted index QTP could do to speed up this process. Right. So there was actually a lot along the lines of what I’m doing right now for my research.
Kostas Pardalis 36:38
Okay, so like they, from what I understand and like, correct me if I’m wrong sharing like this solution that you see there is not like going and like fundamentally changing the format itself like Parquet, right? It’s more about what we can build around Parquet. In terms of meta data and indexes that we can build, of course and implement there to actually bring Parquet closer to words, let’s say these several index systems like Elastic Search do, right? Is this like, correct? Yeah, yeah. That’s interesting. So okay, I have a question. That it’s actually there’s like a lot of conversation lately, out there. About Parquet, let’s say, showing its age, right, like Parquet was created, like 2008 2009. I don’t know exactly how, but it was like, at least 10 years ago, right. Very different use case is up there. I mean, obviously, the format is inspired primarily for traditional OLAP data warehousing use cases, that they have very different latency requirements, right, like, and even the hardware is like so different back then. And all that stuff. So people especially driven by this conversation think like silabs driven by the needs like in a male use cases. They start talking about the need for likes, let’s say upgrading, updating or substituting maybe Parquet. And there are companies out there that they’ve built stuff, right, like you have met other cars, these alpha formats, I think it’s called, then you have all the work like from Google, if I remember correctly with like the Porsche law system, let’s come back from YouTube, where there is like a lot of stuff there of like how we can complement or like change the way that we store data compared to Parquet. And of course, like there are also other systems out there, like right now, like Lambda dB, for example, right? Like they have their own format, they’re like trying to accommodate more, let’s say like the use cases around like a melt. So how does the stuff you think about fitting these words like the industry in a way is like pushing for new formats. Like they actually wanted to go pretty low, let’s say in the stack and go to the storage layer and rethink, let’s say, like the format that we are using there.
Tony Wang 39:18
Let me ask you a very simple question. CSVs are horrible in terms of efficiency. But yet yesterday, I downloaded data from a GitHub repo from Alibaba and the data format was in CSV. But the Alibaba people know better data formats, of course. I mean, do they expect their users to? But it’s a question right, like, so.
Kostas Pardalis 39:45
No, I hear you, I hear you and like actually, to be honest, I find your answer extremely interesting for someone who’s coming from a PhD to be honest, because your approach is much more pragmatic and product oriented than that. research oriented, right? And 100% like CSV is there, it’s not going away anytime soon, right? Like, we will still, like, struggle with it. So I get what you’re saying. And I think it makes total sense what you’re saying, like it is important for building, let’s say, a system that you can take out there in the market, right? And actually deliver value. But how you can defend that, like, in research, how you publish a paper on that? Because going back to the conversation, because the beginning right about like the novelty,
Eric Dodds 40:35
right? Well, I
Tony Wang 40:38
I mean, my, hopefully the debates around my research will be around like, like these external indices, which are definitely not the Parquet format, how they can speed up like these queries on Parquet files. Right, yeah, and Parquet is actually not that bad. Like if you know how to use it properly. Parquet gives you huge flexibility in how you can find your data. For example, if you want random access into a column, people think it’s impossible, but you can just keep the column only encoded. And then you can just randomly access the bytes, you can change the row group sizes, to efficiently retrieve smaller chunks of your data, you can change the number of columns you’re going to put in the table. Alongside with the row group size to tune the file size, you can change the encoding of the columns, you can even use custom encoding algorithms to encode your columns before you put them to Parquet. There’s so many things that you can do to Parquet that you know that can improve its performance. Now, it is a question of why these things are supported by higher level frameworks, like Iceberg or delta that have a very opinionated way in how you should be managing these Parquet files for these OLAP workloads. Right. So if anything, I think Iceberg and Delta should be more flexible in allowing people to tune their own ways to use their Parquet files, then we should be changing the Parquet file format itself is what I think. Okay,
Kostas Pardalis 42:13
I love that. I really like that. Okay, so let’s look a little bit more like the indexing that you’re talking about here, right? Like the path of your research. So when you’re talking about external indices, right, what, what it looks like, what is the use case? Because like, okay, you can index for many different reasons, and with many different algorithms and all that stuff there. But when are you trying to do it here with the sunglasses? Yes, simple.
Tony Wang 42:37
So you know, Postgres has all these kinds of indices that allow a regular Postgres database to do wonderful things. You can have a JSON B index type and built like a jin gi n, generic inverted index on it. And then you can certainly do things like JSON paths matching keyword search and all kinds of amazing things. Right. So I look at Parquet, the same way, you know, the Parquet is your data. And instead of Postgres pages or whatever, you’ve got like Parquet pages or whatever, so So you should be able to build in this these are these Parquet founders who do not have to look like Parquet files that you can efficiently access at Quarry time that tell you like what Parquet pages to read to get your data. Right. Yeah. So like, a higher level of that will be like what ropes to fetch to get to, for example, if you got the column in that Parquet file that’s composed of log messages, you should be able to build a text index on that column that is a lot smaller than that column itself. In terms of storage footprint, it still supports efficient, you know, access from S3, that will quickly tell you what row groups in all your Parquet files and your data contain the keyword that you’re searching for. Yep, similarity, you should be able to build an index on a JSON type in your Parquet, like keep your JSON as a string and Parquet, you should be able to build an index on that, for example, allows you to do like Snowflake variant type querying like, yeah, without Snowflake, right? So you should be able to do all these things, but you just can’t. So that’s what I’m working on.
Kostas Pardalis 44:10
And how do you Okay, so let’s say you have like the storage there like votes remains intact like Parquet. And then you bring these new layers on top of it, where you can create like this indices and obey associated metadata to access the data very efficiently. How did you connect that then with the higher level of like in the stack, right, like the gradient and like, as you said, people already using stuff, they shouldn’t move away from that stuff. And I agree with you guys also like with routing engines, right? So you might have something like Trino there or you might have something like Spark or you might have something I don’t know, like, whatever. How do you expose this intricacies like in a way that can be, let’s say, exploited by this query engine somebody and without having to rewrite the query engine If you
Tony Wang 45:00
don’t have to rewrite a quarry engine, so like this industry is no, oh, of course, you could rewrite the quarry engine and integrate these indices, but you don’t have to. So this is interesting, because a lot of query engines today look at some metadata before they even do your query, for example, I see no BigQuery will tell you how much they think this quarry is going to cost you before you execute it. So in the same way, you know, the query engine could query this index and rewrite your query in a good way, and dramatically reduce the cost. For example, if you have your techspace inverted index, you know, and then you’re using Spark or Trino as your quarry engine, you could query the index first to, you know, translate your text quarry, which would require the quarry engine to read the entire text column into a very selective predicate on maybe the timestamp, right? So instead of running a query that’s like select star where log I like star, Arn, 12345 star, you run a query that’s like select star from timestamp between x y, and that’s a very small range. And that’s provided by the index. Okay, so you keep your query engine, which is like rewriting your code, right? Sure. But
Kostas Pardalis 46:17
someone has to rewrite the query, right.
Tony Wang 46:22
Like, part of a client library for the index. Okay.
Kostas Pardalis 46:26
Okay. Okay. And then someone had to integrate that as part of the optimizer, for example, Trino to do that demo. So
Tony Wang 46:33
it does not change the Trino optimizer, because it just translates very expensive predicates into very cheap predicates. And the Trino optimizer already knows how to do predicate push down and all that stuff where it’s very selective, you know, like, timestamp based filtering, right.
Kostas Pardalis 46:49
Okay. Okay. Sounds good. We are close to the end here. I think we could be talking for like, a couple of more hours. And I think we should do it in the future. I think we have a lot to talk about here. But you also have an open source project out there, like walk, right? Tell us a little bit more about, like, woke up? What is it and why would people be interested in you?
Tony Wang 47:15
So I started cooking. And it’s more like trying to bring full tolerance to a streaming based courier engine like Trino. So it’s actually like a Korean eyebrow like, like, the optimizer like the logical and physical plan optimizers in Python, and. And it is faster than spark on EMR by like, two, three times. But I hit some bottlenecks in trying to actually support SQL. It is very hard today to support SQL and your query engine. And there are a lot of efforts out there to do that. And in fact, I think just the other day, somebody’s trying to propose that generic plugin, speak or logical plan that is optimized based on data fusion, which will be good if it works. But yeah, and I am trying to, you know, integrate some of my newer research into coca, and hopefully, coca can be like, you know, the first query engine that’s natively integrated to these indices and building other party files.
Kostas Pardalis 48:15
Okay, that’s awesome. Eric, I’ll give the microphone back to you. Because we can keep talking forever here. But I think I should give you a little bit more time to ask any questions that you might have.
Eric Dodds 48:27
Yeah, well, I think we’re right at the end. Actually, you know, one thing I’ve been thinking about throughout this whole conversation, Tony, is, what are you interested in doing after you’re done with your PhD? I mean, you’re obviously on the applied side. So have you thought much about that? Yeah,
Tony Wang 48:45
Shinkai, I might be interested in doing a startup, if I can figure out what to do. It’s kind of hard to start a startup these days. It’s always been hard for them. But yeah, or are in my work someplace. Like there are a lot of very cool companies today. Working on this, like you You are observable tools and things like that. So yeah,
Eric Dodds 49:14
very cool. Well, if you end up starting a company or when you go work at a company, we’d love to have you back on. Tell us about finishing the PhD and going into industry. Yeah, I’m
Tony Wang 49:26
mostly focused on trying to graduate right now.
Eric Dodds 49:30
It’s down all right. Well, Tony, it’s been such a good show. We learned so much. And good luck on selling to your audience here in the final stretch.
Tony Wang 49:39
Yes, yes. Well, I made the sale. So I already know where they’re they decided to buy them. Right. Yeah, since the buy. That’s it. I’m sure that you understand how difficult it is.
Eric Dodds 49:53
Yeah, it’s Yeah. closing the deal. Awesome. Well, best of luck and keep us posted.
Tony Wang 49:59
All right. Thank you very much for your time.
Eric Dodds 50:03
We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe to your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.
Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
To keep up to date with our future episodes, subscribe to our podcast on Apple, Spotify, Google, or the player of your choice.
Get a monthly newsletter from The Data Stack Show team with a TL;DR of the previous month’s shows, a sneak peak at upcoming episodes, and curated links from Eric, John, & show guests. Follow on our Substack below.