Episode 161:

The Intersection of Generative AI and Data Infrastructure with Chang She of LanceDB

October 25, 2023

This week on The Data Stack Show, Eric and Kostas chat with Chang She, the CEO and Co-Founder of Eto Labs. During the episode, Chang discusses LanceDB, the history and success of Pandas, as well as the challenges of working with new technologies in the data industry. Chang shares his journey and the challenges faced in open sourcing Pandas, the need for new data infrastructure optimized for AI and ML, the future of AI and other data avenues, and more.

Notes:

Highlights from this week’s conversation include:

Chang’s background and journey with Pandas (6:26)
The persisting challenges in data collection and preparation (10:37)
The resistance to change in using Python for data workflows (13:05)
AI hype and its impact (14:09)
The success and evolution of Pandas as a data framework (20:04)
The vision for a next-generation data infrastructure (26:48]
LanceDB’s file and table format (34:35)
Trade-Offs in Lance Format (42:45)
Introducing the Vector Database (46:30)
The split between production and serving databases (51:14)
The importance of unstructured data and multimodal use cases (57:01)
The potential of generative AI and the balance between value and hype (1:01:34)
Changing expectations of interacting with information systems (1:13:53)
Final thoughts and takeaways (1:15:32)

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 00:05
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com. Welcome back to The Data Stack Show. Kostas today, we’re talking with, really a legend, I think is probably an appropriate term. Shang chi is one of the original co authors of the pandas library. So, you know, we’re going back to a time before, you know, the modern Cloud Data Warehouse when that work started. Absolutely fascinating story. And now he’s working on some pretty incredible tooling around unstructured data. And another fascinating story there and actually a lot in between. And this isn’t going to surprise you. But I actually want to ask about the panda story. I want to talk about Lance dB, which is what he’s working on. But the pandas library was, it came out of the financial sector, which is really interesting. And in a time when we, you know, the technology they were using, we would consider legacy. And now it’s lingua franca for, you know, people worldwide who are doing data science workflows. And so the chance to ask him that story, I think is going to be really exciting. But yeah, you probably have tons of questions about that, but also Lance TV.

Kostas Pardalis 01:45
Yeah, culture, per se. I mean, first of all, we’re talking about a person who has been around building fun, relational technology for data for many years now. So we definitely like to have a conversation with him, like about pandas, and like the experience there, because I think, you know, like, history tends to repeat itself, right? So I’m sure that like, there are many lessons to learn from what it means like to what it means back then like to bring pandas to the market and like to the community out there. And these lessons, they’re definitely applicable also today, like with new technologies, which I think it’s even more important today, because of leaving in these moments in time where like AI and lamps and like, all these new technologies around data are coming out. But we still like trying to figure out what’s the best way to work with these technologies. So that’s definitely something that we should start the conversation with. And obviously then, like talk about Lance DB and see like, what made him get into building like a new part of the Union, like storing data in a table format, and what can happen on top of that, and what it means to build like a data lake that is, let’s say AI native, what it means to build data infrastructure, that can support the new use cases and the new technologies around like AI and ML. So I think it’s going to be a fascinating conversation. And he’s also an amazing person himself, very humble, and very fun to talk with, and there’s going to be a lot to learn from him. So let’s go and do it.

Eric Dodds 03:30
Agree. Let’s do it. Gang. Welcome to The Data Stack Show. It’s really an honor to have you on.

Chang She 03:36
Thank you, Eric. I’m excited to be here.

Eric Dodds 03:39
All right, well, we want to dig into all things, Lance dB, but of course, we have to go back in history first. So you started your career as a quant in the finance industry. So give us the overview and the narrative arc, if you will, of what led you to founding lands?

Chang She 03:58
Yeah, absolutely. So quite a journey. So my name is Chung, Co Co Founder of land CV, then building data and machine learning tooling for almost two decades at this point. You know, as you mentioned, I started out my career as a financial quant. And then I got involved in Python open source, and I was one of the original co authors of the pandas library. And after that became popular, started a company for cloud bi, got acquired by Cloudera. And then I was VP of Engineering at Tubi. TV, where I built a lot of the recommendation systems, ml ops systems and experimentation systems. And throughout that whole experience, I felt that tooling for tabular data was getting better and better. But when I was looking at unstructured data tooling, it was sort of a mess and Tubi TV, it was a streaming company. So we dealt a lot with images, videos and other unstructured assets. And any project involving unstructured data always took three to five times as well. My co-founder at the time was working at cruise. So he saw similar problems, but even at an even bigger scale. So we got together and sort of tried to figure out what the problem was. And our conclusion was, it was because the data engineering data infrastructure for AI was not built on solid ground, everything was optimized for tabular data and systems, you know, a decade old. And so once you build on top of this, on a shaky foundation, things start to fall apart a little bit, right. It’s like trying to build a skyscraper on top of a foundation for like a three storey condo.

Eric Dodds 05:48
Yep. Makes total sense. Before we dig into Lance, can we hear a little bit of the backstory about pandas? I mean, I think it’s, it’s really interesting to me for a number of reasons, I think, you know, when you think about open source technologies, a lot of times you think sort of trickled down from, you know, the big companies that are, you know, that I mean, in some ways, right, which you experience where, you know, there’s these huge issues, that sort of something like an, you know, that has become as popular as pandas, sort of arising out of the financial industry is just interesting. So can you give us a little bit of the backstory there? Yeah.

Chang She 06:27
So we’d have to really go back in time. So when I first started working as a quant, this was 2006. And at that time, data scientists wasn’t really a job title. When I graduated, I knew I loved working with data. And at that time, if you like working with data, you went into quant finance. As a junior analyst, I spent a lot of time on data engineering and data preparation, right? Loading Data from the various data vendors that we have producing reports and data checks, validation, integrating that into our main sort of feature store, which was just a Microsoft SQL server at the time

Eric Dodds 07:09
I was gonna ask you is a feature store, but it was a

Chang She 07:15
feature store also wasn’t a word at that. But there was a lot to like, the scripts are written in Java. And the reports were produced in VB script. And there were a lot of Excel reports flying around, there was no data version A, there was barely code versioning. And everything was just a huge mess, right? And fast forward a couple of years. One day, my colleague and roommate at the sun, West McKinney, came up to me and said, Hey, look at this thing I’ve been working on in Python. And it was a sort of a closed source, proprietary library for data preparation that he built in his group. We had the same fun. I sort of immediately fell in love with it. And I was like, Oh, this is the best thing ever. And I started using it in my group and trying to push for using that, and also pushing for Python, over Java as sort of the NDB scope as the predominant data preparation tools and things like that. And so, you know, initially, there was definitely a lot of pushback on Oh, but Python is not compiled, therefore, it’s not safe, or like, you know, why do we want to use this when we already have a bunch of code written. So it took us a while to sort of get buy in. And then it also was a sort of it also took a while then to get the company to agree to actually open source the thing. And this was in an era sort of, at a little bit after the financial crisis. Yeah, at that time, Wall Street, and hedge funds in general, were extremely anti open source, everything was considered sort of secret sauce. And there was a lot of unwillingness to open that and it took, you know, maybe six months of work from the west to actually make that happen. And sort of the final trigger was essentially, him quitting to start a Ph. D. program. Before they sort of relented and say, Okay, fine, we’ll make this open source.

Eric Dodds 09:29
Wow. I mean, you know, working on pandas sort of in the wake of the financial crisis is, that’s good. What a unique experience. One question that comes up, as you tell that story that’s really interesting is that, you know, you’re sort of talking about a period of time where a lot of the tools that are just the lingua franca of anyone working in data, right, whether you’re more in the data engineering end of the spectrum We’re, you know, sort of ml AI in the spectrum that are really, you know, cloud warehouses and data lakes and Python based workflows, etc. What was really interesting, one thing you said was, you know, I spent a lot of time on, you know, sort of data collection and data preparation, you actually hear the same phrase today, even though from a tooling standpoint, it’s wildly different landscape and far more advanced than it was back then why do you think that is? Because people are saying the same thing? You know, well, over a decade later?

Chang She 10:37
Yeah, I think the problems are different today. And maybe this is something that, you know, I think Kostas has lots of thoughts on here as well, given his experience. But I think, in my day, as a junior analyst, the biggest problems were kind of that connection between the external data vendor and us. And then. So for example, a lot of the data was dropped into, like an FTP, and sometimes it just didn’t arrive on time. And most of these processes were very manual. Right. And I think, you know, at that time, data set sizes were also a lot smaller. So I think today, your problems might be a lot more downstream and have to do a lot more with scale. And then, sort of these manual connections, I do think that data, sort of data accuracy, and cleanliness is a problem that just hasn’t been solved. And I think a lot of it is just because the data that we work with, is generated by real world processes, and by definition, that is just super dirty. And I think probably a third big factor is, you know, in finance, there was always a very big focus on data quality and data cleanliness. I remember going through just the data with a fine to fine tooth comb to figure out okay, did we forget to record stock split merger acquisition? Or did this share price look wrong, because there were some data errors. And because the data, the data being wrong, has an outsize impact in those use cases. But we can only handle that small scale at that point. And now, I think, you know, with internet data, you know, if your log or event data is wrong a couple times out of the billion, it’s not going to affect your processes, or your BI dashboards all that much. So I think the problems are different. But it’s sort of that commonality of data being generated by real world processes is still the same. And so that I think that at the core, that’s why we still hear those same complaints over and over.

Kostas Pardalis 13:05
Yeah, fascinating. Okay, one more question. For me, before we dive into Lance, you talked about this paradigm of trying to essentially sell this idea of using Python. And there is resistance to that, which, you know, looking back, it’s like, Whoa, that sounds crazy, you know, because going from Java to Python for these kinds of workflows, it makes sense. But, of course, when you’re in that situation, there are sort of established incumbent processes, frameworks, etc. You know, people can be resistant to change. Do you see a similar paradigm happening today, especially sort of, with the recent hype around AI, where there’s a smaller group advocating for sort of a new way to do things? And there’s resistance against that? Is there a sort of a modern day analog to that you see in the industry? That’s

Chang She 14:09
a good question. I mean, yeah, it’s so certainly hard to say, because I think I’m, so I live in San Francisco now. Right? So I think my bubble is basically the small group of, you know, all the different small groups of people being very crazy about trying new things. That seemed crazy. So in my immediate circles, it’s actually hard to say I think, you know, all I hear is like, oh, have you tried this new thing that has like, that came out two days ago, and has, you know, under 1000 stars on GitHub already, and that kind of, it’s actually hard for me to say, but I do think that there’s, there’s sort of a very big sort of impulse function that makes its way out so You know, in the San Francisco and just in general sort of tech bubble, Silicon Valley tech bubble, it’s very much like, oh, you know, AI is like chat GPT-3 is so over now. It’s, you know, whatever the latest open source model is. Whereas if you actually go out and talk to some, like, normal people in normal places, they’re like, oh, yeah, I’ve heard about this vaguely, but I don’t really know what it is because it doesn’t have an impact on my daily life yet.

Eric Dodds 15:34
Yeah. Yep. super interesting. Yeah. All right. Well, thank you for entertaining me. My questions about pandas. Tell us about Lance TV. What is it? And what problem does it

Chang She 15:46
solve? Yeah, actually, so before we dive into that, I actually like coasters. Given your experience, I’d love to hear your take on some of that, too, because I feel like you must have very interesting stories there, too.

Kostas Pardalis 16:00
Yeah, I mean, specifically for like the AI, craziness or like,

Chang She 16:07
well, more about like, you know, sort of how has problems in data engineering evolved? Right, when you first started out your career versus now? What are the things that you think people don’t understand? Is it data that they should?

Kostas Pardalis 16:26
Yeah, I mean, I think the best way to understand the evolution of data infrastructure or like I don’t know, data technology, in general is to see, to observe, like the changes in the people involved in the lifecycle of data, right? If you are, like an amateur, like you’ve seen, also, let’s laugh because you were working on these things like back then. But like, it wasn’t about like, close to 2010 2008. There was a wealth of systems coming out, especially from the big tech like in, like, in the Bay Area, like from Twitter, like from LinkedIn. Some of them became very successful systems like Kafka, for example. Right? Yeah, these systems were coming out, like from people, like an, let’s say, type of engineer, that was primarily the type of like, I build systems, someone came to me, and they’re like, oh, we have these huge problems at scale right now. And we don’t know how to deal with it, like, go and figure it out. So the conversation would be more around, like the primitives of like, distributed systems, well, I got all these things, right, with these people like was their language actually because of like, systems engineer, right, engineers. But like, if we fast forward today, and take like, like a typical data engineer, like they have nothing to do with that stuff, like they are not, there are people that are coming more from the data, let’s say domain than from the systems engineering domain, right. And that’s, like, inevitable to happen. Because as something becomes more and more common, we need more and more people to go and work without staff. So we can’t assume that everyone will become like, you know, like the systems engineer have, like, leave to bare metal or whatever, like to go and solve problems out there. Right. And if we see the people like, in the titles and like the roles, like the data scientists, the data engineers, the ML engineers, right, and track the evolution there. I think this can help us a lot to understand both why some technologies are needed or why Python is important, right? Because yeah, of course, like these people are focusing more on like the data, not on the infrastructure. Yes, writing, like in the dynamic language, like Python, when you run, you might end up like service breaking, right? But when you work with data, that’s not the case, because you’re experimenting, primarily, you’re not putting something in production, that’s going to be a web server, right? So it’s all these things that they change the way that we interact with technology. And let’s say the weight of like, what is important, I think changes and the developer experience has like to change. And that’s like, what I think is the best indicator at the end of things like, where things are today and where they should go right now. And that’s like a question that I want to ask you actually about pandas. Right. Because why in your opinion, like pandas was so successful in the end, like what made because, okay, it wasn’t like, you had like, ways to deal with that stuff before, right? Some were like we feel like a new problem appeared out of the blue, right. But what made pandas like the de facto, let’s say framework for anyone who is more in the data side of things like that. Get our scientists, for example, based on your experience what like what you’ve seen out there? Yeah, absolutely.

Chang She 20:05
So I think a lot of this was just making it really easy to deal with real world data. So when we first started out, it wasn’t like it was very clear to us that pandas had a lot of value, because we were using it on a daily basis for business critical processes and systems. But for a stranger, in the beginning, it was actually kind of hard for them to understand. Because at the time, there were sort of a couple of different competing projects. And, you know, in the beginning, you know, now Pan Pan is 2.0 Claus is, is also arrow based, but in the very beginning, pandas sort of mechanically, pandas was just a thin wrapper around NumPy. And so a lot of the data veterans at the time really dismissed pandas, as Oh, this is just a wrapper of convenience functions around NumPy. And I’ll just use NumPy, because I’m smarter than the average data person. And I’ll just code up all this stuff myself. But I think what was successful and what made 10 and is successful, you know, most of this credit, obviously goes to Wes was, we focused on sort of one vertical, one set of problems at a time, and we just made it, you know, 10 times easier to deal with data within that domain. And so, for people in that domain, it was very clear, oh, if I use pandas, you know, it’s, you know, I save like, a ton of time, then using the alternatives. And then, over time, we got a lot of pull requests and feature requests from the open source community in adjacent domains. And we sort of slowly expanded over time, and sort of that. And then finally, the, the, the sort of advent of data science and the explosion of data science in popularity, sort of finally made pandas into the popular library that it is today.

Kostas Pardalis 22:13
Yeah, I think, I think you put like, very well, I think you use some very, like, strong, like, terms there that I think describe the situation like not just with pandas, like with every technology out there, like, you’re talking about veterans, yeah. Like, you always going to have people who know how to go down to the Carolinas, or like, do something crazy there, right. But how many people are likely to have the time to get into this level of expertise, right? So I think like when you get like, to the critical mass, where like, a problem is big enough, like the economy out there that needs like mass adoption, then there are like things that become more important than, let’s say, the underlying efficiency, and that’s like, efficiency for like, access to the technology. And that’s what I think pandas also do, right? It’s not like it’s these, like, beautiful, perfect in a scientific way, like a way of solving life problems, but it is pragmatic, and it is something that like people like other sounds as a tool that helps you be more productive, right? It’s a little bit like, more like a product approach, I would say in like, in technology, but at the end, it is important, like, and we see that many times with stuff like, like see, like Snowflake, for example, right? It’s not like we didn’t have data warehouses before. But suddenly, data warehouses became much more accessible to business users, because they paid more attention to things like, okay, these people don’t know what vacuuming is, like, why they have to vacuum their tables. Like, why would they learn that thing? Like, it’s like even engineers hate, like vacuuming back on the table.

Kostas Pardalis 24:00
So I think that exists, like a lot of value in that. And it’s like when things get really exciting, because that’s like the point where like, a technology is ready to be like for mass adoption. That’s my opinion. And I think we are at another critical point, when it comes to data and I think a lot of that stuff is going to be accelerated because of AI, you know, because data will have much more impact, like in many more different areas of life. So more data needs to be processed and more data needs to be prepared and stored and collected and labeled and like all that stuff. So the question today is like, yeah, how do we build the next generation of infrastructure for data that is going to get us to 2030 and beyond the way that let’s say the systems like Spark or like, whatever was like building the 2000 de 2010 2012 brought us to where we are today, right? And I think Lance DBS likes what Out of these solutions out there. So tell us a little bit more about that, like how the answer is like changing, and what gaps it’s like feeling that the previous generation of data infra had?

Chang She 25:12
Yeah, absolutely. I think when we look at the problems dealing with unstructured data, I think what we see is unstructured data like images and texts and all that their data that’s really hard to ETL the data that’s very hard for sort of tabular data systems to deal with. So you get really bad performance, what you end up happening, what you end up having to do is kind of make multiple copies of the data, we know one copy might be in a format, it’s good for analytics, and one copy that’s good for training and another copy. That’s good for debugging and in a different format. And then you end up having to have different compute systems on top of these different formats. And then you have to then sort of create this Potemkin workflow on top of tooling, that makes it a lot harder, right, you can sort of cite it for a time, all this mess is under the hood. But it’s a very leaky abstraction. And over time, it just sort of comes to the fore. And so for us, you know, our goal is to essentially fix that with the right foundation. And I think if you look at the history of data, every new generation of technology has come with data infrastructure that’s optimized for it, right? So you start with something like Oracle for when your database systems were first coming to the fore and becoming popular. And then when the Internet became popular, I got a lot of JSON data to deal with. And you know, that’s why NoSQL systems, particularly Mongo, became super popular, then it was, you know, Spark, Hadoop, and then Snowflake. And I think, if you look out, you know, 510 years down the road, you know, AI is the next big generation of technology. And I think there needs to be new data infrastructure that’s optimized for AI. Right. So that’s sort of the core mission for land CB, we’re trying to make a next generation lake house for AI data. So the idea is that, you know, as you’re managing large scale unstructured datasets, with land CB, you’ll be able to analyze, train, evaluate and serve just several times faster, with 10 times less development for effort at a fraction of the cost, right, that the first product that we’re putting out there is land CV, the vector database, but we’ll have a lot more exciting things to follow as well.

Kostas Pardalis 27:44
Okay, that’s awesome.

Chang She 27:46
Let’s Okay, let’s

Kostas Pardalis 27:47
do the following. You mentioned some very interesting things, like stuff yet, like, how you ETL, like unstructured data, for example. Right? So let’s like, especially for our audience out there, which is probably okay, we haven’t met many of them. I never had to deal with this type of data at scale. Let’s do the following. Let’s do, let’s describe, right, like a pipeline, a typical pipeline of like dealing with this data, like, without plans to be how like, someone would do it in a data lake or lake house like today. And then follow up with like, what like lands, the bees adding to that show, a data engineer who never had to deal with us, until now, like, can give a glimpse of what it means to work with these with this type of data.

Chang She 28:36
Yeah, so for API data, so for traditional data, right, a lot of the data generation processes are generating things like JSON or CSV, or a lot of systems are just going straight to Parquet. And your life is kind of, you know, a lot easier than but without a lot of time, you’re getting like hardware data. So you might be getting, let’s say, like a bunch of protobuf data that’s coming off of sensors, that’s a time series to go with that you got a bunch of images that correspond in time with some of those observations. And then you might have some text, right that’s produced by a user or, you know, that is that associated with certain products or something like that? Right, so off the bat, you’ve got this plethora of data in different formats that maybe, you know, either comes in from your API or off of a Kafka stream or something like that. And maybe the first stage is that gets dumped into some location in S3, and then you would end up having to write some code to stitch that together and make some of the metadata gets stored as like a Parquet file maybe if you’re lucky or You know, in some table, and then your images are elsewhere, right? And then you’re, and then you have to maybe if you’re a data engineer is good, they know to convert the protobuf into some sane format for analytics, right. And then so. So you have essentially, and then you have some JSON metadata for things like debugging and quick access, right? So right off the bat. So you have these three pieces of data that you have to coordinate all across your pipeline, and it doesn’t really doesn’t really change. And then when you get to training, you know, a lot of people are using, let’s say, TF records. So there’s, you have to convert data from these raw images from S3, and then into, you know, TF record or some format tensor format. And then once and then you go through your training pipeline, and then once that comes out, then you need to do model eval. And then like, you know, like TF records and other tensor formats are not that great. So you have to then convert that back, and then join it now with your metadata, because then you need to slice and dice your data to, to see a model evaluation in different subsets of your data set, or things like that. Right. So that’s what the sort of pipeline looks like right now, I think even before it makes it into production. So most of your effort is spent in this, like managing lots of different pieces of data, trying to match them on some key that may or may not be reliable. And switching between different formats, as you go through these different stages, right? With lands, the earlier you can switch into lands format, the easier it becomes where you can store all of that data together. And whether you’re doing scans, or you know, debugging, where you’re doing, you’re pulling 10 observations out of a million or something like that, and land still performs very, very well. And then, and so once you convert into lands, a lot of the pipeline down the road becomes simpler. So you have one piece of data to deal with. And Lance is integrated with Apache aero, so all of your familiar tooling is already compatible with it. And so you can sort of start to treat that messy pile of data as a much more organized table. You know, I love math, and math, it’s like you always try to reduce a problem previously known as solid state. And so I think, Lance is that, you know, we want AI data to look and feel much more like tabular data. And, and everything’s a lot easier, you can apply a lot of the same tools and principles.

Kostas Pardalis 32:39
Yeah, that makes total sense. And actually, it’s one of the things that I think, like vendors in VML ops, let’s say space, like, I think, not failed, but maybe like, there was like, some mistakes there that ended up being like, creating silos actually, between like the difference, like bottle data infrastructure. Like there was a lot of replication of their infrastructure that is just like to do VML. And then of course, like, data duplication is like a very hard problem. I don’t think people will realize how hard it is to keep consistent copies of data. It might sound silly, like, especially someone who, okay, likes to use technology in a much more casual way, but like, it is one of the biggest problems. Like, it’s really hard to, like, ensure that always, like your data is going to be consistent, like some very strong trade offs there. Right. Like, that’s what we’ve learned, like, from distributed systems, for example. So to me, and like, that’s where like, like, what I like also, like, from what you’re saying, like, it makes sense like to inreach or like enhanced, like, the infrastructure that exists out there and like, bring the new like to reduce the new paradigm to something existing. I’m trying to create something completely separate from it, and just ignore what was done like, so far so I think, I mean, personally, at least, I think that like, it’s a very good decision. But when it comes to lunch, alright, so tell us a little bit more technical stuff around like lands. It’s like it is a table format, I guess we’re talking about tables here. And it allows you to mix very heterogeneous types of data so it’s not exactly like the tabular data we have until now. It is based on Parquet, right? Like there is like the read like Parquet used behind the scenes, is this correct? Or I’m wrong here.

Chang She 34:35
Oh, it’s actually not Parquet base. So let’s actually use both a file format and a table format. So the idea Yeah, so the idea with Parquet is that the data layout is not optimized for unstructured data and not not optimized for random access, which is important there. So we actually sort of wrote our own file format plus the table format from scratch. So this is written in rust. I think maybe this goes back to Eric’s question from earlier, like, Rust is one of those things that might not be ubiquitous yet. But it’s definitely gaining popularity. And I think Rust and Python play really well together. And just the combination of both safety and performance. Plus, the ease of package management is something that I think is very unique. And it’s pretty amazing. As a developer, it’s also sort of very easy to pick up. So we actually started out writing Lance and C++. And, at the beginning of this year, we made a decision for those same reasons to switch over to Russ. And you know, we were rusty newbies. And I think we are learning rust, as we were rewriting. And even then I think it was because it took us about three weeks for me and Les to rewrite roughly four, four and a half months of C++ code. So, and I think more importantly, we just felt a lot more confident with every release, to be able to say, this is not going to segfault if you just look at it wrong. Yeah, yeah, yeah.

Kostas Pardalis 36:21
100%. And I think you touched on a very interesting point here, which, again, I think like, what makes it look like the bundled conversations that we had, and how like, like, these technologies become more and more, let’s say accessible. Like, I think that the big win for the rust ecosystem is the bindings with Python. Which then like, almost like how you have like the front end back end kind of thing like in the application development, having like, kind of like a similar paradigm when it comes to like two systems development, right, which is, I think it’s like going to be like, extremely powerful, like, and we see that already. With, like, so many good tools coming out in the Python ecosystem actually being developed, like the back, and let’s say, like, with rust, so whoever like builds the libraries that are like for the bindings, I think they’ve done like, they’ve done an amazing job with with bison. But all right, that’s awesome, actually, because my next question would be about how do you deal with the columnar nature of Parquet? And yeah, like, Okay, you’ve already answered that. It’s not like a columnar format anymore. But my question now is like, okay, let’s say I have infrastructure already in place, right? I have like my Parquet Delta Lake like Parquet is like the de facto solution out there when it comes to building like plagues. Does it mean that if I want to use lands like I have to go there and like to convert everything into Lance? And like, how is the migration? Or like they did operability between the different storage formats? Working out there?

Chang She 38:06
Yeah, this? So? I mean, the short answer is yes, you have to sort of migrate your data if it’s in Parquet or other formats. This process, fortunately, is very easy. It’s literally two lines of code. One, two, read existing formats into an arrow and want to write it into Lance. And I think this wider topic is very exciting. I think, what’s actually just published a recent blog post on composable data systems. And I think this is, this is the next big revolution and data systems. I’m very excited about that. So you can, you know, you previously, when you were building a database, you had to literally build the whole database from like, the, you know, the parser to the planner, you know, the execution engine, the storage, like indexing, you have to literally build everything. Whereas now there’s so many components out there that you can sort of innovate in one piece, but create a whole system but just using open source components that play well together. You know, this is what makes Apache aero such an important and in my opinion, one of the most underrated projects in this whole ecosystem is you don’t see it you don’t hear about it, right? Yeah, you’re using the higher level tool a but projects like Apache aero makes it just 10 times easier to build new tools and for these different tools to work well with each other. Yeah, yeah

Kostas Pardalis 39:30
100% 100% I totally agree on that. And we should do like I don’t know at some point we should have like an episode just talking about our Oh to be honest because as you say like it is it is like people that’s okay work more on the system side of like data they know about it, obviously, but I think it is like the unsung hero of like what is happening right now because it needs create, like the substrate like to go and build like more modular systems over like the day So we should do that at some point. All right, so let’s go back to Lance. So data is like, Okay, why did you have to build like this new, like way of storing the data, right? You mentioned something already, like the point queries that, okay, columnar systems are not built for that. But there’s always this problem between, there’s also let’s say, bulk work that you need to do that means columnar systems like more performance. And also like more points, kind of queries that you need when you share something. For example, how do you say you balance these two with lunch? Yeah.

Chang She 40:44
So last format, actually. So last time, it actually was a columnar file format. But the data is just laid out in a way that it also supports both SaaS scans and fast point queries. And so originally, we designed it because of the pain points that ML engineers voiced around dealing with things like image data. So for debugging purposes, or, you know, sampling purposes, you often wanted to get something like, top 100 images that spread out across like 1 million images, or 100 million images, right. And so, with Parquet, you have to read out a lot more data just to get one so your performance is very bad. So we sort of designed it for that. And the sort of happy accident was, once we designed it that way, we realized if you can support really fast random access. So I think just purely on micro benchmarks like, just taking a bunch of rows out of a big data set, we beat Parquet by about 1000 1000 times in terms of performance, right? If you’re talking about that order of magnitude of performance improvement, then it makes it a lot easier. And it makes them a lot more sense to start building rich indices on top of the file format. So you can and this is what led to land CV, the vector database. So it’s now on top of the format, we have a vector index that can support, you know, vector search, full text search, we can support sequel, and also all the data is also stored together. And so this is something that I think, like other vector databases can’t do, is, you know, the actual image storage and other things have to go somewhere else. And so now you go back to that complex state of having to mold and manage multiple systems so for us, it was, I would say, like a happy accident that came from a good foundational design choice.

Kostas Pardalis 42:45
Is there some kind of trade off there, I mean, what is like the price that someone has to pay to have this kind of, let’s say, performance and flexibility at the same time?

Chang She 42:58
Definitely. So the trade off here is that if you want to support fast random access, it’s much harder to do data compression, so you can’t do file level compression anymore, right, you have to do sort of either within block or just record level compression. So here, if you have, if you have pure tabular data, then your file sizes in Lance will be bigger, you know, maybe 50% bigger or 30% bigger than it would be in Parquet. So that’s the trade off there is now for AI, you know, let’s say you’re storing image blobs, right in this data set. Now, these image blobs are compressed at the record level already. So file level compression actually doesn’t matter. And the whole dataset size is dominated by your image column. Right. So then, for AI, actually, this trade off makes a lot of sense, because you’re not really sacrificing that much.

Kostas Pardalis 44:00
Yeah, it makes sense. Makes sense. Let’s mix up all senses like the trade over between, like space and time complexity. So like, it’s, I think it’s like anyone who has done any kind of computer science, computer engineering, like it’s one of the most fundamental things like okay, we are going to store more information. So we can do things faster, and vice versa. So, yeah, it depends on what you optimize for. In any case, okay. So, all right, we have, by the way, land SaaS, like the format is open source, right? Like it’s something out there, like people can go and like, use it, play around, do whatever they want with it. There’s also like some tooling, I guess, around it, right? Like, you have some tools that you can convert like Parquet in the lunchroom. And also the opposite. is it also possible like to go from labs if you want

Chang She 44:56
it? Yeah. So it’s also the same two lines of code. It says, you read it on Arrow and right into the other format. Okay,

Kostas Pardalis 45:02
So what about what happens then you have, let’s say, a lance file that causes like a monster inside and you want to go to Parquet, like how is this going to be stored in the Parquet?

Chang She 45:13
Yeah, so right now the, you know, the storage type is just bytes. So it’d be pretty images, it would be like bytes. Or if you’re storing just image URLs, then they just be plain string columns, right? So a lot of the so so we’re, we’re making extension types and arrow to enrich the ecosystem, right? So the arrow right now does not understand, like images, videos or audio. So we’re going to start making these image extension types for arrows that certainly will work well with lands but can be made to work well with Parquet and other formats as well. And so that way, you know, top level tooling can understand, oh, this column of bytes is it’s an image rather than just oh, this is just a bunch of bytes. And so then you’re like, visualization to laying BI to laying, you know, data engineering pipelines can make much smarter decisions and inference based on these things.

Kostas Pardalis 46:12
Yeah. Okay, that makes total sense. And okay, so we talked about, like, the underlying, like technology, which is also like open source, when it comes to the table and the file format. But there’s also a product on top of that right now. So tell us a bit about the product, what is the product?

Chang She 46:30
Yeah, so, you know, I love open source through and through. So I, you know, if money wasn’t an object, I’d certainly spend all my whole day just working on working on open source tooling. But I think certainly, it’s very exciting also to build a product that the market and folks want and well used. So on top of the land format, we’re building a land CB vector database, that’s sort of the first step in our overall AI lake house. And what makes this vector database different is, one, it’s embedded. So you can start in 10 seconds, just by PIP installing, there’s no Docker to mess with, there’s no external services, the format actually makes it a lot more scalable, right. So you know, on a single node, I can do billion scale vector searches within 10 milliseconds. It’s also very flexible, right? Because you can store any kind of data that you want. And you can run queries across a bunch of different paradigms. And that whole combination makes it a lot easier for our users to get really high quality retrieval and simplifies their production stack for systems and encodes. And I think another sort of really big benefit for Lance TV is the ecosystem integration. So a lot of people have told me, you know, once I started using it, oh, this is like if, you know, pandas and vector databases, had a love child and Connellan. CB, it was for people who are, you know, experimenting and working with sort of putting data in and data preparation, all that it was just much easier to load data and in and out of Lance Erlang CV with existing tooling that they made. And so this, you know, we again, go back to our discussion of like, how do we sort of try to bring new things back into an old paradigm and use the existing tooling to solve these new problems? And I think one, one of the things that new sort of vector databases, I think someone coined the term new SQL and need any W SQL as this new generation of databases. I don’t know how I feel about that. But certainly, I think like, there’s this new generation of databases that have kind of forgotten. The lessons that, you know, painful lessons we’ve learned over the last decade are like data warehousing development, right. So like, column Columnar storage is not a thing in a lot of these new databases. Separation of compute, and storage is not a thing in these new databases. And so it’s, I think it’s something that is very much worth doing, especially as you’re scaling up. And those are the things that we are building into a land CV that we’re offering for generative AI users that I think are pretty exciting.

Kostas Pardalis 49:28
All right, one last question from me, and then I’ll give the microphone back to Eric. And it is related to what you were mentioning right now, like database systems. So okay, we have in traditional, right, like, there was like the OLAP and the OLTP kind of paradigm there. Right? And they usually do not use a liquid that is still up to date. They serve, let’s say, a very different workload, right? And that of course, dictates many different trade offs, like different people involved and all these things. So hearing you about lunch dB, like say, Oh, that’s great, like now that I can have like my abundance layer there, for example, and I don’t need to go to another system like to do my filters or like whatever was omitted data. But if I want to build an application, I still need another data store. That’s probably like a Postgres database, right, where I’m going to have some parts of like, like the business logic leaving, like the state is going to be monitored for the application in the system, right? And AI is one of these things that yeah, like, it fills up the list to me that it’s a much more like front end kind of like data technology at the end than, like doing building pipelines for like data warehousing. Right? So how can we bring this to you, because there’s still a dichotomy there, right? Like, I’ll still have my Postgres with my application states. And once the beta is going to have like, the bends there, and whatever else like I need, first of all, do you think that it is a problem? And if it is a problem, like do you see, Lance, like trying to solve this in the future?

Chang She 51:14
Yeah, that’s a really great question. I think there’s like two sorts of gaps in what you’re mentioning. One is this in production, the production, no LTP did kind of database and transactional database versus like, the data store that’s needed for AI serving? So that’s one kind of split right now. The other is going from, you know, development or research into production. Right. So this is like, what you use in your data lake versus what you use in production? So I did the second one, I think that’s a much easier question. So for Lance, you know, because of the fact that we’re good for random access as well, you literally can use the same piece of data in your lake house, and also in production surveying. And this is something that is pretty exciting to me, because there’s no, there’s a very few data technologies. That’s good enough. For both right, the first question, I think there’s no sort of absolute best answer. So in my experience, I’ve seen installations where the production transactional store is also the AI feature store and serving store. And although I would say that at scale, as companies scale up, that tends to be less and less true, and a lot of times, these AI serving stores that support, you know, like vector search workloads have much more stringent requirements. And the workloads tend to be much more CPU intensive. And so when you mix the two together, you end up creating trouble for both types of workloads. So at scale, a lot of companies find it easier to kind of separate, separate the two. Yeah, I’m not, I’m not sure. I think at a small scale, I think it’s perfectly fine to have a single store, it’s, you know, simplifies your stack. And you keep everything together. Although I think the tooling and the user experience around Postgres, let’s say, for vector search is kind of wonky, and the syntax is kind of bad. And like, if you want high quality retrieval, you then have to figure out how to do like, full text search index on your own, and then combine the staffing and performance also tend not to be great. So I think the answer is certainly depends, like, it’s mostly like, depends on your scale. And your use cases, if you’re sort of an AI use case, are very light and very small. And you can certainly sort of put your expertise around that production database, and just use whatever, you know, PG factor and full text index that comes with Postgres. That’s certainly sufficient. But you know, what, sort of larger, more serious production installations tend to be separated? I think it’ll stay that way.

Kostas Pardalis 54:13
Yeah, and I think and correct me if I’m wrong here, but I would assume that like the AI workloads, on the front end part are like primarily read types of workloads, like he probably won’t like to be able to concurrently and really fast read and like to serve results. Well, okay. Like, obviously, when you’re monitoring the state of applications, like read and write heavy write like unique transactions you need like all these things, like, they’re like, very different set of trade offs. There. So it sounds like it’s hard to put them together at scale, at least in a sense. All right, one last thing before Eric gets a microphone. how someone would you know, like, play around with lamps like So what are your recommendations where they should go? Like both for the technology itself and also the product?

Chang She 55:08
Yeah. So the easiest way to start with Lance CV is just pip install and CB. And then in our GitHub, we have a repository. It’s underlined CV slash, vector DB dash recipes. And there’s about a dozen or so worked examples and notebooks, both in JavaScript and also in Python, that you can use land CV and just step through and so these are like, you know, building recommender systems building chatbots with chat GBT using the Lance DB be integration with Lang chain llama index, just you know, building a host of tools will add to it more and more as time goes on. If you want to find out more about the format itself, go to the land CV slash Lance repo, and that’s the file format. There’s a lot of reading material and benchmarks. And a lot of it is, you know, you can, if you’re, if you’re familiar with rust or C++, you can also learn a lot just by going through the rust core codebase. There’s a lot of interesting things that we do there.

Kostas Pardalis 56:15
Sounds good. Alright, Eric, I’m sure if we’re monopolizing the conversation, but the microphone is yours.

56:21
Yeah, no, it was absolutely fascinating. John, one thing that I’m interested in is the changes you’re seeing in the landscape around data itself. So when we think about, you know, unstructured data, like images, etc. You know, of course, we can think about things like self-driving cars, or, you know, various AI applications like that. But in an increasingly media heavy world, do you see unstructured data as becoming a, like a much larger proportion of the data that companies are dealing with?

Chang She 57:01
Yeah, yeah, absolutely. I mean, I think, you know, I think people have like terabytes of photos just on their iPhones these days. So it’s, it’s, I think it’s going to just become much more important and the dataset sizes will dominate tabular data. And a lot of the use cases will also become multimodal. So if you’re in a media heavy world, when you have lots of images and videos, how do you organize that data? And how do you query that data also becomes critical, right? So you want to be able to ask your set of like a billion images, some questions in natural language, or using SQL or something like that. And a lot of that is going to rely on extracting features from the images. But also a lot of times like embedding the images, and using vector search, and a combination of these things. So I think that’s going to become a lot more important in the next few years as just AI income enterprise data becomes more multimodal, I also think that the sort of relationship between data and sort of downstream consumers will change, because a lot of I would say like, you know, before AI, and before machine learning, is a very much waterfall, your way of designing these pipelines, where you come up with a schema, and you load data into it, and that schema, and then you sort of publish this and downstream consumers are like, okay, I can use this. And maybe this is wonky for my use case. But this is what I got. But I think now, it’s much more important that data and data pipeline stays very close to AI and ML, because the use cases there will determine the kind of schema, the kind of transformations and the trade offs that you make, with data much more important than before.

59:00
Yeah, I think, totally agree. And I think one of the things that is going to accelerate this, and I’m really fascinated to see how this plays out. But one of the interesting things about AI, in general, is that it produces large quantities of unstructured data, right? And so you essentially have a system, you know, that you’re building using unstructured data that produces a, you know, a massive amount of additional unstructured data, right. And so you have this system, that’s a loop where in order to sort of meet the demand for, you know, additional AI applications, like it’s going to require a significant amount of infrastructure for unstructured data.

Chang She 59:46
Yep. Yeah, totally. I mean, I think, you know, if you, especially in generative AI, right, if you have a million users producing new images, right, that’s gonna that’s gonna Yeah,

Kostas Pardalis 59:55
That’s Yeah, crazy. Yeah. Or even, you know, just unstructured like, chat chat. conversations even themselves, you know, as a, as an entity. Okay, one last question, because we’re right at the buzzer here. But where did the name Lance come from?

Chang She 1:00:09
So we were just thinking about, like, you know, AI data, unstructured data being sort of these large, heavy blogs, and how do you, how do you deal with and still be very performant. And so we’re thinking about, you know, things that seem, you know, fast but, but also have this combination that we can deal with heavy things. And, and so I think we were watching some, like, you know, some, like, you know, fantasy movie, I forgot the name, and there was like, jousting tournament, and so like, okay, we’re calling it Lance.

Kostas Pardalis 1:00:48
I love it, because it’s actually a great analogy. Lance’s are gigantic, but they’re used in like, fast motion, you know, sort of high impact situation. So, yeah.

Chang She 1:01:01
Yeah. So I’d love to ask one question for you guys, too, which is, we spent a lot of time in the last hour talking about sort of, you know, what’s old and the evolution a lot to get your take on? What’s new as well. So in generative AI, obviously, it’s sort of the hot thing today. So there’s a lot of potential value that we can clearly see, there’s also some hype, right. So, in your opinion, you know, what do you think is the one the most underrated thing in generative AI? And what’s the most overrated thing?

Kostas Pardalis 1:01:33
Hmm, that’s a great question. You know, I think that I think the most underrated I think one of the most underrated things will be the use cases that are not very sexy, but will essentially eliminate things like very low value, human work. You know, so if we think about just one example, that a friend called me the other day, and they work at a company that processes like, huge quantities of PDFs, you know, in the medical space, or whatever, right, and they actually, you know, because of the need for, you know, sort of discovering these pieces of information in them. And like the format’s are all very disparate, and it’s very painful. And so they literally have 1000s of people who brute force this, right. And so, when you think about it, that doesn’t sound very exciting. You know, like, as a use case, he knows, like, oh, well, you can like processes, you can get the information that you need, with a very high level of accuracy, from files that are notoriously difficult to work with, that are in any format. And in any order, right, that doesn’t sound great. But I think what excites me about that is, if you take all of those people, and freed them up to be creative with their work, as opposed to just sort of doing brute force, you know, manual, looking through PDFs for, you know, sort of needle in a haystack information. I think that type of thing has the potential, you know, and that’s just one example there. 1000s across industries, has the potential to really like, unlock a lot of human creativity to solve problems that’s currently trapped in pretty low level work. I think that’s really exciting. I think probably the most over hyped piece of it, and I haven’t thought this through, so you’re getting this. I mean, I’ve thought through it a little bit, but I’ll just do it live. Is this notion that this is just going to take over people’s jobs? And I’ll give you a specific example. Like, I think that there’s certainly potential for that. But I think the way that the media is portraying that is really wide of the mark. Because one example recently is that I was working with a group who was trying to not try to but using MLMs to create long form content around a certain topic to drive SEO, right. And then, I think a lot of people think like, Okay, this is just sort of a silver bullet where like you, if you can give it a prompt, and then you like, get something back, right. And it’s that easy and like, okay, so SEO like, like the people who think critically about that content, like their jobs are all gone, right. And in reality, I think on the ground, what we see at least is that there’s sort of this, there’s kind of two modes. There’s one where it’s like, a very primitive like, you give a prompt, and you ask for something and what you get back is a stack sounding really good for how simple the input is and how low effort it is in it, but when you need to solve for like a very particular use case, you actually have to get very good at using these tools. And it’s not simple, right? Like understanding prompts understanding all of the knobs that are available in a tool like chat GPT. And then, I mean, things like, what we’re finding is that it’s actually very useful to use the LLM tool itself to do prompt development for you, right. And so you get this sort of iterative loop of prompts that can produce a prompt that actually gives you the output that you want, right? You know, you’re, you’re dealing with it on an API level at that point. And so I think gist thing is overhyped, where it’s like, man, to get really good at this, like you actually have to be very creative and get really deep into all of the ways to tune it to actually use it. So I don’t know, that’s my hot take,

Chang She 1:06:01
You know, this really interesting thing is that it reminds me of the sort of hype cycle in history around autonomous vehicles, right, is that the last 10 years every year was like, oh, fully autonomous vehicles are coming out next year. And this is kind of similar, where it’s like, if you’re, if your vehicle is autonomous at person at a time in demos, it’s amazing. But you still have to hire a full time driver. So that driver is now losing his job. So it feels like the field feels very similar here.

Eric Dodds 1:06:32
Yeah, for sure. And I think, you know, I don’t know. I mean, I’m certainly not, I’m not trying to suggest that there won’t be some sort of massive displacement. And I think as adoption grows, a lot of those things will be productized. Right. And so certainly, I see a future where it gets to a point where you can sort of productize those things, but sort of the NASS near term displacement, I don’t think is a reality. Because, you know, it’s, you can’t just give it a sentence and get back something that’s, like, highly accurate, if you want to go beyond like, very simple use cases.

Kostas Pardalis 1:07:09
Totally. Yeah. Should I go next? Eric? I’ll start with what I find to be a very fascinating and very underrated kind of aspect of AI. And that’s, that actually, it enables a kind of like, data flywheel. And what do I mean by that? For me, and obviously, like, Okay, I’m really into, like, the data stuff, show it and like, look into more that stuff. But the reality is that there’s a lot of data out there, much more data than what we can like today, like processes, and a big part of that, like, effectively, and a big part of that is also because like, there is, like, a lack of structure around this data. And I think that Elon’s limps can really help, like, accelerate, like the process of actually creating datasets, and products that they can lead to products over like these datasets at the end. And that’s like, for me, like, it’s a very fascinating aspect of that, like, just the fact of like, I can give a text. And it’s not that I’ll give like a summary behind that is that I can actually get it into a machine understandable format, like as a jig that has like, very specific, like, predefined, like semantics that then I can use with the rest of the technology that I have there like to do things that’s like, it’s almost like a superpower, right? So I see, like, the value of like, adding, like, the pictures in there, for example, like are the audio files, but when the audio file turns into like text, and after that NC do like columns of like topics, or like, tags, or like paragraphs or speakers. That’s crazy. Because if you wanted to do that, today, until today, you pretty much had to be somewhere like meta or like Google that could hire like 1000s of people that would go and download that stuff. Right? So that’s like one of the things that I think that people out there estimate when it comes to these systems. I know it’s like a little bit more on the back end, sort of like things but I think that’s where the value starts, like being created. And when it comes to the hype, I think one of the things that’s okay. Especially like for people who are like spending a lot of time on Twitter, like seeing, like, all these things are like, oh, like, Okay, now, like, you don’t need developers to go and build applications, like you can just use like GPT and like auto pilot and copilot and like, you can go and do the full product and make millions out of that. That’s okay, like I’m sure there’ll be lots of Bucha jet bikes, there’s no way it works, right. Like, it is an amazing tool for developers, amazing for developers, I think it’s like, the first time that I, I could shave after all these years, in fact, like that, this like, truly innovative new tool for helping developers to be more productive. But we’re not anywhere close to like, you know, having a robot developer to build applications. Like this thing does not exist. And the other thing that I think people like, then like to forget is that everyone says that is probably okay, like customer support for exams, going to be fully automated with AI in like robots and all that stuff. But they forget that when people reach out for help, they also like to connect. There’s like human empathy and human relationships. And these things cannot be estimated at the end, right? Why you can at some level, you can have companions, you can have like an AI that you talk to him, like all that stuff, right. But at the end, doing business and working, and actually like, combined at any scale, will know about that. And they already know about that, like putting a face in front of the company, it is important for the business itself. So again, it will make Customer Success much more productive. And the people there are more creative, and like also having a more fulfilling job at the end. But it’s not like suddenly we are going to fire everyone who is like, you know, solving problems for people picking up the phone. And we are going to have an API to do that. And I’m very curious to see what’s going to happen within the creative industries. I think there are very interesting things, especially in cinema. I think we are just going to see an explosion of creativity at the envelope. That’s like my, my feel. There’s a lot of value, in my opinion. I mean, I know that like people think, oh, it might be another crypto situation. But I think it’s a very different situation. There’s still a lot of work that has to be done to enable it. But I think it’s like the future looks really interesting.

Chang She 1:12:10
Certainly does. How about you? You know, I? So I think certainly, you guys already took my number one answer. So glad to come down. I love asking this question. Because every conversation I have, I think everyone sort of comes up with better answers than me. I think on the overhyped front, I would say there’s a lot of excitement about autonomous agents. And I think we are at least sort of a year, if not more away from really making that work very well. Just what I see is that agents really struggle with that last mile accuracy that’s required for production, and also performance. So if you have a complex question, you have to break that down into or tasks, you have to break that down into multiple, possibly pretty long agent chain, and these chains can can start adding up in terms of time and takes like, you know, minutes, or things like that, where it just becomes not interactive, and it’s much faster just to build something sort of special purpose. So I think this, like, everything is going to be taught to become autonomous, and we’ll never have to work again, things are not coming quite yet. In terms of being underrated, I think yeah, I totally agree, I think there’s a lot of like, the less sexy things that are, I think have the potential to produce a lot of value. So one big thing I see is, I think it’s going to change people’s expectations of how they can interact with information systems and knowledge bases. So you know, most websites and applications have a little search feature. And without, without an exception, they all kind of suck. And because they’re all based on sort of text and you know, syntactical search, I think with the popularity of generative AI, our expectations are going to just drastically change. Every search box becomes a semantic search box. And so any sort of tool that doesn’t live up to that promise, and in the next year or so I think is going to have trouble retaining a lot of users. You’re gonna go from, oh, this search sucks, because of course search sucks to Oh, your search stocks. Go to your competitor who out Smith yesterday, built up already?

Eric Dodds 1:14:34
Yep. Yeah, I agree. I was gonna say, you know, thinking about the support agent or the agent piece of it. I would actually like to combine those two and say that the first wave of that is going to be like a better search. So for example, if you think about documentation, it’s really compelling to have this idea that, you know, it’s like okay, Well, you have all these docks and people, you don’t have to comb through the docks to find this really specific answer to their question. And the search around it sucks, because no one has time to like, redo the indices with every new piece of data plane. That’s a horribly manual process, right? But at the same time, if you give wrong information in an automated way, relative to documentation, if someone’s like building an application or something, I mean, you know, we’re doing something a critical like data workflow that’s going to inform an all model that’s going to like, do really important things for downstream users. Like, you can’t really get that wrong, right. And so I agree, like the first way that that’s not going to be like Doc’s going away. And it’s just a chat bot that gives you your answer, it’s going to be like, it will help you search so much faster and better than you ever have before. Totally. Awesome. Well, John, this has been such a fun conversation. Great question. Thanks for making it a conversation. And yeah, we’d love to have you back on to dig back into all sorts of other fun topics.

Chang She 1:16:07
Thank you, thanks for having me, this was a lot of fun.

Eric Dodds 1:16:09
Kostas, one thing that is amazing about Chang. I mean, of course, in addition to the fact that he, you know, was a co author of the pandas library, which is legendary. And the fact that he has built multiple high impact Technologies is a multi time multi exit founder, building data to lane, you know, in sort of the data and ml ops space. I mean, all of those things are, it’s really incredible. But when you talk with him, he, you know, if you didn’t know who he was, you would just think this is just one of those, like, really curious, really passionate, really smart founders, you know, and you said at the very beginning that he’s humble that I mean, that’s almost an understatement. You know, he’s just, he would treat anyone on the same level as him no matter their level of, you know, accomplishment or technical expertise. Yeah. That really stuck out to me. And I also think the other thing that was really great about this episode was, it wasn’t like he came out and said, You know, I have an opinion about the way the world should be. And like, this is why we’re doing things like the lance DB way. He just kind of had a very calm explanation of the problem, and a really good set of reasoning for why he needed to create a new file format, right, which is, like shocking to hear, you know, because it’s like, well, you know, you have, like Parquet exists, why do this? Right? So it sounds really shocking on face value, that man, his description was really compelling. And the story of how they actually sort of almost backed into creating a vector database, you know, because they invented this file format, just an incredible episode.

Kostas Pardalis 1:18:10
Yeah, I mean, chunk was like one of these rare cases where you have both like an innovator in the builder, which is, like, I mean, it’s hard to find that innovator, it’s hard to find a builder, it’s, like, even hard to, like, find someone who combines these two, and at the same time being like, down to earth, like, like him. I think this episode is pretty much everything. I mean, it sounds like lessons from the past can be super helpful, like to understand, like how we should approach and like to solve problems today. And there’s like, a lot of things like to learn from the story of pandas that are applicable today, for everyone who’s trying to build tooling around AI in the middle. What I really enjoyed was actually probably like, the first time that we talked about something, I think, very important, which is how the infrastructure needs to evolve in order to accommodate these new use cases, and actually accelerate innovation with AI in the middle, which is still like work in progress. And I think Chang provided some amazing insight like, what are the right directions to do that. And he said some very interesting things about not creating silos, like, you know, give a very interesting example like mathematics, where he said that, you know, in mathematics, like when you have a new problem, you try to reduce it to a known problem, right? And that’s like how we should also like build technology with like, an amazing insight, to be honest, and I think it’s something that especially like builders keep like to forget and tend to like either like replicates or creates a bloated solutions and all that stuff. So there’s a lot of wisdom. In this episode. I think anyone who’s like a data engineer and will get a glimpse of the future of what it means to like to work with the next generation of data platforms, they should definitely tune in and like police into chunks.

Eric Dodds 1:20:07
I agree. Really an incredible episode. Subscribe if you haven’t, you’ll get notified when this episode goes live on your podcast platform of choice, and of course tele friend, many exciting guests coming down the line for you. And we will catch you on the next one. We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe to your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.

🎙 Sign up for The Future of Machine Learning Livestream!

🗞️ Signup for Our Newsletter

Episode 161:

The Intersection of Generative AI and Data Infrastructure with Chang She of LanceDB

October 25, 2023

Notes:

Transcription:

About the Podcast

Sign Up for The Data Stack Show Newsletter