Episode 159:

What Is a Vector Database? Featuring Bob van Luijt of Weaviate

October 11, 2023

This week on The Data Stack Show, Eric and Kostas chat with Bob van Luijt, the CEO & Co-Founder at Weaviate. During the episode, Bob discusses the technical and business aspects of vector databases, delving into their differences from other types of databases and the opportunities they present. Bob shares his journey and how his love for music relates to his work in machine learning. The conversation also covers the progression of database complexity and the emergence of databases designed for specific data types, limitations of existing databases for vector processing, the importance of simplicity and user-friendliness in the user experience, generative feedback loops, and more.

Notes:

Highlights from this week’s conversation include:

How music impacted Bob’s data journey (3:16)
Music’s relationship with creativity and innovation (11:38)
The genesis of Weaviate and the idea of vector databases (14:09)
The joy of creation (19:02)
OLAP Databases (22:21)
The progression of complexity in databases (24:31)
Vector database (29:23)
Scaling suboptimal algorithms (34:34)
The future of vector space representation (35:51)
Databases role in different industries (39:14)
The brute force approach to discovery (45:57)
Retrieval augmented generation (51:26)
How generative model interacts with the database (57:55)
Final thoughts and takeaways (1:03:20)

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 00:05

Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com. Welcome back to The Data Stack Show. Kostas, today, we are talking with Bob, from levy eight, and so many interesting things about Bob. He’s never had a boss Number one, he started building websites when he was 15. He reminds me a lot of you from that aspect. And then fast forward, he actually has built a company around vector databases and embeddings that support AI type use cases. I mean, what a journey, that’s pretty amazing. I want to ask about databases, in general on a spectrum, right. So we’ve actually, this is sort of a theme, I almost feel like Brooks could write a thesis on, you know, databases from The Data Stack Show, because we’ve talked about every type of database. But we haven’t talked about vector databases. We’ve talked, I mean, of course, adnauseam about sort of OLAP LTP workflows, graphs, we’ve had a number of graph databases. But I think this is the first vector database. And so I want to zero in on that, and put that in the context of sort of the spectrum of like, basic SQL OLAP, you know, to graph to vector. So that’s what I’m gonna ask, but how about you? I mean, there’s so much here. So,

Kostas Pardalis 01:47

yeah, I mean that. Okay, there are two main areas of questions. I think the first one is to make clear to our audience what vector databases are, how they relate to the rest of the databases out there, and why we need a different one, another one. Then, the other one, which is also I think, like, super interesting is what businesses you can build around them. Right? Like, why would someone do it, why do they like, why? Even like, can be a sustainable business. And it’s not just like a feature of another database. Right. So I think we have the right person to talk both for the on like the technical side of things, but also have like a very good ended conversation around like, the business of this type of databases.

Eric Dodds 02:41

All right, well, let’s dig in and talk with Bob and solve all these problems. And more. Yeah,

Kostas Pardalis 02:48

one last thing, he found that least one, Bush, as he said, right? His mother or mother. Right?

Eric Dodds 03:00

This era. Okay, let’s talk with Bob, and Bob. And we’ll hear about Bob’s mom being the CEO of Bob’s life. Bob, welcome to the data SEC show. We’re so excited to have you.

Bob van Luijt 03:14

Well, thanks for having me.

Eric Dodds 03:16

All right. Well, you have a really interesting history. So I guess technically, you’ve never had a boss. You know, which is fascinating. So you’ve always been your own boss, which we may have a few questions about. But tell us about how you got into data. Where did you start? And then what are you doing today?

Bob van Luijt 03:34

Yeah, no, that’s, that is correct. If you mean, if you exclude my mother, I never had a house. So the

Eric Dodds 03:44

mother is boss or like, Chief Product Officer for your life.

Bob van Luijt 03:48

Exactly. Exactly. Yeah, exactly. Well, more Chief Revenue Officer, but that’s so what’s the know? So I’m a millennial, I’m born and 85. So that means, you know what you got. So there was like, when, when I was 15 years old, like we had, you know, the internet, just enough connection at home and at school, too. So I just started to play around, you know, build websites, that kind of stuff. And at some point, there were people, I lived in a small village in the Netherlands and people said, hey, you know, we need websites for stuff. And I was like, you know, I can build two websites. So then I got a gig selling toothbrushes and lighters on a website. I don’t want to know anything about the security those websites had back then. But they are basically listing, so how much, you know, they said like, How much money do you want to make? So I asked my dad, I said, like how much money and my dad just asked for like 500 bucks. I was like, I’ll do it for 500 bucks and the guy said deal. And I was like, Whoa, that’s a lot of money. I’m Rachel’s reaction. Exactly. So I went to Palo Alto as you did to the Chamber of Commerce and I registered my, my, my company. And then it’s an A you grow right to so you’ll learn a lot. And then, and I grew into being a more software software consultant I did study in, in between those CS or anything. So I studied, I studied music, because it’s another passion. I always kept working in technology, because that was like I even said, someone I studied in Boston, I got a grant to study in Boston. And then on the site, I was just, yeah, writing was like remote work of all electricity. So it’s like the Yeah, I was working in Boston for this, these large companies, and that grew and grew. And then at some point, I was introduced to machine learning. And that kind of changed everything because then I was like, Okay, I’m gonna stop being like a freelance consultant. I’m gonna start a company. So like a product company. Yeah. So that’s it. So for a long time already. I mean, so far.

Eric Dodds 06:00

I love it. Were you studying music in Boston? Yes. Okay, that Berkeley? Yes. Okay, and what was your instrument of choice? I mean, I know, multi talented people go to Berkeley, but what’s your instrument?

Bob van Luijt 06:13

Yeah, so So I was. So I started with bass guitar. And one of the funny things is that people that are now here on the radio, they were like, at the same time at Berkeley, yes, that I was. And that is, so that is just super exciting. But Berkeley has been super important for me running a business now. So a lot of things that I’ve learned, being I was very young, was like, I mean, I would not argue very young. So it was like, really early 20s. And that I flew in to Boston, and a lot of things that I learned at Berkeley are things that I’m using in building business today, though, take there’s a, there was a very important lesson in my life. And if I could go back in history, I would probably do the same thing again. It was okay. It was just a great, fantastic period, in my life, and so, yeah, so that I’ve gotten to that.

Eric Dodds 07:10

Okay, so two questions, because I want to talk about databases and vector databases, in particular, but two quick questions on Berkeley because I can’t help myself. So you said that you learned a lot of lessons that help you on the business side? Can you give us maybe the top thing? You’re an entrepreneur, right? Yes. What did you learn at Berkeley that helps you from an entrepreneurial standpoint? And then I have a follow up question.

Bob van Luijt 07:37

So the two things so it’s, this sounds a little bit cliche, but it’s really true. As I learned at Berkeley says, if people talk about the American dream, right, I learned at Berkeley, what that means, right? So people were living 24/7, they were living that lifestyle. And everybody was dreaming big, and working together to get these amazing artists coming. So that was one thing that I learned there. But another thing that I learned was that if you know the people, if you are the people listening to have, you know, musicians in their, you know, in like friends or family, they know that you need to do a lot besides just playing right? To promote your music, you need to get it out there, you need to present it online and those kinds of things. So a lot of things that are like this are when I started, it is very similar to starting an open source project. But instead of goat you’re shipping mp3 files, right? But the mechanics are very similar. So that was something I learned and not to go too deep into that. But as a, I have a strong belief that in and this is from an author is like a futurist, author of Bruce Sterling, and he has this talk, and I’m I don’t know exactly what the name is. But if you google this, you’ll probably find that he says, If you want to know what the impact is of technology on society, you need to look at what the technology is doing to musicians, because it will always happen to the music industry first. And that is something I mean, you can fill a whole episode with that discussion. But that’s, I think that’s to be true, because it’s very you know, it’s very, the industry is very you know, it’s not a very strong industry, right? There’s really people who do this for the love of making art, but technology plays a tremendously important role in that. So it’s those kinds of things that I’ve learned there. I now when I grow older building the business, okay, actually, that my time at Berkeley,

Eric Dodds 09:46

man, that’s fascinating. Okay, we should probably do a follow up episode just because, I mean, music is hard to monetize. You obviously saw that early at Berkeley. Yet there are people with a lot of money, who figured out how to sort of exploit, you know, things that people are passionate about creating, which sounds, you know, ironically, identical to the venture SAS industry, which is interesting. Okay, so second follow up question, then I want to dig into your current company and vector databases. But one thing that I’ve noticed just I mean, even on the podcasts, but throughout my career is that people who study music tend to think with both sides of their brain in a way that’s unique. And I’m not saying that, like, I don’t have any science to back that up. But just something that I’ve noticed enough to where I, Whenever someone’s doing software, and they are a musician, like, I’m immediately interested, because I’ve noticed that pattern over the years, when it comes, you said, you discovered machine learning? What I’d love to know is like, did your did or does your study of music influenced the way you think about machine learning, because there is this relationship of structure and pattern within music that is required to create a foundation. But there’s sort of unlimited combinations of notes and everything and melody that you can use to create things that are not new, right? I mean, they say, you know, maybe we’ve only discovered five or 10% of the possible songs that are ever, you know, able to be creative since the history of the world, which is actually very interesting in terms of machine learning, it feels the same way, right? Is that relationship there for you?

Bob van Luijt 11:38

Yes. And that relationship mostly sits in my mind, and let me explain what I mean with that. So when I was very young, I ran 50 miles at the same time. So I got like, you know, interested in things like, you know, Red Hot Chili Peppers, that kind of stuff. So it’s like guitar solos, right? So I was interested in that. So what was happening there? And then a teacher, and I asked was like, well, if it’s purely, that’s what you like, you might want to listen to the later music of Miles Davis, because they don’t have 30 seconds of guitar Bucha have six minutes. Yeah. And if you go into that, they double click on that, you’re gonna go, Hey, let’s see what John Coltrane was doing. And if you double click on that, you get to the classics, or you look at things like Bach, Stravinsky, that kind of stuff, and so on, and so forth. So that is how you study these kinds of things. And so it gets more and more complex. And there’s an aesthetic in that complexity. And the, and every time that I was working on it, and I figured something out, I could see it, I have these structures that I just can’t visualize. And then I see. And that is the exact same thing that’s happening to me with machinery. So when I started with, with Wii Wii eight, and with these early models, the moment that I figured out that they were, what they were doing, I could see it, and then I and everything that I currently work on, is like, very, I hate it if there’s something happening that I don’t understand. But then if I say I understand that, I can visualize it. And it’s the exact same mechanic in, in my mind. So to give you a third example, I’m, as a hobby, very interested in language philosophy. I once even gave that talk or TEDx talk about software and language. And those kinds of things. If I read these kinds of books, or the like, the, like, I don’t know, like, you know, I liked the work of Wittgenstein, that kind of stuff, then you’re reading it, and you don’t getting it, and then at some point, I can see it. And then when I can see it, I understand it. So my point is that the mechanism in my mind is very similar.

Eric Dodds 13:38

Yep. super interesting. Okay. I love it. I mean, we, I think, guys, we should do an episode just on that. Because, I mean, it is so fun to talk about that. Okay, let’s get down to business. We aviate. Can you? So this is coming to you founded several years ago, you’ve been working on it for some time? What is it? What does it do? And then I want to dig into databases in general from there and learn from you about sort of the progression of databases on the spectrum of complexity.

Bob van Luijt 14:09

Yeah, so I think this is best explained by giving a little bit of history of what I went through, so I was, I was as a consultant, I was working at a big firm, like a publishing firm, and they hired me to work on something. And they were looking at new types of products, what they could do with scientific papers and that kind of stuff. And I was introduced to gloves, which is a model that produces word embeddings for single words. So word embeddings. And people familiar with this. There’s like this famous calculation where they do King minus men, plus women and then, in fact, their space, it moves to the word queen. Yeah, and I was immediately like, when I said I was like, oh, Oh, this is exciting this school. So I started to play around with this. And I got this first simple idea fairly simple. I said, rather than doing this with individual words, if I take a sentence or a paragraph, and I take all these words from the paragraph and I create this calculus centroid, so the center of those vector embeddings, then I can represent that paragraph in vector space. So now I can do semantic search over those kinds of things that were very early on, that was an idea. And I wasn’t sure how to build a jet, how to structure it yet those kinds of things. And so something very logical that you started to do is that you take a database, right? And we experiment with different databases, you start to try to store these embeddings in there, storing is not the problem retrieving is. So then you get in a situation, how are we going to retrieve that stuff? And so we were experimenting with a very early so-called approximate nearest neighbor algorithm, we can double click on that if you’d like. But then I was excited to experiment with it. And then my co founder HN started to play very important for because he said, Well, actually, for search, the back then like traditional search, the library that’s kind of uses leucine isoleucine, sits in solar, it’s an elastic search, it’s in MongoDB. It’s in all these kinases. And the way that leucine shards on scale, is sub optimal for sharding, the approximate nearest neighbor index. So then we were like, Hey, wait a second, now we have this semantic search use case. And we know that there’s like, a moat for a new database. Let’s build a new database. And that’s how the idea was born to start to, to work on we’ve yet. And so we’ve yet currently you store a data object, like you would do in any NoSQL database, and you can add a factor embedding to it right? And, and that is how that was born and how all these things came together. Back then we did not use the term factor database to affect the search engine or anything like that. We were just looking for ways to position it. i Oh,

Eric Dodds 17:09

What did you call it back? Then? Just out of curiosity, what did you call it? Because I mean, vector databases and embeddings have become really popular terms with, you know, the hype in the last year around MLMs. But like, what did you call it back then? Because you were doing things that were sort of primitives, you know, for what, you know? Yeah. Knowledge Graph.

Bob van Luijt 17:31

Because you had these data objects floating around the vector space. And you couldn’t make a link between them. And it’s like, okay, they. So that was, but that caused a lot of confusion, because the knowledge graph was kind of adapted by semantic web people. And I remember that I went to semantic web conferences presenting with yet and people didn’t get the concept of, of using embeddings. For they were like, so you have like a keyword list or something? Or how do you map it? I’m using the representations from the ML models. By the way, this is before the Transformers paper, etc, was released. Right. So it’s as if it took a long time for people to go under to get an understanding of what the data type of embeddings is and what you could do with them. We ‘ve been talking about this stuff for a long time, probably on YouTube, you can find some of these old talks. Don’t try to find it. Probably floating around somewhere.

Eric Dodds 18:26

We won’t put them in the show notes. Yeah. So. Okay, so one more question. I’m sorry. Again, I can’t help myself. But before we get into databases, how did you have that much stamina, right? I mean, you’ve been doing VA for what, like four or five years now? in quite some time. And so before embeddings were cool, before the language around vectors was cool. And obviously, you’ve never had a boss. So you’re, of course, paving your own way. But that’s a lot of endurance. It seems like a sort of present at a conference where people don’t get it. How did you deal with that? You know, I mean, you obviously believe in it enough to, to continue on, but that’s hard.

Bob van Luijt 19:12

Yeah, this is an act. That’s a great question. I’m, so one of the things so this is maybe also related back to what we talked about with art is okay. I can really fall in love with something. Right? And then just I just put my age, you know, my intellectual claws, and it’s and I just don’t let go. Yeah. And I’m also I never, I have a very, I’m blessed with a very wonderful life, and I meet amazing people and I’m able to build an amazing team. And I just, it was not planned. I’m just it’s like, like, you know, I just go with the flow. Yeah, so I never saw that as an issue. I’m just enjoying the ride, and it’s just amazing. It’s you. Is it so? So I appreciate you saying that. But that’s not how it feels. It’s like, it’s just yeah, I’m just, yeah, go with flow.

Eric Dodds 20:08

I, it’s interesting. If we tie it back to music, you hear musicians talk about, you know, writing a certain song, right? And they’re like, Well, how did you do that? You know, and a lot of times, you’ll hear musicians describe it as like, well, it wasn’t like, I didn’t set out to write a catchy song, like, I just had something inside of me that I needed to, like, express. And so this is just a process of me expressing, like what’s inside of me, right? Like, it happens to be that it’s a song. And it happens to be that it resonated with a lot of other people or whatever. But that sounds really similar to what you’re talking about. Yes, and one

Bob van Luijt 20:51

of the things that I do right now, so I’m in a very fortunate position that I can talk to a lot of young people, right, who are studying or experiment with things that I and one of the things that I’m trying to get across them is like, Whatever you do, make something, create something. And if you’re now students, and so for example, it took soon at the business school or something, because then if I talk about software, it’s always very, like a lot of people show up because no decK so and I said, like, if you now work at a Starbucks stick, keep working at the Starbucks and try to build something. Don’t get excited, by these big companies, you know, offering up, try to use this time, you’ll have to make something and I don’t care. So in my talents, I guess that I have are, working with software, and music. But if it’s if you’re if it’s cooking, then cook, right? If it is writing, right? If it’s one, if it’s branding, started design, whatever, right, but make something and because life is so so much fun. If you make stuff, whatever, whatever, however, whatever stuff you make, and that is what I’ve been doing, always I’ve been making stuff, and that’s a and what I’m doing now is just a company building those contents, that’s a form of making that gives a lot of joy. Right. So that’s so yeah, so that.

Eric Dodds 22:21

I love it. I mean, that’s, but again, we could keep going down that path. But let’s Okay, so I want to get technical. And can we talk about OLAP databases? So let’s start with the so we V eight is a form of database. And we can talk about the specifics. But I want to go back to basics. Most people who are interacting with a database are interacting with things like, you know, olTP or OLAP database, right? Then mostly OLAP, if you’re doing any sort of like sequel based workflow, we’re building analytics or, you know, anything, right. And so that’s runs the world. Right? I mean, those are any KPIs that any company of any size, it’s just we’re talking about OLAP workflows, right? Yes. Okay. We VA is sort of on a, like, if we think about the spectrum of complexity of databases and use cases, we V eight to me, as I understand it as sort of like much further along the spectrum of complexity, and I think the step in between, and you correct me if I’m wrong, because you’re the expert, but like, a lot of teams, like when they’re working in OLAP, and they, you know, have 10,000 lines of SQL, and it’s like getting really crazy. A lot of times they’ll say like, okay, maybe we need to do like graph, which will help us solve some of the relationships that we’re trying to represent and SQL. Okay, great. So they like move to graph. So I think, you know, probably a lot of people are familiar with a graph database, and then we have a vector database. And so we actually haven’t, on the show, I think, had a specific discussion about vector database. And so can you, like, just help us understand when you go from sort of OLAP to graph to vector is that even the right way to think about the progression? Of like complexity and databases and use cases? Can you paint the spectrum of, of database use cases for us in terms of complexity?

Bob van Luijt 24:32

Yeah, so I can offer a way for people to think about this, right? So that’s how I think about it. So if you envision it, like let’s say, you have like a big circle, and in the center of the circle, you have databases like Postgres, MySQL, those kinds of things, right. And these kinds of databases are, you could say, get your databases right. So you can do everything with them so you can make them. You might commit graph connections, like in the form of like, joints. You can, you can store vector embeddings. Nowadays in them, you can store strings in them and those kinds of things, right? So, and it’s great for a lot of use cases, these databases are fantastic, right? So, and the people designing these databases, they make trade offs in the decisions they make to build these kinds of databases to support all these cases. But that means that there is a limit to that, right, there’s a limit. So let’s say graph as an example, I think, because graph is a great example. If it turns out that your data set is very reliant on these graph connections, then you run into an issue at some points. And we all know that, oh, well, maybe not everybody, but there’s this term like join hell, right? Yeah. So then you say, well, actually, it’s just we run into each other. So now we are in that circle in the center, we have this core SQL, there’s, so we moved a little bit outside of that, right? So we start to move into NoSQL space, it’s not SQL anymore. So removing it, you’re gonna design something from the ground up, that is very good at dealing with these graph structures. So if you don’t have these graph structures they are just a tiny graph. That’s fine seeing the center. But if you wanted to do something more towards the friends, sure. And what we often see is that if you look at the data types, you’d have in the center relation became graph databases, dates, times became time series databases. Yep. Searches became search engines. And so what we see is that it tends, it turns out that the specific data types that you have in these databases, they kind of asked for, like for bigger use cases they kind of asked for, for their own category basic Sure.

Eric Dodds 26:52

Well, as you start to scale, databases emerge that solve these particular problems, for sure. Yep,

Bob van Luijt 26:59

exactly. So. So what you start to see is with vector databases, exactly the same thing? I think. So you have a vector embedding, which is a data type in itself, right? So the uniqueness in these databases is not so much in storing them as much as in retrieving them fast. Only one of them are a thing that we have that we started to see, right, if you look back at history, is that step from start from the perspective of that SQL database, and then we go out to the fringes is kind of skipped, so that we can say, Okay, we see this new data type, eg vector embedding, let’s just start in that category, right. So let’s just create that category and work in that category. Because what starts to happen is that these databases, hence that everything does not have sequels and NoSQL, start to have different ways to interact with the database that is very well suited for that specific data type. You want to have different API’s with a time series database, then you might want to have one infected database, for example. So that is how I visualize it. So yes, in that center, if everything comes together in the center, but the moment you want to really double down on one of these data types in the SQL databases, you’re probably better off with a purpose built database, regardless if there’s a graph time series vector, whatever.

Eric Dodds 28:26

Yeah. 100%. Okay, well, one more question for me, because I’ve been monopolizing here, but because of no cost, we have a bunch of questions, but can you just define a vector database? I mean, a graph database, I think makes sense to a lot of people because, you know, creating relationships between nodes and edges. And the sequel is, you know, brutal at scale. And so a graph database is a very logical conclusion if you need to represent social relationships or something like that. But tell us, so that’s like, logical, I think. But what’s a vector database? And what’s sort of the, you know, sort of the graph thing, let’s say like, that’s social, right, is sort of the, you know, you move from the center out of Postgres, and needed to represent complex social relationships. And so the graph makes sense. What’s the thing that pulls a vector out of the center? And can you describe the vector database?

Bob van Luijt 29:23

So in fact, the database is, in essence, in essence, often a search engine, in the majority of cases, is a type of search engine, where the vector and betting is the first class citizen, right? So that is the first class in the Soda way that the database shares with the database, it’s the way that you skill with it, it has all those kinds of things, these architectural decisions in building a scalable database. They go all the way back to looking at it from the perspective of the factor index, right, that sits at the heart of it. So that’s how I would define it. It’s just a database where the vector is a first class citizen. And then you have a UX element to that. So the way that developers interact with the database is tailored to those types of use cases.

Kostas Pardalis 30:13

Okay? What’s the difference between a vector database and something like leucine? Right? Because I was, I’m old enough to experience, let’s say, the introduction of inverted indexes. And suddenly, we were like, Oh, my God, like we can have so, so much like faster retrieval of data. By the way, I think I’m revealing a little bit like what’s going on here, because Leucine is also about retrieval rights is how we can trade off upfront processing, to be able to like, go and like sets very quickly the like, kind of unstructured data, right. And there’s a lot of NLP work that has gone inside like this library, right. But we have had this for many years, right? And we have like products and businesses actually have been built on top of that, right? We have all gone via for example. So And okay, well, obviously, another company that came from the Netherlands if I’m not wrong, which is elastic, right? Yes. So there is something in the Netherlands about? Right, yeah. So what’s the difference? And how like, and by the way, the second question, like a follow up question to that, how do they compete? They complement each other? Like, how do you see the business case there?

Bob van Luijt 31:42

This is this. Thank you. This is an excellent question, because I get this question a lot. So I’m happy that you’re asking it, because now he kind of broadcasts like the inside. So a little bit of a preamble before I go into the answer. So there are like three things that play roles here. Right. So one is search algorithms, regardless if that’s for vectors like approximate nearest neighbor, or a Twister, keyword, Bing, 25, those kinds of things. So we have the algorithms. Then we have libraries, and these libraries contain one or more of these algorithms. And then you have databases and a database can be built around the library doesn’t have to be. And what you try to do with the database is that you offer functionality that people expect from the database. So for example, crud supports, create, read, update, delete support, can be backups can be storage can be if it’s transactional, you know, certainly guarantees or etc, etc. Right? So those are three distinctly different things. So Leucine is a collection of search algorithms mostly tailored around keyword search. It’s very, it’s relatively in, in our world of software a relatively old, and I don’t mean that they I mean, in a positive way, like an old library that has brought a lot of value to a lot of people. And there’s also an equivalent actually for, for ml, and it’s called face that is built by Facebook. So those are two libraries. And so what you started to see with Lucene was abusing, hey, we can take leucine, and we can turn it into a database and that function that layer functionality around it, that makes it a database, elastic, solar, I believe near for J uses leucine fissures. Right. So that people started to add that. So now a very logical question would be okay, great. So now we have new data type fact embeddings. Let’s add it to leucine. Right? That’s a very logical, you know, thing. People did that. So if you now see these databases that I just mentioned, talk about leucine. Sorry, talk about effective search. They’re often leucine based, and they use the fact that in an algorithm that’s in leucine. So now the question becomes, so why then why new databases, right? So by just not leverage the same? And the answer has to do with the, if you use leucine, at the core of your database for search, you’re, you know, you’re bound to the JVM and those kinds of things. So you shard the database in a specific way, right? And it turns out that the algorithms used to scale approximate nearest neighbors are suboptimal in leucine. You can even see this in the open source leucine community that people are debating this till today. Some people disagree with leucine doing this at all. So now all of a sudden, and that’s why we thought, okay, we believe that there’s room to build something new, because, of course, a production database feeds through shards and replicates and what have you And that’s where that’s coming from. So when people will notice that if they use a leucine based search engine or database for, for really heavy factor processing work, they will run into scalability problems. And that is why you see these new database. And you had a second question, but I forgot what the second question was. Sorry about that.

Kostas Pardalis 35:27

I also forgot to be honest, I think she might be new. But it’s okay. It’s fine. Yeah, if they can beat actually like, or if you see, like how you see the future, right? Because if you think about it, like, Okay, let’s take solar, and I’ll keep like a classical bit like, outside of the equation here, for a very particular reason, which is business oriented, like, I would like to come back to that. But if we took like, primarily for information retrieval and shards of unstructured text, right, we paid, someone can argue that, okay, why do we need both? Right? Why? Or do we need both? Like, is the question? So the question here is like a future for these kinds of inverted indexing techniques? Is this going to be let’s say, abandoned, because vector space is like just a better presentation? And that’s all semantics? There is a reason to keep them both there, right? Even if they’re not like in the same system, let’s say they’re like, just different systems. Let’s not talk about the systems here. It’s more about how complementary of technology they are in terms of like the use cases that they share, when they will use them as products, right? Yes. So

Bob van Luijt 36:45

This is an excellent question, I can actually marry the two things together. So if you kept Alaska even may, I’ll bring that back in right to answer your question. So the thing is, there’s one thing for example, what we know when it comes to NLP, because of course, we can do not only with textbooks without a thing, but it believes comes with text, we know actually, that mixing the two indices works best. So a hybrid search works, yields the best results for you to mix the dense and the sparse index together. So the embedded from the model, and then for example, bm 25 index that works best. And it turns out, so and then the word scale plays an important role here, because especially on scale, that starts to play a role. Now, what’s interesting, though, and now I’m taking off like the decK hat and putting out my, you know, my products like business hat, that’s, I could think that if for everybody listening with the end, if they have the ambition to build our own database company, or those kinds of things, that’s like, that’s like a little less like a little trick or like a thing that I can share, right? And with the exception of the SQL database that we talked about, where the word like MySQL and Postgres of this world, but everything around that, what you start to see is that people start to position these databases at something that they are uniquely good at. Right? And so if you take elastic as an example, right, so yes, the database is the search engine. And yes, that’s what a lot of people use, and makes a lot of people happy. Right? It is actually what they build as a business that focuses on observability and cybersecurity. So what I’m doing in my role is that I’m asking a question, if I take the vector database, what’s that for us? Yes. And it turned out that we learned thanks to the release of GPT-3 played a very important role in that. We’ve learned what that unique thing is for the vector database. Yes. And that comes in the form of something called retrieval augmented generation. We can talk about that if you like. But that is to answer your question. So that is the big difference, right? So at some point, it goes out of the realm of purely that architectural decision, like, now this database exists, and it’s structured in a certain way with a certain architecture, what kind of use cases does that enable that are unique for this database? And that is how these companies start to grow around the database. So yes, elastic is a search engine. But maybe a business buyer might say, well, for us, it’s just an observability tool, right? So it plays a tremendous important role in that. So in fact, today’s bases start to gravitate in another direction, but more the generative AI direction where they just play such a crucial role.

Kostas Pardalis 39:55

Actually, you okay, you did an excellent job in answering my next question. That’s why I wanted to keep elastic outside like for, for a follow up question exactly for that reason, because I would like to say exactly what you said that elastic ended up as something that it’s like the product for observability used primarily like for that. And that’s where things like actually get really interesting when you start building businesses and production of the technology itself, right? And my follow up question would be exactly that, like, where do you see these embedding databases or vector databases, whatever we would like to call them? Like, going forwards, right? What’s the equivalent of observability? For? Yes, or big systems, and we will get to that. But before we get to that, I think I’d like to do like a bit of, let’s say, go like a little bit back to high school, I would say and talk a little bit about algebra of vectors, cosine similarity, and talk a little bit about like, the basics of how do you retrieve information from these systems. And hopefully, we like it and it will sound extremely familiar to everyone, before we go to the more let’s say, like, unveil into the indexing part, like the very sophisticated algorithms that are approximate and all these things. But let’s, I’d love to hear from you like what is like, the basic operation that actually happens? That’s pretty much like everyone who went through high school probably knows about right, which I find very fascinating, actually.

Bob van Luijt 41:33

Yeah, it’s a I’ve been i So it’s funny that you ask this question, because I’ve been looking for a lot of metaphors and those kinds of things to, to explain it. And it’s always the question is like, how deep do you want to go, I think, if I maybe this is interesting for the rational. So Stephen Wolfram did a very interesting, very interesting blog post, where really, in a very easy to understand language goes really into like, hey, how do you create an embedding? And he takes Wikipedia pages as an example. So he says, for example, if you have to question, for example, the sentence finishes, as the Cat said, on the, and then how do you get to the word met? Right? So how do you get there, and then he explains that from like, distances from words and sentences and those kinds of things. So I don’t go I will not go into that, because then we just need another 30 minutes to answer it. But let me take as my starting point, that the what the, what the machine learning model does, and technically speaking, you don’t need a machine learning model for it, it’s just otherwise you need to do it brute force, and it just got to take way too long. So you want to predict, and what you’re predicting, in effect, the embedding is a geographical representation of these data objects, and closely to each other. So very simply put, if you think about a two dimensional, you know, sheet of paper, and you have individual words, then, probably the words, banana and Apple are going to be more closely together in a two dimensional space than the word monkey. Right? So maybe the word monkeys is closer to banana than it says to Apple, right? Because somewhere when it was training these texts, it’s like, you know, if you see, like, you know, monkeys live there, they’re there. And they like to eat bananas, blah, blah, blah. So the word banana in the sentences is more closely related to a monkey than Apple, for example. But you know, if you are to the Wikipedia page of fruits, they may, you know, examples of fruits, apples, bananas, etc. And so, what we also learned was, if we do that in two dimensions, we lose too much context. So we can represent it in two dimensions, we can represent it in three dimensions. But it’s like it kind of starts to make sense from 90 dimensions and up, right, so. So I, the smallest representative, back in the days from Golf was 9090 dimensions, but geographically speaking, it’s the same, and the distance calculations that are used, so cosine similarity, Euclidean distance, those kinds of things. Those are just very similar. The same mathematical distance metrics that you would use in a two dimensional three dimensional space, just you apply them on a multi dimensional space, but conceptually, it’s the exact same thing. And rather than So, for example, Stephen Wolfram does a great example in his article where he says like he said, like if I want to brute force calculators from the Wikipedia page for dog and get that you know that it’s kind of doable, but then it doesn’t really make sense. Well, if I want to do it for the whole week, I need a prediction model. And if I want to build GPT and I want to do it for The whole web, then it’s impossible to do that brute force. Right? So you, I mean, technically, theoretically speaking is possible, but it’s very impractical. So with the models, do the models predict where these words fit in the distance metric, and later that evolved from single words to sentences, for example, that’s why you got like, things like sentence birth, right? So that you do that for a full sentence, etc. And that’s how that started to evolve. But in the end, there are two types of models. So the first type of model is that what it generates is a vector embedding. And we’re not talking about generative AI. So when would you like to GPT he’s talking about a model that generates these vector embeddings. And turns out, you can represent text, images, audio, heat maps, what have you in vector space. And then if you store them in a database, you can find similar items. And that’s how we search.

Kostas Pardalis 45:57

Awesome, so okay, one way to do it is let’s say you have like a bag of all these vectors, or you get a query, which is like when we say query, it’s not like a SQL query, it’s just a textual question like, that the human would write down, right, turning the again into like this vector presentation. And then you start finding similarities with a bag of, let’s say, vigor that you have, right? And you can do it like let’s say, you can brute force that, right? Like, like a cosine similarity across all of them and choose, let’s say, like the best one, or the five best ones, or whatever. Right? When, why don’t we just do that for retrieval, like and at what point someone who builds an application, right? should start by considering industries ‘ approximate algorithms for that end, at the end? A system in the system like wavy eight? Does that at scale?

Bob van Luijt 47:00

Yeah. So that’s excellent, great, sorry. So the way of doing that brute force, as you described, is a way how people do it. To go even a step further. When I started, when we had when I was starting to play with this. That’s how I did it. Right? I did a brute force. So what do you do? So let’s say that you store the fact that I’m betting again, for what did we have apple, banana, and monkey? And you will have a semantic search question where you say, like, you know, which animal you know lives in the jungle, right? Then you get affected. And so you, you compare, if you brute force, you compare that to Apple, and it gives you a distance, then you compare that to banana gives you distance, and you compare that monkey, and it gives you distance. Now you have three distances, and you basically organize it based on the shortest distance. Yep. Great. So that’s like a linear function, right? So if I now add, like a fourth data object, you know, take a little bit of a long fifth, there is no law. So if you now have a database or like a serious project, and we’re not having three data objects, but we have 100,000, and million 10 million, we have users that have that in the billion scale. So now imagine that you have a production use case for an E commerce player, right? And you not only have like a billion data objects, but you also have multiple requests at the same time, you don’t want to do that brute force, because then people can go on a holiday. And when they come back, they get search results, right? So I exaggerated a little bit, but that’s what the brute force problem is. So this is where the academic world started to help because they invented something called approximate nearest neighbors. So these things live in vector space. So you place this query ahead the animal lives in the jungle, and you will look for the nearest neighbors. And what those algorithms do, they are lightning fast, super fast and retrieving that information from vector space. You pay with something though? You pay with an approximation. Yeah. So now with brute force, you can guarantee 100% accuracy. You can do that with the approximate algorithms. So you pay with approximation. But what you get in return is that speed improvement, and now all of a sudden, we can build these production systems. And what the database does for you, is make sure that you can run that production system rather than just adopting or using the algorithm.

Kostas Pardalis 49:39

Awesome. Okay. We can talk a lot about that stuff. But I want to spend some time about the use cases and more about where these embedding databases are like, going forward, what kind of categories start like a product category? like starting like the formula there. You mentioned something about Rog, right. And open AI and Llh. So, tell us a little bit more about that. How, like wavy eight or any system like wavy eight is adding value to the user and in what we use case, right? Yeah.

Bob van Luijt 50:23

So the so if and I’ll take off again, my, my tech hat and put on my business and I look at so the way you can look at this is like, if you build a new type of database based on a new data type, eg, effects database, you can, you can look at the use case. So we are open source, right? So people start building stuff aged look at what these people build, and you ask them, you know, what are you building? And then what you try to do with the answers that people give you, you basically put them in a box, right? So you put them in the box? Are they building a displacement service? Or are you doing something completely new? Yep. And the displacement service for us is, as I like to call it, air quotes, better search and better recommendation. So people were like, you know, we’re not happy with the keyword search result that we’re getting, we’re gonna adopt these machine learning models to better search. So that is, for example, why we got functionality like hybrid search and those kinds of things, because it helps the people to do better search. But then all of a sudden, B, B, so and this is really, in the as I like to call it post GPT era, people started to do something. And what they started to do was they they say, well, we love the whole the generative models, like, you know, GPT, either, but using GPT and seranthony generative models, but also from the open source models to cohere models, the nowadays entropic models, whatever, right? But he said, like, we have one problem, we want to do that with our own data. Yep. And so what started to emerge was that people use the vector database to have a semantic search query, return these data objects, and inject them into the prompt. And that is that that process is called retrieval augmented generation. Okay, so what they did now, is that now you can double click on that again, right? Or you can say, okay, so you know, but that’s quite primitive, right? It’s a primitive way of, it’s nice, it works, it makes people very happy. But it’s a primitive way of injecting that into the prompt. And now you see, there’s a lot of research happening, and not research as in like, we need to wait two more years for it. Before we know it’s released now. And you can see the first signs of that already. Research, have you said like, well, let’s not do that. Let’s marry the vector store that’s in a vector database, with the retrieval in the model. So the model knows that it needs to be retrievable, and it gets back a bunch of vectors, and it’s able to generate something based on that. So now, from business, Fred, you have a very unique use case within vector databases, that not only ever uniquely positions vectors, it also solves the user problems of data privacy, not having to fine tune your model. Explain its explainability. So did the model generated acts that came from documenting why in the database, so it takes all these kinds of boxes. So now, zooming back out, now I have in my displacement services to which are great, but But now, the first one sits in my new use cases box, and that is a and that is the biggest new use case that we see that people use the fact that database to do every nice, cool Janay, I think with their own data.

Kostas Pardalis 53:53

Okay, so how does this work? I mean, you mentioned that, okay, you retrieve some data, you ingest them, like in the prompt, right? That’s straight forward. You mentioned also that there’s like research around that goes beyond that. With a little bit more about that. That’s what sounds like, quite fascinating, like, what happens there, like the user is actually questioning the model directly. And the model goes and retrieves the data from the index from them from the vector database. Is the user aware of that? Like, how does this work? So so

Bob van Luijt 54:34

so first of all, the user is not aware of this, right? So this is a great thing, right? So that is, by the way, something on a quick side note . I think we should not make it too complex for our users that they need to create the embedded account. So we need to help them to do that. And especially for the rack use cases, we just need to do that for so. So, the injection in the prompt is a very, as I call it, primitive rack. It’s a very straightforward primitive. But I could this is something I could point people to that they might find interesting. If you go and hug your face in the Transformers library, there’s a folder in there. And that’s called Rec. And there, they have a little bit more of what I call sophisticated Rec. So they use two models, one model to create a vector embedding to store data in the vector space and retrieve them and another another model in where you fit in the vector embeddings. And tokens come out on the other end to generate an answer. So the more efficiently you can, you want to marry them, as we like to say what you want to weave them together, write them out like a database, rather than them as two separate entities. But that means what you can do is, for example, you can create smaller ML models, right, because now you don’t have to store all the knowledge in the model, you just need to have his language understood. And the model just needs to know I now need to retrieve something from this, you get to real time use cases, because these databases can update very fast, and you can just keep doing vector search over them. So the closer we can bring them together and marry them together, the more efficiently they work. So now you see, for example, that out of the models don’t come vector embeddings anymore, but like, you know, binary representations of these embeddings and those kinds of things, and the database can eat that information and and provide an answer, then the UX tremendously increases, right, because he just married the two things together.

Kostas Pardalis 56:35

Yeah, 100%. I mean, it’s like, I guess, I guess we have the UX, like the user experience, the developer experience, and probably we start like, have like working on the model experience, because from what I hear is that we are selling building like databases that they are going to be primarily interfacing with the models, not with him. Yes,

Bob van Luijt 56:51

exactly. And this is something so I’m happy that you bring this up, because this is like a secondary use case that we worked on. And ironically, we kind of discovered that one before the before we add to the whole rec use case. And my colleague, Connor, wrote a beautiful blog post about that. And we’ve had a blog, what we call a generative feedback loop. So what you now start to do is if you have a bunch of data in your database, in your effects service, that interacts with a generative model, but what is everything we’ve just been discussing until now is like a flow, right? So you have a user query, the model processes that knows that it needs to get something from the database, injects it into the generative model, something comes out. So we go from left to right. But there’s no reason why you can feed that back into your database. So the use case that Connor is describing in a blog post is that he has Airbnb data that has things like the name of the host, the price, the location, but not the description. Yep. So basically, what he does is that he says, Okay, I use this the RAC use case, to say, Okay, show me all listings without description, generate based on these descriptions, a listing for an elderly couple traveling younger couple term, store that with a vector embedding back into the database, and now all of a sudden, the database starts to populate itself. You can use it for them to clean themselves, and those kinds of things. So we now not only have human interaction with the database, but also model interaction with the database. And I’m, that’s a second new use case that I’m extremely excited about.

Kostas Pardalis 58:30

Yeah. Okay, we’re close. I mean, we’re at the end here. So I’d like to get you to have something you like back to sophomore about that stuff. But there is one question that I do not ask. And I want you to put back your business hat again, right. And as the founder of wavy eight, right building like this database, and entering like this new era, where, let’s say, wave eight itself is not going to be exposed to the user, right, but to the model. And it sounds like a compliment problem here, like who are the compliments like as a product between LLM and the vector database? So how do you see these playing out from a business viability perspective, right, is this going to be a feature of the LLM? Is this going to be a category that can stand out that only Chong writes? What’s your take on that?

Bob van Luijt 59:36

So I said, this is an excellent question. And as you could imagine, the entire role that I have in this company is one that I think about literally daily. So let me try to answer it but I think it’s not right so I don’t think that’s pure and alone. In fact, a search so use it looking at the traditional place to search the vectors case? I don’t think that’s going to be the answer. Right. So, yes, people want that. Yes, there will be some market for that. But I think that indeed, marrying the two together or again, weaving the two together is that is why the big opportunity lies. Now, how users will interact with it, how they will get access to that and what path they will take to that? We do not know. Right. So by the time this episode airs and one of the things we’ll see is that we have this combined offering with AWS right so that you used models SageMaker defect, we’ve had various databases and intertwines. Yeah, so right. So people press the button, and it all comes together. But we’re gonna learn if people will take that path through the models, or through the database, but they need both of them. So in a, hopefully in a couple of months, or a couple of weeks, I have the answer. But this is all very new and very fresh. But I do think that for us, at least, that’s the big new step. So how do you, as vectors, stay like two steps ahead of what’s happening in the world? And I think this is the answer.

Kostas Pardalis 1:01:15

Yep. Yep. Awesome. Okay. Eric, unfortunately, I have to give the microphone back to you.

Eric Dodds 1:01:25

Yes, I guess, you know, it’s funny, we sort of fight in halves, about the show. Okay, this has been amazing. But Bob, I actually want to end on a very simple question. When you’ve had a really long week, and you don’t want to think about vector databases, or embeddings, or business, then you put a record on the record player. What’s your go to, like order the top couple records that you put on the turntable.

Bob van Luijt 1:02:03

So I am relaxing recently. I’m listening a lot to the solo concerts from Keith Jarrett. So I love that. It goes back to what we discussed. It’s like, the four people don’t know him. He gives a one hour concert, he sits behind the grand piano, and he just starts improvising. And every night is different. And to go on that kind of journey. So that is something that I’m listening to a lot. I like a lot of what’s coming out of the scene in LA now. We’d like people to like thunder catheters, I like that a lot. So those kinds of things I so I would argue so music that I would answer with music that takes me on the journey that I can take my mind off things. So that’s what I’m currently listening to. I

Eric Dodds 1:02:56

I love it. All right. Well, Bob, it’s been so wonderful to have you on the show. We’ve learned so much. And I think we need to get another one of the books because obviously we’ve gone over time, which, sorry, Brooks, but we just had so much to cover. So Bob, yeah. Thanks for giving us your time. And we’ll have you back on soon.

Bob van Luijt 1:03:16

Thank you so much, and would love to join you. And thank you.

Eric Dodds 1:03:20

Kostas, I think one of the things that is actually my takeaway is in many ways a question for you. We’ve had some really prestigious academic personas on the show who have done really significant things. I think about the materialized team, and I mean, there’s just some people who have done some really cool things. And what was interesting about Bob is that I mean, he studied music, number one. But he also drew a lot of influence from academia, but he’s self taught. And he’s building a vector tip base and dealing with embeddings. Which is really interesting to me. And so I guess my takeaway question for you, is, how do you think about that, because you, of course, studied a lot of complex engineering stuff in school. But it’s amazing to me that he can sort of study music and apply some of those concepts to things like, very deep engineering concepts, and mathematical concepts to produce an actual product without formal training. I mean, that’s pretty amazing. So I don’t know if that’s, I’m thinking about that, but I think you’re the best person to help me understand that.

Kostas Pardalis 1:04:49

Yeah, I think you asked me. We probably need more people from music to take business. I don’t find it that strange, like what she’s like, but music studies helped him so much. Because, like, at the core of, let’s say, music itself, there’s creativity, right? Like people who play, let’s say, like, the new instrument, or they go and try to even like to have a career or that they have like a very strong need to be to express and create. They are creators, right? I think like the definition of a creator. Right? And it’s a definition of a Creator, who is I’d say that, like, it’s, like, a mental creation, right? Like, music always comes out of something that it’s in your mind. Right? Yeah. So I think like, there are many similarities with even going in, like writing code at the end, right? Like, you start from something very abstract, something that it’s yeah, it can be represented, like with math, or like, whatever, right. But that’s also true, like for music, but at the same time, like, okay, and that’s something that we taught him is that he learned a lot of things about business, right? Because music is like, if you want to survive in there, like an industry is brutal, like, expose yourself, like the definition of like exposing yourself and getting rejected, right. So I think that there are like many lessons from let’s say, like, the people who are there, like many similarities in a way, but let’s say platform, it’s not like a keyboard. And writing code, it’s like an instrument. But in the end, it has a lot to do, they have things in common that are very important like creativity. And also being able to create something completely new and take it out of there, like convince people that like, there’s value in that, right. So that’s one thing. The other thing is that, okay, he’s amazingly good at expressing himself and some very deep and complex, let’s say, concepts, which I think is very important for anything that has to do with all this craziness around AI. And that’s one of the reasons that I would like, ask him if anyone would like to go and listen to him, because I think they are going to feel much more confident around AI as a technology and how it actually has substance and value. And also like some very interesting conversations about businesses, right? New business categories, like new product categories, that they are out there. So please listen to him. And so I hope that we are going to have him back again and talk more about that because we can spend hours with him for sure.

Eric Dodds 1:07:53

I agree. Well, if you’re interested in vector database embeddings for sort of database history in general, listen to the show, subscribe if you haven’t, tell a friend. And of course, we will catch you on the next one. We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe to your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.

🎙 Sign up for The Future of Machine Learning Livestream!

🗞️ Signup for Our Newsletter

Episode 159:

What Is a Vector Database? Featuring Bob van Luijt of Weaviate

October 11, 2023

Notes:

Transcription:

About the Podcast

Sign Up for The Data Stack Show Newsletter