Episode 198:

Building AI Search and Customer-Enabled Fine-Tuning with Jesse Clark of Marqo.ai

July 17, 2024

This week on The Data Stack Show, Eric and John chat with Jesse Clark, the Co-Founder & CTO of Marqo.ai. During the episode, Jesse discusses the evolution of AI and machine learning in enhancing search capabilities, particularly in e-commerce. The group explores the concept of vector search and its advantages over traditional keyword-based methods. The conversation also touches on the challenges of searching for specific items, like car parts for Land Cruisers in Australia, due to the complexity of part numbers and interchangeability. They delve into the difficulties of dealing with unstructured data, such as information locked in PDFs and manuals, and how Marqo is developing AI to search and incorporate this data into relevant results. The episode covers the technical aspects of customizing embedding and language models for better search outcomes and the potential of language models to connect different data modalities for advanced search experiences, the future of interfaces, the role of new technology in search experiences, and more.

Notes:

Highlights from this week’s conversation include:

Jesse’s background and work in data (0:35)
E-commerce Application for Search (1:23)
Ph.D. in Physics Experience Then Working in Data (2:27)
Early Machine Learning Journey (4:35)
Machine Learning at Stitch Fix (7:28)
Machine Learning at Amazon (10:39)
Myths and Realities of AI (13:49)
Bolt-On AI vs. Native AI (17:26)
Overview of Marqo (19:46)
Product launch and fine-tuning models (23:02)
Importance of data quality (25:38)
The power of machine learning in search (32:02)
Future of domain-specific knowledge and product data (34:08)
Unstructured data and AI (37:19)
Technical aspects of Marqo’s system (39:42)
Challenges of vector search (43:27)
Evolution of search technology (48:15)
Future of search interfaces (50:43)
Final thoughts and takeaways (51:53)

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 00:05
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com.
Welcome back to The Data Stack Show. We are here with Jesse Clarke from Marqo. Jessie, welcome to The Data Stack Show. We are super thrilled to talk with you today.

Jesse Clark 00:35
Yeah, great to be here. Thanks so much for having me out.

Eric Dodds 00:38
All right, you have a really long, interesting history which we’re gonna dig into an amendment but give us the abbreviated version.

Jesse Clark 00:47
A pretty good version. Yes, started out physics PhD, looking at very small things, stood about six years in academia then decided that this wasn’t for me moved into women’s fashion during data science at Stitch Fix segwayed into Alexa at Amazon robotics and search and then found in Marketo, which brings me here today. Awesome

John Wessel 01:06
times to discuss. Alright, yeah. Welcome on Jesse. So we talked a little bit before the show about the E-commerce application to search for someone, the history of search, and why it’s so complicated and messy right now. So what are some topics you’re interested in covering?

Jesse Clark 01:22
Yeah, I’m really excited to talk about machine learning and vector search, some of the new capabilities that really unlock and are really forward looking as well, because I think that we’re going to see a large evolution in the way that we search. We’ve already seen some of that with things like chat GPT, and these kinds of question answering methods. So, you know, looking forward to it, I think it’s really exciting. Yeah,

John Wessel 01:42
awesome. All right. Shall we do that? Let’s do it.

Eric Dodds 01:47
Jesse, so glad to have you on the show. And I have to entertain my curiosity just a little bit, before we dive into data stuff to ask you about physics. So you have a PhD in physics, you studied very microscopic things. If my very limited understanding as for x, you sort of almost replicated like microscopes that are too small to have a lens for something, which is insane. So I just happen to like, having a PhD in physics, what was the most surprising thing that you discovered? Or sort of the most unexpected thing? And in, you know, learning so much about physics? Yeah, that’s

Jesse Clark 02:27
a great question. And yeah, really well understood in terms of what I was doing from that very brief chat, in terms of I think, what was the most surprising thing was physics. I did a lot of experimental physics. It was just how good you had to become in these adjacent areas. So things like, you know, electronics, plumbing, we used to do a lot of these experiments, we lived and died by experimental data, and we’d go to great lengths to collect this data. And so we were living in a lead Hutch for six days, you know, collecting all these, this data. And so we’re now starting to program robots, you know, to collect the data, we’d have to hook up vacuum pumps, you know, doing the electrical, you know, ourselves, but none of this was taught to you. So you had to just work it out. And this was a while ago, now. There were much less resources on the internet. And so it was really just like, you know, having to, there was no choice but to work it out on the spot.

John Wessel 03:15
Yeah. Theoretical Physics.

Eric Dodds 03:20
Yeah, I mean, so my conclusion from this is like, if I’m going on an adventure, and I need a real renaissance man who can, you know, figure out how to hook up a vaccine? I just need to find a doctorate in physics. You know, that’s what

John Wessel 03:33
you want your team to survive. Awesome, well,

Eric Dodds 03:39
We have a ton to cover today, talking about AI, talking about all the challenges of the search, and talking about how those two worlds are colliding. But what I love to do, you have been in the depths of machine learning for years now. And so you’re absolutely one of those people who I consider you are doing machine learning, and you know, quote, unquote, AI, we can talk about the wow, that term if we want, then you were doing that way before it was cool. So early on in Stitch Fix, you are doing machine learning across a number of disciplines from of course recommendations, but also you know, demand forecasting, etc. Can you just give me well, actually, why don’t we just start there? Can you just give your machine learning journey? So after you did your PhD? What types of machine learning stuff did you work on? And where?

Jesse Clark 04:35
Yeah, I think it was quite organic, you know, when it did sort of the PhD, you know, a lot of the, you know, a lot of the things you have to do is you need to you know, solve a lot of problems, you know, analyze a lot of data and, you know, that comes down to you’ve got to write the programs. And so it starts to become this sort of very natural evolution. gotta write these programs for analysis. You gotta write these algorithms for analysis. And so you sort of automatically start, you know, basically developing your own type of machine learning for a lot of reasons as part of your PhD. And so, you know, in physics, there was a huge amount of talk back in the 2000s, about big data, you know, we had huge amounts of data. At the time, it was petabytes of data, we used to carry suitcases of hard drives vaccinated experiments, we had so much data, and we just didn’t know how to analyze it, really a lot of it. And there was, you know, everyone talked about this future state where we could get all this information from the data, but at the time, we were like, I have no idea really how we’re going to achieve this course. Now, you know, looking back, we have so many tools now. And big data really is something that can be leveraged. And so that’s amazing to see that it took a long time, longer than I think everyone expected. But you know, sort of, almost 20 years ago, it was such a hot topic. And now 20 years later, we do have a lot of tools. So yeah, the machine learning kind of happened organically, I didn’t realize at the time, it was even probably called machine learning. You know, I just thought these were algorithms that we had to do. And then it wasn’t until I started to look outwards, you know, from my own discipline that I realized, actually, these are very similar things, you know, they’re applied in many different ways. And they’re really, you know, really valuable to a lot of other functions, which, you know, sort of the core of, you know, science and physics.

Eric Dodds 06:01
At what point do you remember the moment when you realize, okay, I build these algorithms to, you know, operate on these zettabytes, a, you know, experiment data, what was the point in which you realize, like, Okay, well, I actually have a machine learning skill here that I could take to industry, you know, outside of academia. Do you remember that moment? Yeah, I

Jesse Clark 06:24
I think it was, I think it was slightly different. I didn’t think that I had the skill, I was more like, I suddenly realized I lacked a lot of skill. And so I was, you know, I was in my, you know, physics domain, and then started to look, you know, outside of it and really think about, you know, yeah, exactly, I was like, I want all this stuff, I must be able to apply it, I’m gonna be pretty, pretty good at this straightaway, and then started to look at it and realized, Oh, hang on a minute, actually, no, nothing here, I need to learn a lot more. This was probably, you know, this was quite a long time ago. So it was very early on in the machine learning journey. But I think it was that realization, that there was this huge amount that I still need to learn, which motivated me, you know, a lot to actually sort of, you know, really cover those gaps. Of course, you know, a lot of the stuff that you do learn in fields, like physics actually does have a lot of counterparts in other fields, like machine learning the code, the terminology is very different. So again, once you sort of realize that, then you actually don’t know again, actually, I know more than I thought I did. You realize what the mappings are between these subjects.

Eric Dodds 07:19
Yeah, it makes total sense. So with Stitch Fix your first sort of industry job and machine learning after you left academia?

Jesse Clark 07:28
Yeah, exactly. So yeah, moving into Yeah, Stitch Fix was the first sort of industry job, you know, full time, data science and machine learning. And so that was really exciting. I mean, a lot of the one thing that was so amazing as well, was just how, you know, in physics and clinical and experimental physics, you know, the quality of the data, and in the quality analysis really dictated, you know, the outcomes. And that was exactly the same straightaway, you know, could recognize that that was the same in industry that it was, you know, the same kind of primitives, really thinking about, you know, the data and being really careful with that, and thinking about how to actually drive, you know, sort of, you know, outcomes. Yeah, nice.

John Wessel 08:03
So Rumor has it that I don’t remember what year it was. But Stitch Fix. That’s so good that some of their recommendation algorithms were kind of at that log rivaled, like Netflix level, I don’t remember the length, what year it was, but at least on the data community, that was like, people weren’t super impressed with the interesting thing. And all the hackers were able to extract, like, at what point were you, you know, you they’re like, how did that come about? Like, what do you think are some of the keys to the success of data science, because most companies go the data science route, hire data scientists, and like, it didn’t pan out all they want. And that seems like such a fix was very much the exception to that.

Jesse Clark 08:42
Yeah, it was very interesting, I think they worked very hard to sort of build an environment that allowed people to kind of do have quite a bit of freedom in terms of exploration, but then once something, you know, had to retrain, then it’s this sort of exploitation and putting it into business. But I think one of the secrets was honestly, like, you know, a bunch of Reformed physicists working on these data problems. Yeah, there was. Yeah, there were a lot of I mean, it was an incredible kind of mix of people. There were, you know, a lot of PhD physicists, you know, computer scientists, neuroscientists, you know, social scientists, so, a lot of people who had deep expertise in a lot of experimentation and data. And so I think that was really the key was that people, you know, both fanatical about this stuff, really, you just didn’t leave anything to chance, it was, you know, wake up in the middle of the night and think that I haven’t, you know, I’ve missed this piece of being in my data coding my ETL I need to fix that, otherwise, my downstream is going to be cooked. And so I think it was really that, you know, combination of just getting people who really loved data, and then giving them the freedom to kind of execute.

John Wessel 09:41
And it sounds like a diversity of backgrounds was helpful there too. I mean, even from your very practical hands on experience with data and physics is very different from hiring a team of PhD data scientists that all kind of have a uniform background right that I find sometimes doesn’t get fully translated into actionable intelligence. So yeah, that’s cool. Yeah.

Jesse Clark 10:05
Yeah, I think it’s I think it’s actually one thing that was really noticeable as well as he had a diversity of backgrounds. Because again, these sort of problems that crop up people have seen him like might look slightly different in their field. And then they can bring a different lens, they’ve got different tools. And so you get this sort of, you know, much, you get this really good sort of better together story where people are able to bring in a lot of these other ideas and solve those problems.

Eric Dodds 10:27
So you moved from Stitch Fix to Amazon. And you did a couple things on Amazon. So give us the overview of you know, what types of machine learning problems did you solve in Amazon?

Jesse Clark 10:38
Yeah, it was really exciting. I joined Amazon, I didn’t even know what I was going to be doing there. And so I was, and it was a little bit when I probably tried a little more color. I was living in California at the time, my wife, and we had just given birth to twins as well. And so then decided to, you know, take this job with Amazon and move to Seattle. And not actually I wasn’t, I didn’t even know what I was going to be doing there. So it was a huge leap of faith. Really, you know, like, why Washington, but the support there. And so, yeah, when I worked at Amazon, a sort of top secret project, basically was the sort of number three or four hired in a team. And so this was really ambitious, zero to one kind of project. So this was really exciting. And I think, you know, Amazon has a reputation, you know, for these projects. And it’s certainly true, they are really taking a big bet, and trying to make, you know, really remarkable things happen. So that was, you know, really just great to see how that sort of evolved. And, you know, being connected below deeply technical and working on the machine, we’re in there for this initial project, and still very connected to the end do what customer problem we’re solving, and how do we take this technology, which is quite complex, that nascent still has number of rough edges, but how do we make that into something that customers are going to love and buy? And so that was really interesting, not just from the technical perspective, but this sort of holistic product development perspective, and just seeing that iterative cycle. And then yeah, moving on, you know, we’re moved on to robotics. after that. I just saw, you know, a huge opportunity in terms of, you know, Greenfield projects, taking on something ambitious, again, you know, that’s the thing with centers, obviously, huge part of Amazon, and the efficiency there, you know, that they’ve been able to, you know, really drive down, you know, it’s quite remarkable. And so, again, to be able to, you know, have a potential impact that was really, really interesting. So, yeah, developing a lot of the intelligence for robots, you know, these are all the sort of machine learning models to help them see, basically, and understand the world. And then, you know, after that, in about two and a half years doing that, I also spent some time in retail and shopping. So again, you know, starting to think about, you know, how do we improve these experiences online? Instead, there’s a whole bunch of different sorts of aspects that this touched, you know, sort of, how do you help? How do you help people discover things? So yeah, this was Yeah, super interesting. Well,

Eric Dodds 12:49
I want to zoom out a little bit, because that is such a wealth of experience across a variety of machine learning disciplines, you know, sort of run, making a recommendation, you know, making a clothing recommendation to Stitch Fix and fashion to robots, which is crazy, and a non trivial, you know, sort of scale of Reno rubs. Obviously, kit. So can we talk about the AI landscape for a little bit? And maybe a good question to start with would be, what are sort of the big myths? Or, like, if you could sort of dispel, you know, it’s like, there’s so much hype out there. If you sort of had a couple of top things where you’re like, you know, these are the things that really bothered me, when I see these headlines, what are those things for you, because you are actually, you know, truly experienced in this literally building sort of like aI focused technology, which we’ll talk about, and then.

Jesse Clark 13:49
Yeah, great question. There’s a lot, I think, that probably gets a bit riled up at the moment, I think, some of the things in AI, I mean, a lot of it really is that it’s, you know, it’s not magic. And, you know, there’s really no sort of, you know, silver bullet here, it’s these problems, you know, to solve them with AI still requires a lot of the same, you know, the same things that have always been required, right, you need really good data, you need really sort of disciplined approaches, you need really good evaluations, you need to understand, you know, sort of what’s happening. So, I think just being grounded, that, you know, this is still a tool, you know, AI is a tool, it’s a technology that you can use and so, you know, all the rules, usual rules apply, you know, think in terms of the things that are, you know, sort of a bit misrepresented, you know, I think we saw a lot, particularly, the hype died down a little bit, but particularly when a lot of the, you know, large language models started to come out with these incredible skills, you know, people started to talk about your this emerging capabilities and whatnot, you know, but I think now, it’s actually become evident that actually, these emerging capabilities aren’t so emerging. It’s really that, you know, people just didn’t stare at the data long enough. And actually, a lot of this stuff was already in the data, and that this is entirely expected and so you’re training these models is not a game. It’s not magical. You don’t just get you know, suddenly get this AGI you know, with our current sort of Crop about the limbs. You know, it’s really you know what’s in the data, it’s very much like training, weight training or something, you train a muscle, it gets bigger. You train an LLM on its data, it gets better at materially yet. And I think that, you know, we need to sort of just be really sort of grounded in, you know, what, you know, what we can do and what we want them to do. Yeah,

Eric Dodds 15:17
that makes total sense, will help us understand. So if you have to break, sort of, like, let’s say, modern AI, down into sort of its components from a technical perspective, could you break those down for us? So like, you talked about large language models, there’s vector databases, giving you rack applications? Yeah, you know, sort of, but what are sort of the main components? When you think about a modern AI application? What are the core pieces of that?

Jesse Clark 15:49
Yeah, another good question. I think the core pieces Yeah, I mean, AI is, you know, as the name suggests, artificial intelligence, and it’s sort of much more encompassing than something like machine learning, which is, you know, much more sort of specific. And so, I think AI, you know, definitely is much more than, you know, just an individual component in terms of like a model, it’s really the, you know, the worth a lot more than the sum of its parts. And that’s what these systems are actually able to sort of do. And so, I think, from Ai, you know, subsystem perspective, and the different parts, you know, there’s obviously, there’s, you know, at least in the modern AI, we’re sort of centralized around deep learning models, or some other machine learning model, which is kind of driving it, but then you have all this dreadful, you know, components around that. And so exactly, like you said, you know, being able to store and retrieve data, you know, vector databases, vector search, as we’ve seen, you know, augmenting large language models with the ability to retrieve information, this has evolved now in just a tool use. And so now, you can actually, you know, not just have, you know, a single database, for example, but you can actually have other functions that can get called out, so, maybe it needs to request something, you know, what not, so you’ve got these sort of systems now, which have, you know, integrating a whole bunch of other sort of, you know, data management systems and, you know, again, like, serving inference, you know, but a lot of it actually looks very similar to I think, what’s happened in software engineering before, again, you know, this isn’t, you know, it looks super impressive and really powerful, you know, a lot of the same engineering practices still apply. And so you’ve still got all these other components, you’ve got the serving, you’ve got the interface, and you’ve got the, you know, obviously, a lot of the, you know, the sanity checking in the sort of safety around it, particularly with language models we’ve seen, you know, with, you know, about the front end engineering from the injection.

John Wessel 17:22
Yeah, so I think that’s probably, well, how would you describe that look to mental models here, especially with SAS out? So I’ve got one, which I’ll call bolt on AI, which is I have a product, and then I call the chat, you can see the endpoint and like, you know, return something, right. I think it’d be things where they just kind of add on AI. Guys soon from augment.

Eric Dodds 17:47
A little problem, yeah, they make your product better.

John Wessel 17:51
Yeah, we’re writing about but in essence, kind of white labeling what chat GPT can do and then focusing toward their product, which is fine, versus whatever you guys are doing, where it’s truly a native AI? And I guess, one, how do you communicate that to people? And maybe what are some of the challenges with so much noise in the spaces and so much? Leon is probably the best way to say it. It’s a very noisy, loud space. I didn’t communicate that to people.

Jesse Clark 18:19
Yeah, I think you’re absolutely right, there’s sort of this, you know, consuming of AI, you know, to build something and then sort of producing AI as well. And so there’s, you know, a big distinction here. Previously, it was much more about, you know, producing, you know, the AI unless there’s much less capabilities in terms of being able to consume it and use it. And we’ve seen, you know, now people are able to integrate it just through a single API, and now we’re sort of AI enabled. But I think, you know, and certainly, now the term has become so overloaded, it’s very hard to actually distinguish what’s facts from fiction, but I think will sort of reveal, you know, at some point, we’ll move very quickly, I think away from people sort of, you know, talking about AI is powering their application, you know, I think probably when electric electricity came about people were like, electrical powered, you know, instead of cooking or something like that. And it was a big selling point. But, you know, now, if you sort of said that, for a lot of stuff, people would probably look at the, quite strangely, of course, you know, be like, Why aren’t you using your electricity to do this stuff? Like, it would be odd if you weren’t? And so, you know, I think we’ll get to the point where it becomes so pervasive that AI is just expected to be part of it. But I think in terms of, you know, being able to sort of, I guess, talk about it, I think we focus a lot on, I think, you know, outcomes, I guess, as well as just sort of business value, and so somewhat, you know, sometimes the sort of, you know, in the middle of it actually not, you know, less important as to what the actual business value is and what the sort of outcomes you can derive are and so I think just trying to focus on, you know, what is that length about like, you know, the technology and the sort of solution and but really about like the problem will actually solving that.

Eric Dodds 19:46
Maybe it’d be good for you to just give us like a Yeah, an overview of what Marketo does, before we dig into search, because you mentioned a couple of things and you’re giving us the overview the landscape, you know, there’s research there’s databases, As you know, when you look at Marketo, it has, you know, I think several components, they’re just levels of this on Marketo. And I love it because you they’re tried everything in search of E commerce, John. And so I’d love for you all to talk about, you know why it’s a hard problem and hear the history. But yeah, started out with just telling us what Marqo does. And sort of, I guess it could be on the vector side, that would be most interesting, because when you hear vector search, you kind of instantly think that your database, there’s a bunch of vector databases out there, like not all of them are created equally, obviously. So help us place Marketo correctly and the way that we’re thinking about it? Yeah, absolutely. So

Jesse Clark 20:39
Marqo is one way to think of it as a vector search sort of platform. And so the reason that we’re calling it that and working towards this is that back just search itself requires much more than just a vector database. And so, you know, vector databases are built around similarity search, so you put in a vector, and then you can find the nearest vectors, and that’s effectively your search that returns lift. Now, these things are the relevant things in terms of a vector sort of search process. However, this is still a very primitive operation. And actually building any kind of search distance requires, you know, much more components from the NOC and in fact, the search itself requires, you know, a whole bunch of additional machine learning components. And so, you know, moving what we’ve Marketo, we’re moving beyond just, you know, focusing on a similarity piece and actually think more holistically, how do we actually bring this, this vector search technology, you know, to developers, so that they can actually integrate it, that’s ever the current, you know, sort of wave of solutions, the vector database, and then, you know, everything is left as an exercise, you know, to the engineer, where they’ve got to now implement all of the orchestration abstraction layer that handles the machine learning. And then once they’ve got that they’ve got, you know, now you’ve got your sort of Hello, world example. But then if you actually have any sort of suitably valuable user a search, you know, search bar or application, you actually need to really think about how do you actually tune the search? How do you develop the models? And so this is sort of the third piece. And so what we’ve done with Marqo is really think, okay, how do we actually, you know, if someone’s going to actually put this into production have a warm lip service, that drives a lot of business value, they’re going to have to cover off on all these components. And these are quite different technical experts, technical domains require a lot of expertise. And so what we’re doing is we have the vector database, we have the abstraction layer, the orchestration machine, learning the inference, so people can get started straight away, it’s documents in documents out so you can search with, you know, text, you search with images, and then we’re now building into this place where you can actually start to fine tune models, and actually, you know, integrate behavioral data feedback from, you know, from users, and have this continual system that continues to learn. And so that you’ve got this search system, which is really performant, covers off all of the components and just continues, gets better over time. And

Eric Dodds 22:46
Is that the last part, because you just had a big product launch? So can you tell us specifically because it sounds like that last part is sort of, because I knew out of the box, you could, you could plug Marqo in and get, you know, you could get, you know, a much more relevant search out of the box. But that last part, it sounds like now you’re enabling your users to integrate their own sort of first party data? Is that sort of like bringing your own embeddings? Tell us about the launch? I’d love to hear a little bit more about that.

Jesse Clark 23:14
Yeah, really exciting. So like, yeah, like I mentioned, it’s to get vector search, you know, be really valuable long term, you need a few different components. And so the initial Marketo focused on the vector database and the abstraction layer. And like you said, Now, we’ve just launched the ability for customers to, you know, fine tune their own models, on their own data, get really specific domain specific models, which really understand the nuances. And so I think everyone is familiar. If you search for even basic things like jeans, on different websites, the notion of what jeans would be is very different. And customers, you know, like particular flavor, particular style, you know, and so being able to actually capture a lot of these nuances now, is what this new product launches, enables customers to do so they can actually fine tune the data, fudging the model story on their data to get it to be really understand, you know, what their customers, you know, language they’re using, it can allows, you know, covers off on, you know, maybe there’s new terms and slang terms, maybe it’s multilingual, maybe it incorporates multiple languages, all of this now can actually be learned and then serve the integrated into Marqo, to provide much, much better results.

John Wessel 24:16
So I’m curious, especially with your background at Stitch Fix, I’ve noticed this trend, where people are more concerned about privacy tracking, you know, things with a data stream is getting harder for first party data. So notice this trend, and I think Stitch Fix did as well early, as companies are more likely to just ask people, like quizzes are really popular right now and the user. Yeah, which is plenty like a full circle thing. We went our separate ways. Elon went on to have to do anything. Let’s just try it like see where they clicked do this.

Eric Dodds 24:50
Yeah, it’s just like digitizing the retail interaction in store right.

John Wessel 24:53
So I’m interested in maybe some applications there for your tie. Technology where it’s like, oh, people are gonna ask like, well, what’s your database? And like, well, let’s store it in a way that AI can interact with it. And then you have that higher quality, more precise data. And do you think that’s important for the future of search? Especially in, you know, AI?

Jesse Clark 25:14
Yeah, I think the data, again, is incredibly important, if not more important now, because you know, the API, really, it’s trained on a particular set of data, it’ll be trained on, you know, a particular sort of style of data. And so what can happen, you know, particularly for the long known problem in machine learning is you can have these sort of distribution mismatches. So, if the machine learning models were trained on a particular type of data, and now it starts to see different data may behave slightly differently, or are a bit worse. And so, you know, data quality, you know, it’s never going to go away, I think it’s, you know, just you gotta be fanatical about the data quality, and that that will always pay dividends. And so you know, that that, really,

Eric Dodds 25:49
so you talked about the history, and you’ve looked at multiple solutions. Can you just give us a brief overview, what types of solutions have you tried to purchase, then I also love for you to bring up like, what were some of the dreams you had about what you could do with search that were just impossible, but like, and then Jesse, I’d love to hear like, you know, are those things that Marqo dresses?

John Wessel 26:10
Yeah, so I mean, my history was search actually goes way back to your there’s an open source solution called Apache Solr. That that a previous job we used in app power, the search for the app, we were building at the time. And, and I was part of the admin team, so we weren’t doing like the mappings and trying to run that. And like Billie Holiday indexes. Yeah, I really didn’t rebuilding indexes in the middle of the night because things broke. I mean, it was so much work. And then we moved to ElasticSearch as kind of the next iteration. Like, oh, this is nice, or like, this is fine. Yeah. And then, you know, Amazon started hosting that for you. And like, there’s kind of this progression. And so move that out to the EComm. And then come into ICO, and we just Shopify, I’m like, okay, like, was looking at Shopify is built on the search. We asked her how we talk to people, and nobody uses it, and I try it out was like, okay, all right, I understand why nobody uses it. But then, that discovery process was just so surprising. There’s some kind of entrenched ecommerce solutions that have been around for a long time that just do ecommerce search. And then they’re like, Oh, we’re gonna add Shopify, like, because we had, you know, this the new thing, but they still have that, like, older model. This, the schemas, for a lot of them are very rigid of like, what we need you to put color here and science here, you know, whatever your parameters are. And I mean, we spent 1000s of hours cleaning up getting data into, you know, fix schemas for search, or basically, wow,

Eric Dodds 27:46
did you look at like, algo Lea,

John Wessel 27:48
we actually, we started with one of those more bespoke vendors, we moved algo Lea, again, which was better. It was a little bit more flexible, you can feed algo Lea, like, actually behavioral type data and do some new things. And then a couple years into it, like, Oh, we found these new AI features like, Oh, this is nice, we turn them on. And it’s like, like, one of them was like an AI prediction feature. Like, get insights into what people are typing in, we built returning results. And I mean, this was we had, I think we had over a million visitors a month. It’s not a small traffic site. And like the insights for us, there just wasn’t much there. Like they had synonyms, like recommendations, like somebody would try one thing, and it was just kind of a disappointing, you know, experience. And I’m sure it’s probably well past, you know, in the last couple years of civil war, but I think the biggest disappointment was around discoverability. But I knew that if I knew a keyword that wasn’t the name of the product, it worked. If I knew this new part number was interesting, it would work. So

Eric Dodds 29:03
but what was the saying, like, describe a user problem to me, and then this is Jesse, what I’d love for you to like when you’re describing user problems. So keywords in the name of a product a lot of times, especially because you were dealing with things like water filtration settings. A lot of times people are describing their problems. Right, right. I filter this out of my water. Yeah. Right. Was that like a? Yeah, describe the

John Wessel 29:27
the problem was one of the biggest ones that space was, does this work with x? Oh, intercession?

Eric Dodds 29:33
So I like relationships man to

John Wessel 29:37
you’re trying to, like, connect this pump to this to that or like that. Yeah. But, but I mean, that was hard. So we built all these data models, like fits and or like compatible with like, an interesting and it just gets an assistant that’d be this web of relationships and it gets really complex. I think that was probably the most difficult one of the most difficult problems is people like Will this work for me? Or like you get into materials? Or like a question like, Will this work at this temperature? And louder? Wasn’t she systemically listed out and like a property with a long tail of like indices and like contingencies is saying, but it might have been buried in the description somewhere, right? Near the long at least for us than any of the long descriptions that most of them have pretty valuable information, but they were basically inaccessible from search engine

Eric Dodds 30:37
see solver problem here. Solver ground? Yeah, that’s right, there’s

Jesse Clark 30:42
yeah, there’s a bit to unpack, I think one, like you described, the first problem was just like the difficulty to use even, you know, sort of keyword systems, you know, before actually having any machine learning involved. And so, certainly, one thing that was focused a lot on at Marquis, you know, being able to make a lot of the vector search technology really accessible to developers. And so, you know, taking away a lot of that, you know, just maintenance and that sort of, you know, back end stuff, it’s really hard to manage. And so that’s part of, you know, part of the value proposition is that, you know, we take care of a lot of that, and so that sort of developer experience, so you can get up and going as it’s been really a core focus. And then like you say, in terms of, you know, BMB, all these different problems, you know, as keyword search, you know, if you know exactly what you want, it’s fantastic. It’s literally just, it’s finding the exact same phrase. But a lot of times you don’t know what the correct language is, you don’t know how to articulate it, maybe it’s a question. Maybe it’s 30 in the, in the description. And so what we’ve seen with these machine learning based techniques, particularly around vector search, obviously, is where you can basically define your own relationships, you know, in terms of what’s similar. And so, you know, people start asking these questions or start, you know, even queering in some very different ways, you could actually start to learn these mappings. And so it’s very flexible about what you actually define as being similar. And so we’d search, you know, someone puts in a query, and then they’ll get back products, and these, you know, these sort of have these natural relationships of similarity. And then you could actually learn all this similarity as well from these kinds of past interactions. And so, it’s so powerful, and that you can define what is similar? It’s not, there’s no canonical, you know, this is similar to that, but you define that through these relation shifts. And so that enables you to now ask questions, you know, you can do really anything you want. So it’s incredibly powerful.

John Wessel 32:21
Yeah, one of the unlocks for me, in doing all this search research, is it a recommendation? Is a search executed on your behalf without input from you? Which seems obvious, you think about, but there is a search problem? Yeah. Like it, just

Eric Dodds 32:38
not something. Never framed it that way. But it makes total sense. Yeah, yeah, exactly. I

Jesse Clark 32:44
I think that yeah, not many people, I think, have quite realized that, you know, search and recommendations are really, you know, two sides of the same coin. And especially in e-commerce, when you’ve got, you know, a vague head query, you know, maybe it is just an item of clothing, like a T -shirt. And there is actually not one result. It’s not like information retrieval, where you’re asking a question, you know, what is the atomic weight of gold, for example? And it’s got a very specific answer. And you might only have one thing that matches that you’ve got this sort of degeneracy, where there’s a lot of potential matches. And so this is like a recommendation problem. But then you segue into the more verbose queries, for example, which might only have one match. And so it’s this fluid transition between recommendations and search. And so I think being able to think about it, you know, in these different queries, and what they actually require, you know, is definitely the right approach.

John Wessel 33:27
Yeah, I’m really curious, especially for, let’s call it kind of specialty e comm. Applications, where, like, some domain knowledge is required to purchase the right thing, let’s talk car parts, they would be a good example. Like, how far away do you think companies are? From basically combining a knowledgeable, you know, model that knows about cars and car parts with their product data, they already have to help people navigate a site? What? What does that currently Inscape look like? What do you think it will look like in a couple of years from now?

Jesse Clark 34:08
Yeah, it’s very interesting, I think, you know, I think at the moment, we’ve got still quite, you know, sort of, you know, early methods in terms of what we can do here in terms of understanding and so at the moment, it’s still very much you know, sort of systems with different pieces, and utilize, might have a embedding model that knows particular thing, maybe you couple that with a language model that knows certainly different things. And so that’s kind of the sort of current state of play. And, you know, depending on what you want that, you know, depending on how you want the results to be sort of displayed or depend on, you know, what language model you might have on the outside, maybe just, you know, have results. And so, if you’re asking it to, you know, understand deeply about, you know, if you’ve got 10 results, then you’re asking them to distill those results into an answer for a customer. You know, the language model itself, at some point has to actually understand, you know, a lot about the domain, you know, depending on what it is it can’t just sort of summarize it will actually have to understand, you know, the differences. And so I think what we’ll see, you know, probably Mark Small into end training, for example, I think we’re already starting to see this where the embedding models, everything is really informing each other, these aren’t necessarily done in complete isolation. So that you can actually get this sort of system, which is, you know, domain specific as well, not just the individual components, you know, because they think that, you know, they, and they also feed into each other. So, you know, the results from one thing feeds into the other. And if you do have, you know, particularly don’t say, issues in one component, it just goes on through the system. And so being able to really optimize, you know, end to end, I think, is where we’re going to be going and having systems, you know, language models and embedding models, you know, that can be optimized together. And then potentially, you know, things like the storage component actually living inside the machine learning models, the large language models. And so going forward, the vector database will be much tightly integrated with the large language model, for example,

John Wessel 35:48
because I mean, back to what you were saying earlier, Eric, you basically just want to replicate that highly knowledgeable, like, in person, you know, customer sales rep, or whatever, like, like that person, you go and talk to a Home Depot that like, used to be a plumber for 20 years, and know everything about plumbing and describe a solution. Like, this is what you need, like, that’s what you want. Yes, sir. So we’re really far away from that.

Eric Dodds 36:15
Yeah, it is interesting. You mentioned car parks, I was thinking, I have a hobby of working actually on Land Cruisers, which are very popular in Australia. And searching for stuff is phenomenally difficult. Because even if you have a base like model number, the part interchangeability varies pretty significantly in terms of like sets of years, right? And so searching for parts is so difficult. And so you end up going to these forums and combing through like, Message Threads to be like, is this the right part number for like, my specific thing, you know? Which is why I mean, it really shouldn’t be, it shouldn’t be that difficult. Because I guess I think let’s face it, all of that information exists, and actually is pretty available. Like, interestingly enough, it’s not like it’s a mystery. It’s just that as a human, you have to go through and like create these explicit relationships in your own mind, you know, that just haven’t been combined from Reddit forums. Yes. Yeah. Yeah. Yeah,

Jesse Clark 37:19
I think, yeah, I think that’s what’s most exciting about a lot of the current sort of waves in AI and machine learning is, now I think we have much better methods to use a lot of these kinds of data that exists, which is really relevant. It’s unstructured data. And so I think previously, before we had to really actuate it, you know, had to really define these relationships. But I think now we’ve got this ability to actually, if we know what’s relevant, you can sort of start to incorporate that and actually learn from a lot of this other information that exists and actually now incorporate that into your sales search system. And so that’s really powerful. I think being able to use a lot of this other data, which was previously really hard to use. One

John Wessel 37:54
One other thing on this topic, is we also had a ton of really useful data locked in PDFs that were from the manufacturer, that were manuals that were how to guide, any, like, how an AI, you know, I searched potentially unlocked some of that data.

Jesse Clark 38:15
Yeah, I think there’s a few different ways. You know, I think one of the things that we sort of focused on with Marqo was, you know, just that problem, that you’ve got so much data, which is unstructured and just basically inaccessible, you know, I think it’s something really numbers, something like 80% of the world’s data is unstructured data, and it’s growing at an exponential rate. And so, you know, one of the ways we found that marker was to think about the invariance, there’s so much been changing in AI, you know, so many new models, everything’s changing. And so how do we build a business and how to solve problems, which are based on, you know, these invariants. And so we know, unstructured data is a huge amount, we know, it’s going to keep growing, we know that people need to search it, and we need, you know, relevant results. And so that’s sort of one way that we’ve been thinking about, you know, the problem building Marketo. And so, you know, being a vector search, particularly allows people to search across these unstructured data in ways that previously were impossible. As a now I think, as well, not only can you search across it, like, like, you know, we sort of just discussed in terms of always late and data, which exists in forums, you know, we now have methods, you know, not just a search that but you can actually incorporate that into a domain specific, you know, model as well and actually really understand that.

Eric Dodds 39:20
Okay, Jesse, I want to dig into the technical side a little bit, because, you know, John mentioned you’re gonna get it, you can have like a package day I, you know, application, can you send data and send it back? Yeah. Is it some sort of embedding model, it has its own, you know, deep learning model Golang, whatever, right. You talked about Marketo as sort of a system. And now with this latest product launch, which is very exciting as you can bring your own, you know, first party data to help, like inform, to inform the system. But you mentioned something earlier that I think is really interesting, and I think it’d be really helpful. I mean, especially for me, but I’d love for the listener so I walk away with a better understanding of the embeddings model, and then the language model. And then the things that you would want to customize on each of those are like, what are the separate concerns across the embeddings model? And the language model that you need to think about? And then what does that relationship work like with Marketo? Yeah,

Jesse Clark 40:24
So I think one thing that we’ve done, particularly with the new product launches, is take quite a holistic approach to the way we build the systems and optimize these models. And so the first one really is making sure that the consideration is spent on the embedding side is that we can actually optimize it in a way that’s actually going to mirror what’s being used in production. So it’s not done in isolation, in terms of what data is being used, because of how the data is structured, you start to see that people have, you know, particular and might have reviews, titles, they have descriptions, we know that this is often used, sometimes it’s missing as well. And so actually, you’ve got to be robust. Now to start missing data. I think that’s one of the key things as well, you know, this sort of current paradigm in vector searches, you have one piece of information gets turned into one vector, and you just search over that. But of course, that’s pretty naive in terms of what actually, you know, customers and users have, they have multiple bits of data, they might have some, they might not have others. And so the first piece is really trying to optimize, you know, I think the models around you know, what the actual customers need, and have, and sort of those use cases, the data structures, how it will be used in the system. And then also, you know, customizing the models as well in terms of, you know, not just the data structures and how it will be used, but actually, what are the outcomes that people want to actually do with this? Like, what are they actually trying to do? Are they trying to improve a particular aspect of the business? And so being able to use a lot of that sort of domain specific data to actually optimize directly for business outcomes. And so that’s sort of, you know, I think about how we’re thinking about it quite holistically from a sort of optimization perspective. So, and this sort of then played into the different, like, you mentioned, the HoloLens and embedding models, each of these these sort of different things, don’t the embedding thought it really needs to understand, you know, what are the you know, how does it Mac, you know, something that comes in at the query to the information that’s in the database, and how to, you know, create that relationship in a way that, you know, it’s gonna retrieve the relevant results. And then on the outside, you know, you’ve got a large language model. And these are used in many different ways, particularly in search, you know, focus on the input and the output. And so on the input side, you can use them for attribute extraction or data enrichment, if you’ve got data cleaning. And so in the game, then on the output, you can use it to synthesize you know, a set of results and actually try and return over it. And so, again, depending on what you want to do on each of these things will depend a little bit on what you need from the language model, you know, how much domain knowledge does it need, or simply need to be able to summarize the board or the need to reason about that. And so, if you start to go beyond, you know, pretty simple sort of summarization and extraction, and the language models themselves will have to become somewhat domain specific, it actually starts to understand a lot of the nuance of the field.

Eric Dodds 42:47
Makes total sense in Okay, so I have this as a funny question. Is there a, you know, he talked about, you know, basic search index, where I know what I want, I know the keyword. So I search for the keyword and I get results. That’s great, right. And then we talked about some of these challenges that are much more nuanced that are very difficult, you have these relationships that the user is not going to make explicit, you know, we need to infer a lot or learn from the inferences that we’re uncovering, you know, what are the use cases where maybe vector search is not a good fit? Are there cases where it’s like, Well, you wouldn’t necessarily use it for that?

Jesse Clark 43:26
Yeah, I mean, I think the very explicit, you know, sort of, I mean, I think part numbers are a good case where you judge more than the exact match or the part number and nothing else. I mean, yeah, I think that’s, you know, that’s a great example, and

Eric Dodds 43:39
further relationships. Auto Parts desk. Yeah, exactly.

Jesse Clark 43:48
That’s why where it’s, you know, like a controller, I need to find this thing about, you know, but I think the, you know, any of the sorts of shortcomings that they just search at the moment, you know, I think it really comes down to one is, you know, the model just not being so, you know, appropriate for the particular use case, which is also usually, you know, why we developed, you know, this new product where we can actually, you know, fine tune embeddings on the custom domain, because this really aligns the model with what the, you know, sort of user intent is. But, you know, in the future, you know, we’re seeing an evolution as well, we’re still in the early days of vector searches, we’ve got this sort of mostly single vector representation of data, but of course, it’s pretty naive. And so it’s moving beyond just a single representation into sort of multiple representations. And then being having much more intelligent kinds of query models as well. will be able to, I think, you know, not only have, you know, the benefits of the keyword search, this type of partner becomes in the system knows that this is a partner, but as it should just return the exact matches versus your question. Okay, now we can default to that default behavior. So, you know, in the future, I think we’ll be able to absorb, effectively, all of these things with vector search and the system will know and understand exactly what you know, what needs to be done. Yeah,

Eric Dodds 44:55
I mean, that was kind of a trick question, because I want a world where like, the only three which is, you know, is like well to search because if you’ve experienced it done really well, it’s so much better. It’s almost just like, I’m trying to think of how to describe it. It’s one of those experiences where you’re just like, yes. Like, this is what it actually like, I guess maybe it decreases the mental load so much. And it feels so intuitive then it’s just like, like, it feels like natural. I guess I’ve not anti-climatic evidence. I don’t know what to use that he has done so much search, but like, that’s the world.

John Wessel 45:32
The thing that got me recently on this topic was let’s see Gbps are chatting before going out a year while returning, and just using, like voice search for Siri or for Google. Like, it feels so bad. Really like, like, because if you’re, if you type in GBT, or use the like, little voice thing, just for just something you might normally use, like, say, you know, okay, Google, or Alexa or whatever. Like, it’s a markedly different experience than accuracy. So that was something that really struck me recently. I’m like, oh, like, this was just, you know, a year ago. Yeah. Yeah. And surprisingly, they haven’t really improved. You know? Yeah, those two particular things. I guess that’s coming out. Yeah.

Eric Dodds 46:20
I mean, I mean, what do you think Jesse? Like, they’re gonna be feeling the heat? Right. Yeah.

Jesse Clark 46:26
I mean, it’s pretty interesting. I think, you know, a lot of it’s, and that’s why it says Search is quite interesting, because, you know, there’s obviously some incumbents there, but now, it’s a wave of AI, you know, they obviously have to do but, uh, you know, existing business model, and there might not, and so, you know, they’ve got to sort of, you know, they can really hard pivot on, you know, at short notice. And so I think we’re sort of seeing some inertia there. And then obviously, having to work out what’s the, you know, take a big battle in terms of where the future is going. But I mean, I think there’s also a lot, you know, that we don’t see, which is, you know, they’ve got a particular business model, they’re optimizing for particular things. And so that’s also what you don’t understand it from a search perspective is, you know, what’s the incentive for their results that they’re providing, right, like, someone who’s got a, you know, web scale search, right, they’re gonna be, you know, they’ve live off ads, they’re selling ads and search. They’re why search results are going to be dominated by these kinds of business objectives. That’s what the whole system is. And so that’s really one thing to consider as well, you know, what the incentives are of the search provider, that will dictate a lot of you know, how these things are done. Did they deliberately, you know, sort of almost have different results.

John Wessel 47:26
Yeah. And I guess that’s why I was like, slightly optimistic about voice isn’t nearly as monetized, right, as somebody else was like, maybe they’ll innovate there faster. But, yeah.

Eric Dodds 47:35
Yeah, that’s such a great point. I mean, yeah, just the erosion of quality search due to revenue, some that just like web searches, like, that’s great. Well, we’re really close to the buzzer here, Jesse. But I want to ask you, you know, we talked about the height, we talked about, there’s this over bullets, we got a great breakdown of, you know, sort of vector search, and like all the exciting things there. What other things in the AI space are you personally excited about? I mean, you’re building a company in your space, but what excites you, as you look at the landscape? Yeah,

Jesse Clark 48:15
I think it’s really exciting to see, I think the evolution of large language models, particularly into different modalities, so you know, large language models, that are able to, you know, particularly just like you said, like, even here, you know, see, and also obviously, use natural language, but then using them in a way that they can act as an agent, or a controller or a critic. And so you can now actually put these, you know, language models inside the system, and they can make evaluations, they can route logic. And so actually, this is something that’s been quite powerful and really hard to achieve. Otherwise, you know, one of the great things about these models is that they’ve got, you know, this natural language interface, it’s kind of lossy, and sometimes you do lose a lot of nuance with it, but it’s also incredibly good at, you know, gluing together all these different interfaces. So it’s this sort of layer, this interface layer that allows you to connect audio, language video into one thing, and then it can go into a database, for example. So this, you know, being able to, sort of unlock, you know, there’s the language models with these different, you know, I guess, sources of data, and then being able to take actions or outputs is really exciting, you know, from my perspective, so you can think about it from search perspective is you can have a system that can be optimizing itself, right? Like, you can have a language model, you know, it knows what good search results would be, say, for your domain. You know, it can literally do like get sent off and just collect data through search results, optimize the system. So you’re moving into that direction, I think, is incredibly exciting. Yeah,

Eric Dodds 49:37
yeah, I feel like we’re gonna enter this phase where we talk about the history of, you know, using, you know, going all the way back to like, open source, open source tool elastic, I’ll go ugly and all this sort of stuff was interesting as I was like a decade, you know, it was like an entire decade. My sense is that the advance in search is going to be like a decK. Ah, in a very drastically short amount of time. And I think it’s because of what you’re saying, Jesse? Yeah, it’s

Jesse Clark 50:05
certainly very interesting. I mean, hopefully it does, you know, get better rapidly. And everyone’s using Marqo, you know, so that we can avoid this, you know, the perennial frustration of searching? Yeah, it’s gonna be fascinating to see how it goes. And I’m incredibly excited as well just sort of to add on to that last question about the future of interfaces as well, we’ve seen this evolve a lot. And like, you know, vector search is really powerful. And just this kind of idea of doing the punching in key words, a couple of keywords, and then sort of pressing ENTER, if we can just sort of move away from those ideas, and actually think about how do we interface with these search systems, you know, the search results, I think will be much better, the experiences will be much, much better. And so I think that’s very exciting to think, How can we actually leverage these new experiences with this new technology as well? Love

Eric Dodds 50:47
it, and where I should have asked this earlier, where can people go to see Marqo try it out? They got it figured out?

Jesse Clark 50:55
Yeah, it’s a marketo.ai. You can go and, you know, we’ve got live demos on the site, you can check it out there, GitHub, open source Apache two license, you know, so you can spin it up on your laptop, you can get going to really Syrah you know, sort of start guide, you can, you know, build your first fixture search, you know, experiencing literally a couple of lines of code, you know, image search, and really experienced those sort of aha moments. I think it’s, like you said, when you first sort of experienced that it can be quite magical when I think we’ve had some customers before when they first stop the end to end system and they start searching for emojis. And all of a sudden, that’s returned your cat emoji and they’re getting picked up by the cat. They’re like, Oh, my God, this is amazing. You know, they don’t understand, you know, this stuff, you know. And so, yeah, head over to marketo.io or head over to GitHub, which you can get to from the site. Cool.

Eric Dodds 51:40
All right. Well, Jesse, thank you so much for giving us your time. I learned time. And we’d love to have you back sometime in the future. Yeah,

Jesse Clark 51:47
Thank you very much. That’d be my pleasure.

Eric Dodds 51:49
We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe to your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.

🎙 Sign up for The Future of Machine Learning Livestream!

🗞️ Signup for Our Newsletter

Episode 198:

Building AI Search and Customer-Enabled Fine-Tuning with Jesse Clark of Marqo.ai

July 17, 2024

Notes:

Transcription:

About the Podcast

Sign Up for The Data Stack Show Newsletter