Episode 177:

AI-Based Data Cleaning, Data Labelling, and Data Enrichment with LLMs Featuring Rishabh Bhargava of refuel

February 14, 2024

This week on The Data Stack Show, Eric and Kostas chat with Rishabh Bhargava, Co-Founder and CEO of refuel. During the episode, the group discusses the evolution of AI, machine learning, and large language models (LLMs). Rish shares his background and the inception of refuel, which focuses on making clean and reliable data accessible for businesses through data cleaning, labeling, and enrichment using LLMs. The conversation explores the impact of LLMs on data quality, the challenges of implementing LLM technology, and the user experience of working with LLMs. They also touch upon the importance of confidence scores in machine learning and the iterative process of model training, a practical use case involving refuel and RudderStack, and more.

Notes:

Highlights from this week’s conversation include:

  • The overview of refuel (0:33)
  • The evolution of AI and LLMs (3:51)
  • Types of LLM models (12:31)
  • Implementing LLM use cases and cost considerations (00:15:52)
  • User experience and fine-tuning LLM models (21:49)
  • Categorizing search queries (22:44)
  • Creating internal benchmark framework (29:50)
  • Benchmarking and evaluation (35:35)
  • Using refuel for documentation (44:18)
  • The challenges of analytics (46:45)
  • Using customer support ticket data (48:17)
  • The tagging process (50:18)
  • Understanding confidence scores (59:22)
  • Training the model with human feedback (1:02:37)
  • Final thoughts and takeaways (1:05:48)

 

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 00:03
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com. We are here on The Data Stack Show with Rish Bhargava. Rish, thank you for giving us some of your time.

Rishabh Bhargava 00:31
Thank you for having me. All right, well give us

Eric Dodds 00:34
the overview. You are running a company that is in a space that is absolutely insane right now, which is AI and MLMs. But give us a brief background? How did you get into this? And then give us a quick overview of refueling?

Rishabh Bhargava 00:50
Awesome, yeah, so look, I’m currently the CEO and co founder of refuel. But being generally in the space of data, machine learning and AI for about eight years, I was at grad school at Stanford, studying computer science research in machine learning. And then spent a few years at a company called Primary AI where I was an early ML engineer, the problems we were trying to solve back then were, you know, how do you take in the world’s unstructured text? You know, ask people, you know, allow people ask any questions, get a two pager to read, you know, so all of these kinds of interesting NLP problems, and then spent a few years after that solving data infrastructure problems, how do you move terabytes of data point A to point B, lots of data pipeline stuff. And that led into starting to refuel? You know, one of the key reasons why we started was just how do you make good, clean, reliable data accessible to teams and businesses. And that’s, that’s the genesis of the company. And here we are.

Kostas Pardalis 01:46
And, Eric, you know, I know Reese from, like, COVID days. So it’s, for me, very exciting to see the evolution through all these, and see him today like building in space. I remember, like talking like, almost like two and a half years ago about what he was thinking back then, a lot of times, we’re not like the thing that we have today. So it’s, for me at least, like it’s very fascinating, because I have like, the journey of the person in front of me here. And I’m really like, happy to get into more details about that. So definitely, we would like to chat about it. But also, like, I think we have like the personal persona here to talk about what it means like to build the product and the company. In such an explosive, almost environment, things are changing, like literally from day to day, when it comes to these technologies like LLM sense, like AI and machine learning. And just like keeping up with the pace, from the perspective of a founder, I think it’s a very unique, like, experience. So I’d love to talk also, like above with theories like and here also what’s happening, we probably have a much better understanding of what is going on with all these technologies out there. But also like how you experienced that, like trying to build something right? And of course, like talk about the product itself. What about you? What are some topics that you’re really excited about, like talking to a date with us?

Rishabh Bhargava 03:15
You’re gonna be super excited to talk about justice, you know, to the world of generative AI, how quickly it’s evolving. But you know, the cost is right, both of you have spent so much time talking to folks in data and how the world of LLM impacts the world of data, right? How do you get better data cleaner data, all of those fun topics? And, frankly, what does it mean, right? What are the opportunities for businesses and enterprises? When this, as you said, explosive technology is really taking off? So excited to dig into these topics?

Kostas Pardalis 03:43
Yep, I think we’re ready for an amazing episode here. What do you think?

Eric Dodds 03:48
Let’s do it. It’s good. All right, we have so much ground to cover. But I want to first start off with just a little bit of your background. So you have been working in the ML space for quite some time now, I guess, you know, sort of in and around it for close to a decade. And you could say, maybe you were working on, you know, LLM flavored stuff a couple of years ago, before it was cool. Which is pretty awesome. That’s, I would say, a badge of honor. But what changed? What were you doing back then? And what changed in the last couple of years? To where it’s a frenzy, you know, and the billions of dollars that are being poured into it is just crazy.

Rishabh Bhargava 04:31
Yeah, it’s, uh, it’s been such an incredible ride, Eric, you know, just a little bit on my background, you know, post grad school at Stanford, I joined this company called primer. This is about seven years ago at this point. And you know, the problem that we were trying to solve back then was how do you take in the world’s unstructured tax information, taking all of the news articles, social media SEC filings, and then build this, you know, simple interface for users where they can search for anything. And instead of Google style results right here, the 10 links, instead of that, what you get is a two pager to read. Right? So how do you assimilate all this knowledge, and be able to put it together in a form that is easy to consume. And this used to be, this was a really hard problem. This used to be this as many months of effort, maybe years of effort, getting it into a place where it works. And if you compare that to what the world looks like, today, I would bet you this is 10 lines of code today, using, you know, open AI and GPT. Four. So truly something, you know, some meaningful changes have happened here. And you know, I think at a high level, it’s not one thing, it’s many things that have sort of come together, you know, it’s new machinery model architectures that have been developed things like the Transformers models, you know, the data volumes that we were able to collect and gather, you know, that has gone up significantly. Hardware has improved the cost of computers, and it’s marriageable, all of these factors coming together, that today, we have these incredibly powerful models to just understand so much of the world. And you just ask them questions, and you get answers that are pretty good. And it just works. So it’s been an incredible ride these last few years.

Eric Dodds 06:12
Very cool. And give us an overview of where refuel fits into the picture when it comes to AI? Yeah,

Rishabh Bhargava 06:19
So look, at refuel, we’re building this platform for data cleaning, labeling, data enrichment, using large language models, and importantly, at better than human accuracy, right. And the reason why we look at this way of building the product is, you know, at the end of the day data is, is such a crucial asset for companies, right. It’s like the lifeblood of making good decisions, training, better models, having better insights about how the business is working. But one of the challenges is, and you know, people still complain, you know, hey, we’re collecting all of this data. But if only I had the right data, right, I could do things X, Y, or Z, right. And people still complain about it. And the reason is, working with data, it’s an incredibly manual activity, it’s very time consuming, right? People are spending tons of time just looking at individual data points. They’re very good at writing simple rules, heuristics. And so doing that work is actually hard. And it’s time consuming. And what if, right, like, you know, the question that we ask is, you know, with MLMs becoming this powerful, what if, like, you know, the way of working wasn’t, you know, we do that, you know, we look at data ourselves, and we write these simple rules. And, you know, we do that manual work ourselves. But what if we were to just write instructions for some large machine learning model, right, some LLM to do that work for us, right. And, you know, writing instructions for how work should be done is significantly easier, significantly faster than doing the work itself. And so that just is a massive leap in sort of productivity. And what we want to build a bit of refuel is being able to do a lot of these data activities, data cleaning data, labeling, data enrichment, where the mode of operation is, as humans, as the experts, we write instructions. But this smart system goes out and does this work for us.

Eric Dodds 08:17
Makes total sense. I have so many questions about refueling, and I’m sure Costas is, but before we dive into the specifics, you live and breathe this world of AI and ML comes every day. And so I’d love to pick your brain on where we’re at. in the industry. And so, you know, one of the things and maybe a good place to start would be, I’m interested in what you see, as far as implementation of LLM based technology inside of companies. And I’ll give you just a, you know, just a high level, maybe prompt, if you will, seems like an appropriate term. But I think everyone who has tried chat GPT is convinced that this is going to be game changing, right? But that for many people is largely sort of a productivity, there’s a productivity element to it, right? Or you have these companions, like get hubs copilot, right? That, again, sort of falls into this, you know, almost personal or team level productivity category, right? Tons and tons of tools out there. But then you have these larger projects within organizations, right. So let’s say we want to build a, you know, an AI chatbot for support, right? We want to adjust the customer journey that we’re taking someone on with sort of a next best action type approach, right? It seems to me that there’s a pretty big, big gap between those two modes, right? And there’s a huge opportunity in the middle. Is that out? accurate, what are you seeing out there?

Rishabh Bhargava 10:02
That’s a great way to look at it. Eric, I think you’re absolutely right. Look, folks who have, you know, spent a meaningful amount of time with Chad GDP. Right, like, you know, you go through this experience of like, Oh, my God, it’s magical, right? Like, I asked you to do something. And this poem that I had generated is so perfect, right? You go through these moments of, you know, it works so incredibly well. There’s, as you mentioned, there’s copilot, like applications that are almost plugged into where, you know, that individual is doing their work. And it’s assisting them, it’s, you know, in the most basic form, it’s doing autocomplete. But in no more advanced form, it’s almost offering suggestions of how to be able to rewrite something or just a higher, slightly higher order activity. But there is a jump from going from, you know, that something that assists an individual person accomplish their task 5% 10% better or faster, to how you deploy applications where these large language models are a core key component, at scale, at is at a level where, you know, the team that is actually developing and building this feels like you know, what, our users are not going to be let down. Because the performance and the accuracy and there aren’t going to be hallucinations, and it’s going to scale, there’s a whole set of challenges to deal with to go from the individual use case to something that looks and feels like this is production ready. You know, and I think, you know, as we kind of roll this back a little bit, we’re very early in the cycle, right? The core technologies, you know, the substrate, this, these MLMs, they themselves are changing so rapidly, of course, you know, open AI has been building and, you know, sort of deploying these models for for a while now, Google is, you know, we’re recording on in December. So Google has just announced their next set of models. There’s a few open source models that are now coming out that are competitive, but the substrate itself, the elements themselves, they’re so new, right? Yeah. And so the types of applications that we expect to be built, you know, this is going to be a cycle of, you know, somewhere in the, you know, two to five years, where we truly see a lot of mainstream adoption. But we’re early. But the thing, I think that there’s the interesting thing is, I think there is still an interesting playbook to follow for folks who are experimenting and want to build high quality sort of pipelines that use our labs that are applications that use our lamps. So I think the playbooks that are being built out, but I think in the curve, we’re still kind of early. Yep.

Eric Dodds 12:31
Yeah, that makes total sense. What are the ways I mean, if you just read, you know, the news headlines, and you know, every company that’s come out with a survey about, you know, adaption of MLMs, you know, you would think that most companies are running something pretty complex in production. You know, I think that’s probably a little bit click Beatty, maybe even that’s generous, but what are you seeing on the ground? What are the most common types of things that companies are trying to use LLM for beyond the sort of personal or sort of small team productivity?

Rishabh Bhargava 13:07
So the way we would look at the way we’re seeing the types of applications that are going live today, you know, the first cut that enterprises typically take is what are the applications that are internal only right, that have no meaningful impact on, you know, at least no direct impact on users? But can drive efficiency gains internally, right? So things that might, you know, if there are 100 documents that need to be reviewed every single week, can we make sure that maybe only 10 need to be reviewed, because 90 of those can be analyzed by some Chirlane basis? That’s an example. I think a second example that teams are starting to think about is places where they can almost act like a co-pilot, or, you know, almost offer suggestions to the user while the user is using those APs as the sort of main application. Right, almost, it’s helpful suggestions. It’s like, I think one of my favorite examples is, you know, if you’ve just created, let’s say, you’ve captured a video, right? Something could automatically suggest a title, right? It’s like a small kind of, it’s a small tweak, but makes a nice kind of difference to the user. And it doesn’t make or break anything. The third thing that we’re starting to see, and I think we’re still early, but this is where we believe a lot of business value is going to be driven is, frankly, existing, existing workflows, where data is being processed or where data consistently gets reviewed by by teams internally, where the goal is, how do we do it more accurately, cheaper, faster, by essentially reducing the amount of human time involved? Right. And these are, you know, these are typically more business critical. The bar for success is going to be a little bit higher, right? So teams will have to invest a little bit more time. I mean effort getting to the levels of accuracy and reliability that they care about. But those become sort of core, you know, let’s say data pipelines, they become core product features. But that’s the direction that we’re seeing businesses sort of head towards. Yeah,

Eric Dodds 15:16
super interesting. You mentioned, efficiency and cost, can you tell us what you’re seeing out there, in terms of what it takes to actually implement an LLM use case, you know, it’s one of those super easy to start, and then very difficult to a just, you know, sort of understand the infrastructure you need for your use case, among all the options out there, and then B, figure out what it actually will cost to run something and production scale, right? I mean, you can query GPT, even if you pay for the API, you know, it’s pretty cheap, you know, to send some queries through, right? When you start processing, you know, hundreds of millions or billions of data points, like, it can get pretty serious. So how are companies thinking about it?

Rishabh Bhargava 16:12
You know, it’s such an interesting question. In some ways, you know, we look at the, we were seeing it, you know, developing new applications with MLMs, it’s a journey, it’s a journey that you have to go on, where, you know, as with the journey you want, you know, somebody who’s accompanying you, and in this particular case, it’s, it’s one LLM, or like a set of algorithms that you start out with. And typically, you know, the place where people start is, there’s a business problem that I need to solve. And we were discussing, prompting initially, it’s like, can I, wouldn’t it be amazing, if I just wrote down some prompt, and the LM just solved my problem for me, that would be amazing, right. And so that’s where people start. And it turns out that, you know, some, you know, in many use cases it can take you 50% 60% of the way there. And then you have to sort of layer on other techniques, almost from the word of Allah lamps that help you sort of go from that 50 to 60%, to 70 to 80, and progressively higher. And, you know, sometimes it’s easier to think about working with Allah, and not to anthropomorphize some sort of LLM. But sometimes it’s easier to think about LLM as like, you know, like a human companion almost right. Yeah. My favorite analogy here is, sorry, this is a bit of a tangent, right, winding way to kind of talk about how to do this, but bear with me. You know, sometimes it’s easier to think about how to get LLM to do what you want them to do by thinking of what would it take a human to succeed at a test? Okay, let’s say we were to kind of go in for a math test in algebra tomorrow, right? Of course, you know, we’ve taken courses in our past, we could just show up, right and go take the test. But we’d probably get to, you know, 50 to 60%. In terms of how well we do, if you wanted to improve in terms of performance, we would, we would go in with sort of a textbook, right, we’d be treated as like an open book test. And the analogy for you know, that in the world of LLM is things like few sharp fronting where you show the LLM examples of how you want that work to be done. And then the LLM does it better, right? Or you introduce new knowledge, right, which is what your textbook does, right. And so that is the next step that typically, developers take right in terms of improving performance. And then the final thing, you know, if you truly wanted to ace the test, you wouldn’t just show up with a textbook, you’d spend the previous week, actually preparing, right, actually, you know, doing a bunch of problems yourself. And that’s very similar to how fine tuning works, right or training the LLM works. And so typically, the journey of building these LLM applications takes this path where teams will just, you know, they’ll pick it up, LLM, they’ll start prompting, they’ll get somewhere, and then it won’t be enough. And then they’ll start to introduce these new techniques that, you know, folks are developing on how to work with dilemmas, whether it’s a few short prompting or retrieval augmented generation, where you’re introducing new knowledge. And then finally, you’re getting to a place where you’ve collected enough data and you’re training your own models, because that drives the best performance for your application. So that’s the path that teams take from an accuracy perspective. But then, of course, you know, you were also, you know, running this in production is not just about accuracy, we have to think about, we have to think about costs, we have to think about latency, we have to think about, you know, where is this deployed? And, you know, I think the nice thing about this ecosystem is that the cost looks something today, but the rate at which costs are going down, right? It’s extremely promising. So we can start, you know, deploying something today, but odds are that in three months or six months time, the same API will just cost 3x Less Right, or there might be an equivalent open source model that is already, you know, as good, but it’s 10x cheaper. So the cost curve is kind of very positive for folks who are starting to work with other labs.

Eric Dodds 20:11
Sure. Yeah. You know, it almost feels like sort of cloud storage when we started to see this extremely precipitous decline. Which sort of democratizes it. Alright, Costas. I can see in your eyes that you have questions on the tip of your tongue, and I want to know what they are.

Kostas Pardalis 20:31
Yeah, if I have. So, Reese, let’s go through, like, the experience that someone gets with review, right, let’s try to enter and muscular like, for two reasons. One is because obviously, I’m very curious to see like, how the product itself feels like for someone who is like, new in, you know, like working with LR lens, because it’s one thing like, I think most of the people, you mentioned that like with, with Eric previously, the first impression of Anila lens through like something like shutter GPT, right, which is very different experience compared to going and fine tuning your model or like building something that it’s like much more fundamental with these models, right. So I’m sure there’s like a gap there, in terms of the experience, and probably the industry, still trying to figure out what the right way is, and like being able to interact and be productive with fine tuning and like building these models. So tell us a little bit about that, like how it happens today. And if you can also tell us a bit of how it has changed since you started, right? Because it will help us understand what you’ve learned also, like, by building something like for the market out there, absolutely

Rishabh Bhargava 21:49
cost us. So look to the experience, then, in our Blitz, maybe I think it’s sometimes easier to take an example, right, let’s say, you know, the type of problem that we’re trying to solve, let’s say you’re, let’s say you’re an E commerce company, or a marketplace, and you’re trying to understand, what are people searching for, right? What are the, you know, given a list of search queries? Like, what is the thing that they’re actually looking for, right? Is it a specific category of product? Is it a specific product? Like, what is the thing that they’re looking for? And, you know, this is a classic example of like, classification or categorization type of tasks. So the way refuel works is, you know, you point us to wherever your data lives, right. So we’ll be able to kind of read it from different cloud storage warehouses, you can do data uploads, and then you pick from one of our templates of the type of thing that you want to accomplish, the type of tasks you want to accomplish. So in this particular case, it would be, let’s say, you know, categorizing search queries like that’s the template that you pick. And the interface of working with refillable, once you’ve plugged in your data, and you’ve picked the template is, you know, just write simple natural language instructions on how you want that categorization to happen. And I think that’s similar to how you know, working, you’re exploring or playing around which category feels like, which is, there’s just a textbox. And what it’s asking you for is, Hey, you want to categorize search queries? Help us understand what are the categories that you’re interested in. And, you know, if you were to explain this to another human, what would you write to explain and get that message across. And that’s the starting point here. So when a user will just describe that, hey, these are the categories that matter to me. And this is how I want you to categorize, essentially, the refill product will start churning through that data, start categorizing every single search query. And then we’ll start highlighting the examples that the LLM found confusing. And this is actually like a big difference to you know, what the simple use of chatty video will do, because, you know, LLM czar, this kind of this their incredible piece of technology, but you give them something and they will give you back something. And without regard for whether it is correct or not. But if you want to get things right, it is important to know and understand where the LLM is actually confused. And so we’ll start highlighting those examples to the user to say, hey, this query and your instructions didn’t quite make sense. Can you review this, and at that point, you know, the ones that are confusing, the user can sort of go in, they can provide, you know, almost very simple thumbs down thumbs up feedback to say, Hey, you got this wrong, you’ve got this right. Or they can go and adjust the guidelines a little bit, and iteratively, refine how they want this categorization task to be done. And the goal really is that, you know, if in the world without our lamps, right, if you had to do this manually, and you’re having to do this categorization every single time for every single one of those search queries, instead of that, you’re maybe having to review 1% Maybe one 1% of the data points that are most helpful for the elderly to understand, and essentially do your do that task better into the future. So that’s what the experience of working with it looks and feels like where, you know, it’s this, it’s a system that is trying to understand the task that you’re setting up. It’s surfacing, whatever is confusing, and iteratively, getting to something that is going to be extremely accurate. And whenever you know, folks are, let’s say, happy with the quality that they’re seeing, it’s a one click button, and then you get sort of an endpoint. And then you can just go and plug it in production. And you’ll continue to kind of serve this categorization maybe, for real traffic as well. That’s the experience of working with the system. And, you know, it’s often useful to compare it with, you know, how would we be done in the world without our labs, right? In the world, without our labs, you’re either manually doing it, or you’re writing simple rules, and then you’re managing rules. But instead, you know, the game with LLM says, Read good instructions, and then give, you know, almost thumbs up thumbs down feedback. And that’s enough to get the ball rolling and get it to be good. Now, I think the second part of your question was, you know, how, how has this changed and evolved? Right, as we’ve been sort of building this out? You know, actually, there’s two interesting things there. The first is, for us, the problems that we’ve been interested in have always remained the same, which is, how do we get better data, leaner data, in less time, and so forth, right? So the problem of good clean datasets always remains the same. I think interesting changes that we’ve learned is, you know, frankly, whichever lens to pick, for a given task, you know, there’s more options that are available now. And there are both techniques that are available that can almost squeeze the juice out from an accuracy perspective. So we’ve essentially just learned a lot in terms of how to maneuver these lamps. Because, you know, at the very beginning, a lot of the onus was on the end user to be able to drive the ALM in a particular direction. But at this point, you know, we have seen many of these problems, we generally understand what you have to do to get the alarms to work successfully, so that teams are not spending too much time, prompt engineering, which, you know, is its own kind of, sort of ball of wax. So that’s one interesting thing that we’ve learned. And I think the second thing that we’ve learned is, and I think we’re going to see this in industry as well, that the future of the industry, it’s not going to look like a single model that is just super capable at every single thing. Yet, we are generally headed in a direction where there’s going to be different models that are capable, some bigger, some smaller, that are capable at individual things. And almost being able to get there quickly and manage that process and manage those systems becomes an important time factor. Because for many reasons, from accuracy to scalability to cause to flexibility, being able to get to that sort of smaller custom model ends up being super important here.

Kostas Pardalis 28:05
Yeah, that makes a lot of sense. Okay. So when someone starts with, like, trying to build an application, right with, with other labs, and here we are talking about open source overlaps, right? Like we’re talking about models that are, like open source. There are like a couple of things that someone needs to decide upon, right? One is, which model should I use as the base model to go and train it? Right? The other thing is that all these models come up in different flavors, so which usually has to do with their size, right? If you have like 7 billion parameters, you have 75 billion parameters, I don’t know, like in the future, we probably weren’t going to have even more variations. So when someone likes to start, and they have a problem in their mind, and they need to start, like experimenting, like to figure out what to do, like, how do they reason about that stuff? Like first of all, how do I choose, like between llama and mistral? Why would I use one or the other? Because apparently, like, my feeling is that as you said, like, there’s no one model that does like everything, right? So I’m sure that like, Mr. Hall might be better in some use cases, law might be better in some other use cases. But at the end, if you read the literature, all these models are always published with the same benchmarks, right? So it doesn’t really help like someone to decide what’s the best for their use case. Right? So how should a user reason about that without wasting hours and hours of training to figure out in the end which model is best for their use case?

Rishabh Bhargava 29:50
Yeah, it’s such a it’s such a such an important problem and still so hard to kind of, to wrap up, you know, still so hard to kind of get right you know, In some ways, like, there’s a few kinds of questions that are, that are kind of underneath, there’s a few things that need to be answered here. at a super high level, you know, the goal is, you know, for somebody who’s building the real application is to figure out almost viability, the right thing that we’re trying to do, like, Is this even doable? Is this even possible? Right? And so, if I were in that person’s shoes, right, the first thing that I would do is I would pick a small amount of data, and I would pick the most powerful model that is available. And I would see, can this problem be solved by the most powerful model today? If you know, giving it as much information as possible, try, you know, try to simplify the problem as much as possible. But what do you know, can this problem even be solved by DiggerLand? That’s one thing that I will try. First and foremost, the second thing that, you know, if, you know, if I started to kind of look into open source, you know, the benchmarks that folks publish, it’s, these are very academic benchmarks, they don’t really tell you too much about how well this is going to do on my data, right? Or let’s say my customer support data, right? Like, how is it? How’s Mr. Alden, and all or what is sort of llama going to know about, you know, what my customers care about, it’s hard. So the way to understand open source MLMs. And to start to get a flavor of that, I think would be, first create a small, pick a small data set, that is representative of your data, and the thing that you want to accomplish, can be, you know, a couple of 100 examples, maybe, you know, 1000, examples or so forth. And then almost, you know, if, for example, infrastructure was available to the team, then, you know, use some of the hugging phase and some of these other kinds of frameworks that are available to spin those models up. Although today we’re starting to see sort of a rise of sort of gist inference kind of provider companies that can make this available through an API as well. But I would start playing around with the smaller models, right? Like, can this problem be solved by a 1 billion parameter model, a 7 billion parameter model, right? And just see, like, you know, at what scale, does this problem get solved for me? Because odds are that if you’re truly interested in open source models, and you’re thinking of deploying these open source models into production, you probably don’t want to be deploying the biggest model? Because it’s just a giant pain. Right? So then the question becomes, like, if we do want to solve this problem, what is the smallest model that we can get away with, and there’s a few kinds of architectures and there’s a few kind of, you know, flavors from a few different data providers that there are the right ones to pick at any given moment in time. And, you know, I even kind of, I don’t even want to offer suggestions, because you know, the times from now, when we’re recording this to when this might actually go live, right, there might be new options that are available. Right? So picking something that from one of them, you know, let’s say from meta or Bistro is a good enough starting point. But then trying it out, like the smaller model, and seeing how far that takes us. Almost gives us a good indication of like, for the latencies that we want, and the cost that we want. What is the accuracy? That is possible?

Kostas Pardalis 33:24
Yep. Yep. That makes sense. So from what I hear from us, it almost sounds like the user needs to come up with their own benchmark internal benchmark framework, right? Like they need to, somehow before they start working with the models to have some kind of taxonomy of like, what it means for a result to be good or bad. And ideally, to have some way of measuring that, like, I don’t know, if it can be just black and white, like, it’s good or bad. And that’s it, right? Like, my needs to be more, let’s say, something that can be between zero and one. But how can users do that? Because that’s, I mean, that’s like always like the problem with benchmarks, right? Like, and even in academia, like there is a reason that benchmarks tend to be so well established, and reagents, and it’s not that easy to bring something new, or if you bring something new Usually that’s a publication notion, right? Because figuring out all their nuances of like, creating some things for an upcoming benchmark, let’s say, in a representative way, and having a good understanding of like, what might go wrong with a benchmark is important, right? So how do someone who has no idea about benchmarking actually but they are domain experts in a way right, like the person who is interested in marketing to go and solve the problem? They are the domain experts. It’s not you. It’s not me. So the engineer has to go and build that stuff, right? But probably never had to think about benchmarks in their lives or like what it means specifically like a benchmark, like for a model, right? So can you give us a little bit of hints there? I mean, I’m sure like, there’s no, probably not an answer to that. If there was like, probably you would be public already with your company, but how you can help your users like to reason about these things and avoid some common pitfalls, let’s say or at least not be scared of like going and trying to build this kind of like, benchmark in front that they need in order like to guide their work.

Rishabh Bhargava 35:35
It’s a great question, actually, I’ll ask you guys the question like, you know, in one way, in one way I can answer it like I can answer it in the direction of like, a refill actually makes this possible. But I don’t want to just show refills here. So I can also just chat about just generally, how the team should think about it. The answer probably is along the lines of like, there should be tools that do so I’m curious, like, if you guys have a sense on how you’d want this answered here. Yeah,

Kostas Pardalis 36:00
I’ll tell you my opinion. And like, it comes from a person who has, like experience with benchmarks from a little bit of like a different domain, because benchmark is like, one of the more long lasting marketing tools in database systems, with a lot of interesting and spicy things happening there with like, specific clauses, and some of them are like people cannot like bubblies, the names of the vendors and like all that stuff, which indicates like how, even in something that it’s like, so deterministic in a way, as like, building the database system, right, still, like figuring out like, what the right benchmark is, is very, like almost like a knots more than like, you know, science. But what I’ve learned is that no benchmark of theirs from academia, or from the industry either can survive the use case of the user. The user always has some, like, small, unique nuances to them, that can literally render like a benchmark, like completely useless, right and misleading. So it is, in the end, I think, more like a product problem, in my opinion. And I say products, not because there’s no engineering involved, there’s a lot of engineering involved there. But it has to be guided by user inputs like figuring out the right three dots. And I think what we see here compared to building-like systems that are supposed to be completely deterministic, is that this is like a continuous process. It’s part of the product experience itself, like the user needs you. As they create their data sets and all that stuff. They also need to create some kind of benchmark that is uniquely aligned to their problems. Now, how do we do that? I don’t know. It’s something that I think is like a very fascinating problem to solve. And I think something that can deliver, like, tremendous value for whatever, like a vendor , comes up with that. But that’s my take on that. What do you think, Erica? You might have like, you’re more of like, customer side? So you probably have more knowledge than any of us on that. Yeah,

Eric Dodds 38:15
I mean, I think, you know, we’ve done a number of different projects actually trying to, you know, trying to actually leverage this technology in a way. I mean, it’s funny Redash, I think we’ve followed a little bit of the pathway that you talked about, right, and there’s a sort of personal productivity, and then there’s sort of this, you know, trying to use it almost like an assistant as part of an existing process. And I think the specificity is really important, right? It’s actually, I think one of the places that I’ve seen a lot of things go wrong, in my sort of limited view, is, well, we have an LLM. Let’s just find a problem. Right. And you know, that you end up sort of, I don’t know. So I think you end up sort of solving problems that don’t necessarily exist for the business. And so for us, it’s really I think one of the key things for us is, like defining the specific KPIs that a project like this can actually impact right, and sort of describing that ahead of time. So that, I don’t know, at least that’s the way that we’ve approached it. Makes sense. Yeah.

Rishabh Bhargava 39:35
I mean, the cost is, I think the question is, we can probably kind of start the kind of recording for us this year again, but cost is look, I think benchmarking is a pretty hard problem. Because every sort of specific customer problem, every specific company, there’s so much uniqueness in their data in how they view the world that, you know, in the world of MLMs, you know, the term that gets used is evaluation, which is, you know, what is the on a given dataset? And with a specific metric in mind, right, the metric might be as simple as, you know, accuracy, right. But with that kind of metric in mind, right. And accuracy is still easier, right? When there’s a yes or no clear answer. In many cases, there might not be a clear answer. So what is that right metric becomes a hard problem. So benchmarking is hard. And I think there’s maybe a couple of things to kind of think through for most teams as they go down this process. The first is what dataset, right? What data set that is small enough that they can maybe manually look at and review. But that still feels representative of their problem, right, and their production traffic that they imagine getting. And of course, that’s not going to be a static data set. So that has to evolve over time, as we see more kinds of data points come through. But that’s, you know, almost Question number one, which is, what is the data set, then? How can maybe, you know, a good product or a good tool will help me find and isolate that data set from a massive table that might exist in a data warehouse? So that’s question number one, around sort of benchmarking and evaluation. And I think the, you know, the second question is, you know, what is the right metric? Right? In some cases, it might be a metric that is more technical, right, something like an accuracy or precision. Sometimes that metric might be more driven by what users care about, and what the product team is thinking about, you know, that this is the thing that matters to a user. And so thinking about, like, you know, I’ll throw out an example. But in the case of the sort of applications where data is being generated, did we generate any facts that were not available in the source text, right? What is a metric that you could write down matters a lot to users? And so then it’s a combination of, you know, how’s the data set evolving over time? And how is that what is the metric and the threshold that we think is going to be that is going to be success or failure for this application? It’s a combination of those things that, you know, teams end up thinking about the best teams. Think about this before a single line of code is written. Yep. Right. But sometimes it’s hard, right? Sometimes you don’t know what are the bounds of, you know, what the technology can offer, and how you know, how the data set might evolve over time. Or sometimes, you know, the threshold that somebody sets is just because they heard it from somebody right from another company. But it turns out, it’s not meaningful enough in that particular business. And so you’re right, it is, it’s a super hard problem. It’s very complicated. But I think, you know, with better tools, this will become easier for people. But in many ways, this is one of the most important things to get right. Because the more time that gets spent here, you know, some of the infrastructure problems and the tooling downstream will fade in which elements to use. They are driven by decisions that are made at this stage of the problem statement.

Kostas Pardalis 43:14
Yep, yep. 100%, I think that’s like the right time to have, we have the luxury here to have actually vendor you in the space and also the user, which is Eric. So RudderStack is like, evaluate, like, some tools that they are trying to build using LLM. And they are doing that through the field show. I think it’s an amazing opportunity to go through this experience by having both the person who builds the solution, but also like the person who’s trying to solve the problem. And see how these work in the end, like with, like, very unique, like depth and detail show. I’ll give it to you, Eric, because, you know, like all the details here. But I’d love as a listener now of the episode here to hear your experience with trying to solve a problem using LFO names and how this happens by interacting and using reviews as the product.

Rishabh Bhargava 44:17
Sure. Maybe I’ll get to ask Eric a couple of questions as well about his experience here.

Kostas Pardalis 44:23
Yeah, totally reveal all products. Live customer feedback. Like that’s what it’s doing. Sure.

Eric Dodds 44:31
Yeah. You know, it’s really fascinating. One of the I’ll just go to the high level use case we had actually met Rish we met talking about the show and having you on and as I learned what refuel did, you know, this light bulb kind of went off? And I think I remember asking you in that initial introductory discussion, Hey, would it work for something like this? And you said So we hopped on a call, but cost us that use case is? Well, you know, one of the things that I am responsible for my job is our documentation. And documentation is a really interesting part of a software business, right? There are many different aspects to it. There are many different ways that people use it. Right, they may read documentation to educate themselves about what the product does. But it’s also used very heavily and in large part intended for people who are implementing the product and actively using it. And so, one discussion that we’ve had a lot on the documentation team is how do we define success with the docs, right, and that sort of, you know, the, that sort of comes from a process of quarterly planning, what are the key things that we want to do in the documentation, and one of the things that we discovered was that there’s a lot of low hanging fruit where if you have documentation that’s been developed over a number of years, you know, and you have 1000s of different documents, in your portfolio, there are some that are old and need to be updated, you know, or that were done quickly, and need to be updated, etc. But once you sort of address the things that are objective problems, which, you know, thankfully, you have a lot of customers and customer success, people can sort of point those out for you to do that, you know, provide the feedback there. One of the challenges is, where do you go next in order to improve it, right, because there are obviously opportunities for improvement. But it’s hard to find those out. And analytics themselves are a challenge. Because you can have lots of false negatives, false positives and false negatives. And so I’ll give you just an example of one metric, like time on site. If you have a blog, and you’re analyzing blog traffic time on site, generally you want more time on site, right? Because it means that people are spending a longer time reading the content and engaging with the content. But in documentation, time on site can be a good thing or a bad thing, it can mean that someone is following a detailed process that should take a certain amount of time. And so they’re on the page for, you know, they can be on the page for a long time. But it could also indicate that they don’t understand what they’re reading. And they keep trying things that aren’t working and returning to the documentation. So how do you know? How do you know that’s the case, and there are a number of ways that you can determine that, or attempt to determine that. But one of the things that we thought a lot about was how we can narrow down those problem, or you know, those problem areas or opportunity areas, and how we can hold the docks accountable to some sort of metric that is measurable over time, where we can see sort of true, you know, if we make it if we uncover one of those and then fix it, and then how do we measure that over time going beyond just sort of raw metrics. And one of the richest repositories that we believe is like a compass for this project is our Customer Support Ticket data? Right? Because if we can triangulate, you know, if there are enough customer support tickets at a certain time with a certain sentiment or a certain outcome that is aligned to a metric, like time on site, or some other metric, then that will indicate to us whether it’s a good thing or a bad thing, right? And if it’s a bad thing, then we can fix it. And then subsequently, we should see customer support tickets related to that specific documentation or set of documentation decline over time, right. And so that was a high level project, the challenge is the customer support team. So I went to the customer support team and said, hey, you know, this is what we want to do with the documentation. And they loved the idea. But they said, you know, the problem is, we’ve tried to do this before. And it just was, it was untenable, right? I mean, you’re talking about, you know, 1000s 10s of 1000s. I can’t remember what the exact number is. But it’s a lot, right? And so, even if you try to pull a random sample and have a couple of, you know, technical account managers go through and try to label the tickets. There’s all sorts of challenges, right? The first one is you have to decide on a taxonomy. If you want to change that you have to go back and redo all the work. I mean, they basically said, we tried this and it didn’t work. And so that’s when we that literally around that time was when I talked to you Redash And we had that initial conversation and so I said, Hey, we have a ton of unstructured data. And we essentially need to tag it according to categories. And so yeah, that’s been interesting. Actually, that’s been a super interesting project.

Kostas Pardalis 50:18
Okay, and tell us a little bit more about tagging. So, you mentioned first of all, like the taxonomy, right? Like, what does the taxonomy ease? Like, in the Yeah, index?

Eric Dodds 50:32
Yeah, that’s a great question. So when you think about tagging data, and I’m not an expert, and tagging data, but for our particular use case, when you think about tagging data, you need to be able to, you need to be able to aggregate and sort the data according to a structure so that you can identify areas where, you know, a certain tag may over index for, you know, tickets that are negative in sentiment, or what, however, you wanted to find that right? I almost think about it, as you know, if you were creating a pivot table in a spreadsheet, how would you structure the columns such that you can create a pivot table with drill downs that would allow you to, you know, to group the results. And one thing that we actually started out with a very simple idea that’s proved to be very helpful, but it’s been trickier than we thought to nail down a taxonomy actually enriched, we haven’t, I don’t think we’ve talked about this since we kicked off the project. So here’s some new information for you. Initially, we just took the navigation of the docks, you know, in the sidebar as our taxonomy, because we thought that would be, even though we actually need to update some of that information architecture, that least we can, we have a consistent starting point that maps tickets, one to one with the actual documentation. The challenge that we face, and actually one of the things that refill has been very helpful with is that the groupings in this, if you just list out all the, you know, essentially the navigation, or even, you know, one or two layers down in the navigation, as the, as essentially, the tags or the labels that you want to use for each ticket, you quickly start to get into what is technically fine for navigation, but practically needs to be grouped differently, if that makes sense. And so a great example would be, you know, something like, the categorization of sources, mobile sources, server side sources, you know, that sort of thing. And you may want to, you know, like for SDKs, or, you know, whatever. They’re just the ways that you practically want to categorize things differently, or group things differently, if that makes sense. Or another good example is like, all of our integrations, you know, we have hundreds of integrations. And, you know, in documentation, they’re just sort of all listed, right? But it actually can be helpful to think about groups of those as like analytics destinations, or marketing destinations, or whatever. And so into like, so what refills allowed us to do is actually test multiple, different taxonomies, which has been really helpful. And so the practical way that we did that was we took a couple 100 tickets as a just random sample. And we wrote a prompt that defined the taxonomy and gave the LLM, an overview of what it’s looking for, you know, related to sort of each, each label that we wanted. And we just tested it, right, and we sort of got the results back, and have been able to modify that over time. Which has been really helpful. And so that was interesting to me. Initially, I thought, we’ll just have a simple taxonomy, it doesn’t matter. But then from a practical standpoint, the output data does really matter for the people who are going to be trying to, you know, sort of use it.

Kostas Pardalis 54:17
And this is something like for when you said the user was going to use it, is this like internal or external? Is this like taxonomy? Primarily, like, interpreted by the customer success, folks in RudderStack? Did

Eric Dodds 54:30
both the documentation team and the customer success team actually, okay,

Kostas Pardalis 54:35
and how do they use this taxonomy? So what’s the, let’s say, you found the perfect taxonomy, they’re using like, Yep, there’s like, all these ABCD whatever tasting like with a little names like, what’s next? You feed like a new to get in there. And it’s mapped in one of the taxonomic categories. They’re, like, how does this work for my user.

Eric Dodds 55:02
Yeah. So I think there are a couple of things that were the initial thing that we wanted to do and we’re fairly close now. Actually, one of the other things that we’ve learned is that going through iterations really helps with the level of confidence that the model provides. So one really nice thing. But actually tell you one of the things that we tried really early on before we started using refuel was just wiring GPT-3, the GPT-3 API up to a Google sheet and sort of dumping in the unstructured data and a list of tags or whatever. But the problem of hallucination is a severe problem in that context, because it’s just going to provide you an answer either way. And so one of the things about refueling that was very helpful for us is that you can essentially define a confidence threshold, and it just won’t return a label if it doesn’t reach a certain threshold, right? And one of the things as a, you know, that we’ve, that is really nice about that is, you know, and I don’t know if this is the intention Rish, but the percentage of unlabeled tickets is kind of a proxy of how well we’re defining the taxonomy and sort of the instructions were giving it, which is a very helpful, like, you know, even just this morning, actually, you know, we’ve been sort of making iterations to this. And we have an extremely high level of confidence across most tickets now, which is really nice, right? Whereas before, we may have gotten and we were, you know, when we were iterating, we were we had very sort of primitive prompts, I would say, and so maybe you get like 60, or 70% of the tickets labeled right, or something like that. And now we’re like, into the high 90s, which is pretty nice. And so the first step was sort of getting confidence and aligning with the customer success team on, you know, let’s spot check these and see if this is relevant. And we’re now at the point where we’re going to run the entire set of unstructured tickets. And the first thing we’re going to do is actually take that and do planning around a couple of things on the documentation side, which I’m closer to identifying the docks that we need to improve, and then setting up a structure to track on a monthly basis will basically operationalize tickets going into refuel and coming back to the labels. And then we’ll track over time, the quantity of tickets for a particular label or set of labels. And so on the documentation side, that’s sort of how we’ll measure these key updates that we do. And then the customer success team, I think, has a number of ways that they’re going to use this right. So if you imagine a new customer is onboarding, and they can see the sources and destinations that they’re using, or the particular use case that they have. But they already know, you know, both quantitatively and then the interesting thing for them as qualitatively gay, I have a group of tickets, I can browse through a couple of 100 tickets related to this problem and figure out, you know, anecdotally what or where did they run into problems at which point in the process, and they can actually update their onboarding processes. Hopefully, the documentation helps a lot. But it can only go so far, right? So the customer success team can actually update their processes to say, here’s a customer, here’s the tech stack that they’re running, here are the use cases that they want to implement. And they’ll know ahead of time, we need to watch out for these things. Do these things, you know, just sort of access? Yeah, that

Kostas Pardalis 58:19
makes total sense, like, read one question from me. And then I’m done. I’m not going to ask him anything more. But sorry. It’s so interesting. So there’s like a very key piece of information here that Eric talked about. And that’s like, confidence, right. And that’s something that refuel retains, like the confidence level of the model in terms of the job that he did with the data. But what does this mean? Because that boils down to a number at the end, right? But there’s a lot going on, behind the scenes, to get down to this number. And probably like, it has to be also like their credits in a different way, depending on many different factors, like why do we need 0.9 instead of zero point? 99 or 0.7? Right? I don’t know. So tell us a little bit more about like, what is this confidence level we’re talking about here? And how people should think about it.

Rishabh Bhargava 59:22
Yeah, great question. QCon says, And, Eric, thanks for the story. I mean, honestly, just loved hearing kind of your thought process and, you know, experience as you went through it. Maybe I’ll have a question or two for you in a second. Yeah. On the confidence bit, you know, you know, this confidence, it can get sort of technical pretty quickly, right. But the main reason for trying to have rigorous ways of assessing confidence is that again, it just comes back to you know, LLM ‘s or, you know, their text in their textile. They’ll produce an answer for you. And so then the question becomes, when do we trust this output? When do We trust this response? And I’ll tell you a little bit about how we do it internally, which is we actually have customer labs that we’ve sort of fine tuned and trained, that are purpose built to produce accurate and reliable confidence scores. And the confidence in the way you think about and interpret this number is, at the end of the day, you know, with an example of the support, ticket tagging, a use case that Eric was mentioning. You know, there’s, you know, there’s a, you know, let’s say with RudderStack tickets, right, it’s either about ETL, or reverse ETL. Or it’s about SDK, confidence is a measure of how likely you know, if we say, a particular output, let’s say reverse ETL, with 90% confidence, like the model’s confidence is 90%, being correct, right. So the goal is for the confidence score to be calibrated to preparedness, if that makes sense, right? That’s the eventual kind of end goal of having these conference calls. So when you then sort of get these scores and these outputs, you should be able to almost set a threshold for your specific tasks and say, hey, I want to be able to, you know, I want to hit like a threshold of 90% confidence, because what that means is that everything that is above that, right is going to be 90%. Confidence, or like 90%, correct or more, right. And so you get that sort of calibrated sort of level, of course, getting confidence scores to be very calibrated, and to be very correct. It’s an ongoing kind of research problem, and something that we invest a lot of our technical resources into, but it’s absolutely critical to get that right. And be prioritized. Otherwise, being able to rely on these outputs becomes hard. That’s how we think about conference core.

Eric Dodds 1:01:43
Yeah, I guess I forgot to also add in a very important detail, in Costas. But one thing, so the way that this works, and there may be more going on, under the hood, Rish, I mean, I’m sure there’s a lot more going on under the hood, but as a user, you can actually go in and look at the individual, the individual to tickets for us, right, but you know, it’d be a data point. And you can interact with that ticket and essentially tell the LLM you know that this is actually this label, or that this was mislabeled, right. And so you basically can, you know, you’re sort of training the model on the pieces that it’s not confident on. And so it kind of makes sense that initially, you get, you know, especially with a primitive prompt, that you get stuff back that has a low confidence level, but then you it’s a human in the loop, essentially, right, you can go in and literally, like tag them and interact with the tickets, and then, you know, so let’s say, you know, we put in a couple of 100 tickets, and then someone can go in and tag, you know, 2030 tickets or whatever, and then the model, you get through a couple of pages, and then refuel essentially tells you like, okay, it’s ready to rerun it, you know, based on this feedback, right? And so then the confidence interval increases. And so you can sort of iterate, iterate through that and give the LLM feedback on, you know, whether its confidence level is accurate or not. Yeah,

Rishabh Bhargava 1:03:17
and exactly, you know, that’s such a good way to put it, I recommend the goal is to spend a little bit of time, you know, on the ones that are less confident where the model is not sure. But every single piece of feedback that we’re that you collect, helps the next data point become better. And, you know, eventually you get to a place where, you know, you just start plugging in new data, as it’s sort of being generated, and get high quality outputs out.

Eric Dodds 1:03:45
You know, one of the other interesting things actually, now I’m thinking through all the details of this that makes it tricky to use an LLM with unstructured data. Is that in us about the taxonomy costus. And one of the, one of the other reasons that has been a very iterative process is that users will often use generic terms or separate terms that are different from what you have, and you know, the title of your documentation changes. And so over time, we’ve actually had to adjust the prompt where we sort of include these conditions. If we notice, again, just sort of doing a high level review. We say SDK, but someone may say, JavaScript snippet or something like that, right. Yeah. And so that is actually pretty difficult. That is very difficult, but like, the nice thing is we I don’t know it’s made that process faster. But we’ve noticed multiple categories where people just use terminology that isn’t in our documentation, and like we don’t really use but that’s just how they refer to it because they have a they’re familiar with a related concept. Yeah.

Kostas Pardalis 1:04:58
So that’s super interesting. Okay, I think we should make a promise here that in a couple of weeks as like this project like progresses, we’ll get both people from refuel and people from RudderStack that were involved in the project and actually go through the project, I think it’s going to be super, super helpful for the people out there. I think like one of the problems I mean, from my perspective, at least one of the issues with like LOA apps right now is that there’s so much noise out there and so much very high level information. But yeah, everything sounds exciting, but when you get into like, the gory details of trying to implement something in production, and things are very different, and having, you know, like people who actually did it, I think can drive tremendous value for the people out there. So if both of you guys are like, fine with that, I think we should have like an episode dedicated to this and like, go through like the use case itself and hear from the people who actually made this happen.

Eric Dodds 1:06:00
Sure. Yeah. We’d be awesome. Customer Success on to. Alright,

Kostas Pardalis 1:06:04
I think we’re on the buzzer here. What do you think? That’s your part? So I’m giving her

Eric Dodds 1:06:09
yeah, you saw my line? That was the next best action. Yeah, we are at the buzzer. RIFF this has been great. This has been so great. It’s just been so helpful to sort of orient us to the LLM space. And, you know, get practical, which I think that’s really helpful. So And congrats on everything you’re doing with refuel. That’s awesome.

Rishabh Bhargava 1:06:36
Thank you so much. It’s been so fun chatting with the both of you. And yeah, excited for the next time.

Eric Dodds 1:06:42
We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe to your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.