This week on The Data Stack Show, John and guest host Matthew Kelliher-Gibson welcome Nicolay Gerold, CEO and Founder, of Aisbach, and host of How AI is Built podcast. The group delves into the evolution, strengths, and challenges of language models (LLMs) and AI. Nicolay shares insights on data-centric AI approaches, practical applications like data extraction and content generation, and the importance of aligning LLMs with user preferences. The conversation also explores the current AI startup landscape, the hype around generative AI, the necessity of thorough testing and monitoring in AI applications, and so much more.Â
Highlights from this week’s conversation include:
The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
John Wessel 00:28
Welcome back to the show. We’re here with Nicolay Gerold. Nicolay, welcome to the show. Give us some background today on some of your previous experience and give us some highlights.
Nicolay Gerold 00:39
Hey yeah, happy to be here. So I’m Nicolay. I run an AI agency in Munich. We also recently started a venture builder. Most of my history has been in LLMs, and especially, like controllable generations. So how to make them boring, which for me is predictable, reliable and safe and yeah, excited to chat with you today.
Matthew Kelliher-Gibson 00:58
Okay, Nicolay, we just spent a few minutes chatting to prepare for the show, and I’m really excited to dig into, from your opinion, what LMS are actually good at, and maybe what they’re not so good at, but everybody still tries to make them do. What are you looking forward to chatting about?
Nicolay Gerold 01:15
I’m really looking forward to discussions like aI startups versus software startups, or even, like aI versus data startups, because it really goes into, like, the deterministic software versus like, unpredictable AI discussion as well. All
John Wessel 01:30
right, let’s dig in.
Matthew Kelliher-Gibson 01:31
Let’s do it.
John Wessel 01:32
So we’re here with Nicolay and Eric is out today. So we have a special co host, the cynical data guy here, co hosting. So thanks for coming today, Matt,
Matthew Kelliher-Gibson 01:41
Thanks for having me. I’ll try to be a little less cynical. All
John Wessel 01:44
right, so, all right, that’ll be good. All right. Nicolay, yeah, give us some more on your background. You’ve worked a lot with LLMs and AI even before it was cool. So give us a little bit of your background and unpack some of that for us.
Nicolay Gerold 01:58
Yeah. So it started quite early and we actually organized the hackathon with OpenAI, and that’s how we actually got to try all of that stuff. And it was CPT three when it came out, and back then, it was really hard to get anything put out of it. Like one out of 10 outputs were actually like usable, or in the direction of usable. And since then, I think they have evolved a lot, but the problem is still the same, like, how can we control them in the end? And yeah, during my university time, this was also my study topic, so I, in my thesis, wrote about controllable generation with LLMs, and basically benchmark different methods before controlling them, and since then, started my own agency, and now doing that for, like, different companies. So it’s quite a lot of fun.
John Wessel 02:49
Yeah, so tell me about the early days. Tell me about, you know, maybe that experience at the hackathon, or even pre hackathon, like, what was that first moment where you’re like, Wow, this is really unique or something that I haven’t seen before. Yeah.
Nicolay Gerold 03:02
So in university, we actually had the chance to do like, text prediction models with RNNs, LSTMs, even before that. But once you go beyond simple examples, it was like utter nonsense, where it’s just through random tokens together and with LMS, at least, like these sentences and the tokens, which are close to each other, had, like, some sense in them. And often, when it was made, like on a shorter length, like a few sentences a paragraph, it really wrote coherent stuff, but often just not factually correct. And it was like it was, for me, like a real game changer. And I really got it, like heavily into i ai through that, because the practical applications, I think, are way more easier to imagine than with most of the traditional ML and AI, because there you have to think about so many different things. And with LLMs Like, you can imagine so many different use cases, because they’re just, like, transforming one text into another text.
Matthew Kelliher-Gibson 04:09
So that that sounds like it was really, like interesting, and probably something that like, I think probably a lot of people were like, Whoa, this is going to be a big deal. What were kind of for you some of those other like milestones that you cut you’ve seen where you’re like, Okay, you know, like going from I’m getting nonsense to like, hey, this can actually make a coherent paragraph. What were some of the other milestones you saw?
Nicolay Gerold 04:34
I think, like, if we go, like, one step before that, like the first milestones is their attention mechanism, which was like, I think somewhere in 2014 and rightly after that, like you mentioned already, like the instruction following part, which is mostly through, like the rhf, so the reinforcement learning through human feedback, but it actually managed to align the model with human preference. So basically, get them to output stuff that the majority of humans like, and this often gives you, like, better output for common tasks, like write me an email and stuff like that. And also really made them more liable for the chat interface, which they like, introduced at the same time in the end, but which is, for me, like, also a major breakthrough. It’s like an UI innovation in the end. Just make it very easy for the everyday person to use that stuff, which, for AI, is really hard to do most of the time, and, like a chat box, is the easiest thing to use. Everyone knows it, and everyone can use it. And the results are, they’re like, instantly, like magic. And if you want to go after that, I think, like the next one, it’s scaling laws, which often isn’t like a breakthrough, but actually like having the realization, like, as we scale up in parameters and in training size, it gets better and better. And this was really like an interesting thing. I think few shots are also something people ignore. I think it’s also a breakthrough, like just writing out the examples and giving it to the LLM, or pre-filling it with its answer so it actually thinks it has written something already. I think that’s also something very interesting and an interesting technique, which isn’t so obvious when you look at the traditional ML and AI part.
John Wessel 06:22
So for the looking forward, obviously, there’s, like, plenty more barriers, you know, things that can be overcome. The obvious one you’ve already mentioned is compute, getting compute costs down right, because there’s some AI, a lot of AI applications are still subsidized practically Right. Like, if you actually look at the math of how much compute and how much went into the training model, like it doesn’t quite work. So like, say we solve that problem and say we continue, can continue to have a process just by expanding training data sets and spending more on Compute outside of that. Are there any other, like, really important barriers that maybe the average person would know about.
Nicolay Gerold 07:00
So I think the alignment, just because it’s aligned to humans in general, doesn’t mean it’s aligned to my preferences. I think that’s the first barrier, because I often have, like, a different taste to how stuff is written. And I think like anyone who is interacting with LLM is like, they really tend to go into, like, the emoji ridden social media posts, right for, like, most types of texts you’re writing. And I think, like, one barrier is actually fine tuning models to the individual user or personalizing them. I think at the moment, we are trying to do that with a few shots, but I think we can get smarter with that. And if we see already trends happening for things like synthetic data, which will make it way easier for everyday people, to generate a training data set, to adjust the model to it, and fine tuning more. Also, it’s really cheap at the moment, and you can even fine tune at the moment, the OpenAI model, so GPT four, one, it’s for free. So when you have, like, the capability to generate synthetic data based on your actual inputs and outputs, and then basically personalize the model to your taste. Add a few shots, I think this is something that will get really interesting. And I think the second barrier is actually, like how to actually get the model to pick either something I feed into the context or something that’s in its internal representation. So in its fate, because a rag at the moment, you’re feeding the stuff into the model, and you’re hoping it actually takes the stuff I’ve added in, but still, often it hallucinates. And this is also, like, still a barrier, like, how do I actually get the model to stick to that? And if there is, like, no information on it, just say, I don’t know. And this is like, the third challenge, I think, like, getting models to say, like, I don’t know, which is, I think, for the foreseeable future without a major architectural exchange. It’s impossible.
Matthew Kelliher-Gibson 09:02
Yeah, I mean, I think that’s also because of the way. I think part of that is the way we train them too. Isn’t it where we want it to give a response that’s human like, or that a human would find acceptable. But how do you decide which responses in your training set should get? I don’t know. You know, like it’s being trained to give an answer. So I’m going to try to give an answer. That’s what it sees. Yes,
Nicolay Gerold 09:27
it’s like, there are 1000s of different possibilities of input. I can feed it in and I want, like, a wanted fixed output. Which is, I don’t know, which is, like, based on what I’ve trained with, like, next token prediction on the entire internet, which is, like, moving it to generate like tokens based on its context. And then, like, basically, I’m feeling like in anything like 20 different users will phrase a question in a different way, and then I expect it to output the same thing every single time I think that’s it’s. Very unlikely that it actually gets to that. Yeah. Let’s
John Wessel 10:03
talk a little bit about approaches. Like you mentioned the data centric AI approach as one model. There’s other approaches there, but maybe explain what that is, what a data centric approach is, and even contrast it with some of the other approaches to AI.
Nicolay Gerold 10:17
Yeah. So I think it’s easiest if I go the other route. So in traditional AI and ML, I basically, I started with creating a data set, then I picked the model, then iteratively, I created features which allow me to predict an outcome or generate something. And then I basically, once I had the data set finished, I only adjusted the model. So basically I picked different features. I alter the architecture of the model. I added a few layers, for example, or I added an additional variable in the regression. And this is basically how I improved the model. I treated the data set as static, and I basically ordered the model to improve the outputs that increase my accuracy, for example. And I’m really hyped about data centric AI, but you actually don’t really tune the model, but you take an existing base model, so for example, an LLM, and you actually tune the data so you train the model, you let it generate the output on the test sets, and you then look at the examples that actually got wrong. And then you actually correct something in the input data. Or you add additional samples where it does these categories correctly. Then you feed in the data into the model. You train it new, or you fine tune it, and then you basically try again and iteratively. You basically improve and add to your data set over time, until you have a model that actually has a satisfactory outcome. And I think this is much more aligned to how it is done in or should be done in practice, where actually you have data shifts, you have changing data, and you have new user groups coming in, but you actually have to adjust the data set over time and then train your model on the data.
John Wessel 12:11
So how much of that would be like engineering around prompts and context, and how much of that would be engineering and the like actually like underlying data.
Nicolay Gerold 12:21
So you have to separate it a little bit. So in the prompts and context, this is not training a model. Training a model is really about adjusting the parameters. And this can be applied to any type of AI. I think adjusting the prompt at the moment is like a really, it’s an easier way to like in parentheses, tune a model, because you can adjust the outputs, but it’s not really tuning it right. It’s kind of a shortcut. Yeah, yeah. Prompting is restricted to a few sets of models where it’s possible, one of which is LLMs. Another one is, for example, the SAM the segment, anything models from meta, where you actually can give a few masks and a few fruit, like, they also call it prongs, which are like boxes,
Matthew Kelliher-Gibson 13:08
yeah. I mean, that was, I think, I know, in my own experience, that was a big thing working with, actually, a former, former guest and Cameron Jaco, where he showed me that method of, let’s go. We’re not going to add more data, just indiscriminately. We’re not going to go mess with the number of parameters. We’re going to look and say, Oh, look, there are no examples. In this edge case, we’re going to go add a handful of those, and suddenly that gets your accuracy a lot better. Kind of like, you’re filling the search space almost with your examples. Or like, you know, adding in, hey, our data is drifting, so we need to add an example to where it’s drifting towards, to kind of make it better, rather than, as you said, just mess with the model the entire time.
Nicolay Gerold 13:51
Yeah, and also it’s, I understand why people don’t really like to do it, because working on the data directly, it’s very laborious. You have to be really careful most of the time. And you are really working in code, especially with generative models. You’re like, reading through long texts all of the time and trying to adjust them to get something good out of
John Wessel 14:11
it’s right. And it feels like the wrong thing to do, right? Like, as an engineer, as somebody that’s an expert in it, AR ML is like, I should be working on the model. I shouldn’t be doing that?
Matthew Kelliher-Gibson 14:20
This is low value. This is low value work,
14:23
yeah, yeah. But that
Matthew Kelliher-Gibson 14:26
Actually, a lot of these work, right?
Nicolay Gerold 14:29
I think, like, AI gets better, or like, the stuff I build by how much time I spend on the data. And look at basically, because most of it is pipelines. It’s not a single model where I’m feeding it in and figuring it out. I’m out. Like, how much time am I spending looking through all of the different pipeline steps? Yeah,
Matthew Kelliher-Gibson 14:47
all right, well, kind of talking about that, like, what are some of the things? You know, when we look at LLMs and we look at, kind of everyone with the hype around them, everyone’s using them, there’s also typically that first wave of they can do every. Things, right? But what are they when, in your experience, what are LLMs like actually good at? What are the things that they do the best work with?
Nicolay Gerold 15:08
So LLMs can do everything. They just can’t do good. You can’t slap everything into an LLM; it’s just like they perform badly on so many different tasks. For me, whether excel at is translating one form of text representation into another. And this is especially like, for example, one use case I love is data extraction. So you take unstructured data, you take long text, and you create another representation, which is basically a JSON, and you structure it, and through that, it actually becomes workable. And this is like the thing I in most of the ventures, but also the projects use LLMs The most for because there they have the highest value they can move through like mountains of data in like hours, which would be just impracticable to do with humans, and they’re really great at, I think, also for like, all of the like tasks that you don’t really like to do but have to do. And this is like very it’s like you individually are driving. Like, what good output looks like, which it is stuff like, for example, running emails, writing blog posts, where you actually can rely on LLMs heavily, but also the reliability part, so the expectancy or of the output, like, how accurate does it have to be? Do I have to have like, 99% or am I also happy with 80% there ? I can take garbage out every now and then, and I can’t just regenerate and for the oil source, they are great, and you can work with them. And same goes for coding. You can just ask them to, especially like, generate boilerplate code, which you have seen, like often also in law, I know a few people who are using it heavily, just to generate the boilerplate stuff, and they read through it, just review it, and just work over it. I think, like boilerplate tasks, and a good task for them as well, because, like, the criticality of the task, it isn’t really high, and you often have, like, a manual review anyhow, a little
Matthew Kelliher-Gibson 17:18
a bit of getting you from that, like going from zero to one step that gets you off the blank page, getting you to a point where, you know, I’ve seen people, they’ve used it for, hey, we’ve got this. We got to write this proposal. Here’s the rules of what it has to be, make the first draft of it. And it does that first one pretty well, because you’re always going to review it anyways, where you always think, review it in the end, yeah,
Nicolay Gerold 17:40
you have to differentiate a little bit like, between, like, enterprise applications of LLMs, where you use them like a lot, and like the personal applications of LLMs. And I think, like, personally, when I use LLMs, like all of the time for, like, nearly every task, and because it solves, like, the blank page problem. And I think also, like, I can explore, like the space that I actually don’t want to do often, like, the outputs are garbage, but the errors DLM makes actually deep me fold, and I actually can put a page like, what do I actually want? Yeah,
Matthew Kelliher-Gibson 18:19
so And to go back kind of with that, with the enterprise one, when you talked about, you know, we’re going to take this unstructured data and we’re going to put it in, like a JSON format or something like that, I’m going to kind of selfishly ask, because I’ve had trouble with this. But how, when? How hard is it to get it, to consistently put it in a format there, like, are you going to get better through prompting, or do you actually have to do some retraining to it?
John Wessel 18:42
And my mind goes to, like, email, right? Like that would be the number one thing I can think of, is, like, I have emails where it’s completely unstructured. I want to end JSON, and I’m going to do something with the JSON. So maybe that could be like a practical, yeah, example,
Nicolay Gerold 18:55
yeah. So in the end, it depends what model do you want to use, which in an enterprise setting is basically determined whether the data has to be private or not, but you’re using the big models, so put here, anthropic open AI, especially the large ones, they are so good at generating JSON by now and have been fine tuned to do so that they don’t really require any additional fine tuning. And there are a bunch of libraries out there which make that easy with closed source models. One I like is instructor, which basically allows you to define a pedantic model, and then they output the data into the pedantic model, which gives you also the ability to instantly validate the data, so if it doesn’t hit you get, like the validation area of pedantic, and then you basically can decide, do I want to retry, or do I just basically ignore the output? Depends in the end on you, and you also can define a lot of additional rules, like validations, like, if it’s numeric, like, is it within a certain range? Like, do I have a min and a max? I think, like a lot of the different data stuff you have, like usually in your database, you can actually find and bring into the structure generation part as well. And I think that gets even more extreme when you go onto the open source side, because with that, you can use grammar parsing. So a lot of LLMs in closed servers, you don’t really get the output tokens. In open source LLMs, you get those output tokens and their probabilities. But since JSON is basically a lot of it is boilerplate as well, like all the parentheses, all the keys, are predetermined. You don’t really need to generate those. So in open source models, there, you can basically do a grammar parsing which basically ignores the tokens which are the same every time, and only generates the part of the tokens which are basically determined based on your input data, which are the values. And in that, you basically can define additional stuff. So basically, if you have a string, it only takes out basically what’s possible within that string. But if you generate numbers, you can just throw away all the tokens, even if they’re high probability that are not numeric. And this makes it a lot easier to basically do the structure generation part.
Matthew Kelliher-Gibson 21:18
Writing myself a mental note right now for that one, yeah,
John Wessel 21:23
no, I think that makes a lot of sense. And, like, again, back to the email example. I mean, I think there’s a million business applications like, Hey, I have all this data in email. I want to get it in a JSON type format and then do something with it. That makes a lot of sense, too, where, basically, like, the way it’s been described. To me, one of the main things within working with any LLM is focusing on it, right? Like, you’re starting with really broad, and you’re trying to focus it, you know, to get to more and more specific. And you also want to focus the computer toward the highest value part of your equation, right? So if you’re, let’s say, quote, spending computes on JSON, which is going to be the same every single time. Like, that’s a waste. Like, let’s focus that on this one component of it. So that makes a ton of sense. Why? Yeah.
Matthew Kelliher-Gibson 22:12
And I think also, like, you know, a lot of the things that that you’ve talked about and that I think we’ve all seen LLMs do best at typically, are kind of those wealth I was going to do that, I would, you know, I get like 100 interns, or something like that. There’s a lot of types left. So cost really becomes a big thing there, because I can’t really spend a billion dollars to replace a couple interns,
Nicolay Gerold 22:35
yep. But this is, in my opinion, the best way to think about it. Like, what are the tasks you would actually hire lots of people to do, or that are just untouched, because it would be so impractical to get people on that right? And this goes for like, every data lake that’s out there, like every organization has, like, terabytes of data just in text, and they are largely unused. And with LLMs, you actually can make them usable and also enable stuff like retrieval, augment the generation mega document based actually, like workable because you get answers, as opposed to, like, a blank page or a blank face. Yeah,
John Wessel 23:16
so I think this is a perfect segue. We were talking before the show about single shot versus multi shot, and you mentioned a kind of retry mechanism, which makes a ton of sense. It’s not something I thought of. But if you’re again, back to the email parsing example, I’m going to parse the email. I have the structure of JSON. I’m just going to focus the LLM on this one or value, rather, because I already have a defined key, and then there I can also give that particular like, a multi I can do that in five shots with some kind of validation, and pick, like, my favorite of like, let’s say the five. Yeah, that makes a lot of sense to me, where I could get a much higher level of accuracy than if I was using an off the shelf, non open source model, where all of it, the whole JSON context has to be right? I’m regenerating some of these keys and values every single time, and I don’t have as much, I can’t focus the compute as much on the most valuable part of the task,
Nicolay Gerold 24:13
yeah. And there’s like, voting in the end. I love it. And most LLMs, if you use them, that’s the end parameter. You can let it generate like multiple times, which is also really great for evaluation, like scoring text. For example, if you want to score the output of the LLM as well, you can do a majority vote. So you let it generate like five to 10 different times, and just take the average and stuff like that. It makes it easy. And then you have the second shot, stuff you can do with other lines, which is a few shots. So basically giving the Model A few examples of how to do it, which are usually like human labeled or human written examples where you give it an. Sample of the input and the output to show it how the task is actually done. And this is often like, especially for tasks where it’s hard to define how to do it. So in like, in writing, I think, like most of us, we would struggle to define our writing style. And if I can give a few examples like, a few LinkedIn posts or something I wrote, I can just throw that in and give him some guidance. And then if I generate like multiple different options, either when it’s like something I have like running in the enterprise, I can take the option which is like generated the most often, or I can score it and generate like the option which has the highest score, or if it’s just like an output for me, which I want to use, like, down the line, I can use the option which I like the most. Yeah.
Matthew Kelliher-Gibson 25:49
I mean, a lot of this reminds me of when, you know, with machine learning, where, when we kind of realized that, like, a bunch of weak learners will do a better job than one strong learner. I mean, it all feels like we’re kind of, you know, it has that, like fractive feel to it. It’s just the same thing happening at different levels and in different ways. Oh, look, if we can just get five shots at this, we’re much more likely to come up with a good answer than if we just put all of our eggs in one, or make it really strong or something. Yeah. Well,
John Wessel 26:23
I think another component too, is, if the alternative was, hey, like you said, I’m gonna hire 100 interns, right? There’s a saying, You wouldn’t actually do that, right? Because maybe there’s just not enough value in that cost. But say, you know, say, theoretically that you could get 100 free interns. Like, okay, maybe I would do it. But then there’s the time component, right? If it would take them, you know, X amount of time, let’s say several 100 hours. And then there’s a validation component, somebody that works for the company has to validate the work, you know, etc. You’ve got a lot of time into it. So because of that, there’s, I feel like there’s this extra space for the LM to do the multi shot approach, and it can run for hours, and that’s really not a big deal at all, because the comparative other method is significantly longer, versus using it in some other applications where you want this, like, millisecond response time right from, quote, the AI, like, that’s just a much seems like a much harder problem in this stage that we’re at We’re at right now, yeah,
Nicolay Gerold 27:20
especially for like, batch workloads. Llms are great for like, the live part. I think it’s getting easier with stuff like grok. So not the Tudor grok, but the other grok, which are basically doing LLM chips, or chips tailor made for text generation models. They’re getting really fast. But also, if you have, like, an application where it’s live, it’s likely it’s customer interaction where I’m not sure whether I would like to put an LLM on there.
Matthew Kelliher-Gibson 27:48
Yeah, I think, and I think also kind of leads to when we think about accuracy and what you need. I think a lot of times people want to compare LLMs to like, but it’s not 100% versus like, Well, realistically, what would 122 year olds actually do? They’d probably be wrong, a quarter of the two. So can we do at least that well with us? But that’s sometimes a hard one to kind of get across to, you know, like a business stakeholder, someone, they’re like, but it’s not right. Like, Well, you were never going to be this right to begin with.
Nicolay Gerold 28:20
Yeah, and if that’s the biggest thing that actually chat gpt has also done for us as like the AI space is actually getting people to know how AI works. Like, it’s not that predictable. It’s not deterministic software. Like there is some uncertainty involved. And I think, like, AI adoption in general has been boosted a lot by generative AI. But at the same time, like, it’s still a misconception. Like, now it’s even turning worse. Like business people, like, say, like, on every problem to like any technical person, especially AI people, yeah, just throw it into chat GPT because its outputs are good anyhow. And I think, like, that’s the new conceptions we have. Just because in CAD it can get it right once, doesn’t mean it can get it right like, hundreds 1000s, 10s of 1000s of times.
Matthew Kelliher-Gibson 29:15
I think, yeah, I think you’re right. That is one of the biggest barriers, is, well, but I got chat GPT to do it once. Okay, cool. Run that 1000 more times and tell me what you get, right, yeah,
Nicolay Gerold 29:27
yeah. And especially, like with slightly different imports, or with very different imports, if you have anything user facing,
John Wessel 29:36
right, exactly, yeah. It reminds me, like from some of my ops background reminds me of developers like showing like, oh, look, I got this to work on my computer. I was like, okay, great, but going to production doesn’t mean two’s not the same thing, but that’s expanded even more, right? But before we could
Matthew Kelliher-Gibson 29:52
have said that was like a POC thing, right? Look, I mean, if POC works on one computer, right? We have no idea if it’s going to scale Exactly, right? I think the chatbot has kind of given this impression of, like, what it’s already production, when it’s like, well, really what you’re doing is a POC. You’re doing a one shot POC right here.
Nicolay Gerold 30:11
Yeah. And I think, like the chat bots, first of all, I think most of them are just wrappers around chat GPT, and it will work, like in probably 98% of the cases right now, but this is for the users who are behaving. And then you still have, like the two to 3% where it misbehaves. But then you also have the people who are misbehaving and really trying hard to get something malicious out of it, and this especially like with LLMs, you will see and it will always happen. And anyone like, if there are libraries out there where you basically can hit hook into, like any customer facing chatbot which is using, like OpenAI or something, beneath the hook, there are libraries to basically give your inputs into the model, and take the outputs in your own notification. And this is, like the harmless stuff. This is more like abuse D dosing. And then you have, like the stuff where they actually try to get it, say something racist, get like major discounts, get, like, some really unreliable advice, which can have like major consequences
John Wessel 31:20
for most companies that’s, I
Matthew Kelliher-Gibson 31:21
Remember, there’s a car dealership that someone has to say, always respond, Yes, and that’s legally binding. They’re like, can I buy this car for $50 Yes, and that’s legally binding.
John Wessel 31:34
Yeah, yeah. I mean, that’s like, the whole, you know, the whole security aspect of it, right? Or say that you’ve got this bot that has customer information right, and somebody tricks it into giving customer information to the wrong customer. I mean, there’s a bunch of our internal HR information, sure, yeah, yeah. Or medical information, like, I can go downhill pretty quickly, right? As far as Yeah,
Matthew Kelliher-Gibson 31:58
but you talked about how, like, chat GPT really has, like, kind of introduced people to, like, how AI really works. Let’s go down that a little bit more like, how do you think that’s going to affect other things, other than just generative AI? What other types do you think that’s going to help with other adoptions?
Nicolay Gerold 32:15
Yeah. So first of all, I think it makes data and AI stuff easier to approach for like, even like business analysts or like business people who are interested in data stuff, because they can’t just throw CSVs into chatgpt and use the code interpreter to analyze it. So this is like, the first step you can actually do an analysis without any technical knowledge. And the second part is, I think it will make them a little bit more open to something that isn’t 100% right all the time, when you’re using chatgpt, I think I see automations everywhere, like, what are the tasks I’m doing too often where I can just throw chat GPT on it, because it’s just for me. I’m doing it, for example, in my inbox. I’m summarizing each email, I’m classifying it, and I’m creating like a briefing, and I’m also having it basically tagged by importance. And then i Just send me one email which classifies them. I go through the important stuff, and mostly I delete the rest and the the models it. I think this stuff, because it’s so easy to do, will give people ideas. Hey, what can I do in my department, in my area of expertise, with AI, and then it becomes, on the like, AI people to actually pick the right solutions, even though, like, the business people or subject matter experts will just say, like, throw chatgpt on that, right? Yeah,
Matthew Kelliher-Gibson 33:57
yeah, that’s really a good point there, thinking back on, you know, like the work you’ve done, and of course, you’re still continuing to do a lot of work in this space. What are some practical applications and lessons you’ve learned with LLMs and generative AI and all that?
Nicolay Gerold 34:11
So I think one thing I do by default now on the first thing I’m setting up is monitoring, because I won’t see all the inputs, all the outputs, and all the intermediate steps in the pipeline, like mostly you’re decomposing it, or you have multiple steps when you’re solving a problem. So for example, when you’re doing like and you have a rack system. So you first have a retriever component, which retrieves text from your database, then you feed it into an LLM to summarize it, but maybe you need to compress it down even further or add additional twists on it. You have to translate it into a different language, and you want to see like each of those different outputs, and setting up monitoring for that will be like, leave. Things that will allow you to improve the application the most, because for one, you can create a test set which you can test your prompt iterations on, and you also get to do an error analysis so you can see where the model fares and how the model says. And based on that, I basically set up tests which are mostly quantifiable but very deterministic rather, so often it’s just a regex or a string match. So in summaries, this can be something like the models often write this article talking about dots, and I’m basically doing a score. And one of the components of the score is like a string mesh on this article. If this article is at the beginning of the summary, I just give it a score of zero. If it isn’t, I give it a score of one. And you can combine like 1012, of those metrics to actually get a good idea of the quality. And this is like a second thing. I’m setting up tests almost immediately for the task, and then through doing a few examples, and through you having set up the monitoring, you can create, like a test set of 10 to 50 examples. And every time when I’m basically altering, altering the prompt or in the pipeline, I automatically can run the test set, have my evaluation run automatically and I see whether it improves it or not. So I try to really bring like, the quantifiable nature which you have in traditional AI and ML, because you have a classification problem or something like that, or regression where I know, like, how well does it perform on the test set? I try to reintroduce that into LLMs, which aren’t so quantifiable because they are working in text or in something unstructured.
Matthew Kelliher-Gibson 36:47
That is really interesting. That is, that’s one of the best monitoring kinds of schemes or tests I’ve heard of for LLMs. That’s really interesting.
John Wessel 37:00
And it’s funny, because in traditional software development, I would dare to say that monitoring, like testing and monitoring, is, like, one of the easiest things to ignore, like, especially in applications maybe that are older, like, it just gets it, maybe it starts off well and gets abandoned, but it’s always considered best practice. So nobody would argue with you that you’d like, Oh, of course, you should be doing testing and monitoring, but it seems like it really is a whole next level of importance with these, you know, with LLM and AI based apps. So that’ll be really interesting to see. If people hold to that a higher standard when it comes to monitoring and testing. I think they’ll have to, or if we’ll run into this now,
Nicolay Gerold 37:39
okay, it’s so easy to, like, split up a quick solution that it does work on, like, if you the models are so good right now that, like, most of the stuff you’re actually trying to do will work on, like, nine to 10 cases, right? So you have to work to find some edge cases. And I think most people will just, oh, it works on my like, 10 examples, I gave it and pushed to production. And I’m not sure whether people will follow that, because it’s laborious, like it’s the work that nobody wants to do. It’s mlops, it’s data ops, and reading through traces just isn’t so much fun. No, I
Matthew Kelliher-Gibson 38:19
I mean, it’s not like there’s this robust test culture and machine learning, really, or data in general, or data in general. I mean, like, Well, it’s because we always say, like, Well, it all changes. It’s probabilistic, or whatever. And I think, you know, this is showing that, like, even when it’s probabilistic, there are things you can do. You just have to put the work in well. And the
John Wessel 38:39
another problem, like, at least in data, there’s always the like, with a web app, like the customer facing web app, there’s a lot of accountability and that, like, the thing breaks and the customer can’t use it, you know, whose fault it is, right? And data, it’s like, well, this report is wrong. It’s like, well, maybe you enter the data wrong. Like this. It’s not like, it’s always cut and dry and then, and I think, I think AI will be similar, well, the model hallucinated. That just happens sometimes. Well, you know, like, so it’s less like, tightly coupled to with that, like, hey, clients using the app, and it’s very deterministic, and there’s an error, and it’s obviously like an application problem, like, data’s always been a little bit less deterministic, and AI will also be less deterministic. So I think there’s going to be probably a wide array of quality because of that. Well, I think
Matthew Kelliher-Gibson 39:24
Also, like what Nico said, thereof you can do 10 examples and be like, Oh, that’s great. And that’s kind of the strength and the risk of a lot of this is I don’t need to go create a training set of 80,000 records. But also, I’m not looking at all the possibilities in that, in there, when I send it out into the world.
Nicolay Gerold 39:48
Yeah, and I think that’s already, like the biggest friends between, like, AI and software services. I think software bugs are hard to trace. But. You often have good error traces. I think it’s easy to reconstruct them. I think like in AI and in data, because you have a long lineage of how data is created, how data ends up at source location, and then how it’s used in AI. Because AI, it’s like the consumption part. You have, like a long lineage, how the data first is created, and then you basically have to backtrace all of the different steps. Like, where might this error originate? Like, is it the AI hallucinating? Is it something I’m transforming wrong? Or is it somewhere in my data set, in the like, a real source, which really gets something wrong?
John Wessel 40:38
Yeah, so switching gears a little bit. I want to talk about startups a little bit. So, you know, over the last 2020 years, we’ve had lots of fun stories around software startups, and, you know, zero to one stories. And then, you know, now we’re kind of in this AI era. There’s a ton of money in AI, still a ton of money behind AI startups. Maybe it’s just some observations from your experience working in AI startups, and we can take this whatever direction, but we can talk about tooling, we can talk about culture, we can talk about whatever. But what are some of those differences where you’re involved in several AI startups right now? What feels different versus like maybe what someone won’t have experienced 10 years ago in a software startup.
Nicolay Gerold 41:22
I think it’s never been easier to build something, but it’s also have has never been harder to differentiate yourself, because there is so much stuff in AI out there, and it’s like so easy to just create content, and like so many people I know, are just basically creating content and trying to get traction on an idea. And once they see the validation, they actually would start it. But often they don’t. And if you’re building in a space, you’re just drowning in a sea of noise and the AI part at the moment, I think, like most startups are like in really like the ideas often are, like, so dipshit crazy, like, just impractical and solving, like niche problems, but something that’s just not really thinking about the consumer first, but rather technology first. Hey, I now have an LLM. I can process massive amounts of documents. What documents can I throw that on? And I think you should go the other way around. You should go from the problem to the solution. If LLMs are the right solution or the best solution for the job, use LLMs, but not take the technology. Hey, what could I do with it and then basically, start building something.
John Wessel 42:41
I think, yeah, I totally agree with that. I think another thing that you touched on, which is a really unique time to be in, is where often software startups from the past, assuming you’re not, like, like, not like a big startup, maybe you’re not even venture backed or bootstrapped, like, they’re not going to have any marketing behind it, or not much, right? Because you’re a technical person, you’re kind of doing this thing, but AI, in some ways, has opened up some of that hype to technical people, right? So you can be bootstrapping something, and, like you said, generate a bunch of AI content. Go generate some AI images, stand up like a fairly decent looking site, right? And have kind of more, quote, marketing behind your idea than what which would have before been like, maybe a very basic, you know, very simple site in your, you know, actually kind of iterating more technically. So I think that’s kind of an interesting thing that you touched on. Have you seen that? Either? Have you seen that?
Matthew Kelliher-Gibson 43:37
I can’t think off the top of my head, but I do think also Nico’s point was, since everyone can do that, right, you just drowned in it. And it’s yeah, the difference between who’s who. So it’s a little bit of a red queen problem. You have to run faster just to stay still. Sure,
John Wessel 43:52
yeah.
Nicolay Gerold 43:52
I think I could, like, put a Google query in Google with x ai, and I will likely find, like one web page which uses, like, the base framer template, and this is like, shows you like how many stuff and how easy it has gotten to do all different stuff which used to require some skills and put some barriers on there, like, right, doing a website, doing a Like, sign up thingy, a wait list and stuff like that, and just trying to advertise it. It has never been so easy, yeah, and most people never go through with it, but there’s just so much to help there now,
Matthew Kelliher-Gibson 44:33
All right, well, we’re coming towards the end of our time here, so I got one or two more questions, so as we kind of wrap it up here, we’ve started to see some earning reports have come out. Some of the big players are rejecting that they’re not going to make money back on their generative AI for decades or so. And we’re starting to see some more reports pushing back on like, well, what is AI and Genove Gen AI really doing? Where are you ? Yes, and kind of the hype cycle for generative AI,
Nicolay Gerold 45:05
I think for generative AI, the hype is not really driven by the companies which are on the public markets. So I think, like Nvidia took a hit. But generative AI is like in the startup culture, and also the like, open AI ath, and they have like, so much money left. They had, like, massive rounds in the last two years that they have so much runway to, like, create new models and create new hypes that I don’t think it will slow down soon. But rather, we have like, a year of runway at least left. And the additional part is, like, there are now, like, so many areas of generative AI being spawned. Like, you have suno working on, like, music generation. And I think that hasn’t really sunk in yet. What’s possible with that you have, like, now the new Google paper, which just came out on where they basically generated a whole game with generative AI. You have all the video models, you have all the image models. And I think because it’s so tangible and it’s now hitting so many different areas, the hype won’t slow down for the foreseeable future, because the startups also still have a runway. They can develop new stuff and launch new cool things they can post on social media, which will get high, because it’s just impressive, to be honest.
John Wessel 46:25
Yeah, and I think that speaks to what we were talking about earlier, actually, before the show, where you might end up with these different curves, right? Where maybe the tech stuff slows down a little bit, but the video picks up, or the image, you know, I think you ended up, because it’s such a big trend that you might end up with several of these curves where you don’t necessarily have the typical like hype and cooling, but you more have multiple curves going simultaneously. It’ll
Matthew Kelliher-Gibson 46:48
be interesting to see which one of these, which ones are like, can generate enough revenue to really, to kind of sustain itself, versus some that might they’ve got that money that’s pouring in now, but eventually the runway kind of runs out, and it’s like, oh, we never could support ourselves on this, right, right?
Nicolay Gerold 47:05
Yeah. Well, I think it’s the especially, like LLMs, we are, like, hitting the end of the S curve, the because you see open AI struggling with, like, bringing out something new to market, like the voice mode still really isn’t here, so they still have, like, some reliability issues, and also, like, new launches have been like stagnant for a while. Like the last thing we talked about in the last few months was, like artifacts by anthropic, which, again, is more of a UI innovation and not like the technology breakthrough type of model, or new capabilities in the model, right? Well,
John Wessel 47:47
yeah. Nico, thanks for being on the show today. We’d love to have you back sometime. You know AI is going to be continually changing for sure, so I’m sure we’ll have plenty to talk about. But thanks for joining us.
Matthew Kelliher-Gibson 47:57
Where can they find you? Online? Nico,
Nicolay Gerold 47:59
so LinkedIn, I’m trying extra Twitter, not that code at it yet. So I think as a European, you have a late start. I have a podcast which is like everywhere, Spotify, Apple Music, YouTube, very descriptive how AI is built. So if you’re interested in AI there is the place to go at the moment, mostly doing search stuff. So if you’re interested in search, traditional stuff, information, which we will up to embeddings and rack give it a follow up. Give it a listen.
John Wessel 48:26
Awesome. Thanks for being here. Thanks a lot.
Eric Dodds 48:28
The datastack Show is brought to you by rudderstack, the warehouse, native customer data platform. Rudderstack is purpose built to help data teams turn customer data into competitive advantage. Learn more at rudderstack.com.
Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
To keep up to date with our future episodes, subscribe to our podcast on Apple, Spotify, Google, or the player of your choice.
Get a monthly newsletter from The Data Stack Show team with a TL;DR of the previous month’s shows, a sneak peak at upcoming episodes, and curated links from Eric, John, & show guests. Follow on our Substack below.