Episode 144:

Explaining Features, Embeddings, and the Difference Between ML and AI with Simba Khadder of Featureform

June 28, 2023

This week on The Data Stack Show, Eric and Kostas chat with Simba Khadder, the CEO of Featureform. During the episode, the group discusses feature stores, embeddings, and the impact of new technologies on the data industry. Other topics include the importance of embeddings and vector databases in the data industry, the future of machine learning and its impact on businesses, new technologies in data science and ML ops, and more.

Notes:

Highlights from this week’s conversation include:

Simba’s background in the data space (3:05)
Subscription intelligence (6:41)
ML and Distributed Systems (9:09)
The Brutal Subscription Industry (12:31)
Serendipity in Recommender Systems (16:31)
Subscription as a Strategy (20:47)
Customizing Content for Subscribers (22:19)
Creating User Embeddings (25:53)
Building Featureform (28:01)
Embedding Projections (32:47)
Spaces and similarity (35:53)
User embeddings and transformer models (38:22)
Vector Databases for AI/ML (45:05)
Orchestrating Transformations in Featureform (51:00)
Impact of new technologies on feature stores (56:17)
Embeddings and the future of ML (59:20)
The gap between ML and business logic (1:02:26)
Final thoughts and takeaways (1:06:37)

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 00:03
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com. Welcome back to The Data Stack Show Kostas, we are going to talk with a Simba from Featureform. And, boy, do I have a lot of questions, we actually did a lot of data science stuff. Last summer, we talked with people building feature store stuff, we talked with people building ml ops stuff. But Simba actually has a really interesting perspective on the entire spectrum of problems in space. So I’m gonna leave you to talk to him about the technical details, I’m going to ask about the moment of serendipity. So he did a ton of subscription work. And he figured out why people would subscribe to publications, you know, sort of, you know, like, The New York Times, etc. And so I’m going to ask him about that. That’s super interesting to me, because I think machine learning can help us understand a lot about that. But you know, of course, being a consumer behavior guy, it can’t answer everything. So I want to know what he knows about that. And then understand how teachers relate to it. So what are you going to ask them?

Kostas Pardalis 01:34
Yeah, um, you mentioned that we did a few, like, episodes, Summer, like, a couple of months ago, right. About envelopes and the tools and the technologies in this space. But I think we’re living right now in like, completely, like different worlds in terms of the technology landscape. Especially because of all the MLMs and open the eyes of the world and all that new technologies that we’re still trying to figure out how we are going to change things in technology. So I think we have the right person to discuss above that. I’d love to talk with him for more about more fundamental things like what are embeddings? What are features, sharing what our features are? Like? Let’s see how, like, let’s say we’ll revisit all these terms that we know for a while now. But how they have changed today, because of all the changes that have happened in the past, like six months in the industry. So yeah, that’s what I’m going to focus on. And I’m sure because I know some bullets. It’s going to be very interesting. And captivating. Discussion. All right,

Eric Dodds 02:49
Let’s dig in. It’s good. Simba, welcome to The Data Stack Show. So great to have you here.

Simba Khadder 02:56
Thanks for having me.

Eric Dodds 02:58
All right. So give us your background. How did you get into data? And what was your path that led you to feature film?

Simba Khadder 03:04
I started, I was at Google. And that was a rule for a little while. I can say I am one of the few Googlers. Who have ever written both PHP and x86 at my time, their data store. Yeah, it’s kind of interesting. When you can, I am truly well, on both ends. And I’ve learned that this is kind of a horseshoe on both ends. That kind of sucks. You kind of want to. Yeah. And

Eric Dodds 03:29
yeah, I love different teams, like what was the can you tell us like, what did you start with bhp? Or? Yeah, it

Simba Khadder 03:36
definitely was like, I earned my stripes x86 that happened to be different projects. PHP, I worked on a solid data store, I was working more on the API side, I worked on a lot of different kinds of parts of it. But one piece, one thing I worked on, was kind of fine tuning some of the lesser used API’s. And one of them was PHP, so I didn’t do it. And it happened to be me. And so I learned more about PVC even ago, I wanted to, but I can officially say I use it in prod at Google scale. Which I don’t think many people mean, maybe Facebook, people can see that too. But hopefully moving gets moving. And the x86 I worked on Google I profile. So I worked on making Google faster. I worked specifically on products to offer things but I mean, I was really focused on search. So obviously, a big piece of what Google does. And so I worked on my feelings faster, and I was going to start my career. I always kind of really like hard technical problems. I found that I love what I worked on, like, um, this was like, I was pretty much reading school. Most of what I worked on in most of my, I guess studies and what I guess made me happy was distributed systems up to that point. I had put it down. Well, I’ve done a bit of some of the stuff I was touching on searching. I actually got to interact with some of them. I’ll tell you, briefly I go we’ll tend to skip that one a bit. And just happened to be at a time when TensorFlow was kind of coming out when I was there. So I got to really see some of the early iterations. And really just kind of, look, I think what drew me to distributed systems to begin with, is how well I guess messy this. And even in an ML like, I’d never liked Eurovision because there was so I found the answer being typically binary so boring to me, like it is or is not, you know, the thing. I’m trying to classify it. But it’s sort of the same system. I really liked that there was never really a right answer, it was always a given take and kind of lost a little bit of an arc to it, of doing it. Well, I think. And yeah, and then I started after I left Google, I had a lot of product ideas that I had these ideas of how, you know, maybe cloud could this is also like, Google clouds coming out. Here. AWS is like, the behemoth but you know, maybe Google Cloud or ether launch, you’re kind of at a tipping point where people are like, Oh, that’s not gonna happen. And I have all these ideas. They’re probably bad ideas. I was like, 20. But I still wanted to go and learn and continue learning. So I left Google and I started my first company, which was training, a training, we took love renditions. I learned a lot. I didn’t start trying to have an idea. All I had was a logo and the name and the co-founder could see or confided Yeah, had nothing. So whenever people are like, Oh, I’ll leave and have a good idea. I’m like, start a free rack. I didn’t have one.

Eric Dodds 06:37
You can do it before that.

Simba Khadder 06:41
Yeah, we just figured it out. And honestly, I feel like you can see the pros and cons. But this, it definitely at least allows us to build something real because we were in search of a solution to a problem. They actually had to go find the problem to solve. And we landed on, we closed subscription intelligence. We did everything from personalization as a service. We help people with recommendations. But really the goal of it wasn’t just recommendations for recommendations sake, we really were focused on driving subscriptions. There was kind of this movement, it’s still kind of happening for b2c products and companies to move from AD based models to subscription. Yep, I think it makes a lot of sense. It’s much more tied to value because you’re not paying for like, you’re not like, hey, like, it’s almost like a bait and switch, like I’m trying to get you to use this thing. But also, I’m really just gearing you very, because you’re the product. subscripting always seemed to make more sense to me. It’s also much less wasteful for certain categories. And anyway, so I was helping drive that shift by helping companies who are still treating things in the ad based way. Like it’s like, if you look, we worked well, publishers and news companies and a lot of what they were doing, there’s obviously teams that switched but I think as a whole there was still kind of taking an ad based methodology, which worked well for charging eyeballs. And that was kind of trying to help them use data to figure out how to change their strategy to drive more subscriptions, decrease churn and understand why users subscribe, that was really the whole tagline. That was like my easy one. One line sales was like, Do you know why your users subscribe? And the answer was almost always, not really. So that’s what we’d say, oh,

Eric Dodds 08:21
Yeah, I love it. Okay, so many questions about sort of that moment of subscribing and, you know, a number of things on recommendations. But let’s rewind just a little bit. So you got exposed to distributed systems at Google, I would love to hear about the moment or the epiphany of saying, like, this is a real thing that is going to affect my job and data and all that sort of stuff. Like, when did that happen for you? Were you sort of, you know, we’re going from working on, you know, PHP stuff to, you know, realizing, okay, distributed systems, is there going to be like, a really big deal? Yeah. So

Simba Khadder 09:10
I think so. Well, two points. I think firstly, like at that point, distributing systems was going to be a big deal. I think that was kind of already a very, I think what I learned was going to become a big deal with them out. And it’s really more data. I think, just like this, I don’t even remember if this is like Spark or had probably come out but maybe it wasn’t as widely used yet as probably a kind of silica cubera. And I think what I really saw was that distributed systems, like Google, have always been kind of this kingdom of distributed systems like they’ve always been ahead of the curve, like they released the MapReduce paper. They released Porpoise, which is kind of the early internal version of Kubernetes, which was a beautiful thing. I think they’ve always done that very well because they’ve you know, I mean, even going back to having commodity hardware, like I think there’s like early stories of Google where one of their innovations was to duct tape on hard drives. So if a hard drive failed, they could literally just rip it off and place another one on without doing anything. And that was like an innovation that would. So I would say at that point that it had kind of played out. But I was interested in the problem space. And I think there was kind of this adjacent problem space, which is ml, which is obviously very different. But I think there’s a lot of the same. I learned characteristics that make the problem really hard to do. That, I think drew me for the same reasons I was doing distributed systems, but I think there’s an extra kicker of that. It was, I think, you know, everyone talks about, like, MLP, I was gonna change the world. But I think, especially seeing it at Google, I really began to understand how every interaction will have supposed to digital, or really everything is gonna have very well, you’re like, I feel it’d be an interesting subject to do is like to see how many models interactive on a given day, just doing your job. It’s probably a lot, it’s probably like, it’s probably over 100. For an average, like tech worker, I mean, just, every time you buy something, that’s fraud detection, plus, like a handful of remodels, every email you get is there’s a marketing model behind that. Like, there’s just so much I mean, Google Search is a model server, hundreds of models that we interact with, on a daily basis, like even like when you go to the grocery store, or buy something like well, like everything in the supply chain, where it’s going to be miles attached to that. So I think I saw that and once I clicked, I realized that, hey, this is like not only a space that I find interesting, but it’s a space that still right for impact, I think a distributed systems, it played out so much that at that point, like you kind of had to be PhD level a truly have any impact. And even then it’s like, you’re kind of we’re in the optimization stage of that trend. term. And, you know, it was either we’re gonna have a whole paradigm shift, which I don’t think didn’t happen, I can’t even stop and what that would look like, surely will happen one day, and both verbally, it was so penetrating space, no one really knew how to do it. No one does it well. So I think that’s what drove me, Bear.

Eric Dodds 12:31
Okay, so you went to this is fascinating to me. So you went to subscriptions, and you worked with, you know, media companies, which I mean, you know, as sort of a someone with a marketing background as sort of the most brutal like bleeding edge of like, Man, how do you get a bunch of anonymous traffic to convert, you know, very small margins, like, so you go from like, okay, you know, distributed system, this is interesting, like ML, like, this is going to have a big impact. And so you go right, for the bleeding edge of like, the hardest problem. First question is, why did you go for the hardest problem? I mean, subscription for media is brutal. Like, is it because it’s a word like hard problems? I there was

Simba Khadder 13:23
definitely part of it. I think, again, we were in search of a problem. And that’s a huge one. And I think we kind of landed there, we actually kind of forget why we ended up in the media, am i Oh, my co founder, male co founder. He is our PD Now, actually, but he’s always kind of had a lot, that’s a big tie to the media, like he’s always been attracted to it. So I think there was a bit of kind of found that market that there kind of, I think pulled us in that direction. And the process was super fascinating. And it was very untouched, like, we weren’t really doing much ml there. And the reason for that, which is maybe why it was hard to like, have a huge business there, or we ended up being this relatively successful was you don’t really make a lot of money per user. It’s kind of a pretty brutal industry. So VCs weren’t super stoked about this. And we told them that we’re selling the media.

Eric Dodds 14:22
Yeah, yes, all our per month, you know, $1 a month to get this access to this client.

Simba Khadder 14:29
I mean, another interesting point was like, I know what the number is now but we found that like, like, I think it was the Guardian needed like 50% of their revenue off of like prints. So at that point, I know how long ago I believed that like, there’s like all these like, really interesting points that we learned about how to shoot works and functions. So that was really surprising to us. Especially since when you think of like, like the reason why if you go into an airport and you see a lot of like the sun Set news. I like all the news names like the skirt stores because the news companies were kind of these big conglomerates and they were just like coming up with new business models. For some weird reason, someone who was like CBS or something like, came up with maybe CNN but someone came up with like, Oh, we’re gonna do like, you know, these kiosks in airports and that’s gonna like to drive a ton of revenue. So anyway, there’s a lot of weird things I learned about these V and how it worked. It was just fun. I got to become like an expert in something I never thought I’d be an expert in. And obviously, it’s like societally very relevant to like, know a lot about how the media works. But it was just even technically, it was fascinating from space.

Eric Dodds 15:41
Yeah, absolutely. And so you solved ml problems on the bleeding edge of like, the most difficult, you know, sort of low margin. problem space. And that’s obviously very relevant to, you know, all sorts of other all sorts of other spaces. When we’re talking before the show, you mentioned something called the serendipity moment, and how two users who look very similar, can follow the same path and like, maybe one subscribes and one doesn’t. And how that’s like a, that’s a very challenging machine learning problem. Can you break that down for us? And like, start by describing what is this serendipity moment?

Simba Khadder 16:31
Yeah, the serendipity moment. We’ve all felt it, that feeling when you do find something that you weren’t really looking for, but it is kind of like, awesome, it’s like exactly what you wanted that moment, that feeling you have, it’s kind of like a little dopamine hit, that’s a certain depth to the moment. And in recommender systems, like, let’s say, you go on, Spotify, YouTube, like you name it, whatever you use to the, let’s say, Spotify, and Spotify recommends a song for you, and you’re like, I’ve never heard the song. And Hudi sort of says, A click it. And first you kind of skeptical because it’s like, this seems off. But then like, you know, like, when you second sand, you’re just like, This is my new favorite song. Like, that moment I’ve read is exactly that moment is magic. The problem is, and this is actually like a known even pre digital like, this is even the way like a grocery store sets up. It’s its aisles, like it’s all there’s a bit of one thing that they consider is that serendipity moment, and what is most likely to be it’s almost like you have to get this gray area where it’s not obvious. It’s obvious to certain deputies. Like if you love you love Red Hot Chili Peppers, and I recommend a number of hell chili pepper songs that’s like in the same album you really like? Like, you might like it, but chances are it’s not serendipitous, because it’s Yeah, I mean, that makes sense. You kind of have to, to be a little bit off target. But if you go off target, it becomes your end. Yeah, it’s like Sure. Like, it’s a discovery exercise,

Eric Dodds 18:10
right? Like it’s exposing you to something that you’re likely to like, but that is not 100%. Like you, we know you already liked this.

Simba Khadder 18:22
And the only hard thing and why I think I really got pulled to this space is one it’s kind of immeasurable, very different. Like there are papers written about how you can try to measure but it’s essentially measurable, you can’t really measure like the serendipity of recommendation. And it only really makes sense when you’re doing things to humans. Because computers are very like, like most things, they have answers. Serendipity is really hard, because it’s a human effect. And it’s if you can’t dislike it, I mean, you could attach something on someone’s head and look at every neuron and try to figure it out, like the serendipity of it. But that’s a trial until we can do that, you know, we kind of are in this hard place where we have to somehow try to use behavior to see if we captured serendipity. And it’s just really hard. Just like, like you mentioned something I’d said, you might have to use this for exactly the same. And if you show one of them a song they might be like, Oh my god, this is my new favorite song completely changed, like their kind, of course, and overuse it or big This sucks Why the hell did you recommend this to me? And it’s just what happens. It’s just impossible to know until after you do it. And it’s just because you have incomplete information. And humans will kind of always have the information because you can’t like, you know, plug into their brain wiring and like, make it deterministic or whatever.

Eric Dodds 19:47
Yeah, for sure. I mean, it’s like, okay, you recommend a song and like, someone’s just been through a breakup, but someone just had a good date, right? Like they’re the same until like there’s a divergence and like the song is going to mean different things. was, okay, I want to dig into a future form, and especially a featured influence, like this sort of stuff. But what one question before we dig into that? How do you balance as someone who’s building technology in this space, the influence of like the commercial element, right? So you mentioned grocery stores. And so it’s like, okay, well, how do you create serendipity? But there are also people bidding for the end cap, you know, you know, to put their cereal on the aisle this most prominent and, you know, marketplaces are like that. And of course, like that’s a business model for marketplaces where you can bid for space, like, how do you consider that as part of a recommender model?

Simba Khadder 20:47
Great question. It actually comes back to A, this is why I think subscription is a very good strategy for a lot of these companies, because it allows it’s essentially like a hey, I’m trading value, I’m averaging it out over a month, let’s say, probably over a little more time. So it’s more like, Hey, you’re kind of buying in once your body is in, I play games anymore, all I have to do is make sure you’re getting as much value as I can give you to justify your $5 investment or whatever. Sure. And typically, these things have subscriptions. It’s, I mean, obviously you have to balance the cost of goods. But there’s an equation there, I guess you could kind of think it through but in general, we lean towards like, just assume that the costs are costs. If you can get people to subscribe, it’s all worthwhile isn’t to make sure they’re valuable. Because when you use them people subscribe. They kind of get all you need that experience mostly Oh, sure. Yeah. So

Eric Dodds 21:50
then more or less sometimes that experience subsequently to make sure that they’re getting sort of Yeah,

Simba Khadder 21:54
Well, we also had to like, like, part of what proc would do is also maybe recommend types of content that you like, and it might change the kind of content you get created. Like this is why, like the information, the athletic will have been very successful, their subscription tiers have very different types of content than other types of companies. They were kind of built by subscription first. And they came to the conclusion that having more like content, you could only get there, like no one else has that content, typically a bit more opinionated, typically a bit longer than more dense, is more likely to drive subscriptions than kind of like headline style content, which is great for getting clicks and getting views but might not drive subscriptions. So anyway, there’s a lot of things that come into play. And I think that the short answer is we’re trying to measure value. But how do you measure value? When a recommendation like games or Spotify? How do you measure the value of the song recommendation? Well, you can’t, you can only see what their behavior reflects after, right? And then you would use that to be like, well, that must have been successful. I recommend this song from the artists they’ve never seen or heard of before and never like that their top artists must have been a very successful recommendation. I provided value. So very likely to share subscribers because they had that magic moment. And so we tried to make that happen. And maybe going into how we do it a bit. And that again, just be in the shoes of a data scientist, trying like you open up this, let’s say it’s a giant file. Let’s just make it simple. It’s a CSV, and each CSV try out, it’s like, hey, this user, maybe anonymous, likely anonymous, actually very likely anonymous. Looked at or listened to the song at this time. And maybe have some song metadata in some table somewhere. Yeah, and you just have like, billions of rows rows, like more like you just have this ridiculous number. Have these. And you have, you know, like we’re handling 100 million monthly active users that are PII. You can’t like to go one by one. In fact, if you pick 1000 of them, it’s still like, doesn’t matter, like 1000 of those, like most of them are just noise anyway. So what became really interesting, and the other thing that we did, which is unique to us, and this recommender system, it wasn’t like we were just recommending things to the user. We also were, he called a subscription intelligence. So we had to help people understand why their users subscribe, which means you had to also explain to a human what to recommend, actually, that’s how the idea came about. We originally were personalization of the service. And I was having trouble closing a few deals, and then went to this, across that and actually flew up to New York and got a beer. And I brought a contract that was printed out. And I just chatted about that. We just chatted and I brought it out, and I was like I brought a contractor and gave him a second to look at me funny and then I was like, I don’t expect you to sign it. I just want to know what Are you? Well, okay, it’s one note, we’re not for your head right now. And attribution was the big problem. And that’s kind of how I saw that. And then would you sign? And the answer was literally like, forget the recommender system. If you solve that problem, you help me understand why they use a subscribe. I’ll pay you for that, even as a recommender. And what I realized we had to do was essentially, the recommender understood why people subscribed, it was actually, the way the model was designed, was actually to drive more subscriptions. So it was just recommending sales people click on, it’s actually the last function, it wasn’t exactly a subscription, but it was much more correlated in that direction. And so we actually had to build this almost explainability model on how they recommend their work, but display that not to a data scientist who knew what the model was yet give us some couldn’t care less about what the model was right? But still, I gave them some value. And the way we did that, just real quick, I’ll hand it back is to create these user embeddings that create this, then we can dive into what that means. But generally, what we do is we would create these embeddings, which are, like holistic views of the user. And we would cluster together, and to what we call personas. So we’d have these, like any number of personas, and we would then provide a more traditional id, like, kind of like, here’s how often they come on. Here’s what these very traditional metrics are. But we would do for these magical personas. And those magical personas are generated by clustering, SSC, the recommender systems, and holistic view of the user. And I think that, yeah, that was the magic of being able to capture all those things together. And yeah, so pause there.

Eric Dodds 26:54
No, I love that. I love it. So many questions. But I’ve been monopolizing. So why don’t we do this? Can you give us the which I should have asked you to do at the beginning, but give us the feature form pitch. And then I want to hand it off to Costas, to dig into how you actually do that. From a technical standpoint. Yeah. So again,

Simba Khadder 27:17
me looking at this giant CSV of all these users, and most of what I’m doing is coming up with features. So those features could be Hey, how often did this user come to attend Spotify? Like how, you know, what’s this user’s favorite song in the last seven days? What’s their favorite artists in the last 30 days? I will generate all of these features, or they’ll certainly embed things which we’ll get into I’m sure and being able to do that alone was hardly required, like a tracking was far. I’d have to like to materialize things to read this. There’s a lot of slang at the very low level. But the worst part, which I’m sure any data scientists listening, this can relate to, is we would have these Google Docs full of like SQL snippets, we would have like untitled 118, the iPython notebook that we’d be like copy and pasting from, we had no source of truth, no versioning, nothing, it was all ad hoc, that we couldn’t at any point in time, look at at training set and be clear about, hey, which features did we used? And how were they created? Exactly. And could we do that again, but just not at the table, it was all done in such an ad hoc fashion. So we will feature a form to be this kind of framework, but set above the infrastructure. So you can still take full advantage of our infra because we’re again having 100 million meu, but it allowed the data scientists to define and manage and serve their features or training sets or labels, everything and a framework that worked and let them write sequel, let them write data frames, let them write what they’re used to writing, but give them almost like the scaffolding to put all that together first, so we can automate most of the low level and kind of mundane tasks that aren’t just like me coming up with new features.

Kostas Pardalis 28:54
All right, so Simba, you mentioned two very interesting terms. And I would like to tell you about both of them. You mentioned the word feature and the words and the term embedding, right. So why do we need two different terms? First of all, like, what’s the difference there? Right? And help us understand a little bit. Let’s say, What came first, what are the differences? Or the similarities? And get a little bit deeper into both of them? Because I’m pretty sure that like, there’s a lot of confusion around these terms, like out there.

Simba Khadder 29:32
Yeah. And sorry, the first term was, which I heard was embedding Redis. First one feature. Yeah. So if, and then bedding is a sub. It’s like a type of feature. So let’s first talk about what a feature is. So a feature is, well, let’s have a whole and no, the model works like a model is essentially a black box function. You might be able to understand it. anything like that? Well, but you can think of it in this case, as a black box is a function. It takes signal inputs, generates an output and a prediction. That’s kind of how any at least any, it’s how most models we use worked. Now those inputs are going to be things like in the Spotify example might be things like the favorite song, the last three days might be your favorite artists in the last seven days, it might be, you know, a variety of different signals. And I like using the word signals, because I think it’s a better term like it captures what that feature is, whether a feature is really just like signals from the raw data that you’re providing into the model. Like in some computer vision cases, it might be just literally the raw image, that’s it signals this array image. But in most situations, especially with NLP, especially with like, any tabular data, which is like a, like fraud detection, use case, at recommender systems, etc, we take a lot of steps into taking that raw data, taking our domain knowledge as a data scientists crossing that with some data transformations to generate signal that we then feed into the model to allow it to do the best job possible. So that’s what a feature is, and should be like, let’s call them traditional features. You could imagine that that feature pipeline to generate them are things like data frames are SQL, like they’re kind of well understood concepts. Now, an embedding is a very special type of feature. So an embedding literally, is a vector and a vector. And like the math says that it’s like an n dimensional point. And each of our values are close, just like floating point numbers. Now, these embeddings have this interesting characteristic, where you can embed a lot of different concepts, let’s say again, and bet on users based on behavior. So I have a user of betting, which if I’m Spotify, maybe it’s like this holistic view of like, who you are, as a user, what you like to listen to, and just all the trying to capture all the nuances that makes you unique, I somehow take all of that, and I turn you into a point space. Now alone, that means nothing, right? Because it’s like, cool, like, I have this random vector, like, that’s great. Like he told me this is Coast us up, but trust you, but like, I don’t know what to do. That’s where the magic comes in, is when you have many points, when your Spotify and you have many millions of users, you end up with millions of points. And it’s almost like the structure of our forms. If you look, if you Google like embedding projections, you can see some of these structures, they typically cluster with a lot of really cool shapes, a few of my favorite things to do with looking at, and the shapes that we can form based on different types of embeddings. And there’s all kinds of things that get injected into that space, that n dimensional space, but really, you typically will visualize this 3d space. So one thing and the most obvious one that a lot of people are aware of, is users who are similar, that have similar music tastes in the Spotify sense, will be close together in space. So they’re vectors will have very similar values. I want to dive into why that’s hard for a second, because, like, let’s say I have text embeddings. So I have two pieces of text, or let’s have three pieces of text. And two of them are really close and one’s far away. The two better flows like the common way to vectorize this is again like all like if you’ve ever done NLP classroom, the first thing you’ll learn is a technique called TF IDF, which is Term Frequency inverse document frequency, which means the amount of times the term a word shows up divided by or the inverse of the document 60 How many documents actually contain that term? So ends up working out where they kind of have common terms and have like a high IDF, or high document frequency rather, because they show up all the time. And then, you know, rare words or ideas often tend to be like, wait really high. So this is the way to vectorize a piece of text, but it’s kind of dumb, because it doesn’t really understand the words, right? It’s just treating each word as a second identifier. Yep. And that works great, but it doesn’t understand sarcasm, like you might have three documents and they will have very similar words if one is sarcastic. And that one’s obviously very different. Now embedding a good one, from a good transformer will actually capture all this nuance, and then we’ll really put that into the embedding space. And the same way user listening behavior. They’re listening behavior. It’s not just like, oh, like they listen to, you know, they love Katy Perry. This user also loves Katy Perry. So we’re like near each other. It’s a lot more nuanced than that. Now, the final thing that kind of went into this is when I build that initial feature I use like a SQL query. When you build an embedding or something like an embedding, I typically do a transformer model. And it’s literally a machine learning model, whose whole job is to take features actually inputs again, but to take those and generate an embedding. So you embed a concept, which is typically sparse data into vector space. So anyway, there’s a lot, but that’s the very long answer to a very short question.

Kostas Pardalis 35:53
No, I actually, it’s like, I think he did an amazing job describing what like, an embedding is, and like, what’s the difference between a feature hearing you describing like, an embedding?

36:06
I keep like to, say terms that you used, you mentioned about like a high dimensional space. So there is a concept of space there. And then you have points in the space, right, where we can, let’s say, for people that they’ve done like basic algebra, like, it’s not that different from what we were doing with vectors in algebra, right? Like, actually, we pretty much use the same algorithm at the end like to calculate the similarity. Right? Right. And that’s, and I think that’s like, also, like big part of the beauty of this whole thing is that we can take like, something so complicated, and reduce it into like, a mathematical structure where like basic tools from algebra can be used to answer like, questions about semantics at the end, right, like, and that’s like, what you were describing you were saying, Yeah, we can do things with frequency. But there’s no semantics in there. Like we cannot really understand like, what’s the difference between like, the meaning of words, and that’s what we do with a bit of extra dates. But one of the things that you said, and you mentioned, for example, we haven’t said the user, right, we have the word the bending, we have, I don’t know, like, whatever bending, right? It seems like we need a different space for each one of these things that we are trying to model, right? It’s not like, we will take like a model that does work the beddings. And we will use that like the generates user appendix, if I understand correctly, right,

Simba Khadder 37:34
you can and we did it. Like we would put user embeddings and item embeddings, in the same space. So we could actually use that to find, hey, if I take a user, I can find the N items that are closest to it. And those would be items that have the highest affinity towards that user. So you can embed things into the same space. It’s, yeah, there’s a way like it gets. It can be done, you like to generic models typically don’t do that they have their own space that they’re trained on. But if you own the transformer, it’s very much doable to build things in the same space. So I wouldn’t say each thing has its own unique face, you can’t cross between them. Like if you wanted to, I can think of ideas off top my head, but I’m sure it’s possible. Like you can put images and users somehow in the same embedding space, the spaces. Another thing is like, it’s not like there’s one user space that exists for all users. Yep, the Transformers themselves. The way the Transformers work, is they are trained. I’ll give you an example. I’ll tell you how one of the parts of our recommender system would create a balance, we would have this model. And it would take all these like, let’s call them like user attributes. Okay? What’s your favorite, just traditional features? What’s their, you know, what’s their favorite thing the last 30 days looked at, they just listened to whatever this is like , like their age, whatever other traditional features, we would feed into this transformer model. And we would train the transformer to solve a surrogate problem. And the surrogate problem is really what defines a latent space. So the surrogate problem you’re training on is, hey, try to predict what the user is going to listen to next. Which is an impossible problem. Like there’s no way they’re given those features. You have a strong model that will be able to guess with 99% accuracy, like what users can listen to next. As we talked about, it’s just impossible from space. Now, by doing that, you will. There’s a way literally how you do it is you essentially take the last hidden layer of IDPs I work. And that’s embedding. That’s literally how you create them. But actually those values are the embedding. And fair many different tricks and techniques. Like one thing you could do is drive Birdman, using the chances of clicking rather you predict how long will they watch the item? So not only like which item but how long will they watch it, that will actually change the embedding space. And funny enough that changing the bedding space will actually typically result in higher quality recommendations that you can click space. So there’s a whole lot of science and art. And again, I love the art of machine learning. Like I love problems where it’s creative, and it’s like fun. And it’s not like, hey, like, is this a hot dog or not about no fancy revision people I know, I’ve been like, kind of not talking well about it. But I love recommending face embeddings in general, because of that, art was like, I’m literally building a model into a huge model, typically. And it’s really fast. It’s a train. And I’m using it entirely for its last hidden layer, because that’s the only thing that’s useful to me, I actually don’t care about them all itself beyond that.

Kostas Pardalis 41:06
So to build a model that generates beddings we start again, from features.

Simba Khadder 41:11
Yes, yeah. It’s all actually a funny thing that you can train embeddings on the fly, as you’re training your models to create beddings, which is a whole nother story. But we did that too. So you can. Anyway, yes, there’s a lot of crazy things you could do. But I really want to highlight that embeddings are just a special type of feature. And even in that world, you’re still using features like even to create the embeddings you have to create features. So it’s a signal, right? I mean, like, in general, like, I mean, as much as we like to imagine, like these types of models as if it’s like magic wands that you like, sprinkle on tax, and it magically takes the text and makes it do whatever. But it doesn’t really work. Like always, models are using and trained on very traditional features to create embeddings. And typically, because they’re really cutting edge models, when you think of it like a GPT, you start actually putting both models on top of each other, she have embeddings that feed into other embeddings. Like if you’re using images and text in GPT, for if you get support that now I imagine, I would be very surprised if originally those things are processed by different models to be embedded. And then fit into like a set number model to create kind of a holistic embedding of the whole thing. So yeah, fee embeddings are just specialized species. In fact, like, with feature form, love for the problems like this of traditional features, like hey, how did I create this feature? How’s it defined? What version is this? Where is it being used? Who owns it, like governance, like all those kinds of traditional problems lineage that you would expect to have a ton of traditional features totally happened with the beddings. Like, I can’t tell you how many times I’ve had like indeni things. And I would have these things like this again, like when I built my own vector database, probably three or four times in my career. And there’d be all these kinds of like, Man, I actually don’t remember how I paid them. They don’t remember which Molly used and how I trained it, and where that model is. So I actually can create embeddings from it. Because it’s just somewhere on my like, you know, in my untitled notebooks, but I have on my laptop. And so yeah, I mean, the feature problem was built even now people associate features. So it was a traditional ml, it was actually built with this kind of new style ml originally. And we, let’s say, cut that stuff out. Because no one It wasn’t cool. When I was doing embeddings. It’s cool. Now, what I was thinking was cool. So I was like, Cool, let’s cut that out. Because no one knows what that is. And we’ll focus on traditional stuff that we’re using today. But we’re actually about to release a lot of stuff into space, the stuff we’ve actually built before we turned it off. And now we’re gonna turn it back on, which is pretty exciting. Actually, I’m very excited for the four of us on the stage.

Kostas Pardalis 43:56
Yeah, and we will have the opportunity to talk about that. And before we get there, you mentioned vector databases. And I mean, naturally, they come as, like as a way like to interact and use like the bendings. Like, that’s all like most people like to hear about them today. So there is this concept of the vector database, as you said, the embedding at the end, like from a presentation standpoint, it’s just like a vector with float numbers you need. And we need to do operations from that stuff, right? Like, somehow we get them and we need to work with them and be able to share to them like, make comparisons, but like, with ones like the staff, that’s, let’s say we also do in the traditional database. So my question is, why do we need to have like, these new things that’s called a vector database, right? And how does it fit in the overall workflow that we have when we are working and building data sets? terms of travels or some kind of aim and whatever we want to call it like elements in that.

Simba Khadder 45:05
Yeah, I’ll start first. So I’ll get into how LLM stuff kind of adds a new flavor for vector databases. But I mentioned before I built the vector database a few times in my career, we actually released one feature form, which is kind of deprecated. Now, just as there’s plenty of other great options, which you should look at, but the reason I built it, so when I saw Okay, first, like the problem to be solved, problem to be solved originally, before even the vector database part was, I would have these embeddings, of very common problem is doing a nearest neighbor lookup. It’s a very easy way to make like, again, if I have a user embedding, I want to find the end items closest to it. Well, I just do a nearest neighbor lookup on the sector. So that’s how I would do it. Now the problem is that doing a nearest neighbor lookup is a very expensive operation. It’s, essentially, the only way you can do it correctly for like 100% correctly is to brute force. And so there were a variety of companies that came up with approximate nearest neighbor algorithms, one of the most popular ones, which is funny, because it’s kind of like no longer I think it’s lost in time. But it was the most popular one, it was one called annoy from Spotify, and get it like annoy a nn. Oh, yeah. So approximate nearest neighbor, index for the noise in memory. And so the problem with that was, it’s like, if I gave you a B tree, like here, this is your database, it’s, well, that’s great. Like, you solve the really hard algorithmic part. But there’s like, all this stuff, I have to build around it to actually deploy this thing. Like it was super common. So see it, I mean, probably less than now for a long time. I would see we do it, to be honest, if we would actually upload our embedded files in the Docker container over the model. And we would, in that container, read the file greater no-index on startup time, that’s actually how we do it. And then like, there was no, it made sense, eventually, to create a service. That was kind of almost annoying as a service for us, which became our IT Directory database. I mean, essentially, I mean, there’s more you have to do today, you also persist to disk. And there’s obviously more than just putting on a toy. But that was one of the key problems I saw, there’s never a key problem, which is annoying isn’t doing a lot of them, like being able to distribute the search, being able to do filtering is a really hard problem, which none of the open source indices can do. Now, the proprietary, I think, I believe Prime, Kylin, VBA, and Redis. I can do it all. But there’s a number of hard problems. These are indexes. These are like database problems. Now it’s a specialized index, and you need to build a database that runs the best size index, or you have to fit a specialized index into existing databases. The problem is that the existing algorithms, like the most common one, now I see is h NSW, which was created at Facebook, this doesn’t really play well with how databases are protected. So the algorithms can have to be tweaked to give the final algorithm that has similar quality that also like, kind of has similar characteristics and how you scale it out, as you would find like a B tree or whatever. So yeah, anyway, that’s a long answer of why DBS existed, like definitely solving a real problem. Now, the question which remains for sure, like people will be using embeddings to ask me at least, and the nearest neighbor lookup, approximate neighbor lookup that’s not going away. That’s really common, that’s no question. But I think it’s something I’ve learned from a misconception the market has, which I learned recently is that people think vector databases are just a place to like, cash embeddings, which is not true. Like it’s not like, I mean, you could do it that way. But, I mean, at that point, like, it’s just, it’s a list, like you could put in Redis, it doesn’t really matter. The thing that makes it special is that index, that nearest neighbor index. So yeah, I’ll pass it back on that note.

Kostas Pardalis 49:25
No, it makes total sense. And like how does this fit in a system? Like feature form, which is a feature so right, so we have feature singers habits, like vector data bases? There has to be some kind of I mean, features are everywhere, right? Like that’s like, someone who has no idea what we talked so far. I think like, the first signal that they read like some relationship there is like we’d have features everywhere like we need them to build the, the embeddings themselves. So how does like features work with them and learn Let’s stick with like features for like the product that you built, right, like, how it is architected and how vector database fits into that? Yeah,

Simba Khadder 50:09
one thing that makes a feature community so this is not true of literally any other feature store. But just how we work is we call ourselves a virtual feature store. The virtual means that we sit on top of your existing infrastructure, we simply turn what you have into a feature store. And so we ended up being this kind of framework layer, but data scientists love to use, but also allow you to take full advantage of all the infrastructure you have underneath. Now, from our perspective, it’s very common for like some versus our bigger clients to have like some features in Redis, some features in Cassandra, some features and Mongo, whatever, they might have a variety of different places, they might have some things built in Spark, some built in Snowflake. And feature POM works really well in those situations, because it sits on top of all of it. And it provides one unified abstraction to define the features to manage them and serve them. Now, from our perspective, a vector database is just another kind of online store that we would call Rainford. Stores. It’s a number plate that stores features which embeddings happen to be again, like a specific type of feature. And it has this new operation for lookups, which is the nearest neighbor lookup. So we just need to provide both of those operations. The other thing that we do is we orchestrate transformations. So we’ll orchestrate as a data scientist who will define your transformations in our framework, which again, will be 99%, the same code, you’d write the same simple upgrade, the same PI spark pandas, whatever. And then we have this kind of function that you put it in. And you might give it some metadata, like the name, the version, the description, all that stuff, and no one here, and a lot more as possible. And we would orchestrate that you could set a schedule, and we can orchestrate that for you those transformations. And from our perspective, a transformer associate pre trained transformer, like even like an LM is just a new type of transformation. Like from our perspective, it takes an input, which is tax and outputs, an embedding, which is just a feature. It’s a special type of feature, like I think saying, and all we know is that I suppose the type of feature, you typically want to store in a vector database, if you want the nearest neighbor lookup, you don’t have to just go into key value lookups, you can put in wherever. But if you’re doing a specifically a nearest neighbor lookup and put it back to the database, the feature point was this workflow tool. It’s this tool that encodes a feature workflow on top of existing infrastructure. And the vector database is a tool that provides this new specialized index, which happens to be extremely beddings. Transformers are the special type of transformation, which happens to create an embedding, and happen to be models themselves. And an embedding is a special type of feature that has a lot of characteristics that I touched on. So that’s how they relate to each other as your current embedding graph of all those concepts. And we can look at it.

Kostas Pardalis 53:09
Absolutely. Yeah, I think we should do though. All right, cool. And like we’ve talked, whenever we talk, like, let’s say more of like the infrastructure side of things, right. We talk a lot like and I hear you like mentioning a lot of like, more traditional technologies that have been used, like, we’re talking about like Cassandra, like Spark, like readies all these things. Today, we also like all this craziness with, let’s say, these huge, loud slang words, models, like open AI, blah, blah, blah, they can do all these crazy things. How do you see, first of all before that, like, is there a distinction between ML and AI? And because we use I don’t know, like, I also like I was thinking about that like about myself, like, I use these terms in a very mixed way. Many times. I have a distinction in my mind, like what’s the difference, but I don’t think it’s explicit. And I’d love to hear your opinion on that. And then we can continue on, like how things will change because of that.

Simba Khadder 54:20
Yeah. So very, is a difference in what they mean. But I think like most terms, it’s really what they mean in practice. And historically, pre LM. Like, honestly, my take was if you said AI, you don’t know anything about ml. If you said ml, you knew about ml was kind of my take, like if AI was the hand wavy way of saying it and then ml was much more like, I guess concrete, like I know what that means. And over time, like now, I think people have attributed or associated AI to be with the foundational models lmzg pts and ml have kind of what I’m calling traditional ml. And that’s all machine learning. And that is, you know, GPT intelligence is maybe a better question for number times. But I think it is practice. But today, I think it’s kind of an accepted thing where like aI just means that class of model, and ml just means everything else. And I think that’s totally fine and fair, because they’re different. I mean, it’s just like the way you use them, the way you think about them, the way you interact with them. It’s just so different from once. It’s not just ml, and it has its own term. I prefer foundational models, but I mean, AI signs.

Kostas Pardalis 55:38
Okay. Yeah, does that make sense? I love the answer. Very helpful for myself, also, to be honest. Alright, so how will things change in the future? Like, do you see, first of all, do you see features, or feature stores? And, again, feature forum, specifically? Changing its roadmap in a way, right? Because of these foundational models? If you would like to reflect on how a year ago, you were thinking about how the feature form is going to be successful, as it’s changed because of this? And in what way? Like, how do you see things changing?

Simba Khadder 56:17
Yeah, I think things are changing very rapidly. It’s kind of insane how fast that is kind of how I would frame this new class of model. It’s almost like the straw that broke the camel’s back. Like, you could argue that there’s a model that most data scientists are aware of clubber, which also has like a line of models that came before it, like Elmo blending up, that really brought this new type of transformer into the hands of most data scientists and would kind of give anyone quickly and easily the ability to have state of the art NLP with this kind of generic model. So for me, that was like a very magical moment. I came in November, when that came out, I must have been like, five years ago now.

Simba Khadder 57:11
Now, with the CPT or sub GPT-3, in the chat, GPT, I think what’s happened is evil. We’re using a lot of the same techniques, obviously, like, much more specialized, much more nuanced, better, we’ve gotten better and continue to get better. But it’s the same, let’s call it a category of solution, like a bird. And we, but now, the big difference is that as an end user, like it’s like, my like, my grandma looking at it, like, it feels like it’s past that, like it feels like real now it doesn’t feel like a project. But interesting. A has finally been processed like wine, where it’s now good enough to use it in more situations than like, kind of older, like a bird could be used like it’s almost like, in many situations. It’s almost like an animal if it’s like passing a Turing test or whatever. But, like, getting there and that ‘s like, if you interact with a chat, GBT, it feels like even if you’re talking to an AI, it feels good. Like, it feels like, like I, it’s obvious. It’s not like, Well, I think it’s really bad. Like it feels pretty solid. And I think that’s been a big change. And that’s kind of because it’s finally crossed that line. We’re finally at the point where, you know, there’s a lot of use cases and problem statements that have been unreachable and unattainable, that suddenly opens up. Now, I’m sure you’ve seen, like, with a lot of that application layer products built on top of GPT, they look very similar, right? They all kind of are all solving very similar problems. And that’s because in my opinion, prompts are kind of the wrong tool for a job. I made this joke earlier with someone where so there’s one thing that happened is this is more like evolution. And biology is the crab, like the animal as created evolutionarily, like many different times for many different species. And the joke is like, you know, the crab is the global minimum, like, the best thing we can come up with, is the crab. It’s perfect. And so that was my joke and I joke the same feeling of like a sequel of like, in the like prompts are wrong, and then they’re like, we’re gonna have this like SQL like language interactive these models. And that will happen and also think what’s much more what’s very likely to happen is embeddings, which we’ve talked about, I talked about how pasta from all these operations on and but one of just nearest neighbor lookups. And I think that embeddings are a much more natural kind of intermediary to use and build upon to those much more complicated applications. I think the reason why lots of AI applications, people are saying I’ll look the same as because the API of a girl based on is so simple, like it just tracks and has an output. And all they’re doing is coming up with interesting prompts essentially, like a template that they have together to try to make it do what they want. And I think that won’t exist soon. And so I think what will happen is that this will, I think will happen is that they will kind of take GPT, it will still have that interface, but it will likely be a lot of the transformers that they have underneath the hood, but they will allow you to use for a cost. And then things again, become this kind of this core piece of ML, which has been true for a lot like NLP, this has been true for years. But I think so makes it true for a lot of different parts of Mr. Like recommender systems, and other places where we’ve been using them. But this, I think, will become much more powerful. And the vector database will have a core piece of the feature store on top of that, it’s still a different workflow. But it’s not that different. Right? Like, again, embeddings are just a specialized type of feature. A transformer is a specialized type of transformation, you can fit them together, like it’s very common for us to take embeddings and to use them as inputs for upper models. Rather than using Postgres as an identifier like a user ID, I would just take his embedding, and put it into the model to do like a ranking step. So now I have these really generic embeddings that work super well. And so now if the speak those in as well, but traditional ml will go away, especially if you start thinking about kind of all the specialized use cases, people have fraud detection cetera, you can’t just run that through an LM like it just doesn’t really fit that way you can use them beddings that your mirror intermediaries that LM has sprinkle back on your fraud detection to make it better. And that’s what I think we’ll start to see happen. So it’ll kind of be this joining of the two. And the feature Palmer made like there’s still a data science workflow idea, scientists have gone away. So we’d remade that workflow layer that datasets interact with the directive with all of our features, both embeddings and automatics.

Kostas Pardalis 1:02:01
Makes a lot of sense. All right, we could continue this conversation for like many hours. With, I think it means that you have to come back. And we will do that. But now I’ll pass the microphone back to Eric. Because we are closer to the buzzer as you say I did again. No, I stole your passes.

Eric Dodds 1:02:24
Wow. Yeah. You saw my line. That’s okay, though. Okay, Simba, this will probably turn into two more questions. But I believe what you’re saying about the way that things will play out. But in most businesses, the business logic is non predictive, right, so we’re talking about sort of like basic business logic, you know, say for example, a key KPI or something, it doesn’t rely on ML at all. That leads to decision making. And when you were talking about the bleeding edge of sort of subscription machine learning models, it really seemed like, you were sort of, you know, knocking on the door of like, machine learning, helping drive business logic. But I really still see a gigantic gap there. In that core, KPIs are going to drive the business. And machine learning is so really early and chat GBT is exciting. But there are all these other components, right? So help me understand, like, from your perspective, I believe that there are recommendations models and feature features that can really help lead the business logic. But it seems like most businesses are gonna lag behind. So how do we cover that gap? To cover the gap?

Simba Khadder 1:04:00
I think, I don’t know if I will get to cover the gap just because I think it’s a bit like, it’s almost like how old enterprises modernized? And I think that’s kind of been like an age old question, but I’m not gonna send that answer to you. I want to maybe dive into part of the question, I think, which is, I guess I just want to highlight necessarily, but lots of my engineers that picture for us chat GPT a lot and where it’s become really and I use it to I’m like writing a blog post and I was like, God, like trying to think of like, this. Like, I’m like, kind of stuck on this paragraph to be your job I will send to you as Chad GPT-3 Like, hey, like, describe this for me, like, no, that’s not right. But it’s easier for me to like, take something and kind of, yeah, see why it’s wrong than it is for me to just come up with a time space. And so I think what it does is it enables people to do a specific decision like the kind of people who are making decisions as their job. How to do it better, because they have this kind of machine that can track them, which is always right. But it always has an answer. And the answer is always like, if not extremely stupid. Most of the time, it’s usually like I’m directionally right. And I can use that and feed that in as far as Europe, it’s a feature to your own brain, so that you can kind of make the best possible decision. I think that’s what we’ll see happen. I think we’re kind of seeing it happen. And I think that the metrics keys are like keyboards suddenly making decisions. I don’t believe we just thought that there was a 10 what you’re saying, we’re like, oh, yeah, like, rob a job everyone. Like, we can just like, I had podcasts and so many jokes about like, next time, maybe we’ll just have our LLM, like, come on, talk to each other. And we’ll just like, sit back and have a beer. And so we’re not, we’re not there yet. But I do think that we are definitely at a point where it’s like, it is a multiplier effect. And data. And ml has always been a multiplier effect, like better the whole page, like software was this multiplier effect of productivity per person, I get one line of code, and it automatically scales across 100 million users or percent. Yeah. Now it’s that but it compounds. That’s what data does, and ML does. And then LM says is this newest logarithm that takes full advantage of that, and maybe creates our new kind of maybe third order.

Eric Dodds 1:06:20
Broker activity. Yep, I love it. Well, Simba, this has been a wonderful episode, as cost has said, so much more to talk about. So we’d love to have you on. But thanks for giving us some of your time.

Simba Khadder 1:06:34
Of course, thank you, this is love and fun.

Eric Dodds 1:06:37
While Costas would have fascinating conversation with Simba from feature form, I feel like the conversation spanned such a larger footprint than just, you know, features and even ml ops. I mean, we talked about so many different things. But I think what I’m going to take away is that his background in trying to understand how to create a great moment for a user, it’s very clear, that influences the way that he thinks about building technology that ultimately materializes into data points, you know, of course, you know, we can call those features, you know, those they can, there’s embeddings, and there’s all sorts of technical stuff that is very clear that Simba is building a technology that will enable teams to use data points that create really great experiences. And I think that comes from him facing the difficulty of trying to understand why or why not, you know, of the millions of visitors, you know, the handful of people will subscribe. And that, to me, was really refreshing, because ML Ops is a very difficult space, feature stores and all of this surrounding technology can be very complicated. There are a lot of players. But it’s clear that Simba just wants to help people understand how to drive a great experience using a data point that happens to be derived. That happens to rely on a lot of, you know, data sources, and that happens to he No need to be served like in a very real time way. But to him, those are consequences. Yeah,

Eric Dodds 1:08:25
I felt it was saying, I mean, okay, Simba is like a person. First of all, it has a lot of experience, right? Like he has been through many different kinds of experience, like many different phases of what we call a middle or AI.

Kostas Pardalis 1:08:39
And he has done that in a very, like, production environment, right. So he has seen how we can build actual systems and products and deliver value. And with all these technologies, which obviously, it’s something very important for him today as he’s building his own company. And I think it’s like, an incredible advantage that he has, we didn’t talk that much about, and maybe this is something that we should, like, how was it topic like for another conversation with him to talk more about, like the developer experience and like how, like, all these complicated infrastructure with all these different, let’s say, technologies, and all the stuff that we discussed together, how we can delete, deliver, like an experienced developer that works with all that stuff to make him like, more productive. But what I’ll keep from the conversation that we’ve had with him, I think he gave an amazing description of what features are, what beddings are, how they relate to each other, how we go from one to the other and how we use them together and how most importantly, all these will become some kind of, let’s say, universal API for all these ML or AI driven applications in the near future. So I am going to say more about that, because I want everyone to like to listen to him by his much, much better than nothing about.

Kostas Pardalis 1:10:22
But there’s like a wealth of very interesting information around all the things that are happening today in the industry and will happen in the next couple of months in the industry. So yep,

Eric Dodds 1:10:35
I agree. I think that if you want to learn about features, there’s actually way more in here. And I think you’ll learn about the future of what it looks like for ml ops, and actually opera operationalizing a lot of the stuff so definitely take a listen. If you haven’t subscribed, definitely subscribe. Tell a friend, and we will catch you on the next one. We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe to your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.

🎙 Sign up for The Future of Machine Learning Livestream!

🗞️ Signup for Our Newsletter

Episode 144:

Explaining Features, Embeddings, and the Difference Between ML and AI with Simba Khadder of Featureform

June 28, 2023

Notes:

Transcription:

About the Podcast

Sign Up for The Data Stack Show Newsletter