This week on The Data Stack Show, we have a special edition as we recorded a series of bonus episodes live at Data Council in Austin, Texas. In this episode, Brooks and Kostas chat with Simba Khadder, Mikiko Bazeley, & Shabnam Mokhtarani, three members of the leadership team at Featureform. During the episode, the group discusses the differences in MLops and data ops, what makes Featureform different from other feature stores, learnings from leading the work versus being in the work, and more.
Highlights from this week’s conversation include:
The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Eric Dodds 00:05
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com.
Brooks Patterson 00:25
All right, what’s up everybody? This is Brooks. I am usually behind the scenes on the show running things for Eric and Kostas. But we are live in Data Council this week. And Eric, unfortunately, he’s fine, but got in a biking accident and wasn’t able to make it. So I’m filling in for Eric. And we have an awesome group of folks here. We have Shabnam, Makiko, and Simba from Featureform. And we’re super excited to chat with you guys. Here live in person data council. So we’d love just quickly to get around chatting. And we can start with you would love to just get around here a little bit about your background.
Shabnam Mokhtarani 01:03
Yeah, I guess. I have an interesting background in tech. Law. I’m currently CEO at feature form, but got my first job in tech working at Slack. I joined them around in 2015, when they were still pretty small working in global Biz Ops, I then decided to blow up my career and pivot to software engineering, and did that for a couple years before meeting the wonderful Simba. And here I am in Featureform.
Mikiko Bazeley 01:30
Hey, yeah, so I joined Featureform last October as head of ml ops. So my focus is entirely on number one, like helping users, specifically data scientists and ML platform engineers, develop and deploy the platforms and systems to make sure that like production and all is going well for them. A lot of times, that means working from everyone, from early stage startups to we have some like pre large, like international enterprise users. Now, at this point, prior to joining feature form, I was actually an ML platform engineer at MailChimp, which was acquired by Intuit in no earlier than October of last year. And I’ve held various roles either as a data scientist, or as a platform engineer, even as an analytic analytics engineer like way before the term came into vogue. So I really like working with folks up and down the data science machine learning value chain. Cool.
Simba Khadder 02:29
Yeah, and they’ve both mentioned that they’re both boot campers just realized right now, a lot of our execs are out of boot camp. So yeah, myself, I am maybe a little more boring in that I went to UC Santa Cruz or ces when I was at Google for not very long, I immediately was like, Is this fun supposed to do for 40 years and realize that’s not the life I want for myself and left to start my first company without any idea without any like, like no product. And that was my last company. So I ended up being modestly successful. We were handling over 100 million men at our peak, doing personalization and putting subscriptions into the concern. And we built all this now we call it MLOps In the house. And one thing we built, which nowadays we call it a feature store, was so pivotal to us. And I realized that there was such an opportunity there. So that kind of became the foundation of what is now feature four?
Brooks Patterson 03:23
Cool, what’s it been like? Going from kind of doing the work to helping others do the work? Have there been any surprises there?
Mikiko Bazeley 03:31
Yeah, and it was the irony, of course, was last year when I was at their circle points in time, like, especially with the, you know, the acquisition of MailChimp, I that was like the third or fourth acquisition I had gone through at a company, not that company, but just across my career. So I had been kind of figuring out what my options were. I knew I wanted to continue to build stuff, I knew I wanted to have an impact, something that I was a little bit frustrated with. And then I kind of sympathized with the longer enterprise users was the feeling of like, your work not mattering. Like you having no impact on life, either the overall architecture or the direction of your company. So thinking about it like, Hey, do I go join a startup? Do I go to work as a consultant for like Google for cloud services, or AWS? And I was famous for saying I am never ever gonna go work for an MLOps company, I’m certainly not going to become the feature store girl and I, you know, I had to eat my words later on. But it’s been really delightful. I think sometimes in like the ML ops or the data ops space, there’s a little bit of this love hate relationship with vendors. And I 100% like to understand that beyond like, you know, both sides of the table now. But what I really the reason why I joined the feature forum was because I thought it was just like such a cool project. We were trying to form really similar stuff over at MailChimp and trying to figure out how we actually make data scientists more successful, without mandating this really constrictive, like single path to production. And I felt like a lot of the solutions out there, you know, for better or worse, we’re not quite like meeting the gap, that feature sort of fills. And so for me, it was like, Okay, I’m gonna go, I’m gonna leave MailChimp to go join a project that I really want to like, help see and grow not just on its own, but also in grading with other like mo ops projects in like open source communities and under so I feel like we need that unification, frankly,
Brooks Patterson 05:43
Can you unpack the love hate relationship and kind of speak to it maybe from one side is like, why is there the hate? And then maybe now that you’re on the vendor side? Like, how can your perspective change?
Mikiko Bazeley 05:57
Yeah, absolutely. So. So as a practitioner, I know. So I am a bootcamp grad. Right. The thing that I think is really cool about the feature form that I only realized recently is that all of us are like UC grads, university, California grads, so we all went to public university, which is fantastic. We also went to like the surfer schools, oddly enough, like UC Santa Cruz, where a bunch of folks went, I went to UC San Diego. And but we’re all kind of bootstrapping it, you know. And as a practitioner, I felt like a lot of times, I had to kind of piece together my own stack. And frankly, it just didn’t help like a lot of vendors or projects or open source like tools, were really kind of helping that. Like they were really good at just one thing. But then they weren’t thinking about the broader workflow that I was getting involved in as a data scientist. And then when I became a platform engineer, I’m like, Oh, actually, this is very hard and very, like this can be very complicated. But as a platform engineer, to a certain degree, you have to be informed and be opinionated. And you really have to understand the data science user. And hopefully, that’s kind of the perspective and MC. I can kind of bring it to the feature form. We already have a culture of empathy, and user centricity, I think, that is very well supported by Chad and Simba, especially since Chad came from slack, where they were all about that user centric experience. But I hope I just bring more of that, like a data scientist flavor to it. And lender awareness of their pain points, for sure. Yeah. That’s really cool.
Simba Khadder 07:30
So all right. Let’s start with,
Kostas Pardalis 07:33
I have a question about them at lunch. Okay. And I want to ask you, like what’s, like the difference between ml ops and data ops? And I’m asking this, because I think we’ve mentioned both terms so far. So what’s the difference? Why do we need both of them?
Simba Khadder 07:48
Yeah, I like to frame the difference, not between data ops and ml. Also, that’s kind of what the vendors have called themselves. I think that there’s a difference between metrics and features. And a metric is something you know, it’s a metric if you’re using it in the spreadsheet, if you’re using slides, you’re using it in a BI tool, those are metrics and metrics have the characteristics that typically there’s a correct answer, right? Like, there is your Mr. Or last month, that is like it like, you might not know what it is. And you might have the wrong number, but there is somewhere in, in the space of possible numbers, the right number, they’re typically relatively slow moving compared to features. So you’re not you’re experimenting to try to find the right goal. You’re not like, oh, like, I wonder if I could just frame them or are this way like there’s kind of a right, you know, metric. So the problem space is very different. And the tooling, like if you think of it like a DVT, or something, like it’s all about templating, it’s very much about purposely putting guardrails, while also making it really easy to make forward momentum. Yeah. If you think about features, on the other hand, features have the characteristic we’re very used to in models. And so we’re using both training and perhaps an inference depending on where the model is being used. When you’re doing feature engineering, aeration is much more random, for lack of a better word, like you’re trying all kinds of different things is not like a straight line, we will do weird transformations like hey, what’s my MMR but cut all users that pay us less than 1000? a month? Why? Because it makes our model better? Why does it make our modeler? Who knows what it’s like, it just does. And so there’s a lot more movement in duration, the characteristics you’d like are different. The use case of the person like the data scientist doing ML is inherently different. And their problem set is different. You want a whole different application layer, then you’d want for what we call data ops tools. Now, I also think that in general like this, what like if you look at let’s call it, pick an orchestrator, yep. And you call it a data ops tool, I think in practice, it’s not necessarily like cleaning a data ops tool. I think there’s almost like this whole topology that we haven’t figured out yet. So it’s just like some tools will kind of cross the chasm of it. But they won’t really be called Data ops tools or almost like, there’s almost like an analytics ops that’s missing, and is like analog, and feature ops, which are kind of a part of, and there’s almost like this generic thing that lives underneath. So I just think we don’t have the topology down. I also think there’s a bit of Conway’s Law. Like, if we go back forward, five, six years from now, there’ll be some number of tools and startups, it still exists, the tools and stuff will be the topology as the best one doesn’t matter. It’s like, what’s left, you know?
Kostas Pardalis 10:32
Yeah. Yeah. Makes a lot of sense. And what’s in the envelopes, product space, right? Like, we have opted, obviously, like feature stores here, which we are going to talk about, but what else is there, which is unique about envelopes, right? And yes, not like crossover between data ops and animals.
Simba Khadder 10:51
So I’m gonna badly paraphrase my keycode’s great talk, which I’m sure we’re gonna post soon. But there’s four stages of the ML lifecycle. There’s data, let’s just call it data versus experimentation data, or sorry, data analysis versus feature engineering was feature serving, this kind of is one stage you’ll be in. And that’s kind of the data stage. The next stage, we would call a training stage, where you’re very much training the models, you are hyper parameter tuning, you’re really kind of experimenting and iterating. On the model itself. Once you have a trained model, you get to the deployment phase. And this is where COVID, train, model and deploy in production, you might do Canary tests, you might do kind of all these things we’re used to doing for services. Finally, it’s in production, like the fourth stage. And final stage is our final, I guess, you kind of go back and forth, but the four stages of experiment as evaluation, rather, and then the evaluation stage, you’re taking all the information on how the models doing, and you are seeing kind of, you might have to go back into other parts of the process, make them all better, you’re constantly iterating. It’s a constant cycle, the malls are never perfect. Yeah, maybe good enough. But you know, it’s kind of like those things always exist. On the MOF side, we kind of have broken down into let’s call it three abstraction layers. I’ll go through all of them, there’s actually five, call them at the very top, we call it the platform layer. That’s a platform layer. That is what a data scientist interacts with. It kind of goes across everything. And you can think of data ops, like data ops has no idea what the model is, has no idea what evaluating a model looks like. So from that MLOps side, you need a platform that really understands the ML lifecycle from end to end. And it’s one unified layer underneath it. There’re workflow tools for data, which would be kind of like what we call a virtual feature store, which is kind of orchestrating the inflorescence of the application layer, but ml data scientists use to interact with their spark Redis. You name it. You need your experimental area. Yeah, you’re sharing a tracker, think like ML flow weights and biases comment, various your deployment workflow, which again, does a whole kind of thinking and pickled Spinnaker, it’s kind of like the CD for models. And finally, there’s the evaluation store, which is taking the metrics that you’re storing, allowing you to evaluate it’s kind of the Kibana and DevOps terms. So you almost have like the Spinnaker, the Kibana, the data orchestrator. And it wasn’t really a good thing for training, I guess it would be like a build tool, maybe like the CI, and then underneath that are still like training services, you need some actually trains the model you need some actually serves the model, you need something that actually like stores data and transforms data that could be the spark and Redis is something that actually collects the logs. So there’s a lot of very long answers to save me. There’s like a very wide space and a lot of problems to be fine. That’s just one view of it, which still like there’s like probably 100 vendors that don’t fit into the framework, I said, that are still very valuable tools,
Kostas Pardalis 13:51
however soon. So what is a feature for me? Yeah, feature form,
Simba Khadder 13:55
We call ourselves a virtual feature store. Okay, we’re an open source product. So you can go check us out on GitHub. We are a place for data scientists to define manage and server features. And to imagine that if you’re a data scientist listening to this, I bet you right now, there is a notebook that you’re having using for work that’s called Untitled 118, the iPython notebook that you can copy and pasting from, you have some Google Docs full of like useful sequel snippets, you know that there’s one person in the company that knows how, you know, training set and was built and you have to go slack them tomorrow so that they can remind you where to find, you know, whatever data, there’s so much ad hoc NIHSS that end up process of features. Yep. It’s just completely, it’s just made up. What we’ve figured out I think a lot of people are focused on is getting the right data, infra tools, like we know how to handle a lot of data. So what I call the platform problem, like scaling, so cute, having low latency serving, having high throughput, we’ve solved that problem. I think pretty well. What we have unsolved is making a product that is usable, and valuable to data scientists having interacted with that layer. So the virtual feature store you can think of as an application layer over your existing data infrastructure, which provides the versioning, access control governance, a nice API for serving everything you would need, there are features that whole workflow and orchestration feature form does using your existing aircraft.
Kostas Pardalis 15:25
Okay, and, okay, feature services, a term that has been around for a while now, in terms of the decK industry, at least. So what is the difference between like, feature formance? What was before, right, like the other features that are out there? Okay, but I got a hint from you. It’s virtual, but I think it’d be great to understand what these, like what it means, like virtual compared to the existing solutions out there, right?
Simba Khadder 15:54
Yeah. So we’ve broken down feature stores into three categories, we call them the literal, the physical, and the virtual. So I’ve talked about the virtual, I’m talking about the other two, the literal feature store, literally stores features. So if you think there’s a handful, I mean, even like SageMaker is a feature store, vertexes feature store, kind of follow this architecture, where you create your features elsewhere. And then you finally store them in the literal feature store, the value prop is all your features are in one place. And that place can serve for both training and inference. Okay. Now, the thing that we believe is missing is that your untitled notebook is still there, you’re still drafting and sparking all of its changes. Rather than writing to read this and S3, you’re just writing to one place, there’s definitely some value there. But we think it misses the main pain point that data scientists have versus the physical. What kinda does it all it does, like what we’re describing the transformations, the storage, all of it, but it does it on its own impro. So it’s really kind of a heavy tool like, they sometimes call themselves a feature engine. And it’s, this is kind of like how we built our last company, you can look at some of the internal things like Michelangelo and Zipline. They kind of follow this really hard to implement, it kind of replaces existing infrastructure, it’s really expensive. You have to write all your like features in the new DSL that they’ve created. It’s doing everything at the cost of super high adoption costs, like sometimes impossibly high adoption costs, it gets impossible if you’re like a large bank to get all of your data to get through one place and process it all in one place that some startup is running. So that’s one thing. The other thing is beyond adoption cost, is there’s kind of this lock in like, for example, if that physical feature doesn’t do streaming, in the way you’d like, or can’t handle some transformation, you want to do your luck. The virtual just to kind of complete it a little bit of a repeat from before, is what would happen. If you took that physical feature store, you chopped out all the actual processing and storage and provided nice clean puzzle pieces. Yeah, for you to plug in your existing app for a lot. And what happens in reality is it’s actually like the physical feature store and much more. Because we have customers who were very like, yeah, we have multiple Spark clusters, you have read the service team. And we have Cassandra for this team. And you kind of get this data meshi like kind of like a hydrogenous infrastructure thing that happens, which is more true to form for enterprise, while having one unified application layer. And actually, that’s the future for a lot of this stuff, like choosing the right data providers to get the characteristics you need from your platform. And use the best in class API feature forms name is actually not Terraform. And we kind of have this idea where it’s like, yeah, Terraform is in the cloud provider. Right? Yeah. Well, Terraform and or hashey. Corp, Robert, or Yeah. Terraform Heap? Yeah, it is the best API, would you ever build Terraform yourself? No. Why would you? Right? I mean, it’s kind of like the like, it’s come to this like a perfect API that fits well, for the use case. Yeah. So our goal is to build the best possible API’s that your data scientists love to use while giving you the flexibility to get all the other characteristics you need.
Kostas Pardalis 19:06
Yeah, that makes total sense. And like, one last question from me, you mentioned features and features have different metrics. Do we also need like a different storage layer for like features like the database systems like different that we are using at the end, or it’s just, let’s say, the same data lakes, the same OLTP databases, or like things like readies passing layers that we are using to also store the features?
Simba Khadder 19:33
So we like underneath typically, like you can plug in whatever you want. We have people use Postgres, if you have these, Oracle is a very common one. Probably one last time for inference. Yep. For all fine stores. It could be S3 Could be HDFS, it could be Snowflake. We think that those companies are amazing at solving those problems that like, you know, being able to have really low latency reads or whatever lines you need and bouncing costs, like that’s what they do all day long. That’s what they think about so why beat them I tried to play that game we’re gonna lose, we’re gonna build the worst Redis why build the worst? Redis? Let’s just let you plug in Redis and even manage it, it’s like, yeah, like, we’re not gonna Mandra this as well as Redis laps, or as well as the company itself doing it. So, you know, that’s yours. We’ll just do that application layer. One thing I’m going to add to it’s not what you asked. I think it’s interesting. One other type of database that we think a lot about is that factor database. That would be
Kostas Pardalis 20:23
my next question, actually, like, what, what is a regular database and how it differs, like with a feature store, right?
Simba Khadder 20:28
Yeah. So to understand the vector database, you sort of have to understand this concept of an embedding. And you can think of an embedding as a whole, it’s literally a vector, like, literally, it’s like a float that we treat as a vector. And you actually think of that as part of that space. So if you have a 3d and bending, it’s literally like a point in 3d space. Yeah. How we create that embedding, I’ll spare you but it’s what you can think of it is, for, let’s say, a building recommender system. This is literally what we did. I could go and look at processes buying history, in my buying history, and then all of our buying histories and I can take it and magically, I’d have this thing is called the transformer that magically takes that and turns it into a vector, that vector is a holistic representation of my back of our binary case, I can actually plug it into our models, and use it as a kind of, rather than putting your name down your ID, then I can put a vector and it’s going to be a more dense representation of you can also visit another interesting characteristic that happens, where users of similar buying behaviors are near each other in latent space, which is again, a 3d embedding, they will literally be close to each other in space. So that unique property of things that are similar behavior, whoever latent space, but you can think of it as if it’s the user’s behavior, if it’s tax, it’s kind of like this text is viewed similarly according to this transformer. And you need something that can quickly get nearest neighbors, like, hey, I want to make a recommendation, I want to go and pull your 10 closest items to this item. So I can show similar items here. So it’s an index and these indexes have existed for a while, because the actual nn problems are relatively slow. So can you kind of have to brute force this, you get an approximate nearest neighbor, relatively faster, for sure. And those indices have existed forever, not forever. But for a long time. Like I remember building I built a vector database myself, probably four times in my career now. And what was missing was everything else that makes the database a database, right? A database isn’t just like an index array and memory index. So the vector database takes this index and builds everything else to fit it into being a database. And, yeah, I’ll give one more example, where embeddings get really interesting, like, for YouTube, the way YouTube works, is it there’s two models that do their recommendation, the first model is candidate generation, it will grab the 1000s 1000 nearest neighbor videos to you, your user embedding, and then they’ll take those embeddings and feed them into a second model to do rank. So we’ll actually use them embedding this multimodal kind of idea of chaining models together and using embeddings as this kind of intermediary language is the future of them. It has been for many years, but now I think people are becoming more aware of it. And I think that a call even prompts and algorithms and all this eventually it’s all gonna break down to hey, we have transformers determinate embeddings that’s everything is kind of going to look either embedding to and whatever, like images or image to embedding, but everything falls back into that space.
Kostas Pardalis 23:35
Yep. Okay, that was like, one of the best explanations of all these technologies are I think, like, everyone’s like a little bit confused lately with all their Yeah, he’s around all these things. But that was amazing. Thank you so much, Chris. Brooks. Microphone is back to you again, like it’s all yours.
Brooks Patterson 23:56
Yeah, one. One thing I know I can easily do is I feel for Eric, because I can say we are at the buzzer, which he always says, but thank you so much for coming and chatting with us. I know. It’s a busy conference for y’all. So yeah, thanks for coming, teaching us more about ML apps. Before the last question, before we kind of sign off here. Also to find out more about features for where they can go. You mentioned GitHub. But where else can they get?
Shabnam Mokhtarani 24:25
So we also have a website featureform.com. We’re on LinkedIn or on Twitter as a feature form ml. We also have a Slack community that you can find on our site and encourage you guys to join and ask any questions. We share updates there both around the product and webinars, meetups events, all the great content that the team produces. So yeah, join us.
Brooks Patterson 24:53
Awesome. Yeah, feature forum. Check it out. Thanks for joining The Data Stack Show. Please subscribe. Drag to the show. And we’ll catch you next time.
Eric Dodds 25:03
We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe to your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.
Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
To keep up to date with our future episodes, subscribe to our podcast on Apple, Spotify, Google, or the player of your choice.
Get a monthly newsletter from The Data Stack Show team with a TL;DR of the previous month’s shows, a sneak peak at upcoming episodes, and curated links from Eric, John, & show guests. Follow on our Substack below.