On this week’s episode of The Data Stack Show, Kostas is joined by Willem Pienaar, tech lead at Tecton to discuss machine learning, features and feature stores.
Highlights from this week’s episode include:
The Data Stack Show is a weekly podcast powered by RudderStack. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Eric Dodds 00:06
Welcome to The Data Stack Show where we talk with data engineers, data teams, data scientists, and the teams and people consuming data products. I’m Eric Dodds.
Kostas Pardalis 00:16
And I’m Kostas Pardalis. Join us each week as we explore the world of data and meet the people shaping it.
Kostas Pardalis 00:24
Welcome to another episode of The Data Stack Show, I’m Kostas, and today I have the pleasure to host Willem Pienaar, tech lead at Tecton in another episode where we will be discussing feature stores, MLOps, and open source. Willem is working in one of the hottest startups right now around feature stores. And he’s also the maintainer of probably one of the best open source feature store solutions out there. So we will have the opportunity to chat with him and dive into what feature stores are, why they are building them, why they are using them, what MLOps is about, and how open source is important in this new wave of technology that is supporting machine learning. Unfortunately, today, Eric is not going to join us. But we will have a good time discussing with Willem and many things to learn from him. So let’s dive in.
Kostas Pardalis 01:17
Welcome, everyone to another episode of The Data Stack Show. Today I have a very special guest. His name is Willem Pienaar. We are going to be discussing about quite recent developments in the space of data in general which is feature stores, but everything around data and machine learning. And I’m really excited to have this conversation with him. Welcome, Willem, how are you?
Willem Pienaar 01:45
Thanks, Kostas, I’m great. Thanks for having me on the show.
Kostas Pardalis 01:48
Yeah, of course. So would you like to start by giving us a quick introduction and a little bit of the background story about you?
Willem Pienaar 01:58
Sure. So I can give you a quick background. I’m South African born and raised. I grew up there and studied mechanical and electronic engineering. I built a company while I was a student, networking company, and then sold that. After that in South Africa worked in control systems, engineering, industrial automation. I did that for a few years, and eventually emigrated to Thailand, where I worked kind of as a software engineer/electronics engineer, all the way from electronic devices in manufacturing up to the cloud. We did a lot of work for multinational corporations, and a lot of IoT work, did a lot of work in kind of the jungles of Indonesia, where we built remote sensors and a lot of like streaming data from power plants in the jungle to central control systems and things like that. So I’ve been in and around kind of the engineering space, the data space, kind of vertical solutions for a while. And after working there for a few years, I moved to Singapore, where I joined at the time, kind of like a company that had been deemed a rocket ship that just crossed 1 billion in valuation as an Indonesian company called Gojek.
Kostas Pardalis 03:20
Oh, wow.
Willem Pienaar 03:21
It’s currently a $10 billion company. At the time they’re mostly focused on ride hailing as their core product, but they are today a multi-application or multi-product platform. So they do food deliveries, digital payments, lifestyle services, so you can get like somebody to come and fix your car. Or you can pay and get like air time, like data for your phone and things like that. So there’s like every single need that you have in the day, like if you need a motorcycle, taxi, car, delivery, groceries, get like 17 different services. So I joined that team, our directive was to build … It’s basically at the start to get ML into production, because they were sitting on mountains of data just like Uber and Lyft and all of these companies, but they weren’t leveraging that at all. And they had a bunch of data scientists that they’d hired but those folks couldn’t get into production because the engineers building the products weren’t incentivized to really help them. So I was the engineering lead that helped them kind of build the initial systems, actual ML systems, so not products so much. And our team was kind of … it started off as kind of embedded in the data science team. Eventually we became a platform team. And then we ended up building a lot of data products and data tooling. So at that point, that was about two years into being a coder.
Willem Pienaar 04:50
Our team focused on the end to end ML lifecycle. We were only about 10 to 15 folks at the time. The data scientists were like 50 to 60 and so we kind of pivoted towards building tools at large that lots of other teams could use: API’s, UI services, study scalable approach that we could find. And, you know, some of the things that we worked on were like, the feature stores and model serving and training and schedulers and, you know, versioning, and processing of data and experimentation systems. It’s all kind of in our purview, and all ML focused. Yeah. So after that, I joined Tecton. I was there for about four years, and I just recently joined Tecton. It’s been a match made in heaven. Tecton is a company that’s focused purely on feature stores. And that’s kind of my specialty at Gojec. And, you know, also led the team that built Feast, which is the feature store that we built with Google at Gojec. And so at Tecton, our focus is primarily to build a world-class feature store. And we have two products that we’re kind of building out there. And we’re trying to build towards a unified vision for them. So that’s kind of the short story in a long form.
Kostas Pardalis 06:10
Well, that’s quite the journey both geographically speaking, and also in terms of your careers. That’s amazing. Cool. Can you share a little bit more about Tecton? I mean, you said that Tecton’s focusing mainly on building like a feature store? Can you tell us a little bit more about the company and product itself? And then we are going to get more into more detail about both feature stores in general.
Willem Pienaar 06:37
That’s a good question. So Tecton, was founded by the original folks that built the Michelangelo platform at Uber. And I think most people in the data space have heard of that. So that was kind of a seminal, internal, proprietary platform that was built at Uber. And, you know, it was sold as something that democratized machine learning. That’s a very overused term. But it was widely used within Uber to kind of productionize both data and models, and a lot of people told us is that that system was actually used for a lot of ETA and iteration and development and not just for productionization. But it’s a very famous system. So they left there. So it’s Mike, who was a PM on that project, Kevin, who was well known as an engineering leader. And they founded Tecton. And I think they started in stealth 2019. So they had been secretly starting to build a feature store startup, and they’ve grown the team. Prior to me joining about 23 people, I think I was the 24th of the 25th person to join. So they’ve got a very advanced … I’d almost go as far as to say it’s the leading feature store right now, that is at least publicly available, with either open source or proprietary paid. And it’s a complete end to end feature store. And it addresses both like enterprise and kind of like, small startups. It’s not fully open to the public right now. So you need to, obviously sign up and pay and go through the normal sales channels. But it’s something that we want to get in everybody’s hands in the future. But there are some specific differences between products between Tecton and Feast. But we can get into that a bit later.
Kostas Pardalis 08:21
Yeah, absolutely. Quick question. Before we move forwards. You mentioned about being the most advanced feature store right now, like in the market. I mean, my background is mainly on data engineering, to be honest. I’m not a person who has worked in ML, so I know about feature stores, but I haven’t used them extensively myself. So I did a bit of research. And I tried to see what is available out there. And what I’ve seen and noticed is that there are many technologies that are coming from very big corporations, like you mentioned, for example, Michelangelo, I found like, what Airbnb is doing. I think they have Zipline.
Willem Pienaar 09:02
Yes, Zipline is their feature store and Bighead is their ML platform.
Kostas Pardalis 09:06
Yeah, so it looks like every big corporation has pretty much come up with their own architecture. But you don’t, at least I didn’t manage to find that many open source solutions, are there open source solutions outside of Feast out there?
Kostas Pardalis 09:19
So there’s Hopsworks. But they’re one that came out more or less the same time as us. They were kind of a Hadoop-focused one. They had like proprietary underlying technologies, like file systems and things. There are smaller ones. I think there’s one called Butterfree that I recently saw that seems a little bit nascent, but they’re coming out and I believe that some of the proprietary feature stores will be open source in this year as well. At least there have been some rumors.
Kostas Pardalis 09:53
Yeah. That’s good. That’s exciting. Cool. Okay, so let’s move forward into a little bit of more like technical details. And let’s start with first of all, what is a feature? I mean, we’re talking about feature stores.
Willem Pienaar 10:06
Yeah. I mean, the simplest answer to that is, I mean, I think we have advanced answers, but it’s an input to a model. So it’s literally a data point that is used in a model to make a prediction.
Kostas Pardalis 10:21
And how is I mean, how is it different compared to, let’s say, the typical data types that we have in the database? Like, what’s the difference there? Or is it pretty much the same thing, just like buckets but in a different way?
Willem Pienaar 10:36
I think it’s more of an abstract label that is assigned to specific data, because it’s just in what context that has been used, you can take a raw event data and feed it into a model, and it can be considered a feature. So it’s when it’s fed into the model, then it and it has some kind of influence on the outcome that the model is producing, that’s when it becomes a feature. But in terms of the data types that you’re feeding to the model, it’s almost always integers, or, you know, floats, or binary values. If you’re feeding strings, or bytes, often the model has then the capabilities to, you know, interpret those types, but it’s not really primitive types, that you’re feeding into a model. And it’s, in most cases, these features are only valuable, it’s not in all cases, you know, once you’ve aggregated them to some degree. So if you look at like the amount of purchases that a user has made, or some kind of, you know, value that allows the model to make a stronger influence on, you know, on a user or customer or whatever the entity is that you care about, that you’re making the prediction about. So typically, it’s something that features aggregated data, but it can also be raw data.
Kostas Pardalis 11:53
But like, the most common case is to have some kind of aggregation, right?
Willem Pienaar 11:58
Yes, yes.
Kostas Pardalis 12:00
That’s interesting. And can you give us a little bit of background around the lifecycle of a feature? As you said, it can be from raw data, absolutely aggregation? How do we come up with a feature? How do we start from the raw data that we get, and how we end up with a feature that we can store on a feature store and use iton our models online or offline for training.
Willem Pienaar 12:23
I’ll give you the non-feature store flow first, because the feature stores are all different. The non-feature store flows, the user exports some historical data from a warehouse or some lake that the company’s organized for them. So they sample the data, and then they, you know, take like 10,000 or 100,000 rows. And then they just process that data, and then they train a model on that, and then look at the models performance. And then typically, they would ship that model into production somehow, and then get an engineering team to kind of rewrite those transformations on the real time, event stream or transactional data that’s available in production. And as the systems are transacting, that data is fed to the model, and they can make predictions. And if you looked at that flow, you could also productionize, that flow by, you know, the training part that the data scientist did at the start, could be extended to have more data. And it could be automated through Airflow or some pipelining system. But that’s kind of the high level flow. And so the feature in that story is, you know, the transformation that’s made on the raw data. And it is fed into the model, during training. Often you will log a list of features, as you know, strings, column names with the model binary, and you can then reference those same features in production, because all of your models would probably have different lists of features that they’re, you know, kind of referencing. And so the lifecycle continues to production. And then somehow, you need to tie the data sources that you have in production, with the list of features that saved with that model binary.
Willem Pienaar 14:04
So your model serving infrastructure needs to know how to select the right columns, and data points in production and feed that to the model. Otherwise, you’re going to have like, you know, a skew or some, you know, if the wrong features are being fed to the model, it’s just going to be an inaccurate prediction. So that’s a typical flow and how the feature stores fit into this. I’m not sure if we want to get into the feature stores, but the life cycle is extended to … it’s kind of split in that the feature store provides two interfaces, one at the training time and one at the serving time and it prevents you from or it removes the need to kind of re-engineer features. And it gives you a kind of unified interface to the same data, same features. We can get into that in a bit.
Willem Pienaar 14:48
But just the final part of the lifecycle. I guess the final place where you would look at the lifecycle of the feature because you’ve made that prediction is you would have an experimentation system that tracks the outcome of the prediction. And if the outcome is good, then you could go back and say, these features are actually predictive, and if the outcome is bad, then you can say, well, maybe these features are the problem, or maybe the model type is the problem. Maybe there’s some intrinsic problem with the kind of way that we frame the problem, the problem domain. But, yeah, so you’d want to have the model itself. And all the logic that you have around it, and the features as part of the collection of artifacts that are associated with an outcome in an experiment. And by experiment, I mean, like, let’s say, for the website, you are A/B testing two models, and those models might be recommending specific products. So you can measure based on user behavior, which model is doing best, and the features are the primary influence there.
Kostas Pardalis 15:48
That’s super interesting. Actually, I find it fascinating. Like, it’s a completely different type of complexity, when you’re serving models compared to a software product and how you serve it. When you have, again, operations we have like, again, like life cycle, and you have similarities, but at the same time, the tools have unique and that’s like the theme of things that I’m getting from you. And the methodology is that like, they are different. And I’m really happy that I have you here today to learn more about that.
Kostas Pardalis 16:14
Okay, we talked about what the feature is. And with us a little bit also about feature stores, let’s get a little bit more into the feature store itself. You mentioned something about putting like two different phases, one for the training parts, and one when the model is in line, like what is a feature store at the end and how it is different from like a database or a data store in general, where we store data? And what are the components there?
Willem Pienaar 16:40
Yeah, this is something I’ve kind of thought about a lot. And the best way I can explain it is that the feature store is an opinionated data system that allows you to operationalize data for machine learning. So it’s a data system meant for machine learning. And it has some unique properties based on the requirements that machine learning models have. So by the way, this definition is not universal. Because all feature stores basically are different and people have different opinions about what a feature store should be. But there are some characteristics that make up most feature stores. So the one that I think is extremely important is that a feature store provides a kind of unified consistent interface for you in the offline and online worlds. So with models, on part of the life cycle, you’re training the model, and then the next side you are serving that model in production. That production could be an online serving, or it could also be a batch scoring, where you’re doing like a large batch of data that you want to make predictions on. But an important failure mode that we often see in production systems where they don’t have a feature store is, there has to be a re-engineering of features in both environments, because typically, there are different teams working in different environments, data scientists working with Python and offline side, and then you have like Golang and Java in the production side with engineers. And so they end up pre-engineering a lot of these features. And that causes drift and problems with models. So the feature store provides a single interface between your model and the data. And so it literally is an API, or SDK that allows you to pull data and it serves the data to your model. And it ensures the quality of the data to that model. Then feature stores … and that fundamentally removes this kind of data drift, concept drift problem, where it depends on the architecture of the feature store, of course.
Willem Pienaar 18:37
Another problem that feature stores solve is feature reuse. So it allows you to kind of define both in those two contexts but between the kind of offline and online world, sort of the streaming and batch world, consistent definitions of feature. So you can define a transformation once and other teams can see that definition and they can consume your features, they can fork that transformation, and then reapply that and create new features. So it allows for collaboration, it allows for reuse. That’s actually one of the biggest problems we had at Gojek was that teams were just copying and pasting each other’s code, if they knew about it, but often they were just re-engineering the same features over and over so they were recreating the same transformations. Now, this aspect is not necessarily unique to a feature store. But it’s something that it’s very uniquely positioned to do because it really sits at the center of like, your kind of machine learning. This is essentially the foundation to your machine learning architecture. So the feature store provides that consistent view, it also provides an abstraction from between the model and your data infrastructure. So this is also something that we had massive problems with at Gojek where teams would build trading pipelines, and then they would write SQL queries that are basically running before model training. And in production they would have like, you know, access to Redis and a lot like connectivity and boilerplate code. So feature stores decouple the process of creating and materializing features from the consumption of that, which in turn makes your models highly portable. So there’s no direct coupling or assumption that you know that certain boilerplate code will be packaged with your model. And so I think those are the kind of key things that make a feature store unique, it’s this kind of consistent view between both environments, it also provides online serving capabilities. So it gives you low latency access to features in production, and also gives you often the kind of more advanced features towards providing point in time guarantees. So it ensures that when you are training a model that the view that the model sees on historical data is accurate, and that it represents the same view that the model will see. In an online case, this isn’t always easy to do, because you need to do a lot of kind of fuzzy, as in joints with data in order to ensure that you don’t accidentally leak future data to models. So to drill a little bit into that it’s very easy to as a data scientist accidentally, when you’re doing like a join of like 20 or so tables to produce a training data set, to easily just access some future data. Like maybe it’s an aggregation, that’s over a day. And you think that data that’s stored on today’s timestamp means that it was from the previous day, but actually, it’s from the coming day. Now, your model can see into the future when you’re training it. But when it actually gets deployed into production, you can’t get that data. And so it’s just wildly inaccurate. So those are like subtle little things that trip up a lot of teams when they productionize models, and that a feature store helps with.
Kostas Pardalis 21:49
It’s very interesting. I’ll go back and ask about the feature again, just because I’m trying to make it more clear myself, to be honest. So if I understand correctly, you want to think about the feature in an abstract way. Because initially, to be honest, like when I was thinking about features and reading about it, I was thinking that at the end, there is a database somewhere where you have like some data stored there, which is the result of doing a pre-aggregation, right. But the more we talk together, I tend to think that the feature at the end is something much more complex at that. And it has encapsulated like more information than just the output of a transformation. So is it accurate to say that like, at the end, the feature is a piece of code that actually executes like the aggregation or defines the aggregation or the type of processing that you want to do on the data together with source because the data needs to come from somewhere and this cannot be arbitrary. It has to be well-defined as part of the feature, the model, of course that we are associated with, at the end, and also the time, right, because something that we observe today, even if we are talking about the same data source or we use the same aggregation, it doesn’t mean that it’s going to be the same again, tomorrow, or it was the same yesterday. That’s what I’m saying makes sense.
Willem Pienaar 23:06
To some degree, but I would challenge you at some of that. Are you saying the feature is the definition of all those things?
Kostas Pardalis 23:11
Yes.
Willem Pienaar 23:12
It’s not clear to me how the model is associated with the feature here, or connected? Because normally, a model has a dependency on a range of features. But the feature has no awareness of models that consume it.
Kostas Pardalis 23:25
Okay. Yeah, I was thinking more about the model as being the entity that’s going to consume the feature. So in this sense, it makes sense like to associate with it. But yeah, I get your point now. The feature can live there and you can reuse the feature also with different models, if I understand correctly.
Willem Pienaar 23:42
Yeah, so if you disconnect the model there, you’ve got your input source data, and then you’ve got the transformation. Those are actually the only … that’s all you need to produce a specific feature. I don’t think time would be in the mix there. Because yes, over time, things would change. But if you change the transformation or the source data, then that is the input artifact that is changing in this if you have like a deterministic function that produces a specific feature. So if the input data changes, or if the transformation changes, it’s a new feature, or it’s a new version of the same feature. And feature stores also help you with tracking that. So if you have a feature store that allows for tracking of versions, then if one of those two things changed, there will be a new version of the feature. And interestingly, then, when you consume that feature, if your model has a dependency on an old feature, you’ll consume the old data and old transformation. And if you consume from the new version, it’ll be the new transformation or the new data. Also, I mean, there is an aspect of it that does depend on how you partition your data, like the time element does come in there. So if you’re just doing a refresh of the data, every week or month will be different, right? There’s a seasonality effect in data. So what we typically do is, we just consider those to be, we consider those to be the same feature, but different models. So it depends on, you can be really pedantic about the versioning there. But for refreshing models, it’s typically not that serious, as long as you have the right validation on your source data, and you can make sure that the effects of seasonality are not too wild. Sorry I’m digressing again here, but yeah, I’m completely with you.
Kostas Pardalis 25:32
Okay, okay. Thank you so much. Now, it’s much more clear about the feature. Sorry, I really find this conversation that we’re having an amazing opportunity for me to learn more about that stuff. So I might do some silly questions. I know that there might be some people out there that might be much more advanced and work in this space. But yeah, I’m selfish. All right. So moving a little bit forward, staying in the feature store still. I just want to understand a little bit more how a feature store is architected? What are the components? If you see like from a software engineering perspective, right, like, let’s say, I want to start building a feature store, what kind of architecture I should expect to see there? And what are like the main components of it?
Willem Pienaar 26:16
The traditional feature stores have an offline store. This is a place where you are going to materialize data. So essentially, you’re going to take data from some source, you’re going to use–this as another component that feature store has–some kind of compute layer, some transformation system like Spark, Airflow, it could even be like an ELT stack, like Warehouse, and then you’re going to produce data, and then you’re going to store it in the offline store. That store is used by the feature store. And often you have like an API, that’s your feature store API that you query, it’ll then hit the offline store with a query, produce a training data set and export that for you to train your model on. The feature stores also have an online store. And so it will have typically an online API, which you will hit with a query in production. And that will be backed by let’s say, a Dynamo, a Redis, some kind of low-latency store key value in almost all cases. And that store is also populated by these jobs that transform the data. The more advanced feature stores have, you know, some operational components as well. So if you look at Tecton and some, you know, a few have some of these capabilities, but not as advanced as Tecton, it plugs into kind of monitoring systems. It also has feature transformation, on-demand feature transformation services, you can do something like, not just pre-compute features to be served, you can also do a transformation on the fly.
Willem Pienaar 26:25
So sometimes you have, like, let’s say, you’ve got a driver, making a booking on a ride hailing app, you only have their location when they’re making the booking, and you only have the location of the customer when they’re making the booking. So you can’t pre-compute that. But you still need to produce features that are dependent on those input variables. So Tecton has this ability to do on the fly feature computation. And you can actually define those transformations ahead of time, but they execute at runtime. So integration with monitoring systems on the fly computation, pre-computed computation, offline store, online store, I’d say those are the primary components. And then you have like, the computations or either batch jobs or their streaming jobs. So if you’re doing transformations on streams, they’re long lived. And if you’re doing batch, then they’re just like, running on some schedule, like every day or every hour or something like that. Those are the canonical components of a feature store. But if you were listening to what I was saying earlier about what makes a feature store unique. And if you look at what Feast has implemented, I’d say the only thing that really needs to be there is the online store, and an ability to create training data from your offline data. So that’s, that’s kind of the essential complexity.
Kostas Pardalis 29:07
That’s great. Actually, I was checking Feast at some point, and if I’m not mistaken, like in Feast, for example, you don’t have transformations there right?
Willem Pienaar 29:20
Correct. Yeah.
Kostas Pardalis 29:21
And that made me think, and I was reading an article at some point where there was some kind of like, critique around feature stores. And actually, what they were saying is that feature stores are great, but feature stores are also something that need to evolve as, let’s say like machine learning inside the organization evolves, right. Like if you start today to try and experiment and come up with some models and all that stuff. probably getting a full feature store is going to be like an overkill. So you mentioned two things that you said that are the basic requirements to have like a feature store, which is the offline training and the online service of the data. What is the evolution? As the company grows, and as the company starts becoming more and more serious around the ML and the data science teams that they have, how do you also see the feature stores evolving there?
Willem Pienaar 30:12
That’s a great question. So this is something we’ve been thinking about a lot as well, like, could a single data scientist use a feature store? Can a two-, three-man team deploy and run a feature store for a single use case? I don’t, we haven’t found a use case for a single data scientist. But we believe that it’s possible for small teams like let’s say, there’s a company, they’ve got one team, this team has to build one model and get into production, and they need a system that gives them kind of a structured way to get data into production without engineers being involved, they would deploy a feature store, and they would kind of just use that themselves. When more teams start to depend on or want to use feature stores, like they’re going to get more ML models into production that require features, or when that team iterates on the same ML system, but with different iterations of the same model. So like the type of model is the same, the problem it’s solving, but they’ve got different variants. And each, each model needs to be tracked with lots of different features, then it makes sense to kind of double down on the feature store and get some, you know, I guess, you’d either need a more advanced feature store, like depending on if you’re using a Feast or a proprietary solution, as opposed to something yourself. But at some point, you can’t just have like a Redis and maybe some Airflow scripts that are pushing data into production, you need to have something that’s providing your versioning, providing you tracking of features, you know, battle-tested API’s and things like that. But you can emerge and evolve from a solutions team that’s solving one problem to having that feature store owned by a platform team. That’s, I guess, the next step. So it’s a central engineering team that manages the feature store, they do things like provide access control, they make sure that data gets garbage collected in stores, they make sure that SLO’s and SLA’s are being met, that performance guarantees are being met, that if jobs are failing that they’re going to be the ones fixing that. Then you’ve essentially separated two worlds, right? On the one side, you have data engineers, data scientists, that originally, they were creating data like features, and they were taking their own models into production. And they were doing like end-to-end. But eventually, it becomes two worlds. One is data engineers or data scientists creating features, features that may or may not be used by them, it could be for other teams. And often what I’ve seen at kind of large companies is that analysts are being asked to do this. So they ask analysts to write like seqSQL, BigQuery, SQL, Snowflake, all that stuff. Because analysts are really good at that. It’s efficient. And you create this wealth of like transformations on the one side. And then the feature store is just this layer that productionizes and operationalizes that data. And then on the other side, you have this catalog of features, that you as a user, you can just pick the ones that you want, based on metadata that’s stored on those features, train your model, iterate on that until you’re happy, and then productionize that. But you probably are not going to engineer any features, you might just reuse existing features. So I think that’s kind of like the final point at which you reach the end of your evolution, then it’s mostly about security and access control and scalability and, you know, enterprise functionality, and kind of that’s where Tecton is currently very good at. So Feast is something that is mostly deployed by teams that are more advanced than the single solutions team. Like it’s almost in all cases, a platform team. But it’s not an enterprise feature store like Tecton.
Kostas Pardalis 33:59
Fascinating. Yeah, absolutely. Absolutely. And it’s very interesting to hear about that. So features are something like quite new, right? It’s a new concept in terms of technology. You taught many different parts of it and I assume that there’s also different maturity on these parts today. What parts do you see in components from a feature store that there’s a lot of space for improvement right now? And where do you think like the direction is going to, both from your experience in your previous companies that you were at like, also Feast, but also at Tecton because from what I understand Tecton is also like interacting with a different type, a more enterprise type of company, which usually they have all sorts of different requirements?
Willem Pienaar 34:43
That’s a very good question. This is a very, I think the the tricky one here to solve is who you’re addressing. The biggest problem with the feature store today is that it solves many problems because it’s uniquely positioned to solve those problems. And so it becomes this platform that you know, it’s kind of a Frankenstein monster. So I think feature stores will evolve in different directions. And they will be more focused over time. So I think you’ll see kind of a split between feature stores that are more focused on the solution teams and the kind of smaller teams, and then you’ll see ones that are focused on the platforms and enterprises. And their needs are different. So I think that basic problem is already somewhat solved. If you look at Spark transformations, or DBT, it’s not perfect. But there are solutions in creating features. And the kind of focus right now is not so much how do you create features, how do you compute them, how do you store them, and how do you serve them? It’s how do you do everything around that that kind of discovery and re-use access? How do you do things like the lineage between features, dependencies? How do you track how models are performing that use features? How do you integrate with, you know, adjacent monitoring systems and data validation and quality systems? Those are kind of the enterprise needs. And then if you look at the lower scale, kind of solution team focus, it’s a little bit more on how do you make it easier to get started with feature stores? How do you make it easy to integrate into existing workflows? How do you make it less kind of overwhelming for teams. And I think all of the feature stores today are still kind of tough to get started. So I bet that if you went to Feast, you didn’t install and run it, you probably just read the docs, because that’s not just it’s not just the pip install, right? You have to spin up infrastructure, you need a use case, and you need to do quite a lot to, you know, go end to end with it. So it depends on who you’re kind of targeting the kind of smaller teams, larger teams, the platform teams. But I think the v1 problems are solved, the v2 problems are different for those two, and those are the ones I kind of mentioned earlier.
Kostas Pardalis 37:00
Yeah, that’s, that’s, that’s great. It’s, it’s very, very interesting to hear about the enterprise where it looks like a lot of value in this organization is always around like governance, and all these things that have been addressed, or we’re trying to address also in different spaces but how they do they apply, like, specifically in the case of a feature store, which is super interesting to see like the same story, but narrated from the side of a feature store. So last, let’s say a bit of more technical question before we move on and I’d like to discuss a little bit more about Tecton. How does the feature store in general integrate with the rest of the data infrastructure that the company as you mentioned that setting up a feature store is not like a simple process usually, because there’s a lot of different components of data infrastructure that you have to deploy there? What are like the main touch points with the rest of the data infrastructure, that a feature has to have?
Willem Pienaar 37:59
The main touch points are, you have data sources, either batch or streaming, and you have some kind of job runner or compute engine, so like Cloud Dataflow, Kinesis, Spark, something that can run a process that can take data from that source, pull it in, do some transformations, or take transformations, there’s an ETL system, essentially, and then load that into stores, one or more stores. So in the old Feast architecture, you’d pull data from the source and you push to a stream. And from that stream, it will get sunk into, like online and offline stores. But in the new Feast architecture and the Tecton architecture, what happens is you pull from, let’s say, a batch source, it could be like a warehouse, like Redshift, it can be a bucket. And you can pull from streams, like, you know, Kafka or something like Pub/Sub, and do transformations, and then just push to a single online store or a single offline store. So there’s the compute layer, there’s the two sources, there’s the storage engines, the storage engines may be existing infrastructure. So feature stores, at least the good ones reuse existing infrastructure, and they don’t create new data islands. And then there’s also integration with operational systems like, you know, if you’ve got a Prometheus, or you’ve got some kind of logging system like Stackdriver or Kibana or ELK Stack, feature stores integrate with those. And because they’re a production systems, right, you’re depending on like, literally the business decisions are being made on the fly with this data. So they are critical to have operational excellence on. You need the logs, you need the matrix, you need alerts. So they integrate with all those systems like a pager, pager duty or, you know, sentry and all of this kind of monitoring and metrics systems. And then of course, the kind of critical integration is into the model serving layer. So the feature servers, the model server and the feature server speak to each other. The models will call out to get features. And this also happens during training. So if there’s a pipeline training a model, then that also calls out to the feature store. And depending on your feature store, it’ll either be deployed to Kubernetes, or it’ll be deployed to kind of like a managed environment. But I’d say most of them actually require Kubernetes these days to run. And if your feature store allows you to train locally, like in your notebook, then typically they load and export data sets through object stores. So that’s also another integration touch point. And then recently, there’s been like, I don’t know if you know of Lyft’s Munson?
Kostas Pardalis 40:42
I haven’t heard of it.
Willem Pienaar 40:43
Wait is it Lyft, or is it another company? But it’s a metadata, it’s a discovery system, kind of like a data discovery major data tracking system that you deploy it in your organization, it basically pulls or collects information about all the systems that have data or across your work. So that’s something that has been becoming quite popular. Data hub is another one. And they’ve recently integrated with Feast as well. So the integrations between those systems and feature stores are also important.
Kostas Pardalis 41:17
Super interesting. I see there are many touch points there. So it requires from what it sounds like quite a lot of effort to set it up and also probably complex like operations around that, which takes me to my next question. Let’s chat a little bit more about Tecton. And more specifically about what it means to productize such complex architectures, especially on the cloud. So how did you manage to do that with Tecton? Can you tell us a little bit more about this?
Willem Pienaar 41:44
Well, I’d love to give you the finer details, but I’ve just joined the team two months ago, so I wasn’t really involved with most of those small things. But I can tell you at a high level how we operate. So there’s multiple aspects to try. Tecton, today runs as a managed service, we have architected the system in such a way that we can run a single Tecton control plane, basically the brain of operations, and we have a separate data plane, and this data plane can be deployed into a customer’s cloud environment. And so essentially, what this provides is a way for us to horizontally scale out the amount of customers that we can support and provide them data locality, like their data doesn’t have to leave their environments. So we have a large engineering team that’s heavily focused on ensuring the reliability and stability and performance, as well as just the functionality that’s available in that system, both from a control and operational standpoint, as well as execution standpoint. So, you know, how do you do computations for the customer? And how can you make that efficient? And how can you save them money? And how can you give them, you know, earlier alerts and warnings? And how do you integrate into the stores that they have and are already using? On the other side, we’ve got product teams. I’m a little bit closer to the product side. So we have a lot of conversations on what is the most intuitive way for users to define features? How do you allow them to specify the configuration that tells the feature store how to operate. Because in the data space, it’s unlike engineering in that you’re not reining in the chaos, right, you’re not reducing complexity. There’s an innate complexity to data. And the more features you create, the more uncertainty and complexity is kind of and entropy is introduced into the system. So you kind of want to give them as much structure as you can, while at the same time giving them freedom to, you know, operate. Like you can’t just say, you can do an average or a min max, right, you have to allow them to write any kind of transformations, bring their own code if they want to, bring their own dependencies, but at the same time, prevent them from you know, taking down a production system and, you know, through accidentally bringing in some sleep function or something.
Willem Pienaar 44:10
So, on the product side, we’re heavily focused on, you know, understanding how the users think and to provide to them. And the great thing about this is that we have like, two worlds here, right. We have Feast, the open source side, and we have Tacton, where we have different customers and different users. So and then finally, it’s just yeah, I mean, we have amazing founders that have seen a lot of great implementations of feature stores like Uber, Michelangelo, and other companies, and they’re very well connected. And we have great investors as well with Sequoia and Andreessen Horowitz that really guides us in, you know, in our venture.
Kostas Pardalis 44:49
Yeah, yep. Absolutely. That’s, it’s very important. So, how does Feast and Tecton work together? What’s the vision there? They’re both from your side and also from the Tecton side, because you joined the company there you will be working on a product, Tecton. And at the same time, I assume we are going to continue maintaining Feast. So what’s the story behind this?
Willem Pienaar 45:12
Well, that’s a great question. I think that when we started, we started independently. And then at some point, we just realized, like, we were trying to solve the same problem. And we’ll probably be better doing this together. And we have these great two products. So for us, it’s just about figuring out how to build the best feature store. And we believe that, you know, there will be large overlap between these two, but that Feast and Tecton will kind of gravitate towards solving problems for different groups of users, where Feast will be a little bit more for teams that just want to get started quickly and solve specific problems. They’re more of the kind of nascent stage, but if you go to like a large bank, or corporate, something that requires companies or teams that require high scale, or you know, multi-tenancy or advanced access control, then you’re more likely to go towards Tecton. So, for us, we’re still trying to kind of converge these two visions. So we’re working very closely, you know, I’m pretty close on the Feast and Tecton sides and unifying these visions. But I think over the next three to six months, it’ll become much clearer exactly what we are, what we have decided. And that’s as much as I can answer right now, I hope that it was satisfying enough for you.
Kostas Pardalis 46:27
No, that’s good. That’s good. Like, I totally understand. How’s your experience working on an open source project, by the way?
Willem Pienaar 46:34
It’s extremely rewarding. And it’s also kind of draining at some points. So you don’t really have often closed loop feedback. So you only see the tip of the iceberg in users. So like 2% of users will make an issue or give you feedback, but that’ll often be negative. So you really have to kind of have conviction that what you’re doing is right. Luckily, I had to run Feast internally at Gojek for like three years or so or two years at least. So it was very rewarding to work with our customers internally and just get them to use it and make them happy and see how impactful the software is. And so you don’t need to have conviction that an open source project is successful, we just kind of put it out there and if people like it, they like it. If they don’t, they don’t, but it turns out they kind of do. And so for the most part it has been very rewarding. But it can be a lot of work. So it’s best if you’re paid to do it.
Kostas Pardalis 47:30
Yeah, I totally understand how it feels. So we are almost close to our time. And we have many things to discuss, to be honest. I mean, it’s really fascinating, this whole space with feature stores. But one last question. Tecton recently raised quite impressive rounds from some very impressive VCs here in the Silicon Valley. You mentioned some of them already. Can you tell us a little bit about what does this mean both for the company itself, like what excites you about what’s going to happen in the next couple of months? And also what it means about feature stores in general, right. And this market, let’s say that it’s like emerging.
Willem Pienaar 48:14
Yeah, so the market is going to get a lot more competitive. We’ve already seen Amazon release their feature store, not sure if you had a look at that. We believe that, you know, our other cloud providers also bring them out. And so raising that round is kind of a vote of confidence from our investors that, you know, they believe that we are one of the stronger players in this big round. And I think that Tecton is probably the most … it’s the right environment and the right people to build an industry dominating or, you know, the most successful feature store, which is part of the reason why I joined this team. What you can expect to see going forward, I think, in the short term is a lot easier access, a lot more transparency, in terms of our API’s and the functionality that we provide. And we’ll be going to boards users a lot more. So previously, we would, I mean, from a technical standpoint, we’re going to be a lot more open towards integrating into existing infrastructure, and reusing existing infrastructure instead of providing a managed service with specific types of infrastructure. So I think that’s we’ve collected a lot of feedback. We’ve got a lot of great customers that have been working really closely with us. And so there’s a lot of things that will be landing in the next couple of months. But I think the thing that I’m the most excited about is, you know, kind of getting more eyes on the product itself and opening up what we’ve been working on.
Kostas Pardalis 49:52
Yeah, and also see how it’s going to work together with Feast because from what I understand from what you said earlier, there are also like things that are going to happen there. Thank you so much. It was a great conversation. I really enjoyed it. I learned a lot. And I really appreciate that. And so yeah I hope to meet again, like in a couple of months and see how things are going and learn more about this wonderful space..
Willem Pienaar 50:14
Definitely, I’m going to take you up on that offer.
Kostas Pardalis 50:16
Thank you so much for your time today.
Willem Pienaar 50:18
Thank you.
Kostas Pardalis 50:20
Thank you, everyone, for joining us in another episode of The Data Stack Show. I hope you enjoyed today’s episode with Willem as much as I did. When we started recording this episode, I had many questions. And I would say even some doubts about the importance of the feature stores. I know many more things right now about them and I truly understand why they are important. And William did an amazing job explaining that to us. And I’m really looking forward to having another recording with him in the near future. Willem has many things to share with us about this exciting new world of MLOps. Thank you so much again for listening to our show and we’ll see you on the next episode.
Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
To keep up to date with our future episodes, subscribe to our podcast on Apple, Spotify, Google, or the player of your choice.
Get a monthly newsletter from The Data Stack Show team with a TL;DR of the previous month’s shows, a sneak peak at upcoming episodes, and curated links from Eric, John, & show guests. Follow on our Substack below.