Episode 05:

Data Council Week: The Difference Between Data Platforms and ML Platforms With Michael Del Balso of Tecton

April 26, 2023

This week on The Data Stack Show, we have a special edition as we recorded a series of bonus episodes live at Data Council in Austin, Texas. In this episode, Brooks and Kostas chat with Michael Del Balso, Co-Founder and CEO of Tecton. During the episode, Michael talks about the company’s desire to provide customers the best feature engineering experience in the world. Topics in the conversation include MLops and data platforms, machine learning vs. data pipelines, the most difficult part in the life cycle of prediction, and more.

Notes:

Highlights from this week’s conversation include:

Michael’s journey to co-founding Tecton (0:22)
The evolution of MLops and platform teams (3:50)
Understanding boundaries between the data platform and the MLops (8:42)
Differences in machine learning vs data pipelines (16:58)
The systems needed to handle all these types of data (22:22)
Developer experience in Tecton (25:15)
Automating challenges in ML development (32:30)
The most difficult part of the life cycle of prediction (37:24)
Exciting new developments at Tecton (39:27)

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 00:03
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com.

Brooks Patterson 00:23
All right, we’re back in person and data counsel Austin, and we have got Michael Del Balso. He’s the co-founder and CEO of Tecton. And super excited to chat with him. I’m Brooks, again, if you’ve been following along, you’ve heard that Eric did not make it to the conference. So I’m filling in. You’re stuck with me. But I have Kostas here as well. He’s extremely excited to chat with Mike. So Mike, welcome to the show.

Michael Del Balso 00:37
Thanks for having me.

Brooks Patterson 00:39
Yeah. First, could you just tell us about your background and kind of what led to you starting Tecton?

Michael Del Balso 00:45
For sure. So yeah, I’ve been in a machine learning for a while now, I actually got into a kind of randomly, I worked at Google and I began working on the machine learning teams that power the ads auction, so I was a product manager for probably the I would say, at that time, the best productionize ml system in probably in the world that drove like, you know, all of the ads revenue there. And, and that was really cool. And did that for a bit and then joined. And from that team, that actually that team is a team that publishes pretty famous, like foundational ML ops paper called machine learning, high interest, credit card or technical debt, or something like that, I always get the words in the wrong order there. But it’s pretty similar, often cited mL mL ops fever. But then I joined Uber, and joined Uber at a time when there were not really a lot of ML happening in Uber. And I was kind of tasked with helping start the ML team and like, bringing things from zero to one. And so we went from, so we created, we actually created like the infrastructure, the infrastructure Uber platform called Michelangelo, and over from, you know, I joined in 2015. So over a couple years, two and a half years, let’s say, we went from, we just had like a really good run. And we built some really good platforms. And we went to learn from like, you know, a handful of models in production to literally 10s of 1000s of models in production, making real time predictions, like a huge scale, right, millions of predictions a second, that powers all that kinds of eta and fraud detection, and like pricing stuff that happens at Uber. And then going through that and building that stack out, we came up with something called a feature store. And I published a blog post about Michelangelo, a while back, and a lot of people said, hey, you know, we’re solving a lot of these, we’re trying to solve a lot of these problems that, that you guys are solving, and Michelangelo and one of the most interesting parts of it. And the hardest part for us is all of the data side, the data pipelines that generate the features, the power models, all of the real time serving all of them, the monitoring for the data storage, the online and offline consistency, stuff like that. And so we recognize there’s a need to start. There’s like an industry need for this architecture. And that became the beginning of what we call a feature platform today. And so myself and ng lead some of the Michelangelo we left to start a Tecton. And we’ve been doing that for the past couple years. And so TikTok we think of ourselves, as you know, we’re kind of like the company for feature platforms. We see ourselves as that Enterprise feature platform company, we sell to Fortune 500, some like top tech companies, folks that aren’t necessarily like Google or Facebook who are for sure, gonna build it themselves kind of thing. But everyone else who’s trying to do like, real machine learning in production. We hope we can help them out and provide them with the best feature engineering experience in the world. Cool. Before we hit record you’re talking about platform teams and kind of how companies handle build versus buy. Do you want to just speak a little more of that? Yeah, it’s really, it’s been interesting seeing the, the evolution over the past couple of years, because when I was kind of when we started Michelangelo in 2015, there just wasn’t a lot of good ml infrastructure, ml tooling, the ML ops was I don’t even think it was a term at that time, really. And honestly, like the concept of buying just never came up to us, it just, we were never like, oh, maybe there’s a product, we just assume that there’s no product to do what we wanted to do. And so over time, you know, the industry has grown up effectively and it’s become and it’s been bid, the offerings on the market have become more compelling. But also, at the same time as the kind of like, vendor solutions have gotten in parallel, as the vendor solutions have become somewhat compelling, you know, ML platform teams have grown internally and companies it’s become, you know, the need for machine learning. And so the willingness, the willingness of a company to invest in machine learning, and machine learning platform has, has grown. And we see it as often it’s a parallel thing to a Data Platform team, a data team, or an ML platform team, you know, you’ve probably seen this in a bunch of companies where it’s kind of like their consumer, their customer of the data platform teams, the data platform will manage the data warehouse, or the data, Lake, stuff like that. And an ML platform team will be like a specialty for, like, it’s a specialization of that, and manage a lot of the ML infrastructure on top of the Core Data Platform. And, and so, you know, the industry, these ML platform teams have been in a lot of companies have grown quite a bit. They’ve been building all kinds of cool stuff, managed training, managed model serving, like drift, detection, and feature pipeline management, stuff like that. And I think like recently, especially with, like the economic situation, but even before that a little bit, you know, like these teams were ballooning. And sometimes you’re getting to, like 1020 30 people on an ML platform team, and that’s expensive. And, and so we’ve begun seeing a lot of, hey, why, like, Well, now that there’s all these, like solutions we can buy, why is it really, you know, like, strategic for us to build our own training infrastructure? Should we just find that? Well, it was strategic before when it didn’t exist, and you needed to, like have machine learning in production. And especially on like, a lot of the data side of things, you know, the place where we play, same kind of thing. And so we’re seeing these ML platform teams kind of have this like, interesting, kind of, like, identity crisis today, where they have to think about okay, well, I thought I was like, building all this cool, like, invent, I came here to like, invent ml infrastructure. And now, it’s their role is, is a lot more tied to the use cases is a lot more tied to like, what is the like, why am I actually here, like, like, my team is building a recommender, or like, someone in my company needs to make recommendations, ultimately, or needs to detect fraud. And so now they’re a lot more or less like, a carte blanche, just build whatever cool tools you want, and a lot closer to, or a lot more being driven by what is the actual need from this business use case? And how do I map how I help that end, use case team business team map that back to like tools or whatever the right stack is, and they have to be kind of like the stewards of the right stack. They have the Rena, and I think we’ve seen it be kind of like a difference and identity and charter and stuff like that for ML platform teams over the past few years. And obviously, people are different parts of that journey. But I think that’s a general trend we’re seeing as well. Yes, and it seems like it’s kind of just general, like, you know, Mo, the, across the board with data teams. The business impact is now kind of number one, I’d say, you know, machine learning stuff, I feel like it’s been particularly like this sexy thing where someone could be like, I can just go invented this new, kind of completely Greenfield new stack, there’s no best practices or establish right way to do things. So I don’t think this is always like a conscious thought for people working on this stuff. But it was, you know, you can see it in people’s attitudes, like I just, it’s a cool place where I can just invent cool tech, and sometimes that was a little bit divorced from what the, you know, the actual business need is and, and more so than I saw, you know, for example, then it just like the normal data stack, for example.

Kostas Pardalis 09:03
So we’ve been talking about ML ops and platforms. And I’d like to begin the conversation with like, helping me understand the boundaries between the data platform and the ml, ml ops or ML platform, right? Yeah. Where each one starts, where it stops. And like, most importantly, like, where are the synergies? Because there are synergies, right, like, I can’t imagine that you have like an ML platform somewhere without also having some sort of pre-existing activities, data platform right this month. And

Michael Del Balso 09:35
Those are tricky situations. We love it when people have, like great underlying data platforms. And unfortunately, that’s not always the case. Right. And so, I think this is a little bit of a strategy question for ml ops vendors. You see, if you look at analytics vendors, they’ll typically have a bunch of capabilities in their system which kind of like ” Are there as optional to like, fill in for gaps that you might have in your data stack. Because if you are missing what we’re seeing missing data observability capability, it’s really important for your machine learning model. But the rest of the company doesn’t really matter that much. Probably realistic that the ML ops vendor is going to add that in one way or another. And, and they’ll say, Hey, you don’t have to use this. But like, part of it is here in case you need it, which you might use for your really important business use case. But anyway, Sorry, I interrupted you. Are you Oh, that was a question like what’s like, what’s the boundary?

Kostas Pardalis 10:34
What’s the boundary? And also, you said something very interesting about it or not, but like observability, for example. And you might say, okay, like, let’s say, in a BI environment with, let’s say, the traditional data stack, maybe people don’t care that much about introducing an observability platform, right? But if you do ML, you probably need more. Right? Yeah. So tell us a little bit more about that. Like, I think that’s it’s, for many people, it’s hard to understand, what are the differences? For sure. And I think the most important thing at the end is like, what are, let’s say, the synergies between the two platforms?

Michael Del Balso 11:11
Well, okay, so I think I have kind of, there’s kind of two kinds of there’s two dimensions of difference also, like one dimension, is this data? Or is this a normal data like bi type of thing, thing that I’m trying to do? Or is this like a machine learning kind of thing and machine learning has some special requirements. But a second dimension as that is often correlated with that first dimension, as it is this kind of analytical thing I’m doing where it’s kind of offline. It’s an internal use case. And so let’s, so let’s take imagines an analytical machine learning thing, this is a big distinction we make, you know, it could be like, hey, my finance team has to predict, forecast how many sales are going to happen next quarter, you run a job, maybe do some machine learning stuff. But if it fails, it’s not a big deal, just press retry. And you’re good, right? Whereas we could also be an operational thing could be a thing that powers your product, your end user experience. And so that’s a pretty different, pretty different set of engineering requirements, right? You might have a lot of users. And so you have to be ready to serve at a crazy level of scale, right? Or you may have maybe, like, say, It’s a fraud detection situation where you have to make a decision really fast. And so you know, someone swipes a credit card, and you have to say, in like 100 milliseconds, or we or 30 milliseconds, Is this acceptable or not acceptable, or you have some like uptime, availability requirements, where, you know, some downstream consumer, it’s something really bad’s gonna happen if you’re not available at this kind of, you know, availability. So. So that kind of production and not production. Differentiation, I think, is actually a bigger driver of some of the differences you see in an ML stack and a kind of standard data stack. It’s correlated with, like machine learning, or using UBI kind of stuff. And so, of course, there’s, you know, there’s examples that are contradictory to this, that you have an internal only ml application where you can have a, or you can have like a production, like embedded analytics, where you have a dashboard that’s updated in real time for your customers, it has nothing to do with machine learning. But just in general, like those types, that tends to be like a pretty correlated distinction, and so on. So the whole point here, though, is that machine learning often comes with this production, these production requirements, reduction requirements, are these things I’ve listed, and you can probably list a bunch more. But, you know, why would you use machine learning for? Why would you go through all that trouble? It’s because these use cases often tend to be pretty valuable use cases to the business. So the business is like, hey, for me to prevent fraud, that’s worth so much money for me. So I’m going to really invest in it. Whereas like, the 101 dashboard and the company incrementally, maybe that’s not going to merit a 50 person team or something like that. And so that’s why we see like different levels of investment, different levels of like, how willing to do something, cost them different levels of Yeah, like different, often different stacks for those things as well. And so then coming back to like, why machine learning? What’s the kind of boundary between machine learning or ml ops tools and the data tools, it often becomes a boundary between like production data tools, and like non production, data tools. But machine learning definitely has that we can like to go through a bunch of these things. But like, if you want to look at an ML platform, like what are the things in an ML platform, you have? Gotta train them a model, right? So it’s like, breaking into like model stuff and data stuff? Yes. Model. You got to train the model. You got to evaluate the model. You gotta serve the model. Yes. And you know, there’s a pretty good system for that stuff today, and you can go and find really nice open source tooling for that, or you can find a vendor solution that will do it all in one. And then on the data side, well first, like, what is the data? What is the data side and a machine learning use case? Well, the data is, you know, your model takes in data, they’re called features to make predictions, right, and you take in some data about your users about your product, whatever. And hopefully, they’re up to date, and they’re fresh and stuff like that. So you can make high quality predictions, and they’re expressive stuff like that. And so there’s a lot of good information going into your model. So the model can cause depression. But that’s a hard problem, hard data engineering problem in and of itself. So, you know, we find that to get a machine learning application into production. It’s not just let me deploy a model, it’s let me deploy the model and a whole bunch of supporting data pipelines that are often more complicated than your kind of like bi pipelines powering a dashboard. And that’s like a really big hard part. And that’s the data pipeline thing for machine learning is always on the boundary of is this a data thing? Is this machine learning? Yeah, yeah. But it tends to be the hardest, you know, you’ve heard I’m sure everybody on this podcast has always said, like, you know, the hard part in machine learning is the data and all of that kind of stuff. It’s because it is, and that’s actually, you know, it’s the layer that we focus on at Tecton. But it’s a lesson first learned, actually, when I was little, when we’re building Michelangelo, you know, first, we started with this model thing, and we’re going to all of the different data science teams internally, and we were saying, hey, like, let’s help you out, let’s help you get surge pricing into whatever into production. And we would find that there’s a bunch of cool models that work. And then we would do a bunch of custom data engineering. And then we would go to the next team, there’s 100, you know, 200 data teams, there are different teams internally. And we’re doing the same data engineering things again, and again. So we centralized that and automated that in the ML platform. And specifically, only mlb teams have these needs. And we can get into what the specific needs are. But that became the feature store and the feature platform. And so that has become a separate thing than what you have in a traditional data platform. You don’t need a lot of that. Yeah, real time serving streaming stuff in the exact same way.

Kostas Pardalis 17:16
Yeah. That’s super interesting. So in terms of the data, and let’s say, the pipelines, actually, the basic principle of a pipeline remains the same, right? Like you have data and you go through stages and would like to transform the data, right? But how like, this is different in the case of a metal

Michael Del Balso 17:34
machine learning versus not? Yeah, good question. So I’ll call it two, two big things. And then we can talk about things like, what are the implications of these differences? Right, we’re just thinking about this use case. One thing is that I have a train, I have two consumption use cases for my data in machine learning. So the data, again, is features, features, let’s think about what some features are as an example. I could be predicting fraud, right? So I want to, let’s say, I have one feature, which is, how large is this transaction that someone’s making right now? And how does it compare to an average transaction, let’s just say that’s got a bunch of different types of features like that, right? And so I need to use that data to build my model to train my machine learning model. That’s kind of consumption. Scenario one. And that’s, I’m doing that in Jupiter. I’m doing that. And I’m plugging that into psychic learning or PyTorch, or whatever. And that’s offline. Yep. And then I get a model from that. And then I deploy that model. And that model needs to it’s a consumption case soup. It’s in production. I need the same data that is the same, you know, how big is this transaction compared to the average transaction? I need that calculated the exact same way and delivered to that model in production in real time, right. And so. So that’s, you know, that’s the inference step. So that’s, you know, consumption case, one is training consumption, case two is inference. The data needs to be consistent across those, it’s, and it’s it more so than in any other data scenario. So if you have a dashboard, where, you know, it’s where you’re off by a decimal place, or like a format of the numbers kind of different or whatever, it’s not a big deal. If you know, in your prototype, it was one way and in production, it’s another way. But in machine learning, if there’s any difference in that data, then you basically have an undefined behavior for the model. Yeah. And so and then you get this problem. This is the drift problem that people talk about. And then you don’t know what your model is going to do. And it can affect behavior in a really bad way. But it’s also a very hard problem to detect and debug so anyway, so that’s like, exciting consistency between online and offline is like a big problem. The second problem that is pretty unique to machine learning is going back to the training side of things. So say I have, you know, 40 features, right? And then we have customers that have 4000 features from all I need to know, you know, I’m trying to give my model examples of, of what I knew about a customer or a product or whatever, at the time I had to make that prediction in the past. So I’m not really. I don’t really care about what that features value today, right now? What do I know about the customer today? I care about when this purchase was made at, like, 1231. On Thursday, what was this feature’s value at that time? And this can be, so imagine, I have to do that for every single feature. And then I have to do that for every single purchase that happens, right? So that’s like a complicated thing you can, we can imagine a bunch of different ways to do it. And it’s not impossible to figure out. But if you’re a data scientist, it’s like, okay, that’s a whole other like data engineering thing I have to do. And, and you should just have a really clean, nice workflow to make that really easy, because you’re trying to do millions of rows, that potentially 1000s of columns. And then what’s even more tricky here, this is I’ll say, this is like, challenge number three for these use cases, is that you’re typically not only sourcing data from where you’re sourcing data from is not a simple story. Typically, it’s not like let me plug into Snowflake, and then just like run a query, you know, for example, these like production fraud models that often, okay, I’m gonna pull a minute run this query against Snowflake, and that will be the what is the zip code, we expect some profile data, some slow moving data, then there might be some data that’s based on streaming values, like how many times is this user logged in, in the past five minutes, if it’s, and the model can learn if it’s 1000 times, there’s probably something weird, and maybe this is like high risk. And then there’s, and then there is another type of data or another type of feature, which is like real time, it’s like super real high. It’s not streaming where it’s asynchronously, but pretty calculated, but pretty fresh, it’s very real time, it’s like, the based on the data of the transaction, this transaction is coming in, I need to do some operation based on the size of the data first, or the IP of the transaction issue, or let’s say. So now we have three different kinds of computers, you have to manage, and you have to backfill all of those values through all these points in time in history. So you can see this whole problem just like it explodes, right? It’s like all these different dimensions of this problem. And so the point is not to say, Hey, you can’t figure out how to do one of these things. Just a really terrible workflow for a data scientist, who’s just trying to build a fraud model and put it to do their job, really. And so that’s kind of like, the set of the part of that most of the big problems that

Kostas Pardalis 22:40
yeah, I have. Yeah, that’s like the again, that’s very thought provoking actually very, like two things. One, like, technical question that I have. And then the other one, which is probably like, most important, make the question like next is about like, the experience that the user has, in this case. Yeah, like, the ML engineer, or like the deadlines of your whatever, let’s apply with the technology like, okay, for someone who comes, let’s say, from the database, systems war, yeah. Okay. You always know that there are bases that like, you kind of have databases and like to undo everything like that systems tend to, let’s say, get optimized for specific workloads. And I can’t stop thinking as you talk about all these things, like in my mind’s eye, kind of like coming one after the other, like the different workloads, right. So my first question is like, what kind of like data infrastructure like what kind of data system you need, in order like to do all like, work with all these different types of data, right, from like, time series data from streaming data, how to like doing like, slow moving, like bots, data, graph data, and all that, like, especially like, from what I know, from banks, when it comes like to fraud detection, like graph database, I’ll use like, a lot of like to find like relationships. And so like, so taking all these things together? That’s a lot. Yeah, like, it’s crazy. Like, how do you even keep this thing consistent? Fine.

Michael Del Balso 24:09
Yeah, I mean, maybe it’s, it would be good to clarify that tectons are not doing all of that stuff, right? So we’re not saying, Hey, we are the one system that can be better at each of these things than everybody else. So we think of an approach of plugging into the best in class solution. So So you know, what we provide for a data scientist and even talking about that, like experience of using it, but we let them write their feature code, their feature engineering in one place, we provide a really nice workflow for them to, you know, do register author register, share and manage these features. But then we plug it in and send that code to the appropriate underlying infrastructure to run that. So this could be a stream processing pipeline. This could be in the real time case, we actually run like Python code and whatever. in real time to, to run efficiently, or we often just push down SQL queries to Snowflake, or we’ll kick off a Spark job or something like that. So it’s not intended to be a, like one master data engine that does everything kind of thing. But more like a common hub, a common control center for the data scientists so they can get a control of all of the different data flows that power their ML application. Does that Does that

Kostas Pardalis 25:30
answer your question? Yeah, 100% 100%. And let’s talk more about the experience, right? How this experiment looks like and how we can make it easy for an ML engineer to interact with all these different systems, right, because of each one of them. I can, like just thinking of writing like a job for Spark, and executing a SQL query on Snowflake are like, really different things, right? And I’m pretty sure like a Muslims in the US are like data scientists, they prefer to focus on other things. Right? So what is the authoring or the author feature? And how can we help them have a good experience with that? Good question.

Michael Del Balso 26:11
So I think, you know, when we see what our customers are spending their time on, especially imagine like a new use case, like, Hey, we’re spinning up fraud model number two, or something, a lot of the time that goes into if you just look at, like the timeline of the project, spent in like figuring out how to connect to something in the first place, and getting that original kind of integration going. And so one of the first parts of the experience is getting that integration out of the way ahead of time. So, you know, this is where the ML platform comes in, we work with the ML platform, before the data scientist or the feature engineer, whoever is building the model. You ever even know anything about the platform, we get all the integrations with the right data sources and stuff like that registered on tectons, we connect to your warehouse and your streams and your production system and stuff like that. So then, that lets us provide an SDK to the data scientist who’s now in the mode of, hey, I want to develop a machine learning application. I need a training data set, right? Yeah. Okay, I want to write, I want to write a feature that operates on the stream, I want to write a feature that runs in real time that’s based on the data my application sends me in real time, I want to write a feature, that’s what are some sequel that runs on Snowflake, for example, well, now there’s one SDK where they can, you know, write that code snippet in the exact same way for each of those different types of compute, and register it into a central decentralized, you know, feature repository. And so all within you know, your Jupyter Notebook, you can, it’s literally just writing a Python function that emits either a SQL query or does an operation on a, you know, a panda’s data frame or something like that. And you put a little decorator on it, and that tells us, Hey, this is a feature view, okay, pretty straightforward experience. And then you can say, feature view DOT run, and then we’ll execute, you know, and we’ll give you the feature about it, since there’s not like a, like a crazy amount of magic there. But then you can take all of these feature views either refer to them by name, the ones that have already been generated in the apple, in the feature store that are already there, that someone else in your company has made, or the ones you just defined, like live in your notebook. And you can bring them all together in a list and say, Hey, give me the historical training dataset for this. Every login attempt that any user did in the past six months, I want to backfill with the feature value for all 400 features. Yep. So that’s where a lot of complexity comes in. How do you do what, how for each of these feature types? How do you figure out what the historical values of the feature was? How do you do it efficiently? How do you join it all together efficiently? How do you and then how do you make it really easy to iterate on that whole thing that’s like that workflow that’s just such an ugly, it would be such an ugly workflow normally, and we’re all about making that as smooth as possible for the person prototyping their machine learning application. And then the second process is just called the prototyping stage. And so you just train a model, we give you back a data frame, you just train your model on it. And then once you’re happy with your model, you know, you deploy your model. And normally, this is kind of like the main thing that people get stuck on historically, they would say, Hey, okay, let’s go rebuild all these pipelines in production. Now, this is a classic, throw it over the wall to engineers in production who rewrite everything. But then the Tecton world you know, you’ve already registered your pipelines, you’ve already registered your features, so that they’re already productionize. And so there’s no there’s nothing else to do, just your model in production just makes a call to Tecton and it says, Hey, I need these features in real time. And that’s already productionize those values in real time. It speeds up this prototyping stage and the natural, Junior Jupyter Notebook. But then it also brings the production isation stage the time for that to like zero. Because yeah, that’s not a step in the Tecton. Yeah, workflows. Yeah, that

Kostas Pardalis 30:08
makes a little sense.

Michael Del Balso 30:09
You get what I mean by that, it ‘s just like, you don’t have to rewrite it. Basically, it’s just

Kostas Pardalis 30:12
yeah. 100%. Like, when he’s like the product engineering getting involved in that, because like, you have a model, you expose it through like a gRPC like endpoint or like a REST endpoint, like whatever. But at some point, these things need to be integrated with the product experience. Right? So how does this part work? Because we didn’t like to focus a lot on the data side of things like more like, you know, esoteric stuff, like we’ve data engineers, like ML engineers, and all these things. But at some point, we’ll sort of integrate that with the product itself, right? So how does this work

Michael Del Balso 30:44
100%. And so the same problem that we talked about around like, hey, like, you know, in reality, you spend a lot of your time integrating with different sources, that applies just as much to like, figuring out how to connect this thing to some database originally, as a data source to figuring out how to connect my ml systems to production, the production, you know, the end application. And so, in the Tecton way of doing things, that’s something that’s handled by the ML platform team. So Emma in sending up Tecton, your ML platform team can connect Tecton to your production systems. So then, what that does, is it makes it easy for the data scientists. Now let’s just think about just the flow of building an ML model, independent of the platform team, hopefully, your ML platform group is not involved in building an ML model, the same way you wouldn’t want your data platform team involved in, you know, like, every single iteration on a dashboard, or some like, you know, some analytical work. And so in that every machine learning engineer or data scientist, who’s iterating, they when they productionize, you know, Tecton is already connected to their application, it already exists in their production environment. So it’s just a matter of opening up a new API, a new endpoint, on the Tecton side that can serve that data. And so we just expose that API, and then their application just has to query from Tecton, a different set of features, or a different alias for a group of features. And so that’s why I like that integration step, you still have to do the integration upfront, but you don’t have to do it in every single iteration. And that’s where the real speed up happens. And then what and then the whole point is, from like, a data science manager, you know, perspective is great, like, my team can iterate so much faster, because there’s not all this data engineering stuff that has to happen in every single iteration. Yeah, my data scientists can affect what’s happening in production, without going through all of these different steps.

Kostas Pardalis 32:43
Nice. And let’s talk a little bit about inference. Now. You said at some point that, okay, we trained the model. And now we need to like more online. Major, like sounds like creating features that we’re going to fit, make the predictions, right? And I guess, like you like, I mean, the latency and throughput, like requirements, profit sounds like slightly different, again, a completely different workload, right? So how does this work? Let’s say I want to build something like fraud detection, right? So it’s pretty, I mean, as you said, like, in 30 milliseconds or something like that you need to make a decision. How does this work? And what’s like, let’s say, unique challenges, it has like, compared to like, more traditional, more esoteric kind of uses of like, I know,

Michael Del Balso 33:32
yeah, so So maybe it’s good to start from like, the most basic form of the more like, the analytical and all use cases. So, you know, let’s go back to that example, where I’m the finance team, you know, I’m the data scientist in the finance team, and I want to, you know, predict sales next quarter, well, okay, what do I need to what’s the input data, the input features, I need to make that prediction, all that come from my Snowflake, let’s say, right? So well, I can use this pipeline, I can issue a query to Snowflake, maybe it takes a couple seconds, I can wait for that data to come back. And then you know, run a prediction job or run past it through psych it learns inference pipeline. So that’s kind of like the base case, it’s the most simple thing you would get. Now, when you want to go into production, you want to power your user experience by this time thing, right? Typically, there’s, you got to go faster than that. And so you know, you don’t want your user waiting around for the page to load while you’re figuring out you know something. So. So it’s common to have, let’s say, like, a time budget of 100 milliseconds or something like 50 milliseconds, where you’d say, hey, everything needs to be fixed, the prediction needs to be 100%. Ready within 50 milliseconds, because we just got to show the pitch. We can’t wait around for all the ML stuff to happen, right. And that tends to be a real limiter for what kinds of ml can we do, if you know what our product will be like. No Can we have the product? Well, if it’s slow, we’re just not going to have it, we’re not going to consider having it right. So the problem and all teams often have is, how do we do this cool stuff? How do we do it quickly? Yeah. And when we come to, you know, the different types of information that they want passing their model, the different features, you know, they can depend on systems that are not that fast. So when, for example, I want to send a query to my data warehouse, I have to wait around for it, right? So there’s different kinds of ways to approach that. But there are ways to approach that are different depending on the underlying data infrastructure that you have to interact with, but like, super obvious example, is, okay, well, let’s run the query ahead of time. And then just like cash the value, right, and so maybe we run it every day, or maybe we run it every 30 minutes, or something like that. And so, so just like a very common thing to do is, let’s pre compute these values and get them all loaded up, ready to serve really fast. And when you do that, then you have this problem of like, okay, well, how fresh is this value? Right? Well, if it happens once a day, then maybe it’s, you know, like 18 hours old when I’m serving the value. And so this is kind of like a question, and I’ll produce ml teams. Think about this all the time. Okay, well, how do we do this trade off? How do we make it go faster, but not cost too much money? So how do I keep things fresh, but also, I don’t want to be constantly just like querying my warehouse and breaking things, right? And then you have that type of feature, you have one, maybe I’m using my streaming data. And so in there, I may be pre-calculating values as well and caching them. And then there’s this, like, features that depend on actual real time data that’s only available when you’re making a prediction, like, like the example of what is the user’s IP address, right? You can’t know that ahead of time. So you can’t predict that ahead of time. So in that case, you have to compute that feature at prediction time. And so you need that to go really fast. Yeah. And so this is another domain where like, you can like each of these things we can talk about and be like, yeah, like, we can do that. It’s not impossible to run a query on a schedule and load it up. But if you’re a data scientist, you just really want one thing that will handle all of this stuff for you. And so that’s what we do, we just automate the best practices, we have all the best practices built in. And then the kind of like knobs that you would really want to tune this stuff to, to trade off between performance costs, stuff like that. It’s kind of all built in there to make it really easy for someone who’s building and going to production, without having to worry about a lot of, like, unnecessary data engineering details behind the scenes.

Kostas Pardalis 37:35
Yeah, it makes a lot of sense. And this whole, I would say, lifecycle of prediction, right? Like from getting the data, creating the future, doing the inference, and serving like the user at the end, which part is usually like, the most time consuming is like, the feature creation part is the inference itself, like, how long the model takes like to do, what’s the cause to do? That’s usually like, takes a lot of time, or depends.

Michael Del Balso 38:04
You mean, the inference pipe, like when you’re making a prediction? Yeah, there’s data retrieval, the data retrieval can depend, it can really depend. And we, so you could have a piece of feature engineering code that can be quite complicated that has to run in real time. And that’s, that’s one of these ones where, you know, it’s just the reality that you can’t have an arbitrarily complex thing run arbitrarily fast, in real time, at a cost, at a level of cost that is acceptable to the challenge tends not to be once you adopt an architecture like this speed of surfing doesn’t tend to be a problem, actually, it’s like, you know, we this is what the online feature store is, we as long as we can, like manage getting fresh values into the online feature store. And we automate all of that, and everything. The online feature stores really fast. And well, you know, we can use different underlying technologies to power that depending on the performance, character characteristics, and how often the feature store is updated, and what your kind of scale of serving is, and your latency needs, such that we can optimize cost for the customer. But it’s just those that kind of solve problems for data retrieval that tends not to be the hard part in that the bottleneck is the user experience and getting the whole ml application up and running. Does that make sense?

Kostas Pardalis 39:23
Does it do? Absolutely. Yep. So rooks, all yours. I can give it to me.

Michael Del Balso 39:34
I know you gotta get to the next thing. It’s a conference here in Kosis. Yeah, he can keep talking all day. Yeah. But it’s been so fascinating. One last thing I want to ask. Before we sign off here. I know you’re just launching some of the things that take time. Can you give us a quick, quick overview of the launch? Awesome. Yeah, we just launched what we call Tecton 0.6. So maybe like a week or so ago? The big thing there is we have almost like a completely redesigned development workflow so that things are way faster for a data scientist to do their feature engineering. And basically, the core feature, you know, we aspire to provide our customers the best feature engineering experience in the world. And we have, like a totally different level of ease of use in the core workflow, the core-like loop of build, write a feature, and test it, and it’s productionize. That’s all done in your notebook. Now, it’s like a super beautiful, elegant experience. And so I think people should check that out. And then I think the second thing I’ll call out from this launch is, one of the things that we see quite a bit is how streaming features are pretty important for a lot of types of production ml use cases, this is like an aggregate over a bunch of events, basically. And there’s all kinds of you might say, hey, I want to count how many times someone tried to log in over the past five minutes, 15 minutes, 15 days, whatever. And we have huge upgrades into how, what kind of freshness you can get from those types of features impact on and the speed that they run and the flexibility of these aggregations. So one of the nice things about having like the, like one platform for to manage the features is that when there are particular common use cases, or types of features that are quite powerful, and pretty complicated for people to implement, like a lot of these like feature ag streaming, aggregation things, we can just build special things to speed people up. And so we’ve got a little bit of magic in Tecton that makes all of these kinds of streaming aggregations super easy for people. And, we really upgraded that in this launch tool. And so we’re seeing our customers love that. So those are the two things that call it cool. Yeah, you’re asking. Yeah. So for other data scientists that are listening to you like, Man, I gotta check this out. Where do they get good@tecton.ai? Just sign up for a free trial, or shoot me an email mike@tecton.ai And I’d love to chat with you.

Brooks Patterson 42:01
Cool. Cool. Well, Mike, thanks so much for your time today. listeners. Thank you for listening. Check out Tecton and subscribe to the show if you haven’t yet. And we’ll catch you next time.

Eric Dodds 42:20
We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe to your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.

🎙 Sign up for The Future of Machine Learning Livestream!

🗞️ Signup for Our Newsletter

Episode 05:

Data Council Week: The Difference Between Data Platforms and ML Platforms With Michael Del Balso of Tecton

April 26, 2023

Notes:

Transcription:

About the Podcast

Sign Up for The Data Stack Show Newsletter