This week on The Data Stack Show, Eric and Kostas chat with Tristan Zajonc and Willem Pienaar. From timeless truths to progressive changes, they discuss all things ML and how it relates to SQL and real-time data.
Highlights from this week’s conversation include:
The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Eric Dodds 0:05
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com.
Welcome to The Data Stack Show live-stream where we are going to talk all about machine learning with two brilliant minds. Tristan from continual.ai and Willem from Tecton. Costas, what do you want to ask these two super smart people about ML?
Kostas Pardalis 0:44
Oh, they both been on the show before. And it’s been a while I think, since we call like, the episodes with them. So I’m very curious to see like what costumes? Yep. So Stein from Qatar, once a while has confidence in the ML space since we talked with them. And I’m pretty sure like many things will come up like they’re both two very bright and very knowledgeable. Guys, Myspace. So I’m, I’m sure that we will have surprises. So let’s go and solve them.
Eric Dodds 1:19
Let’s do it.
Welcome to The Data Stack Show live. This is going to be such a fun conversation. We’ve been excited about this for weeks, maybe even months now. And we’re going to talk all about ML current state future and we have to have the best minds that I know of here to talk about that. Let’s start out with some intros. Tristan, do you want to go first?
Tristan Zajonc 1:48
Hey, I’m Tristan. Glad to be here. And thanks for the invite. Greg, great to be here with what I will do most fired for a long time. So my name is Tristan. I’m the co-founder currently of a startup called continual we’re building an operational AI platform for the monitor data stack. I’ve been working on ML infrastructure for the last 10 years to the early enterprise data science platform called cents, which was acquired by Cloudera, the sort of the big data company behind Hadoop spent a good number of years, they’re building their machine learning platform, which they call clutter machine learning, and got to see all the pluses and minuses of that sort of generation of data infrastructure and machine learning infrastructure. So yeah, really excited to be here to talk about the future of machine learning and machine learning infrastructure.
Eric Dodds 2:35
Willem Pienaar 2:37
Yeah. Hey, Eric, can just, it’s great to be here. Yeah, so my background about almost a decade in the software and data space. A few years ago, I spent about four years leading ML platform team at a company called Grgic. In Singapore, where we really sink our teeth into building a complete end-to-end stack for major deck avorn for a bunch of different use cases. So we really learned a lot there. And through that process, and one of the tools we bought out of many was Feast, which is a feature store that ultimately was open-sourced and became a little bit popular. About two years ago, I moved over to Tecton, a company that focuses purely on feature stores, and has an enterprise offering, where I continue to invest in both my time in the open source side, as well as the enterprise side, building out the feature store technology and the whole category.
Eric Dodds 3:32
Letter and cannot wait later in the conversation to talk about open source, especially as it relates to ML stuff. I think it’s a really fascinating topic. I want to kick it off though with a question that I think a lot of our listeners have faced in sort of implementing ML at their companies. And the context behind this is that we’ve had tons of guests on the show, who talk about things like overuse of ML misapplication of ML sort of like worth the data science team was throwing ML at any problem that moved and that created problems, etc. or situations where it’s sort of the inverse, right, where we spent a lot of time trying to solve this problem. And ultimately, we realized that it would have been way better to use machine learning to solve this problem. And so with all of the experience that you have, both of you sort of building ML infrastructure, both inside of companies for companies are building tooling around ML infrastructure and seeing it live on the ground. I would love your perspective on when is it right to use ML? I know that’s a sort of a broad question. And but what are the conditions? And we just love? Even for our listeners who have sophisticated ML functions running at their companies? What are some of the signals of it being a really good use case for ML? So Willem, do you want to start since Tristan did his intro first?
Willem Pienaar 5:21
Yeah, I think for me, there are two clauses to this. There are use cases that are well-trodden vaults that are already established within the space like in the market, you can think of like recommendation systems or fraud detection or churn predictions. And there are more experimental, more moonshot projects. And even at my time at GoJek, we saw this a lot, we saw teams that would conjure up the totally new use cases that you can never think of, and say ML is required here. So typically, I’d say, if you’re thinking of introducing ML, if you’re using or if you’re attacking a existing use case that is already established in the market, you probably can quantify the impact of that before you even start. You know how many users you have, what traffic you have, you can probably get a link back from napkin estimate of what the impact would be just based on the amount of companies and teams that have already built those systems. If you’re entering a kind of moonshot space, then I think it is a little bit more dangerous. Especially if those spaces have already. You have existing techniques, like let’s say you’re in banking or finance and you can use SQL, you can use R or something else, or simpler techniques. So in those cases, I think it was a little bit different. But I’d say try and quantify the impact ahead of time. And steer clear of the kind of moonshot your new types of use cases, because through that there are not that many major ML use cases that are being discovered every day, a lot of the top ones are already out there.
Eric Dodds 7:02
Tristan Zajonc 7:02
Yeah, I would agree with everything Willem said. I would say on the last point, I do think we’re entering an era where there may be some additional use cases that people are going to discover with these foundations, large language models that are being developed. And so it does feel like there is some early examples of this, like, if you look at, for instance, GitHub copilot, like what’s the revenue from that? It’s actually quite significant, very, very fast. But historically, I actually wouldn’t say that wasn’t the case. There weren’t a lot of examples of that over the last five years, there are maybe a few examples now. It’s hard to find those new use cases. So definitely agree with that. The other thing I would add here, really is highly contingent on how difficult it is. I would recommend if you’re gonna build a large language model from scratch, scratch, that’s gonna be incredibly difficult, unlikely or shouldn’t you should do that you’re going to spend 10s of billions of dollars to do that. On the other hand, if you can use an API and experiment in sort of minutes, and to see if you can get some interesting results or change your product and interesting way, absolutely go do that. And so I think the same thing, I think that applies to sort of all ML use cases is the difficulty of doing the use cases goes down, there’s more opportunity, there’s an opportunity for users, more use cases to be implemented, or sort of the ROI to be positive for that. And so I think both Willem and Dinah are working on systems that are essentially trying to reduce the complexity of doing ML. And so that will make it more the ROI positive and more domains.
Kostas Pardalis 8:36
I have a question. Do you think we can go through some of the most common use cases that you see out there that have been traditionally tackled through ML and do it in a very pragmatic way? Because most people, especially people who don’t work in this area, when they hear about ML and AI, they think about self-driving cars, language models, automated generated data, and all this very fancy stuff that we see that’s really at the state of the art right now, in terms of like what ML models can do. But machine learning is nothing new. Like it has been used for a very long time. And there are like very concrete use cases with a very concrete business value out there. And I think there is like value to just like, go through like the most common use cases out there. So people can relate to that.
Eric Dodds 9:39
And I will add on to that a little bit, just to maybe add a little bit of spice. I’ve kind of had this theory for a while that you can boil business models down into sort of a durable-like set. So if you think about something like e-commerce, when you think about ML in the context of a purchase flow, you probably know a huge amount of what’s already known about that. It’s just changing variables. And so it’s really interesting to think about some of that stuff is probably just already known, like the work doesn’t even have to be done. So anyways, just wanted to add that little component on to it.
Willem Pienaar 10:17
Maybe I can jump in. So I think thing, at least my perspective, I’m a little biased, because there are types of customers we have does slow more towards, like line of business. It’s like e-commerce or it’s potentially banks or, like ride-hailing companies, those kinds of customers. Customers doing language models or image or video. And, and so we’re not really focused on like self-driving cars, and we’re looking at that kind of bleeding edge of the space. So what we see a lot is that they talk to you differently, recommendation systems and fraud detection systems, primarily because we’re focused so heavily on real-time attacks on and my past experience at GoJek. That’s also a focus area for us. And so I’d say those are the two ones. And, of course, churn, prediction and optimization are two other big ones. So at Grgic, for example, we predict churn for users, we would identify a cluster of users that are high risk, and then we would send out vouchers to them and make sure that they’re happy, and they have a less like higher retention. And then, of course, like pricing and personalization of your product towards the customer is also another area that is a little bit domain-specific, but also a very common use case that we see. And can go from batch to real-time, depending on the kind of customer.
Tristan Zajonc 11:46
Yeah, I sometimes think of it as like three main categories. One, you’re building fundamentally new products and services that are like only possible because of the AI that’s embedded inside of that self-driving cars, right? self driving cars, for example, Alexa is an example of that series, an example of that. So these are products that could not exist if you didn’t have this underlying capability. That I would say is the minority, but it could actually be the most transformative over the next 10 years in terms of what is possible. I think what we certainly see the most continual, and in my previous roles, were really twofold. One is improving existing products and services. So for instance, personalization is just a no-brainer for an e-commerce store to do that at various parts of the customer journey from search ranking to sending personalized emails to like what’s on the homepage, right? That’s, that’s, that’s huge. And it makes a direct revenue impact right on it. There’s lots of micro major hyper scalar. Like Facebook, there are lots of micro-optimizations that you can do, right? Who exactly do you show? What image do and then like, what text should you show as part of that image? What friends should you show and so all those little, all those little ML models are feeding into a product experience. Some of them may have relatively small effects, but you’re a big enough company to do. The third set of use cases, which we see a lot of continual, currently, is old ones around business operations. And so if you think about a retailer, or you think about a manufacturing company, right, they don’t have a sort of a customer-facing product. But they do have an immense business, where there are opportunities to make predictions. And typically, we see two main classes in this subcategory in this category. One is around your customers. So things like churn rate, lead scoring, upsell opportunities, what more on a kind of a, not a real-time basis. So everything to make predictions about your customers. And the other one is around operational use cases, things like inventory forecasting supply chain optimization logistics, optimization, lots and lots of big data certain scale, that doing those optimizations will become important, particularly when you’re in a competitive industry with low margins. And so that’s sort of a final set of use cases. I see a lot of those right now.
Kostas Pardalis 13:55
So this, that’s awesome. First of all, like, it’s super helpful for me to, like, I’m always trying to learn a little bit like to enumerate all the different use cases around ML, but like what you said, like makes, like total sense. And why we don’t have let’s say, 10 prediction as a service, or why don’t we have, let’s say, I don’t know, recommenders as a service. And we need, let’s say, companies to go and invest in infrastructure for ML and build on top of that infrastructure, like all these models, what’s the reason that we haven’t seen like a market like this, and we lean more towards, let’s say, a world where companies have like to do their own and maybe their infrastructure to build these models?
Willem Pienaar 14:39
So maybe we should take turns just and start one first, but I’ll take this one quickly. This time. I think there’s an aspect of like, it is actually happening. We are seeing vertical products for ML being built. We are seeing— like Italy’s personalises purpose-built for recces. There were fraud detection. vendors out there. And so there are off-the-shelf tools that you can use, there are shortcomings to them. And there’s risk to them because they’re typically not completely end to end. And so you have integration pains in some cases with those vendors, but they are finding success. And also in a lot of the work that those teams have to do, but I think another one is one other point or aspect is IP. For ML, a lot of companies see the actual system, the ML system or something that’s important and competitive advantage to them. And so they often don’t want to outsource that. Because if everybody can just use vendor, then what’s your competitive advantage? With ML, you’re basically breaking even on that front. And so they think, okay, we can just invest in this area, and leapfrog our competition.
Kostas Pardalis 15:43
Yeah, that’s super interesting. What do you think, Tristan? What’s your take on that?
Tristan Zajonc 15:47
I do think that for product use cases, the IP issue is very real. But I think for like the business operations use cases like churn, there’s discretion, well, why isn’t a Vertica? Why isn’t it in the verticalized tool? And I think it’s the same reasons why BI tools still exist horizontal BI tools, the data is so diverse, and the questions that you’re going to ask are so subtly different. So even when we have customers, almost every customer that we talked to asks us about churn? And then the question is, well, what do you mean by churn, right? Is it churn in the next 30 days? Is it at the end of the contract duration? Is it a $1 base churn measure where maybe you have a usage-based churn, and you could have an expansion contraction, maybe you have a premium plan and a basic plan, and you’re trying to decide on whether the churn is between that maybe it’s all of these, maybe it’s over different time horizons, and your business wants to have all of these different predictions. And so and it’s very hard for a vertical tool to do that, both from an outcome perspective to defining all the outcomes that you want, you’ll do churn and you’ll get these variations of trend, and you’ll want to do lifetime value, then you’ll want to do, like, are they a highly active user, then you’ll want to do what product next they’re going to buy. And those predictions that you’re going to want are also going to leverage the same inputs, the signals, ultimately, your predictive model is only going to be as good as the data that’s flowing into it. And so then the question becomes, okay, well, what data do you have? And you’re going to want to integrate all this different data from all these different sources. So for instance, we have customers on initially, like, for instance, in the Shopify ecosystem, where you think, Oh, everybody has standardized data, why can’t we standardize LTV models and standardized, churn models or something like that just areas, personalization models, but they have other data, right, they do have, for instance, on a few in-person stores, right, which are not part of the Shopify ecosystem. And so I think that’s where you’re seeing this sort of like new modern data architecture, where people are one trying to integrate and aggregate data inside these kinds of cloud data warehouses. And then on that shared data, build a whole bunch of shared sort of use cases on top. And BI is obviously the first one, you buy a horizontal BI tool, I think ML tools, ML is very similar, where you can build and you can leverage that data that’s inside your data warehouse. Yeah,
Willem Pienaar 17:50
I think that’s a very good point. So I think, originally was even a kind of this impacts the infrastructure and tooling. And it was worse originally, where you’d have ML tooling that is super horizontal. And even if you’re vertical, the vertical tool is also limited in some ways, we’re not limited, but needs to be tailored to a specific use case. Like, if you take, just for detection, for example, that is a very broad category, there are different types of kinds of fraud, and all depends on the data model, the company, its credit card fraud, or something else, KYC. So, yes, I think it’s, it is improving, though. So as we go from horizontal to vertical, there is definitely a problem of customization. And so it needs to be a lower level abstraction than often is produced out there.
Eric Dodds 18:41
One, one follow-up question on that. I’d love to dig in just a little bit more on the specifics there. What specifically have you seen improve, and how has that changed? The process of delivering ML? At what points in the lifecycle of the build have the most significant changes happened?
Tristan Zajonc 19:05
Well, I’ll start on this one. My honest opinion is I still think a lot of ecosystem is horrifying. It reminds me It reminds me a lot of likes to do Barrows. I spent five years of my life sort of ended up CO system in sort of ’15 to 2020 timeframe. And it’s incredibly powerful. You can do amazing things with that, right? Everybody’s excited about it, everybody, sort of, there’s that energy behind it. That same thing applies in ML and ML ops. And that’s all true when there’s open source, and there’s a vibrant ecosystem, all of that. But then it sort of gets to this point where you’re like, Wow, this is way too complicated. And that happened with the big data ecosystem. Nick Schrock, the CEO of DAG sir has a set where he says we went from an era of big data to big complexity. I’m like, I sort of feel like the same thing has happened in ML and ML ops. Now one thing that I think sort of two things that I think that are happening in ML ops, which I am excited about. One is I do think that there’s a Rhys have really next generation best of breed tools. TikTok might be one around feature stores, it weights and biases around sort of experiment tracking. These are like good tools that are definitely far better than anything, even an alternative existed in the past. I’m also excited by I think there are, you’re seeing in the tech companies like sort of little next-generation platforms coming out sort of have higher level of abstractions. Facebook just talked about their internal platform called looper, which is sort of a declarative end-to-end real-time machine learning platform for product decisions. It’s incredibly, they’ve radically simplified interface that engineers need to use to build predictive features into their products. And so they can have hundreds of use cases now that are very rapidly implemented and relatively easy to maintain. And so that’s what continued, we’re kind of trying to do similar things with Lorien. I still think that if I talked to any person who’s doing ML ops, nobody says they love what they have is, is what most of my conversations with people who are in the trenches, right, it’s like, we get it to work. But it’s not totally awesome.
Willem Pienaar 21:13
Yeah, I think the best of breed is there, there’s kind of two paradigms in my mind, there’s end to end and there’s best of breed, within end to end, you’ve got the horizontal platforms like the OG Michelangelo, and I guess, keep flows horizontal. And then we have the vertical ones that we just spoke about. And out of all of those, I think they have different trade-offs, right, like the vertical one, we said, it’s may need to be tailored towards the use case, and it’s not assists, there are all these kits are subtly different, depending on the domain, then if you’ve got a best of breed, you’ve got a different problem where you’ve got an end to end flow in which you’re introducing a single component, and then as a vendor, you can build the perfect components, then if the user is less involved, the end to end system, that’s hard. And so what we see, in the amount of space is there’s, it’s a death by a million cuts, right? You have so many decisions you have to make yourself, how do you do artifact tracking throughout the whole lifecycle and metadata management? And how do you do experimentation, because you’re not just plugging these into like Lego pieces. And that’s extremely, extremely difficult. And more so than the best of breed world. But I do see this, like, there’s a divergence and convergence, right? There’s a divergence where folks go away and build these tools. And then there’s a recognition of the tools that are best of breed. And then you can see all these blog posts coming out of, oh, Deputy works this product, and there are integrations between them. And more and more, they’re getting glued together in a way that makes sense, allows you to chain them, and removes all the decision friction and fatigue that users have to experience today. So yeah, we’re in a kind of like weird spot, CPython and MLOps. And gist is space. But hopefully, we can kind of like power through this and get to kind of consolidated modern ML stack.
Tristan Zajonc 22:52
Yeah, no, I totally, I totally agree with that. I think there is this, there’s this tension between those two, too. There are a lot of startups that are doing the sort of best-of-breed narrower products and then thinking about the integrations and trying to solve those integrations. I think the hyperscalers are saying, Oh, no, we have these endemic platforms. But if you actually look at them, they’re a bunch of not best-to-breed individual things that you still have to Glue all together. So it’d be one thing if they were say, Okay, no, here’s a template, you have to do sort of continually improving recess typing use cases, right? Where the models maintain the predictions that are being made in real-time, the features are maintained, right, the whole thing is being monitored. If they’re saying, Okay, we make that easy for you, then you might say, Okay, I’m gonna go all in on that Gen platform, I think the challenge right now is if you look at these platforms, they basically are right, a bunch of different components that then you sort of have to kind of Glue exercise for the reader to Glue them together, and they all have these stack diagrams look crazily complicated. I’m definitely hopeful that there will be sort of end-to-end approaches that make it very, very easy to implement use cases. But don’t, don’t expose all that complexity to users, I don’t think it means end to end, like, okay, we’re one vendor that has 10 different products, and you put them all together yourself, I think you have to say, what’s the end goal that you’re trying to achieve? Maybe you have to narrow yourself into a domain, right? Maybe you have to narrow yourself into a domain, personalization or real-time, machine learning or sort of continual batch meet and senior data warehouse for business use cases, right? And then you can build. If you can narrow the scope, maybe you can find the right abstraction that makes the end deliverable easier to achieve, while successfully basically.
Kostas Pardalis 24:31
I sit here talking about envelopes. And I’d like to guess like to hear from you about what like differentiates your operations around like the infrastructure that’s been built as compared like to the rest of the infrastructure that the company has, right, like, I mean, Ops is like a very big topic like we’ve been like investing in like coming up with new tools like all the time and like there are some amazing things that are happening like when it comes to like DevOps, for example. And like all the vitamins and nutrients are there. Even like in data, right, like big data, as you said, like we started, like with big data, it became like, like a big complexity to manage. And there’s like a lot of improvement there too. Data engineers like to simplify operations and work more efficiently. So why is ML different? What do we need? What is missing?
Willem Pienaar 25:23
Well, effectively, you’ve got a data-driven decision system. And so there’s inherent complexity about making decisions in your company that have impact on your bottom line with a system that is making those decisions based on data, whether it’s ML, or like a regression model, or whatever it is. And so you can’t do something like have a test oracle that just says, Okay, this thing is good to go ship it into production. You never have 100% confidence. And so you need ML specific infrastructure, where there’s experimentation systems or monitoring systems, that you can track the outcomes and compare that to predictions and make those things obvious to your end users. And I think that’s those areas are still a little bit nascent today, the appreciation for what we see a lot of companies do is they identified the P zero as the critical things that they have to do to get a model into production and get an API, and maybe get it serving traffic. But they don’t have the rest of the story around that the monitoring and experimentation. And often, this ties back to the problem with like, when to use ML and when to not use them. If you didn’t do that you didn’t quantify this ahead of time, and you didn’t perhaps start with a non-machine learning model that you could AB test against your machine learning model, then you’re almost doomed to fail. But yeah, that was the summary is there are inherent complexities with ML, if you’re basing decisions or the organization around data.
Tristan Zajonc 26:51
Yeah, no, I agree with that. I do think that right now, there are two siloed stacks, there’s this sort of machine learning stack that honestly feels to me more like it’s coming out of the sort of that Hadoop era, sort of like the next step, maybe a little bit more cloudy or something. But it’s kind of got that feel to it. And then there’s this analytics and this more analytics-oriented stack, which is very much centered on SQL on the Data Warehouse. And then there’s a whole ecosystem around that from job orchestrators to data quality monitoring tools, or the totally modern data set, because all ecosystem of vendors, huge vendors around data observability, and monitoring, that, as far as I can tell, haven’t looked at all looked at the ML monitoring and observability use cases, I do think that there will be convergence of these stacks I think we will converge onto these hyperscale data platforms, that’s where the data is going to primary lit primarily live, I do think that there’s only needs to be one job orchestration system like you don’t need two job orchestration systems, one for ML and one for the rest of your data engineering. At least if you’re gonna build all these things yourself. I think there’s, it’s interesting to like it does monitoring, is there a convergence of monitoring that because the use cases around ML monitoring is very different and the traditional data monitoring companies are not building the features that an ML monitoring team would want. It does feel like they’re kind of separate. And there are things that you would want over in one area that you might want in the other area. So I think it’ll be interesting to see how they, how they converge. There are unique challenges, though, I think, to Willem’s point, making real-time decisions, right, where you have real-time features, you have historical training of unpredictable models and non-deterministic outcomes, you don’t even know the outcome of what you’re doing until you deployed into production, potentially, right, what the product impact, you might actually, the machine learning metric that you can measure during training might not actually be the business metric that you care about. So you have to run an AV test and kind of do this sort of roll out of A/B tests of thinking about that is a very sophist on the scale of sort of data products that you can build. I do think that machine learning products are the kind of like most challenging type of products because they really require you to think about all of these concerns, and then the continual ongoing life cycle of those concerns. It’s not a one-and-done type situation.
Willem Pienaar 28:59
Yeah, the point that you raised about the analytic stack. And the ML stack is also very valid one, I mean, it’s really clear to me that there’s a yearning for simplicity, architecturally, within companies. And so that’s part of the appeal of the modern data stack is that you can just shove everything into your BigQuery, or Snowflake or other data warehouse or lake house, and centralize all everything around that private ingestion, transformation, and the reporting, etc. I think the challenge with ML is, of course, you’re making real-time decisions and a lot of cases. And so there’s a kind of a philosophical gap there. Windows ational friction where you’ve got their warehouses built in a way that is perhaps there’s no staging, there’s no staging production split, but engineers demand that. And so they are wary to use the data warehouse as this kind of like interface or source of truth for production. But at the same time, you’re seeing teams, ship ETL pipelines with models and then for batch use cases, perhaps. And so there’s a bleed-over between those two and I think long term You’ll see a consolidation there, just because there’s a lot of pressure towards having a single system that you store your data and not a bunch of data islands, maybe one or two ways maximum that you want to transform your data. It’s only if you really need to have, for example, an ETL system, perhaps streaming, or you need to have on-demand for real-time transformations that you pulled out the data. But people want to have a single place where they do something in a single way. And I think a big part of that education as well, if you’ve got a workforce that’s not being taken from traditional roles into analytics or data, and now you’re also bolting on ML into that there’s a lot of retooling and rescaling what’s happening and you don’t want to overwhelm your workforce.
Kostas Pardalis 30:45
Yeah, it makes total sense. Do you feel like this convergence has started already?
Willem Pienaar 30:51
Kostas Pardalis 30:53
There’s some like exact, like, for with the technology out there. That’s like, the most read these conferences happening?
Willem Pienaar 31:00
We’re just seeing companies like Snowflake and big we’re BigQuery growing in, like, adoption. And teams, rightly starting with tools like CBT for machine learning even. And then, as they need like, fresher predictions or low latency predictions, introducing more real-time elements to it.
Tristan Zajonc 31:27
Yeah, no, I think they’re that absolutely there’s a convergence towards like, look, these hyperscale data platforms that have SQL at the core, Snowflake, Big Query best to kind of Databricks, and even the direction of Databricks, which has a more different maybe heritage, but if you look at directionally, where that, where they’re going from a technology, it’s much more into tables and query planners, and Delta Lake and all that stuff, even, it’s kind of under this lighthouse umbrella. That seems to be the core foundation, I think every company that we talked to wants to consolidate data for all of their use cases, ML, and analytics, and kind of the whole business into one of these hyperscale data platforms, the challenge with respect to ML, then becomes well, okay, what are these additional needs that ML has particularly real time ML? So it’s real-time feature generation that tends to lead to streaming. And so what’s your streaming story? There’s real-time feature store storage, or real time feature serving, that leads to well, what’s your key value sort of role-oriented store, right, which, so these, these data platforms that are have so much traction are all built for analytical use cases. And so there are technical limits right now that that haven’t yet been really over, over overcome. So the obvious ones to be are the streaming one, the real time serving, maybe like nearest neighbor vector search for things like personalization, where you need to actually do sort of approximate nearest neighbor lookups. These are kind of core bits on a production ML infrastructure that you would typically have, I think there’s a question then, are those gonna be separate systems. Or you have a streaming pipeline? Where you do that, are you gonna have like a hot cache of data? Or are these these core platforms so ambitious, right? That they’re going to try to sort of absorb those, expose those capabilities inside of the core platform. So Snowflake recently just announced, right, that they have this hybrid sort of H Docker Hybrid Hybrid tables concept where you can do fast row lookups that potentially enable some additional real-time serving use cases that might be useful for ML they’re heavily investing in streaming, that might kind of close the gap in terms of real-time feature generation, and so that you can consolidate and bring those workloads on the platform and Databricks is a platform that has sort of got some additional flexibility where you can do some of those doesn’t have real time serving, so it’ll be interesting to see where this all goes, and whether it ends up being kind of one core data platform and infrastructure with a whole bunch of workflows on top and where the other vendors are building workflows, or whether there are core infrastructure bill bits that will kind of need to still Glue together to do production, machine learning.
Kostas Pardalis 34:00
So you mentioned something, Willem, like, that’s, like, very interesting. You mentioned that people start using DBT even for like ML use cases. And it makes me wonder, like, it was always like this dichotomy between ML and analytical use cases where we were say, okay, the language of analytic SQL for a ML CPython, right, like, people don’t literally cross this boundary, like, easily. Did you see something like changing their like, do you see like SQL becoming like more important? And why is this happening? If it’s happening?
Willem Pienaar 34:38
Tristan Zajonc 35:40
Yeah, another point, I think is that AI and ML is becoming increasingly data-centric. So what matters really is the data that’s feeding into your models, and the sort of the pipelines that are happening after that are sort of becoming increased, I would argue, are becoming increasingly commodified. So what you really need to do, there’s an argument that what ML is basically okay, what are the set of inputs? Yeah, how do you model your inputs, your features to your animal problem? And what are the set of outputs that you’re trying to predict? And in the end state, do you really care, it’s not all you really need to provide, right, maybe all the other stuff gets, it gets hidden from you. So if that becomes if you have a system where the data becomes much, much more important, then all of your work ends up sort of focused on data transformation. And that’s really where these data platforms shine. So where they don’t shine, okay, if you’re gonna write, your custom TensorFlow model, and you need to train it your PyTorch model, you need to train it on GPU. No, I mean, pushing that data in the Data Warehouse makes no sense. Currently, it probably is not going to make sense. Any, honestly ever. Nut on the other hand, if all of that stuff sort of that stuff is hidden from you, then your job ends up being a sort of a data management and data manipulation job. And I think there’s no question that SQL, maybe with a little bit of UDF here and there in Python, is just such a more manageable way to do your data transformation and data engineering work including feature engineering work. And that’s where tools like DVT, which sort of are put SQL at the at the core, but now are increasingly even allow even little snippets of Python where necessary to bring those escape hatches, you just get a much simpler to operate system, a much more performant system, and then sort and much more easy to govern and manage system. So your IT team wants you as well. And who’s not going to adopt that?
Eric Dodds 37:22
But what when we think about the graduation from the analytic side in a centralized store that’s SQL-based to serving in real-time. Are you seeing the need for real-time flow out of that? So you build on the analytic stack, and then you sort of graduate into the need for the real-time use cases, whatever, as you prove out value realize additional opportunities, etc. Number one, are you seeing that happen? And then two, it sounds like there’s still actually like a pretty gigantic gap technologically moving from that, even if you have that foundation really tight in the centralized store, actually moving to serve that stuff real time is, is non-trivial. If you’re just based on this centralized store.
Willem Pienaar 38:20
Tristan, do you want to start?
Tristan Zajonc 38:22
Well, I mean, this is right up your alley. But yeah, there’s a huge gap. So there between us currently, but let me let Willem talk.
Willem Pienaar 38:29
There is a gap there. There’s a lot of challenges, just from the it’s a heterogeneous out there, the infrastructures, but what we see as teams starting with the centralized stack, the kind of data warehouse, proving the value of the use case and batch if they can. So that’s like, phase one, phase two is often you shipping that data into some kind of production environment, aesthetic copy of it, or data or a model or something from that. There’s a freshness problem in that case, but your you can respond with low latency. But often, in most cases, you’ve got a product that’s operation operationally running real-time, and you’ve got an event stream that’s coupled to that. And that’s managed by engineers. And so if you really want fresh and local real-time system with fresh data and models that can depend on the data, that’s going to kind of the value that a feature still provides is the unifies these like the offline and the online world. There’s a big technological gap. And I think that is part of the problem that we’re trying to solve with feature stores is how do you go from kind of like the offline training batch world into the online real time world in a consistent way because the model needs to move between the two but teams often have a hand over there. So that’s one there’s a technical challenges was an organizational challenge around you’ve got Atlas, creating features, perhaps, or data and then data scientists kind of like improving those features, training a model and shipping that into production or handing it over to engineers. There’s a lot of like teaching between the two of them. Like, how do you actually interface those? And so the tools we’re hoping can make that easier for folks?
Eric Dodds 40:05
Yeah. Super interesting.
Tristan Zajonc 40:09
Yeah, no, I think this is one of the big unsolved, unsolved problems that, on one hand, we have this great data foundation. But the real-time use cases, kind of real serving use cases are just hard to do. There’s just a glimmer that maybe there’ll be only possible to do on a single platform with things like Snowflake or hybrid tables. But it’s so so early there. And then you end up with these two worlds. And as soon as you end up with the two worlds, you have to do this complicated dance between them where you’re moving data from your batch environment into your online environment, then maybe, well, maybe you want to actually move your online environment to your batch environment, to clear which direction you always want to go. Actually, people take different approaches where they start with the online and log to the offline, or do you take the option of moving up to the online? And so all of a sudden, you got a fair amount of complexity? And then you know that it’s obviously what motivates tooling around just like feature stores to be built?
Kostas Pardalis 41:03
So guys, please, do you feel about this? Is more of a technology issue? Like, is that like technology missing right now? Or is it an organizational issue? Because what I hear like from Willem is that there is like a choreography among many different also like people and probably also like departments and like trying to like to make this happen. And there are like feedback loops that needed to be there that probably I don’t know, maybe they include even broader set of people right in the organization. So what do you think is the main challenge right now that the industry needs to address?
Willem Pienaar 41:40
That was two questions, but I’ll say that, yes, certainly, imagine you’re a data scientist, maybe more of the data scientist, that’s the like, initiator of the machine learning project in a company, the amount of teams that you need to interface with to get into production is high, right, it’s not just the team with API that’s going to integrate with you. It’s the team that’s the platform team, and where you’re going to run your training pipelines, maybe there’s an ML platform team, and then they’ve got something purposeful, but unlikely, there’s a team that maybe wants to look at, you want monitoring for your system. So you need to speak to a team about monitoring, like an SRE team or a DevOps team. There’s a security compliance team, there’s the operations team that actually speak to the customer on the street. And so there are so many stakeholders to manage a lot of data scientists become more product managers. We’ve not made it easier for them, as an industry to just get into production. And so that’s kind of what the vendor is trying to do with tools is provide a gateway a portal into this solution that that’s being built for each one of these groups. So that it kind of like one person isn’t, or one group isn’t responsible to go and interface with everybody. And kind of, because the point participants making is there’s essentially a loop, right? There’s a kind of like training, serving, prediction, data collection, logging, storage, and then transformation loop that’s end to end. And certainly teams are involved there. We’re trying to just make that easier to address through tuning.
Tristan Zajonc 43:10
Yeah. My view is it’s not a recipe for long-term success if you have a significant amount of coordination to do each sort of job to be done that you as an individual gets assigned. And so if you have to talk to all these different teams and hold a meeting and try to understand their systems, and they understand your systems and your needs, it’s just a recipe for things going very, very slow. And I think there are reasonably two ways to solve that problem. The one way is to have extremely well-defined interfaces. Between these different services where you don’t really need to go talk to the other team to use it. So if there’s a monitoring system, and you just use it, and you don’t even talk to them, they’ve exposed to you those interfaces. And that’s sort of the Amazon model, right? Every small team and then clean interfaces, right? Everything’s API first, right? That’s kind of their innovation level. The other way is to try to find something that is a little bit more end-to-end where a single person can do more. But there’s only a certain amount of complexity that an individual can can can put in their mind. So you have to reduce the complexity very, very dramatically. And so I think both of those are challenges. But as you think about the interfaces, the abstractions are not always obvious, right? We’re very much evolving, right? So we’re coordinating part because we’re trying to figure out, hey, what does everybody need for ML for production ML, and then likewise, end to end, it’s often hard to find the abstractions there that don’t box the user in or that they want, right. So it’s a kind of a trade-off there. I think that’s kind of what the, how I see things, people navigating it.
Willem Pienaar 44:30
A good example of some of the challenges here is like if you’ve got an Android or an iOS app, and you’re making some kind of prediction, you want to track what user the act, what action the user takes based on some kind of personalization, perhaps often that requires that mobile team to go and develop some kind of cost custom logic as part of their mobile application in order to collect the data that ultimately goes back into your experiment. And so there are all these little subtle areas in which you need to interface with teams to Just get end to end. And so I think the abstractions are still being, there are agents are still being cut. And I think that’s the key problem to be solved.
Kostas Pardalis 45:10
Yeah. Do you see also space for a new role? Because Willem, you mentioned like the data scientists like turning into kind of like BM of the end, like trying to manage all these relationships there. Do you think that there is a need?
Willem Pienaar 45:23
Yeah, I see three roles, I see that the research scientist, the person that’s taking two years to write the paper and he’s using the data and the company or organization to do that, then I see the MLE hands-on goes into end, Walters thing and get them to prod and maybe even is uncalled for that. And then I see the DS that becomes the kind of product manager, essentially, the center point and all the spokes or the star emerge from him towards all the stakeholders. And he, he or she owns this use case, I don’t think there’s really a need for new role. But there are essentially archives that we do have seen out there in the wild.
Tristan Zajonc 46:05
Yeah, the one thing I might differ a little bit, it’s not I think, ultimately, for AI and ML to become sort of widely adopted, it needs to be put into the hands of more users. And that includes product engineers, I don’t see any reason why a product engineer could not build an ML-powered feature in the long term or an analytics engineer, right, maybe has more of a GBT sequel background can’t build and production ML model production in production, no handoff, right, an in production model that continually maintains, that should be my view, it’s like anybody who’s sort of with that, that’s building production systems, right, from an analytics engineer, to a data engineer, to a machine learning engineer to a data scientist to a product engineer should be able to do that. I do think the research engineer is sort of separate, they’re going to be in the weeds doing things. So we’re going a little bit of a separate universe. And occasionally for very, very critical systems maybe that’s only the domain of the ML engineer. But I generally think that if you’re building a product, there are more and more and more use cases and more and more systems that enables somebody who’s more of a product engineer, right for think about personalization, feels like a product engineer, plus the data engineer should be able to get that job done. If the tools exist. And then I think the ML engineer will also love it because they’ll be able to do either different work or deeper work or more work or have more impact. It every single use case doesn’t require sort of this, like deep and experience of infrastructure in ML, which is what the ML engineer needs.
Willem Pienaar 47:29
I agree. So as we see the industry kind of commoditize, the engineers or have a much higher leverage. And then as their problem gets off the data science problems get solved. And we can monetize that whole layer. Ultimately, the product engineers, or the product folks even will be the ones building these mo solutions. So that’s definitely I’d say, the group that will attack the longtail of use cases, I think if you look at like a Reddit or maybe like a Facebook, maybe your key recommendation system or your key mo use case, can always be custom built, maybe like Google search, that’ll be custom built. But there will be a long tail of use cases that the product teams can build themselves using some kind of solution that’s perhaps centered around your data warehouse and with abstractions that they’re familiar with already.
Kostas Pardalis 48:19
Yes, that makes a little sense. So all right. You mentioned quite a few times least problems between like the bots and the real-time and the low latency requirements that ML has. What are the best patterns are utilized right now to bridge this gap. And for people like to productize these models, like how does it work? And what’s the let’s see the state of the art in the space?
Willem Pienaar 48:54
Do you mean for a specific use case or for?
Kostas Pardalis 48:57
Oh, there are the changes based on the use case that becomes like even more interesting?
Willem Pienaar 49:01
Well, the truth is at the foundational level, it’s all about the data, right? And so that’s why we started at TikTok with like feature stores and providing a way for you to craft data and features, dynamic with the power your models, I think downstream from that it’s really very specific to your use case, extremely specific. So if you’re building recommendation systems, it’s very different from Justin said earlier subtly different from fraud detection or ature. And so, but I think from a kind of like, data transformation and organization perspective, features stores at least, or tools in that layer of the stack, or even DBT provide a lot of value. They provide you ways to go 70-80% of the way by crossing the features that ultimately power your models. And often you can experiment and see performance of models offline, before you even go live right. Depending on the use case that may not be accurate or is typically you want to go Life yourself. But it’s very hard to answer that question in terms of without diving into specific use cases. And even then it will kick in, it’s so different for from customer to customer.
Tristan Zajonc 50:14
This is kind of going into the weeds, although I’m curious on Willem’s viewpoint on this is I do think there, there are two patterns. So I look at the features or landscape, there are two patterns that I see being adopted. And one is what you could call like an online first approach. And so if you look at how Facebook describes its feature store in Bahrain, if you look at how YouTube for what describes their feature, store environment, which are massive scales where they’re generating tons and tons of data. They don’t have a lack of training data, for instance, they tend to adopt a more online approach where you generate online features, and then you log those features out, and you kind of wait around to collect the training data based on that new feed, you kind of deploy them first to online, and then you log them, and then you train your models off of that, I think there’s a different approach, which takes more, which kind of puts more emphasis on the ability to backfill data, and sort of generate features, and then kind of generate training data going back in time. And that introduces a fair amount of complexity, which tools like TikTok Feast, I think, in traditional pizza stores, out of salt, which have been stacked up and has a different sort of architecture, a different set of trade-offs, that those probably get to, again, your use case. But I do think that that’s something that I’m watching very closely is sort of how that unfolds those two different architectures.
Willem Pienaar 51:27
Yeah, maybe if he just focuses on the lack of data aspect, like, especially in the real-time side, the kind of login word approach, as well as the kind of like, being able to compute or backfill data, those are completely, there were different architectures, they have big impacts on your, on your architecture. And they do seem to not be a golden bullet, or a silver bullet. In the case of a login, wait, if you don’t have high traffic, a lot of volume of users, it takes a long time to collect the training data. So if you ship a new feature, or need to log that, maybe it takes you two weeks, if you don’t have a lot of traffic, maybe it takes you two months. And so your iteration speed can be slow. If you’re Google, maybe it takes you minutes, right. And then the other side, there’s architectural complexity to the original, that traditional architecture feature store, because you have the offline and online worlds. What I’m excited about is you could technologies like Snowflake and others where there’s real-time ingestion and hybrid tables. You have your stream-centric platforms that are being developed that could potentially consolidate these two worlds. But even today, companies like deck also have logging architectures, right? We, if you’re interfacing with an API, for example, for features for values, let’s say you’re recording some API to get some data about a customer or transaction with a credit card company or something. The only way to deal with that is to log it out for training purposes later, you can’t use that ahead of time, you can’t query them in bulk offline. And so even today, Tecton is kind of like a hybrid state. But I think over time, the login weights, architecture does have a lot of appeal.
Eric Dodds 53:14
Super interesting. Well, we’re close to time here and I want to leave plenty of time for questions. So please write your questions in. I’ll start with one here, which both of you talked a little bit about this maybe in general, and you’re both building really cool tooling in the ML space. But one of our listeners wanted to know— Maybe just pick one for the sake of time. So the most exciting thing that you’re seeing in the ML space, specifically? You don’t have to name a tool if you don’t want to. As builders of these tools, what excites you most that you’re saying?
Tristan Zajonc 53:56
Well, I could start on this one. But I mean, two things. So obviously, what we’re building today, which is sort of a declarative approach to operational AI. There’s a tremendous, there’s a tremendous need for higher level of abstractions, for production, machine learning, and I think we’re trying to do that at continual. I get super excited when I read about, okay, Facebook has this thing called looper, you could read the paper, they’re a really exciting example of an end-to-end declarative platform for Real-Time Machine Learning. Apple has something called Oh virgin, which is really exciting. For more natural language processing use cases, Stitch Fix just had a great blog, on their system. So for me, it’s just taking iView it that there’s like, sort of generation one, maybe Uber, Michelangelo’s, like the canonical, OG of that example, where you kind of have your all the different components totally make sense. What’s the next step? And so I’m a super and I think we’re starting to see that coming out of both startups and coming out of sort of the hyper scalars who are kind of on to the next thing. The next thing which I think you can’t ignore is the foundation models, large language models, the things that are open AI are doing. I do think we’re IAM I’m very bullish on this being sort of a new chapter. And it’s very unclear what’s what it’s gonna unlock. But it’s starting, you’re even starting to see some commercial successes for use cases. And I think it’s, I think it’s going in moving forward and a tremendous speed. And it, not only is going to unlock a whole new set of use cases, but actually, there’s a whole new set of tooling concerns that you’re going to have to address to. So it’s unclear when the developer tooling ecosystem, what the data management tooling ecosystem is going to look like for these extremely large language models. And so I’m really excited by both the use cases and the tooling for these large language models.
Willem Pienaar 55:37
Yeah, I want to kind of echo the point that Justin made around looper. It’s an extremely exciting direction that I think, if you look at what has happened over the last couple of years, large tech companies have innovated, and the market has kind of commoditized those technologies or approaches. And what we see from Facebook or Mater is this platform is declarative, and very focused on the product engineer. I’m super excited about that. Simple abstractions. But the dress ML use cases. So I think it’s really about the persona that’s being addressed here. And so it seems like we’re moving on to the product teams a little bit more. So I think, yeah, that’s, that’s the key thing that I’m excited about.
Eric Dodds 56:21
Very cool. All right. Well, we’re gonna try and sneak one more in here. And it’s about open source. So open source in software, in general, is a really interesting topic. But this is a really interesting question, specifically to ML. Is open source even more important in ML? Because there’s a lot ML can sort of have this flavor of ambiguity around it for people who aren’t necessary, not aren’t necessarily close to it? How important is open source in and
Willem Pienaar 56:58
I think when this industry is a little bit wild, wasty, it’s more important, especially because we’ve spoke about the abstractions. And if abstract abstraction is not perfect, then you’re stuck. If you’re not using an open source tool, right? We see this a lot. For example, in Feast, if you want to do if you want to use a different database as your back end, how do you do that with a vendor that doesn’t support it, you need to wait with Feast, you can kind of plug in your own backend store. Long term, I don’t, the jury is still out whether it is necessary to have open source as the delivery mechanism for the functionality. There’s a certainly a lot of companies, especially if you look at the modern data stack, that have proven that you can solve a whole class of problems with a cloud-based solution. So I think, for, as we said earlier, the long tail of use cases, the jury’s still out.
Tristan Zajonc 57:57
I think we’re gonna see a similar transition to what happened in the data sphere, where managed services, especially for infrastructure, for the infrastructure of ML, where they’re sort of stateful services that you need to manage, it’s going to move towards people want fully managed services that they can just use, and they can kind of get their job done. And once those services become good enough, once they sort of trust the abstractions, the SLA is the company itself, it’s gonna be just so obvious that pay just use these use these vendors, I think you’ll see that weights and biases and experiment tracking, it’s just like, just go use weights and biases, I mean, pay 50 bucks a month, the value you’re getting from that is amazing if you’re, if you’re looking for a way to track experiments. And it’s not an open source product, but it’s very much targeting the ML developer crowd that would you think would be kind of the most open source friendly, I’m still hopeful that open source remains a huge part of ML in terms of both open publishing open libraries, like the core libraries behind the ML algorithms. I think those will stay open source for longer, although even now with that, I think I would have said these algorithms would stay open source forever. And now with these large language models, again, becoming a little bit unclear whether the sort of the open approach, they’d be hugging faces, the best example of this is going to win, versus that these hyper scalars are going to release these models that are proprietary, but if they’re just going to be so amazing, that you’re gonna kind of go through your nose and just use them. And then there’ll be enough competition in the marketplace that you’re not going to get held hostage. So you’re gonna feel like, hey, no, no big deal. I can always switch between Google or Microsoft or open AI for these large models. So we’ll see.
Eric Dodds 59:28
Yeah, super interesting. All right. Well, we are at the buzzer. This has been such a helpful conversation. Tristan Wilson, thank you so much for giving us some of your time. Super helpful for us and for our listeners.
Willem Pienaar 59:42
Thanks for having me.
Tristan Zajonc 59:43
Thanks for having me as well. It was my pleasure.
Eric Dodds 59:46
Kostas, I appreciated so much the honest take that both Tristan and Willem had on the gap that between the promise of ML and the reality of ML ops on the ground for people doing the work today. Both of them had very strong feelings that it’s a pretty gnarly space still, and there are still a lot of things that are really hard to do. And yeah, it was just really refreshing, especially to hear they’re both founders of ML products. I think at one point, Tristan called part of the ecosystem for a fine. I just appreciated that take. And I think it’s, I think that’s very helpful, not only for our listeners, but for us, just to realize there’s tons and tons of promises out there. And companies like continual and Tecton doing really cool stuff that we’re still in really early innings.
Kostas Pardalis 1:00:59
Yeah, I think there was like a wealth of updates on what’s going on out there. I think it’s good that we hear that Forbes, in general, are still an unsolved problem in ML. And I think it makes sense. We’ve personally like to come up with the technology. And then when you figure out the operations around that, and obviously, like many there are, unlike similarities with software engineering, but there are always like differences. So we probably need different tooling or different methodologies. I don’t know. But like literally like, and when it’s the mature enough to get to the point where like, you can say, okay, operations is like, what we should care about. And there is a blend of applications of this is happening right now. Like, that’s what I hear from the guys. I also keep some stuff about, like, what is needed out there, just like from okay, like feature stores, like the products of these guys were building. But there are broader needs there. Like, even like database systems need like to come up with more innovation in order to solve some of the problems, right, like, what we were talking about how to serve the models, and like the features for the models and these hybrid new database systems like Snowflake and how these can help with that. DOT Dwolla staff still early, but there’s like a lot of innovation that needs to happen. And some from some angles, like pretty, like basic innovation also, like very, in the very deep of infrastructure look, we are using like the database and storage level, even so it’s a very exciting space, like, I would be more than happy to, I don’t know, like be one of these folks out there that they builds products for the companies in this space. Anyone who thinks about it, I think they should go and give it a try.
Eric Dodds 1:03:03
Absolutely. And cost this will build an ML startup name generator, AI-driven, of course, to help support you and your mission.
Kostas Pardalis 1:03:14
Eric Dodds 1:03:18
All right. Thanks for joining us for another live stream and we’ll let you know when the next one’s coming out. Catch you on the next show.
We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at firstname.lastname@example.org. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.