51: Democratizing AI and ML with Tristan Zajonc of Continual

Joining Eric and Kostas this week on The Data Stack Show is Tristan Zajonc, the cofounder and CEO of Continual. Continual is offering early access to its operational AI layer for cloud data warehouses at continual.ai. 

  • Tristan’s background with Cloudera and the need for continual operational ML and AI (3:15)
  • How the complexity of Continual is hidden behind a simplicity of use (14:48)
  • Focusing on data that lives within a data warehouse (18:43)
  • Understanding features in the ML conversation (22:47)
  • The three layers of Continual (26:11)
  • The importance of SQL to Continual (30:19)
  • Caching layers and the data warehouse centric approach (38:28)
  • Betting on the warehouse being a central component of data stack architecture (43:34)

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription

Eric Dodds 00:06

Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at Rudderstack.com.

Welcome back to The Data Stack Show. Today we have Tristan Zajonc; his last name does not sound like it’s spelled, but it is “Zions”; we confirmed with him. And he founded a company, his second company actually, called Continual. And they do really interesting machine learning stuff on top of your existing cloud warehouse, which I think is just going to be a fascinating topic. But one of the questions that I have, which in the parlance of machine learning is probably going to be predictable, is when you think about machine learning sort of readily available on top of your existing warehouse, in many ways, that’s kind of almost a democratization of machine learning, which inside of a lot of companies is still really hard to operationalize at scale, just because there’s so many moving parts and pieces. But this is something that Tristan actually saw on the ground building tools for people to operationalize data science. So I want to ask him, even though the promise of machine learning is still so exciting, it’s still just a actually a pretty hard problem when it comes down to the practical implementation. That is my burning question. And I’ve already been talking too much. So Kostas, what’s your question? And then I’ll give you plenty of time on the mic during the show. 

Kostas Pardalis 01:47

Yeah, I want to learn more about the product itself. They have a very interesting approach where they enhance data warehouses with ML capabilities in a way. So I want to see how they do it. What kind of components and what kind of abstractions they have built on top of a data warehouse, and what is missing, , well, how they are dealing with the latency problem for example. I’ll probably have quite a few technical questions to ask Tristan. And I’ll focus on that.

Eric Dodds 02:17

Yeah, that’s great. Yeah, it’ll be interesting to see. You know, a lot of times you see technology this, you see the introduction of new frameworks, or even languages that are sort of a variation of an existing language. So it’ll be interesting to see if they’re using sort of established paradigms, or they’re introducing new paradigms that sort of make delivering this easier for them, but may be difficult for users. So without further ado, let’s jump in and talk with Tristan.

Kostas Pardalis 02:42

Let’s do it.

Eric Dodds 02:45

Tristan, welcome to The Data Stack Show. So many things to talk about. And we really appreciate you taking the time. 

Tristan Zajonc 02:52

Thanks so much for having me. It’s a pleasure. 

Eric Dodds 02:55

Okay, so we’re gonna talk lots about ML and AI and hear about Continual. But let’s start out by just telling us your background: how you’re a two time founder. So congratulations. That’s a huge accomplishment. But what led you to Continual? What’s your background? 

Tristan Zajonc 03:15

Yeah, well, that’s a little bit of a long story. Let me see how condensed I can make it. So I’m a statistician by training. I graduated from grad school in the 2012 era, when the rise of the word data science was happening, big data was happening. Of course, the cloud was sort of well underway. At that point, I was trying to figure out what to do next and had the entrepreneurial itch, after just seeing what I perceived to be a missing product in the market around enabling data science within the enterprise. And so in 2013, I founded a company called Sense, which was really one of the first enterprise data science platforms out there. It was targeting code first data scientists, right the rise, the rise of open source data science tooling was well underway, the rise of the Big Data ecosystem, Hadoop, Spark, etc., was well under underway. And it felt there needed to be a new statistical computing or data science platform that was geared towards these users, and increasingly it became clear that it needed to be not only geared towards those users, but actually also serve the needs of the enterprise to bring a team of those users together, enable collaboration and neighbor, enable operationalization etc. So that was really 2013. We raised a seed round, grew that company, basically to product market fit. And then right before a Series A, ended up getting acquired by Cloudera, that big data platform company, the leading provider of Hadoop. Spent three years at Cloudera, had an amazing time and that product Sense that I built became their data science workbench product, I guess they call it now Cloudera Machine Learning. Unsurprisingly, they realized that the pinnacle application on top of a data platform really is AI/ML doing predictions. Sort of once you’ve stored the data, once you’ve processed it, once you’ve done some basic analytics, you really want to go beyond analytics to predictive analytics or AI/ML and there’s a whole class of users that don’t write Java, they write Python. And they want to enable those users. So I spent three very, very pleasurable years at Cloudera building out their data science platform. And then the entrepreneurial itch started scratching again, and I decided to leave Cloudera about two years ago to found Continual. 

Basically, the problem that I saw at Cloudera really was, there was a tremendous buy-in for AI and ML to have a pervasive aspect across large scale enterprises or businesses. Every customer that I talked to I was sort of in the CTO for ML role at Cloudera, which was kind of partly outbound, partly inbound, every customer I talked to, all bought into this idea of the AI-first, AI-centric enterprise. They could rattle off a dozen use cases, or more in a meeting, they would often show me those slides that often look like vendor slides where there’s tons and tons of use cases. But they were really all struggling to actually make that vision a reality. And at Cloudera, we offered that incredible portfolio of different products and capabilities just by being a very broad platform. But what I was seeing was just a lot of companies weren’t succeeding. And the reason was just the sheer complexity of actually moving AI/ML from the R&D phase into the operational and production phase, sort of this continual operational, so you’ll find a continual the unsurprisingly, the name is continual that AI so so has this idea of continual operational ML and AI. We have a very unique take on that, which I can talk about. But yeah, that was sort of the initial genesis.

Eric Dodds 06:37

I love it. And I want to, I want to circle back to why AI and ML are hard, because I think it’s just a helpful topic to discuss, especially from someone who’s actually built tooling around it, because I think you get to experience the problem in a unique way if you’re actually building tools to solve for it. But before we go there, can you just give us a brief, high-level overview of what does Continual do?

Tristan Zajonc 07:04

Yeah, no, absolutely. So, so Continual is we like to say it’s a continual AI and ML platform that sits directly on cloud data warehouses like Snowflake, Redshift, BigQuery, Azure, Synapse. It enables anybody to build predictive models that never stop learning from data. So it has this core recognition that the world is fundamentally changing but that data is continually arriving. That predictions and models need to be continually maintained. And so a typical application would be maintaining a customer churn forecast, an inventory forecast, shipping arrival time, out of stock event, whether equipment was going to fail. And we’re just building a sort of a fundamentally easier way to do that, that puts the data warehouse and the way we accomplish that, is we kind of put the data warehouse at the center. So I can talk much more about that. We think that data is increasingly flowing into the data warehouse. And that’s the place where you should build an experience and workflow around that. It will play well with all the rest of the ecosystem. And it will just sort of 10x simplify both the process of building machine learning, but also equally important, the process of maintaining and iterating on those predictive models that you have. I’m happy to go into more depth. But it’s a platform that’s fundamentally declarative. Like SQL, we try to make AI, this process where it’s very data centric, you’re focused on what are the features in the input signals to these predictive models? What are the things that you’re trying to predict, like customer churn? We don’t think there needs to be in this sort of modern data stack era, any Kubernetes, any containers, any Python pipelines, 80-95 percent of AI, ML use cases that I see within the enterprise, we think can be solved in this sort of dramatically simpler way. And yeah, that’s what Continual is doing.

Eric Dodds 08:54

I feel like in that two minute explanation, you gave us enough fodder to do five or six podcast episodes. So much to talk about. Could you just give us a quick answer because there’s so many things. And I think next, maybe we can jump to sort of the warehouse being the center of the stack and talking through modern stack architecture, because I think that’s a really interesting subject. But before we do that, I think a good way to sort of get there with context is to talk about why AI and ML are hard. So you mentioned that you had some tools that you had built at Cloudera. But you noticed that … and it’s a really interesting thing and AI and ML are … Kostas and I will say the marketing kind of leads the actual practical usage inside of orgs where it’s the promise of the future. And it is. We all believe that for sure. And we know there’s power there when the rubber meets the road. It’s actually pretty hard to operationalize it. Why is that? Could you just hit the top sort of couple points of what are the barriers that block companies from actually making it a reality and driving value? 

Tristan Zajonc 10:07

I feel there’s a lot of people who are also missing the mark in terms of solving the problem. So there’s some people that think, Okay, well, what we need to do is we need a notebook that can access compute resources in the cloud, right? So they build a notebook. I mean, it’s a worthwhile tool, right? You can launch a notebook in the cloud and get access to a GPU, right, that might be solving one particular thing. But you might have people that are building, some easier way or a different interface to actually train a model, right? But training a model, it actually isn’t that hard, right? Once you hire a data scientist who’s taken some basic skill sets, calling scikit learn fit or XGBoost is typically not that hard. Now, then there’s people saying, Okay, well the problem … and I think this is starting to go down in the right direction … the problem is really productionisation and operationalization?

Now, the naive answer to that is okay, the answer is we’ll put a predictive model inside a container, or something like that and we’ll have a model deployment platform, but that all of those are really missing, from my experience, all of those really missed the mark. If you ask why isn’t AI and ML being successful in the business, and why isn’t it actually being embedded in these business processes so that it can have this impact. And the fundamental problem there really is around the continual nature of ML, right? So there’s, there’s very rarely a static model that you need to do that you can deploy. Even if there is a static model, the data that’s feeding into that model is not static, right. So you have data continually coming in about your customers, about the products that they’re purchasing, about the inventory on your shelves, you’ve got the mileage, how much jiggle there is in your aircraft engine that might have a maintenance issue. And so all of those things, that data is coming in is changing, even if the model is not changing. Now, almost certainly, the model is also changing, because the world is fundamentally changing. And then if the model is changing, or the data that’s going into that model is changing, the predictions are changing, right? So you need those updated predictions. So in order to really embed AI/ML experience or insight into a product or into an operational system within an enterprise, you’ve got to think about that continual nature, right? You’ve got to build a workflow. So then the next step is okay, we recognize that that’s the problem. Well, how do you solve that? And if you go, and you look out and look at the canonical stack diagram, right, Uber’s Michelangelo platform, they’ve documented their internal ML platform, and if you go and look at that, you see, wow, there’s about seven different distributed systems in this diagram, right? It’s all of this crazy pipeline jungle, there’s data storage systems, there’s training systems, there’s monitoring systems, and then kind of patching it all together, is this kind of crazy, what at least looks to me like spaghetti kind of DAGs to manage all their training and inference and testing and performance monitoring, and all of that, all that sort of thing. 

And so I think that you get to that and increasingly either you don’t have the in-house capability to pull that off, or the ROI ends up not being there, because it becomes so expensive to build and retain these models, that you say, hey, let me go and work on other more pressing problems. And so we think that we think that that’s a solvable problem. We think that there’s a way to sort of just like in the Hadoop ecosystem, which I’m familiar with. We went from like the MapReduce era, right, where you wrote all this Java code to do basic analytics to figure out how many customers churned. And then now, of course, we just go and open up Snowflake and run it, run a query. The same sort of thing can happen for ML, but it doesn’t just need to be an easier interface, it also needs to be this continual operational system. And that, I think, is the trick, right? It’s not just there’s some sort of person who kind of puts a prediction statement inside of a SQL statement. That’s really not enough. That’s not solving the core problem, you need to think of an easier way to build and maintain both the model and predictions. Yeah. So that’s my diagnosis, at least.

Kostas Pardalis 13:52

So Tristan you mentioned, the complexity that someone can see in this architecture, with all the different distributed systems in this budget of pipelines and all that stuff. How does Continual simplify that? Actually, there are two questions. One is what kind of complexity does Continual expose to the user? And the other is what complexity is hidden and how do you manage to hide it? Right? So can you share a little bit more about that? Because it’s super interesting. I always find it fascinating. I think one of the reasons that I love technology is that it gives you this opportunity to build something very, very complex in terms of how it operates, because that’s how the world works, but hides it behind a lot of simplicity. Right. And I think that’s very common, what we see with technology, so I’d love to hear more about how you do that with Continual.

Tristan Zajonc 14:48

Yeah, no, I love that analogy. I think it is true that the history of technology is in many ways, the history of the hierarchy, the hierarchy of abstractions, right all the way you think about programming languages all the way down to the hardware level. Each layer, there’s another abstraction that hopefully isn’t leaking, and therefore makes building on top of it dramatically simpler. But in terms of continual ML, just step back and think for a moment. So don’t look at all the technology and just pause. What is ML? Right? What is a machine learning model? A predictive model? Exactly. It’s really nothing more than a function that takes some inputs. So data, and those inputs are often called features, right? So signals or features, they could be about your customers, right? That could be, let’s say that you’re doing a customer churn problem, those things are, well, how does the customer use the product? How much have they used it in the last seven days, so there’s a set of inputs, and then there’s a target. And that target could be something that you’re trying to predict. So that’s in this case, customer churn. Now customer churn could be a few different definitions of customer churn, 30 days, 90 days, 100 days. So you see how quickly once you go down this path, you have a lot of predictive models, even if you only think of one use case. Then there’s a function between those two things. Now, increasingly, that function is a very, very complicated transformation between inputs and the prediction tasks that you’re doing. But if you think about the level of abstraction that ideally you should be able to achieve is, hey, manage your input signals, your features, and manage what you’re trying to predict. What’s inside that transform function should really, it’s going to be learned by machine learning. But really, you shouldn’t have to think too much about it, right? That’s not something that feels like it’s an essential complexity that you should have to marry and then manage. And then the second part of it. So that’s the whole world of automated machine learning that kind of deals with that kind of that transform function and figuring out how to, okay, let’s go to compare a bunch of models and figure out the best models and the best architectures that will give us the best predictive performance.

There’s a second dimension to that, which is the operational and continual dimension, which is also what we focus on, which is saying, okay, well, now, if you’re going to operationalize this, you need a way to continually retain and continually predict. And so that should just be policy, right, that’s how often do you want things to be retrained? How often do you want things to be predicted? So what continual does is it really gives you a workflow to one: manage and collaborate around all your features. So you do that with SQL, you say, Hey, here’s, here’s how I’m going to model my business. Here’s my customers, here’s the features on them. Here’s my products, here’s the features on those, here’s my stores, here’s the features on those, etc. Then you can manage your prediction targets; what are the things you’re trying to predict? And everything else is automated, right, the process of training models and retraining models, comparing models, the process of maintaining the prediction, we do that. We automate all of that. We tried to distill that down to this and central complexity. Now we bet on that one way to do that is your data in your data warehouse, right, you kind of need to say what’s the level of abstraction below you? Right? And what we bet on and I think this has been an amazing enabler for us is we’ve bet that the future is the data warehouse, the future is SQL from a data management perspective and data transformation perspective. Now all we need is an AI/ML system that’s operational and plays well with that ecosystem, and has a workflow that works for that ecosystem, that user etc.

Kostas Pardalis 18:12

That’s fascinating. So when we are talking about the use cases that Continual puts forward, we are talking about doing predictions, and machine learning, and AI around using pretty structured data, right? Or usually when we think about ML and AI, the first thing that we think about is image recognition, right, computer vision. Is this something that also can be part of Continual, or is the focus right now mainly on structured and business data?

Tristan Zajonc  18:43

So that’s a great question. And so we are in the, in the short/medium term, we are really focused on data that you typically see within an enterprise that lives in a data warehouse, and that tends to be structured data that does have a relational dimension to it, right? So customers buy products, etc. and also very clearly have a temporal dimension. So we have an abstraction that is sort of very tailored towards relational and temporal data, and building both features and building predictions on top of that relational temporal data. Now, the model though, and what’s very exciting is the model is easily extensible to richer types of data. For instance, we already support text data, right? We can use text data as features. So conceptual and if you think about computer vision, right, conceptually, a computer vision is nothing more than an image type into a function and let’s say you’re trying to do classification well that would be a class or a category or Boolean or something as the output type. So that abstraction level still works. It can be even more sophisticated that it can even be a video comes in on one side and a video that’s a segmentation video comes out on the other side, and really, you can think of that as a type in and type out and that for us, if you think about the bread and butter workloads that are happening in a data warehouse, that’s not probably the dominant use case, it’s certainly not the dominant use case, we see. We do see a ton of text data and the need to leverage and extract information from text data. And we do increasingly see image data, right. So Snowflake, for instance, just announced support for unstructured information, including images, texts, PDFs, etc. And a lot of times you want to extract insights from that and then put those insights back into your data warehouse so that you can then query them, right. So the data warehouse, our belief is the data warehouse still is going to be the place where a lot of that a lot of that stuff happens. Now, if you’re building an autonomous car, right, that’s processing a whole bunch of real time streams. No, that’s not going to be that architecture.

Kostas Pardalis  20:42

Yeah, it makes total sense. And all these use cases, with IoT, and machine learning on the edge at the edge, and all that stuff, they’re more specialized. But yeah, that’s super, super interesting. And if I understand correctly, okay, and I’m coming more from the world of data engineering, not much from machine learning. So I’m still learning about that. And the way that things work is that let’s say we have a data warehouse. So we push out data, where we collect it doesn’t matter how we do it. And from these raw data that we have, the next step is to go and create some features, right? And once we have done that, the next step is to feed these features into a model and train a model. Is this correct first of all, or am I missing something?

Tristan Zajonc  21:27

Yeah, that’s, that’s absolutely correct. Although I would say to you, I mean, just don’t forget the continual. The end goal that you’re really trying to end up that is that continual process by which you both maintain that model, at least on some frequency, weekly, monthly, etc. and, it will depend on if you’re doing real time or continual batch, but let’s say that you’re doing customer churn or something that, or inventory. You’re almost always continually maintaining that prediction. Right. So you’re almost always always continually maintaining that prediction.

Kostas Pardalis  21:57

Absolutely. Absolutely. Yeah. I’m talking mainly about, let’s say, the transformation of the data. I’m not thinking that much about the operations, right. And I’m wondering, how do we go from the raw data to the features, right? How do users do that? And let’s get an example. Okay, a more concrete example, let’s say we have the use case here is churn. Okay. So how does a user that is going to start implementing Continual today, let’s say, and assuming that they have all the data on their data warehouse, they can do the first step, which is going from their old data to the features and how do these features look also, because I hear the word feature a lot, feature stores, all the stuff around MLOps, but at the end, what are these features? Right?

Tristan Zajonc  22:47

Yeah, yeah. So a feature is something that you believe, given your business insight, your understanding as a human right of the business that you think is going to be predictive of whatever you’re trying to predict, in this case, churn, right? So a classic feature would be something in this case, when you let’s say you have clickstream data coming through RudderStack and into your data warehouse, you might then want to say, Well, I have a deep insight into that activity over the last few days, let’s say seven days is very important. And so you might want to embed that knowledge, basically, your business knowledge, and you would define that as a feature. And you would really want to be able to reuse that feature across all of your downstream use cases, right. So you don’t just have customer churn, you also have something about the other products that they might want to buy, you all have their LTV calculation, lifetime value, you might have net expansion and net contraction, so something that’s maybe not a binary churn metric, or  upsell to the premium plan. So what we see is, typically, once you’re dealing with these sales and marketing use cases, you might start with churn, but very, very quickly, I mean, if it becomes easy to build predictive models, very quickly, you go from one model, to a dozen in that very narrow domain, even putting aside all the other ones, other parts of your business that you could add, impact, and that you use the same features for those downstream use cases. And so one of the benefits of a feature store is the ability to easily reuse your features in multiple applications.

Tristan Zajonc  24:15

Another aspect I think of that is maybe less well understood outside the feature story, and ML kind of community is, is the temporal nature of features. If you’re trying to predict something about customer churn, it’s critical and your, let’s say, customer churn in the next month, right? It’s critical that you have the ability to go back in time and ask yourself, hey, what was that feature a month ago, two months ago, three months ago? So that you can then look at the future ground truth you see some customers, how does a machine learning model learn? It needs to see some examples of that actually happening, a customer churning. And so the way you do that is you go to your historical data, and you look back in time and you say, Okay, well a year ago, did the person with these characteristics a year ago did they then churn in the next 30 days? This is 11 months ago. And so you need in your feature store, you need to make sure you define your features in a way that allows you to kind of have this time machine characteristic, sometimes you will say it’s called point time correct or temporal join, you need to be able to go back in time and say, I need to get that feature at that particular point in time.

Tristan Zajonc  25:16

So what Continual does is it gives you a whole workflow to define those features. And make sure that you organize them properly, make sure you know, you attach them to the right entity, your customers, you make sure they have a time index that appropriately, make sure when you train your models, you get the features for the right point in time, you don’t have data leakage. And so that’s all you know, that’s all very important. Now, you also bet that the way you should define those features to your question, Well, how do you actually do that? Right? How do you define a feature, you should do that with SQL. Increasingly, SQL really is this incredibly powerful lingua franca. It scales beautifully. It’s just especially when you deal with it at scale, it just becomes this very, very powerful language. And so that’s how we think about that process.

Kostas Pardalis  25:59

So is it accurate to describe that one of the things that temporal offers to someone who has a data warehouse is to actually extend the data warehouse with a feature store?

Tristan Zajonc  26:11

Yes, exactly. So we really have three layers to Continual. So one is there’s a feature store, but that is a virtual feature store that is on top of your data warehouse, right. So we replicate no data into our system, we define essentially views, and organize the views on top of your existing data and give you a workflow for that you can also have native integration, for instance, with DBT. So if you’re coming from the world of DBT, the data build tool, you can define those features and your target pretty much your whole model using DBT. So that’s kind of at the core. And that’s why we say we’re a data first platform for AI, we really think that’s the most important thing that’s modeling your business, and that’s the most important thing where you really need to bring all your expertise to bear.

Tristan Zajonc  26:55

Above that, in terms of training models, we have this declarative AI engine. You can think of it as an auto ML system that has this very flexible ability to pull in data, this temporal relational data and make sort of state of the art predictions over time. And then the final thing we have is we have this continual ML operations aspect. So we don’t just train that model once. It’s not upload a CSV file and get a bunch of models. It’s really about maintaining both the model and the prediction, and giving you visibility on top of all.

Tristan Zajonc  27:24

That may sound like a lot, but it really is not. Because the only thing that you’re actually doing is you’re really just defining your data, right. The rest of it’s all kind of just happening automatically, kind of on autopilot. And the end result is that you basically get a state of the art continually improving predictions inside your data warehouse. And with a workflow that makes it not only easy to build that but also easy to maintain, and also easy to iterate. And we think that basically, our goal is really, imagine a company that has 500 models. What is the system that’s going to be able to do that? Putting aside whether it’s Continual or somebody else, in my view, it’s got to be a high-level declarative system. That’s the only way to manage 500 models. If you go and try to manage 500 models and the continual lifecycle of 500 models using a whole bunch of custom Airflow DAGs that you write, and every single one is a custom script maintained by a data scientist, I mean, that is just not a recipe for success. That’s not the future that I think is possible. It may be the status quo today that that’s the way we do it today. But I think we all need to be striving for some sort of higher level experience. If we really want AI and ML to become pervasive, there’s got to be some higher level experiences that we invent kind of as technologists.

Kostas Pardalis  28:39

Yeah. And just to make it a little bit more clear. When you are talking about this declarative language, you’re talking about having an approach when it comes to operationalizing models similar to what Terraform, for example, has done for the cloud infrastructure. Is this correct?

Tristan Zajonc  28:56

Yeah, that’s a fantastic example or analogy. So if you think about managing cloud infrastructure, but you manage that now with a declarative approach, you manage it using Terraform. If you think about managing containers, right, you manage it by using Kubernetes likely, and you define declarative, here’s what I want to happen. And then Kubernetes goes and makes it happen. And if the thing fails, and machines fail, it fixes those problems, right, and it maintains the number of replicas that you want. And so yes, we think that our experience is very much tailored to that, but you have this configuration, you can push into the system, we go make it happen. You can do that in a UI, you can actually do it in version control, just as you would with Terraform or Kubernetes manifests, you can do it like that. The second element though, is there is this data element. And so it’s not a bunch of GAML. There’s also SQL there, but in order to define your input features and your output targets, you do that using the language of SQL, so that allows the whole system now to become declarative. So on one hand, you have the data manipulation that you have, just SQL is a declarative language. So we have a declarative language for the necessary data manipulation that you need to kind of organize and model your business and then the continual operations aspect is declarative in a way. I think that analogy is the best analogy, right, exactly like Terraform.

Kostas Pardalis  30:19

Interesting, a quick question, you keep mentioning SQL and how important SQL and how much you’re betting on SQL. Do you see some kind of limitations in the expressivity that SQL has in terms of creating features, or in the economics of the language? The reason that I’m asking is because a very good example, as we have seen, in these spaces, like DBT, right, which was a project that came into life exactly, because of the limitations, more around the ergonomics that SQL has, right, and DBT came and brought into the game all these best practices, and all these nice tools that engineers used to have and brought these into data. So do you see any kind of limitations with SQL? And if yes, how do you see that we can overcome these?

Tristan Zajonc  31:08

Yes, I mean, so you absolutely need to combine SQL with a workflow around SQL, for instance, DBT, right. So if you just have a bunch of shell scripts lying around with SQL statements in them, that’s not going to be a great way to manage your data. But if you’re trying to model your business, and your data is already in a data warehouse, you should really use the power of SQL. And it’s, I think, as you more and more embrace that philosophy, you realize how far it will go. There are in machine learning, there are things where the ergonomics, I come from a Python background, I lived and breathed Python and R and all of those tools, there are instances where you kind of think, Okay, well, that might be a little bit easier to express in or to wrap up in a Python syntax. What’s amazing is, increasingly I just don’t see that, I think that the data that’s coming into models is more and more raw and raw. So with respect to machine learning models. So it used to be that you needed to do a tremendous amount of feature engineering. And increasingly, and in our system, you can still do feature engineering to kind of bring your business insight to bear. But increasingly, the model itself is doing internal to it some degree of feature engineering, which is really just part of the model. And so for instance, if you look at the history of computer vision, right, and even tabular data, increasingly, you can push raw data into those models, raw images, raw tabular data with very little middle pre-processing. And then some of the complicated feature engineering that’s very ML specific, and maybe SQL is not as well suited for, that can happen sort of internal to the model. And so I think that type of feature engineering really doesn’t need to be exposed to the end user. Right? So if you think about the types of features where the business user, or the user needs to bring their own insights to bear, I have a hard time thinking of where SQL has led me down in terms of that.

Kostas Pardalis  33:01

That makes total sense. I mean, and I think, again, as I said, I’m not coming from ML, but also big parts of the success of deep learning is exactly that is that the model itself generates optimal features that can help build better. Because I remember back in the beginning of the Zeros, when we didn’t have deep learning yet, computer vision, most of the papers that you would see getting published was what kind of features we can create to make sure that a very, very specific nice use case of computer vision, we can tackle a little bit better. And I think part of the revolution with deep learning is exactly that.

Tristan Zajonc  33:41

Yeah, absolutely. I mean, you can’t, you can’t write an edge detector on an image in SQL, I grant you that. But that’s not what you need to do anymore, right? For the state of the art models, you just need to pass in a raw image. And increasingly, you might even be able to say something like a question on that image, like, how many cars are there in this image? Right? So you might not know this visual, this whole area of visual question answering. So even something kind of as wild as that, if you think about it from a data perspective, it’s really no more than an image coming in, and a column with a question and a column with an answer. And that’s, I mean, that’s just mind blowing to me. I mean, it’s almost amazing that that’s possible. And I think the overwhelming trajectory is towards that. A lot of models don’t even need data, I mean, increasingly. So if you look at what’s happening with for instance, open AI and their GPT-3 type of work. And of course, speech recognition in many parts of the domain, you actually don’t even need to bring any data to bear. There’s no model training, you just it’s an API. But what we’ve seen is within the enterprise, so some people ask me, okay, well, is it all just going to move to this everything’s on an API? Well, the answer there is no, clearly no, because the customer churn within your business, you have to look at your historical churn patterns. There’s no way you can just predict customer churn, given a user’s demographics. If you don’t have some history there. Same thing with inventory forecasting, predictive maintenance use cases. There’s a set of use cases where fundamentally they’re data driven. They’re driven off the data of your business. And so what we’re doing is really trying to provide the easiest experience for those types of use cases.

Kostas Pardalis  35:11

I keep saying that there are data problems that the business context is very important. You cannot state the churn model that can predict what is happening at DoorDash and use it in Continual, right. It just doesn’t work. It’s a completely different view of the world, right? Because they are dealing with a completely different view of the world.

Tristan Zajonc  35:37

Yeah, no, absolutely. I mean, I think that’s actually why the data warehouse is so powerful. And in terms of a data strategy for companies. I mean, a lot of times people say, Well, isn’t, for instance, all of AI and ML gonna get verticalized. People have said that about BI as well, right. So you have the sales and marketing use cases like churn and you have inventory forecasting use cases. But what I’ve seen is twofold. One, even for those very standard use cases that every business has, the data is very different, right? So of course, the signals that you’re getting from all of your different touchpoints, your websites, your products, I mean, all the things that are sending you data, all of that data is very bespoke to your business. Right?

Tristan Zajonc  36:17

So Strava is very different from DoorDash but they might still both have a churn at the end. They’re trying to predict churn and reduce churn. Actually, this one thing sort of surprised me, even as I’ve worked with more and more companies, even the definition of churn is very bespoke. I just was chatting with a company that said, Okay, well, they  had stripe data, so you think, okay, very standard, but in their world churn was defined as they have to be 30 days out of sort of … they could, they could cancel their account due to a bad credit card, but if it’s just 15 days, and then they manage to put it in a new credit card, it’s not churn. And so the beautiful thing about the data warehouse is it gives you the data professional, the data scientists the power to kind of model their business in the ways that are unique, and it has that flexibility, but tries to hide everything else. Yeah. I have a hard time seeing how for many, many use cases, you can get rid of that. And so no, our bet is that that level of complexity, the ability to model your business, and the need to model your business is going to persist for most sophisticated data-driven companies.

Kostas Pardalis  37:22

Absolutely. Yeah, I agree with that. I have a bit of a more technical question to ask you. I was taking and going through the architectures of different feature stores. And one of the things that I’ve seen, which pretty much exists in every feature store architecture, is a caching layer to serve the features. And  the reason to do that is because low latency, in some use cases is extremely important right? Now, data warehouses on the other side, and all up in general, were built with a completely different perception of what time is, right, but in the past, for example, the data warehouse was built in a way that it could run queries for hours or even days, right. So latency is a completely different thing, when we’re talking about data warehouses or compared to transactional databases or caching layers. So how do you deal with that when you get a data warehouse centric approach?

Tristan Zajonc  38:28

Yeah, no, that’s a great, great question. Because you’re right, you’re absolutely right. I think there’s widespread recognition that a feature store or something called the feature store should be at the center of your data as your ML strategy. And in part, because we see that, hey, that’s the one of the most important bits and also where a lot of complexity can come in. The way I think about it, there’s really three parts. So if you’re just stepping back, what is a feature store? Maybe not everybody in the audience, they might know the term, but what exactly is it? I think it needs to offer three things. So the first is collaboration and doing once sharing of the definitions of features across your business, right. You should not have data scientists duplicating feature definitions, you should have the features properly governed. That’s probably the easiest one to solve, right. And you could probably solve some of that with your existing tools, right, by following DBT best practices. You might have a virtual feature store. But the second one is what’s called point in time correctness, which is this idea of a time machine. You really, for training purposes, need to be able to go back in time and reproduce a feature at any point in time, or at least at regular intervals that you’re going to train on. And it needs to be actually you need to be able to do that on a per row basis. For every user, you need to potentially go back, so it’s not just Snowflake’s time machine or DataBricks’ time machine kind of backup functionality, you actually need to be able to do sort of a temporal join where you get the features at a particular moment in time. You need to do that to construct your training dataset, so that you can then forecast churn into the future without any data leakage. And the third one, which is what you’re pointing out is for, and this is only applicable for real time serving use cases that can’t be pre-materialized and need to be done on a real time path and cannot also be passed in from the client, you need a way to serve features in low latency. So if you have somebody coming in, let’s say you’re doing a search personalization type application, you have somebody who is typing in a search query, that search query comes in, you find the relevant records, then typically, you want to re-rank it based on maybe the previous clickstream of that user, and what they’ve been been doing and their history of actions. And you have a set of features that you need to very, very quickly look up and say, Okay, what have they most recently looked at? And did they click on those things, or whatever it is, you need to use those features. And for that, you need a caching, well, typically, today, you need a caching layer on top of on top of the database. And so a lot of the feature store work, if you look at what some of the open source feature stores are doing. But it’s really about trying to find an architecture for that caching.

Tristan Zajonc  40:57

It’s interesting, we actually started looking at that very closely when we founded Continual. And what we heard, what we saw was, first of all, data is in the data warehouse. And that’s so and people want to leave it there to the degree possible. And second, a huge number of use cases can be solved with this sort of continual batch mindset. It’s a tremendous simplifying, approach in terms of your architecture. And third, we have a bunch of ideas around how to do that cache. We’re waiting a little bit to see if you want to do the real time use cases. We’re waiting a little bit. It’ll be interesting to see where the data warehouses themselves go. Some of the cloud platforms are building some capabilities indirectly. There’s emerging data stores, things like Materialize, which have certain ability to do that. Obviously, the real time databases, things like Rockset, I know Snowflake, for instance, is very focused on high concurrency and low latency. So it’ll be interesting to see how that converges. That’s definitely an open question. The architectural complexity that emerges by trying to maintain consistency between these environments, when in some ways you’re just expressing you’d like to just express a SQL statement and have it taken care of for you. Seems to be me over time, something that’s going to be eliminated. So I think it’s a very interesting question. I think it’s unclear where it will go. Exactly in terms of will there be a dual write system in the cache? Or will we converge towards a data warehouse and new functionalities built directly into the data warehouse? Or even will there be tailor-made data stores that have this characteristic of historical data storage?

Eric Dodds  42:25

It is super interesting. We actually talked with Materialize recently on the show, fascinating conversation, super smart team member there. And then a favorite topic of ours is, what are the warehouses building? And they’ve already advanced in so many different ways, but they’re building in things that they’re going to make things really interesting. But speaking of the warehouse Tristan–and we can land the plane on this question, because we’re coming up on time–but you’ve mentioned multiple times, and  this is a really interesting topic, I think, in general, but the importance of the data warehouse in the context of the modern stack, so zooming the lens out from sort of the specifics, could you just tell us, why, I mean, you’re betting on the warehouse being a central component of the modern way that companies are architecting their data stacks and all of these different tools, why are you doing that? And then why do you think the time for that is now? It seems like there’s a new crop of companies that are sort of making this bet. And why is that? Why do you think that’s happening at this point in time?

Tristan Zajonc  43:34

Yeah, so I think sort of, let’s say, twofold. So the first is that, as I’ve spent a decade experiencing data infrastructure, data engineering infrastructure, and machine learning infrastructure, I mean, the number one problem I see is complexity, right? These stacks just get incredibly complicated to manage, move data between, and particularly with any velocity from a developer perspective. And there’s maybe a death by a thousand paper cuts. So this is sort of a historical error. I mean, Hadoop, the knock on the Hadoop ecosystem is complexity. But even putting aside Hadoop and looking at, if you use all the raw building blocks of a cloud vendor like AWS, it gets very, very tricky. And the complexity gets very hard, which makes it very costly to build new use cases, very, very costly to maintain them and to iterate on them and build new and then it just compounds and compounds over time. And so I think the big thing about the data warehouse, the first thing is by putting the data warehouse at the center, by betting on a cloud managed data warehouse that’s elastic that offers workload isolation, so you can have your data scientists going crazy in one isolated cluster on the same shared data. It’s an incredibly liberating experience if you’ve experienced the alternative, if you experienced the sort of the complexity of shared compute of multiple disparate systems where you’re moving between them, of multiple different languages, you’re moving from MapReduce to SQL to Parquet files to Python to all of that. It’s an incredibly powerful model this kind of big data ecosystem, but the data warehouse is much, much simpler.

Tristan Zajonc  45:11

And so as I’ve matured and or just as I’ve experienced sort of what happens when you’re dealing with too much complexity, I have a natural affinity towards the simplicity of the data warehouse and the power of the data warehouse. That’s the number one, the second one which is more towards the ecosystem. There needs to be some common foundation by which products can integrate, and then the ecosystem can develop and the data warehouse is, I think it’s now emerging; it’s an amazing point where different products can collaborate in a kind of turnkey way, while still allowing the flexibility that you want.

Tristan Zajonc  45:45

So for instance, ingestion, you have RudderStack, you have Fivetran, you have that whole community that’s making what previously were these crazy airflow DAGs, from Salesforce into your data warehouse or from your your logs into your data warehouse, they are now making a completely turnkey analysis landing in the data warehouse. Of course, you can do transformation with DBT, you can do data monitoring with Bigeye and Soda, and I’m sure I’m forgetting some. And you can, of course, have your BI tools. There’s a whole new class of BI tools that, thank God, are actually running the analytics inside the data warehouse, so there’s not data movement. So you can build all your reporting off of that. You can increasingly, again, using tools like RudderStack, Census, or Hightouch, etc., move the data out of the data warehouse, and actually make them actionable, kind of weaponize it so we can actually have an impact on your business. And all of these things are doing it in this completely turnkey way. And so I was seeing that full stack emerging and the genesis of Continual was really saying, wow, there’s not really a way to do operational ML. I mean, you want to do these predictions on top of that stack in that ecosystem.

Tristan Zajonc  46:46

And please don’t tell me I’m gonna kick up a Kubernetes cluster to do that if I’ve embraced that stack. And so yeah, I’m tremendously excited by the ability to drive down complexity to simplify things. I think if we really want data and AI and ML to be centered and pervasive across the enterprise, you’ve got to have a simple, productive, kind of low cost, low manpower person perspective, if you really want to become  pervasive, so that’s why I’m bullish.

Eric Dodds  47:14

Yeah. Yeah. No, I couldn’t agree more. And I think there are a lot of situations or sort of use cases for activating data, where you had ecosystems of tools crop up in order to do those things to your point in a fairly complex way. And then sort of five years later, the warehouse technology is advanced enough to where it’s, oh, well, actually, the most elegant solution was there all along. Because you use your data warehouse, right, and just the ecosystem around it of pipelines to your point. And really, the data warehouse technology itself hadn’t quite gotten to the point where it was elegant. But now it really is. I love the way that you described it as, number one simple. And then number two, the need for a central hub. And it’s such an obvious choice for both of those points.

Tristan Zajonc  48:08

Yeah. And it’s always a journey, right? I mean, I think most technologies, you start, you start complicated, you start low level, then some patterns and abstractions emerge. And then people build fundamentally new, easier experiences. And I think there’s a perpetual search for that. And yeah, I mean, Hadoop and that ecosystem is incredibly powerful. It’s an incredibly powerful technology. If you look at how Facebook runs, a lot of it’s on that sort of technology. And if you need all of that power, all that flexibility, the ability to dive into the code yourself, that’s a great ecosystem to buy into. But there’s also I think, even the Hadoop ecosystem bet on SQL, right? I mean, very quickly in the rise of Hadoop, and the big data ecosystem, Hive became a thing, which is a big data query language, and then it was quickly, well, how do we get faster querying. That was various projects, trying to do faster querying. And then recently, it was segmentation of compute. And then of course, the rise of the cloud kind of disrupted a lot of that architecture. So yeah, there’s a journey there. I think we’re right now at this moment where there’s this convergence on the modern data stack. I really think over the next five years, a huge amount of innovation is going to happen there. And if you kind of buy into that ecosystem, you’re going to be able to freeride on all that innovation that’s happening in a million different quarters.

Eric Dodds  49:22

Yeah, absolutely. One of our previous guests kind of described being able to quickly derive value out of ML as the next phase of the data stack. Whereas analytics is sort of maturing to the point where I mean, as simple as it sounds to sort of self-serve analytics across the organization. Still, a lot of companies haven’t figured it out but the technology is now there to where they’re known playbooks for how to do that. And ML is going to be the next phase of that. And I really think in a lot of ways that’s true, because once you have the data, clean enough to produce really powerful analytics Then it’s okay, well, great. Now let’s really turn the heat up and start optimizing the business in some interesting ways with this data that’s really well suited for machine learning use cases.

Tristan Zajonc  50:10

Yeah, I mean, absolutely. I mean, that’s the thesis for the company. I may need to talk to this person who you talked to. We’ll have to recruit them? Yeah, I think there’s a classic pyramid where if you look at it, there’s sort of AI and ML at the top, I think there’s still some other things that are going to come that still need to happen. We still need to  push into the application domain and make sure that we can handle not just the sort of the back end operations of business, but the applications themselves. But I’m very bullish on this, this path. And I’ve seen the simplicity of it now. It kind of makes me very excited.

Eric Dodds  50:41

Very cool. Well, we are at time, but really quickly. Tristan. Is there a way, I mean, just hearing about Continual, keep thinking this is really cool. I want to see it in action. Is there a way for our listeners to check it out and try it? Or what’s the process like there? 

Tristan Zajonc  51:01

Absolutely. So I mean, we’re in early access now; we launched about a month, two months ago. So you can go to continual.ai. And you can learn a bunch more about Continual. If you type in your email there, we absolutely will reach out to you within 24 hours and set up a demo. So we can give you a demo. We’re taking early access customers. We typically do a demo and then onboard folks to try it out. We’re hoping to get something out in terms of general availability soon. So stay tuned for that. But yeah, we look forward to hearing from anybody who’s interested. We’ll give a little last plug here. 

Eric Dodds  51:32

Yeah. Cool. And just to confirm all the major warehouses, right?

Tristan Zajonc  51:36

All the major cloud warehouses, yep. All those major cloud cloud data warehouses.

Eric Dodds  51:40

Awesome. Very cool. Well, definitely encourage the audience to give it a try. Really cool products. And Tristan, we really, thank you again for the time. This has been an awesome conversation, and we’d love to have you back on the show sometime soon.

Tristan Zajonc  51:54

Absolutely. My pleasure. Thanks so much.

Eric Dodds  51:57

Well, I think what my first big takeaway is that you and Tristan are incredibly smart people. And it was really fun to hear you dig into the tech. But my second one was, his excitement about the data warehouse, which has really been a continual theme, I think throughout actually was a big theme last season, we’ve heard it in the last couple shows about how the cloud data warehouse is just enabling so many different things. When Redshift first came out, I don’t think anyone would have, I mean, I’m sure there were very future looking people who sort of imagined this world where everything’s connected around the warehouse. But I don’t think a lot of people imagined the sort of things that we’re talking about as far as Continual goes. And so that’s just really cool. And I can’t wait to see how that innovation continues to unfold.

Kostas Pardalis  52:44

Absolutely. I don’t think that the data warehouse we will be talking about in a couple of years from now is going to look very similar to what Redshift was when it started in 2012 for example. I found it very interesting Tristan mentioned at some point about Snowflake supporting more unstructured types of data, images, and free text. And yeah, the data warehouse becomes a much broader concept, right? It’s more a data platform in general. And it’s going to fuel many different use cases. And of course, one of the most important ones, from what it seems, is going to be built around AI and ML. So yeah, it’s very fascinating to see what people like Tristan do and the stuff that they’re building, and how they’re encountering the data warehouse with non-traditional data warehousing capabilities like machine learning. And I’m really looking forward to seeing in a couple of months from now, how the product is going to look, super interesting for me, very engaging. I mean, that’s something that we see, especially with founders, very passionate about the products and the technologies they build. And it’s always a lot of fun to discuss with them.

Eric Dodds  53:57

Yeah, absolutely. Well, thanks again for joining us on The Data Stack Show. Great set of shows lined up over the next couple of weeks, so make sure to stay tuned, and we will catch you on the next one.

Eric Dodds  54:11

We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at Eric@datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.