Episode 140:

Stream Processing for Machine Learning with Davor Bonaci of DataStax

May 31, 2023

This week on The Data Stack Show, Eric and Kostas chat with Davor Bonaci, the Executive VP at DataStax. During the episode, Davor discusses his journey in working stream processing at Goggle and his journey in founding Kaskada. The conversation also includes discussion on recommendation engines, how to improve stream and batch processing, what Kaskada is doing to solve key pain points in the space, democratizing and operationalizing ML, and more.

Notes:

Highlights from this week’s conversation include:

  • Davor’s journey from Google and what he was building there (3:32)
  • How work in stream processing changed Davor’s journey (5:10)
  • Analytical predictive models and infrastructure (9:39)
  • How Kaskada serves as a recommendation engine with data (14:05)
  • Kaskada’s user experience as an event processing platform (20:06)
  • Enhancing typical feature store architecture to achieve better results (23:34)
  • What is needed to improve stream and batch processes (27:39)
  • Using another syntax instead of SQL (36:44)
  • DataStax acquiring Kaskada and what will come from that merger (40:24)
  • Operationalizing and democratizing ML (47:54)
  • Final thoughts and takeaways (56:04) 

 

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 00:03
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com. Welcome back to The Data Stack Show, Kostas, we have another exciting one. Today, we’re going to talk about stream processing. We have actually talked about this subject a good bit on the show. But this is pretty interesting, because Davor from Kaskada, which was recently acquired by data stacks, we’ll talk about that a little bit. Built a technology that’s really focused on stream processing, specifically for ml use cases, and kind of closing the gap between, you know, the actual sort of building of insights and features and then actually serving those. And it’s pretty fascinating. What I’m really interested in is what they saw, or sort of maybe the lack in the market that they noticed that caused them to want to build something new, in large part, because you have a lot of really good really high power stream processing tools, you have things like feature stores, you have all sorts of interesting, low latency ways to serve stuff. The pieces are there in order to sort of actually build and deliver cool stuff, you know, even from a stream. But obviously, it wasn’t sufficient. So that’s what I want to ask about. How about you?

Kostas Pardalis 01:43
Yeah, I think it’s a very interesting case. I mean, Kafka is a very interesting case, because it is a streaming processing engine, but it’s emerged as the solution to a problem that is very use case specific. And it has to do with machine learning, right? So it’s going to be very interesting. And like, one of the things that I want to talk about is how we go from something like a feature store, which is supposed to be like, one of the possible solutions out there in the ML problems to something like a SCADA, right? Well, the only difference is why we need something. But it’s, let’s say, more unified, in a way in terms of both like the technology, but also like the experience of the user that uses the solution. So that’s like one of the things that I’m very interested to discuss. And yeah, like later on, but like the journey and also learn about the January of, you know, like getting acquired by data stocks, and like the acquisition, so but more about, like the vision, or like how something that is, if you think about it, like it’s actually like, quite interesting, like, data stacks, is based on Apache Cassandra, which is like a 10 years old technology, right? And then you have something that it’s superb, let’s say the new in terms of payment, the need,

Eric Dodds 03:24
and technologies it’s built on. Yep. And

Kostas Pardalis 03:29
it’s very interesting to see how these things come together and why and what’s the potential outputs of race, right. So it will be very interesting, like to discuss all these things.

Eric Dodds 03:42
All right, well, let’s dig in and talk with DeVore. DeVore. Welcome to The Data Stack Show.

Davor Bonaci 03:48
Hi, great to be with you.

Eric Dodds 03:51
All right. Well, we have some exciting things to talk about. You’ve had quite the journey over the last couple months, you know, in terms of acquisitions, open sourcing stuff, which is, which is all really cool. Let’s look back a little bit in history, though, because this isn’t the first time you’ve opened sourced technology related to streaming, which is kind of cool that you’ve been able to do this a couple times. Now that you were at Google. And can you tell us a little bit about Google and what you were doing there? And then what you built an open source there?

Davor Bonaci 04:25
Oh, yeah. So when I was at Google, it was like the early days of Google Cloud, and we’d be building a unified programming model for batch and stream processing that ultimately resulted in the Apache Beam project. It was quite a successful project with a relatively large number of companies around the world using it, contributing to it. And then, you know, a few years later, me and my co-founder left Google to start a company in a similar space. That kind of led to SCADA the founding of SCADA that really tried to nail the problem of building predictive behavior on machine learning models from even base data? Obviously, we were working at this problem for quite a while that resulted in an acquisition by data stacks about three months ago. And yeah, happy to be with you talking about all of this journey across, you know, from Google to SCADA to data stacks and everything in between.

Eric Dodds 05:27
Sure that you’ve had such a focus on streaming, I’m interested to know, I mean, you obviously are, you know, have been a professional software engineer and working with data for quite some time now. Did you have an interest in stream processing? Or is that something that you found at Google and started to work on at Google?

Davor Bonaci 05:47
I started working at Google. I didn’t, I was not looking at stream processing pre, I guess 2013? Yeah. And basically started around late 2013. Looking into it. And now I guess, this year would be a decade that I am looking at this problem. Yeah.

Eric Dodds 06:06
A decade of streaming? Well, I mean, I guess it’s interesting to think back on 2013. I mean, there were infrastructure limitations that made certain stream processing things pretty difficult, or at a minimum, pretty expensive. So can you tell us so you get into stream processing at Google, you work on Bing, you open source beam. But a beam? And then you know, of course, there are a number of other technologies out there around stream processing, you know, even even within Apache, but those were sufficient for what you wanted to do. So why build something new when at that point? You know, there were multiple major players and multiple different architectures running pretty large organizations at scale for stream processing use cases?

Davor Bonaci 06:56
Yeah. So when we started looking at the problem of machine learning, we discovered that neither batch solutions, nor streaming solutions, nor, you know, beam solves this problem. Well, right. And so if you start thinking about building behavioral machine learning, right, so think about these kind of recommendation engines, churn prediction models, right, something about predicting the future future action, future interest, based on what has happened in the past, right, like, when you look at that nature of that problem, it’s you have to process historical data, observe, feature values, generate training examples to the right points in time, to be able to train the model, that problem looks like more like analytics looks more like batch looks more like historical data processing. And then you have this kind of inference problem where you want to take real time data and give it the most recent feature vector to give it to the model and then produce a real time prediction. And so when you look at that problem, right, like, it’s not well solved by batch, because you have too big of a latency, it’s not well solved by streaming system. Because you can’t have it, it’s very hard to get this kind of historical component on top of it. And so we made the conclusion that the fundamentally existing systems are not well built for that, right? Obviously, other people around the same time have been looking at the same solutions. And they found ways of hacking certain things together to solve the problem. And from that work, feature stores or, you know, common feature stores came to be, right, they tried to create an online store and offline store, it’s really kind of a divergent architecture to try to solve these different use cases on top of the same data. And the VR, more of a system builder than you know, hacking things on top of systems. So we took the problem really deep and then designed the system. And it’s really built for the problem at hand. And the problem at hand we see is, you know, easily connecting to the data, describing features in a in an easy way where you can iterate in a place like a notebook, write test hypothesis, test a lot of features very quickly, that gives you immediate kind of backfill kind of analysis of features at any point in time. And once you train the model, like really, with a click of a button, or you know, checking features as code into production, you can compute and serve those features with low latency, right, all from the same system that is purpose built for this problem. And that’s kind of how Scala was born. And you know, we found some funding for it. We found the team for it and the team built the product and then we took it to market. And you know, I guess the rest is history.

Eric Dodds 09:57
Very cool. Can we actually talk about Have, you sort of mentioned that there’s, you know, you have these sort of two separate problems, right. And that, you know, there’s sort of an analytics type use case, which is looking historically. And then you have the actual sort of ingestion of the real time data that allows you to sort of feed the model and actually create an experience, right, like a hotel recommendation, or, you know, a product recommendation or something. So can you describe the way that you saw that materialize in terms of both infrastructure and teams? Were there different teams working on those separate problems? You know, we like you, because a lot of times you’ll see sort of data science is working on the model and, you know, sort of more of the like, analytics predictive piece. And then it’s a pretty heavy engineering problem to actually like, grab the feature. And then like, you know, it needs to be served in a website or app. Can you describe the common patterns around that breakdown? And how do people sort of hack that together?

Davor Bonaci 11:00
Yep, absolutely. So we think that there are two fundamental problems. Problem number one is really finding predictive signals inside of your data. And that is very kind of company data problem specific, right? If you have a bunch of data coming from your app, right, think of it clickstream tap stream or, you know, engagement information coming from the app. That’s a lot of data, and relatively hard to find what is really predictive signal that tells you what the user might be interested in whether they’ll, you know, buy something, whether they will, you know, renew a subscription, what they may be interested in, why they are here, and so on. Right? That’s the problem of finding quality predictive signals out of the clickstream. Events, stream data, yeah, that problem becomes harder. The more data gets messy, if you are getting it from multiple places for multiple obligations, right? And schemas, and other things evolve over time. So kind of figuring things out, there tends to be more of a data modeling extraction of useful signals. And we feel that’s a key part of getting machine learning and AI right. There is a different problem, a problem of scale. And that is kind of how I can once I know the model, once I know what my features are, how to open that model, at scale with low latency and good unit economics. And that problem gets harder, the more you know, the more scale you have. Right. And those are two problems. And usually two different people are best to solve two different problems. Right, we’ve seen in the data community, a lot of talk recently, or last few years, about the scale side of things. And I think that’s very warranted, because it’s a hard problem, right. And people pushing the boundaries here tend to work at big companies, typically in the Bay Area that have, you know, really large scale, and they start hitting these problems, right. And I think that’s a really, you know, hard and difficult problem to solve. But I just want to make sure that we don’t forget that all to get to a really good AI, it’s what most people should do is focus on extracting quality signal, the better the signal is, the more predictive it is, it’s easier to build a model, it’s cheaper, right, and it’s actually doing work that is company specific, it’s very leveraged work. Whereas, you know, distributed systems, they are very kind of common and horizontal, and not specific to a company that may be, you know, doing it. So we often think about this infrastructure being more horizontal, and should probably be done in an open source community with other people that, you know, can kind of jointly innovate on it. And then companies really rethink and should focus on their quality signal from their data, because that’s really leveraged for them. That’s a unique business value to them.

Eric Dodds 14:23
Makes total sense. Okay, so walk us, let’s do a breakdown of Kaskada with maybe a sample company. So let’s say, you know, I’m a company that, you know, sells maybe its retail products on, you know, online or something, you know, sort of large scale e commerce, I have multiple websites, maybe even multiple mobile apps, I’m probably ingesting some sort of log data, you know, from my production. You know, my production databases and so I have multiple different data formats coming in from multiple different sources. You know, and I want to say, you know, we have multiple brands. And I want to know if someone’s purchased these things from this brand. What other products from this other brand? Could I maybe cross sell them on? Right? How, what does it look like for my company to implement Kaskada? Like, you know, who are the people involved? And how do we implement it? Yeah.

Davor Bonaci 15:23
So what you have described here, if I can generalize a bit, is a recommendation engine. Yeah. Right. And people have been looking at recommendation engines for a while. It’s like one of the first use cases of machine learning. And obviously, in many industries, they have been, they have been successfully implemented. The interesting thing is, when you look at our recommendation engines and the quality of them, it’s quite interesting what you can find. Right? And so let’s start with a few examples here. Right? So let’s say that today you buy a couch. Right? What is the chance that you’re going to buy a couch next week? Well, you know, the basic recommendation model will conclude if you bought the couch this week, you might buy a couch next week. But we both know, that’s not how it works. Yeah. Right. And so there are some recommendation engines that, you know, fail in miserable ways, you know, in this way, without understanding who you are, and why you bought it, right? If you’re a reseller of couches, sure, you know, more couches, this week, more couches next week. But if you’re buying for your own home, if you bought the couch this week, maybe you’re interested in a coffee table, but not in another couch. Right? Like we have to really understand who the customer is and why they are here, to be able to provide good recommendations. Right? That’s key, right? Like sometimes, recommendation suggests totally off. And if you search online, you’ll find examples where people can fluff that, you know, quality of these things, when they are not done well. So when you think about how I can do this? Well, it’s about understanding motivation, and driving signals from interaction on a digital platform to understand why the person is here. Right. And so it’s, you know, what they are searching not just what they’re buying, how frequently they are searching something, right, and then being able to do this quickly, to give them in a session, personalized experience, based on the reason why they are here today, for example, I think that’s key, how you do that, you have to focus on the signal coming from their interactions with the app. And I have, in every case we looked at, we always find that we can, you know, separate somebody buying a couch for themselves and somebody who is a reseller of couches, right? As long as you focus on the, the interactions on the site tend to look very different. And if you derive the signal, out of the event based data, then the model can, you know, latch on to it, learn and give good predictions that separate one experience from another. Right? That’s key, right. And that’s what we like to enable customers to do. And most often, once they use our technology, they find things that they have not known about their user base, or you know, the user base before they started. And that’s what we consider success. Once you discover predictive things and segmentation of your users. That was not clear before you started. That is success, then you are discovering something about your business, about your users from your data. And that makes the company better. Right? Makes total sense. You know, that’s what we are all about.

Eric Dodds 18:57
Yeah. So let’s get practical for a second. So if I’m a user of, if I’m implementing Kaskada, right, like I, you know, I get it, I get it set up, right, and running. And so are there just endpoints that I point my, you know, app and website, and production databases at like, it just will just ingest them? No matter the schema? Is it as simple as that?

Davor Bonaci 19:26
Yeah. So we obviously want to load data from as many places as possible, right? And we try to make that as easy as possible. Obviously, we can’t read it from anywhere. Or we can’t read it from everywhere, but we can read it from common places that people you know, store data, right? We typically suggest people do some early exploration to start with Parquet formatted files with a kind of scheme, you know, structured data in Parquet formatted files stored in some cloud storage type place. So perhaps, you know, managed by Iceberg or something like that is what we usually recommend. But we can read it from plenty of places, usually with a few lines of code, just kind of specifying the location. And then we can read structured data relatively easily. We do not shine on unstructured data today.

Eric Dodds 20:22
Yep. Makes total sense. And then, once the data makes it into Kaskada, what’s the user experience? Like? What do you know? How is it? How am I trying to find the signal in the noise? Using Kaskada? On the platform?

Davor Bonaci 20:37
Yeah. So first, we tell people usually to use the tools they like, everything we do today, is API first, right? So you can open a Jupyter notebook, IPython notebook, the one pip install, that’s one line of code, then you load the data from somewhere, that’s another line of code. And then after that, you can build features, test features, and use all the machine learning libraries that you like, right, Scikit, learn PyTorch, whatever you like. We generally support the idea that our product is API first, data frames and data frames out and you can connect it with all the tools that exist in the machine learning ecosystem, that obviously practitioners, you know, have learned to love. Over the last, you know, a couple years.

Eric Dodds 21:32
Yeah, it makes total sense. All right. Well, I’ve been hammering you with questions, Costas, please jump in here, because I know you probably have a ton of questions yourself. Yeah.

Kostas Pardalis 21:41
Thank you, Eric. So the word like I have, I want to ask you, you mentioned a couple of different, let’s say, like, roll Dr. Technology categories, like pitchers, thorns, pitcher engines. And so then obviously, there’s also the whole idea of having a streaming processing engine. So what is SCADA Muric? Primarily, is it like a streaming data processing engine, feature store, or something else?

Davor Bonaci 22:12
It’s hard. You know, we obviously need a label for people to understand and add Cascara, we call that a feature engine. I’d like It’s like a feature store, but really focused on generating features, as opposed to storing and serving them. Like, and that’s kind of how we coined the term feature engine. And some other companies have caught on like, there is another company, I think, well, Sumatra, that also tried similar approaches in this space. So we consider ourselves a feature engine, right, the engine that can help you generate feature values, at any point in time, or at boot time have now for inference, and so on. So generation of features from underlying raw data, we call that feature engine. Recently, we open sourced SCADA code. And we started calling it modern open source event processing. Because what we figured out that what we built is actually generalizes, to all processing of events, right, beat bad, she beat in streaming mode, like be it in any way, shape, or form. And so we have our kind of website that talks about modern open source event processing, as our positioning today, and that’s more naturally how it evolved. Rather than our intention, our intention wasn’t to build the generic Event Processing System, it’s just that we discovered that by sort of by accident by solving, I guess, the machine learning problem,

Kostas Pardalis 23:48
you know, well, yeah, makes a little sense. All right. So if someone like Dixon is looking at, let’s say, a typical feature or architecture, you usually see two main components there, right? Like you have, let’s say, the offline processing that happens there or like, let’s see the batch processing, where you go get all your historical data, use that to build a model, whatever. And as part of that, you also define the features that you need for that, right? And then of course, you have the line very soon, which is okay, what new data comes that we need to turn in the features that they have previously defined? And use them somehow right. With SCADA. And usually, like in feature stores, you have different technologies implemented inside, right, which kind of makes sense because historically, we would say data processing platforms are focusing either one or the alphabet like they are either like streaming or bots, and not yet. I’ve worked with Kafka like the fi decide like they use Kaskada am I going to call Have two architectures implemented one? How does it work?

Davor Bonaci 25:05
Yeah, so this architecture of online store and offline store, this is what I think is, you know, hack, hack under liquidation here, around the cow can I stitch existing systems to solve the problem, and I realized that they are not really built for it. So I need to put multiple of them, and use them in different places to try to get you know the outcome and unit economics that I like, right? And so if we kind of look at these two paths, right, I think streaming systems are really good in this inference path says take the recent data, compute some thing that is relatively recent with low latency and serve the results, right, these are kind of materialized views on top of event based data. And I think we have good systems to do that. On the batch side of things. All right, I think, obviously, we have spark and other systems that can process vast amounts of data. But often we find when you think about the user experience, like if you know which features you want, it’s easy to write in a data processing pipeline that computes it. But whenever we talk to ML teams, we often find that what they need is the ability to test hypotheses to try to find signals that are actually irrelevant for their use case. And that is that you can do that in a batch system, and then run a backfill job that is populated at all possible points in time for all entities for all features, right? That’s really not great. And most of these values computed will never be used. And so we think that the right solution to this problem is to take a feature definition that is described easily declaratively. And that can kind of easily cross this training gap, right? It can run in training, without doing kind of complicated backfill that stores everything at every point in time, but compute feature, or training examples, when the when you need them, generate easily with simple queries, right, with tiny queries, complicated data dependent windows and data dependent features, deliver them to training and literally with a click of a button, be able to maintain real time materialized views over streams for a production use case. And so that’s kind of how we view it, right? It’s just one single architecture, purpose built to process streams, or event based data, be it historic, be it

Kostas Pardalis 27:56
real time. Yeah, okay, that makes sense. And what I hear is that building a system, like SCADA, is like trying to solve the problem because Canada is like, so weak, we need to innovate, like, let’s say, into France, like one is like the technology channels, right? Like something that can incorporate like, both, I would say, the streaming and the bots, part of things in one part of them. But it’s also from well, they hear, like a user experience of developer experience, probably we need to figure out like, what’s the right way? For our user, in this case, like an ML engineer, or like a data scientist, to interact with the data and help them guardrail them, like, figuring out much faster? What’s the signal out of the noise? Right? So let’s look a little bit more about that. Because I’m pretty sure like people have heard the loads in the past couple of years about like, how are we streaming data? How to have like, low latencies, high throughputs distributed system level, like all that stuff? Yep. But I think these experience parts, it’s still very new and still, like, mainly unexplored, to be honest. So what did they egg from your experience by building a SCADA to deliver this experience? What is needed? And how are you? What did you build to address things?

Davor Bonaci 29:28
So I think it’s really important to be able to interact with data in a natural declarative way, where you can just kind of state the intention of what you are trying to compute. And the underlying system figures out the best way of implementing that. So right like this really high levels of abstractions. When you describe in a natural way we What is that you need to compute? So let’s talk about machine learning features, right? Like, there is a feature, the finishing, the feature definition can be something as simple as the number of sessions you have had in the last month. Right? So a very simple feature, you have one window, right? It’s a one month window, you’re counting the number of logins, that’s probably the number of sessions, right? In a particular window. Great, we can define that. But then in machine learning use cases, you have more things. Thing is when to observe this feature, right? Like streaming systems make one simple assumption, the only time you are interested in observing this feature is the time of now. Yeah, right. Like what happened three years ago, well, streaming, that’s not a concern for the streaming system. But somebody building machine learning models, right, like needs to observe this feature, that in a specific point in time that matches the model context matches how the prediction is being made. And that is that very, those features happen, those times happen at different points in time for different users for different entities, right. And now, we have to describe what we want in a natural way, right? So we want to count the number of sessions in the last month. We need to observe it, you know, 30 days before or after certain events, maybe 30 days after they signed up for service. Right? Maybe that’s the right point in time to observe that feature, right, then you have to explain to the system when the time is. Right. And then usually in machine learning, or at least in supervised learning, we have the concept of labeling it. So you have to observe something at that point in time, and then move it to the future to compute the label of what has actually happened. But so that’s how a practitioner thinks about the problem. Right? So what’s the feature definition, when it should be observed in a data dependent way, and how to label that example at some other point in time, right? So those are the natural abstractions that ML engineers or any ML practitioner cares about. And these are kind of quite difficult to do in the tabular way the sequel has championed. And so what we have is a simple query language that can do these aggregations, right? Like, this feature definition looks like a sequel, like count number of sessions. But what we really add on is powerful time based semantics that help you describe when the feature is observed, how the training example is labeled to make it really easy and tiny, to, to compute, training data sets in a few lines of code. And the system takes care of the rest. I think that’s the real power that we bring to our community.

Kostas Pardalis 33:14
All right, that’s super cool. And one of the, like, I don’t know, I think one of the main means you like SQL, like probably always had as a declarative language, which by the way, is like the definition of a declarative language. That’s the whole point, right? Like, I’m gonna describe to you what I wanted my eggs, you will go to the database and figure it out. And it’s wisdom with like, the, the ugly details. But it was never easy, like two or intuitive, let’s say to work with time. And that’s one part like some other things that are hard , like anything that has to do with things, more imperative, kind of like programming, like loops and all these things. So can you tell us a little bit more about, let’s say, the new syntax that you figured out is like, best for working with time, right? Because obviously, like, and we’re talking about events here, time is always present. Right? Exactly. Even if we’re not talking about a meme, events are pretty much like what I usually tend to say is like, like, like, time series data, but they are not with more dimensionality in their audit, like with more metadata.

Davor Bonaci 34:36
That is exactly right. So yeah,

Kostas Pardalis 34:39
we still have like, what are the constructs that are like missing come to you the

Davor Bonaci 34:42
The most important difference that we bring to our community is the concept of a timeline. So when an event happens it really describes a change, right? Like you’re logged in, that really increases the number of sessions by one, right and so if we want to process like In this data over time, it’s really about the feature value changes over time. It’s really a timeline. It’s a graph, right? It’s not a computation at the end of time, it is how the feature value has changed or time. And these events are described at points in time when feature value went from 10 to 11. And so our constructs produce timelines, right? When you say, you know, summing integers, right, like, all systems will tell you, okay, the sum is 50. Right? At the end of time, our current sum is 42. Right? We don’t tell you that the current sum is 42, or the total sum is 50. We produce a timeline, right? The sum has changed this way over the period of time. Right. And that is the basic output of primitive operations, you produce a timeline that describes how features have changed over time. And then you have these kinds of time selectors, let’s call them that way. Right, like time selectors that select when such a feature should be observed, when such a feature should be labeled. Right. So you can kind of manipulate timelines. Right. Like, that’s how I would describe Cascara. It’s built for manipulating the timeline.

Kostas Pardalis 36:29
Okay. That’s super interesting. And like, you mentioned that the syntax is like, sequel, light, right? Yeah,

Davor Bonaci 36:40
I mean, it is declarative. So that’s certainly, you know, a kind of sequel perspective on things. We don’t have, you know, select stars from where and these types of, you know, keywords in the language.

Kostas Pardalis 36:59
Yeah. So from how to, like, from a usability standpoint, because, okay, like, SQL is something like, pretty much everyone knows, right? Like, I’ve worked with data, like, even for a short period of time in your life, you have seen like SQL, so it’s a very, I’ll say there’s, like, together with Excel and JavaScript out there in terms of how global you know, like, the syntax series. Why go after, let’s say, a completely different syntax, instead of enriching standard SQL with new constructs? Right.

Davor Bonaci 37:39
Right. So we have had these debates for a long time. Right? Like, we generally chose to make some changes, as opposed to add some additions. Because if we were just adding additions, certain things would be unnatural, and would surprise people. Right. And so we decided that doesn’t make sense of this tabular model, that SQL, you know, enforces is not the best underlying concept for building these abstractions. Yeah. On the other hand, yes, it’s a trade off between some learning curves that Cascara may introduce. But we think of that, as, you know, these are simple concepts. Like if you just understand that this is a timeline, and the definition of what you’re computing is all the same, but you’re just selecting where, like, if you understand the concepts, right, these are very tiny snippets, that right, that anytime you start a new using a new product, there is some learning curve, Excel has its own DSL inside Excel, people have been using Excel, everybody uses Excel, right? Like, these are, this is of that nature, you describe, you know, some formula that looks like, you know, few functions and few selectors, right? This is not you’re not in order to go to school to do this, right? You read the documentation, you look at three examples. And you know, you should know what’s going on.

Kostas Pardalis 39:12
Right? Yeah, that makes sense. All right.

Davor Bonaci 39:16
Yeah. The library, right. Like you have to understand you know, constructs, you know, user model off it and then you start using it and yeah, we don’t

Kostas Pardalis 39:26
make all sense like your experience with because okay, you are we’ve been talking like all this time about like primal, let’s say like and Mel practitioners, like people that they primarily leave like in Python lands, right. So, okay, I mean, if they have, like, they use SQL, they can do it, but like, let’s say their native languages like Python. So what did you like and what was your experience working with them like with people that they’re coming from a very imperative Programming kind of environment and getting into like declaratives. Yeah, so

Davor Bonaci 40:04
we try to merge these worlds. So, right? If you go to our website and kind of see the flavor of what we built, it looks like Python. It has a pipe operator just like Python. Right? It is, right. Like we recognize that the primary programming language for our community is Python, that most ML libraries are built for Python. Right. And so we try to be as close to Python as we can, and make it super easy to integrate with IPython notebooks. Right? Like, that’s, you know, that has been a specific design. Point all along.

Eric Dodds 40:42
Alright, okay, when goods keep

Kostas Pardalis 40:45
chatting about that stuff, like for hours, but there’s also something great that happens. Like late lately, above Kafka that was like the acquisition, or demands we’ve liked, there are stacks. So I love to understand, like, why this happens, right? Like what’s like the, like, the vision behind merging these two products together, right, like, everyone knows, like data stacks, and Apache Cassandra. I mean, Apache Cassandra has been around for a while, right? Like, it’s not something new. And it’s like a database system with very specific use cases. So tell us more about that. Like, what should we expect as the child of this marriage?

Davor Bonaci 41:42
Yep. Absolutely. So obviously, data stacks are rooted in Apache Cassandra, Apache Cassandra is one of the first big data systems that have been built. Alright, it’s, you know, it’s all over a decade old. And it’s still being used by so many companies

Kostas Pardalis 42:02
to solve,

Davor Bonaci 42:05
to store and serve transactional data. Netflix uses it for everything. Uber uses it for everything right, like, and plenty of others, right? Like this is a really key storage system even a decade or a decade after it was originally built. And it has been proven time and time again, with if you really want to scale, right. Like, with good unit economics, you go to Cassandra, that has been kind of widely understood. And obviously, data stacks has been a company around Cassandra helping users adapt. Over the last few years, data stacks moved into Database as a Service market with the launch of Astra which is like a fully managed database as a service product that makes usage of Cassandra easier cloud native, and to support high growth applications. Right. And so what we’ve been looking at is, what is the real opportunity here? Right? Obviously, databases are not super interesting in 2023, like many people see databases as a solved problem. But AI is obviously the interest of most high growth apps today. And so data stacks strategy is to serve smart high growth applications, for the, you know, decades to come. Right. And these applications obviously need a really good storage system like Apache Cassandra to serve and store transactional data. But that’s not enough for the apps that are going to be built in the next decade, right? They need streaming capabilities, they need to compute things from real time data to serve, you know, the real time derived data inside the applications. And they need things like smart predictions, right recommendation engines, churn, prediction, and many other things that personalize the app experience. And so what we are really building here is the best solution to build modern smart, high growth applications. And you need a storage system, you need a computer system, you need the AI system to be able to serve high growth applications for decades to come.

Kostas Pardalis 44:26
Okay, that’s super exciting. So how is this? Well, like which parts of this vision is like served from Cassandra and what is like Casca darling to that, right, like how together they materialize this vision?

Davor Bonaci 44:42
Yeah. So Cassandra is obviously the storage system that has great unit economics and it scales infinitely. So Cassandra is the best way to store user specific information and be able to serve it with low latency. Then we have in our portal All your streaming systems, right, based on Apache pulsar mostly but you know Kafka compatible, that can inject, ingest data coming from, you know, anywhere coming from high growth applications. And then we bring Cascara into the fold, which is really about computing things that you need for Real Time Machine Learning. And that then you can again, store and serve out of Cassandra. So it’s really about completing the story, completing the picture for, you know, serving high growth applications, you can ingest data, you can store data, you can manipulate data to compute what you need to be able to build smart, high growth applications.

Kostas Pardalis 45:48
Yeah, that makes total sense. And just like to remind our audience Kaskada goods, open source recently, right, so there is a GitHub repo out there with, let’s say, the core engine of Kafka, Dow, for Event Processing. It’s also like building on top of something very interesting, like technologies that we have Battiato here, they’ve helped Ross. So I think, even if someone doesn’t, let’s say, have to use it in production, I think just going and seeing how the system is built, like the assumptions. It’s a very modern system. And I think like, it’s going to be like an inspiration, like from many people, like who won’t like either to us or like build something like that. So go and take kids on GitHub, go check like cascada.io Like you can get, like all the links from there. And I think what is important is for you to get feedback from all the people, right? So go ahead, please, yeah, we’d

Davor Bonaci 46:58
I love to engage with folks in the community, listen to their feedback, and obviously advance the state of the art in event processing, particularly for ML use cases. And so we certainly invite everybody to come along, join us, provide comments, and even participate or contribute, as they see fit. So everybody’s welcome.

Kostas Pardalis 47:22
That’s awesome. He’s Kafka, is there a requirement for the open source code SCADA to have Cassandra also, or it can be used as a standalone solution for something it can

Davor Bonaci 47:35
be totally used standalone, right. So just for quick evaluation, you can do a real simple pip install, and you can play with it on your machine. It requires no connections anywhere and requires no installation of Cassandra, right. Like for trying things out, just a simple pip install. I think it can be easier.

Kostas Pardalis 47:54
Okay, that’s awesome. Eric, all yours again.

Eric Dodds 47:58
Yeah, this has been such a fascinating conversation. And it is exciting to look under the hood. I mean, you know, arrow and rust and other technologies like that. Certainly very exciting, not only for Costas and I, but I think our audience, but Devora questions for you. So when we think about a technology like Kaskada, do you envision this solving the sort of, you know, let’s say like operationalizing ml, you know, and sort of closing the gap between those two problems we discussed? Do you envision Kaskada, making that problem a lot easier for larger companies? Right. So, you know, you’ve mentioned a couple of gigantic organizations. And, of course, you know, if you’re doing real time recommendations, you need to be a company of a certain scale, right? You need to have enough data, and you need to have enough engineering resources, you know, in order for that to be worth it, even to your point, you know, you know, the unit economics have to work out, you know, for your recommendations engine to, you know, have positive ROI. But do you think something like Kaskada can actually democratize that process for companies who maybe don’t have multiple different teams who can manage the different parts of this? Do you envision it? Or have you even seen with your users or customers, that actually making it easier for maybe a single team to sort of build and ship things that maybe would have taken them another couple of years to get to just simply from a resource standpoint, team standpoint, fragmented infrastructure,

Davor Bonaci 49:37
That is exactly what we hope the impact on our community will be? Right? Like, nothing that I have talked about, is novel, in a sense that it couldn’t be built, right? Like this is software, anything can be built? Sure. It’s like literally we have not invented new mass, right, like anything can be built. It’s just like the complexity and how many people you need to be able to reliably get to success. Right? Like, people have real time recommendations, you can find online posts from Ireland on Netflix, that talk about these problems and how many years it took them to get to the system they have today, right. And obviously, businesses like that have business needs to solve the problem. And there was nothing available in the market. And they had to, you know, figure out the problem, because it was lucrative for them. Right? It was highly leveraged, right? So it was worth it for them to solve, right, and we I think, are significantly reducing the total cost of ownership, right, of building something like this, right? It’s becoming much, much cheaper to do it. Right. And which means that there are so many more models that can be put into production, because the cost is so much smaller, right? There are so many more models that have a positive ROI, right, that have a lucrative ROI. And that’s what we hope, is the ultimate impact. Once you know, this gets adopted in larger numbers.

Eric Dodds 51:18
Yeah, it makes total sense. Well, let’s end with maybe some practical advice on how like, you know, if I have a, let’s say, I’m, you know, a machine learning engineer or working, you know, in, in kind of the context of data science, and I want to try this out, would you recommend, you know, playing around with building, you know, trying to build some features in Kaskada, that I maybe have already built, you know, with my existing system, and just experiencing how much sort of more flexible it is? Or would you suggest, maybe starting out with more of an exploratory exercise and trying to find that signal and the noise that you talked about?

Davor Bonaci 52:01
I think that depends who you’re talking to. Right? If you want a person who thinks about the signal, right, then I’d say, right, like, try just do a pip install on your laptop and just play with it. Right? And, you know, consider it success if you discover new things that you did not know, an hour ago. Right, that is success, you will discover the new predictive signal that was not obvious to you, right? Like, if you are a person who cares about extracting signals, playing, and just exploring and measuring yourself, one, what have you learned from data that you didn’t know before? If you are an engineer, who cares about it? Who cares about reliability, stability of production unit economics? What’s the latency in production? Right, then I would say, the best thing would be, you know, implemented, you know, three simple features, you know, check in features as code and see how easy it is to populate a feature store, right, that you can just kind of serve it from any database like like Cassandra or something else, right, with a simple API call to give the most recent feature Rector, I just, I would say focus on getting to production part, if that’s what you care about.

Eric Dodds 53:23
Yep. Makes total sense. Okay, one last question. And this one’s more for maybe the listeners who are early in their career. Maybe they work more on the data engineering, or operational side, less on the machine learning side. But they know, okay, I need to familiarize myself with ML, because it’s going to increasingly, you know, infiltrate many aspects of data within an organization. So you know, regardless of technology, you’ve been, you know, operationalizing ml for a long time. Now, do you have any advice for that person who’s really good on the ops side, but maybe they want to explore the ML side?

Davor Bonaci 54:07
Yeah, so I’d say advice would be like you are really well positioned and you are in a place that is likely going to be interesting. For a long time, right? We are discovering that data is really powerful and every company is becoming a data company and seeing how they can leverage data that they have in the best way possible. So you’re kind of really well positioned right if you are on the engineering side, you probably care more about reliability, latency, throughput, unit economics and so on. Right and I think here you want to understand the systems the best you can understand what they are built for. And every single time when you are evaluating a system ask yourself, what the system is not built for, right? What has been sacrificed to achieve the benefits? You just talk to me about what enabled you to achieve that, like, what did you ignore? What did you deprioritize? Right? Like those types of architectural analysis, I would, you know, wish everybody understood and focused, right. So not running for the next school thing, but really understanding the trade offs in the design of different systems, what they are built for, and how to apply them well. Right, that would be like, understanding and knowing that I think unlocks, you know, an engineer’s career, and, you know, you started growing and growing. So, understanding the systems, right, and particularly, what is not prioritized to achieve the benefits that, you know, people like to talk about

Eric Dodds 55:59
such wise advice? Yeah, noticing what’s not there is often much more powerful than simply understanding what’s there. So, wonderful advice. DeVore, this has been such a great conversation, and we’re so glad that you gave us some time to come on the show.

Davor Bonaci 56:17
Thank you so much, it was a great conversation, I really enjoyed talking to both of you

Eric Dodds 56:21
a fascinating conversation with Davor of Kaskada, which was acquired by DataStax, of course, it’s really interesting story about what they envision, in terms of Kaskada being integrated into data stacks, you know, which, you know, sort of operates a lot of stuff on top of Cassandra. So lots of cool stuff there, I think, for the future. But Kaskada is also open source. And it does a lot of interesting things in terms of making it easier to not only discover interesting potential features, and datasets, but also, like delivering those and serving those, which is really interesting. One of the things that I thought was fascinating about this conversation, was the decision to essentially create a new language as part of the system, because the system in and of itself, is capable of doing some really interesting cool things. But they chose to sort of write a language that this is, you know, probably a really, a really bad way to describe it. But it’s almost a mix between SQL and Python, right? It’s declarative, but it’s the flavor of Python, which I thought was fascinating. And so it is, it really does seem like they’re kind of meeting in the middle of these two worlds, sort of the operational side, and more of the statistical side. So that I don’t know that, that was a fascinating approach. I’m certainly going to be thinking about that one. What stuck out to you.

Kostas Pardalis 57:57
Yeah, kinda percent, I think, like the most. So there are like two things, like keep, like from this conversation, one has to do with like, building the technology itself, and like how big of a problem it is, and why it’s not something that can be, let’s say, solved with just like stitching together. technologies, but you really need to start thinking like in first principles, and build the new shoes, you know, wait, right. That’s one thing. But that’s, let’s say, Britain, by the way, innovation, technology rights? Well, I found, like, extremely interesting. How important is the user experience also, and what’s the connection with what you’re saying about the language? Like, the reason they ended up building a new language is because they were trying to figure out what’s the right way for our users, in this case, ml engineers to interact and work with the data and somehow like guardrail them into figuring out, what’s the signal out of all this noise out there. Right? And exactly what you said, they had to find the good things from all the different politics, jobs that are out there and put them together in a way that felt native to their user, which is the ML engineer, right? And the ML engineer, yeah, leaves in some labs. They use Python like you cannot change that all the libraries are in Python, no matter how they work with the data, whether they will have to do some processing with the data Python will be needed. So it is important to build the right experiences there. And we see that like this experience the needs for these experiences also drive new innovation like building In New languages on top of all the processing systems that we have. So that’s something that like I see, I think we will see more and more of in the infrastructure space, as we tried to make like, democratize access to all these technologies, which is probably something that will get even farther accelerate, because of all the recent developments with AI, ml and all that stuff. So yeah, like let’s well I keep looking forward to chopping again, and see what comes out from putting Cassandra together with the SCADA.

Eric Dodds 1:00:41
Absolutely. Well, another good one in the books. Thanks for listening to The Data Stack Show, as always, and we will catch you on the next one. We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe to your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.