Episode 112:

Python Native Stream Processing with Zander Matheson of bytewax

November 9, 2022

This week on The Data Stack Show, Eric and Kostas chat with Zander Matheson, the founder of bytewax. During the episode, Zander discusses what makes bytewax unique, the definition of “real-time,” and how bytewax works with other systems.

Notes:

Highlights from this week’s conversation include:

Zander’s background and career journey (2:32)
Introducing bytewax (5:16)
The difference between systems (10:57)
Bytewax’s most common use cases (16:15)
How bytewax integrates with other systems (20:25)
The technology that makes up bytewax (24:31)
Comparing bytewax to other systems (34:17)
What’s next for bytewax (36:31)

Try it out: bytewax.io

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 0:03
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com.

Welcome back to The Data Stack Show. Today we’re going to talk with Zander from bytewax, and Kostas, I love the topic of streaming, which will surprise you at all. But this is really interesting. We’ve taught we’ve actually had some sort of stream processing type technologies actually multiple on the show. However, bytewax is stream processing in the Python ecosystem, which is really interesting, actually makes a ton of sense, you know, just at a, at a high level, when you think about, you know, how prevalent Python is, in terms of data. And my question is not going to surprise you at all. But I want to know where the motivation came from. So Xander was at Heroku, and GitHub working as a data scientist. And so there was some sort of need, I’m sure that he saw those companies that, you know, motive motivated him to start, bite, lacs, and do sort of stream processing, and some ml ops stuff in Python. So that’s what I’m going to ask you about. And I’m sure you have some technical questions. So tell me what you’re going to dig into.

Kostas Pardalis 1:28
Yeah, I’d love to go deeper into the technical side of things and see how it is built and what are like the technical decisions that were made there. And the one thing that I definitely want to discuss with him is like, what’s the difference between a streaming processing system that is built in 2022, compared to the previous generation of streaming engines like scene, and storm and Spark streaming? And like all these frameworks are they have been around like for a while, so yeah, I’m very excited about it. I’d love to see like, what’s the difference there? Like what’s new.

Zander Matheson 2:08
Totally. All right. Well, let’s dig in and chat with Zander.

Kostas Pardalis 2:11
Let’s do it.

Eric Dodds 2:12
Zander, welcome to The Data Stack Show. We’re super excited to chat.

Zander Matheson 2:16
Thanks. I’m excited to be here. And thanks for having me.

Eric Dodds 2:19
Totally. Okay. Well, tell us about your background. Lots of data science-type stuff. Yeah. Tell us where you came from and what you’re doing today.

Zander Matheson 2:28
Yeah, I’ll give a little bit more of my background. So the last, you know, I don’t know, five or six years been working as a data scientist, I was at GitHub and before that Heroku. But before that, I have like a little bit of all mixed path on how I ended up here. I was actually a civil engineer. That’s what I went to school for, and worked as a civil engineer. Then I went to business school. And after business school, I got into working in technology. And I worked at a company in Virginia, a speech technology company. So doing speech recognition, text to speech. And I was, I was working as a, I was supposed to do like this dabs and sales and I liked the technology, more than I liked doing that. And I was better at working with technology. So I that’s how I kind of ended up in where I am today is a guy got interested, it was right around the time when people started using neural networks for speech recognition. And I was just I thought it was I thought it was so cool. And that’s what led me into the data side of things.

Eric Dodds 3:34
Yeah, very cool. Okay, so I have a question about, I’m always interested in on the show I probably asked is, I don’t know if I’ve asked this in a while, but it always fascinates me sort of when people will kind of take a career, you know, their career changes, which is very common. But are there any lessons from like your study or work in civil engineering that you took with you into data science, you know, because they’re both technical in their own way, but like, pretty different. So I don’t know if there are any lessons you took with you.

Zander Matheson 4:08
Yeah. So when I was working as a software engineer, I was like working with hydrology models, which are stochastic models. And so that was a lot of like, you know, similar basis and theoretical work to like both of these things. So there was a takeaway there. In my, like, General, education for civil engineer. He did, like some programming, it was just like, part of what we did, but not a lot. But yeah, that’s so that’s kind of from I guess, from the technical points point of things, but decomposition of problems is just like a general, like concept and, you know, thinking being able to analyze, analyze things, regardless, is there a skill I guess that transfers.

Eric Dodds 4:55
Yeah, that’s great. No, I love that. It’s kind of like pipe flow. lines, you know, for, like, and they conduct substantive civil engineering to like I insert data is great. Yeah. That’s a great connection there. Okay, well, tell us about bytewax.

Zander Matheson 5:13
Sure. Yeah. So, bytewax today, there’s bytewax the company. And then there’s bytewax at the open source project. So bytewax, the open-source project is a Python framework for data processing. And it’s focused on processing streams. So, stream processing, or stateful stream processing is kind of like the focus of the framework. So you can think of it similarly to Flink or spark structured streaming or Kafka streams. And that you can do more complex or advanced analysis on streaming data. So you can do aggregations, you can do windowing, etc, etc. So, and, and then it’s in Python, it’s Python native. So you can use, like, the Python ecosystem, to build these stream processing workflows. Then, with bytewax, those are called data flows. And that comes from the data flow back to the paradigm that is underneath bytewax.

Eric Dodds 6:21
Tell us about the company as well.

Zander Matheson 6:22
Yeah, so bytewax. Company, we started actually, in the beginning of 2021. And we started bytewax, and we actually started bytewax to bring to market a hosted machine learning platform, was the idea we had worked on this problem at GitHub, and few of us left GitHub to build that. And we had an idea that eventually led to this pivot, where we started building a stream processing framework. And the idea was that all machine learning, when you’re running a machine learning model, in real-time, you have this idea of pre-processing data for features or whatever joining additional data on and then you’d like run the inference with the model, and then you’ll do some post-processing potentially. And so we like, made a framework where you could treat it like a DAG, in Python to do these different stacks. And then you would just deploy them, and we would go scaffold up some infrastructure to run this on Kubernetes. And you could like scaled on everything. And that was the initial idea. And people started using it for to process data in real-time without doing any machine learning. It’s just Yeah, and it wasn’t really meant for that because it didn’t have like state built in. So we ended up just kind of scratching a bunch of that and restarting with a different execution underneath.

Eric Dodds 7:46
Fascinating. Okay, so let’s dig into that a little bit. So the first fire, so you and several people from GitHub, we’re building, you know, a platform for machine learning, which is interesting. We’ve actually heard that on the show, it’s gone, right? And we talk to life, people with machine learning. And it’s kind of like, tell us about ml ops, and everyone’s like, oh, like, like, there are a lot of things that have gotten better and like, but there’s so much about that. So that was as sort of the goal was sort of like, like ML ops pain, like, Let’s go solve that. And then people just started using it for, like, a data flow that didn’t include the machine learning part.

Zander Matheson 8:26
Yeah, I mean, yeah, as simple as that. That’s a simplification, I think, in some sense, but it captures the assets being a better way. Yeah, it was. So there was like, four, year four or five years ago, we were when I was at GitHub, we were working on a machine learning platform. And we did a bunch of work to like, extend Kubernetes and make our own custom operators, and resource definitions so that you could like run training jobs and do hyperparameter staff and do all this mo ops stock on Kubernetes. And so we took, like, we took some of those ideas, and we built him to platform so that the idea was like smaller businesses without the larger teams to like, filled all that ops infrastructure. Could you decide, you know, bytewax deployed my, my workflow thing, and then we kind of handle the rest and give you like metrics and alerting and scaling and stuff like that.

Eric Dodds 9:23
Super interesting. Okay. So the next question is, why do you think— it’s always fascinated me when you like, you build something, and then people use it and kind of a way that you didn’t expect? Why do you think users were using it in the way that they were? Because obviously, like, there’s a component that met a need that was unmet, and they’re like, Oh, this is better way to do this. Like, what were the driving sort of factors or use cases or like technical limitations that we’re facing that that initial version of bytewax unlock in terms of like the streaming piece of it?

Zander Matheson 9:58
Yeah. So many of these groups were still like going to do some use some machine learning down the road. But there was not a really good way in Python to do the feature transformations they needed to do on in real-time against streams. And so they were like, chat trying to transform data in real-time for that aspect. That was, that was that was it. So there was like, in Python, there wasn’t an easy way to essentially write hook, hook up to Kafka, and then run some like, use some Python tools, whenever it is NumPy, etcetera, etcetera, to like, transform that data. And then you have a feature set that you will then serve in real time to the model.

Eric Dodds 10:42
Super interesting. Okay, Kostas, I have 100 more questions but please jump in here because there’s so much to talk about. And I know you have some great questions for him on the technical side.

Kostas Pardalis 10:55
Yeah. You mentioned at the beginning some frameworks for streaming processing, like Flink, which is probably like the most well-known one. There have been like, a few others in the past, like without storm from Twitter. This there was another ones. SAMSA. Yep, sums up there. There have been quite a few attempts in like building streaming processing systems, right? What’s the difference between like a system like fivewalls, today converts to, let’s say, a parodying that was implemented, it’s probably like at the gate Decatur, CropLife. Because like, these are like, all these tools, they started, like pretty much to appear like and have like that 2000 beginning to 2010 rights. So what’s the difference there?

Zander Matheson 11:47
Yeah, I think many of those, I mean, I’m not an expert in all those systems. So, you know, I could be missing things. But many of these systems, there’s like a series of trade-offs you make for like correctness, you know, latency, scalability, et cetera. And so, I think, many of these systems or not, I think many of these systems take different approaches, and like, you know, one thing for another potentially, and that may lead to them being dropped off and something new coming along. Other things may have, you know, like, we are software engineers, etc. We are like a little bit like trend followers. And so some things seem to like fall out of trend. And then they’re, they don’t like gain the adoption. So certain, like ecosystems of tools that, you know, they weren’t, they were integrated with or worked well, with maybe, you know, fell by the wayside, and new systems came up. And those new technologies kind of worked better with those systems. So like you had, you know, spark was like becoming, you know, mid, you know, whatever, like, 2014 2015, or whatever spark was like becoming a thing. And people were moving away from some of the different MapReduce frameworks and like going to different using Spark for different things. So I don’t know, it seems like there are these trends. And often, sometimes it’s like a trade-off, like, you make an architectural decision that lets you have better performance, but that it’s like you’re giving up correctness. Or maybe you have to add state in a different way, or like to persist state or what, you know, checkpoints, things in state you like give up some performance. So there are all these trade-offs. And I think that maybe that leads to a plethora of different services.

Kostas Pardalis 13:39
Yeah, makes total sense. So someone like today that has to choose between, let’s say, let’s take, let’s take Flink because Flink is quite dominant, let’s say like in the sensory processing when state is involved, and bite walks, right? Like why would choose one or the other? Why I would go and do us like and work with bytewax instead of like sleet.

Zander Matheson 14:04
Yeah, so I mean, it comes back to those trade-offs. Again, like Flink is a more mature product. And you have SQL you can use like Flink. And there are many different like bindings for languages. And they’ve built out a bunch of features that bytewax doesn’t have. But the trade-off that we are sort of investing in there is it’s a lot more involved to get up and running and started with sling for like a big group of users that maybe don’t have the experience with that whole Java ecosystem, you know, tuning the JVM, et cetera, et cetera. And so our thought is like, we’ll make a trade-off of maybe not having the exact same robustness that Flink has today. But we’ll give an experience where it’s a lot easier to get started and quicker to get started. For this subset of users that are, you know, maybe a newer group of users are there like machine learning engineers, data scientists or newer data? to engineers. So that’s the trade-off that we’ve been playing in. What’s the ecosystem?

Kostas Pardalis 15:05
Yeah, yeah. Yeah. developer experience is quite important. Yeah. Yeah. I mean, everyone who has had the time point like to war with systems like this, or like maintaining these systems, like, I think they have like, maybe horror stories, too. It’s not easy. I mean, okay, it’s easy to get, like, let’s say Kafka, for example, do you like getting like, a couple of Docker images and running them locally job in words, but like, from that point, like to take it to production and keep it in production? It’s like, a huge, huge difference in terms of like, the complexity of like, all the operations that are involved in Greece. So like, I totally understand like, what you mean by, like, the experience of the developer there? And what are the use cases like your the users of Bibles today, like, mu, like, sounds like a very interesting story of like, how you started, like, from machine learning, you saw that, like they’re using it like, other reasons, and ended up like creating like byte worlds? What are the most common use cases that you see there? Where bytewax is involved?

Zander Matheson 16:12
Yeah. So kind of can align them maybe to application. So IoT, security, financial industry, there’s, there are many different use cases within those applications. So you have like anomaly detection for things like fraud or Mal, you know, malfunctioning sensors, or you can have just, you know, windowed averages, and we have someone using bytewax to, it’s basically just get an average over a window of energy usage. And then they like, show the user what the energy usage and they can then make some decisions based on that. There’s, you know, one thing I’ll say about Iraq’s is like, it’s a very generally applicable framework. I mean, it is a data processing tool for literally any set of data, which makes it also very difficult sometimes to like, position it correctly. But yeah, within so there’s like applications, like I said, in those industries and their various different use cases. And then coming back to what I was talking about for machine learning, and how we ended up with this streaming framework is that stream processing framework, because that application is around features. So today, when you train a machine learning model, you’re going to like generate a feature set, and then you’re going to let in some data set. And then I’m going to use that to train the model and you’re like that sweet, I’m getting great performance have these awesome features. And then you need to use that somehow in production. And so there’s been all this, like tooling around that you have like, Feast, which is an open source thing from Tecton, and many other feature stores, etc. But you don’t really have like a good tool for like, creating those features on the fly. That still allows you to use that Python ecosystem. So that’s another big use case is like, whatever, you have to do two features when you when you’re working static data, you have to do that in real-time. And so you’re going to need state because you’re going to need to know like, I don’t know, how many pickups did this uber driver do in the last 30 minutes. And I’m gonna feed that into my model to know if they’re the person to recommend for the next one, or whatever it might be for recommendation engines, personalization, etc. So that’s a big, like, group of users, or potential users, I guess.

Kostas Pardalis 18:35
Yeah, that’s awesome. And when you’re talking about real-time, feature generation, like, what is real-time in terms of like, time itself? Like, are we talking about milliseconds? Are we talking about Super loans? Like, what are the requirements when it comes to like to involve visual?

Zander Matheson 18:52
I don’t know if there’s an actual definition, I think of real-time and sub-second. But it for a streaming system for real-time, real-time feature transformation, I assume it would be talking about end-to-end latency. So you have from the moment a user interacts with something to the feature being then like stored in the database where it can then be served. So some low latency key-value store or whatever, that would be the total latency, or that would be like the time and I think you’re probably going to be over a second in that. I’m not sure. But you might be near a second. So you could possibly be near real-time and not actually real-time.

Kostas Pardalis 19:39
Yeah. Yeah. So what’s the architecture there? Like, we have the model somewhere and we need to feed it with like sitters we create in real-time. And we have like the system where the user interacts with this data will be like storms on something like data-based, whatever. But how do we get bytewax together with like the existing production environment that we have. And we create, let’s say, a system where we can mean almost real-time, generate features, feed them to the recommenders, or whatever models we have there and do something with it. How does this work?

Zander Matheson 20:21
Yeah, to make it a little bit more concrete, we work a bit with the folks that are working on the Feast project. So I’m just smoke with Keith along that vein. Yeah, in fact, if you buy racks, is you can use bytewax as like a materialization engine with the so it like turns offline features into online features. But anyway, so the architecture looks like you have some streaming platform be at Kafka or red panda or pulsar. So you have some system there. And upstream of that you have interactions with your service where you’re collecting the data. So that can be a you can be ingesting from logs, or you can be collecting telemetry data, whatever, it’s going into Kafka. And then downstream of Kafka would be BI, or Kafka or alternative would be bytewax. So it’s a consumer. And it’s listening to Kafka, and then it’s, you know, doing the transformations. And then it’s writing to the online feature store, which is going to be like Redis, or Dynamo DB, or Postgres, or MySQL, or something that can be rather low latency. And it’s probably also writing out to the offline feature store. So that’s like your data warehouse, or even later, or some storage, some, like more analytical storage engine, and that so that it can be reused for retraining or determining new life or new models, etc. So I don’t know if I answer the question at all there. But you’ll most likely you’ll have some orchestration layer, that’s managing some serving things. So that’s where your models actually going to be loaded. So maybe your models pickled, and it’s stored in an object store, or it’s part of an image or something in some way, shape, or form your model is loaded into memory. And it’s in a service, let’s say it’s in a pod in Kubernetes. And its service traffic. So it’s like a microservice and a request will be made there. And once that request is made, that feature that has been gathered in real time has to be ready in that low latency database. Because that model serving microservice will reach out for that feature set and use that in conjunction with whatever was just requested, to actually made see the prediction, so you have other things as well, on top of that, like ML Ops is just like so many layers right now, which is what we’re getting into. But at the very base, you have like services with a model loaded into memory, you have some databases, and you have like a streaming platform, and you have some compute and processing system to process the data, both in real-time and in and in batch.

Kostas Pardalis 23:23
Okay, that was super insightful. To be honest, it’s like a very common question. But I asked like pretty much everyone who has to do something with the mill. And I think it’s like the common question for everyone who doesn’t work in the MLOps, to be honest, because it is like very complex environments out there. And it was great like to see of that. And sikap also, like, other sound like how feature stores interact and different technologies, just like us bite walks, you described. So that was great and very, very helpful.

Okay, let’s get like a little bit more into what makes bytewax what it is, like the technology is used, like you mentioned something about, like timing data flow at the beginning, which is like different computation models. What else is out there machine light? It’s like your called baseline CPython. Together with Ross? Yeah. So tell us a little bit more about Blendo choices there. And like the pros and the cons and like, how did you end up making the choices that you made?

Zander Matheson 24:29
Yeah, so I’m gonna go back to when we pivoted from the machine learning platform. It was a hosted thing. And when we pivoted on it, though, we had some frustrations with the hosted thing because it was very difficult to get people to send data into our like, hosted environment. So when we were making the pivot to provide what people wanted, which was the stream processing framework, and give them the ability Ready to run it in their environments. So we could have a different go-to-market motion that wasn’t about the hosted platform, we decided, okay, let’s look at our options. And we found timely data flow. And we were like, Oh, this is really interesting because it can run as a single thread on a local machine. So we can give people the ability to write these stream processing workflows in a way that they can then run on their local machine. And it can also scale up to many different workers. So that was part of the reason why we chose time we did a lot it’s like, and it’s, it’s a cool project because it uses a different a little different approach and the architecture in comparison to like the things you’re more familiar with, like Spark. Anyway, so yeah, timely did slow is an awesome library, the person who created it, Frank McSherry, I don’t know if you’ve had him on the podcast. He’s the he is, I think he’s titles like Chief Science Officer at something like that, and materialized, but he was part of the team that started materialized, and materialized uses timely data flow, but it also uses something called differential data flow. But anyway, so timely data flow was Ross library, written by Frank McSherry. And it’s based off this project called Nyad, Nyad. timely data flow, which is a was a Microsoft three Microsoft research project back in the day, it’s a, like I said, it’s a rust library, we wanted to provide a Python framework. So we were like, Okay, we need to figure out how to make some sort of, like F fi, we found there were just so happens, there’s a library called PI oh three, which allows you to kind of marry rust and Python. So we were able to marry the timely data flow library into this Python framework. And so in actuality, the like execution and runtime is rust, and it’s timely data flow. And then you have like, your processing functions that are written in Python. And what’s what is, what is really cool about the whole like, thing is we have the ability to like, move different things down to the last layer, if we want for performance. So say, for example, recently, when we released a Kafka input configuration capability, so you don’t have to, like write all the Kafka Consumer code. We made that in the bus like later. And the thought was, okay, cool. If we do the input in the output in the Ross layer, and we provide some serialization in that Ross layer, in essence, you can move data from one place to another without having to serialize input into Python, you could still interact with it as if it was like Python code. You can get some performance benefits there. But yeah, Python, Python three is really cool library. And the reason I think it’s really interesting is much like how Python leverages C. And like, there’s like to speed up the code, you can use pi R three in a similar way to speed up your Python code. And you can get like really insane performance boosts like 20 30x Sometimes, which is like, yeah, which is incredible. And Pio three is your gateway to that, like, great performance hit to keep you on the journey.

Kostas Pardalis 28:31
So how does like bytewax work when you want to start using the Python ecosystem, right? Like, okay, I get it. Like, I can write Python. And this Python will have some bindings like do the rascals and these rascals is going to be like, the more native, let’s say, Paul is going to be executed from that. But that’s one thing. It’s another thing when you start adding all the different libraries around, right, like, how does this work? Especially like in the streaming environment, right?

Zander Matheson 29:03
Yeah, so sometimes you have to pass around, you know, types. But so rough Pio three, allows you to compile rust and use that from Python. It also allows you to interpret Python, in Ross in rust-like execution. So timely data flow has this concept of operators, which you’re you might be familiar with, and other frameworks where you have things like map and filter and etc, etc. And so you there’s like some code around that, that that’s how you like, control flow of data through the data flow. In those operators. You pass, you can pass them Python functions. So there are Python functions are running as Python in they’re being run from the interpreter or the pi interpreter and they’re running light on the Ross execution layer. So if you and you’re not, you will see your you’ll have to take the hint from serializing data from a Ross type potentially to a Python tab and vice versa. But then you’re using the native Python types. And again, I don’t want to get people confused with a native, you’re using the Python types with the Python code. And so that’s why many of these libraries will just work and you’re not kind of constrained and laid UDF world that your user-defined function world where, if that library doesn’t exist, you just don’t have access to it.

Kostas Pardalis 30:35
Yeah, that makes sense. Like you, for example, like I like in Python, you have like people, for example, like using, like frameworks like pandas, right, like to direct them, like war with data, Guy use pandas, and like, interact with a panda’s data frame and use that like as the way that I work with data inside bytewax right now?

Zander Matheson 30:56
Yeah. So I mean, that’s one of the great like, things to use stateful map with. So if I’m like accumulating data from, you know, on a per-user basis over a certain amount of time, I can like stack that into a data frame. And then I can actually pass that data frame onto the next step. And then I can do whatever compute function there. I could do it into earlier step. But I could also pass it on. Yeah, I mean, it’s fun. It’s really fun to use it that way. Because you get like, you get access to all the stuff that you want to use. Like you don’t have to think about it because you’re like, I’ve been using this for x number of years. I know that API.

Kostas Pardalis 31:36
100%. No, no, that’s a very, very important point, in my opinion, like, inventing a new API and arming to educate everyone to use the new API is just takes forever, right? Like, it’s not easy. And yeah, people choose one or the other, like, for reasons like pandas is being used, and it’s good enough like why someone would like change that now? Does it copy mutations? Are they are not related to the API itself? Yeah, of course. That’s why you see stuff like pi Spark, for example, where you have a pandas API, but the back end that does the computation on the back is like Spark. So you ask you like on a lot of data. Or you’d like something like Bigeye oops, and do like streaming processing, like on the back end? But I think it’s just a look back, like, what was the win, we weren’t discussing at the beginning about the previous generation of streaming processing environments. And like the difference in product in the developer experience, I think that’s like, a very big differentiator. When you have lights to like store or like seeing or whatever they were, like, imposing a way and an API to it wasn’t just like, what was happening right, with the system behind right. Now we see like a big shift, I think, with like, developer productivity tools, and like, all the systems that we’re building where the system will try like to fit into what the developer knows, right? Like, let’s get the API but SaaS proved, like it’s semantics, and like, people know how to use it. And let’s work like behind the scenes and change the things like went to make them better. And I think like that, that’s, that’s amazing, in my opinion, like, it’s great. Like, it’s, I love to see projects like doing that.

Zander Matheson 33:24
Yeah, it’s always the magic feeling, right? Like when the tool meets your, like, expectation for the experience. That’s when you feel the magic feeling of like, yeah, like, this just worked cool. Yeah, well, yeah. And that was the feeling I got when we could, like bytewax was ready to use and I was like, making our data flow. And I could just use the Python libraries. I wanted to, like I always play around with this library called River. It’s like a Python machine learning library for like online learning, is it and it just so fun because it just like works. You’re just like, import it, do some anomaly detection? And it just like, it just works. It’s it’s really fun.

Kostas Pardalis 34:06
Yeah. That’s cool. All right. I have two more questions before like I do the microphone back to Eric. So the first one: you mentioned, materialized materialized is, again, like a system that has been built like on toggles that they love no building is based on SQL. What’s the difference between like a system like bytewax and smother realize?

Zander Matheson 34:33
Yeah, so that like, at the very bottom, they’re both using data, timely data flow. So but that’s like timely, the timely data flow aspect is more about managing the flow of data. And we faulted both companies materialized by words have exposed different ways to interact with that data. So the effort behind material is to expose SQL layer so that it’s like, more portable. And I think it’s like you can, I think you can use, I think I don’t know if they’re at this point yet, but they’re nearly at a point where you can interact with it like it’s a Postgres database, I think you can use like psycho PG to and, and you can just write queries against it, they had to do a bunch of work between timely data flow and the user interacting with it with SQL. And there’s another project called differential data flow, that incorporates, like, timely, and then builds on top of that, to kind of handle a bunch of laws, the converting the SQL queries into the what that, you know, data flow would actually look like. And, yeah, so similarities are at the data flow level, like there’s Yeah, and then we kind of diverge. Right. But they have, you know, they’ve done really interesting work. And I don’t know if you can, I would say you should talk to Frank, Frank McSherry or like, get him on the yeah, get him on the podcast because he would be able to speak about it in a more elegant way than I can.

Kostas Pardalis 36:16
Yeah, absolutely, Merle. And that’s something that we should definitely do. Okay, and my last question, what’s next? What’s the roadmap for bytewax and what excites you there?

Zander Matheson 36:30
Yeah, so we’re bytewax is both an open source company, open source project and company, and that company has to be sustainable. So we have to become commercially, we have to make bytewax into a commercially viable product, as well as an open-source project. So that’s something that will come in the future, not the most immediate, but more immediately is just about more adoption bytewax has many of the things you need to like filled to build, like, advanced analysis on streams. And so it’s, you know, getting it out there getting the word out there and awareness of the project that exists and getting some adoption. That’s the more immediate future. And then we’ll turn to how we can add additional features that make it easier to run bytewax at scale, to integrate bike racks with, you know, existing organizations infrastructure and, and other systems, and provide like a paid version of buybacks as well. One other piece that’s on the roadmap for the next little bit is we positioned bytewax as a library for doing like advanced analysis on streaming data for like machine learning use case hitters, or cybersecurity and stuff like that. We’ve had a lot of people who we’ve been, we’ve spoken to, or like started to use bytewax that tell us Oh, yeah, that’s really interesting. And it’s on our roadmap, but it’s like a year out or sad or nine months out. And, and, but what we need to solve today is we just need to, like ingest data or like move it from one place to another, which you could deal with bytewax, but you know, writing the code yourself? Well, we might experiment a little bit on how we could potentially serve those users as well. So make some effort in making by like, a few different connections between different sources of data. So you can, you know, get started with that really easily. And maybe just set some configurations and then deploy it.

Kostas Pardalis 38:37
Interesting, super excited. I think we have many reasons like to record another episode a quarter or two from now. So looking forward to that. Eric, it’s all yours.

Eric Dodds 38:49
Oh, why thank you, Kostas.

Kostas Pardalis 38:51
My pleasure.

Eric Dodds 38:52
I’d actually like to follow up with a, what you just talked about in terms of different users in the way they’re using bytewax, but also return to something you mentioned about your original mission. You know, when you came out of GitHub, and started bytewax, which was just sort of, almost democratize certain parts of, you know, this ml workflow for smaller companies, right, who didn’t have a whole team to actually manage the infrastructure side of the house when it came to like the ML side of it. And one thing that was interesting to me, and, you know, this could just be the top things that came to mind. But a lot of the use cases you mentioned or examples you gave, tend to be things that companies at scale, are really interested in, but the smaller companies can’t do. So for example, like computing features, you know, pushing them to an in memory store so that they can be served with like second level, or slightly sub-second or slightly over-second latency See in the enterprise, that’s a huge use case, right? Because, like, delivering recommendations like that, you know, can, you know, changing like conversion by 1% can mean, you know, huge stuff for the bottom line. But in the spirit of like democratization, like what, what does that look like for what, bytewax now? And is that sort of like basic, like data pipeline use case? How do you see that? Or I just love to know, like, do you feel like you’ve sort of retained that original mission or, or that like component of your mission?

Zander Matheson 40:34
I mean, to some extent, the reason we have like a Python framework is some of that like democratization? I mean, there are a lot more like it’s one of the most widely used programming languages, and it is like the de facto language of data, if you exclude SQL name, but programming language or Yeah, but yeah, I think that we’re we’re at a little different layer of companies than we were targeting before, potentially. Yeah, I’m not 100%. Sure. I think part of it was like, large, you know, like, part of it is that the mainstream medium enterprise or whatever, hasn’t gotten to that level of sophistication where they are like, actually, like, building models and deploying them and all that. And so we have we’re maybe it was like, maybe a little bit ahead of where the market was so we weren’t able to democratize anyone because no one was thinking to be democratized. And we’re theta. Yeah. And when if we were to like look at streaming, I think that you’re pointing something out that’s interesting is we’re still focused on like, more advanced stuff. And that’s still we’re still like kind of ahead of it. In terms of like the broader market, the broader market is trying to do more and Justin ingestion and like some transformation and move data around and get it into like a centralized thing. But what I think we can do their to try and like democratize is try and like, build infrastructure that makes it easy, like tools, I guess, to makes it easy for people to use data flows to also do that ingestion part, I did a, there’s a open source spotlight saying that data talks Club does, and it comes out next week. And I do like an example demo, where you can use bytewax to take server logs, it was something we did at GitHub. So we didn’t have like telemetry across our products, we wanted to know what was going on. So we would just like ingest data from web server logs. And we would reside there at the load balancer. But anyway, we’ll take that kind of data, request data and like, put it in the data warehouse so that we could query it and derive insights. So I’ve made a demo doing that with bytewax and this tool called doc dB. And it gave me it was like my spirit was in democratization. It was like, here’s the lightest weight thing you can possibly use to like, take data and understand what’s going on with your product. And yeah, I don’t know, I ever respond. Maybe it’s not actually what anyone will actually use. But yeah, hopefully can help more people use streaming data like that is kind of why we are building bytewax, we want more people to be able to do it.

Eric Dodds 43:27
Yeah, totally. I mean, I would say, you know, it, it sounds like this, the, you know, make giving people easier access to leverage streams, in Python, in itself, to your point is, is a big part of that, right? Like, I mean, if you think about someone who wants to do something interesting, with a Kafka stream, you know, especially if that Kafka streams, like managed by a completely different team, right? I mean, that’s, like, really hard at a lot of companies to like, do that. Right. You know, either from a technical standpoint, or, you know, cross-functional standpoint or whatever. So, yeah, that’s, that’s super interesting.

For our listeners, I we should have mentioned this earlier, but where can they go to learn more about bytewax or try it out?

Zander Matheson 44:19
Yeah, so we have our website, bytewax.io. There’s like the documentation. There are a bunch of examples and API documentation, and then the GitHub repo. Also, we have like a ton of examples in there. So those are right now those are the two places to go and try it out. And it’s like, you can pip install it and then just like random examples and get a feel for how it works. I have a bunch of blogs too, that have tutorials on them. And they’re more like, a position for like certain use cases. So there’s like a cybersecurity one and there’s like a thing level to like order book from an exchange. She’s kind of won and anomaly detection for IoT. So if you have like more of an application, maybe that’s the better place to start there and look and see if it’s solved. And if not, and go dig through the examples and learn how to do it. And obviously, come to the slack or slack channel, and we have a Slack channel where you can like, join. And there are people jump in there, you know, problems they’re working through on the we’re happy to, like work through them. I love doing that. So I’ve always, I always find it fun to like, deal with stuff, so. So that’ll be nice to get started.

Eric Dodds 45:34
Yeah. Oh, love it. Well, Zander, thank you so much for taking the time. Man, we learned so much. It was so fascinating. I loved hearing you and Kostas break down all the technical details. So super fun for me. And yeah, we’d love to have you back on the show as bytewax grows and hear how things are going.

Zander Matheson 45:53
Cool. Yeah. Thank you for having me. I hope we didn’t go too deep down the rabbit hole there.

Eric Dodds 45:58
You can never go too deep down the rabbit hole on The Data Stack Show.

Zander Matheson 46:04
We’re definitely evolved. Stack at that point. We were at the bottom.

Eric Dodds 46:08
Yeah. Yeah. All right. Well, thanks again.

Zander Matheson 46:14
Cool. Yeah. Thanks, guys. It’s great to chat.

Eric Dodds 46:16
Thank you, Zander. It was great.

Super interesting. One interesting takeaway I have is that it seems like there is a proliferation in the ways that you can do stream processing. And what’s really interesting, I think, really good for the industry, right? Like, with materialized which talked a little bit on the show, you can, you know, accomplish something similar with SQL. You know, you have traditional streaming tools like Kafka, right? And with confluent, you know, you can do like interesting, interesting things with transformation and stream processing, or whatever. And so it is interesting, it seems like in general, if you look at all these technologies, they’re responding to a big need in the market related to stream processing. You know, it’s just really interesting. And you can sort of access it through different interfaces, right, or developer experiences, depending on it. So what do you think that’s true? Like, is that because streaming? Like if you think about stream processing has been around for a long time, but there seem to be a lot of new technologies cropping up around it?

Kostas Pardalis 47:24
To be honest, I think there’s some kind of like explosive, and like, the solutions out there will have to do with like, interacting with data in general. So it’s not just like streaming. But I think streaming processing was one of these things that was really struggling with developer experience, does like a very different type of data, like to work with the ways or like, the API’s that existed with the possible length, and still exist on like, different ends in terms of like what, you know, most of the developers out there have, are used to in the user. And so while I find like, extremely interesting is like, how much attention to the developer experience is holding that is happening right now, I can see that would build a new system, not just because we have a new, super performant way of doing things, but primarily because we want to make things easier for developers to work with, and more accessible to more developers out there. Yep. So I think that’s going to be like a very common theme across new products. And if you remember when we had that panel about streaming, there was like a discussion about the developer experience, actually, I think the most important topics of all of the binaries that talked about was the developer experience. Yep. So and we see this happening right now we see like new systems coming up that the prime only focus on like the developer experience.

Eric Dodds 49:01
Totally agree. Well, maybe we should do another streaming animal. Because you kind of have the, like a battle of different ecosystems, right, like SQL versus Python, et cetera. So maybe we can have Brooks have like a battle with streaming, which would be great. But that’s all the time we have for today. Thank you. As always for listening. Subscribe if you haven’t, and we will catch you on the next one.

We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.

🎙 Sign up for The Future of Machine Learning Livestream!

🗞️ Signup for Our Newsletter

Episode 112:

Python Native Stream Processing with Zander Matheson of bytewax

November 9, 2022

Notes:

Transcription:

About the Podcast

Sign Up for The Data Stack Show Newsletter