The PRQL: Does Machine Learning Need Its Own Orchestrator? Featuring Sandy Ryza of Dagster

January 2, 2024

In this bonus episode, Eric and Kostas preview their upcoming conversation with Sandy Ryza of Dagster.

Notes:

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com

Transcription:

Eric Dodds 00:05
Welcome to The Data Stack Show prequel. This is a short bonus episode where we preview the upcoming show, you’ll get to meet our guests and hear about the topics we’re going to cover. If they’re interesting to you, you can catch the full length show when it drops on Wednesday. We are here with Sandy Ryza from Dagster Labs. Sandy, so excited to chat with you about data ops, workflows, data pipelines, all of the above. Thanks for coming on the show.

Sandy Ryza 00:36
Thanks for having me. Excited to chat with you.

Eric Dodds 00:39
Alright well, give us your background. Briefly.

Sandy Ryza 00:43
Yeah, so I’m presently the lead engineer on the Dagster project. And I think we can talk a little bit more about what the Dagster project is for those who aren’t familiar. Later. Earlier in my career, I had a mix of roles that involve building data infrastructures and building tools that would help data practitioners and working as a data practitioner, machine learning engineer myself. I started my career at Cloudera. I while I was there with this book, advanced analytics with Spark, that taught how to use that particular framework to do machine learning. And then spent a number of years of practicing data scientist at a clover health motive, which used to be called Keep truckin. And also works in public transit software before finding myself back in the data tooling, space, Dagster Labs.

Kostas Pardalis 01:30
That’s also on Sunday. And I think we are going to have like a lot to talk about. But something that I’m like, particularly interested into going deeper, is the role of, let’s say, an orchestrator in the lifecycle of data, like, defining it, why we need it, why it has to be like an external tool, right. And it’s not part of query engine, for example, and also why Currently, we have such a diverse, let’s say, number of solutions out there, especially when we are considering like, the more traditional data related operations and DML operations, and we even see, like, you know, like, new orchestrators coming out that are focusing just on the ML side, like why we need that when we have quality, like something that’s already works for data. And I’d love to hear like and learn from you. Like, why is that? And what it means, like for the practitioners out there, right? What’s in your mind, though, like what you would like to chat and gets, like deeper into like, during our conversation?

Sandy Ryza 02:41
Yeah, the topic that you brought up is one that I’ve thought about quite a bit, both from this perspective, being a machine learning engineer, and from this perspective of working on tools for machine learning engineers. And, you know, I think we can get into this later. But the fact that I ended up working on a general purpose orchestrator kind of says a lot about how I view the role of orchestration and data pipelines in the machine learning engineering domain. So we’re really excited to talk about that. Excited to also talk about orchestration in general and what it means to build a data pipeline and of the relevance of that to different roles, like data engineers, machine learning engineers, data scientists.

Kostas Pardalis 03:25
Yeah, that’s awesome. I think we have a lot to talk about. And what do you think? What’s going

Eric Dodds 03:29
Yeah, let’s get to it. All right. That’s a wrap for the prequel. The full length episode will drop Wednesday morning. Subscribe now so you don’t miss it.