Episode 171:

Machine Learning Pipelines Are Still Data Pipelines with Sandy Ryza of Dagster

January 3, 2024

This week on The Data Stack Show, Eric and Kostas chat with Sandy Ryza, Lead Engineer at Dagster. During the episode, Sandy shares insights on data cleaning, data engineering processes, and the need for improved tools. He introduces Dagster, an orchestrator that focuses on assets like tables, datasets, and machine learning models, and contrasts it with traditional workflow systems. He also explains Dagster’s integration with DBT, while also exploring the changing dynamics in data roles, the impact of modern tooling, the potential for increased creativity in the field, and more.

Notes:

Highlights from this week’s conversation include:

The role of an orchestrator in the lifecycle of data (1:34)
Relevance of orchestration in data pipelines (00:02:45)
Changes around data ops and MLOps (3:37)
Data Cleaning (11:42)
Overview of Dagster (13:50)
Assets vs Tasks in Data Pipeline (19:15)
Building a Data Pipeline with Dexter (25:40)
Difference between Data Asset and Materialized Dataset (28:28)
Defining Lineage and Data Assets in Dagster (29:32)
The boundaries of software and organizational structures (37:25)
The benefits of a unified orchestration framework (39:56)
Orchestration in the development phase (45:29)
The emergence of analytics engineer role (51:53)
Fluidity in data pipeline and infrastructure roles (52:40)

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 00:05
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com. We are here with Sandy Ryza from Dagster Labs. Sandy, so excited to chat with you about data ops, workflows, data pipelines, all of the above. Thanks for coming on the show.

Sandy Ryza 00:40
Thanks for having me. Excited to chat with you.

Eric Dodds 00:42
Alright, well give us your background. Briefly.

Sandy Ryza 00:47
Yeah, so I’m presently the lead engineer on the Dagster for the project. And I think we can talk a little bit more about what the Dexter project is for those who aren’t familiar. Later. Earlier in my career, I had a mix of roles that involved building data infrastructures to building tools that would help data practitioners and working as a data practitioner, machine learning engineer myself. I started my career at Cloudera. While I was there with this book, advanced analytics was spark, that taught how to use that particular framework to do machine learning. And then spent a number of years as a practicing data scientist at a clover health motive, which used to be called keep trucking, and also works in public transit software before finding myself back in the data tooling space at Dagster Labs.

Kostas Pardalis 01:34
That’s also on Sunday. And I think we were going to have a lot to talk about. But something that I’m particularly interested in going deeper is the role of the listening orchestrator in the lifecycle of data, like, defining it, why we need it, why it has to be like an external tool, right? And it’s not part of query engine, for example, and also why Currently, we have such a diverse, let’s say, number of solutions out there, especially when we are considering like, the more traditional data related operations and DML operations, and we even see, like, you know, like, new orchestrators coming out that are focusing just on the ML side, like why we need that when we have quality, like something that already works for data. And I’d love to hear and learn from you. Like, why is that? And what it means, like for the practitioners out there, right? What’s in your mind, though, like what you would like to chat and get deeper into, like, during our conversation?

Sandy Ryza 02:45
Yeah, the topic that you brought up is one that I’ve thought about quite a bit, both from this perspective of being a machine learning engineer, and from this perspective of working on tools for machine learning engineers. And, you know, I think we can get into this later. But the fact that I ended up working on a general purpose orchestrator kind of says a lot about how I view the role of orchestration and data pipelines in the machine learning engineering domain. So really excited to talk about that. Excited to also talk about orchestration in general, and what it means to build a data pipeline, and the relevance of that to different roles, like data engineers, machine learning engineers, data scientists.

Kostas Pardalis 03:28
Yeah, that’s awesome. I think we have a lot to talk about. And what do you think?

Eric Dodds 03:34
Let’s get to it. So great to have Dagster back on the podcast after such a short time. All right, well, we have a ton to talk about. And specifically, we want to talk about sort of the intersection that changes around data ops, ml ops, and that whole space. I mean, there’s so many tools, there’s so many opinions out there. So I want to get there. But I want to, I want to start by hearing your story, because it’s pretty fascinating. So can you just give us an overview of sort of the arc of your career, where you started and how you sort of ended back in the place where you started?

Sandy Ryza 04:12
Yeah, my career is a bit of a it’s a bit of a loop. And I’ll quickly walk you through that. So I started out in data in 2012, which felt like a qualitatively different era of data. So this was the arrow data scientist was kind of a burgeoning new term, a buzzword, the sexiest job. The entire stack, and like a lot of the focus of where the technology was going, was that big data was the other buzzword and everyone was focused on how we can process these enormous amounts of data. And I worked at Cloudera, which was kind of at the heart of that. So I was a contributor to these open source software projects that were kind of at the heart of this big data software stack. One of those was Hadoop MapReduce, a story originally based on these kinds of foundational internal Google papers. I had to process Google aside data. And the other one was Spark, which was sort of an improvement upon the original Hadoop framework that made it accessible to a much broader set of people and for a much broader set of use cases. For example, machine learning. So I started my career working on these open source software projects that were fundamentally built for data practitioners, like data engineers, and data scientists. And became kind of interested in what was on the other side of the API boundary. We were building these systems that could process enormous amounts of data. It’s like, that’s cool. But it’s very abstract, like, what value do you actually bring the world by processing these enormous amounts of data. And so I wanted to sort of up the value chain a little bit and learn a little bit about what the world of using these tools looks like. So I first did that within Cloudera, we had this internal consulting function, which was sort of like an embedded data science team. And we would go on site to a, let’s say, a large telco and help them understand their users and use these big data tools to understand their users. But eventually, ended up working in full time roles as a machine learning engineer data person at AI companies that actually had embedded versions of those functions. So one of them was Clover health, where we were working on health insurance. Another one was keep trucking, which is now called motive working on technology that helps truck drivers do their jobs. And so, you know, I started talking about how 2012 felt like a very different era in data. And I think in a way, that’s largely because the problems that you will focus on were very different at the time. And I think there was this kind of acknowledgment that maybe the role of data had gotten ahead of itself a little bit or had, where the tools had maybe solved some layer of problems. But there was this other layer of problems. It was, like, bigger and scarier on top of that layer of problems. No, it was about the size of the data, but about the complexity of the data. So like, from the perspective of Cloudera is like, Okay, once we make you a tool that can do, you’ll process two terabytes of data and, you know, 25 seconds, then you’ll just take that and make your machine learning model, and you’re done. You’re done. It’s awesome. Like, like, right, then they just run like, you know, fit a regression model or, you know, it treats and you know, who even needs a Hoover needs a data team. But let’s move to the other side of this and become, being in these roles. Where I was actually developing machine learning models, doing analyses, trying to answer questions with data, I became clear that like, the hardest part of actually doing this job was wrangling and, you know, structuring this enormous amount of complexity, like starting with data that was, you know, I don’t think you’d say garbage, you say very disorganized and trying to bring some order. Some order, you know, not just to the data itself, but to the process that generates and keeps that data up to date. Yep. And so the consequence of this was that, because sort of doing these basic data tasks, was so sort of disorganized and difficult these jobs, I ended up spending, you know, especially when I was in more lead roles and responsible for making other people on my team be productive, ended up spending an enormous amount of my time just building internal frameworks at these companies to do this job. And, you know, maybe we’ll get to this later. But a huge, you know, the biggest, the biggest way you can improve machine learning model is getting better data to that model. Yeah. Yep. So, in the roles where I was responsible for building better machine learning models, I was primarily concerned with how I could get better data to these models, and do so in a reliable, repeatable way. And I basically ended up spending a huge chunk of my time building frameworks that would allow me and other people on my team to do that successfully. When it came time to find a new role, around 2020 I sort of felt like Why go into a company and build another like internal version of this framework that’s, you know, might be really useful for this company. When I could try to build a version of it that is accessible to many different organizations. is like it ultimately felt like a very much more high leverage thing to be able to do. And I happen to know Nick, who was the founder of Dexter, and the company, which used to be called elemental, but it’s now known as DAG for labs. And was basically like, this is a problem. You know, I built this system a couple of times before I want to do it again, but do it general and right, this time, I talked to Nick, and joined the team at DAG for labs. As one of the six or seven or something first employees, I’ve basically been working full time on Dexter, this open source software project since then, wow,

Eric Dodds 10:45
What a story. You know, it really struck me. I loved your you know, I love the analogy about you know, you can process two terabytes of data and 25 seconds or whatever. It’s like, you have this race car. But in order to drive it, you actually have to go build an oil refinery. You know, it’s like, so I

Sandy Ryza 11:07
I think that’s an amazing analogy. Yeah. Love that.

Eric Dodds 11:11
Yeah, that’s, that’s super ironic. Okay. So, a couple of questions here. In terms of, well, first of all, actually, what I’d love to know is when did you step back? As a practitioner, you know, going through multiple roles as a practitioner? When did you step back? Do you remember maybe the moment or sort of the project, where you said, Wow, I’m seeing a pattern year, because I seem to keep going back and working on this similar thing?

Sandy Ryza 11:42
Yeah, so I think that I have a fundamental, some dementia, the way that my brain works is very lazy. And what I mean by that is, I really don’t like to try to hold a bunch of information in my head at one time. I really want to be able to think clearly, I really want some external system to be able to like, offload that too. So pretty early on in these roles, where I was doing data, pipelining tasks, I sort of got frustrated very early with the tooling and found myself trying to at least, like contribute to it, improve it in minor ways. I think another piece there was talking to a lot of other practicing data scientists at the time. And you know, there was this refrain of so much of what we like, you know, we’re hired to do machine learning, but all we do is clean the data. I think it took some number of those conversations, I don’t know how many, but for me to realize and reframe it, in my mind that like, cleaning the data isn’t this like, Roger like task of drudgery that you have to do before doing the exciting part, like it is kind of fundamentally the heart of the machine learning engineer job. And, you know, you can think of it as cleaning data, or you can think of it as producing reliable datasets that are generally useful within your organization. So, you know, I think someone who’s coming from a software perspective, and like, a building perspective, this notion of doing data cleaning, and doing good engineering is sort of this like, this work of like structuring, like taking these, these reusable pieces of data, and then building even more useful and reusable pieces of data. On top of that, I found that like a very motivating way to think about that work. And yeah, I think, I think that probably clicked in my first data role, but then really got reinforced in my later roles.

Eric Dodds 13:50
Okay. super interesting. My next question is actually more related to DAG sir, so what I’d love for you to do is tell us, you know, give us an overview of like, what is Dagster? Or what does it do? And then I’d love to know how much of what you were building? Like, How close was it to the stuff that DAG sister does, when you are in those practitioner roles, like the tools?

Sandy Ryza 14:17
Got it? Okay, so trying to think about what the best angle is to approach this. Okay, so I think, both in my life and generally in these roles, a pretty common pattern is that you have some sort of, you’ll have a set of analysts that aren’t like software engineers, like the most technical people, although they’ll have, you know, some proficiency with Python, or some proficiency with SQL. And you’ll end up with some sort of domain specific language or internal framework inside of a company that allows those analysts to do their job and it’s not all always like this, but if you have like a sort of more tech savvy analyst, or some data engineer who’s responsible for supporting these analysts, they’ll end up building something internally, that makes it so the analyst doesn’t have to like, you know, spin up a cron process and like run Docker, every time that they want to, let’s say, keep some table up to date. And if you look at these frameworks, and sort of like thinking about the frameworks that the organizations that I was at, they always tend to revolve around tables. And so like, the fundamental abstraction, when you’re thinking about, you know, sort of reproducible work in a data analyst, or even machine learning role is like a table, or some sort of data set. Like, I want to start with this data that we have, that’s maybe sort of not clean or not formatted in the way that’s most useful to me. And then, in the course of my analysis, ideally, kind of like factor out some sort of cleaner, more useful version of this dataset that, you know, the next time I have to do this analysis, I’ll be able to rely on excellent for health, as well as keep trucking in the kind of like natural way that we built our internal internal tools, like make our data scientists productive. And then there’s this interesting mismatch, because that was the natural way for us to think about it as the people in these data roles. But then you look at these tools that are sort of the orchestrators of the time. They were, you know, still puppet orchestrators now, like airflow, for example, are focused on a totally different set of abstractions, right? So with airflow you go in, and you define a DAG, and a DAG is a set of tasks, and you’re fundamentally thinking about tasks when you operate your data pipeline. So like, the primary challenge, as someone who is trying to do this data science, data pipelining work was translating from this like table way of looking at the world, it was very natural to me and the other people I worked with, to this like task and workflow based world, which was the language that tools like airflow spoke. And so, the internal frameworks that I would end up working on at these companies are basically kind of these translation layers. So allow me to express what I’m trying to do like my database, if you regret my data pipeline in terms of tables, or datasets, or machine learning models, and the relationships between those entities, and then, you know, have some software kind of figure out how to airflow that for me, and turn it into this world of like Dagster, and tasks, it was a messy fit, you could do, you know, you can get your pipeline running on a schedule. But there were all these weird translation issues at the border. So like, when at times comes time for someone to debug an error or look at logs, they’re forced to think in terms of these like very different abstractions than the ones that are natural to that, as a data practitioner. So so this is a very kind of long winded way of saying that we’re gonna be excited about working on an orchestrator, like Dexter with the opportunity to build something that thought about assets. And when I say an asset, I mean, a table, a data set, machine learning model, any sort of persistent object that captures some sort of understanding of the world was the opportunity to think about that as the center, the central abstraction for building the data pipeline, and allowing everything to revolve around that.

Eric Dodds 18:43
super interesting. And can you talk about maybe just at a high level to start with? What do you think about a system that relies on the concept of assets as opposed to tasks? Like, what are the fundamental differences there, in terms of how the system itself operates? Right? Because I mean, you can create, you know, you can do orchestration with airflow, you can do orchestration with Daxter, right. But we’re talking about sort of two fundamentally different approaches.

Sandy Ryza 19:15
That’s right. It permeates in a bunch of different ways. I’m trying to think about the best way to approach it. When you build a workflow using tasks, there’s kind of this fundamentally, top down approach where you have these sort of like individual tasks, and then you assemble them into a DAG and the DAG, you know, it’s the workflow, it defines the dependencies between those tasks. Whereas when you’re working with assets, it ends up being a little bit more of a fundamentally more distributed approach. So when you define a data asset, for example, you know, which is synonymous with saying, I have a table that I want to create, let’s say, you know, there’s like this raw data here. That’s all the raw events there. Come into my system. And then I want to create a table called fleeing events or gold events. When I define that table, I define its dependency on the upstream Events table. And the way that the entire dependency graph is sort of defined is at the level of individual assets, instead of having to, instead of having to do this top down approach that involves a set of Dagster. The sort of consequence of thinking that way is you’re not forced to make these tough and often kind of arbitrary decisions about where you’re, where the nodes in your graph go. A common failure mode in people who build DAG based data pipelines, is they’ll have one sort of enormous, unwieldy DAG. And anytime they make a change, they have to contend with that entire DAG, like, execute it or sort of like, deal deal with the enormity of it, whereas, or, you know, or they’ll go the opposite way. And they’ll chop up their DAG into these tiny little pieces, but then lose the ability to actually sort of reliably extract the relationships between those pieces. And so when you think about data assets, you think about defining dependencies in terms of what data do I need to be able to generate this dataset, you kind of sidestep that problem entirely. A second piece of that, which is this, the fundamentally declarative approach that comes when you’re sort of thinking about assets first, when data engineers are sort of questioned by other people in organizations like management or business stakeholders may be questioned and has too much of an interrogative connotation. But vote, and data practitioners want to communicate with stakeholders about their work, the language they normally communicate in is data assets. Yeah, so. And, you know, I found this very true in my own work, like, like, when I’m explaining to someone the data pipeline that I’m working on, or the thing that I’m going to produce for them, the thing that I draw on the whiteboard is the tables that are going to be produced. Yeah, yep. And you’re still learning this almost even more, more clearly, you sit, you know, I’m going to make this machine learning model, these are the features that are going to go into it, these are the evaluations that are going to come out of it. So naturally when you’re communicating about what you’re building and your data pipeline, think in terms of this, this network of data assets. And so the advantage of an orchestrator sort of thing primarily about the data assets is like that language is the language that you use to actually define your pipeline. So the consequence of that is you have this degree of confidence that the pipeline is actually going to generate, that this this network of data assets, because it’s the language the pipeline is defined in terms of makes

Kostas Pardalis 23:12
total sense. Sorry, Eric, something like an example like a concrete example, right of like a pipeline, and how this could be done using like the concept of software defined assets, right, like in in dancer.

Sandy Ryza 23:32
Yeah, so I really wish I had the ability to use a visual aid. But well, I’ll do my best to describe it. So super basic, let’s say you have a table of raw event data. Let’s say you’re running a website, people come onto your website, and click on things and your website login. Maybe your website sells something. Yep. So you have these kinds of core basic entities that will often come in in some sort of raw form at the beginning of your pipeline. So those might be let’s say, all the events that happen on your website. So like clicks, page views, pageviews logins. And your role as a data engineering team might be to deliver cleaned up versions or aggregated versions of this kind of data. As core data sets, other people inside of your organization can use those to build data products or do analyses in mind there are, let’s say

Kostas Pardalis 24:33
we have just to make it like a little bit more concrete, like to someone, let’s say we collect these events, right? All the different events. And at the end, we want to get to the point where we have somewhere the number of signups per month. Right. And I’m giving this as an example because I think it’s very straightforward, and it’s exactly what you’re saying. But you making the case is like the more general case, right? So let’s say we go from an event in JSON captured with something like RudderStack. Rights posting, though, like our data warehouse. And we want to end up calculating like how many signups without like these moms? Right? How? I mean, I think like many people, especially coming from like, like using airflow or something like that they get like, let’s say the tasks that are needed there, right? How would we do that with digital assets?

Sandy Ryza 25:40
Great question. So with the digital asset focused way of building that data pipeline, the first thing you sort of do is think about the nouns, it’s natural to start with where you’re trying to get to. And then, you know, I’ll even do this sometimes on a whiteboard, I’ll write out where I’m trying to get to, I’ll write out the data that I have. And then I’ll write out a set of intermediate data sets that will help me get from the data that I have to the data that I’m trying to get to. So thinking in terms of your specific example, the data that you want to get to is probably a table that has information about these signups. And you might even write out the schema of that table. And the data that you’re starting with is let’s say this raw, untransformed, you know, sequence of blobs represent events, maybe they’re in S3. And let’s say that where you want to get to is going to be to have this table in Snowflake, so that it’s easy to query from sort of sort of dashboarding tools. Yep. So those are two nodes in your graph. And then you think about okay, to actually build a reliable signup dataset, what are the subcomponents that I need to have to be able to, to be able to accurately calculate signups? So let’s see, maybe one of these sub components is the, you know, set of times people hit the enter button on my signup form. But I also know that there’s a bunch of internal testing that we do, where people will hit that enter button, we don’t actually want to count that in our sort of like, our business facing signup metrics. So it’s important to exclude those internal testing. Now we have this separate table somewhere, that is a list of all of our internal test users. So to compute this, ultimately what MIT signups table, we’re going to need to depend on a couple different things. One is going to be this table of external test users. One is going to be this list of form submissions for the signup form, then you think, okay, how do I get the form submissions, ultimately, that will be derived from my underlying asset at the beginning of my graph, which is the list of JSON blobs that are clicks on the website. And you can draw these out and basically put arrows connecting any dataset to datasets that you add that it needs to read in order to generate itself?

Kostas Pardalis 28:12
And so the asset is, the data set itself, like the materialized like dealership or the event, a concept, let’s say, How is like, what’s the difference there between the two?

Sandy Ryza 28:27
Got it? So first of all, just to clarify, when we say data asset in the world of Daxter, the reason we don’t use a word just like table is that we want these to be more flexible. So they could be relational data, but it could also be a set of images in S3 or a machine learning model. When we talk about an asset, we’re talking about some sort of object in persistent storage, okay, doesn’t necessarily need to be a data warehouse. But it could be a, it could be a table in a data warehouse, it could be a file on a file system model that’s in some sort of model store. And that’s what we’re referring to when we refer to a data asset.

Kostas Pardalis 29:05
Okay. Okay. That’s great. Cool. So from what I understand, like from what we are saying, like, the way that Dexter works is by actually asking the user to define the lineage, let’s say the materialize. steps that the data has to go through until it delivers, like the end result, right? So instead of thinking in terms of, like processing, we’re thinking in terms of outcomes, right, like so it’s not, let’s say the query per se, that generates the data is the data and how it connects to the previous data set that was the input to actually generates this. Do I get it right or I’m yeah, that’s it. That’s Exactly

Sandy Ryza 30:00
right. And I want to add if you have to think about processing at some point, because you know, the doctor isn’t gonna read your mind and just figure out what needs to get run in order to dry it, you know, in order to build the signups dataset from the events dataset. But when you write out your processing logic, you’re sort of hanging off of this scaffolding of the data asset graph. Okay,

Kostas Pardalis 30:30
and how is the user using ducks there, if you’d like and there’s the gay easy, like, something like, like some notations that you are using to annotate some object like in a notebook? Like, how actually like the user goes there, and defines, like the lineage between like the data usage?

Sandy Ryza 30:53
Great question. Yeah. So Dexter exposes a Python API that allows you to define your data pipeline. And so ultimately, the most straightforward way to define a data pipeline indexer is to write a Python file, and include a set of asset definitions in that Python file. An asset definition is basically a decorated function. So for example, if you want to have an asset called signups, you would write a Python function called signups. You would decorate it with Dexter’s asset decorator, just to indicate it’s an asset and then optionally include metadata about the asset and including dependencies on other assets. And then inside the body of that function, you would include the logic that’s actually required to build this sounds table. Okay, so for example, read data from some other table and then do some transformations and then write it out to your storage system.

Kostas Pardalis 31:52
And I would assume this is something that we throw back to, like the ducks that also system itself, like it can be SQL or it can be like, a data frame HBK that is used like for Spark or like pi spark or light ball RS or like whatever, like, the processing logic itself, is not something that doctor is opinionated about.

Sandy Ryza 32:19
It’s exactly right. Yeah, so and so the idea is that it’s just a Python function, you can invoke any computation in any framework from that Python function. A really common thing to do is to invoke DBT. So for those who aren’t familiar with DBT, DBT is a framework that allows the binding tables basically as SQL statements. So if you’ve got let’s say you create, let’s say you want to define this signups table, you would create a file called signups DOT SQL. And then inside that file, you include a select statement that says, Select bla bla bla from the Events table, and extra as a DBT integration that basically will digest that DBT table definition. Have Dagster or understand it, and then when it comes time to actually execute that node in the graph will invoke DBT to execute the SQL inside your database.

Kostas Pardalis 33:24
Okay, that’s interesting. Why would someone do it like that, though? And not just us directly? Like Doug’s there, or DBT. Right? Why would someone use both systems together?

Sandy Ryza 33:39
Got it. Yeah. So I think there’s two directions to think about that question. One is, why wouldn’t you just use Daxter? And the other one is, why wouldn’t you just use DBT? So starting with the why wouldn’t you just use Dexter DBT has become a standard for expressing data transformations in SQL. And it has a lot of features that make it really useful that makes it really powerful at doing that. So for example, you can write macros, the standard way to specify data dependencies in DBT. I just became widely accepted as part of the analytics engineering skill set. And for Dexter to try to rebuild that would sort of unnecessarily fragment the ecosystem and make it less accessible to the set the set of users who are already familiar with one way of thinking about it is even as like a set of extensions to the sequel, probe programming language that sort of make it useful for defining data pipelines. So that’s why DBT is a really useful tool to use even with Baxter for the question of why not just use DBT DBT is very narrowly focused on a particular kind of data transformation in a particular kind of data pipeline. And in most organizations, even when a large body of the work that they’re doing sort of fits into the DBT framework, often a large body of the work that they’re doing will not fit easily into the DBT framework. So for example, they’ll have steps in their pipeline that do things that are just fundamentally not SQL transformations, like maybe they’ll be moving data between the different storage systems, or they will be building machine learning models. And those don’t really make sense to, to represent instead of DBT. And so if you were to use DBT, for all your SQL and then Dagster, for all of your non SQL stuff, you’d end up in this sort of fragmented world, you wouldn’t have a single consistent view or ability to execute your entire data pipeline. And so embedding DAX, or in DBT, allows you to kind of be at the best of both worlds.

Kostas Pardalis 35:51
Okay, that makes sense. And let’s talk a little bit about how you said something interesting. You said that, actually, no, before we get to that, like you mentioned something about DBT. And that’s recently, it’s very interesting, like about fragmentation. Right. So there are plenty of orchestrators out there. Right. And one of the ways that orchestrators are like creating is because somehow there is like a use case where for whatever reason, like the existing orchestrators do not cover the need or like, whatever. And suddenly we came up with another orchestrator out there, right. So, and I think that like, that’s very common, especially if we take the Meltwater and like the tape data processing world, right? We’ll both have the same thing. But anyway, like some, we need some way to differentiate the two. But I think our audience gets what I’m trying to say here. So, for example, we have flight, right? It is an orchestrate or you go like to the website, build and deploy data and ml pipelines, right. What’s the difference between something like flights that focuses more on the ML side of things, let’s say from my understanding, at least in something like ducks that right and why at the end, we end up like, having all these different orchestration tools, right. And in this case, like DBT is also like an example of that, right? Because the between in a ways also like as part of the product, at least kind of an orchestrator, right? If someone lives only inside sick, well, technically, they can use only DBT. Right? They don’t need the accelero redflow, or some other system. And I think that’s very confusing at the end. Like, I think like, there’s the practitioners at the end, they’re like, Okay, like, what is going on here? Right. So tell us a little bit more about that. And how you think about it both. Okay, like professionalism as an HR practitioner, right. But also like Duxton combined?

Sandy Ryza 38:03
Yeah, a lot of thoughts there. So I think there’s this truism, which I think is true in many cases, that software boundaries end up sort of modeling organizational boundaries. So, teams will build software that sort of serves the needs of their team. And if an organization isn’t structured in a certain way, that could lead to two different teams building the building software that solves very similar problems, but in slightly different and incompatible ways. And so to make this concrete in the world of data, often, you know, within a data organization or within a company at large, the functions of analytics, and machine learning and data engineering will be sort of organizationally separate. Historically, I think what that has led to is that people within those functions have ended up building, you know, maybe building internally and then going on to open source or going on to, or going on to commercialize tools that are sort of rooted in their understanding of that particular function. Something that I have encountered at working at companies with fairly early data functions is that you end up having to fill a lot of roles and that the software that’s needed to you know, orchestrate in the world of machine learning is actually very similar to the software that’s needed to orchestrate in the world of analytics. So I’ve come to a belief that you actually don’t really need super specific tooling for a lot of these domains. A lot of the boundaries and silos that are setup are sort of artificial or unnecessary, and not only unnecessary, but actually have a fairly high cost. So from the perspective of a machine Key learning team. You know, as I mentioned earlier, the biggest sort of the highest leverage way you can improve your machine learning model is by feeding it better data and sharing the data that’s coming to it is clean and correct and accurate, it becomes a lot harder to do that if the the underlying processes generating the data uses a totally different software stack from the software stack that you’re using. So you actually can read a lot of benefits by having the kind of Converse D siloed view of the world that allows a machine learning person to understand the impact of machines, it’s like far up in the data pipeline, because their machine learning model is trained using the same orchestration framework that upstream data asset is built using. Yeah,

Kostas Pardalis 40:51
That makes sense. But what are the differences though, like between, let’s say, building workflows, or like trying to orchestrate like ML work convert like trying to orchestrate data engineering or like analytical work, right? Like, what what are the differences between them,

Sandy Ryza 41:11
One thing that comes up in a millwork more than data engineering or analytical work is that the experimentation phase is, and the development phase is often a lot more, more rich and intended. So in the simplest case, if you’re just building a basic table, you write a SQL query, run it a couple times, you know, commit to your repo, and now you have that table running. And, you know, ideally, your orchestrator is good enough that it can basically just start updating that table. When you’re working with a machine learning pipeline, often there’s a whole sort of workflow of experimentation that happens, even if you’ve written kind of like the perfect code the first time, you end up needing to tweak parameters to try out your model on different features. And so the iterative process is a lot more heavy, the computer is often much more heterogeneous, as well, in the world of machine learning. If you’re able to express your computation in SQL, you can basically just ship it off to Snowflake, or duck dB, or whatever your database is, and have it execute inside of there. But if you’re dealing with machine learning models, there’s a wide array of Python libraries that you could be using. There’s hardware that you might not have access to, like, like, like, like GPUs, you end up needing sort of a much more flexible execution substrate to orchestrate across. So to sum up, I think, to sort of large, larger points that we need to think about when you’re orchestrating machine learning, versus orchestrating let’s say, analytic data pipelines. One of them is, like, an iterative experimentation based workflow? And the other one, is this more computing? complex computational environment? Yeah, so

Kostas Pardalis 43:12
this interactive, like, experimentation part happens in production in the middle? Or is it like a completely separate task? Well, I’m trying like to get to here, like I’m trying to understand, I like physically like, like, normal, like, in my mind, at least, like the orchestrator is something that gets into

Sandy Ryza 43:35
into like,

Kostas Pardalis 43:37
the process when you actually go into production, right, like you have concluded how things should be done, and now you have to deploy something repeatedly and with a lot of like, obviously, in a reliable way that these things will keep happening, right? Because, like, yeah, you will experiment with a melt, but I would argue that like, Whatever has to do with shorts will have a lot of experimentation, right? Like, even if you write like, I am a website, like way, right? So it’s, it’s part of the nature of the job at the end, when you build software one way or another. There is this iterative process during development. But that’s how engineering works, like you reach a point and you say, hey, like, Okay, now that this is like what I want to do, let’s push it into production, right? Is this different, like with a metal like a mill doesn’t have this distinction, and you have to incorporate the orchestrator material or like, is the orchestrator something that’s likely to be incorporated as part of like the development phase and not only the production phase? Got

Sandy Ryza 44:49
it? Yeah. So I think in broad strokes, I would be inclined to agree with you in particular on that point, that experimentation is a big part of the software industry. In general, data engineering as well. So, a lot of the sort of pieces of the machine learning development pipeline that consented to be presented as unique to the machine learning development pipeline are actually general, like software development. So software engineering practices, which is part of I don’t think that these require specialized tools. The one area that I would wonder, though, to speak more about is this notion that orchestration should only be part of production. So I don’t think people should be replacing their Jupyter notebooks with orchestration. But I do think it’s very powerful to be able to work with an orchestrator and much earlier phases of the data development lifecycle, if you think about an orchestrator abstractly this system that understand the dependencies between data, and an upstream data and is able to execute computation, sort of along the lines of those dependencies. And that is a really important function, even when you’re early in some degree of the experimentation process. So for example, if you’re prototyping a change to the logic that generates one of your data assets, it’s often really important to understand the implications of that change. So how it affects that data asset and how it affects downstream data assets, far before you decide to commit that change to production.

Kostas Pardalis 46:36
Yep. 100%. Yeah. Okay. Got it. And from your experience, like with that, Sir, do you see who is like the primary user that you see is like more of like the data engineer or like the more or less a traditional data practitioner, or you see, like, more people coming from a male, and like any change there in terms of the trends of like, who is actually like coming to learn more about ducks there these days?

Sandy Ryza 47:10
Yeah, so we see a lot of different users, maybe try to categorize them in some sort of way. One pattern of Dagster for use is that data platform engineers will adopt DAG to help them organize the computation of a bunch of different sorts of functions inside their data organization. So maybe the data platform engineer is supporting a team of analytics engineers, or maybe supporting it to analytics engineers, as well as machine learning engineers. And they want to set up a kind of shared orchestration environment, where all of the data assets that are being produced by these people who may be a little bit less technical, can be orchestrated in one place. So that’s one pattern of DAG for usage. Another pattern of the extra usage is sort of the bread and butter, data engineering, DAG. So usage. So in this case, the person who adopts Dexter is also the woman who’s sort of writing the content of the data pipelines, they’re not just not just facilitating other people’s data pipelines, they’re actually defining data assets in Dexter, writing the logic to move data around or transform that data. And then last of all, we see a lot of people doing machine learning using Dexter. And so in these cases, it’s normally sort of a mixed machine learning and data pipelining function. They’ll be using Dexter to train the machine learning model, but then also to generate all the features that fit into the machine learning model, and then perhaps, take that machine learning model and then do batch inference with it.

Kostas Pardalis 48:53
Yeah, it makes sense. And one last question from me. And then I’ll give the microphone back to Eric. But with the emergence of like LLM ‘s and like, let’s say AI engineering, and not just like ML engineering, either, like it difference in terms of like what is needed to build around devil lamps, or the existing orchestrators like dogs. They’re like, what do you need to do to go and work with LLM and AI?

Sandy Ryza 49:27
Yeah, it’s interesting. At the broad strokes, you still fundamentally have data that you’re feeding in. And data pipelines still exist. There’s some differences. So for example, feature engineering becomes less important in the world of MLMs because you know, these models are probably not powerful enough to be able to do some of the thinking that a machine learning engineer would have needed to do. But at the same time, you have the prompting, and you’re moving data through vector databases, so the pipeline’s you end up creating end up looking Very similar, some of the nodes have slightly different labels. We’ve seen users use Dexter for traditional machine learning as well as MLMs. And like, fundamentally, that shape of the work is not so different.

Kostas Pardalis 50:14
All right, that’s all from me for now, Eric, sorry, for hijacking the conversation here. But

Eric Dodds 50:21
no, that was, that was amazing. That was amazing. I learned so much. We have time for one, one more question year’s end. And I want to ask you more about rolls and team structure. In a world where, you know, the lines between data engineering, and you know, ml engineering, ml ops, and data science really blur I mean, many of the things that we’ve talked about today, you know, you could label the conversation, you know, a conversation about ml ops or a conversation about data engineering, either way. And you kind of saw this, you know, DBT, I think, helped coined the phrase analytics engineer, right where you have, you know, you mentioned analysts who like, or maybe somewhat literate in SQL, we ever have literacy and sequel or Python, but not, you know, actually running pipelines. But that kind of started to change. And a lot of analysts started to learn to run pipelines, right. And the same with data engineers who ran pipelines, but they didn’t necessarily, you know, sort of work on the modeling layer. And so you had this role emerge, it was kind of an analytics engineer, that’s a little bit of a hybrid, what do you think, is going to happen in sort of the relationship between traditionally like ML engineer, or data science and ml, engineer, data scientist, data engineer, you know, sort of that realm?

Sandy Ryza 51:53
Yeah, to your point, it definitely feels like the boundaries between these roles. If they are always blurry, they become very blurry. You know, I feel like in 2015, most data, scientists would spend half their time like, explaining to other people what exactly a data scientist was, or sparring with other people that, you know, the definition of a data scientist, and thankfully, those conversations aren’t such a huge part of the job of data science anymore. So you know, maybe that’s because people have just come to accept that, it means so many different things and trying to pin it down is a bit of a, it’s a bit of a fruitless exercise, the way that I tend to be inclined to think about it, is there gonna be spectrums of proficiency that different people have, and that, you know, maybe eventually end up getting clustered into these different roles. So at one axis of proficiency is data modeling, you know, which is sort of tightly related to sort of engaging with the facts of the particular business. And then these other axes of proficiency, which are more about infrastructure, you know, dealing with Kubernetes, and, and different substrates. I think that from what we’ve seen, the boundaries are super fluid. And it really varies from organization to organization. How sort of separate the person who thinks about the data pipeline is from the person who thinks about the infrastructure data pipeline runs on who writes in pi, you know, who writes in Python, who writes purely in SQL? And it’s difficult to build a data platform with the assumption that these functions are going to end up totally siloed.

Eric Dodds 53:48
Hmm. Yeah, I think it’s really interesting. And I think, you know, the tooling is, as really helped enable a lot of this change. You know, for example, who writes in Python, who writes in SQL, a lot of modern tooling, it doesn’t matter, right, you can have someone writing SQL and someone writing Python, and you can use the same workflow and work on the same dataset, which is incredible. I mean, that really is, you know, that sounds, you know, for anyone who’s, you know, sort of only familiar with modern tooling, where that’s, like, pretty recent, that’s, well, I mean, it’s definitely not, it’s insane. So it is pretty cool. And I think, you know, personally, what I see that I’m very excited about is, you know, I think when you give people much easier access to explore different areas that are interesting to them. They can follow their curiosity without these, you know, sort of massive technical walls that, you know, would require a career change to overcome right but the tools are making it a lot more fluid, which I think will spark a lot of creativity, which is exciting. Well, Sandy, we’re at the time I’m here, but it’s been so great. We learned so much that you’re doing incredible work at DAG VISTA. So thanks for giving us some of your time.

Sandy Ryza 55:06
Thanks so much for having me on the show.

Eric Dodds 55:10
We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe to your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.

🎙 Sign up for The Future of Machine Learning Livestream!

🗞️ Signup for Our Newsletter

Episode 171:

Machine Learning Pipelines Are Still Data Pipelines with Sandy Ryza of Dagster

January 3, 2024

Notes:

Transcription:

About the Podcast

Sign Up for The Data Stack Show Newsletter