Episode 148:

Exploring the Intersection of DAGs, ML Code, and Complex Code Bases: An Elegant Solution Unveiled with Stefan Krawczyk of DAGWorks

July 26, 2023

This week on The Data Stack Show, Eric and Kostas chat with Stefan Krawczyk, the Co-Creator of Hamilton and Co-Founder of DAGWorks. During the episode, Stefan shares his journey working in data for NextDoor, StitchFix, and others on his journey to founding DAGWorks. The conversation also includes much discussion around Stefan’s creation of Hamilton, how the platform works with definitions and time-series data, how it improves pipelines, what makes Hamilton an ML oriented framework, the importance of unit testing, and more.

Notes:

Highlights from this week’s conversation include:

Stefan’s background in data (2:39)
What is DAGWorks? (3:55)
How building point solutions influenced Stefan’s journey (5:03)
Solving the tooling problems of self-service at an organization (11:44)
Creating Hamilton (15:53)
How Hamilton works with definitions and time-series data (19:34)
What makes Hamilton an ML oriented framework? (23:39)
Navigating the differences between ML teams and other data teams (26:27)
Understanding the fundamentals of Hamilton (28:25)
Dealing with types and conflicts in programming (33:18)
How Hamilton helps improve pipelines and maintaining data (37:11)
Why unit testing is important for a data scientist (44:54)
The ups and downs of founding building a data solution (46:32)
Connecting with DAGWorks and trying out Hamilton (50:01)
Final thoughts and takeaways (52:46)

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 00:05
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com. The Data Stack Show is brought to you by RudderStack. They’ve been helping us put on the show for years, and they just launched an awesome new product called profiles. It makes it easy to build an identity graph and complete customer profiles right in your warehouse for data lake, you should go check it out@rudderstack.com Today. Welcome back to The Data Stack Show Kostas, super excited for the topic today. So we’re going to talk with Stefan from DAG works, he developed a really interesting technology at Stitch Fix, called Hamilton. And, you know, we actually haven’t talked about DAGs a ton on the show airflow has kind of come up here and there. And Hamilton’s a fascinating take on this where you sort of declare functions, and it sort of produces a DAG, that makes it much easier to test code, understand code and actually produce code, which is pretty fascinating. And this is all in the Python ecosystem, ml stuff. It’s very cool. I want to know what led Stefan from originally kind of working on some of these end use cases. So building, you know, an experimentation platform, for testing, for example, or experimentation framework for testing and all the data and the trimmings that go into that, to go far deeper in the stack. And actually building sort of platform level tooling enables the building of those tools, if that makes sense. So, to me, that’s a fascinating journey, a very difficult problem to solve from a developer experience standpoint. Yeah, excited to hear about his journey. How about you?

Kostas Pardalis 01:54
Yeah. I mean, I definitely want to learn more about Hamilton, the project itself, then the whole journey from coming up with like, the problem insights for six weeks and ending up with like an open source project that currently is like the foundation for a company. So that’s like, definitely something that I would like to talk about with Stefan and get deeper into what Hamilton needs because these kinds of systems are similar to what DB D also does, right? Like they have a lot of value. But they also rely a lot on people like using it and adopting these solutions. So I want to hear from Stefan about that, like how we can actually do this and how we can, you know, like onboard people, until they figure out the value of actually using something like this in their everyday work.

Eric Dodds 03:03
All right, well, let’s dig in and talk about DAG, CPython. And, Hamilton.

Kostas Pardalis 03:09
Let’s do it. Stefan, welcome to The Data Stack

Eric Dodds 03:13
Show. So excited to chat with you so many questions. But first, of course, give us kind of your background. And what led you to starting DAG works.

Stefan Krawczyk 03:23
Thanks for having me. Yeah, so DAG works. I’m the CEO of DAG works DAG for Directed Acyclic Graph, we’re recent YC batch graduates. At a high level, we’re simplifying ETL pipeline management targeting ML and AI use cases. In terms of my background, how I got here to be CEO of a small startup. I came over here to Silicon Valley back in 2007. I did an internship at IBM and then I went to grad school at Stanford. Since then, I finished a master’s in computer science, you know, right at the time when I was still classically trained. So all the deep learning stuff was just all the PhDs were doing. So I’m still kind of catching up on coursework there. But otherwise, I’d been working at companies like LinkedIn next door, where I was engineer number 13. But a lot of initial things, a small event, I went to a small startup at Bonn, it crashed and burned, which was a good time. But otherwise, before starting the company, I was at Stitch Fix for six years, helping data scientists streamline their model production relation efforts. All right,

Eric Dodds 04:25
love it and give us just go one, click deeper with DAG works right? You know, so I think a lot of our listeners are familiar with DAG sort of in a general sense. But you’re starting a company around it. So can you go one, click deeper and tell us you know, what does the product do?

Stefan Krawczyk 04:44
As a startup, we’re still evolving, but effectively we trying to, you know, if you’re for the practices listening, if you’ve ever inherited an ETL, or a pipeline that you were horrified by, or had, you know, had to come in and debug something you haven’t written yourself and it’s failing certainly, but either upstream data changes, or code changes you weren’t aware of because your teammate kind of made them or something, right? We’re essentially trying to solve that problem. Because we feel that you can get things to production pretty easily these days. But really, the problem then becomes how do you maintain and manage these over time such that you don’t slow down? And you can, you know, rather than when someone leaves spending six months to rewrite it, you know, this should be a more standard way to kind of maintain, manage, and therefore kind of operate these data and ml pipelines?

Eric Dodds 05:31
Yep, I love it. Well, tons of specific questions there. But let’s rewind just a little bit. So that next door, you said, you were very early, and you built a lot of first things, right. So you sort of the data warehousing, data, Lake, you know, infrastructure, testing, infrastructure, for experimentation, et cetera. So you’re sort of really on the front lines, like shipping stuff that was, you know, hitting production and sort of, you know, producing test results, and all that sort of stuff. And now you’re building to, you know, really platform tooling for the people who are going to enable those things. And so I would just love to hear about what, you know, tell us about your experience at next door, you know, launching a bunch of those things, did that influence the way that you thought about platforms? Because I would guess, I mean, you know, I could be way wrong, but you were building a lot of point solutions that weren’t a platform, and then probably eventually needed to be, you know, sort of platform tooling at scale.

Stefan Krawczyk 06:37
Yeah, a lot to unpack there. So if I get off track, feel free to bring me in. But yeah, I mean, I want to say so before going to next I was actually at LinkedIn, where, you know, I had the opportunity to see, you know, a larger company with a bit of established infrastructure. So for example, they had heard of Hadoop clusters. And so all the problems of, you know, writing jobs, trying to maintain understanding of debugging, trusting a data set, can I use this to build a better model, right? And so, the launch next door was like, hey, it’s small, it’s a, it’s still such a social network, like, they’re going to be building or needing these things, can I build them out there. So that was part of the motivation. And also like, you know, I liked building products as much as I liked building the infrastructure of things, right? And so. So I think, from that perspective, you know, going from zero to one and having a blank canvas is like, you know, terrifying and exciting at the same time, back then, like, it was a very different environment as it is now, because now there’s a lot of vendors, a lot of off the shelf solutions, but back then you had really had to kind of build most of the things yourself. I mean, AWS was just in its infancy. I remember getting a demo of Snowflake, when they were just building things out, right. And so yeah, so next door, I got the opportunity to, you know, get the keys to AWS effectively, you know, and to try to solve business problems. The first one, for example, being you know, we need a data warehouse, because up until that point, they were actually running queries off of the production databases. So if you were working on if you were using the site on a Sunday is, things could have been impacted, because of the queries and things or at least they were getting to some of the scale where the queries were at least, off the read replicas were going up, we’re locking them up, right. And so. So having seen, you know, this is where partly like, if you have to think of things from first principles, and like, and see how the sausage is made. As an expression goes, like, I think you kind of get a better appreciation for the things that you can build on top of. And then also potentially, the, you know, the decisions you make lower level down, how they eventually kind of impact you at a high level. So, an extra, one of the things I kind of built out was in your AB testing experimentation system, for example, and then, you know, trying to connect that all the way back with, you know, things that happen on the website so you can do inference. So you can, it was pretty easy, or we made it much easier to you know, if you wanted to create a change, you know, you could feature flag it, turn it on, and then you know, get some metrics and telemetry. Yeah.

Kostas Pardalis 09:07
And so, yeah, I guess,

Stefan Krawczyk 09:11
In terms of, you know, like, we’re going to say, a place like, Stitch Fix, which, at a startup pop in between, right, I think I realized that one, I am not excited by being given a data set and figuring out what to do. So I’m actually more excited about building the tool. So I had a great time building experimentation stuff I had, like LinkedIn prototyping content based kind of recommendation infrastructure as well. Right. And so in which case I realized that I use my passion more, you know, helping other people be more successful. And so in which case Stitch Fix with its LoRa has a lot of different modeling problems. So it wasn’t a shop that just wanted to optimize the ad targeted ads, right? It actually had a lot of different problems, and then we’re hiring a lot of people to solve it.

Eric Dodds 09:58
That’s great. So what So what You went to Stitch Fix? Did you? Were you hired specifically to work on platform and tooling stuff?

Stefan Krawczyk 10:06
Yeah, yeah. So one of the, one of the reasons why I left next year was because I was realized that, you know, machine learning wasn’t quite a key to the company, like I could build things myself, but I wanted to be part of a team to bounce ideas off and work at a place that valued it, or it was a little more core to what was doing. So I actually went to an NLP for an enterprise startup, for that purpose, where I got to kind of delve into things like, how do you build machine learning models on top of Spark and get them to production? Right. Unfortunately, that was a good roller coaster ride, but the company rather than money had to fold. And so then I realized that I was like, I wanted to build more of these platforms. And so in that case, it’s Stitch Fix, yeah, they were avant garde at the time where they’re actually hiring a platform team pretty early to help enable and build out, you know, kind of self service infrastructure for data scientists. So rather than the model for those who don’t know Stitch Fix, it’s a personal styling service. So if you don’t like shopping and picking out your own clothes, it’s a great kind of service for you. But very early on, they had a team and environment where they hired data scientists who were in their own organization, they wouldn’t attach the marketing or engineering, they were their own organization. Oh, interesting. And they were tasked with building, prototyping, and then engineering the things that required to get to production. And so, but they hired it as a starting, you know, hanging out on a platform team too slowly, you know, rather than data scientists having to build a lot of engineering work themselves, like, slowly bringing in part of the abstractions in layers to help make that self service easier. So, for example, you know, the platform team on Jenkins, and you know, the Spark cluster, and then, you know, setting up Kafka, and then the Redshift instance and then helping. So I was part of a team that was more focused on the, okay, how do you get machine learning, and then plug it back into the business? So part of my journey was booting, actually, one team that was focused on back end kind of model deployment, and other on setting up the centralized experimentation infrastructure. And the third one being what we call the modern lifecycle, which is end to end like how do we actually speed up getting a model from development to production?

Eric Dodds 12:15
Makes total sense. Now, can we dig into the Self Service piece of that a little bit? So when you came to Stitch Fix, it sounds like culturally, they had sort of committed to enabling more self service. Can you talk about who specifically in the org needed self service? And what were the problems that we’re facing? Like? What were the bottlenecks that not having, you know, tooling for self service was creating?

Stefan Krawczyk 12:44
Yeah, I want to mention, so there’s a pretty reasonable summary of what things were at the time. My former VP, Jeff Magnuson, wrote a pretty famous post called Engineers shouldn’t Write ETL. So if they haven’t seen that poster, I haven’t heard of it, I can take a look at it up. But effectively, I mean, part of the thesis was that, you know, being candid over work, thrown over a wall at someone isn’t very happy work for that person. And they’re also kind of disconnected from business value. And so the idea at Stitch Fix was, well, you know, can we do data scientist, the person who has, has the idea, but it’s also talking, say, with the, the business partner, so it’s Stitch Fix, each data science team was effectively partnered with, you know, some of the team marketing, you know, operations, styling, merchandise, right? And so that we’re trying to help those teams make better decisions. And so the thought was, you know, iteration loops are key in terms of machine learning, differentiating so how can we speed up this loop, easiest way to speed up this loop with the person who’s building it can also take it to production, and then you know, close the loop, and then iterate and make better decisions that way. So that was really the philosophical kind of thesis as to what it was. And so and so I want to say, it wasn’t necessarily like a problem. It was more like, Hey, this is the framing, this is how we want to operate. which case then the framing of the platform team is like, how can we build on capabilities and provide an easier time for that data scientists to do more get more done without engineering it themselves, but we weren’t on anyone’s critical path. So there was like, there was a bit of like, not obviously, if you want to use Spark cluster, you had to use the, the customer in terms of API’s to read and write there was, you know, there was a lot of before the platform team came in, people were writing their own tools and solutions, right. So Stitch Fix hired very capable, you know, PhDs from various walks of life that went to computer science backgrounds, but some of them knew that they could abstract things. And so in which case you know, part of it was, you know, competing with data scientists in house abstractions and trying to gain ownership of them as a platform to better manage them.

Eric Dodds 14:54
I was gonna ask about that because you’re okay, self service. Let’s make this cycle time faster. You know, that sounds really great on the surface. But, you know, you’re talking about like, you know, multiple data scientists, you know, sort of working for different internal stakeholders who have already built some of their own tooling. Was it challenging? You know, was there pushback? Or were people generally excited about it? I mean, I know the tool eventually had to prove itself out and get adoption internally. But culturally, what was it like to enter that? You know, sort of mandate, I guess, if you will.

Stefan Krawczyk 15:31
I mean, it was a mixed bag. I mean, like, it depends. So very academic type environments are very much open to suggestion and discussion, very high communication paths. So there was like a weekly what was called beverage Minute, where you could kind of present talk about things, and that’s where kind of people did and that was your kind of forum to disseminate stuff. And so people are always eager to learn best practices, right? I think, you know, people being practically minded, like if they build it, and they’re like, Well, I don’t have that problem. Why should I use your tool? Why should I bother spending time? Like, I mean, coming from very practical concerns of like, you know, what’s in it for me? Right. So that’s, that’s if anything was a bit of a challenge. Like if some team had a little bit of a solution, but not in the other teams? Did you know you could get the other teams on, but that one team would be like, Well, I don’t think the opportunity cost. Is there yet, right?

Eric Dodds 16:20
Yep. Yep. That makes total sense. Okay. So one of the big pieces of work that came from your efforts at Stitch Fix was Hamilton, which is intimately tied to DAG work. So can you set the stage for where Hamilton came from inside of Stitch Fix and sort of maybe the particular flavor of problem that it was solving?

Kostas Pardalis 16:43
Yeah. So one of

Stefan Krawczyk 16:48
it was built for a data science team, so one data science team, and one of the oldest teams there, basically had a code base that was basically, you know, 566 years old, at the point, gone through a lot of team members and things. And so it wasn’t written, you know, structured by, you know, people with a software running background. But effectively, they had to forecast things about the business that the business could make operational decisions on. So they’re basically doing time series forecasting. And what is pretty common in time series forecasting is that you are continually adding and updating the code because things change in the business thing as time moves, you need to account for it right. And so one way to do that is to use the right kind of inputs or features, right? So at a high level, getting a forecast up, or like the pipeline, or the ETL, at a high level forecast was pretty, you know, a, it’s a simple or pretty standard, you know, only a couple of steps. But the software, the challenges of adding, maintaining, updating, and changing the code that was within the at a high level with the map, in that macro pipeline was really what was the challenge, and was really slowing them down. They were also operationally always under the gun, because they, you know, had to provide things that was needed make decisions on, you know, they had to model different scenarios and certain things in which case, you know, they weren’t in a position to really do things themselves, in which case, the manager came to the platform team was like, Hey, can I help? And so what I found really was like it was this, the macro pipeline wasn’t the challenge, it was the code within the steps that needed to be changed and updated. Right. And so this is where like, yeah, getting to production was easy. But now the maintenance aspect of like, maintaining, changing, updating was really the struggle. And so was Hamilton, the idea wasn’t gonna, how can we, you know, it was, this was a plus for work from home Wednesday. So if there was no work from home Wednesday might not have come up with this. But I had a full day to kind of think about this problem. And kind of analyzing and looking at their code, that was a lot of effectively what they were trying to do. One of the biggest problems was they needed to create a data set or data frame with 1000s of columns. And because with time series, forecasting is very easy for me to create your inputs, the derivatives of QCon columns. And so the ability to express the transforms, was really, and be confident that like, if you change one, like you don’t know what’s downstream of it, all those dependencies, because the code base was so big, it was, you know, it wasn’t, you know, that that was structured, right. And so came up with Hamilton, where effectively it was like, I was trying to make it as simple as possible from a process perspective of giving an output, how can you quickly and easily map it back to code? And the definition for it, right? And so if Hamilton at a high level is a micro framework for describing data flows, right? And so, data flow is essentially computation and data movement. This is exactly what they were doing with their process to create this large data frame, given some source data, put it through a bunch of transforms, and create a table. So Hamilton was kind of created from that problem of like, yeah, the software engineer needed, and I mean, I could dive into more details of how it works, but I’m going to First ask whether I’ve given a high level context?

Eric Dodds 20:04
No, that’s super helpful. And one thing I actually want to drill into, because I want to hand the mic off to Kostas in a second, and dig into the guts of how Hamilton works. But we’re talking about time series data, and especially around features, specifically, one of the things that’s interesting about Hamilton sort of being, you know, let’s say, and maybe I’m jumping the gun a little bit here, but more sort of declarative rather than imperative is that it creates a much more flexible environment. Least, you know, from me tinkering around with it, in terms of definitions, right, because one of the problems with time series data and definitions is that if a definition changes, which it will, and you have a large code base, it’s not that you can’t get a current view of how that definition, you know, looks with your snapshot data, but it’s actually going back and sort of re computing and updating everything, historically, in order to like, you know, rerun models, and all that sort of stuff. Which is really interesting. Were you thinking a lot about the definition piece with Hamilton and sort of making it easier to create definitions that didn’t require, you know, like updating 100, you know, different points in the code?

Stefan Krawczyk 21:23
Yeah, I mean, so, effectively, if you can make it really simple to make output to occur, then logic, that means there’s only really one place to do it. And so what, one of the problems with the code base that it was before was that you know, there wasn’t a good testing story, there wasn’t a good documentation story, and I had to see dependencies between things. And then when you updated, something you didn’t know, to your point, like how confident you were in like, what you actually changed or impacted, right? Yeah. Because, you know, everything was effectively in a large script, where you had to run everything to test something. So there was this kind of real Nash, it’s really a lot of energy required to understand changes and impacts. So effectively, by rewriting things as functions, which I’ll kind of dig into, it helps really abstract and encapsulate what the dependencies are. And so therefore, if you are going to make a change, it’s very much, much easier to logically reason, then and find saying the codebase, like who, you know, the upstream and downstream dependencies of this. And so it becomes you have a far more procedural, methodical way that you can then kind of add, update and change workflows. Whereas before, if you kind of described or, like wherever software engineering practices, you kind of use, you have to take a lot more care and concern when you do that. But with Hamilton, it’s kind of the paradigm that forces you to do things a particular way that makes this particularly beneficial for your changing, updating and maintaining?

Eric Dodds 22:50
Yeah, absolutely. You know, it’s amazing, even if, you know, even on teams that really are diligent about best practices with software engineering, it’s amazing as code bases grow the amount of tribal knowledge that’s needed to make significant changes, you know, you always end up with a handful of people who know all of the nooks and the crannies, you know, and sort of that one dependency, that’s, you know, the, you know, the killer, when you push to production without tinkering with it.

Stefan Krawczyk 23:22
One thing for the Raiders, I think, since your audience is probably familiar with DBT, I want to say Hamilton is very similar, I guess, in what DBT did for SQL right? Before DBT was a bit of a wild west of how you maintain and manage your SQL files, how they are linked together, right? How do you test document them? Right, Hamilton kind of does pretty much the same thing before Python function Python transforms, right. And so it gives you this very opinionated structured way that you end up actually, you know, being more productive and being able to write and manage more code than you would otherwise, which I think you know, DBT kind of did for the seaboard. Yeah,

Eric Dodds 23:54
absolutely. All right, Costas. I’ve been monopolizing, and I know you have a ton of questions about how this works. I do too, please,

Kostas Pardalis 24:02
you can get back in the conversation whenever you want. So don’t be shy. So the first question, what makes Hamilton in a male oriented framework? Why is it like for a man, right, like writing ETL? For the male? What’s it’s not for something else, right?

Stefan Krawczyk 24:23
I want to say its roots are definitely machine learning oriented or Kanban. You know, like, effectively orders describing was a feature engineering problem for time series forecasting. Right? I mean, Hamilton since then, we kind of added and adjusted it to operate over on you know, any Python object type because it was initially focused on pandas. Now it isn’t. I fixed it, we kind of call it a bit of a Swiss army knife and that you could do anything you can model in a DAG or at least you would have to draw a workflow diagram. Hamilton’s maybe the one of the easiest ways to kind of directly code the map that maps to it. But specifically, you know, think pie I saw on a machine learning a very couple of together, software editing practices are hard in machine learning, in which case, you know, I feel, you know, Hamilton specifically is trying to target the, you know, the soft engineering aspects of things, in which case, I think machine learning kind of data is least mature there. And so very Woefully answers that, like, you know, its roots are from that. And so therefore, I think it’s targeting more of those problems faced, but people have been applying Hamilton to much wider use cases than just machine learning.

Kostas Pardalis 25:29
Yeah, 100%. I always find it very fascinating to hear, like, from practitioners like you about the unique challenges that the ML workloads have compared to any other data workload, right.

Stefan Krawczyk 25:45
I mean, yeah, I mean, it’s like, what is actually little less around workloads and more about team process and code that helps define those things, right. Since, you know, individuals build models or data or you know, artifacts, right, but teams on them, right, and you need kind of different practices to make it work, right. I mean, there’s the infrastructure side, like, how do you feature in defense engineering over, you know, gigabytes of data, but then it’s also like, Well, how do you actually maintain the definition of the code to ensure that it’s correct that, you know, it can live a long prosperous life, when you leave, someone else can inherit it? Right. And so, Hamilton is kind of, you know, starting from that angle, first of like, but definitely, I can see a future where it can, you know, you can use it on Spark, you can use it in a notebook, you can use it in a web service, anywhere that Python runs. Right. So definitely has integrations and extensions that definitely also extend out into more of the data processing side. Yep, yep. And okay, so

Kostas Pardalis 26:40
Let’s change the question a little bit. And instead of talking about, like the workloads, let’s talk about the teams like how we build teams, and people in these remote teams might be like, different than like a data engineering team or a data infra team, right? So tell us a little bit more about that, like how things are different for a MultiMix compared to Iona like a BI team, right?

Stefan Krawczyk 27:03
I mean, so there’s a bit of nuance here, because depending on like, if you’re applying machine learning to then go into an online setting, or if it’s only all in an offline world, right, there’s a slightly different kind of SLAs and tolerances. Most data science, machine learning engineers, I know don’t have computer science backgrounds. I want to say it’s probably almost even true for data engineers, I know as well, right? But effectively, you try to couple data and compute together in a way that yields a statistical model representation, that then you can kind of, you know, with just some bytes in memory, that then you want to kind of ship out how you get there. And how you produce it, really, I think impacts how the company operates, how the team operates, the ease and effectiveness that you can kind of, you know, quickly get results. So I want to say, yeah, there’s a lot more focus on the good side, you can say this way we’re ml Ops is, you know, trying to become like a DevOps practice, right, where it’s kind of giving you the guiding principles on how to kind of operate and manage things. But that’s it. And then, I guess, in terms of, you know, how it relates to, like, other things actually think machine learning is a bit of a workflow, a super set of analytics workflows, right? So I think it’s because you actually have the same problems. On the analog side, maybe obviously, slightly different focuses and kind of endpoints, but effectively, you’re effectively generally using the same infrastructure, or reusing it as a bit of time. And then you’re generally connecting, I have to connect with that world as well. And so I want to say it’s more of a superset of that, and has therefore slightly more different challenges. Because the things that you produce, are more likely to then end up in other places like, you know, online and a web service versus, you know, analytics results, which just are only served from a dashboard and look.

Kostas Pardalis 28:53
Okay, well, it’s great. So, okay, you mentioned at some point when we were discussing with Eric that, like Hamilton’s, like an open a opinionated way of like, doing things around like a man, right, like, and you give like a, I think, like a very good example, like for people like to have their style there was DBT, right, like, where DBT came and put some kind of guardrails there on, like, how things should be getting done, right? And you take us a little bit through that, like, what does this mean? Like how the word is like perceived from the lenses, like from the point of view of like, Hampton, what are like the terminology used, right, but that’s data frames, there was a little bit about the vocabulary and like all these things that we should know, like, rather than the fundamentals of communism.

Stefan Krawczyk 29:42
Sure. So certainly Hamilton’s a micro framework for describing data flow. So I say macro framework, and that it’s available anywhere that Python runs. It doesn’t contain state and all it’s doing is really kind of helping you. You can say orchestrate code. It is not a macro orchestration system as opposed to airflow prefect, DAG, stir, which contains state and you think of tasks as computational units, Hamilton, instead you think of things, the units are functions. And so rather than writing procedural code where you’re assigning, say, a column to a data frame object in Hamilton, instead, you would rewrite that as a function where the column name is a is the name of the function, and the function input arguments, declare dependencies or like other things that are required as input to compute that column. So, so inherently, so I guess there’s this macro versus micro. So I call Hamilton a microphone, micro illustration framework, or micro framework, micro station, a kind of view of the world versus macro, which is something that isn’t right. It is we’re writing functions that are declarative. And so where the function name means something and function input arguments, also declare dependencies. You’re not writing scripts. With Hamilton, there is a bit of, well, you don’t call the functions directly in your rights and driver code. And so with Hamilton, and the other concept is like you have this driver, right. And so given the functions that you have written, you have to write all your functions curated into Python modules, paths, and modules, you could say, are representations of parts of your DAG. So if you think visually, and you think of nodes and edges, where functions are nodes and edges being the kind of dependencies of what’s required to be passed in. That’s, I guess, the nuts and bolts of Hamilton, you write functions that go into modules, but then you need to drive some drivers scripts, to then read those modules to build this kind of DAG representation of the world. That’s that code. That’s the you could say, the script code that you would then kind of plug into any way that you’re on Python. oppose any clarifications? Or following along so far? Yeah. So

Kostas Pardalis 31:47
just to make sure that, like I understand correctly, right. And consider me as a very naive, let’s say practitioner around that stuff. Right. So if I’m going to start developing using Hamilton, I will start thinking in terms of volumes, right? So I don’t really, I don’t stop from the concept of having something like a table or something like a data frame. Right. So technically, I can create, let’s say, Independence columns, and then I can mix and match to create outputs. data sets in a way, right?

Stefan Krawczyk 32:25
Yeah, I mean, so So yeah, so Hamilton’s routes, wrangling, say, Pandas data frame, so it’s very easy. So going back to time series NATS and time series data, it’s very easy to think in columns when you’re processing this type of data. And so with Hamilton functionally, the function you can therefore think of is, because when two represent a column, the framework forces you to only have one definition. So if you have a column name, X, there’s only one place that you can have x and your DAG, or there’s only one node that can be called X to compute and create that, right? So Hamilton forces you to have one declaration of this. And so we’re the function name is kind of equivalent to the column name or an output you can get. But when you write that function, you have actually said what data comes into it, you’ve only just you’ve only declared through the function arguments, the names of other columns or inputs that are required. So it was Hamilton, you kind of you’re not coupling context when you’re writing these functions. And so therefore, you’re kind of effectively coming up with, you know, you could say a column definition or feature definition, that is kind of invariant to context, the way that Hamilton stitches things together is through names, right. And so if you have a column named foo, that takes in an argument by Hamilton when you go to compute food or other look for a function called bar, or it will expect some input called bar to come in

Kostas Pardalis 33:45
100%. So okay, so we chained together like functions, right? And create, let’s say, a new column, right? Like columns. And you said, like, the context is not that important. Like, when I define the function, I just link, let’s say the inputs. But okay, coming like, again, like from a little bit more of like, you know, traditional like, programming, but how do you deal with types, for example, right, like, how do how, how do I avoid having like issues with types and conflicts and

Stefan Krawczyk 34:25
stuff like that? Yeah. So it’s pretty lightweight here. We start with the function declaring an output an output, when you write a function has to declare a function output type, but also the input arguments also have to be type annotated. So when Hamilton constructs a DAG, of how things are chained together, it does a quick check of like, Hey, do these function types match? You have the flexibility to kind of, you know, fuzzy them as much as you like, but effectively so that that’s a deconstruction at runtime. There’s also a brief check, like in But to the DAG, to make sure that you know the types match at least the expected kind of input arguments. But otherwise, there’s a bit of an assumption that if you set a function that outputs a panda’s data frame, it’s a panda’s data frame. And the reason why we don’t do anything too strict there is that like, well, if you want to reuse your pandas code and run it, as you know, with pandas on Spark, assuming you meet that subsequent API, to everyone who’s reading the code, it looks like a panda’s data frame. But underneath it could be a pandas or PI spark data frame, wrapped in the pandas API. So effectively, you know, with Hamilton, the DAG kind of enforces types to ensure that functions match. But you have flexibility as to how you know, if you really want to perturb that, you can write some code to kind of fuzzy that up otherwise, at runtime, there isn’t much of an enforcement check. But then if you do really want that, there is the facility then to also what’s called a check output annotation that you can add to a function that can do a runtime data quality for you, which you could then you know, check the type check the, you know, the cardinality or the the values of a particular output.

Kostas Pardalis 36:02
Okay, that’s cool. And, okay, so let’s say I want to start playing around with Hammington. Right. And I already have some existing environment where I create, like pipelines, and I work with my data, right? How do I migrate to Hamilton? What do I have to do? Yeah,

Stefan Krawczyk 36:22
It’s a good question. So Hamilton, as I said, it runs anywhere that Python runs. So all you need is to really, you know, say your, say you’re using Pandas, just for the sake of argument. You can replace however much code you want, with Hamilton. So you can slowly, you could say, change parts of your code base, and replace it with Hamilton code. I mean, the in terms of, you know, actually migrating, you know, the easiest is to save the input data, save the target output data, and then kind of, you know, write transforms and functions that then, you know, as you’re migrating things to see whether they, you know, the old way in the UAE kind of line up or match up, but from an actual kind of practicality. And you know, POC perspective, like it’s really up to you to scope. How big of a chunk Do you want to really move to Hamilton in which case, because all you need to do is just pip install the Hamilton Library. That’s really the only impediment for you to kind of try or something is really the time to like, chunk, what code you want to translate to Hamilton. And otherwise, you know, there shouldn’t be any system dependencies released.

Kostas Pardalis 37:30
Okay, that’s super cool. And you mentioned at the beginning of the conversation, that’s okay, what’s one thing like to build something, it’s a completely different thing like to operate and maintain something, right. And like, that’s where a lot of pain exists today, following what’s a pipeline, finding these pipelines to a new engineer or trying to figure out things that are going in there like updating that, improving that. It’s hard. And from my understanding, like one of, let’s say, the goals, the vision of Hamilton is to help with that, and to actually bring blockchain best practices that we have in software engineering, also, like when we work like with, with data pipelines. So how is this done? Okay, I’ve built it right. I’ve used my own, I have locked in our pipeline that builds whatever the input of a service that takes like a model is. What’s next, like what kind of tooling I have around Hamilton, that helps me, let’s say, to go there and debug the pipeline or improve on the pipeline? And in general, maintain the pipeline?

Stefan Krawczyk 38:41
Yeah, yeah, it’s a good question. So one is, I’m going to claim that, you know, a junior data scientist can write Hamilton code, and no one’s going to be terrified of inheriting it. Because I mean, so part of, I guess, one of the things that kind of the framework forces you to do is basically you need to chunk things up into functions. One nice thing of chunking things up into functions is that everything is unit testable, not to say that you have to add unit tests. But if you really want to, you can, and then you also have the function docstring, where you can add more, more specific kinds of documentation. Now, because everything is kind of stitched together by naming, you are also forced to name things slightly more verbosely, so that you can kind of pretty much read the function definition and kind of understand things, right? And so I just want to set the context of like, you know, the base level of like, what Hamilton gives you effectively, you can think of it so you know, you’re a senior software engineer in your back pocket without you having junior high one. Because, you know, you’re decoupling logic, it’s making it reasonable from day one, because you’re forced to create modules, and then you have these great testing stories. And then one of the facilities that’s built into the Hamilton framework model that natively is that you can output a graph as a visualization of how actually everything connects or like how a particular execution path looks. Right. So with that, on the base of that, right. I want to say, if someone’s coming in to make a change, right, there wasn’t much extra tooling. You need a low level, right to a to b conference. So if someone’s making a change to a particular piece of logic, it’s very, it’s only a single function, right? The function, you know, who’s downstream of that, because you just need to find people, you know, grep, the code base for whoever has that function, but arguments, right? If you’re adding something, you know, you’re not gonna clobber anything or impact anything, because it’s a very separate thing that you’re creating, right? Similarly, if you’re deleting or removing things, you can also easily go through the codebase to find things. So pull requests there are a little easier and simpler, because things are chunked in a way that like people just did, a lot of the changes already have all the context around them. And they’re not really you know, they’re not disparate parts of the code base, when a change is made. So and so therefore, in terms of debugging, right, because you have this DAG structure, if there’s an issue, it’s pretty methodical to debug something, right? So if you see an output, it looks funky. Well, it’s very easy for you to map where the code should be. So if the logic and the function looks after you can, you could test it and write unit tests? But if it’s not, then, you know, it’s functional for argument. So then you effectively didn’t know okay, what was what’s, what was run before that. So you can then logically step through the code base as to like, Okay, well, if it’s not this, then it’s this, if it’s not this, this, this and you can see it, you know, PDB, set trace, or, you know, the debugging output within it, right. And so, this, we’re saying it’s kind of this paradigm forces, this kind of simplicity, or very structured or like, you know, a standardized way of approaching and kind of debugging stuff, in which case, therefore, anyone new who’s comes to the code base, right, they don’t need to read a wall of text and be consuming from a firehose, instead, they can, you know, if they want to see a particular output, they can use the tool to visualize that particular execution path. And then just walk through the code there, right, or with someone or someone’s handing off, right. So I think it really simplifies a lot of the decisions and, you know, effectively encourages a lot of the best practices that you would naturally have in a good codebase to make it easy for someone to come and update and maintain. And then also,

Kostas Pardalis 42:06
you mentioned the recommendation of like, I was browsing like GitHub rep of Hamilton. And there’s like a very interesting, like, matrix there that compares the features of Hamilton with other systems. And I think it’s really helped someone to understand exactly what Hamiltonians are like. But I want to ask about like, the gold is, you mentioned some points of the gold is always like unit testable, right? It’s like, it’s always true for Hamilton. But it’s not like for other systems, like DBT, for example, or Feast or airflow? Can you elaborate a little bit more on that? Like why, with Hamilton, we can do that? Right? And why can’t we do it with airflow, for example?

Stefan Krawczyk 42:55
Yeah, yeah, it’s very easy to system. So you’re given a blank slate of Python, you can write a script, right. And so one of the things that’s very easy and most people do is they want to get as fast as possible from A to B. In the data world, that means loading some data, doing some transforms, and then learning back out, right. So if you think of the context that you have just coupled together to kind of do that, you have made an assumption of, you know, where the data is coming from, maybe it’s of a particular format, or type. The logic then is very much now coupled to that particular context. So if you, you know, most data scientists cut and paste code rather than refactoring it for reuse, right? And that’s partly because of that kind of, you know, coupling of contexts. And then you’ve also assumed what the outputs are. And so you could make that code always testable, but you need to think about it when you’re writing it. Right? Yeah, you need to structure things in a way that, you know, because if you couple things, or you write functions that take in certain things, that means the unit test is a pairing, because you have to mock different data loaders, kind of API’s to kind of make it work. Whereas, you know, with Hamilton, you’re really forced to really chunk things separately, or at least if there’s anything complex, it’s actually contained in a single function in a single place. Right. And, and so it is therefore much easier if you know if you need to write a unit test right in Hamilton and have it maintainable. Whereas in the other context, you have to think about that. Yeah, as you’re writing it, but most people don’t and so which case then it’s a problem of inertia and then people generally you know, add to the codebase to make it look like how it is and so which case the problem then just propagates unless you find that someone who that one person you know, there’s generally one person needed every company who really likes cleaning up code you find them and they want to do it but you know, those people are a rarity. which case you know, for me I’m more of a reframe the problem to make problems go away type of guy and so which can So with Hamilton’s like you reframe the problem a little bit by getting you to write code in a certain style, but then all these other problems you just don’t have to deal with because you know, you because you’ve, we’ve designed you to write code in a certain way that always makes, you know, unit testing and documentation friendliness. True. Yeah.

Kostas Pardalis 45:18
And one more question on like, unit tests. And I want to ask that, I want you to ask this question I want, I want to ask this question to you. Because you mentioned at the beginning, it’s like, very true that many of the practitioners are male in the dead AWS high end domain. And that’s also true, like for many of us, the data engineers out there don’t necessarily come from a software engineering background, right. So probably, they’ll also like not being exposed to new testing and why unit testing is important, right? So why is unit testing important? For a data scientist,

Stefan Krawczyk 45:52
It’s important if you have particular logic that you want to ensure that you have written correctly and if someone changes that they don’t, you know, break it inadvertently, right? And so I want to say, it’s not true that you always need unit tests for simple functions, right? It’s mainly for the things you really want to kind of enshrine the logic for. And also potentially to help other people understand , like, these are the bounds of logic. So classic examples of this, like I said, Stitch Fix was you had a survey response to a particular question. And you wanted to transform that server response into a particular, you know, input or output, right? Your test is a great way to encapsulate and kind of, you know, enshrine a bit of that logic, right to ensure that like, hey, if something changes, or there, what are the assumptions that change, you could easily kind of understand and see whether that kind of test broke up.

Kostas Pardalis 46:42
Cool. So let’s propose a little bit here, like about Hamilton. And I want to ask you, because we talk a lot about things like Hamilton, but Hamilton is also like the seeds of a comeback that you’ve built, right? today. And I would like to hear from you a little bit about like this journey, how, you know, things started, like from within, like Stitch Fix, as you said, there was a problem there you started, like we described, like how you start, like building hammertone, to the point of like, today being like the CEO of a company that is building a product and the business on top of, of the solution, right? So there was a bit about this experience, like how you decided to do it, the good things around it, like whatever, like made you happy so far. And if you can also share some of the bitter parts of like doing these, because I’m sure it’s not easy. That will be awesome.

Stefan Krawczyk 47:40
And we’re just Stanford governed by the bags. For the last decade, I’ve been thinking about starting a company, in terms of, you know, how DAG got started, and you know, the idea for it, like, we did our booth versus buying on the platform team at Stitch Fix. So we saw a lot of industry come in. And quite frankly, I was like, No, I think we actually have better ideas or assumptions. Or even Yeah, we could build a better product here. And so we built most things at Stitch Fix. Actually, for that reason, we only brought in a few things, right. And so, Hamilton actually started out more of a branding exercise. And so part of it was actually it was of the things my team built, it was also the easiest kind of open source. But also, from that perspective is also I guess, the most interesting, so I do think it’s actually pretty different. I print different approaches to what other people are taking and so part of that was like, you know, I think it’s unique, and then just happened to be easier to open source than other things. And so we open sourced it in the reaction from your pupils like, yeah, like, I honestly, initially thought Hamilton was a bit of a, it was a cute metaprogramming hack in Python to kind of get to work. But like, I was, like, I wasn’t quite sure when other people would think, get the same value out of it. Suffice to say, you know, people did, which was exciting. And then realizing, you know, like, it’s districts we had, you know, 100 plus data scientists to deal with, but you know, with open source, it’s kind of like, wow, you actually have 1000s of people, you could potentially, you know, help and reach. Right. And so, that was invigorating from a personal perspective of like, just being able to reach more people, and, you know, and help more people. So I think if you know, with open source, there’s the challenge of actually, how do you start a business around it? I mean, if you look at other companies, you know, DBT, for example, you know, they didn’t really take off until they were three or four years outside of open source, right. Hamilton was actually built in 2019. We only open source to 18 months ago. I mean, I didn’t know that was sticky, because the teams that use it internally at Citrix loved it, but it was exciting to see it’s kind of adoption go and then and then so from that perspective, I think open source gets adopted by me being excited by helping other people. And then you have been thinking about companies for the last decade. I thought it was you know, now’s a good time because I’m like, I still think I know something people don’t in which case you know, that machine learning tech debt is gonna I come home to roost in the next few years of all the people who brought machine into production and now feeling the pains of you know, vendor ops, as it’s sometimes called, or, you know, stitching together all these envelopes solutions. And then timing, knowing something the market didn’t, doesn’t and then you’re having a passion for it was kind of roughly the three things that lifted larger myself, the other co creator, Hamilton, Canada, to start Databricks.

Kostas Pardalis 50:25
That’s awesome. And one last quick question for me, before I hand the microphone back to Eric, work on someone learning more about Hamilton and the company. Yeah. So

Stefan Krawczyk 50:38
If you want to try Hamilton, we have a website called try Hamilton DOT Dev. A runs py died, because he has a small defensive footprint, we can actually run it load Python up in the browser, and you can play around without you having to install anything otherwise, so the DAG works platform that we’re building around Hamilton, you can kind of think of it as just at a high level, you know, Hamilton’s technology, Databricks platform is kind of a product around it, you can go to DAG works.io. And by the time that this releases, I think we should be taking off the beta waitlist. And so if that’s still there, do sign up, we’ll get you on it quickly. Else. Hopefully, we’ll have you know, more of a self service means to kind of play around

Kostas Pardalis 51:21
with what we built on top of Hamilton. That’s great, Eric, all yours. All right. Well, we

Eric Dodds 51:29
I have to ask the question, Where did the name Hamilton come from?

Kostas Pardalis 51:33
Good question, sir.

Stefan Krawczyk 51:35
So it’s Citrix, the team that we were booting, you know, just for was, you know, I was gonna say this was pretty fundamental is basically a rewrite of, kind of how they write code and how they kind of push things. And so the team was called the forecasting, estimation and demand team or the Fed team for short. And so I had also recently learnt more about American history, because they had the Hamilton musical gone. I was like, hey, what’s foundational? What’s foundational, and associates from the Fed? Well, Alexander Hamilton created the actual Federal Reserve. And so then, there were other names, right. But then, as I started thinking about it more, I’m like, well, Hamilton, also, you know, the Fed team is also trying to model the business in a way. So there are Hamiltonian physics content concepts, right. And then the actual implementation, what we’re doing is graph theory, one on one effectively, right. And so from computer sciences, also Hamiltonian concepts, there sounds like a great, you know, Hamilton’s is probably, you know, the best name for it, since it helps tie together all these things. I love

Eric Dodds 52:37
it. Well, Stefan, this has been such a wonderful time, we’ve learned so much. And thank you, again, for giving us a little bit of your day to chat about DAGs, Hamilton, Python, open source, and more. Thanks for having me.

Stefan Krawczyk 52:54
It was a good time, in terms of being, you know, more succinct on responses. I think, you know, this is my lesson I’ve learned from this podcast, I need to kind of work on that a bit more. But otherwise, yeah. Much appreciated having you on and thanks for the conversation.

Eric Dodds 53:08
Anytime you were great. The cost is, I loved it. I loved the show, because we covered a variety of topics with Stefan from dad’s work. And Hamilton. You know, I think one of the most fascinating things about the show to me was we kind of started out thinking we were going to talk a lot about DAGs, right? Because DAG works, sort of the name of the company is focused on DAGs. But really, what’s interesting is that it’s not necessarily a tool for DAGs, like you would think about airflow necessarily. It’s actually a tool for writing clean, testable, ML code that produces a DAG. And so the DAG is almost sort of a consequence of an entire methodology, you know, which is Hamilton, which was absolutely fascinating. And so I really appreciated the way that Stefan sort of got at the heart of the problem. It’s not like we need another DAG tool, right, we actually need a tool that solves sort of problems with complex growing code bases at the core. And a DAG is sort of a natural consequence of that, and a way to view the solution, but not the only one. So I think that was my big takeaway. I think it’s a very interesting, elegant solution. Or a way to approach the problem.

Kostas Pardalis 54:28
Yeah, the octopi are everywhere with these kinds of problems, right? Like anything that’s close to a workflow or there is some kind of dependency there. There’s always a DAG somewhere, right? And, like, similarly, like, again, like, how many don’t, the same way that if you think about like, DBT, right, like DBT also, isn’t that right? Every DVD project is a graph that connects models with each other. The difference of courses that we have like DVD, which is like in the sequel words. And then we have Hamilton who lives in the Python world. And it’s also like, targeting different, different audiences, right? So that’s like at the end, like what Hamilton is trying to do is like to bring the value of, let’s say the guardrails that a framework like DBT is offering like to bi and be analytical and the analytics professionals out there to the ML community, right, because they also have that and probably they have it also like in deeper complexity, compared to, let’s say, the BI words, just because by nature, like ML models and features have, like deeper, deeper dependencies to each other. So it’s very interesting to see how the patterns emerge, you know, like in different sides of the field, like the industry, but at its core, they remain the same. Right, right. So yeah, I think everyone likes to go and take a look at Hamilton. They also have a sandbox-like playground where you can try it online if you want and start building a company on top of that. And like, any feedback is going to be like, super useful for the commandant, folks. So I would encourage everyone like to go and like dude,

Eric Dodds 56:28
definitely. And while you’re checking out Hamilton, I think you should try Hamilton DOT Dev, head over to the data stack show. Click on your favorite podcast app and subscribe to the datasets show. Tell a friend if you haven’t, and we will catch you on the next one. We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe to your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.

🎙 Sign up for The Future of Machine Learning Livestream!

🗞️ Signup for Our Newsletter

Episode 148:

Exploring the Intersection of DAGs, ML Code, and Complex Code Bases: An Elegant Solution Unveiled with Stefan Krawczyk of DAGWorks

July 26, 2023

Notes:

Transcription:

About the Podcast

Sign Up for The Data Stack Show Newsletter