Episode 158:

The Orchestration Layer as the Data Platform Control Plane With Nick Schrock of Dagster Labs

October 4, 2023

This week on The Data Stack Show, Eric and Kostas chat with Nick Schrock, Founder of Dagster Labs. During the episode, Nick discusses his background at Facebook and his involvement in successful open-source technologies like React and GraphQL. Nick explains the mismatch between the complexity and available tools in the data infrastructure space, leading him to start Dagster Labs. The group also talks about the challenges and fragmentation in the data engineering industry, the need for better abstraction layers, the role of orchestrators, comparing it to GraphQL’s role in product engineering, the importance of data orchestration in the future of data infrastructure and engineering, and more.

Notes:

Highlights from this week’s conversation include:

  • Nick’s background and journey in data (2:28)
  • Founding Dagster Labs (7:50)
  • The evolution of data engineering (12:32)
  • Fragmentation in data infrastructure (15:04)
  • The role of orchestration in data platforms (19:53)
  • The importance of operational tools for data pipelines (25:01)
  • Lessons learned from working with GraphQL (26:19)
  • The role of the orchestrator in data engineering (34:51)
  • The boundaries between data infrastructure and product engineering (37:33)
  • Different orchestrators in the data infrastructure landscape(42:03)
  • The role of MLOps in data engineering (46:04)
  • Data Quality and Orchestration (51:04)
  • Future of Data Teams and Orchestration (54:27)
  • Final thoughts and takeaways from (58:01)

 

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 00:05
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com. Welcome back to The Data Stack Show. Kostas, a very exciting guest, we’ve actually had lots of people on the show, who have built incredible technologies, the fangs or other similar huge companies, and then have gone on to do really interesting things and found really interesting companies. Similar story today, we’re going to talk with Nick, who worked at Facebook, and was actually one of the people who was behind Graph QL, which is really fascinating. But he started a company called DAG sur labs, originally called elemental. And they build an orchestration tool with a goal to sort of become a control plane for data infrastructure, which is really fascinating. And I mean, what a journey. I can’t wait to ask him about it. I think one of the questions that I want to dig into with him is actually basic, we’ve had some orchestration players on the show. Airflow is obviously a huge incumbent in this space, but I just want to talk about what problem orchestration solves, it’d be fun to define a DAG, I don’t think we’ve done that on the show, which is surprising. And then just, for myself, build a better understanding of the nature of the problem that they’re solving. So that’s probably where I’m gonna start. How about you?

Kostas Pardalis 01:50
Yeah, yeah, this is going to be a very interesting conversation. We have plenty to learn from, from make, I definitely want to ask about Graph QL. First of all, I think it’s interesting. I mean, it’s more common to see people getting the, you know, starting like a company in one space, because they have prior experience in the same space, right? of the industry. But GraphQL is not part of the data infrastructure, per se, right. Like, it’s not like a tool that typically you find there. And I’d love to hear the story behind it. How, like, Nick, from building something so successful was like Graph QL ended up and like building something like in a different, it’s a tiny, little bit different particle. So that’s one thing that’s for definitely, I would like to chat about. And then I mean, we need to talk about orchestrators, like the ones parts of the infrastructure in data. We have airflow, right, which has been the de facto solution out there. So I’d love to talk about this whole space like the product category of orchestrators from airflow to the whole landscape, like see what else exists out there outside of like, dogs, Darwin’s airflow, right and why? So yeah, let’s go, we’ll have this conversation, I’m sure we are going to have an amazing time with him.

Eric Dodds 03:27
Right, great. Let’s do it again. Nick, welcome to The Data Stack Show.

Nick Schrock 03:32
Great to be with you. Thanks for having me.

Eric Dodds 03:34
All right. Well, you have built some really cool things and your time and are building really cool things. Daxter Labs, which is super exciting, but gets started at the beginning. Well, where did your career start? And then how did you get into data?

Nick Schrock 03:49
Yeah, so if everyone is listening, my name is Nick Schrock, I’m the CTO and Founder of Daxter labs. And, you know, just kind of my career up until now, I’ll start in 2009. So from 2009, to 2017, I worked at Facebook. And that was kind of the bulk of my career, increasingly less true over time. But while I was there, kind of my time was dominated by building internals and infrastructure for our application developers. So I formed this team called product infrastructure, which was infrastructure or the product teams. And our mission was to make our application developers more efficient and productive. So we started building internal tools, and internal frameworks, but we ended up externalizing that into open source projects. So react came out of that team. And I never even did react, but it was kind of adjacent to and actually the CEO of Daxter labs, Pete Hahn. He’s one of the CO creators of reality. And then what I’m personally more associated with this, I was one of the CO creators of Graph QL. And, you know, both those, especially those who ended up being successful open source technologies. And so that was really exciting to be a part of. I left Facebook 2017 and was figuring out what to do next. And, you know, I started asking companies inside and outside the valley, what their biggest technical liabilities were. And just over and over, regardless of domain, regardless of maturity of the company, this notion of data infrastructure kept on coming up as the technical issue that was preventing them from making progress on their business. I remember really distinctly, I was talking to a healthcare company. And I kind of had data on the mind, I was like, Okay, tell me about your data problems. And I expected them to be talking about HIPAA, and like all these complicated issues, but then what they were talking about was much lower level. And then I remember one, at one point in the conversation, I was like, Wait, so you’re telling me what in your mind was preventing you from making progress in American Healthcare is the inability to do a regularized computation on a CSV file? And they’re like, yeah, pretty much. And that was kind of the moment where I was like, Okay, this is something I should really look at. And I was like, data infrastructure, adjacent and Facebook. So um, so I knew about the issues, but I didn’t like to live and breathe that as much as I did the application space. So I really dug in. And what I’d say is, I found the biggest mismatch between the complexity and criticality of a problem domain, and the tools and processes to support that domain that I’ve ever seen in my entire career. And the only thing that came close was in full stack web development, say in like 2010 22,008, where you had like, ie six, super immature JavaScript frameworks, just a completely hostile development environment. As a result, there is also this kind of self defeat in the engineering culture around it, and then you fast forward 10 years, in full stack web development, it’s like the entire universe has changed. And the tools are amazing. And the quality of software being produced is so much better as well. And it’s clear to me that that same sort of transformation was necessary data. And I really was attracted to the orchestration component of that story, because that was the linchpin technology. And I think we can go into that a little more. But that was, that’s what started the process of forming Daxter labs at the time, it was called elemental and recently changed the company name. And, yeah, and I, you know, I incorporated in 2018, started to play with some ideas, and then really, the company started to take off in 2019, with hiring full time people and pushing out the project publicly. And, you know, fast forward, we raised a series B, a few months ago, you know, very hostile fundraising environments. That, and, and yeah, now we’re scaling the company and feeling a ton of momentum. And it feels great to really kind of, yeah, really hit an inflection point with the company.

Eric Dodds 07:53
Awesome. And just for, you know, I think most of our listeners are familiar with the concept of orchestration. But tell us what, Dexter labs what like, what do you make? And what problem do you solve? Yeah, so

Nick Schrock 08:06
fundamentally, we are, you know, we sponsor Stachel, as sponsors of that open source project called Daxter, which is an open source orchestration framework. And then we deliver a commercial product that leverages that framework, called DAG through cloud. So, orchestration performs a very critical function. And its definition is slightly changing over time. But kind of the base case of orchestration is that when you are, you know, a data platform to use a fancy term is effectively just, you have data, you do computation on it, you produce more data, and subsequently, you do more computation on that. And it’s almost like an assembly line, right. But instead of an assembly line, you have an assembly DAG, Directed Acyclic. Graph. And what the orchestrator at its most primitive form does is decide when that factory runs and ensures that things will run in the right order. And so if there’s like an error somewhere in that factory, you can retry the step, right? So at a minimum, you’re ordering and scheduling computations in production. But that orchestration layer is a very interesting leverage point to build a more full featured product. So for example, the orchestrator has to know how to order computations, and therefore it has enough information to understand the lineage of two pieces of data in the system. So for example, we have integrated data lineage. And likewise, if your data were the orchestrator, then that ends up being a natural place to catalog all the data assets that are produced by our system. So we really conceive of ourselves. Yeah, we have to meet users where they are and people categorize this isn’t working Straight, because they’re comparing, like, should I use this system or Dexter. And that’s the way it works. But we really conceive of ourselves as kind of a control plane for a data platform that does orchestration as well as other components of the system.

Eric Dodds 10:13
That was great. Super helpful. One quick thing. Can you define DAG, because I think that’s a term that’s thrown around a lot, especially like as an acronym, but actually define, like, directed acyclic graph, just so we level set, because we never want to assume that all of our listeners know what all of these acronyms mean.

Nick Schrock 10:34
Totally. So it’s highly, it has kind of an obtuse sounding term that actually describes something extremely intuitive. So it stands for Directed Acyclic Graph. So what does that mean? I think the best real world analogy is a recipe for cooking. So imagine that you’re cooking a recipe, you make a fundamental ingredient, that’s a step and then you use that ingredient with two other steps. And then at some point, you recombine those two sub ingredients to do play, put it in the oven or something that is a Directed Acyclic Graph, because it doesn’t make sense for that to have cycles, right? Because if, in the end, you took something out of the oven, and then you restarted the first step, you would never complete the recipe. So that is kind of, you know, the recipe is kind of the real world manifestation of a DAG,

Eric Dodds 11:27
I love it. Okay, you brought up a term when talking about the software engineering industry, and you drew a parallel to data infrastructure, and you describe it as a hostile environment, right. I mean, yes, I saw the disparate JavaScript frameworks, you know, and in some ways, when you think back to that world, you have a set of tooling that’s, you know, fairly primitive. Why is the world of data infrastructure hostile? I know that they’re, you know, are there? What are the similarities? Right? And I guess, maybe, to direct the question a little bit, to some extent in software engineering, like there were some primitive tools, or some, you know, like very early frameworks, etc. Yeah. That’s a little bit of a problem, then, like complex infrastructure where you’re dealing with something fragmented, like fragmentation, right? So right, what’s the nature of the hostility in data infrastructure?

Nick Schrock 12:30
So why do I think it’s hostile? So I think there is a pretty good analogy, in terms of, you know, if you think about the history of data engineering, you know, it kind of has a lineage that is not software engineering, I mean, historically, it was like drag and drop tools, and Informatica and all this stuff, and it was kind of thought of like, a lower status job, as well. So there’s also that lineage. And, you know, I think it’s because data engineering, in the end, ends up dealing with very physical things in the world, like you’re moving around files and creating tables and data warehouses. And it’s actually, it is difficult to synthesize test data to set up virtual environments where things are more flexible and whatnot. And therefore have like, you know, because software engineering processes really work when it’s purely abstraction based, and you can kind of like shim out the right layers and have super fast dev workflows, and whatnot. And that’s fundamentally a very difficult problem in data. The analogy to web development is that, you know, web development wasn’t executing on a physical thing, a physical data, but it was scripting the web browser, which was not designed to be a programmable surface area. So it was just this completely hacked system. And there were no good abstractions over the browser. So like you were left just like manually testing the browser. There’s no way to run like JavaScript code on your laptop without booting up a browser and the super heavyweight thing and whatnot. Yeah, things like React, constructed the right software abstraction between the application code and the underlying browser, which was this like, incredibly hostile substrate for software development? I think this is the analogy applies to data engineering, where there aren’t good enough abstraction layers between the application code that in business logic code that data engineers have to write, and these underlying concrete storage systems and computational runtimes, which are actually extremely inflexible and hard to deal with. Yeah. And, you know, an example of an example piece of infrastructure that has gotten popular these days. And putting Baxter aside but one that really sticks out that kind of solves some of this program bill, the problem is something like duck dB, right, which makes it very easy to program against the same runtime on your laptop as well as in the cloud. It’s like, Yeah, and you can actually unless you execute it in the web browser to tie this all together, which is nuts, right? So the technologies that are kind of very developer centric with a computation that is portable between different environments are extremely powerful.

Eric Dodds 15:21
Yeah. Yeah, super helpful. Let’s Okay, let’s talk about fragmentation really quickly, because I want to dig in on that a little bit. So you have sort of fragmented, you have these fragmented systems, it’s difficult to sort of have a development environment, you know, especially when you’re dealing with these physical things, like you talked about the infrastructures completely different. But when we think about fragmentation, and data infrastructure and data engineering, there’s this sort of interesting debate going on out there around whether or not the industry is going to move towards increased fragmentation, or if there’s sort of a re bundling happening, because to your point, you know, solving these problems with these disparate layers is really hard. And so right, is the market gonna sort of congeal around? Or is it going to produce, you know, more vertically inter integrated, or I guess, horizontally integrated, depending on how you look at it, or both? Systems that sort of solve some of these problems in an integrated way? What’s your take on that? And then how does the dragster fit into your take on that?

Nick Schrock 16:33
Yeah, so that’s a great question. I think you said the right words, which is it’s either going to be vertical or horizontal. Because I think anyone who’s talking about this subject, and who is even different, lots of people have different visualizations of the end state. But I think everyone agrees that currently, the world is too fragmented. And life is too hard for people spinning up data platforms, because they have to cobble together way too many tools. And it’s extremely complicated, both technically as well as like maintaining that many business and support relationships. It’s just a, it’s not a sustainable situation. Yeah. So there’s kind of two schools of thoughts, I think. One is you’re going to pivot back to a world of more vertical integration and go back to that world. That’s like Informatica, and Oracle, or Microsoft, kind of in the 90s. You know, you pick one stack and choose everything that is inflexible and terrible. For technological reasons, you also got vendor lock in. So the modern day analogy of this is, you’re either going to be like a Databricks company, a Snowflake company, where you’re going to be one of the hyper scalars, Amazon, you know, Microsoft just has a new fabric product, who’s another kind of kind of data platform in a box type solution. So that’s the vertically integrated story. Yep. And if you’re not going to do vertical integration, you need to solve this bundling issue, then there has to be some other layer of integration. That’s the horizontal layer. Now, I think of this. Yeah, naturally, given our position in the stack, I think the orchestrator is a natural place to assemble all these capabilities that are needed over the data platforms like Databricks, and Snowflake. And I think the other thing, I do think that there is a natural resistance to vertical integration. And I think companies instinctively know this, because if you go to any large company, almost all of them are running Databricks and Snowflake. Yeah. Like, no one just takes one, like everyone runs both because yeah, suitable for different workloads. And you also don’t want to, like, bet everything on one, one vendor and get knocked into that degree. So I think in some ways, like the natural market resistance is doing this now, Snowflake and Databricks, or whomever, they might build enough capabilities, or do it in a tasteful way where customers still feel like it’s composable so that they can eventually got dominance, but I just don’t think that’s fundamentally the way it works. And so, like, I don’t want to live in that future. I think vertical integration is boring and sad in some ways. And so then it’s, you know, what’s the horizontal layer gonna be? You know, some people think it’s going to be the cataloging. Right? And that’s the basis of the control point. Some people think, okay, like, Apache arrow is like the way this works, because you can have portable memory formats that you move between data platforms or whatnot. Yeah. You know, I personally, because I mean, my job to say this, the founder of the company, but I think this orchestration control plane layer is like the natural place to put it because the orchestrator by its nature, every practitioner needs to interact with it, because anyone who’s putting a data product in production has to put it in either as to put it in some sort of asset graph, because all data comes from somewhere it goes somewhere. So they have to be placed in the context of some sort of orchestration at some point. And then the orchestrator is also the system that shells out and invokes every single computational runtime and touches every storage layer. So I view it as like this very natural choke leverage point that has to exist at any platform of real scale, any data plant with a real scale. And, you know, the kind of user experience you want. I know it’s a tortured analogy. But you really want something that feels like the iPhone, where you have this common set of rules, you have a grid of apps, and they all kind of abide by certain rules and ecosystem reprise order to the chaos, but within that Order Chaos, you get a ton of heterogeneity. Right. And that’s, that’s even though it’s a torture analogy a bit because like the iPhone is vertically integrated, but it’s a vertically integrated OS with like an app store. Right? So the analogy is on purpose, but I think that’s like the, for lack of a better term, the vibe you have is some sort of superstructure that brings order to chaos. But within that, you can mix and match technologies.

Eric Dodds 21:17
Yeah, that makes total sense. Okay, let’s see. One. One more question for me before I hand the mic over to Costas. So, in that world, where we think about horizontal integration, do you sort of operate with a set of maybe sort of foundational philosophies or design principles around like, when and where the orchestration layer enters the picture. And so let’s sort of talk about trickle down, which we, you know, which we started out the chat with, right, talking about building these things at Facebook, you know, things are way ahead of most of the market, because they’re inventing new technologies that sort of eventually then trickle down and sort of, you know, companies can adopt them. When we think about orchestration and Daxter, in particular, your view of the world is that this is really where you should sort of start building your infrastructure, right? So center as therapists, like, you start with sort of an orchestration layer and then augment your stack over time around the orchestration layer? Or is it a situation where you really only need this when you hit a certain level of complexity, or, you know, when you have multiple storage layers, or some sort of break point or threshold where orchestration is the right tool for the job. So

Nick Schrock 22:45
Typically, I think that orchestration should be one of the first tools that you use in the data platform. Like, for example, if you only have one data warehouse, and you know, you will only ever use DBT, you only use templates. SQL, you’ll bring in no other technologies, and you don’t need anything beyond a cron scheduler. And you know that for certain for certain, yeah, might not be a full orchestrator

Eric Dodds 23:13
Snowflake, DBT Tableau, and you’re done. You know, yeah, something like that,

Nick Schrock 23:17
right? And if you don’t need any automation, right, would be another example. Where right, like, literally, you can just, like manually do stuff and only create things on demand, and you’re comfortable with manual intervention. That’s another example. But for nearly every single data project, I think that orchestration is fundamental and essential, like you need to schedule things, you need to order computations, you need to do it across multiple stakeholders, multiple technologies, typically, and even. And by multiple stakeholders, sometimes, I should probably say, multiple roles. Because even as a solo practitioner, you often kind of wear different hats, depending on what you’re thinking about, you know, like, sometimes you’re thinking, like, I’m building infrastructure for myself, and sometimes, I’m building the actual data pipelines. And, you know, I think we have work to do to educate the marketplace to convince people that their work istration should be one of the first tools you adopt rather than one or the last. But I think in reality, the things you do within the orchestrator are so fundamental to building data pipelines, that it should be in the picture from day one, right? You’re going to have errors in your production. You need to be able to resolve those as easily as possible. You’re gonna need alerting, you’re going to need to schedule things, you’re going to need to order your computations. And you know, those are like the basic tools in mind. If you don’t build orchestrated in place, you’re very quickly left with like a Rube Goldberg machine of like, maybe you have like four different hosted SaaS apps and they have overlapping cron schedules and you’re like, Yeah, debugging issue is across multiple tools. I just don’t think it’s tenable.

Eric Dodds 25:03
Yeah. Yeah, it’s super interesting. It’s, you know, I kind of think about it, as, you know, you said that, like, the tortured analogy of the iPhone, but I was just thinking about like a dashboard in a car, where it’s interesting because it feels like a single thing, but it actually represents, like, a massive amount of complexity with it relate to very different parts of the system, right, from like, you know, braking to like pressures that are running in different, you know, pieces of, you know, the engine or transmission or you know, all these separate things, but it feels you wouldn’t think about designing a dashboard for a color, like in a disparate way, right? Like it. Yes. represents a related system.

Nick Schrock 25:52
Yeah. Especially for coherent operational aspects , like being able to go to one place and having a source of truth, the so-called single pane of glass, we can be like, what’s going on in my system right now? Yeah, I just, I cannot imagine running a data pipeline, a single data pipeline, let alone an entire platform of data pipelines without that operational tool?

Eric Dodds 26:14
Yep. All right. Kostas, I could keep going. But please, no, you have a ton of questions. Yeah.

Kostas Pardalis 26:20
Thank you, Eric. All right, Nick, let’s go back to your graph. QL days first, okay, before we get deeper into the data we are doing and talking more about like the data, infrastructure? So I have a question like, since you describe, like your journey, what? What have you learned from working with Graph QL? Right, like a tool that is primarily used by product engineers, right to go in, like applications that you think are also very applicable to what Daxter is doing? Right?

Nick Schrock 27:06
So that’s a great question. I think there’s a few lessons. One is if you can express the problem that you’re trying to solve in concepts that make sense to them and align with their day to day experience on the ground. That is enormously powerful. So the analogy is that in Graph QL, I think the novel insight, because people are like, Oh, why don’t you just use something like SQL? Well, the reason why is that SQL is fundamentally Tabular. And Graph QL is hierarchical, right. And the reason why that’s powerful is that if you’re a front end developer, the view libraries that you’re dealing with 99% of the time, it’s a hierarchical structure, like everything about nesting elements with each other. And the fact that you can then express a query language that directly maps with that is extremely powerful, both in terms of just understanding the query language, and maybe most importantly, building client side tools that line up those views with that data fetching. It’s just an extremely powerful paradigm. And similarly, with DAG, so for example, we really thought about from first principles, what are the things that you’re, what are you actually doing? When you’re building a data pipeline? What’s the outcome you’re trying to effect in the world? And how do you interface with the stakeholders who care about you? And, you know, we kind of have this phrase we say, around the office virtual office, which is like no one gives a shit about your pipelines. Right? All they care about is the data assets that they depend on, right? Pipelines are implementation details. And if you can express it from the developers perspective, you can kind of start out with like, hey, declare the assets you want to exist in the world. And if everything downstream of that makes sense, then everything lines up better, both your own internal tools, the way you communicate with stakeholders, et cetera, et cetera. So I think that one lesson is like, really, and you know, a doctor, we really, you know, this has been a struggle, there’s been a challenge and say, over the years to really dial in this language, and we’re still working on it. But getting that right is super important. And the other thing that I learned with Graph QL, is that a lot of these developers, there’s kind of this common trope, you talk to VCs or you talk to like engineers, there’s a lot of almost contempt for the broader software engineering, like communities like, oh, all developers are dumb. And you’re used to only doing the top 1% of developers and that’s the way it’s going to work. And what I found in the Graph QL space is that people aren’t dumb developers. Our developers know their domain and their business. They are generally quite smart, quite bright and competent, but they are extremely busy. So I think people confuse smart busy people with uninterested people. And that causes a lot of people to build underpowered tools like Graph QL. Like using something like Graph QL, you’re still relying on the users to do a lot like they’re building a very complex piece of software underneath the hood, graph. QL provides this overarching structure that makes sense in their mind and tools on top of that, but kind of beyond that, like developers have to do all sorts of clever things. And I’ve always been pleasantly shocked at how sophisticated the Graph QL community is, in terms of building custom tools and whatnot. And I think the same thing applies to the data engineering world where you don’t just want to give out of the box solutions. But you also want to provide developers a toolkit to make them more productive. And you have to find the right balance and do that. But I think having that mentality is critical. And, you know, that has really served us well, in the Daxter journey, I guess, I still can’t believe some of the use cases, people apply this stuff, too. So you know, the first principle is getting your mental model, right? Super important. And to understand that your users are smart, busy people typically building complicated things, and understanding where to give them the tools where they can do the complex things while simplifying everything else as much as possible, is also critically important. And then the last thing is being consistent with messaging is utterly critical. I remember the Graph QL once we started to see meetups where people that we didn’t organize where people were effectively saying the things that we propagandized and advocated for. And I’m like, I remember the meeting where we decided on using that phrase, and not another phrase. Now it’s being repeated in Johannesburg. And we didn’t need to talk to anyone to make that happen. That’s like a very powerful thing. So consistent messaging is another thing that comes to mind.

Kostas Pardalis 32:12
Okay, that’s awesome. Actually, that was an unexpected outcome, to be honest, like, I didn’t expect to hear that from you. But that actually makes a lot of sense, I think, especially when we are talking about like, new part of the new Cintiq. The new technology, right? Like we’re, like, a period where the marketplace out there needs to go through education. Right. So consistency, there’s like, I think it’s, it is like critical. So I’ll make that.

Nick Schrock 32:39
I think the other thing I’ve learned is that it’s really important for a technology to be viewed as a career enhancing move to adopt. So if you can build a technology where people feel like they will advance further in their career and achieve better financial success and notoriety because they adopt you. Like, that is like an incredibly important place to be. In terms of a technology provider.

Kostas Pardalis 33:10
Yeah. 100% 100%. Okay, one question. Just to try and help, let’s say the people out there who are coming from one or the other. And when I see one, the other I mean, product development in one product engineering on one side, and then engineering on the other side, right. And data infrastructure. So obviously, there are like two different domains. But there has to be some overlap, right? Like, there’s still engineering, right? It still has to do both, like manage data, both manage state, both have to present something to someone to do something and all these things show the equation that I want to ask you, because I don’t want to ask you what’s what they have in common and whatnot. Right? I’ll try to do things in a little bit more creative way. From your experience with Graph QL in product engineering, right? If you had to choose, let’s say, in data engineering, like technology that it’s like, closer to what Graph QL is from Product Engineering, what you would say is that VCs, like which part of the stack out there, like it can be the orchestrator. It can be I don’t know like aro as you mentioned, at some point, the query engine, I don’t know, like, the good thing, the good and the bad thing with the infrastructure is that there’s so much to choose from out there like and yeah, these things. But what is like a similar utility at the end for the engineer out there, right, like position in the stack. If there is, might there be something?

Nick Schrock 34:51
Well, I mean, I think one of the reasons why I was attracted to the layer in the stack that Dexter is in, is that I felt that the orcas Strader served as the basis of a layer which could serve a similar function as Graph QL. But in the data domain, and so far as, you know, Graph QL is a very compelling choke point in a front end stack where all the different clients, Android, web apps, iPhone, all go through the same schema. And then that isn’t backed by a piece of software that talks to every single service and backing store at the company provides this organizational layer to kind of model your entire application, right? In a similar way, I think the orchestrator serves the same function in terms of the place where you can model your entire data platform where all the different stakeholders can kind of view it through the same lens, this graph of assets, right, and then each one of those assets can be backed by arbitrary computation, arbitrary storage. So I really do think it’s, and I think that’s why I was attracted to it, whether implicitly or explicitly is that I felt this kind of had that same property of being both a technological and organizational choke point through which you can deliver enormous value and have enormous leverage.

Kostas Pardalis 36:20
And why do we need such a different implementation of the technology, then, like, why can’t we get Graph QL and somehow adopt it, right? And be like, the interface for the data infrastructure to like, what is the reason? And it’s more of a technical question, when I’m asking to be honest. Yeah, I mean, I’ll certainly be patient. But why is this happening?

Nick Schrock 36:43
Well, I just think it’s a completely different domain and problem space. You know, like, I remember, you know, people are like, Oh, Nick, why aren’t you thinking about using Graph QL? For analytics? You know, and I’m like, Absolutely not? Absolutely not. And the reason why is what I was talking about before, is that Graph QL works for front end applications, because the net thing that you view on the screen is hierarchical. Yeah, right. And that makes sense. When you’re dealing with analytics, you’re looking at tables. You’re looking at tabular data, and direct renderings of that in dashboards and whatnot. So sequel is the right tool for analytics. And Graph QL is a better tool, in my opinion, for building kind of front end products. And so they’re completely distinct domains.

Kostas Pardalis 37:33
Okay, it makes total sense. One last question that has to do with let’s say, the boundaries between like these two different disciplines, but what also have some similarities. So, in the data infrastructure, we are talking about orchestrators, right, which are, like these concepts like Daxter, right, we have, as you said, like tasks to be executed, we have some scheduler we have like, managing, like failures, like all these things. There is another, let’s say, in the product engineering space, there’s also the concept of the workflow, engine, right. And there’s been a lot of conversation lately about workflow engines, and how close they should be to the state or like, should be part of the database, like the transactional database or nodes, or outside, but they have some similarities, right? Like, at the end, even, the workflow engine is like, it is a DAG, pretty much like you have some tasks that need to have some ordering and how they get executed. Maybe, necessarily, let’s say, doing data processing directly, but that might be an endpoint that you have to go and like, hit somewhere, right? Yeah. Why do we like, again, what’s the difference? Like, why can we have one right, that can drive let’s say, the data infrastructure and like the processing there, and like the same also like with product engineering, where we have to orchestrate again, like tasks? What’s an

Nick Schrock 39:06
example workflow engine and product engineering that you’re thinking of just so I can, because workflow engine is like, can be different things, different people?

Kostas Pardalis 39:14
Yeah. 100%, like temporal, for example, is like a problem. In my mind, right?

Nick Schrock 39:20
Yeah, temporality is really interesting. And actually, I think, fundamentally, something like temporal is a more imperative and general purpose tool. But you have to make explicit trade offs there that make it less well suited for doing data processing in the context of a data platform. I think the simplest visualization of doing it is that using temporal, if you want to understand the lineage of your data assets, without executing it, that is literally impossible. And tomorrow because tomorrow is a more dynamic machine that makes very different trade offs. So there’s nothing preventing you from using sem portal to perform a subset of the functions in a data platform orchestrator. But it just doesn’t fit all the needs of data platform teams. And so, you know, there’s a world where there’s a data platform stack where temporal is a component of it.

40:32
But fundamentally,

Nick Schrock 40:35
fundamentally, it’s very different. Something like the portal is interesting. I’ll be very curious to see how it develops over time, because it’s actually an extremely invasive programming model. And I would, if I was hired as a VP of Engineering at some larger company that I bet heavily on temporal for its infrastructure, I would be lying awake, sweating at night thinking about like, how do I debug this if it goes wrong, because like, you’re putting so much faith in the system to like, reentry, be reentrant. And like, manage all this state properly. And like if something goes wrong, I just have a hard time debugging it. But yeah, I have like, I’m both extremely intrigued. And amazement, temporal and also kind of terrified of it. At scale, especially.

Kostas Pardalis 41:29
Yeah, it makes sense. Makes sense. All right. Cool. So let’s focus more on the daily and fast stuff. Now. Let’s talk about orchestrators in data infra right, like tags is okay, obviously not the first one. There are like many different solutions out there. Some are like more needs. Some are more generic, I would say probably the most well known one is airflow, right? For sure. So let’s give us like, like how you see the landscape out there?

Nick Schrock 42:03
Totally. Yeah, airflow is funny. So the lineage of airflow is actually from Facebook. So it’s based on a system that was kicked off in Facebook in 2007, called Data swarm, which still exists, then Max, who invented airflow and treated airflow, who I know very well. And he actually left Facebook went to Airbnb realized they needed a similar system and kind of, you know, basically built vi to have a data swarm. And I think that airflow did a couple of really important things. One is that you could build DAGs, you could write your code in Python, rather than having to use a UI or use some really inflexible config system. And then it had a nice UI. And so between being able to use Python, which gave a lot a level of dynamism, and a language that data people understood, and a high quality UI, it really took off. But there’s a few things that are a problem with airflow. One is clearly not written for the local development experience. And these systems are complicated enough where you need to be able to test them, do automated testing, and have fast feedback loops, because those are the foundations of developer productivity. And developer productivity is absolutely huge. And the other thing, and this is funny, we like to say that, even though it is kind of the incumbent orchestrator that people build data pipelines in, it actually is not a great tool for building data pipelines, because it’s not aware of the data that produces it’s kind of this like tautological thing. It’s like the wrong layer of abstraction for data pipelines has got its momentum and became a norm. But we fundamentally think that a more data oriented approach is important. So if you think about the landscape, and I’ll include DBT, Dexter prefects, and airflow and prefect is actually much more similar to something like Tim portal, at this point, because prefects new two o product that was a company started like a year before, DAG, so labs, and they have this sort of DAG LIS vision that’s similar to temporal, we’re just arbitrary workflows. And so it’s more imperative and generic. Then you have the task based DAG system, which is airflow. Then you have DBT, which is very popular, which is exclusively for Jinja template SQL, with a hint of Python these days, but like 99.9% of the usage is template generated SQL. They build graph data assets as well. They call them models, and they exclusively execute over the warehouse. And they’re targeted for these kinds of software engineer analyst hybrids, they call analytics engineers. If you think of those as a spectrum DAGs, there’s kind of in between airflow and DBT, meaning that has a much more declarative data oriented approach, similar to DBT, but is targeted towards data and ml engineers and more trained software engineers. And it’s more flexible and can be backed by any arbitrary computation, not just Jinja, template SQL. So that’s kind of the landscape of one way declarative, hyper focus on the data warehouse, SQL, that’s DBT, all the way to the other hand, somebody like prefectures in Portal, which is completely DAG LIS, more of a straight workflow engine, and then kind of airflow and Daxter in between.

Kostas Pardalis 45:47
Okay, that was awesome. And are there any, let’s say more outside, I like nice types of orchestrators, like, there’s this whole thing of around like ML ops, for example, like, is ml is like a group of orchestrators just like for ml.

Nick Schrock 46:03
So ML Ops is super interesting, I think it’s something that we’re gonna be focusing on in 2024. Because, you know, we actually really believe that the ML ops ecosystem is unnecessarily siloed, there was an article and they don’t need their own orchestrators, their MOps should be a lawyer, not a silo. And there was this great article that hit Hacker News like six months ago, which is, which was like, ml Ops is 98%, data engineering. And I think that’s totally true.

Kostas Pardalis 46:35
I wrote this, by the way, you wrote that,

Nick Schrock 46:38
Oh, my God never connected that. Okay, so this is perfect, that’s amazing. We love that. That’s like the basis, that’s gonna be the base of our product market next year. Because if you interview, we don’t emphasize our Mo use cases at all, right, but our cloud customer base, over 50. So 90% of them use it for ETL and analytics, right, which makes sense, but 50%, also use it for ml, and 40%, also for what they call production use cases. So multi use cases are the norm. And what happens is that a Data Platform team brings in Daxter. And then they start using it and then their ml team wants support and doing stuff, they talk to the ML team. They’re like, well, we mourn why Python, we want to write DAGs of stuff, we produce a bunch of intermediate tables. And at the end of the line, we produce models, they’re like, Well, that sounds like data pipelining. And Daxter provides a great foundational tool for the data engineering components of the ML ops job, which is 98% of it. i That’s so funny that you’re the one that wrote that article. So we totally bought into that view. Daxter is about data engineering. So we kind of think of data engineering as this layer. Yeah. And then different parts of the data pipeline, overlay different technologies on top of that layer. So in the middle of that data warehouse, you might have DBT core, in the ML component of it, you might have mL Ops is a layer of two ways on top of that, but it all shares a common control plane driven by data engineering principles.

Kostas Pardalis 48:12
Yeah, that makes total sense. I mean, obviously, I agree. Like, I also like, what was the reason I wrote like that blog post and

Nick Schrock 48:20
pre post?

Kostas Pardalis 48:22
Yeah, it was, he’s helped more, much more impact than I expected to be honest. And it was very interesting to see the reactions both from the male, let’s say, group of people, and also like the Dave engineers. Anyway, maybe we should have like an episode just talking about that. Because I Oh, yeah, I do believe there needs to be a convergence, like, between the two disciplines, like it’s, it is important if you want to keep adding more and more value and foster innovation, like in the industry. Otherwise, it’s just like things are way too fragmented. Doesn’t make sense.

Nick Schrock 48:54
Yeah. It doesn’t make any sense. No, we gotta get Sandy on this. He’s the lead engineer on the project. And we could get going for two hours getting ourselves whipped up about this subject. Yeah.

Kostas Pardalis 49:05
Yeah, we should do that. Absolutely. We’ll arrange that. So okay, let’s go back to like, specifically to workflow and I have like one last question here. If there’s something that you are, let’s say interviews off, that’s airflow has rights as well. Yeah, that would be one thing.

Nick Schrock 49:28
Oh, just the install base. Yeah. Like that’s, that’s pretty much it. I feel like we compare favorably almost every actually the install base and the existing corpus, searchable content related to the technology or the, you know, the advantages of incumbency, but those are the two things I’m Yeah, yeah, I envy. Yeah, maybe we’re making good progress. So did you. Yeah, that’s true.

Kostas Pardalis 49:59
And I think you have also generated some pretty good content out there. So, okay, but envelopes have been around also for how many years now? Like it’s, like 10 years, maybe like a little bit like there’s a lot to

Nick Schrock 50:14
max out in 2014. I think it was open source pretty quickly. 2015. So we’re getting there.

Kostas Pardalis 50:20
Yep. Yep. All right, let’s talk about ducks now that you’ve been working on these for quite a while. What’s, what’s next? What are the next couple of things that are coming out about Dexter? And what should we be excited about for the future? Future releases?

Nick Schrock 50:44
Yeah, so I think that our near term future is very much about demonstrating, you know, we kind of have this position in the stack, we claim to be this operational single pane of glass, we have companies where a bunch of different stakeholder teams are adopting it. And now the next part of the journey is like, using that leverage point to deliver more value to teams. So one point of that is that, you know, I think this show is going to come out like a week before our launch week. But we’re going to be announcing embedded data quality in the orchestrator. So that doesn’t mean we’re going to try to replace DBT tests or replace Soda or replace Great Expectations. Those are all systems where we can leverage, but it’s more about almost making the orchestrator data quality cognizant, I would say. And so it’s very, we actually get this isn’t just us like reaching outside of our domain, our users like explicitly want this because they’re used to looking at our asset graph and being like, what is the state of my system, and having an extra checkbox there, that says, I passed all my data quality test is the most natural thing in the world to want to integrate, and then being able to alert like, Okay, if this thing fails, yeah, ping me in Slack or whatever. So it’s a very natural extension of the orchestration system. And we think that, you know, in five years any orchestration platform that doesn’t include data quality, will be viewed as woefully incomplete. Similarly, we’re adding data quality capability. Similarly, we also are adding consumption management capabilities to the system. So first of all, we’re going to kind of be augmenting our integrations to make it very straightforward to collect metrics about consumption. Yeah. So like, how many Snowflake credits is each asset consuming? And then what’s very unique is that one, we can index that by asset name in our system. So we can give reports to say like, Hey, you’re recomputing this thing all the time, it’s consuming this many credits, are you sure you’re getting enough value out of that. And then second of all, the orchestrator is also like, naturally a very interesting place to embed cost and consumption information. Because you can do quoting, you can provide quotas, you can project how much computation is going to cost going forward. It’s just a very natural place to embed that sort of cost information. And we think that it’s going to be incredibly powerful. You know, Jeff Bezos famously said, like, your margin is my opportunity, I think the equivalent data is like your NDR, your network, your net dollar retention is our opportunity, because you can’t increase your Snowflake, spend 80%, year over year, eventually you run out of money. And you need tools to be able to control that. And we think that we’re experiencing a natural place to do that. And then lastly, the other thing that we’re going to be releasing is just a way, especially that centralized data engineers and data platform teams can bring in all the computations of all their stakeholder teams, in a way that’s much, much easier that doesn’t require modification of that external code, or minimal modification of that external code. I think right now, I think Dexter historically, has gotten dinged, because it kind of feels like he has to take over too much of your system. And I think that feedback was pretty accurate, actually. And so we’ve really taken that to heart. So we really want to make it so that instead of the entire organization having to become Daxter experts, only a centralized team has to become Daxter experts and everyone else kind of becomes Dexter curious. Where they just know hints of Dexter and then they can use their operational tools and it all kind of works smoothly. So you know, in the end, the goal of that launch week is really to make these centralized data teams a data platform. hims especially way more leveraged, so make it way faster for them to kind of bring everyone into the orchestration platform. And then once they’re there, be able to use these value added features to deliver enormous value super quickly. So. So beyond orchestration is kind of one of our internal teams. And we want to, you know, really kind of develop this future of this more advanced control plane that I think data teams desperately need.

Kostas Pardalis 55:29
Yeah, that makes total sense. All right. We are really close to the end here. And I want to give some time to Eric to ask any other questions that he has. But we definitely would like to do at least one more episode. I think we have a lot to chat about. So I’m really looking forward to doing this again in the future. And yeah, it’s

Nick Schrock 55:55
It was great. That’s so funny. You wrote that article. And I was just named drop me a swear to god, that was on purpose. Because that would have been real 40 chests to pretend to not know that, then drop it.

Eric Dodds 56:06
Yeah. cost us your real data? And they’ll answer. I mean, you’re the foundation of Product Marketing Strategy. Okay, just one last quick question. We’ve talked so much about data. And, Nick, you’ve been so articulate and so helpful on so many subjects. So my last question actually has nothing to do with data. So if you weren’t working with data, or building technology, tooling, what would you do?

Nick Schrock 56:32
Oh, I actually have a good answer to this. I am, we are on the cusp in the world of an energy transition, and no one understands the implications of it. And this isn’t for environmental reasons, we reached this tipping point where solar energy is now cheaper than all other fossil fuels, all other energy generation. So solar, wind and battery together. Not only is it cleaner, which is interesting, but also it’s way cheaper. And by definition, there’s no fuel inputs to that. And that is incredibly exciting. Because not only because there’s kind of like this mentality that oh, by decarbonizing, we’re going to have to degrowth and go back to some pastoral life where no one drives and no one travels completely false. If we do this right, we’re actually going to live in a world with infinite, effectively, infinitely abundant energy. And working on that transition, I think would be incredibly exciting. Because effectively, the way that the math works out for these clean energy systems is that the cheapest configuration of building them means you dramatically over provision, solar and wind generation capacity. So they don’t have to build as many batteries. And that means that most of the time during the year, you have a wild excess of essentially limitless free energy. So I think there’s going to be an entire new wave of industry that is built to effectively take advantage of this intermittent, infinite, virtually free energy. I think it could be an incredible future. So that is what I will be working on.

Eric Dodds 58:20
Fascinating man. That’s an episode in and of itself. Well, as we like to say we’re at the buzzer. Brexit challenge to Simon Lambda plane. But Nick, this has been such a wonderful conversation. Thank you for giving us your time. And we’ll have you back on very soon.

Nick Schrock 58:35
Awesome. Yeah, this was great. Thanks for having me.

Eric Dodds 58:39
What a guest. I feel like every time we asked Nick a hard question, he was able to come up with an answer that was concise and articulate for every single question that was really amazing. Thank you probably asked the winning one of the winning questions, which was around the data orchestration landscape. Okay. And what a fascinating answer, he really, I mean, it’s, it was really helpful to hear him talk about the entire spectrum where you have a you know, sir, Tim portal, which you brought up on the show, which is sort of, you know, embedded in application code and sort of, you know, deeply integrated workflow, you know, surgery, generic workflow execution, all the way over to, you know, the DBT side of things, which is, you know, sort of Jinja templating, and, you know, managing jobs or running SQL queries, and he really painted that entire picture. It was so helpful to me, and that is my big takeaway. So I think this is a show for anyone who wants to deeply understand history, the current state, and then think well about the future of orchestration. This is a great show.

Kostas Pardalis 59:54
Yeah. Oh, 100%. I think it’s almost like the future of orchestrators. I think it’s A glimpse to the future of infrastructure in general. Yeah. And data engineering, I would say also, because that’s like another part of this episode that I think is super unique and super fascinating is that we talked a lot about what are the differences, and also the overlaps between like product engineering and data engineering, the tooling, the infrastructure out there, why we need to have like these different disciplines, or domains, what we can learn from one and transfer to the other. Right. And I think Nick has such an extensive experience at the kind of unique scale of Facebook, right. So his perspective, I think, is very interesting and very insightful. And it’s not that easy to find out there. So I would encourage everyone like to tune in and like, actually listen to the conversation that we have, and hopefully we’re going to have more conversations with him in the future to

Eric Dodds 1:01:03
I totally agree. One last bonus from this episode is that I think this is the episode where you became a data influencer because and I won’t give away too much Nick reference that they’re building their product marketing strategy, off of a particular article that went on the first page of Hacker News that may or may not have been authored by one of the co-hosts of the show. So if that tantalizing piece of juicy information is interesting to you, listen to the entire show. To hear more, subscribe if you haven’t told a friend, and we’ll catch you on the next one. We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe to your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.