Episode 132:

Data Quality and Data Contracts with Chad Sanderson of Data Quality Camp

March 29, 2023

This week on The Data Stack Show, Eric and Kostas chat with Chad Sanderson, Head of Data and Data Contracts Advocate. During the episode, Chad talks about all things data contracts. The conversation includes topics such as the value of data contracts, dealing with the semantic and logical layers of data, implicit contracts at companies, how contracts fit into data infrastructure, and more.

Notes:

Highlights from this week’s conversation include:

  • Chad’s background in data (2:10)
  • Breaking down data quality (4:02)
  • Semantic and logical layers of data (10:04)
  • What are data contracts and how do they work? (17:41)
  • Implicit contracts at companies (24:01)
  • Where do data contracts fit in data infrastructure? (28:14)
  • The value of data contracts to the producer and consumer (31:18)
  • Tools needed in effective data contracts (46:13)
  • The importance of community in data quality (50:53)
  • Getting connected to Data Quality Camp (1:00:55)
  • Final thoughts and takeaways (1:01:53)

 

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 00:05
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com. Welcome back to The Data Stack Show Kostas. Today we’re going to talk with Chad Sanderson. He has had a long career as a data practitioner, but he runs a community and creates a lot of content around data quality. And he talks a lot about data contracts in particular. And that’s what I want to ask him about. I don’t think we’ve talked about data contracts in the show, we’ve had discussions about data quality, and a lot of the tooling that’s trying to accomplish that, but data contracts, I think, is a new subject. And so I want the breakdown the 101 from Chad, because he is the expert.

Kostas Pardalis 01:05
100%, I’m very excited to have you on the show. To be honest, Wonderworks is one of these consoles. I mean, we keep hearing more and more about it lately. But it’s not about the data convex donors that I’m so excited about, it’s more about having a chat on the show. Because it’s, you know, like, we tend to like to talk about different things in the infrastructure usual like thing, the point of view of like Darfur, bringing like a solution. But in this case, we’ll have a person who’s just passionate about Billa Pawlenty with rights and tries to build not just like, not like a product, but more of like, the change the way that people walk around data and the quality ticular. So they’re like many more things or broader things that we can discuss with him. And I’m really looking forward to doing that. And talk about data quality in general, why it is important, why it’s so hard to define leads like beta products, Feast, Glue snotty, and how we can make things better.

Eric Dodds 02:15
All right, well, let’s dig in and talk about data contracts with chat. Let’s do it. Chad, welcome to The Data Stack Show. A privileged to have you on

Chad Sanderson 02:29
A privilege to be here. Thanks for having me.

Eric Dodds 02:32
Absolutely. Well, we have a ton to talk about. I’ve read a lot of your work. It’s been a huge help to me in the way that I think about a lot of things related to data quality. But give us your background. And kind of what led you to what you’re doing today with the community and the content. Yeah, so

Chad Sanderson 02:52
In just 10 days, I spend most of my time writing, going to various conferences, and I may have a book or two in the works at the moment. It’s odd on that, but, and I’m running a community called Data Quality camp. And that community is focused around managing data quality at scale, which is an unsolved need. There’s a lot of ways to think about data quality, and not too many standards. So I thought it’d be good to stand up some community that helps people manage their transition to data quality a little bit better. Before that, I spent three years at a company called convoy, which is a freight tech startup based out in Seattle. They do not have what I would call big data problems more like small immediate data problems, where the issue is less around cost, and computation and more around the complexity of the data and ownership. And from my time, they’re trying to solve these problems around the accumulation of tech debts in the data warehouse, ownership problems between producers and consumers. We derived this programmatic initiative around an idea called couple data contracts. And that’s what I spend most of my time writing and talking about these days. Prior to that I was at Microsoft that worked on their artificial intelligence platform team. And then I’ve been a big data analyst for just around 10 years altogether. And I worked at a variety of other companies. So the E-commerce space.

Eric Dodds 04:25
Okay, Chad, you mentioned that you do a lot of thinking and writing about data quality at scale. You faced this problem in previous jobs. But as you mentioned, data quality. There’s a lot that goes into it. It’s a very wide field. There are lots of companies that are trying to solve this problem. And they’re doing it in some very different ways. Right? Very different approaches. What can you break down data quality for us? How do you frame, you know, such a big topic? Yeah, so

Chad Sanderson 04:59
I would basically break down data quality into two main categories that each have their own subdivisions of delegations of concern. I guess, the first layer of quality is what I think any engineering team would think of when they hear quality, right? Does the application work, the way that it was intended, if we have a set of requirements for a product are those requirements being met, right, or the SLAs that we have being met is, and that could include, you know, the freshness of the data, it could include whether or not there are, you know, serious breaking changes being made, these two dependencies are the is the API of any guy is being consumed, evolving in a way that is conducive to the health of that application. And, and so that’s one part of it, I think, where we’re treating the data itself, like a product. And so there is some expected level of quality map to the requirements. The other element of quality that I think is unique to data, is the idea of truth, or trustworthiness, like the data needs to map to some real world reality. And if I have a shipment, and I have data about the shipment, and I know where the shipment was dropped off, and where it was initiated from and whether or not it arrived on time, all of that should map to whatever happened, whatever really happened in the real world. And that is a really complex subject, because you have different levels of a big T truth and little T truths. If you’ve spent any time like philosophy, marijuana, so there’s big T truth or there’s some objective meaning to what happened. And then there’s our little T truth, which is the subjective interpretation of what happened. And all of us, many people at a company have different interpretations of what a specific metric might mean, or what a specific dimension of lightning and so part of data quality is ensuring that everyone is speaking the same language, and that the objective truth about the world is reflected in the data itself. That’s what I think are the main components of data quality.

Eric Dodds 07:19
That’s super helpful. And I love the philosophy angle, do you see more struggle on the big T side or the little T side? Right? I mean, obviously, if you don’t get the big T, right, then you’re going to have a lot of problems, the little T but do companies really struggle with the big T side of things.

Chad Sanderson 07:38
I think basically every company I’ve ever talked to struggles with the big T side of things. And this is, you know, not to jump the gun. But this is one of the major issues that data contracts is attempting to solve is ensuring that the data is defined in a way that maps to the real world and that it doesn’t change unexpectedly, for reasons that may have to do with something other than the data itself. Like, well, we decided to launch a new feature, or we decided to drop a column or or rename a column because it didn’t really fit what we were attempting to do with our application. The goal is for the data we’re collecting from our source systems to be tightly mapped to that big T truth as possible. And, part of that mapping has to come from the consumers who understand what the data means and what it maps to having a great relationship with producers who are responsible for maintaining the systems that are collecting that data. So you think that big T truth is a huge problem. A little T truth is also a huge problem. And it really depends on where in the organization you’re looking and what type of business it is. But there are massive disagreements that again, almost every company I’ve talked to you about, like what a particular metric even means we have this dimension at convoy that was called a shipment distance. And you would think that’s a pretty straightforward thing. It’s just the distance between where a shipment is origin point and destination point. But there were so many people that couldn’t exactly agree on, you know, what specifically we’re talking about, we could be talking about distance in kilometers or distance and miles. Some people wanted to define the starting point, as the moment from where the shipment was dropped off at. Some people wanted to define it as from where the trucker was driving from. And these types of differences in thinking sort of applied to the use cases that the consumer is attempting to solve. So wrangling everybody’s brain around the same semantic concepts is very challenging.

Eric Dodds 09:47
Yeah. Now, would you I want to get practical for a second, would you describe shipment distance as a big T or a little T?

Chad Sanderson 09:56
It’s there’s definitely elements of both. And so this is So we kind of get to the philosophy of all of this right? And it actually becomes a really challenging conversation to have. There is obviously some objective distance, right, the shipment is traveling from one place to another place. Yep. So that part is real. The question is, what explicitly do we mean by shipment? And what distance part is real? It’s the shipment part, flying distance to the shipment? Where people disagree,

Eric Dodds 10:29
right? Yep. Yeah, that makes total sense. I was thinking about even something like delivery, right? It seems like it’s binary, this thing was delivered, or it wasn’t right. But one team could say, Okay, if it gets to the physical destination, it’s delivered, right? But another team may say, Well, no, it’s when the customer opens it and verifies that what they got is correct. Like, that’s a successful delivery, right? Or, you know, whatever it is, you know, which are all useful. But the question that comes up, then is you start to face and I’m speaking from experience here, you know, even in stuff that that we do every day with our own data, is, okay, so you have some disagreements, right? Not because anyone’s necessarily like, right or wrong, in a lot of cases, it’s that in order to interpret their job, or understand the effectiveness of their work, you need to measure something in a slightly different way, right. But the problem that often comes up is, then you start to have this proliferation of, you know, it’s like, okay, well, now we have 19, shipping distance, you know, variations. So, you know, whatever, you know, of course, like, it’s sad, but it’s like shipping distance underscore, you know, X, you know, or bought, you know, like 19 different variations. How do you and I want to get practical, we talk about, you know, data contracts and stuff. But philosophically, where do you fall on the spectrum of like, you know, we need to provide consumers with the information that they need to do their job well, without allowing, you know, things to run rampant. And to create all of this, like metrics debt, you know, it just spirals out of control. It’s like the, I feel like the first time you say like, Okay, we’ll just cut a new version of this. It’s like, you know, weeks later, you know, the warehouse is already getting messy.

Chad Sanderson 12:38
Yeah, so I, so I sort of see the data platform environment split into two halves, there is the semantic layer and the logical layer. And I’m using those terms a bit differently, I think, from how a lot of other companies use them. And there’s a reason why I think companies do this in a different way. But like, when people talk about semantics, that means, at least in every definition, I’ve seen that’s like the nature of the thing itself, right? Like, if I say, the semantics of a car, I’m talking about the nature of a car, I’m not talking about abstract interpretations of cars, I’m saying like well does, the car has an engine, and it has four tires. And it has a function, which is to move from one place to another place. And so that’s one layer I think needs to exist in the data platform. And then the other layer that needs to exist is the logical layer. So these are derivations of real world objects and events. And those are kind of subject to our interpretation, something like margin, as an example of logical construction, right? Is this a real thing called margin that exists in the world that we can grasp? Right? It depends on how we the humans who work at a company choose to define margin and can be cut in many different ways. I think that semantic layer needs to have one, one type of governance and implementation and coordination. And I think that the logical layer needs to have a very different type of governance and organization that is based around the promotion, that sort of sort of like crowd sourced, almost like Reddit upvoted artifacts, right. So if we have you know, if as a company, we agree that this definition of the margin is the one that is most commonly used by the business. That doesn’t mean that everyone else can’t have their own interpretation. That’s fine. But if anybody in the company has a question, what is margin as according to some common definition, there should be a very easy way for them to get access to that data without having to try to understand the 30 unique versions of margin that exists all across the business. So I think there needs to exist some plain To where there can be iteration, teams can derive, you know, logical aggregations based on real world semantic objects. And as those logical derivations become more and more valuable at the company, they are elevated to a higher level of importance and treated as if treated like an API. And then there can be discussion, right? So if you want to change the definition of margin that is powering key data objects. And actually, they let me take a step back, because I think the it when we were talking about these elevated meanings, it’s not just in a sort of abstract like, oh, yeah, there’s one, one set of metrics that are good and one or bad, having that elevated version of a metric should, should allow you to use the metric in ways that are more actionable and production grade. For example, if I want to use this concept of margin in a dashboard that I serve out to my customers, then I have to use the official version, the first the elevated version, if I want to serve it to a sales team, or if I want to do something that maybe goes across team, then it has to be I have to contribute back to this, like Central, almost Open Source Definition of a metric. If you want to, you know, create your own version of a metric, and it lives in your little local environment, and you tinker with it, and you apply it to a dashboard that only you see, that’s fine. But once we start sort of going cross company, that’s what we need to have some agreement of what these terms actually mean.

Eric Dodds 16:25
Yep. Yeah, that makes total sense. So a centralized agreement on the most important things, and then, but we’re not removing sort of decentralization from the equation, right.

Chad Sanderson 16:44
It makes total sense. And this is my approach to both the semantic layer and the logical layer during I think that there’s this sometimes a misconception in data where we’ll sort of look at all the spigot, a sequel when our data warehouse, and we’ll look at the pipelines that are actively failing and the business logic because you know, there’s no real clear agreement are what these entities are. And we take this approach of, we need to go and remodel everything, we need to have a very clear and well agreed and established data model. And we have an entity called shipments. And it’s owned by this team. It has to be brought here and we have a data minister, we have data warehouses, we have data Mart’s whatever. But it always pitches this big, massive overhaul. And there is going to be a big T truth that applies everywhere. And you’re not allowed to change that. And that’s just not feasible or realistic. In my experience, you have to give people the ability to sort of iterate and tinker and prototype and sort of try out new things, but give them a path to move from prototype sort of design environment to a production, high trust environment. And that production high trust environment needs to be supported by all the best practices and software engineering to a smaller, but far more valuable and condensed, slice of our data pipelines.

Eric Dodds 18:05
Yep. No, that makes total sense. Let’s Okay, so where do data contracts fit in here? Right. I know, we’ve probably been, you know, walking all around the subject of data contracts. In the philosophical discussion, or I guess, practical discussion is well around quality, but okay, break down data contracts for us and where they fit into everything that you just outlined.

Chad Sanderson 18:29
Yeah, so data contracts are at their core are agreements between producers and consumers enforced in a programmatic mechanism. And that means to put it simply, it’s up to date, APR, and API for data is not it is more, more robust and comprehensive than I would say, a traditional API, because you’re not just thinking about the schema and the evolution of the schema, but you’re taking into consideration the data itself. Right. So then this goes back to that real world truth that I was mentioning before. If I have an expectation that a particular ID field always is a 10 character string, then I need to ensure that the data itself reflects that. And if I get a nine character string or a 15 character string, that means that somewhere a bug or otherwise, some otherwise a regression has been introduced. And that means my version of that my assumption that this data represents the big T truth has been violated, because there is no, it doesn’t make sense for an ID to be 15 characters. It doesn’t work in our system. Right. So I actually think about what we’re talking about here, too. I mentioned a call before and Splinter, these two have got this issue that’s talking about truth and semantics. And you have this other issue that’s talking: does the product map to my record to the requirements that I have? I think that data contracts actually start primarily on the right side of that. As a quality mechanism to say, is my data product working the way that I expect? And do I have a very clear owner that’s willing to fix bugs and regressions in that data product. But I think that over time, it can be used to transition more to, to solve some of the semantic problems that I mentioned before. Yep, that makes total sense. One question, this is kind of a practical or maybe a specific question.

Eric Dodds 20:28
One of the challenges that I’ve seen, come up over and over again, as it relates to data contracts is on the logic side, in on the logic side, on the consumer side, right, so one of the challenges is that you have like a sales team, or a marketing team or a product team. And they have some sort of tooling that allows them to do whatever they do, right? So they’re sending messages, or they’re, you know, moving deals through some sort of, you know, lifecycle or whatever. And tons of logic lives in there. Right. But those systems tend to be very inflexible, understandably, right, because they’re, you know, built for that sort of purpose. And so when you think about a contract, I think one of the challenges is that, like, you have logic, business logic, that I would say, like many times, you know, is a contributor and former of even some of the semantics like big teachers. This is what a closed deal is, or whatever, right, that lives in a downstream tool. But when we think about an API, as you described it for data, you know that a lot of that has to be centralized in infrastructure. What do you think about that, and the world of data contracts, and even the technical side of data contracts?

Chad Sanderson 21:49
Yeah, it’s definitely a challenging problem. But it’s actually a lot that I think is going to be solved at some point. Salesforce, for example, has their own sort of DevOps oriented infrastructure now, where changes are like logs through actions. And so if you’re a developer, you can tie into that. And I think that there’s a lot of difference, there’s a lot of interesting potential. There’s an interesting potential and those types of systems like essentially being able to say, hey, we detected by running a check that you are about to, you know, drop a column in your Salesforce schema, there’s someone downstream that has a dependency on you. So we’re not going to let you do that. And you can, obviously you need as an engineer to implement a system like that. But you can abstract the messaging up to the level of the non technical user. i There are obviously some systems that are very old, like ERP systems and things like that, that, you know, maybe you will never fully integrate, like, they’ll never have their own DevOps solution. But even then, I don’t think it’s an impossible problem to solve. The challenge is really getting in between the change and the data, making its way to whatever this business critical pipeline is. So for example, you could do something where you say, Look, I just want to have some staging table, where I drop all the data from Salesforce or my ERP system. I run a set of checks, ideally, on if it was a real time, all the better. But you know, most of the stuff is pushed out through like batch systems. So you can run a check maybe once per day, or once every few hours. And if you see any violations of this contract downstream, then you can revert to a previous version, you could try to parse through that data and only allow whatever you know, at the row level meets the contract through into the pipeline. And then you can try to have some alert or notification for the salesperson or the business person that made the change that said, hey, something that put that you pushed out earlier in the day, or yesterday was a violation of a contract, and you’re potentially causing machine learning model to break, we’re gonna need you to go in and update that. Right. So some of this probably is going to require significant cultural change. Like it’s just people learning that changes that you make to data I have have impacts elsewhere. But some of it is like having the right tooling to just get in between bad data arriving in a pipeline and having some messaging that goes out to these producers.

Eric Dodds 24:26
Yep. What happens when, you know, I think a lot of companies have, I would say, maybe implicit contracts, but not explicit contracts around data, right, especially when there’s not, you know, a centralized infrastructure or, you know, other sort of tooling to mitigate that. How do you see that play out at a lot of companies? Yeah,

Chad Sanderson 24:53
a ton of companies have implicit contracts. I call them non consensual API’s. That’s great. Yeah. And it’s not good. It never really plays out? Well, honestly, I don’t think I’ve seen a single situation of those implicit contracts actually being positive for anyone downstream. And oftentimes what happens is, but it also makes sense why they exist, right? You have some software engineer who, you know, owns a Postgres database, or MySQL database or something like that. They are thinking in terms of their production applications, and, you know, ensuring that their applications have the right data to function. And they’re not thinking about the downstream data, or the analytics or the machine learning at all. And that’s because a lot of the tool billing teams use, like, you know, some ELT tools, or CDC allows these teams to not be concerned with those problems, right, and just say, Hey, I’m just gonna plug into your database, I’m gonna pull your data out, I’m gonna do something fancy with it, because I need to move quickly. And the engineering team says, Okay, that’s cool. But just so you know, like you are being you have a dependency on me. And, and that’s that, like, I just don’t need to worry about it, like, you’re gonna, you’re gonna fix this issue. And that’s usually fine. For the first few years that a company exists, right? Because A, it’s very easy to be in the loop whenever an engineer makes a potential breaking change to your pipeline. And be, you know, people are just like, thoughtful, and nobody’s a jerk. And the data, I would say, is useful enough to really have any sort of strict data quality guidelines around it. It’s mainly for you know, analytics, maybe it’s for BI, you know, okay, if my customer churn table is down for a few hours, or it’s maybe down for a couple of days, while some analyst comes in and fixes it, that’s fine. You know, it’s not that big of a deal. But once you start getting to scale, and now you have a data engineers that are being bottleneck, or they are the bottleneck in a lot of cases, because you’ve got this large to consumers, and data scientists, and you have machine learning models, and those models are breaking all the time, you have all these changes that are happening. And all of those tickets get routed to this central data engineering team, and they’re spending all their time just solving tickets constantly. And it’s not fun for anybody. So from the consumers, because they’re not having their problems addressed in a timely way. It’s not fun for the data engineers, because they’re just constantly underwater. And they don’t get to do what they actually want to do, which is do engineering and like build things. And, it’s not really fun for the data producers either because they get yelled at, you know, like every other week about something that they broke that they had no idea about. And so yeah, that’s sort of how I’ve seen it typically play out, like most companies I’ve seen on the modern data stack that adopt that, you know, just move fast and break things early architecture get to a point where like, that doesn’t actually work anymore.

Kostas Pardalis 27:52
Let’s go through an example of some data infrastructure. And we’re like the data contracts. Existing meat, right, like, let’s assume we have like a typical example of a production database. Postgres generates data, of course, you want to export the data from there. So there is some kind of ETL, like CDC, whatever, it doesn’t matter, right? Like, take the data out there, put it into the warehouse, there are some steps of transformation that will happen to the data there. And you will end up with some tables that can be consumed for analytical purposes. And let’s keep it like in the simplest, more common, most common scenario from analytics, let’s move on to talk about like a maylands, Morrinsville, like use cases where data contracts fit in these environments. And the reason I’m asking is because you use the words API and the law contracts. And in my mind, an API is always like contracts between two systems, right? And in the world of like deadlines in the air, we actually have way too many systems that we need work history, or like, make them interoperate, right. So help me to understand a little bit like where do we start putting these data contracts in? Sounds like an environment? Yes.

Chad Sanderson 29:18
So in general, and so we’ll start at a high level and sort of drill down to the tactical and a high level, I think that data contracts need to exist anytime there is a handoff of data from one team to another team. So that could be from the Postgres database to the data lake. It could be from the data lake to the warehouse, it could be from one team that owns a particular data model in the data warehouse to another team that consumes that model. But anytime data is handed off, and there’s some transformation that’s happening there, there needs to be a data contract. And that sort of API input output needs to exist. As you rightly pointed out, Depending on where you’re at, in the pipeline, the vehicle that is the mechanism of enforcement that the data contract takes is going to look different. So if you’re trying to enforce at the production, you know Postgres level, then you’re probably going to need something in cicb, you want to prevent the changes from being made before they happen as often as you can. If you have CDC, and you’ve got an Event Bus, then you might want to do a set of enforcement there, right, we want to look at each row. And if we detect that at the row level, there’s a violation on the contracts, we can sideline that data, stick it into a DLP you get backfilling later and send out an alert to the data team that’s on call for that contract. The overall goal is to try to shift the ownership as left as we can for each contract, and try to make that the enforcement as tactile and is embedded into the developer workflow as we possibly can. So if we’re just talking about Postgres, for example, you know, over time about the use case, we might want to start off by defining a contract in some schema serialization framework. So it could be protobuf, it could be Avro could be JSON schema. I don’t recommend that you want to store those contracts in some type of registry. And then there should be a mechanism of doing backwards compatibility checks on that store to contract and ideally on the data itself, during the actual build process. And then you can, you know, break the build, and you can send that alert and say, Hey, you there’s been a contract violation. That’s like one example. But like I said, in each transformation stage, there are things that you can do that you can try to tie back to a producer. Yeah, it makes total sense. And

Kostas Pardalis 31:47
Okay, there are many different people involved, right? Yes, probably more than, like the technologies involved in this whole process. So let’s take like, don’t have, like, over complicated, but let’s at least, like assume two basic categories, we have the data producers and the data on Schumer’s, right. What’s the value that each one of them gets from implementing data contracts? Yes.

Chad Sanderson 32:15
So a lot of this comes down to the implementation. But I would say the primary value that the producer gets is awareness, if it’s implemented the right way. And this caveat, we should say that the data contract is a really meaningful piece of technology. And it serves a function. And there’s a very specific function that it serves is to define contracts and to enforce contracts, all around that core function. I think there needs to exist other capabilities at an organization, which adds the value that you’re talking about. And I think of this as not super dissimilar from GitHub, where at its source, or at the core, GitHub is a platform that facilitates source control. But all around source control, you have this other functionality that brings engineers from all across the company together, pull requests, you know, code diffs, and things like that just makes deploying and managing sort of, you know, code deployments in an agile way, like, very easy for everybody. And that creates a great incentive to actually use the system. Data contracts require something similar, what we found at Conway, was that awareness was the big value for the producer. And that meant understanding if you own some upstream database, how is that data actually being used? Where is it being used? And if you’re going to make some changes? How is that going to impact someone else? The reason that this is a valuable thing is obviously because like, as an engineer, you want to build scalable, maintainable systems, and you don’t want to break them anyway. Also, you deserve credit. Like, as your data is being used in, let’s say, you know, a pricing model for the company, and you ensure high data quality for your, you know, piece of the pie, and that makes the model better, then that’s something that you as an engineer really deserve credit for. And then on the final part, like, we don’t want, it’s not good, if software engineers are brought into, like, you know, they have to be they have to tend to see, oh, we participate in a co op, because there was some breaking change that was made to a very valuable data product. So as often as we can avoid that, right, that would be ideal for them. So the next, for the consumer, the value for the consumer is, is really having more higher quality data specifically for the things that are most important to them. And by that, I mean, I don’t think that data contracts need to apply everywhere, not everywhere you have data or every use case of data requires a contract. I think, because contracts do add time and they do add additional effort. They should only be applied where the ROI justification makes sense. So if you’ve got, you know, like, I mean, you mentioned analytics, but ideally would be some report that adds a lot of value back to the company like a dashboard, the CEO looks at every single morning, maybe in that case, a contract would make a lot of sense. And if you’ve got some data consumer that’s on the hook for ensuring that the data is correct, they probably never want to be in a situation where they go into that meeting, like, oh, sorry, guys, the dashboard is broken. And I’d have no idea why, just from a career perspective, and also from a business perspective, that’s not really great. There’s actually a couple more things I wanted to mention on the producer side really quickly, that’s very valuable. One of them is that I think that contracts are bi directional systems. So lineage, to me, is a huge part of the contract, being able to understand, you know, where the data is actually being used, what feeds into the contract, and also, who is using the source data. And if it’s bi directional, it means that not only should the producer be accountable to the consumer, but the consumer has to be accountable to the producer. So GDPR is a really great example of where this adds an enormous amount of value, right? Like, if you’re an engineer, and you’re generating some data that might be audited, or you are accountable for how it’s used at the company, you need to have that insight. Otherwise, like, it doesn’t make sense to make the data available to anybody at all. So yeah, there’s a couple examples.

Kostas Pardalis 36:14
Okay. That’s awesome. And okay, you mentioned earlier that, like Eric said, that there are always some implicit contracts, right? Yes, let’s say the company reaches the point where things being implicits is not a good thing. And I think pretty much everyone who has been working for a while has experienced that, right? Like, it’s part of the evolution of building systems. Where do we start, like, making things explicit? Like who? And I’m asking you, because you like the experience building, talking with so many, like different teams and people who start this conversation about the data products? And who usually likes booths is enough for this to happen, right? Like, who is the driving force behind? Yeah,

Chad Sanderson 37:06
great question. Generally, the driving force is the data engineering team or the data platform team. The reason for that is they are the bow of the bottleneck, they’re feeling a tremendous amount of pain. In most cases, this was my team, right? Every day, we’d have 1015 20, service desk tickets, and they all essentially follow the same pattern, which is something happened in a production system, and the downstream team did not have the ability to solve it themselves. And they relied on us. And we had a lot of churn for that reason. So the data engineers generally want to get out of that communication cycle between the producers and consumers. And this is a method of doing things like managing the surge loop centralization. In terms of where you start, it’s a big cultural transition, a lot of it depends on the company. Honestly, if you’ve got a and the use case, if you’ve got a use case that is unbelievably built to the business, then you can probably skip a couple of steps, right, like, so if you’re an Amazon, and you know, you have your recommendations, model or whatever. And that’s making you $2 billion a year, I would guess, with about 99% certainty that they have a lot of mechanisms in place to prevent that model from just breaking randomly. So that’s a great starting point. Is there something that’s really valuable to the business? I think you can actually start directly with the producer in that case, and say, Hey, there’s some constraints that we need. There’s some policies that we need to implement about how data is changed. And we’re actually not going to allow you to make schema changes or make significant changes to the data. Because whatever feature you’re building is not as important as our recommendation model, there’s nothing that you could create that could generate more value than that. And so therefore, we’re going to block you. And that’s probably a business decision. In most other cases, what I’d say is the best thing to do is in invest in this sort of awareness infrastructure, the goal is not to initiate change from the producer side, on day one, it’s to allow everybody in the pipeline to just figure out what oh, what what happened based on the changes that they make. As an engineer, you don’t have the context. Or like, if I do this, what’s going to happen, then you can’t possibly make an informed decision. Nor can you take ownership of the data in the future. This is what we did at convoy. So we basically said, Hey, we have a valuable use case, we want to inform, but not break, we had a GitHub box that if there was a chain of potentially breaking change that was being made, we will use that GitHub bot to alert and say, Hey, here’s how the data is being used downstream. Here’s the data product. Here’s the SLA, here’s what’s going to happen if the pipeline fails, like it’s going to be an incident or not, and just a person that you should go talk to, to actually, you know, work through this change. And then the producer has a choice. They can either say, You know what? I think it’s fine. doesn’t really seem like a big use case. I agree. I really need to get this thing out the door. And that’s okay, they just post a change, they’re willing to deal with the results. And at worst, we can still alert the downstream consumers that a change is coming. We know exactly why, where it’s coming from and how to sort of negotiate and deal with the problem. And in the best case scenario, they say, oh, yeah, maybe I should go have a conversation with this person, because I don’t want to break them. And we come to some amicable conclusion of the contract. Tansen is sort of answering your question in reverse there. But the first part was like, where do you start? I think you, LD, have to, but it’s much easier to start on the producer side, if you can get contracts on the producer side first, then every transfer is a Trent transformation step below it. The owner is going to feel much more confident saying, Oh, well, you know, my data is now under contracts. And therefore I feel comfortable vending that data to someone else, we try to go from the bottom up, you don’t have that, right, like you could still potentially be broken. And now as a data owner, we’re sort of right back to square one, where instead of the data engineer, that onus has just been shifted to, you know, whoever the data consumer is, or the analytics engineer that owns that data set? And that’s,

Kostas Pardalis 41:07
Yeah, that makes a lot of sense. My question is, like, you know, I wouldn’t whenever we are talking about, like, APIs Nordson, Bulldozer stop, like, usually we have, like, it’s a two sided St. Right, like, there are like two parties that they have to agree on something. Right. Right, right. And I can think of engineers, like having a conversation and like figuring out when, like, the schema how it should look like or like, adding, removing, like, all these things are the more technical things, right. And here, we are talking about, like, at least how I understand it so far, like, with a process that has implications from the engineering side, to the highest level as a consumer of data, right? Because, as you said, like there might be, I don’t know, like, a dashboard that the CEO used to report to the board. Right, right. When you have to communicate between the consumer and the producer to create contracts, right? And share with other people that, like, okay, they think very different terms, right? Like, even the language that they’re using is different. I come on this carpet or maybe I don’t, I don’t know, maybe it’s not as important. Right. But how do you do that?

Chad Sanderson 42:20
Yeah, so two things. The first thing is that I think it’s, I think it’s, there is a maturity curve of implementing contracts at a company. And I think the curve shifts to start with the technical, the producers and the technical consumers having that conversation, because at least their language is the most similar to each other, it’s still different, but the most similar to each other, and I believe the vehicle of communication is the PR. And in that PR, if you think communicate, hey, this is how the data is being used. Here’s information about the lineage. So you can see how it transforms what the final data set looks like, that can end here are all the constraints and why we need those constraints. That is probably enough information that sort of like the right level of communication for producers and consumers to have a fruitful, just productive discussion. I think that for the non-technical consumer, it’s a lot more challenging for them to have that conversation directly with the producer. So I think and again, you know, I’m not even this far yet, but it’s, it’s where I want to get to, is, I think that there need in the same way, there’s sort of this conversation, this sort of surface for conversation between the producer and the consumer, that needs to be a similar service and conversation between the consumer and the technical consumer, where the non technical consumer can essentially say, Hey, here’s what I know about the business, here’s what I know needs to be true. And that technical consumer is able to translate that set of requirements into contracts that can then be fulfilled by the producer. So I think it’s probably a sort of a double hop of communication,

Kostas Pardalis 43:50
and how does it work with, like a semantic layer in place, I know that you talked about like, at the beginning between, like the difference between like a semantic layer and like a low scale layer. But I think like this, like, at least in my experience, like with the enterprise, where you have, you know, like the colibra of the, of the world out there where, you know, it’s a very top down kind of situation where like, yeah, the board will come and define what their revenue is. And we are going to create the terminology of like, this is what like, revenue is and these costs are spread across like the rest of the organization. Right? So how these two things can align,

Chad Sanderson 44:28
right? It’s very tricky. It’s definitely a very tricky thing. I basically think this is gonna be an unsatisfactory answer. But I think that there really needs to exist, levels of abstraction that are based around these serve, you know, fundamental engineering artifacts. I think that it will be very far to go sort of directly from the business wanting to define some metric to then taking that and translating it to like five foundations that have the foundations of trustworthy data that are not there. In sort of the engineering and like programmatic sense. That’s why I always recommend starting off is like, ensure that you have a this sort of foundational, like, highly trustworthy data pipeline that is defined between the technical producer and the technical consumer. And then I think there’s lots of interesting ways that you can focus on abstraction, the higher that you go, which, like I said, it’s sort of a non satisfying answer, because people want to do all these interesting things with the semantic layer today. And my personal opinion is that we’re, we’re kind of we’re sort of trying to reverse decades of bad practice, just to be frank with you, we’ve kind of been doing data the wrong way, where we’ve, we in a lot of cases, we’ve started at the end, we started with the analytics sort of BI tool, and said, Let’s just sort of very quickly get data into these really complex, analytical instruments. And we can build out a lot of cool stuff and build out all our metrics and everything else. And the fundamental architecture and upstream ownership is just not, it’s just not there. And then we reach a point where we want to do so many more interesting things with our data, we want to have OLAP cubes and do slice and dice and have semantic layers and have these like API’s and also the great stuff, but you don’t even have ownership from the source. And so I think we need to reverse that trend, start from the top, work our way down, and then build the layers of abstraction onto that. So then, ideally, the non technical consumer can say, hey, I have this version of margin that I would like to define. And here’s how I like to define it. And that just like back sort of propagates through the system. Yeah. But I don’t think the foundations are in place to do things like that. Yeah. I don’t

Kostas Pardalis 46:41
agree with that. All right. So let’s talk about tooling. I feel like mentioning a lot of things like GitHub, stuff rights, PRs, like, like working all together, like I’m given, like, all that stuff. So if we would like to start into medical data contracts today, right, like outside of a good rapport, what do we need? Like? What are the tools, let’s say the fundamental tools like an engineering team? Meet? Yeah.

Chad Sanderson 47:12
So just start from like a requirements first perspective, and then, you know, we could talk about very specific tooling, requirements perspective, you need some mechanism of defining a contract could work with a schema registry could work for that you need some form of a registry, A, you need some schema, a series of serialization framework to work in. So you need to be using protobuf for Avro, or JSON schema. And then you need some mechanism of performing those back backwards incompatible changes. So you know that I sort of wrote a whole article about exactly how you do this. But you can, you know, you can use Docker, you can spin up a clone of the database, you can run a backward compatibility check against that, during the build phase, you can do a check against the Kafka schema registry and do backwards compatibility checks against that, I would say that sort of the having the sort of the schema evolution pieces in place are the most foundational aspects of the contract. And then the most foundational aspects of ownership in general. So if you get that in place, you’re like, 50% of the way there. The next big piece is how you enforce on non schema related data issues, semantics cardinality, you there can. So there’s a few different places where enforcement makes sense. It really just depends on your use case and how the data moves through the pipeline. But like in Kerberos case, we had data lakes we had, we were doing streaming, we were using CDC with Debezium. We were already using Flink as a stream processing layer. And we were also using Snowflake. And so when you just think about that spectrum and technologies, what we could do is we could have checks in the CI CD set layer, we thought that we could do checks in the application code on values. So if we detected that there’s some value that falls outside of the constraint, we could block it there in the stream. We could use Flink to run some Flink SQL and have moderate checks like, you know, at the row level, does this entity have a many to one relationship with another entity? And is that what we actually observed? If yes, great, allowing it to the pipeline is no sidelight. And then when the data actually lands in like a lake or warehouse, you can take their profiles. So like why Labs has a really cool open source tool for doing data profiles. You’ve got a bunch of great tools for monitoring out there like Monte Carlo and InLight up and elementary.io which is the open source version. So you can do all those checks there and then you’ve got the warehouse and in the warehouse, you’ve got you know, you’ve got airflow, you’ve got a DVT tests, you’ve got great expectations, and you can implement your CI CD checks still using To seek schema registry if you’re using a tool like DBT, and then, and then you would have to do checks sort of on batch, right, you’d have to do some batch process, you run all your checks there, you see if it passes or not. And then you have some, you know, like, like a system in place for either rolling back to a previous version, or you know, shunting the data to another table or something. So, so, so technically, like all the tools to do this stuff, like already exist, right? All the open source tools are out there. It’s just a matter of stringing all the pieces together so that you have the right level of enforcement in the right place, at least, like that’s how you would do sort of the core data contracting technology.

Kostas Pardalis 50:41
Yeah, it makes a lot of sense. All right, cool. One last question from me, that I’ll do the mic back to, to Eric. So we’ve been talking like this all this time. And I think we were equally talking about technology, and people rights, like people are always like, involved at the end, like, you have to take the people and like, educate them, or like, make them I don’t know, like change the way that they do things. And like all that stuff. And like I agree at the end, right, like that, we have a contract that we employ, we have to agree in Chinese. And I know that we are very active in building a community around that stuff. So I’d like to ask you about that, like how important education is in Macau, in how like a community, right, like, acts as vessels for these changes to happen, right? So let us like, say your wisdom with us about like the community, because it’s very interesting, like,

Chad Sanderson 51:42
topic. Yeah, slowly. So I think community is critical here. And the reason is that there’s a couple reasons I think are critical. The first is ads. This is, you know, one of the things that I’ve heard a lot from people that read my content is, you know, he’s like, wow, you know, you’re saying things that seem so obvious. In retrospect, like, of course, you can’t solve data quality, unless the producer gets involved, like, how could you possibly do it like garbage in garbage out, doesn’t make any sense if the garbage is already in, right? Like, you have to prevent it from getting in the first place. And there’s only one way to do that. And that’s to start from the source as well. And I think part of the community is like giving people the weapon, maybe it’s not the right word, but giving them the tools to have the appropriate conversations with their producers or with their consumers. And oftentimes, data engineers and data platform engineers are so in the weeds, focusing on the day to day work, that it’s hard for them to like take a step back and figure out how do I have these conversations like in the bigger sense, and this is something I think community is really useful for, it’s like saying, Oh, wait, I can actually contextualize all the problems I’m having in this larger narrative about like the company like, why is it? You know, why is data set up the way that it is? How is it? How is data quality affected by these various sorts of pieces in the business working together? And how can I speak to that, and propose changes that actually make more sense? The other reason I think that community is valuable for at least talking about data contracts is because, as you said, historically, these types of problems have been purely organizational, right, we need to make some organizational shifts, right, you hear a lot about data mesh, and data meshes. It’s an organizational thing. It’s like we need to restructure our organization. So you have better ownership of data objects and domains, which I don’t think is entirely necessary to get to a point where you actually have enough problems to solve that people don’t like, and really hate doing their work every day. But it has been this really heavy organizational process and getting in front of the people says, actually, that you still will always need some element of cultural transition. But technology can really help. Because technology makes that cultural transition. Easy, right? It’s easy for the producers to take the ownership if it’s easy for them to understand how the data is being used. And if it’s easy for the consumer to define what they need, then people will do it. Right. The bottlenecks are like, people don’t do the right thing, if doing the right thing is very hard. And it takes away from their primary work. Right. So I think that’s another great thing as it as a great message to spread through culture is like helping people overcome the traumas of the past, where they’ve tried to do this stuff before, and they’ve just gotten smacked down by the fist of reality. It’s like, Well, you got to understand the reason you failed, right? The reason that you got smacked down is because you were asking the business to do this massive cultural change. And it’s not really tied to business value, and it would have taken a year and a half or two years, and you had to had to involve the entire organization, instead of doing it iteratively and programmatically and like very efficiently, you know, so So I think the community is really great for sharing stories like that, and for just helping people like think through these types of issues.

Kostas Pardalis 54:52
This is great. Any questions? Yeah,

Eric Dodds 54:57
we’re close to the buzzer here but That was really helpful. practical advice. Yeah, it is so funny. I mean, you said this at the beginning of the show that we just tend to say like, okay, let’s just, it’s almost cathartic to think about a full reset, you know, like, let’s just do a full reset, and like, build all this stuff from the ground up or whatever. But that’s not actually reality. But another question, I think, on the practical side to close us out here. So on the cultural side, I think it was really helpful. On the tooling side, it seems like there’s a bit of a gap. And you describe it really well. Chad, when you talked about, okay, well, you’re a small company, you know, and it’s okay, a pipeline breaks, and you know, someone’s dashboard goes down, and so they send a Slack message, hey, something wrong here? Oh, yeah, let me look at that, okay, like, you get it up and running the next day. But, you know, it’s not like the company’s losing money, because, you know, this Dataflow or pipeline broke, but you inevitably, in that environment, like a crew of a bunch of debt, you know, that you’re gonna have to pay back at some point. And it’s interesting, because those smaller teams don’t often have the resources to sort of implement dedicated, you know, tooling around API’s or data contracts or whatever. Right? How do you approach that, as you’re thinking about our listeners, who are maybe at smaller companies, they maybe are working on the cultural side. But from a tooling standpoint, it’s like, well, you know, I’m definitely not gonna get the budget to go like, buy a really, you know, nice, dedicated tool for some of this stuff. But I also have the bandwidth to like, start building some of this stuff internally, what are the where should they start? What should they think about it?

Chad Sanderson 56:52
So I think if you’re at a small company, the best thing that you can do is to try to be in the loop, whatever, producers are making changes to things and just establish a good relationship with those two, right? So you know, if there’s a meeting, explain, like, hey, we have some important data, can you just invite me to, like, whenever you’re talking about, like making a major database change, they just sort of loop me in, let’s do it, let’s put, you know, put together a dedicated Slack channel. If you have changes that impact your database at all, we can push it, push all the alerts to a Slack channel, so at least I’m notified, I can ask you a question. But I think it really is sort of a, getting in the loop and having the conversation, if you don’t have the resources for things like tools, or open source technology, or whatever, or building something. And I think that the point of transition starts to come when there is some data asset, which, if you have incremental data quality, you start to experience incremental value back to the business, measurable value back to the business, right. So I’ve got, you know, maybe a machine learning model, and it’s an irrelevant model. And it’s running every day. And I know it’s making us money. And we’re having to drop 10-20% of the data due to no values. And those no values are sort of being caused by issues of upstream systems. And you say, okay, if I’m just able to solve this one problem, this very small slice of a pipeline, by getting a contract on a few, maybe one schema, or maybe even one or two columns upstream, and I can say, Hey, I was able to reduce the amount of Knowles flowing into this table by 25 30%. And I can connect that and say, like, hey, there’s some real world ground truth, and we’re making better predictions. And now our model is making more money, you have just justified why data quality is a meaningful investment to me. What too many teams do I found, it’s like, they try to take this very holistic approach and say, well, we need data quality everywhere. We need monitoring everywhere, we need checks on everything. And number one, that’s leads to alert fatigue almost 100% of time, like I said, because you’ve got, you know, the metaphor that I used before I’ve used before is like, it’s if you if your house is on fire, you don’t need a you don’t need a fire detector. You need the firemen, right, you already know the house is on fire, you don’t need a bunch of alerts to tell you that you’re burning out, you need someone to come and solve the problem. And so if you have a million different alarms going off, it actually numbs you and desensitizes the teams to data quality issues, which is a bad thing. And, and, and so is it. You need to focus on a smaller piece of the problem that’s manageable, that’s iterative. That’s not going to be a massive cultural shift for the producers and where you have clear business value. This is exactly what we did akamba way and I will be on honest and say I didn’t start doing that I had to learn that was the right approach if I took the big wide approach at first, and that totally bombed out and completely failed, and, and then when I switched to the smaller, narrower approach, we just got so much more traction. And the great thing was because it wasn’t as large of a lift on the producer side, they got the engineers to familiarize themselves with these processes. And it turned out they’re like, wait a second, this is just integration testing. This is just c ICD. This isn’t an API for data. Of course, we should be doing this. Like, why aren’t we doing this? And in fact, at some point, I know, maybe a lot of listeners will find this hard to believe. But the conversation actually flipped. And so instead of the consumers going to the producer than saying, hey, I need you to take ownership over this stuff. It was the producers going to the consumers and saying, Hey, I have some data here. Is it useful to you? And if it is, how do I put a contract around it? So I think that like, you just have to give people time and space, and allow them to sort of see the successes one by one, and not try to not try to rush it to solve all the problems in the world. And in one single project.

Eric Dodds 1:01:09
I love it. I think that is so well said Chad, this has been such a helpful episode. And even for the work that I do every day, but did my job. It’s just so much here to implement right away. So thank you, thank you for joining the show. If people want to check out the community, where should they go?

Chad Sanderson 1:01:31
So you can go into your browser right now. And you can type in data quality DOT camp slash slack, and you’ll get redirected to the Slack channel. It’s Slack. So it’s totally free. And right now it’s mainly a community for networking and finding peers who are in the data space. So head, there’s lots of people who are like heads of data science at big companies, heads of data engineering heads, a data platform, and they’re all talking about how they’re implementing data contracts, and monitoring and data quality of all sorts. But later in the year, maybe middle of the year, we’re going to start working on some other things like in person events and meetups, training courses, stuff like that. So there’s a lot planned.

1:02:12
Very cool. Well keep us posted on the books as well. Well, and we’ll have you back on to talk about whichever one you publish first. That would be fantastic. Great talking to you, folks. Thanks. All right, this fascinating conversation with Chad Sanderson, who runs data quality camp, which is a community, produces a ton of content. And, you know, we covered so many topics, I think one of the things that he kept returning to over and over again, that I think was incredibly helpful, and just really good, there’s so much practical stuff for people to get from the show.

Eric Dodds 1:02:49
I felt like I could walk away from the show and have practical things that I could start doing tomorrow to make data quality better. And I think that was really refreshing because the conversation around data quality can feel really big, right? It’s a huge problem. How do you fix it? We have so much tech debt? What tools do I use? Where do you solve the problem in the pipeline? You know, do you try to do things proactively with schema management? Do you try to do, you know, sort of passive detection? I mean, there’s so many things, and I walked away, especially at the end there. By even having a couple of practical things in my mind, I should probably go do this tomorrow to make our data quality better. You know, there are small things that I can do. And so I think both for, you know, the like, listeners who are data leaders, but also the ones who are doing the work on the ground every day, just a hugely practical, helpful episode, in terms of what can you do tomorrow, to start improving data quality, we also talked about philosophy, which is always fun.

Kostas Pardalis 1:04:03
Absolutely. Well, that will keep like from the conversation that we had with Chad and like find I found like, very, like refreshing and interesting is that he’s giving like a definition of what data quality is, from the perspective of, let’s say, the agents that are involved in the process of working with data, and not trying to give like an objective definition of Oh, you have like this metric and that metric in like something that’s, you know, automatically like a machine Camrys on a box, right? And I think that’s like the most important thing here. But at the end, no matter what data is like information, we have to agree on how we use it. And thus I think like the big change that Chad brings with his ideas, and the most interesting I’m seeing is that he’s not keeping the abstract. It’s not like an abstract thing. But like an organization is like, you know, to go and like, hire hundreds of consultants to coach you on how to do it. He tells you that like you can do today like the tooling is out there. And he positions technology in a very interesting way on what’s the role of the technology like making this happen. So I don’t know, I think I should be encouraged to listen and relate to these episodes, because I think there’s like a lot of wisdom briefings like we discussed. Both of them provide technology like carbs that can be used, but also the importance of people in the organization implementing processes around data.

Eric Dodds 1:05:48
I agree. Well, thank you again for joining The Data Stack Show. We will catch you on the next one. We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.