Episode 183:

Why Modern Data Quality Must Move Beyond Traditional Data Management Practices with Chad Sanderson of Gable.ai

March 27, 2024

This week on The Data Stack Show, Eric and Kostas chat with Chad Sanderson, the CEO at Gable.ai. During the episode, Chad discusses the complexities of managing the data supply chain, emphasizing the importance of data quality, feedback loops, and aligning incentives within organizations. He shares his journey from analyst to data infrastructure leader at companies like Oracle, Sephora, and Microsoft. Chad introduces his company, Gable, which tackles upstream data quality issues. He critiques traditional data catalogs and advocates for a more dynamic, decentralized approach. The conversation explores the role of metadata, the integration of data quality checks in the software development lifecycle, the need for cultural shifts towards data responsibility, the significance of full lineage graphs and semantic metadata, treating data as a product with quality gates, and more.

Notes:

Highlights from this week’s conversation include:

Chad’s background and journey in data (0:46)
Importance of Data Supply Chain (2:19)
Challenges with Modern Data Stack (3:28)
Comparing Data Supply Chain to Real-world Supply Chains (4:49)
Overview of Gable.ai (8:05)
Rethinking Data Catalogs (11:42)
New Ideas for Managing Data (15:16)
Data Discovery and Governance Challenges (18:51)
Static Code Analysis and AI Impact on Data (24:55)
Creating Contracts and Defining Data Lineage (27:31)
Data Quality Issues and Upstream Problems (32:32)
Challenges with Third-Party Vendors and External Data (34:29)
Incentivizing Engineers for Data Quality (40:28)
Feedback Loops and Actionability in Data Catalogs (45:30)
Missing metadata (48:57)
Role of AI in data semantics (50:27)
Data as a product (54:26)
Slowing down to go faster (57:38)
Quantifying the cost of data changes (1:01:24)

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 00:05
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com. We are here with Chad Sanderson. Chad, you have a really long history working in data quality and have actually even founded a company Gable.ai. So we have so much to talk about. But of course, we want to start at the beginning. Tell us how you got into data in the beginning.

Chad Sanderson 00:47
Yeah, well, great to be here with you, folks. Thanks for having me on. Again, it’s been a while. But I really enjoyed the last conversation. And in terms of where I got started in data AI, I’ve been doing this for a pretty long time, starting as an analyst and working at a very small company in northern Georgia that produced growth parts, and then ended up working as a data scientist within Oracle. And then from there, I kind of fell in love with the infrastructure side of the house. I felt like building things for other people to use was more validating and rewarding than trying to be a smart scientist myself, and ended up doing that at a few big companies. So I worked on the Data Platform team at Sephora and subway, the AI platform team over at Microsoft, and then most recently, I lead data infrastructure for a freight tech company called QCon. Boy,

Kostas Pardalis 01:47
That’s awesome. By the way, we mean, the first time that we have you here, Chad’s, so I’m very excited to continue the conversation from where we left. And like many things that have happened since then, but one of the things that I really want to talk with you about is the supply chain around data and data infrastructure. There’s always a lot of focus, either like on the people who are managing the infrastructure, or like the people who are like the downstream consumers, right, like the people who are the analysts or the data scientists. But one of the parts in the supply chain that we don’t talk about much is going more and more like upstream where the data is actually captured, generated, and transferred into the data infrastructure. And apparently, like many of the issues that we deal with, they stem from that. There are organizational issues, we’re talking about, like very different engineering teams involved there with different goals and needs. But at the end, all these people in these systems, they need to work together, we want to have data that we can rely on, right? So I’d love to get a little bit deeper into that and spend some time together, like to talk about the importance, these issues there, and what we can do to make things better, right? So that’s one of the things that I’d love to hear your thoughts on? What’s in your mind? What would you like to talk about? Well,

Chad Sanderson 03:20
I think that’s a great topic, first of all, and it’s very timely and topical, as teams are, you know, the modern data stack is still, I think, on the tip of everybody’s tongue, but it’s a, it’s become a bit of a sour word. These days, I think there was a belief, maybe five to eight years ago, that by adopting the modern data stack, you would be able to get all of this utility and value from data. And I think to some degree, that was true. The modern data stack did allow teams to get started with their data implementations very quickly to move off of their old legacy infrastructure very quickly, to get a dashboard spun up fast to answer some questions about their product. But maintaining the system over time, became challenging. And that’s where the phrase that you use, which is data supply chain, comes into play. This idea that it’s data is not just a pipeline, it’s also people. And it’s people focusing on different aspects of the data. An application developer, who is emitting events to a transit transactional database is using data for one thing, a data engineering team that is extracting that data and potentially transforming it into some core table and the warehouse is using it for something different. A front end engineer who’s using, you know, RudderStack to emit events is doing something totally different and the analyst is doing something totally different. And yet, all of these people are fundamentally interconnected with each other. And that is a supply chain. And this is very different, I think, to the way that software engineers on the application side think about their work. In fact, they try to become as modular and as decoupled from the rest of the organization as possible so that they can move faster. Whereas in the data world, if you take this supply chain view, decoupling is actually impossible, right, it’s just not actually feasible to do, because we’re so reliant on transformations by other people within the company. And if you start looking at the pipeline as more of a supply chain, then you can begin to make comparisons to other supply chains in the real world and see where they put their focus. So as a very quick example, McDonald’s is obviously a massive supply chain, and they’ve spent billions of dollars in optimizing that supply chain over the years. One of the most interesting things that I found is that when we talk about quality, McDonald’s tries to put the primary burden of quality onto the producers, not the consumers, meaning if you’re a manufacturer of the beef patties that are used in their sandwiches, you are the one that’s doing quality at the sort of Patty creation layer, it’s not the responsibility of the individual retailers and the stores that are putting the patties on the buns to individually inspect every Patty for quality, you can imagine the type of cost and efficiency issues that would lead to where the focus is speed. And so the patty suppliers, and the stores and McDonald’s corporate have to be in a really tight feedback loop with each other communicating about compliance and regulations and governance and quality, so that the end retailer doesn’t have to sort of worry about a lot of these capacities about a lot of these issues. The end. The last thing I’ll say about McDonald’s, because I think it’s such a fascinating use case, is that the suppliers actually track on their own how the paddy needs, like the volume requirements for each individual store. So when those numbers get low, they can automatically push more patties to each store when it’s needed. So it’s a very different way of doing things having these tight feedback loops, versus the way that I think most data teams operate today. Yeah,

Kostas Pardalis 07:26
yeah. Make sense? Okay, I think we’ve got a little to talk about already. What do you think? Let’s

Eric Dodds 07:33
do it. It’s good. We love having guests back on, especially when they’ve tackled really exciting things in between their first time on the show and their second time on the show. And you actually founded a company called gable.ai. And so we have tons to talk about in terms of data quality, generally, but I do not want to keep our listeners on the edge of the seat, you know, for the whole hour. So give us the overview of Gable.

Chad Sanderson 08:04
Yeah, so Gable is really trying to tackle a problem that I’ve personally experienced for basically every role in my career. Every time I started at a new organization, my focus as a data leader was to understand the use cases for data in the company and start to apply data management best practices, beginning with my immediate team, which is analysts and data scientists and data engineers, we would always go through that process. And at some point, we would still be facing massive quality compliance and governance issues. And that’s because I found that a significant number of these quality issues are coming from the upstream data producers that just weren’t aware of my existence. And as time went on, I found that these producers were not averse to supporting us, but they did not have the tool sets to effectively do that. Oftentimes, it required me explaining to them how data worked, or trying to get them to use a different tool outside of their stack or, you know, saying, Hey, here’s the data catalog. And I want you to look at it anytime that you make a change to ensure you’re not breaching anybody. And this is just very hard and complex. And so we developed a Gable.ai to act as the data management surface for data producers. It’s something that any engineer or Data Platform Manager can use two, number one, understand the quality of their data coming from the source systems. Number two can create contracts, whether one sided or two sided around the expectations of that data. And the number three, protect themselves from changes to the data and might mean data that is already in flight. So maybe I’m consuming an API from a third party provider and they decided to suddenly change the schema out from under me. We want to be able to detect that change before it causes an impact on the pipeline’s or it could well be, it could mean someone making a change to the actual code, like maybe there’s some Python function in code that is producing data. And the software engineer making that change just doesn’t know that it’s going to cause an impact downstream. We want to be able to catch that using the tools that engineers already leverage, like GitHub and get lab and stop that or at least give them information to both sides that a change is coming. So yeah, that’s basically how the tool works. That’s Gable.ai. And that’s the high level problem we’re trying to solve. Awesome.

Eric Dodds 10:45
Well, I have some specific questions about Gable.ai that I want to get to dig into the product a little bit more, especially, you know, you chose the.ai, you know, URL. And so I want to dig into the reason behind that, because I know it is intentional. But let’s zoom out a little bit. First, one of the things we were chatting about before we hit record was the traditional way. And you actually mentioned this term data catalog, right? It’s a huge buzzword. There are entire companies formed around this concept of a data catalog today. We were chatting a little bit about how there are certain concepts that have been around for a long time, like a data catalog, but maybe they aren’t necessarily the right way to solve modern day problems. So why don’t we just talk about the data catalog? For instance? Do you think that it’s one of those concepts that we should retain, right? Because there are certain things historically, you know, that are good for us to know to retain. But things change? Right, so maybe we don’t need to retain everything?

Chad Sanderson 11:53
Yeah, I think a catalog is one of those ideas that conceptually makes an enormous amount of sense on the surface. If I have a large number of objects, and I want to go searching for a specific object in that pile, having a catalog that allows me to quickly and easily find the thing that I need makes a lot of sense. But like you said, I think this is an older idea that’s based around a very particular organizational model. So the original concept of the data catalog back from the on prem days was actually taken from like a library where you have an enormous amount of books, you’ve got someone who comes into the library and is trying to find something specific and they go to a computer or they open one of the like very old school documents like a literal catalog. And from there, they can search and try to find what they need. But this requires a certain management model of the catalog itself, right? You’ve got librarians, people who know all of the books in the library, they maintain the catalog, they’re very careful about what they bring in and out. They’re curating the catalog itself, and they can add all of the relevant sort of metadata, quote, unquote, about catalog that gives people the information that they need. And this was also true in the on prem data world as well. When you have data architects and data stewards, like you had to be very explicit about the data that you were bringing into your ecosystem, you have to know exactly what that data was, where it came from, what it was going to be useful. And the catalog that you then provided to your consumers was his very sort of narrow, curated list of all of the data that could possibly exist. But in the modern, sort of data stack. It’s not like that. It’s more of a, you know, you’ve got your data lake, and that is a dumping ground for 1000s or hundreds of 1000s of data points. Yeah, there really is no curation anymore. And so what happens in that world? Do I think that the model, the underlying model needs to change?

Eric Dodds 14:12
It makes total sense. One, digging in on that a little bit more. We think about the data lake and, you know, of course, there’s tons of memes around it being a data swamp, you know, and, you know, we’re collecting more data than we ever have before. What are the new ideas that we need to think about in order to manage that, right, because what’s attractive about a data catalog, I guess, you could say is that you have call it like a single source of truth or, you know, sort of a shared set of definitions, whatever you want to call it, that people can use as a reference point. And like you said, when the producers were, you know, they had to engineer all of that stuff, right. And so they basically designed from a spec and that is, is your data catalog, right, essentially, but when you can just point SAS pipelines from any source to your data lake or your data warehouse? It’s this crazy world where like, Could a data catalog even keep up? And so what’s the new? What are some new ideas for us to sort of operate in this new world?

Chad Sanderson 15:16
I think it’s a question of socio-technical engineering. So funnily enough, like there is sort of a modern day library, which I would say, is Amazon. I mean, that’s sort of Jeff Bezos. His whole original idea was that it was a bookstore on the internet. But it was different from a typical library, because it was totally decentralized. There wasn’t one person curating all the books in the library, the curation actually fell on to the sellers of those books. And what Amazon did is they built an algorithm that was based around search, it was a ranking algorithm. And that ranking algorithm would elevate certain books higher in search based on their relevancy and the metadata that these curators or the book owners would actually add. And there’s a really strong, powerful incentive for the owners of each book, to play the game right to participate. Because if they do a good job adding their context, it ranks higher, which means more people pay them money. And the same is true for any sort of ranking algorithm based system like Google or anything else, right? Like you’re incentivizing the people who own the websites to add the metadata so that they get searched for more often. Yeah, I think this paradigm is what a lot of the modern cataloging solutions have tried to emulate, like, let’s move more search, let’s move board to machine learning based ranking. But the problem to me is that it hasn’t captured that socio technological incentive, the Amazon book owner, their incentive is money. The Google website owner, their incentive is, you know, clicks or whatever value they get from someone going to the website, what is the incentive of a data analyst or a data scientist to provide all of that metadata to get their particular asset rank? Is that even something they want at all? Because if they’re working around a data asset, do they want to expose that to the broader organization? If they have 1000s of people now taking a dependency on it, that it becomes part of their workload to support it, which they may not want to do nor half the time to do? So I think the incentives are not aligned. And in order to exist in this federated world, there has to be a way to better align those incentives. I think that’s what needs to change. Well,

Eric Dodds 17:50
Okay, I want to use the brought up two concepts in there, and I’m gonna label them but let me know if there is a label data incorrectly. But there’s this concept of data discovery. I think the point about search is really interesting, right? Okay. So you have this massive data lake, and you have a search, focused data catalog type product that allows you to know, and then you can apply ranking, etc. But in many ways, that sort of data discovery, right. The bookseller on Amazon is trying to help people who like, you know, murder mystery fiction, to discover their work, right, which is great. I mean, that is certainly a problem. Right. But when you think about the other use of the data catalog beyond just discovery, there’s a governance aspect, right? Because there’s these questions of, we found something that is not in the catalog, should it be in there, right? Or there’s something in the catalog that has changed, or we need to update the catalog itself? Right. And so how do you marry those two worlds? And I mean, I agree that catalog really isn’t even the right way to think about that. Because discovery, and governance or quality, or whatever labels you want to put on that side of it, are extremely different challenges.

Chad Sanderson 19:18
Yeah, I think that’s exactly right. I think that they have very different implications as well. I do think that a great discovery system requires a couple of problems. I think the first is really great discovery actually requires more context, then a system built on top of downstream data alone is able to provide if I’m a data scientist or an analyst, and I was at one point in my career, what I really wanted when I was doing a search for data was to understand, you know, what does this data mean? Who is using it? Which is an indication of trust? Where is it coming from? What was its intended purpose? Can I trust it at all? And how should I use it? Right? These are sort of the big categories of questions that I wanted to answer. Its data catalog is simply scraping data from, you know, a Snowflake instance, and then putting a UI on it, putting it into a list and letting people you know, look at the metadata. It’s only answering sort of a small subset of those questions. And it’s like, yep, what is the thing? Can I find something that matches the string input that I typed into a search box, but all the other questions, I now have to go and figure out basically on my own by talking to people potentially talking to engineers, trying to trace this to some code based resource or some other external resource. And that lowers the utility of the catalog by quite a bit. And then there’s the the governance side that you mentioned, and governance is in and quality is really interesting, kind of, like I implied before, because in in sort of a supply chain universe, the quality in the governance is going to be on the producer, that I mean, that’s really, it’s really the only way. And if the governance is going to be on the producer, that means that the producer needs to have an incentive to add that governance place. And I think, today, it’s very hard as a producer to even know who is taking a dependency on the data that you are generating, you don’t know how they’re using it, and therefore you don’t even know what metadata would be relevant for them. And you may not even want to expose all of that metadata, like I mentioned before. So to your earlier point, I think catalog is probably, at least to me, anyway, it’s not the right way of framing the problem. If I could frame it differently, it may be more around like inventory management. And that’s more of the supply chain take than the sort of the old school you take. Yeah,

Eric Dodds 22:12
absolutely fascinating. When we think about and actually, I’d love to, I’d love to dig into the practical nature of Gable.ai really quickly, just because I think it’s interesting to talk about the supply chain, and maybe a fun way to do it. You know, you and I recently talked about some of the data quality features that RudderStack recently implemented, right. And I think it’s a good example, because they’re a very small slice of the pie, right? They’re designed to, you know, help catch errors and event data at the source right at the very beginning, right. So you have events being emitted from some website or app, you know, you can have, you know, sort of define schemas that allow you to say, look, if this property is missing, drop the event, do whatever, right, propagate an error, send it downstream. First of all, would you consider that as sort of a, like a producer a source? How does that orient us to enable? Where would the RudderStack, sort of datasource? Like sit? Is that a producer?

Chad Sanderson 23:13
Absolutely, I think that RudderStack would be a producer, I think, pretty much, you know, the way I’ve thought about it, is that there’s really two types of producer assets, I guess, there’s Kodak, or maybe three, there’s sort of code assets. There’s structures of data, such as schemas, and things like this. And then there are the actual contents of data itself. And like you said, there’s lots and lots of different slices of, of this problem, where you’re, the events that you’re emitting from your application, like RudderStack, are, are one area where you need this type of coverage. You know, like I said, API’s that you ingest, you’ve got kind of back end events, you’ve got custom front end events, you’ve got, you know, C sharp and dotnet, and like all of these other sorts of this very wide variety of things. And so I think everything that you talked about sort of in the RudderStack webinar, which was, you know, being able to check the data live as it’s flowing from one system to another system, doing schema management, you know, all of that we consider that I think that’s totally relevant to what Gable.ai is working on as well. We also are trying to look at things like, can we actually analyze the code pre deployment figure out if a change that’s coming through a pull request is going to cause a violation of the contract and where in the contract is just ation of the data from a consumer and there is some level of sophistication to we do have, for example, like static code analysis that crawls and abstract With syntax tree, we can basically figure out when a change is made, what are all of the sort of dependencies in code that like power that change what all the function calls, and then if any function call is modified anywhere in that syntax tree, we can then recognize that it’s going to impact the data in some way. And then in addition to that, and this is where I think things get really cool, is we can layer on artificial intelligence. So not only would we know how different changes within that syntax tree can affect the schema, we can also know how it affects the actual contents of the data before the changes are deployed. So an example of that would be and this is like a typically very difficult thing to catch. pre deployment is, you know, let’s say I have a date time field. And we can say it’s like Date Time DOT now and a product engineer decides to change that to like Date, Time DOT UTC. Now, if you’ve been in data for any amount of time, like very common date format, engineers, but like that change represents an enormous amount of difficulty to detect and modify, and all the places in all the various ins C ICD, not only could we identify that change is going to happen, but we could actually understand that it is changing to UTC and then communicate that to every NS on that data. That allows the consumer to either say, Okay, I’m going to prepare all of my queries for UTC from now on, or if it’s a really important thing. And you might say, hey, software engineer, I want to give you some feedback that you’re going to cause an outage to different teams, so please don’t make this change right now. Yeah, that’s like one big part of the platform is that, like, you shifting left trying to catch them? Yeah, they’re closer to the source as a part of DevOps. And then the other side of it is, you know, like you said, with rather stack, we try to catch stuff in flight, as well. So if someone has made a bunch of changes, there’s a lot of changes coming through and files that land in a Postgres database in S3, we run at S3, we look at those files, individually, map them back to the contracts, and then we can send some signals to the Data Platform team to say, hey, there’s some bad data that’s coming through, now’s your opportunity to get in front of it, so that it doesn’t actually make its way into the pipeline.

Eric Dodds 27:30
Yep, I want to drill down on that just a little bit more, and I’m going to give you an example of a contract, please feel free to trash it and pick something else. But let’s take this contract around, like a metric, like active users, right? You know, of course, like one of the those definitions that you ask around a company and you get five different definitions, we need to turn that into a contract so that all the reports downstream are using sort of the same metric or whatever, right, and maybe RudderStack event data is a contributor to that definition, you know, based on a timestamp of some user activity, right. But there are tons of other ingredients into that metric, right. And so maybe you need to roll that up at an account level. And so you either need, you know, a copy of the Postgres production database, you know, so you can make that connection, or, you know, Salesforce or whatever it is, right? You need maybe subscription data from a payment system so that you know what plan they’re on. So you can, you know, look at active users by, you know, all those different tiers. So, we have that contract in Gable.ai. And so can you just kind of describe the way that you might wire in a couple of those other pieces beyond just the event data? Because I think the other interesting thing is, you know, when we think about data quality at RudderStack, we’re just trying to align to a schema definition. But what’s interesting is that the downstream definition and a contract actually may interpret that differently in the business context, as opposed to there’s a difference on the schema and something’s different, right? Yes.

Chad Sanderson 29:10
So I think there’s sort of two different ways to think about this one way and the way that I usually recommend people to think about this problem is to start from the top down. There’s a couple of reasons for that, it can be organizationally very difficult to ask someone downstream to take a contract around something like a transformation and around a metric and Snowflake, or something like that, or Big Query. If the inputs to that contract are not under contract, right, that feels a bit scary. It’s like I am now taking accountability for something that I may not necessarily roll. Yeah. And so there’s oftentimes there’s pushback to that, which is why I usually say that the best place to start with contracts is from the sources first, and then waterfall your way down. Interesting. The second piece of that is it. The second piece of that is like, I think that there’s a longer term horizon on this stuff where everything I just said doesn’t apply, which is starting to integrate the more concepts around data lineage into contract definition. So let’s say that I have this sort of metrics table, and I want to put a contract on it, but nothing exists. In the ideal world, you would be able to say, I want these contracts. And now I want some underlying system to figure out what all of the sources are, sort of end to end, I want to create almost like a data lineage in reverse. And then I simply want to either ask for a contract, or to start collecting data on how changes to those upstream systems are ultimately going to affect this transformation of mine. downstream. This is something that we hear a lot where teams basically say, I want contracts, but I don’t really have the social like political capital, or my engineering team, and tell them what to do without evidence. And they would like to just collect that data first. So I think that sort of the other is being able to construct that lineage understanding how things are changing, creating the collecting the data and creating the evidence for the contracts and then implementing them from the Yeah.

Eric Dodds 31:33
Love the love the phrase around responsibility for something that’s not under contract. Okay, I actually have a question. I know Kostas has a ton of questions. But I actually have a question for you and for Kostas, when we think about contracts, right. So I think about, you know, I brought up the example of active users, it could be any number of things. You’ve been a practitioner cause, you’ve been a practitioner, you those a bunch of data tooling, how fragmented are the problems with data quality? And I guess maybe we could think about the 8020 rule. And part of the reason I want to ask is because, you know, even you know, in the work that I do with, you know, analytics and stuff like that, you always wonder it’s like, Man, I mean, this is kind of messy, like, I wonder what it’s like at other companies? Is it the same set of problems? Is it really fragmented? Does the 8020 rule apply where there’s like, you need these sets of, you know, contracts, and they’ll take care of 80% of the problems. But, you know, what, have you seen Chad, maybe start with you, and then Kostas would love your thoughts as well. So

Chad Sanderson 32:32
a lot of it depends on where the company is getting their data from. The numbers that I have seen is that 50 to 70% of the data quality issues are coming from the upstream source systems, or the data producers. So that’s sort of the most typical range that I’ve heard. Now, within that, I think that there is a pretty wide variety of problems, for example, like databases, changes to databases, not really that much of a problem. And the reason why it’s generally not a problem for data teams is because engineers don’t do a lot of backwards incompatible stuff, because they’re scared of deleting columns that others are using. Sure, yeah. And so the, but there’s still a quality problem there, which is like, well, as a software engineer, or maybe I’m just going to add a new column that contains data from the old column, and I don’t communicate that to the screen. So that’s an issue. And then on the actual kind of business logic code side of the house, this is where we hear issues on the data content. And that’s like that sort of date time UTC change that I mentioned, our, we also hear a ton of problems around third party vendors, especially schema changes. And that’s because they’re really under no obligation to not make those changes. And a lot of the actual, like financial, like the legal contracts between teams, doesn’t account for changes to the actual data structures themselves, right, the SLA more about, you know, what the actual service is, but will this data suddenly change, and from tomorrow to today? So depending on where companies have built the majority of their data infrastructure, you’ll see a very different sort of split in what upstream caught problems are causing the most issues.

Kostas Pardalis 34:30
Yeah, I think it sounds like describing it very well. And I think it gets, it’s probably getting even more complicated when we start considering all the different roles out there, that they can make changes to the database schema, right? Like for example, let’s say you’re using Salesforce, I mean, Salesforce at the end it is like a user interface like when a database. You have people there who can go and in a table that they don’t see as a table they see doesn’t lead, or whatever they’re gonna go and like to make changes to their rights. And these changes can propagate down to the, like, the data infrastructure that we have. And like all that stuff. So I think especially like after shots, and that’s like what’s, like I find, like, very interesting with, like Todd was saying about like the catalog because, yeah, sure, like back then we had a very narrow set of like producers of the end, right, that were like under a lot of control by someone. But they like pretty much every system that we are using in the company to do something with potential , like a data producer. And like the people behind them are not necessarily one data people or even engineers, right? Like they can be salespeople, or like marketing people, or like HR people or whatever, like, I don’t think anyone can, let’s say, require from them to understand what UDC even needs, right? When they are going to make changes. And that’s like, obviously, on top of what let’s say Salesforce on their own might change their rights, which I would say, like probably more rare than what is caused by, I would say like the actual users. So yeah, I mean, I think it makes up our sense that most of the political problems like come from, like the production of the data of their, but it’s also like, I think like, the question I have for you like, actually Chad is, even if we focus only on the production side, right? Let’s go upstream, is there among, let’s say, the upstream producers of like a typical company out there. Another Pareto kind of distribution in terms of like, where most of the problems come from, compared to others?

Chad Sanderson 36:44
Yeah, I mean, I think you actually touched on a few of these sorts of third party tools like Salesforce, HubSpot, SAP, that are maintained by teams outside of the data organization. I mean, you said it exactly, the, it doesn’t seem like a problem as a salesperson, or an S, or a Salesforce administrator, to delete a couple columns in your schema that you’re working with. But if you’re relying on that data, for, you know, your finance team or your machine learning team, this becomes hugely problematic. So this is almost always a source of pain. I think the other thing that’s very problematic is, are the events and we hear front end events are especially notorious. And this is something I think that Eric and the RudderStack team are sort of working on. But we hear it all the time where you have this, you know, relatively legacy code base. And there’s a ton of different objects in code that are generating data. And for every single feature that’s deployed, those may or may not change, the events may suddenly stop firing, or new events might be suddenly added. And no one is told about that. And the ETL processes don’t get built, there’s just such a large communication gap between the team sort of working on the features that are producing the data and the teams that are using the data that you know, really anything that can, can go wrong, oftentimes does. And then the other really big area, I think, is the external data. This is where it’s just unbelievably problematic. And a lot of companies, they’re not sort of ingesting real time data feeds, it’s sort of much longer batch processes that take a lot longer to load. So it might be every quarter, I pull in a big data set, or you know, every couple of months, I pull in a big data set. And there’s so much change that happens on the producer side, between, you know, the times that they’ve been these large datasets out that it could look like a completely different thing. When you get from month to month, or quarter to quarter. And there’s so much work that then has to go into sort of putting the data into a format that can actually be ingested into the existing pipeline. That is just because, you know, there’s a company I was talking to where they basically said we the data team lost our entire December to one of those changes. And I think that these types of things are very common.

Kostas Pardalis 39:15
Eric, anything you want to bother? No,

Eric Dodds 39:18
I know you have a ton more questions. Of course, I could ask a bunch of questions, but I’m just soaking this up like a sponge. I love it. Okay,

Kostas Pardalis 39:25
okay. So let’s look up all the events. Let’s get a little bit deeper into that. And before we get into the data and the technology part, let’s talk a little bit about humans since, like, organizations there. So I have a feeling that not that many front end developers have ever been promoted because of their data hygiene when it comes to event rights. So how do we align that because you made a very good point about, let’s say the incentives out there in the marketplace. For example, where people are actually incentivized to get inputs, good metadata, or even get to the point where they try to game the algorithm through the middle data that they put in the right. But in organizations that are not necessarily even, like aligned between them inside engineering, like the data teams, and like the product, the product teams might not be aligned, right? Like, how can we do that? And like, what are the limits of the end of technology without stuff? Right,

Chad Sanderson 40:28
exactly. I mean, I think that your last sentence there, hit it exactly. I think that technology can only do so much, in my opinion. And what I’ve seen is, like you said, it comes down to incentives. And so the question is, in fact, I, when I was at convoy, I asked engineers this exact question as I went to them, and I said, Hey, how can I get you to care about the data that you’re producing? Because you’re changing things, and it’s causing a big problem for us? And the answer that I heard pretty consistently was, well, I need to know who has a dependency on me, like, who is using that data? Why are they using it, and when am I going to do something that affects them, I don’t have any of that context right now, when I’m going through my day to day work. And so it feels a bit bad. I think if you’re an engineer, and you’re going through your typical processes, you’re making some sort of code change. It gets reviewed by everybody on your team, they give you the thumbs up, you ship it, you deploy it. And then two and a half weeks later, some data guy comes to bang on your door and says, Hey, you made a change, and it broke everything downstream. It’s like at that point, you’ve already moved on to the next project, you’re working on something new, you’ve left the old stuff behind, it just doesn’t feel good to have to retract all of that. And this is why something we’ve heard a lot is like product engineers generally tend to see data things as being the realm of data people, right? If you do, anything sort of in the Data Warehouse is kind of treated as a black box. And if there’s a problem caused there, then the data teams will just go they’ll deal with it downstream. And I think that this mentality needs to change. And I think that product can help it change. So one example of this is Dev SEC ops, right? Like the whole discipline of DEV SEC ops has evolved over the past, like five to seven years, from security engineers have basically said, Look, we cannot permanently be in a reactive state when it comes to security issues. We can respond to hacking, we can respond to fraud, but the best case scenario for us is to start to incorporate security best practices into the software development lifecycle, as for example, just another step within C ICD. And I think this is what needs to happen within data. Checks for data quality should be another step within CI CD. And that step, just like any other integration tests, or any other form of code review, should communicate context to both the producer and the consumer of what’s about to go wrong. So if I can tell an engineer, for example, Hey, the change that you’re about to make is going to cause this amount of damage downstream to these data products. And these people, you’ve now created a sense of accountability, if they continue to deploy the change even in that environment. Well, you can’t say you don’t know anymore. It’s no longer a black box, it’s been open. And it provides an opportunity for the data scientists to plead their case and say, Hey, you’re about to break us. You need at least give us a few days or give us a week to, to account for this. I think that that is a type of communication that changes culture over

Kostas Pardalis 43:57
time. Yeah. Makes total sense. And, okay, we talked about, like the people and how they are involved and what is needed there. But let’s talk also a little bit about technology, like, what are the tools that are missing? Right? And what are the tools? Also, that’s where the opportunities are, let’s say in the toolbox that we have today to go and build new things. Like you mentioned, for example, like the catalog that it’s like a concept that probably has to be evolved, right. And it’s something that we have like a bundle like a couple of like weeks ago with folks like Ryan from Iceberg and Wes McKinney and it was like one of the things that came up right like that the catalog is like one of these things that we might have like to rethink about it’s catalogs by the way, I think what like most people have in their mind they are thinking of like a place where they can go and as you said like it’s an inventory of like things right where I can find my The assets and like reason about what I can do and what I’m looking for. But catalogs are also what, let’s say huel, the query engines out there, there’s also like metadata that like the systems are, needs to go and like chicken the queries. So there are like multiple different laser layers from the machine up to the human look like they have to interact with that. So what are the tools that you see missing? And what are the opportunities? So

Chad Sanderson 45:30
what I think is missing for the catalog to be effective is feedback loops, and actionability, basically, or to maybe phrase it another way, give something get something, if I can provide as a consumer, or even a producer for that matter, if I can provide information to a catalog, that helps me in some way, then I am more inclined to provide that information more frequently. And as a data product owner, what I would like to get back in return, or one of the most valuable things that I could get back is either some information about where my data is coming from the source of truth, you know, who it’s actually owned by this sort of class of problems that I mentioned before, that I’m interested in, or I get data quality in response, right. And so this kind of ties back to the point I was making earlier around lineage. And I’ll just give you a very simple example to illustrate, you know, let’s say within the warehouse, there’s sort of a table, maybe a raw table that’s owned by a data engineer, and then a few transformation steps away, there was, I don’t know what Eric was saying, there’s like some metric that’s been produced by a product team, and they don’t want that to break. Now, what they could do is if they, through whatever the system is, could effectively describe what their quality needs are. And then we could traverse the lineage graph, and say, Okay, I can now communicate these quality needs to all of the producers who manage data that ultimately inputs to this metric. And I can be sure that if there was ever going to be a change that violated those expectations, I would know about it in advance. Now, I, as the metric owner, am a lot more inclined to add good information. Right. So I’ve created a feedback loop where I’m providing metadata in detail about my data object that I maintain. I’m getting something which is quality in return. And now I’ve built something that is robust that someone else can take a dependency on. And I think this is the type of system that basically has to exist where the team, the producer, team of some data object is getting a lot of value, in return for contributing the metadata in the context, which I don’t think is the case today.

Kostas Pardalis 47:58
And you say, you Mozilla will like metadata and incentivize people like to go and add the metadata of their like, what is the metadata that’s missing right now? Like, what is it like to construct the because linear ads, okay, like, the lineage graph is not like a new concept, right? Like, it’s been around for a while, but why is it not enough what we have already? What is missing from there? Well,

Chad Sanderson 48:20
I think it’s a couple of things. I think one thing is that, number one, the lineage graph doesn’t actually go far enough. And you hear this a lot like right now, if especially in the modern data stack, the limits the edges of the lineage graph, basically end at structure data in and if that’s where you stop, then you’re missing another 50% of the lineage, which means that if something does change in that sort of unstructured code based world, it is ultimately still going to impact you any, you know, monitoring or quality checks at that point, are just reactive to the changes that have happened. So number one, you need to actually have the full lineage in place in order for the system to actually work the way that I’m describing it. And then in terms of what metadata is missing, I think there’s a massive amount, right? Number one, for just like, the biggest question that I probably had, as a data scientist and got as a data platform leader is, what does a single row in this table actually represent? That data is found almost nowhere in the catalogs, because again, there’s no real incentive for someone to go through all of the various objects that they own and add that same is true for all the columns, like if we have a column called, you know, I don’t know, in convoy we were it was a freight company. And so this idea of distance was very important. We had probably 12 different definitions of distance, and none of them were laid out explicitly in the catalog. Distance might be in terms of miles, it might be in terms of time, it might be in terms of geography, it might be some combination. didn’t have all of those. But if I, as the owner of that data product, can communicate exactly what I mean by distance, then that’s going to help the upstream teams better communicate when something changes that impacts my understanding. So, yeah, I think that’s sort of the idea, I think all of the semantic information about the data. That’s the missing metadata, in my opinion. Yeah. Yeah. Makes sense.

Kostas Pardalis 50:25
Do you see an opportunity there for AI to play a role with the semantics of all these, like data that we have? And if yes, like how? Yes,

Chad Sanderson 50:38
number one? I think so. I think the challenge, though, is that? Well, again, I think there’s a couple ways that this can play out, I ultimately, I think that this is what all businesses will need to do, in order to really scale up their AI operations, like they are going to need to add some sort of, you know, language based semantic information to their core datasets, otherwise, all this idea of like, Oh, I’m just going to be able to automatically query any data in my data set and ask it any question like all of that is going to be impossible, because the semantic information is not there to do it. It’s just tables and columns. And nobody knows what this stuff actually refers to. I think one option is that the leadership could just say, Okay, everybody that owns something in data, we’re going to spend a year or maybe two years, going to all of the big datasets in our organization and trying to fill out as much of the semantic detail as we possibly can. I think that could help as a start. But I tried this when I was onboarding a data catalog. And it’s like temporary, right, like, you get the initial boost, like maybe for a month, you get a ton of metadata added all at once. And then it just kind of gradually slopes off and ultimately isn’t maintained, which is pretty problematic. I think a better way to do it is to start from the sources and trickle down. In the same way I was describing Eric, before that I think all of this comes back to the contract. If you can have a contract that is rich with this semantic information, starting from the source systems, it is the responsibility of the producers to maintain, they understand what all of their dependencies are, anytime something changes with the contracts, they’re actually not allowed to deploy that change, unless they have evolved the contract and contributed the you know, required semantic update, then you get this sort of nice model of inheritance where every single data set that is leveraging that semantic metadata can then use it to build their own contract. And I think a lot of that could actually be automated. This is more of a far off future. But I think it would be a more sustainable way of ensuring that the catalog is actually up to date, and the data is trustworthy.

Kostas Pardalis 53:05
Yeah, it makes total sense. Eric, we’re close to the end here. So I’d like to give you some time to ask any follow up questions you have.

Matt Butcher 53:13
Yeah, so two more questions. For me. One, Just following on to the AI topic. What are the you know, when you think about

Eric Dodds 53:27
the risks? And you know, this is somewhat of a tired topic. But I think it’s really interesting in the context of data quality, as we’re discussing it, because I agree with you that AI can have a massive impact on the ability to scale certain aspects of this, right? But when we’re talking about a data contract, the impact of something going wrong is significant, right? It’s not like you need to double check your facts, because you’re, you know, researching some information, right, you’re talking about something potentially, you know, someone potentially making an errant decision, you know, for a business. So what do you think about that aspect? And, you know, I guess maybe, as we think about the next several years, when do you see that problem being worked out?

Chad Sanderson 54:26
I think that it’s going to require treating the data as a product in terms of the environments that data teams are using. And what I mean by that is today, when we are building software applications, what delineates a software application in a QA sort of test environment from something that is produced and deployed to users is the process that it follows to get there. Ultimately, code is not that dissimilar. It’s just that there’s a series of quality checks and CI CD checks and unit testing and integration testing and code review and monitoring. It’s the process you follow that actually makes something like code a production system or not. And I think that in the data world, exactly, as you said, what makes something production not trustworthy? Is there a very clear owner? Do we know exactly what this data means? Is there a mechanism for evolving the data over time? Do we have the ability to iteratively manage that context? And I think the process that has to be followed from kind of like experimental datasets to a production data set is a lot of the same stuff like it CI CD and unit tests and integration, I think contracts play a really big part of that, like there needs to be a contract in place before we consider data production grade. And I think this is where the environment comes in. Like there needs to be actually like, literally different environments for a data asset that is production versus one that is not. And so if I am, and I think that it should have impacts on where you can use that data, like, if we don’t have a dataset that has a contract and has gone through the production creation process, I can’t use it in my machine learning model, or I can’t use it, I can’t share it with our executive team in a dashboard or report. And in the same way that I can’t deploy something to a customer. If I don’t follow my, you know, quality, my code quality process, I think this is the thing that probably needs to change the most like right now in data, we don’t delineate at all between what is production and what is not production in the sense of like customer utility, it’s all sort of bunched into a big spaghetti glob. Yeah.

Eric Dodds 56:53
Super helpful. All right. Last question. You know, a lot of what we’ve talked about, one way to summarize, it could be, you know, you almost need to slow down to go faster, right? Right, you know, actually defining contracts, actually putting data producers under contract. You know, you use the term socio-technological, right, and involves people that takes time. Can you speak to the listener who, you know, has followed along this conversation and said, Man, I would love to start, you know, fixing this problem at my company. But it’s really hard to get things to slow down, so you can go faster in the future? What would be the top couple pieces of pieces of advice for that person?

Chad Sanderson 57:48
So yeah, so first of all, I agree with you, there is some element of slowing down. But at the same time, I would say that, like, I think that’s the same for code quality, too, right? GitHub does slow us down, right, and see ICD checks do slow us down. And having something like LaunchDarkly, that controls feature deployments is going slower than just deploying everything to 100% of our audience. But what the software teams have realized is that in the long run, if you do not have these types of quality gates in place, you will be dealing with bugs so frequently, that you will be spending a lot more time on that than you will on shipping products. So that’s sort of the first framing that I would take, because I think this falls under that exact sort of class of problems. The second thing I would say is that I think the problems that a lot of engineering organizations, and even you know, more business units have was slowing down on the data side is that they are still not treating their data. Like it is a product, they’re treating it more like, hey, it’s just some airy thing, I want an answer to a question, I get an answer to a question. It’s not something that needs a maintainer and it has to be robust and trustworthy and scalable, and all these other things, which is kind of the implication. It’s like if I ask a question about my business, it is implied that it is trustworthy, and that it’s high quality, but oftentimes, that connection is not made. And so what I often recommend people to do is to illustrate that to the company and then illustrate the gap. So a concept I use a lot at convoy was this idea of tier one data services. And that basically means there are some sets of data objects at your business where a data quality issue can be traced back to revenue or business value. So in Kerberos’ case, we were using a lot of machine learning models. Ace Single no value for a record would mean that a particular row of training data would need to get thrown out. And if you’re doing that a lot, then you can actually map that to a deep inaccuracy. And if you know how much value your model is producing, then every percentage point in inaccuracy can be traced to a literal dollar sign. Right. And so that’s sort of one application. I think there’s lots of applications within finance. There’s some really important reporting that goes on. Once you sort of identify all of these use cases for data, what I then like to do is map out the lineage and go all the way back to the source systems to the very beginning, and say, Okay, now we see that there is this tree, there’s all these producers and consumers that are feeding into this ultimate data product. And then the question is, how many of these producers and consumers have contracts? How many of them know that this downstream system even exists? And how many times has that data been changed in a way that’s ultimately backwards incompatible and causes quality issues with that system? Right now, with all of that, you can actually quantify the cost of any potential change to any input to your tier one data service. And you can put that in front of a CTO or head of engineering or head of data or even the CEO. And it becomes immediately important, the level of risk that the company faces not having something like this in place. So that’s a really excellent way to get started. A lot of companies are beginning just with paper contracts and saying, here are the agreements and the expectations that we need as a set of data consumers and then working to implement those more programmatically over time. Such

Eric Dodds 1:01:46
helpful advice that I really need to take to heart. stuff. I go on a date every day. Chad, thank you so much for joining us. If anyone is interested in connecting with Chad, you can find him on LinkedIn, gable.ai is a website. So you can head there, check out the product. And Chad. Yeah, thank you again, for such a great conversation.

Chad Sanderson 1:02:08
Thank you for having me.

Eric Dodds 1:02:10
We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe to your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.

🎙 Sign up for The Future of Machine Learning Livestream!

🗞️ Signup for Our Newsletter

Episode 183:

Why Modern Data Quality Must Move Beyond Traditional Data Management Practices with Chad Sanderson of Gable.ai

March 27, 2024

Notes:

Transcription:

About the Podcast

Sign Up for The Data Stack Show Newsletter