In this special episode of The Data Stack Show, Eric and Kostas are joined by the founders of Bigeye, Metaplane, Lightup Data, and Great Expectations. Together they discuss definitions of data terms, creating standards, best practices, and more.
Highlights from this week’s conversation include:
The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Eric Dodds 0:05
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com.
Welcome everyone to The Data Stack Show live. We have done a couple of these now. And it’s quickly becoming one of our favorite things that we do we have a great show lined up today, we are going to talk about data quality, which is a very wide-ranging subject. We’ve collected some of the top thinkers in the industry who are building products around this, and we’re super excited to chat. Also, for those of you who are joining, please drop questions in the Q&A. As we go through the show. We’d love to answer this live. We love going down rabbit trails that are helpful for the listeners. So please drop your questions in the Q&A.
Why don’t we go ahead and get started with some introductions? I’ll just go in order of the people that see on Zoom. So Ben, do you want to give us a brief background? And tell us about yourself?
Ben Castleton 1:13
Sure. Thanks, Eric. And thanks for having us on. I’m Ben Castleton, co-founder at Superconductive. And we’re the team behind Great Expectations, then, basically, it’s a product that allows us to test and validate data and provide documentation and it’s an open source platform. Excited to be here.
Eric Dodds 1:32
Excited to have you. All right, Manu, you’re up next.
Manu Bansal 1:36
Hey, everyone. This is Manu, founder and CEO of Lightup Data, a three-year-old startup backed by Andreessen, solving the data quality problem, or what we might sometimes call bit observability, we are not open source and proudly search this point. And we’re super excited to be building the product we have and the progress we have made. Since we started. This is not an easy problem, as I’m sure every panelist here will agree to. And we’d love to talk more about what the problem means save you from how we do it that can always come later. But I think there’s a lot of conversation still needing to happen around why we should be looking at this problem and why it’s central to the modern data stack. And I’m very excited to be here on this panel today. So thanks for the show.
Eric Dodds 2:18
Of course, so excited to have you. Alright, Kevin.
Kevin Hu 2:21
Hi, everyone. I’m Kevin, co-founder and CEO of Metaplane, a plug and play did observability tool that helps you to spot inequality issues, assess the impact and figure out the root cause. And also thank you both for creating the zip this DMZ between coffee. I think we can help grow the market together. And we chatted before the show and was really looking forward to this conversation.
Eric Dodds 2:44
Absolutely. Well, so excited to have you in the DMZ description. Alright, Egor.
Egor Gryaznov 2:49
I guess we left all of our knives at home, right? Hey, everyone. I’m Egor. I’m the co-founder and CTO here at Big Eye. We are a data observability platform. We help you build the sort of workflows that you need in order to detect and deal with any sort of data quality problems that come up in your systems. And as everyone else mentioned, or Eric and Kostas, thanks a lot for organizing this the very excited to have a conversation and see what all these other brilliant minds think about the quality space.
Eric Dodds 3:20
Absolutely. Well, so excited to have you. Well, let’s start. So one thing I love about this group is and I think everyone really shares the same mindset that getting into the technical details is certainly important, but it really is about the core problem that we’re trying to solve. And so I’d love to try to put a definition to data quality in just a few sentences. And actually, we’ll just go in reverse order here. So I’d love for you to try to define data quality in just a couple of sentences. Because I think for a lot of our listeners, you hear data quality, data observability, data lineage, anomaly detection like there are tons of terms. Let’s zoom out from the details a little bit. Egor, can you define data quality for us?
Egor Gryaznov 4:08
If we want to zoom out all the way, I think that data quality is knowing whether or not your data is fit for the use that is intended for it by the business. I think that’s about as broad as I can make that definition. It everything else that we talked about data observability, lineage, issue management, notification management, all of those are tools in order to deal with the to detect and deal with problems that come up. But data quality itself is really just can the business do something useful with the data, they want to do something and is the data in the right shape and the right format in the right state for somebody in the organization to work with it?
Eric Dodds 4:55
Love it. All right, Kevin.
Kevin Hu 4:57
Very well said by Egor. Data exists to be used, right, it exists to make a decision on it to take an action on it to automate something. So when there is a departure from what you’d expect, like you expect an hourly dashboard, and it’s been more than an hour, done that is a data quality issue, and plus one to the idea that data lineage data, observability, all of these, these are just technologies in service of the problem, which is data quality.
Eric Dodds 5:24
Manu Bansal 5:26
I think Egor and Kevin did a great job of kind of talking with the broad definitions, maybe I can take a stab at me a little sharper. From my point of view, I think of data quality as issues where data is broken, but your infrastructure is healthy. It’s when you kind of think about what is class of issues that we would want to call data, quality issues, this issues or infrastructure goes down, machine goes down, and our data starts showing, and that’s the data quality issues to some extent. But you try to differentiate between a machine going down and any effects that can have versus when everything is looking good on your Datadog dashboard or your APM tool. And yet, when you have a data-driven product, it’s not doing what it’s supposed to do. I am, I’d love to dig deeper into this and wrestle this conversation. But these kind of issues, that can tend to be very close to the business and usually end up requiring manual judgment. And that’s a class of data quality issues. That’s very important, but not necessarily solvable, solvable by a tool. So we are starting to kind of converge on the definition of what are more operational data quality issues, we put systematic, and have something deducted autonomously, right, as opposed to requiring, but there I would leave this with issues that are not coming from the infrastructure, and you data is broken.
Eric Dodds 6:41
Love it. All right, Ben, you’re going last after everyone else gave their definitions. I don’t know if that’s easier or harder, but you’re up.
Ben Castleton 6:49
I just took notes on what they said it here. But seriously, love what Manu is saying. And, and we think of the same thing, it’s kind of like if you are running data through an application, and nothing’s wrong with that. But if you can still have a really impactful problem set, if you’ve got data that isn’t, isn’t what you expect it to be. And so we think of that as separately testing the data, as just as important as testing your software at Great Expectations. And that’s kind of the crux of why we came up with that name. Also wanted to just come in and say the way we say it is data quality can be kind of reframed to say you’re basically looking to see if the data is fit for the purpose that you intended for. And so not necessarily like I could have a data set that has all these issues that somebody else might say, or issues, but it’s intended for a purpose that I have in mind, and it works with that purpose, then I don’t really have data quality issues that I’m concerned about. And so we think of that as just testing to see if it’s fit for the purpose that it’s intended for.
Eric Dodds 7:59
Fascinating. Yeah, that is I’d love to dig into that a little bit more later in the show because quality is both highly objective, uncertain vectors, and highly subjective on other vectors. So I’d love to talk about that. First, though, I’d love to hear—and we’ll just do a round robin again. And we’ll go in reverse order again. And then I’ll hand the mic to Kostas—but one thing I’m interested in is all of you are building really great tools in this space. But as we all know, the data space is adopting a lot of paradigms from software engineering and quality, observability, sort of lineage, issue management, all the sort of constellation of these specific components that sort of ladder up to providing whatever data quality is, why is this something that it almost seems like, Okay, we learned all of this and software grid, we have a set of best practices. And then now it seems like and maybe this is just perception, but now it’s kind of like, Oh, dang it, we forgot that when we started working with data, even though we knew it was really, really important. So Ben, there’s sort of a long period where it wasn’t necessarily a first class citizen, even in some of the core pieces of data infrastructure that came out. Why do you think that is?
Ben Castleton 9:25
Yeah, I’ve actually done a fair amount of wondering about that myself, like what actually caused it meaning the entire world, the ecosystem, the economics behind it, why? Why was that? I think there’s a combination of factors. One obvious one being the technology, innovation and just like storage, compute, all the Moore’s Law happening in technology so that you got a lot of ability to test data that you did not have a decade ago or even it just moves so fast. So that’s one factor. You also have, like, I don’t know if this is true, but sometimes I wonder if it was just kind of missed. Because while we started tackling what we knew about which is code, right, let’s look at testing code. Let’s look at infrastructure to figure out we make our code think I did this too. I remember when I first started coding, I assumed every time I wrote a bit of code, that it would work perfectly after I wrote it, I’m just going to hit Enter, and then run it, and it’s going to work. So we assumed, well, if we fix that code, we’re going to be fine. And then you start to realize, with all the data we have now that it just doesn’t work without testing data. So I’ll let others chat but those are some.
Super helpful. All right, Manu.
Manu Bansal 10:32
It’s an interesting question, maybe challenge the premise of the question first, a little bit, right? I mean, how our software is, right? We’ve been talking about it since what 1950s? 60s maybe, right. But when did we see Splunk happen? And when did we see data dog become universal? The huge gap between those two movements, right, great point, it’s offer used to be wild, wild west to for a very long time. And then you started to see some open source libraries half done kind of trying to get productize, and people sending them across and people still kind of moving logs on FTP and whatnot, right? Back in the day, if it took some time for best practices to the merge. I guess the question then becomes, now that has happened with software? Can we just borrow that into data? I would love to I think we all work. The fact is that data is so different than software, right? And we have seen some false starts when trying to carry over ideas, not picking favorites here. But data. versioning, for example, is not nearly as powerful a software version is try still trying to figure out what that actually means what kind of value it create. Or it took some time for data build tool to show up and took a very different form than software build tools. And so it says, it doesn’t really carry over. And I’ve thought a lot about why that’s the case. Why is there kind of a rift between the two turns out software is a design time testable entity. The person producing software is a human being. But data, at least the way we talk about it now is a runtime entity, which is coming in autonomously, you don’t have human intervention in Moscow. Right? So it’s just a fundamentally different object to be building or testing or monitoring or putting quality controls on, right? Like, there’s no such thing as let me test my data today. Because tomorrow, it’s going to look different, and you will be it’s a dynamic thing. It’s a runtime entity, right, so those ideas need to be redone. And that’s why we end up going back to a clean slate.
Kevin Hu 12:58
Ben, kind of writing on big daddy Moore’s law a little bit with that, our compute and storage costs going down so that there’s not really a trade-off between analytical workloads and quality workloads anymore. If you go, like people have been testing their data for, I mean, for a very long time, right? Oh, G data. And ETL developers have always had SQL scripts running against their databases. Now, what’s possible is that you can have the SQL scripts for not take down prod as well. I’d say that the demand side is also shifting a little bit, right, where some of the first BI tools came before Edgar Codd wrote his paper on relational databases. So BI is the very old, maybe one of the first applications of software. But now it goes beyond decision support. Right, we’re seeing reverse ETL. I don’t know what the kids call these days. For CTR, and operational workflows, you’re seeing more automations powered by data, like machine learning right? Now the stakes are higher. So even though everything old is new, again, it’s new, but with a vengeance. And like the degree is that the stakes of data going down is much higher. And the tools that we use to address that both technologically and conceptually and with Manu is that we have to be critical about what we borrow, right? We can’t borrow wholesale because there are many differences, data being a dynamic entity, but also the first class objects like lineage, what is the lineage and software? You have traces. And that’s important, and it feels a somewhat similar role. But it’s not quite the same thing. And as a result, even the heuristics, like treat your software like cattle, we cannot treat our data like cattle, right? Like they have their very precious thoroughbred racehorses that we have to cuddle a little bit but yeah, that’s a whole nother that’s a whole nother thing.
Eric Dodds 14:55
Yeah. Yeah, love it.
Egor Gryaznov 14:59
I want to dig in a little bit to what Manu said, as well about testing and data being dynamic. I think this is why data observability as a term is a much more accurate representation of how to track data quality in within data systems. But it’s the same, it’s the same story where software observability hasn’t really come into its own until I’d say 15 ish years ago, maybe where the notion of monitoring your systems in your applications through externally measurable properties in order to understand their internal behaviors, that data dog and New Relic in the Dynatrace of the world have made that possible for software systems, where if I have an application running on a server, I can guess at its health by monitoring things like latency and response times and CPU utilization. And there are a lot of externalities that are going to affect these measurements. And the in the same way, that software has externalities like, oh, well, our switch went down. So now we’re routing through a different switch and your latency is going to go up just because of that, or the something else is running on the system, the CPU, there’s memory contention, and that’s going to affect going back to Manu’s point, there are externalities in the data, that data changes all the time. And so the best thing we can do is just monitor for across those properties and say, This is what we can observe about the data. And it is behaving differently than we expect it to behave. So that may or may not be a problem with the data’s itself. But you need to be aware of that. And I think that’s why did observability has been such a fraught correct term for the act of looking at data and making sure that it is right and working as expected. And data quality then is saying like, Okay, well, when you are doing that data observability are we actually? Is it satisfying all of the rules that we expect? But yes, we need to understand the data will change. And going back to the software Comparison, software didn’t really pick this up until very recently. Oh, we’re very recently air quotes here, for everyone listening on audio is because 15 years ago is obviously not recent at all, but also in this in the sense of like software has been around for 5060 years like that is a fairly recent development in the software’s.
Kostas Pardalis 17:37
Thank you so much, because actually, you almost gave an answer to my question, to be honest which is about the difference between the terms of like observability, and data quality. And actually, when you guys did like your introductions, and hopefully I’m not wrong, two have you mentioned that you are building data quality products, and to have you said that observability products, okay, so I’d love to, like get a little bit deeper into that. And understand, like, what, what are the reasons for each one of you to choose one or the other? And try like to add a little bit more context that will be helpful primarily for me, because as you all know, I’m very selfish person and I want to learn primarily myself, and but also like for our audience, so let’s, I’ll with you, we got because, okay, you pretty much like gave a response to that. I replied to that already. But if you can add like a little bit more context, I think it would be great.
Egor Gryaznov 18:36
Yeah. And I’m gonna throw another term at you just to make this even more complicated than already is. How do we add big I have been thinking about data reliability, and data reliability engineering, as going back to that software Comparison, the data equivalent of SRE and Site Reliability, where SRE introduces best practices for maintaining software systems and making sure that they are reliable and up and usable. data reliability is the same set of tools and practices applied to the data space. Now, zooming into that, you need both operational best practices. And this is a very human problem and a process problem that is trickier to solve than tooling, but you also need the tools in order to support those processes and data. observability is a tool in the toolbox that helps you generate the signals to understand what state your data is in. So then you can enact the processes in order to go and repair it or change it or modify it or update your assumptions about that in order to make it more reliable. And so the reason I say big eyes a data observability platform is because we are solving that piece of the toolbox right now, we are solving for how can you most efficiently monitor the state of your day data. So you can start making better decisions about what to do, and start creating the sorts of processes that you need to have around those signals in order to make your data reliable. And the interesting part to me is, we see data reliability picking up steam, I met a person the other day, whose title is data reliability engineer. And that is very, very exciting to me. Because that encompasses that thinking of, I am using tools and setting up the processes for my organization to have the trust and understand that the data is as reliable as possible. Some, hopefully that answers the question a little bit more. But also, there was a whole nother wrench of it.
Kostas Pardalis 20:42
Oh, yeah. And I’m pretty sure that like everyone was going to add like even more to the conflict so, Kevin, you’re next.
Kevin Hu 20:50
That’s got to be good to see that job posting, or to see that job title. Well, I’m a little bit conflicted when I agree so much with every other panelists on the show. But I assure the audience that we have not talked beforehand. So if we all agree with each other, It either means that we’re all right, or which I hope it’s the former by how Igor described data observability of trying to collect as many externally observable properties of your data system as possible. That’s 100%. Correct. I would just borrow those properties until like metrics, metadata, lineage and logs, that’s four ways to organize it. And I always returned back to like, our customers, which are data teams, they spend all day providing data to their customers, right? Providing data to the sales team to make better decisions, understand the impact of their work and prioritize their work. But what is the data for the data team? Right? How do they data teams know which tables are being used, which ones should be deprecated? What models I should create? And date observability is kind of creating the data for the data team. And the data is that data now we’re getting very meta for a second. But so we have this technology, which is data observability. And data quality is one really important problem that it solves. But it does solve other problems. Right? Spend Management is a major issue for data teams prioritizing work or a refactor model and knowing what the downstream impact might be data engineering as a job, and I’m sure we’ll talk about the rolls later, is a bucket of a whole bunch of disparate jobs to be done. And data. observability is like one technology that kind of like layers on top of all of that. Not that any one is more important than the other. But it’s not a direct one-to-one mapping data quality and data observability.
Kostas Pardalis 22:43
Manu, you’re next.
Manu Bansal 22:46
So I see it a bit differently. Now, I think a lot of it, by the way is just basically irrational and non-technical it is because when you say data quality, the first thing you probably think of a 20-year-old tool that no one wants to be called today. And you invent this new term called Data observability, just because you understand outright, but let’s leave aside all that. Right, let’s go into the actual semantics here. Right? I think it’s all of these terms are actually great term. And they’re destroyed, describing different aspects of the same overall goal you’re trying to accomplish, right? If you just kind of go back to the language a little bit, right? What are we trying to accomplish? We want good data quality? We want reliable data, we want to operate it well. Right? So that’s your overall objective care? What do you need to do that? You need to make sure that if data breaks, you catch it, right? So you need monitoring on some observable property of your data. And before you can monitor something, what do you need, you need observability, right, you need to have external, externally visible signals that are telling you how your data looks like. So these all work together that different layers in the journey, right. So the objective is good data or data quality or data reliability, the starting point is to have observables and added surgery. And then in between somewhere there you have monitoring, right? Just having observables is not inefficient, if you’re not monitoring them, right, you need to make sure if something changes, you will catch it. And then you have management. So all of these terms are actually accurate. It depends on whether you want to talk more about building those observable signals, or you want to talk more about the end result which is data quality, it is just different interpretations to the same objectives.
Kostas Pardalis 24:30
Makes a lot of sense. And Ben.
Ben Castleton 24:35
Yeah, thank you so much good stuff here and I have a hard time disagreeing significantly with anyone I would say for great expectations we kind of approached it from a different angle, which was you’ve got the ocean to boil and let’s start with putting one test on one data asset and figure out how to prevent certain problems with understand meaning whether or not this data is what you expect it in a specific way. And building those tests, you can’t really get to scale without going out and doing what Kevin and Igor are talking about with data observability, having a system and software that allows you to observe the results of those metrics, the logs and all that, but we approach it from testing as the entry point to collaboration with people across teams. And so it just kind of the we coming at it from a different angle, but I think we’re all we all know, sort of the same problem set as what we’re trying to solve. For us. It’s more about, well, how can this team collaborate with another team around data and remove some of the friction involved in data workflows, that almost always is a very diverse set of workflows, and diverse set of people trying to collaborate around data to make the entire thing work. And so you’ve got to put in place, almost like contracts where you can see, I expect it to be like this when it comes to this point where our team takes over and make sure that you can test that and validate it, and then have good documentation around it. And that’s an entry point into the whole, the whole world of observability. And, and just data collaboration in general.
Kostas Pardalis 26:20
So, you mentioned something very interesting, Ben. You said that you focusing on the collaboration with other teams. And that’s like a very good, let’s say, Target, like for my next question, which is, who should care about quality in the organization? Like, who is, let’s say, the people that are the next data reliability engineers, right? When this becomes a thing, which is becoming from whether it seems so let’s start with you being like, who are the people who should be like really, really interested in this and actually, like us, one of the users of like products like,
Ben Castleton 26:55
Yeah. Should we just all agree to start with the Board of Directors, the CEO, the C suite, all the managers, everybody. Yeah, I won’t get too much disagreement there. But, but specifically, our products we’ve, we’ve aimed at, first, the users who are oftentimes data engineers, building data pipelines. And then sometimes this second level, people who run into data quality issues are not necessarily coders or data engineers and are more like subject matter experts and people who are analyzing the data and sort of run into the problems and are oftentimes interacting with data engineers and people who are building pipelines. So we sort of target them first. But these sorts of conversations are really helpful to enlighten the entire ecosystem of business people who should be concerned about data quality because when you have problems, it affects the entire organization.
Kostas Pardalis 27:57
100%. And actually, one of the reasons that I really love working with products that have to do with data is that you like you cannot focus on only one persona, like, yeah, you have the main user of your product that might be the data engineer, let’s say, for example, but then you also have like data analysts who are going to be working with the data and even like, the output of the data analysts going to be probably like, marketing managers, someone else. So you have like this chain of stakeholders who one consumes, let’s say, the output of the other, but when you’re building like products like these, and especially when it comes to like, get like data quality, you really have to consider like pretty much all of them. And I have like a question about that. But that’s for later, because now I want to ask like Manu about his opinion on that.
Manu Bansal 28:51
This is a very interesting and very important topic. I think what Ben is saying makes a lot of sense, which should come all the way from the top. And we are actually seeing signs of that. So. So that’s actually to the credit of the executives who do realize the importance of data, maybe it’s because data is coming to find time now. And it’s driving really high-value use cases now. And if Databricks everyone knows inventory is not getting stocked the right way or the CFO is looking at her. Why is the sales numbers suddenly low? And then everyone’s happy cave about it? Right? I think you got to hit upon this initially, right? We care about the data reliability engineer persona to emerge. We’re not actually seeing that yet. I think there’s still a bit of a debate, even when we talk to customers, where they will actually sometimes ask us, who should be looking at this, who should be tasked with maintaining data quality, and we see this tension right now, where the stakeholders who have most to gain or lose are so-called lesser means are the business stakeholders who are consumers of data. And they’re the ones unfortunately, we’re also the ones who detect those issue for the first Find remember, the people who, who I strongly believe should own data quality are the producers of data, right? So let’s say the data engineering team, who are moving in data, storing it, and then producing finished data that others can consume. And we’re starting to see that happen. But I think it’s also going to the same amount of tension that you saw with DevOps and SRE role finally getting split out where it was an orphan need. And when you talk about in software, software engineers were the best equipped to deal with software quality issues. But the ones that were not really interested in operating software and dealing with those issues, right. So then we had to create this, and Google came out with SRE handbook in 2006, 2007. And said, this is its own discipline. And this is a first-class rule. And it requires a really skilled person to be running software and keeping Google’s page load time under a second. Right. And I think the same transition is now happening in data. So who should own it? I think it should not be the consumers of data between producers and a different persona. I think there’s still an open question.
Kevin Hu 31:07
I agree that it’s an open question. And also that data teams, as much as we love them, they do not produce or consume data, right? Like, it’s the go-to market teams, the product teams, engineering teams that are like it’s a human putting in a number or machine Poppy on a number. And at the end of the pipeline, it’s someone reading a number or taking action on it. And as much as we talk about Snowflake as the source of truth, like it is Snowflake does not ship with data, it is a box into which you put your truth. And as a result, we’ve seen both from the data team being solely responsible for data quality, and the organization being responsible for data quality I’ve seen everywhere along the spectrum be successful. But I think it requires being realistic about behavior change. Because now if they go to someone like a sales rep, please putting in the wrong number, right? Well, why do doctors have bad handwriting? It’s because they don’t have to read their own handwriting pharmacists do. But if you don’t suffer the consequences of your own actions, it’s very hard to change that unless someone rules with an iron fist, which could work. So I’m just noting that the data teams that I’ve seen have most success with up leveling the state of data quality through an organization gets everyone looped in North, they say this is the goal. This is how we reach it, but I’m going to need you to be on my side. And to help you be on my side. It’s going to hurt a little bit.
Egor Gryaznov 32:41
I’m piecing a couple of things together that have been said earlier, I want to go back, Eric made this comment, I think at the very beginning, which is that data quality is pretty objective. But sometimes it’s subjective, I’m actually going to argue with that I’m gonna say it’s always subjective. And the reason that data quality is always subjective, is because the only people who can define what is expected about the data or the end consumers of the data. And so this is why I actually, I agree with Kevin, I want to push back a little bit on and get a little more out of this model. From your statement Manu around the data team, the data producers need to be the ultimate owners of a data quality. And the reason for this is data producers often don’t know what the data actually means and what it’s being used for. Like I was a data engineer back at Uber, I had a bunch of pipelines. And the pipelines push things around and they transformed it. And I talked to a pm and I talked to a data scientist, and they told me what it should look like at the end. And I made that happen. But what was that data being used for? What would their expectations about that data unless that was communicated to me by the business itself by the data consumers themselves? I wouldn’t never know or be able to encode that. And so to Kevin’s point, I think it’s important to get everyone on the same page. But what’s even more important is allowing the business stakeholders to define what data quality means for them, what are their expectations of the data, and providing the sorts of systems and tools that let them encode those expectations into the rest of the processes that are being run by the data producers. So going back to my statement, data reliability engineers, data reliability engineers are really the people who support the tools and enforce the processes and create these processes in the organization for to have the sort of cohesion going back to Kevin’s point it’s going to be a little painful, somebody is going to show up with a giant document and say, if you want to build a dashboard, you have to say what the expectations of that dashboard are gonna be. And you now have an extra four hours of work. Nobody likes that. But that is the only way that you will get real valuable expectations from your data and then being forced through the tools that the data reliability engineer team or the dip producers team can start enforcing.
Kostas Pardalis 35:08
Egor, I don’t know what’s happening today, but you almost always answer the next question that I have.
Egor Gryaznov 35:16
I have a binder.
Manu Bansal 35:20
This is a good segue here to dig deeper into what we have been discussing. It is very interesting because it’s starting now, get into why we are talking about data quality and data observability and how these two things relate to each other. Why is this still a conversation? If you’ll let me, I’ll just challenge Egor a little bit on this and say, yes, everything is subjective, but at that zoom level, but if you’re going to do one more zoom in, right, I think this starts to factorize into two different components. One a subjective, one need not be. It appears or it need not be great. So so when you think of how to measure data, and what it needs to be, these are two different sets, it’s how you measure it is an indicator, we like to call it data quality indicator, just like you think of a KPI so how you measure your business, that’s a KPI. How you measure your data quality. That’s a DQ why in the language, we use data quality indicators, right? How you measure the performance of website, you think of page load time as a metric, right? So the metric definition here can actually be objective. And that’s very technical, that’s actually coming from engineers, for the most part. Now, what this needs to be for the business to be successful. That’s subjective. And that usually comes from the business stakeholder. It’s the criteria of the rule on a metric. Yes, I think that’s very subjective. People will have different interpretations what that metric needs to be, I think we have an opportunity here to standardize it. I’d love to hear what Ben has to say to that because your dissertation has been going around talking about pleading standard for it. And I think that can be done. And we’ll see. And more and more of that happen over time. You’re seeing that definitely happening with it. observability, where we talked about CPU and memory and discuss, like most basic metrics that everyone will always want to measure? Do you want to run your CPU super hot at 80%? Or do you want to run it super safe at 40%? Well, that’s, there’s a storage, right? I mean, that depends on how you want to operate. But there are no two ways about it. You should be looking at CPU utilization metric.
Ben Castleton 37:28
Can’t argue with wanting to create a standard for some of those. I was agreeing with Egor and now I’m agreeing with Manu. So yeah, really hard to disagree with you guys. You’re too smart.
Kostas Pardalis 37:39
I have something here before you go on. What I really love as a person who’s like, today work like mainly products, and I have like, tried to build businesses in the past is that you are touching like one of the most let’s say interesting problems, which is how the word is like perceived from the lenses of like an engineer and how it is perceived by the lenses of like, an actual user, right? So you have like, the subjectivity versus objectivity is like, exactly like what engineering managers and product managers have to fight every day. When they tried to define, okay, what are we going to build next show? I love that. And I think it’s one of the most interesting challenges that you guys have to solve, like, with your products, because at the end, you have like, all these different, let’s say, people that are involved, and yeah, like you have the data engineer, like the data engineer needs something very concrete that it’s going to be measured, right? But then how do you communicate the outputs of what is measured and observed there, to the marketing manager, who the only thing that that person cares about is like how much I can trust or I cannot trust the data, right? Like even the language is different. And that’s my actual like, also like the question that I would follow up, but you started like answering those like, how do you think that this can actually happen? Because it sounds like, from a product’s design and management perspective, like a huge, huge talent. And Egor, please go on. I interrupted you.
Egor Gryaznov 39:21
Yeah, no, I will. I can weave both answers into one. So I agree that the signals can be objective in software because infrastructure and software all behaves the same way. They all consume CPU and memory, and they have endpoints, and those endpoints have latencies. And they’re hit a certain number of times. And these are non-negotiable facts about software. The problem is there are very, very, very few non-negotiable facts about data. And, in my mind, the things that I’ve usually been able to enumerate at this point Are the table needs to be loading on at some cadence, the table needs to be loading some number of records, those two signals are non-negotiable. And probably, actually that’s about it. I don’t even think nulls are non-negotiable because some datasets can some fields can be null, some can’t, there’s still no negotiation there. So there are only really two signals in data that are actually objective. Everything else is subjective. Because do I care if this column was over? No, maybe I don’t. Maybe there’s like some extras field that somebody’s dumping in here and may or may not choose to populate in the log that like that. Do I even want to start measuring that? Does that matter to me? Is that field being consumed to cost this point? Is it ever being consumed downstream. And that is where you get into that subjectivity of does observing it even matter. And to follow up on that, the business stakeholders care about the data when it’s actually being used in a data product. When I say data product, I mean, something like a dashboard ML model that’s generating output, something that they are then consuming. I think the best way to do this is to surface that information as close to the consumer as possible. So we’re talking into the data into the BI tools into their query editors into their data catalogs where they’re actually starting to interact with the data. Now, this is where subjectivity comes back. What matters, how do you determine that a dashboard is no longer fit for use. And the only way to do that is to take the person who has built the dashboard and say, here is what matters about this dashboard, here are properties that need to be held true. And that is always subjective on a per business, on a business, by business basis, and even a dashboard by dashboard basis. And so that’s why I still stand firm on data quality is subjective, but hopefully it costs us I’ve also answered your question.
Kostas Pardalis 42:01
Yep. You did. Ben, what do you think about that?
Ben Castleton 42:07
I definitely think when we talk about data quality, again, it goes back to our definition of it’s being fit for the purpose that you intend it for. And also, you don’t want to waste a lot, there’s a cost to testing data, there’s a cost to the software tools, in both effort to manage and in also the technology as well. But you don’t want to test everything, you don’t want to observe everything, because that doesn’t help you. You want to observe the important things, and you want to test the important things. And so I agree, you kind of have to start with the end in mind. However, when I, when I go back to like our customers, I think very objectively, I can always say, well, a good place to test is from the application to its first landing place for staging area, are you dropping it in an S3 bucket? And you want to know, did it get from the application to that S3 bucket in the same form? Or did something get missed there? So testing on ingestion into the Data Warehouse? And then you’ve always got this, like, is the data quality? Or is the data as you expect through the transforms? And is it as you expect, right before it gets pumped out into either an AI model or some sort of BI tools. And so those things are usually objectively true with the, with the customers that we see that that’s just usually there are problems that happen in there. And so we see, there are some objective rules that we want to test. And you’ve mentioned a number of rows, Igor, and very few things that are objective that that’s an interesting idea. And I think you’re right that, that you have to be subjectively pulling in what business users want to be able to decide what other metrics you’re testing. But having standards about how to test that just seems like it’ll create a lot of efficiency. And so that’s kind of the angle we’re going after. And I would just say that one thing, when I think about observability, and this how this conversation wraps together, I do not, I want to push back on the idea that you can have the source of truth be in a data warehouse, like it really has to encompass a much broader set of infrastructure. And so you want to be able to execute data quality tests, and just observability outside the data warehouse in order to get a complete picture. And that’s really important in our framing for our products.
Eric Dodds 44:40
Ben, I’d love to dig into that a little bit. Let’s talk about like what is the jurisdiction of these various components? We don’t need to get into the components right that the storage layer seems to be a really logical starting point because you’re trying to make Snowflake, your source of truth. Great. I mean, that’s, that’s good. And that actually, in many ways, can sort of expedite some of these data quality issues. Because you sort of have comprehensive way to, like, detect certain things, et cetera. But what is the jurisdiction? And I’d love I mean, feel free to take the question wherever you want. I’m interested in, in kind of the philosophical aspect of that, right, because you can reach into me to Igor said, you can actually reach into the BI tools that people are using, right. So what is the jurisdiction? I’d love to know the way that you think about that?
Ben Castleton 45:38
Yeah, well, I think it comes back to the business drivers. And we’ve mentioned—can’t remember if it was you, Kevin—but mentioning, while sure a go to market team uses data, like if you’ve got your data in Salesforce or some other CRM, by the way, totally separate topic. But there’s, there’s like new ways of selling and sales teams want to use data. And there’s all this innovation happening around that with product lead growth. And you think about the data that’s used there, that’s going to drive where you want to test it. So if you’re really working in Salesforce, for example, you’re going to want to have some tests around the Salesforce integration with whatever product analytics you’re doing. And, and you’re going to want to be seeing that the data is as you expect so that your sales teams are operating on infrastructure that is producing the stuff that they want to use every day. And you don’t want to have that just show up when, two weeks later, you have just your sales team has been not efficient and not able to manage their processes, because the data is bad. So that was that’s what drives it. And so we talked about wanting to test it at the source, and then there’s no jurisdiction here besides a pipeline, I think, and that usually crosses a wide variety of infrastructure.
Eric Dodds 47:01
Yeah, fascinating. Okay. So Kevin, you described Metaplane as plug and play so what’s your take on jurisdiction?
Kevin Hu 47:10
It’s a tough question, right? Like, I think even in software observability, the jury’s still out on do you want to test the symptoms? Or do you want to touch the causes? There are arguments for and against both? Right? You could say if we are monitoring data within Snowflake, that, okay, if something goes wrong, that is too late. Right, the problem has already occurred. Or you could say this is returning to the previous topic, what matters to the person using the data, and therefore, we should test it. I think there’s been a trend towards monitoring the symptoms. First, one, because that’s much more aligned with what the users perceive. And two, because that helps you prioritize what kinds of causes to debug upstream. But the jury is still out. I think there are both like two ways to do it. And this is another case, Eric, of everything old is new again, right? When we’re talking about focusing on the outcomes of data. Wow, people have been talking about this in the academic literature for 30 years, right. extrinsic versus intrinsic data quality dimensions of do want to enforce referential integrity or go talk to your user. And their testing the symptoms versus causes is yet another thing where I think they came to a conclusion 30 years ago, and we’re trying to read arrive it for the modern data ecosystem.
Eric Dodds 48:33
Got it. All right, Igor, what say you?
Egor Gryaznov 48:36
I think there’s a difference in jurisdiction of what is monitored versus where the information is presented. So a lot of so going back to what Ben said, about monitor, you have a process takes data puts it into S3, that needs to actually land there, it needs to be a non zero byte file, whatever other properties that needs to have. I think when we talked when I talked about data observability, I’m thinking about observing, looking at the data contents itself. I, I feel like there is a whole nother dimension to data quality, which is, are the processes that are generating my data running correctly. And I feel like there needs to be a little bit more disambiguation in the term. I don’t know, maybe it’s data, pipeline monitoring, process, monitoring, whatever we want to call that. But there’s a whole nother sphere of monitoring, which is actually much closer to application monitoring, which can be much more objective, such as, did my pipeline run a yes or no? How long did it take to run? How much memory to consume all of these properties about the process itself that you can start monitoring and increasing signal on? So I think that’s, that is still under the jurisdiction of data quality and It impacts the state of the data. But I actually think it’s a slightly different problem. In terms of the BI tool thing that I mentioned. I think that is surfacing the information that is coming out of the systems. And I do not want to have jurisdiction over the BI tool, I like that it’s not in my best interest, not in big eyes, best interest at all. But we need to surface the signals that we’re collecting information that we know about the state of the data closer to where the consumers are looking at it. And if that means pushing that into BI tools and pushing that into data, query tools, that’s what it has to be.
Eric Dodds 50:43
Great. All right, Manu, I would love for you to have the last word briefly on jurisdiction, and then we can wrap up with a really good question from one of the listeners.
Manu Bansal 50:52
To bring into context what Damian just said on the chat here, right, which is very relatable actually, that used to be my experience when I was building data pipelines. And we see this time and again, where it eventually lands on the data engineer building the pipeline. So as much as I think the producers of data should own data quality, it should not mean that that becomes a bottleneck. And all data quality is now becoming that one person or one team’s responsibility. But we see that happen just way too many times, right? Where you go to data engineers, and they’re frustrated, they’re, instead of building pipelines, they’re just chasing data quality issues. And that used to be me. And then I’m going around trying to understand the business context and saying, hey, it’s not really my job to do, or I’m not the expert on what data quality should even look like, or how to interpret this data. So I think it goes back to like, who should own data quality, I think that role needs to be carved out. And the more we are talking to customers, the more we’re seeing a separation happening, where they’re creating an offshore team out of data engineering, and starting to call that a data quality or a Data Governance team. So these are people who are engineers by trade, who understand how to operate data, and are now starting to ramp up on interpretation of data, and have an operational mindset and enjoy doing that. Right. Now, this is more of a platform team though, right? They’re not the word, only ones responsible for every single system are typically a source of truth actually is not Snowflake. Various transactions don’t reconcile and Snowflake, big. No one gets fired, right? Because he will go to Oracle DB. And, and if that doesn’t reconcile, there’s a problem. But if that’s working fine, your transaction further, it’s fine. Right? So now you can’t hold this one person responsible for data quality with Snowflake again, and Oracle and Kafka, and data Mart’s being shipped out of DBT and whatnot, right? So they are an enabler, they need to, they need to create an easy medium for different stakeholders to come in and specify their own data quality tests, which could its sources from which for Duncan systems, anything in between, right. So that’s, that’s what I see kind of evolution going into.
Eric Dodds 53:06
Yeah, love it. And we’re right at the buzzer here but I’m going to read this question because you answered it, Manu, which is great, but I’d love to hear from the other panelists. I’ll just read this really quickly to give the listeners context. So Damien said, hi, everyone, I’m a data engineer working for a sort of located in Paris, and I’m the only data engineer so far, I think a lot of us can relate to that. Regarding responsibility for data quality, I really strive not to become the single point of knowledge and responsible for all pipelines simply because it’s not possible to know everything about the business. And said, I really believe in what could be called Data ops, and about providing the tools and infrastructure to empower developers to be more conscious about data quality, and what they are producing. But now, I wonder, do you think this is a good approach? And if so, do you have any advice on how to onboard people onto these topics? So Manu, a great answer. And let’s see, we’ll go then Kevin, and then Egor can close this out.
Egor Gryaznov 54:04
Yeah, great question. And it is near and dear to many of our hearts, I think. And I would just punch a couple of other things. I think data quality tools are not going to answer the entire spectrum of this problem. And it is super important to be integrated with a stack that allows you to do some of this I would be also leaning heavily on some of the data catalog companies here and leaning heavily on making sure I’m integrated with how I’m looking at data quality with my integration tools and when data is moving, and then making sure that people across teams can see that so collaboration across the cross these teams is super important and we just I really believe strongly that should be mediated with software like software is well suited to do this. And so that’s what I think a lot of us here on this call are trying to build to make that easier. So a great question.
Kevin Hu 55:03
I agree that no one tool will solve all of your problems. And I would start by speak for your audience, right speak in a language that they understand you’re appealing to what they’re interested in. If you go to an engineer and said you shipped an event name change, like cause a found out with downstream data assets. glaze over, if you go to your VP of sales and say there are data quality issues, we have to invest in data integrity, eyes glaze over, talk about, we want you to ship this change without us having to come back to you one month later, like yelling at you. And we want your reps to put in data in a way that makes it so that your dashboards can be up and you can be confident it. And then once you speak the language of the rest of the org, then take it to the next level, then take a talk about contracts. Talk about what they expect from the data and everything that we just talked about. It’s Digi at that point.
Egor Gryaznov 56:01
I think two pieces of advice from me. One is start small and figure out the biggest pain point in the business right now, to Kevin’s point speak the language of the business, figure out what they are struggling the most with, and or what you are struggling the most with in order to support the business and try to solve that problem and build the tooling and process around those areas. If it’s data quality, then that’s what it is. If it’s data discovery, and nobody knows where any dashboard lives, or how to find them. Maybe that’s the first place you need to look and solve the problem for the second, I think, to answer the original question is, is this a good idea? Yes, totally a good idea, the best way to onboard leadership into this in my mind would be to explain to them how this is going to help you and your team scale. You’re a single data engineer, and you’re a small startup, you need to be efficient, and you need to, you probably are playing a lot of roles in the organization. And by showing how data ops can help you scale through tools and processes. You can say, Look, I don’t have to spend an hour a day doing this, this data quality check, because if somebody else can help me, I can go into find what they expect and get the notifications, you will just find all the most important things back to me. And that is going to resonate a lot with your manager with your leadership team. Because they’re gonna say, great, you want to make yourself more efficient. We are all for that. Because now if we have to hire only one other data engineer rather than two more because you have the tools and processes in place to become data efficiently, I think that’s going to be a very easy so in your organization.
Eric Dodds 57:39
Love it. Well, this has been such a helpful show, I learned so much about data quality and all of the other components that surround it. So Ben Manu, Kevin, Igor, thank you so much for giving us some time. And joining us on the dataset show live. And thank you to all the listeners with all the great questions.
Manu Bansal 57:57
Thank you. Pleasure being here.
Egor Gryaznov 57:59
Thanks for having us.
Eric Dodds 58:00
We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at firstname.lastname@example.org. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.