Episode 100:

Data Quality is Relative to Purpose with James Campbell of Superconductive

August 17, 2022

This week on The Data Stack Show, Eric and Kostas celebrate their 100th episode with a chat with James Campbell, the co-founder and CTO of Superconductive. During the episode, James discusses data quality: the problems, a solution, and much more in between. From his perspective, let’s learn how different expectations can coincide.

Play Video

Notes:

Highlights from this week’s conversation include:

James’ role at Great Expectations (2:33)
What Great Expectations does (5:49)
How Great Expectations approaches data quality (7:01)
Why a data engineer should use Great Expectations (16:41)
Defining “data quality” (19:16)
Translating expectations from one domain to the other (27:00)
Community around Great Expectations (30:59)
The user experience (33:41)
Something exciting on the horizon (40:27)
Interacting with marketers in a non-technical way (43:57)

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 0:05
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com.

Welcome to The Data Stack Show, Kostas. Today, we are talking with James from Great Expectations. Now, we’ve already talked with Ben from that company. And so we sort of have gotten some interesting thoughts on definitions around data quality, etc. But Great Expectations is a fascinating tool. They have a command line interface and a Python library. And so the way that they approach the problem from a technical standpoint, is super interesting. One of my questions is going to be around if we have time to get to it, around how they think about the interaction between different parties within an organization who need to agree on sort of data, data definitions. It’s like a huge thing with data. Right? You have some sort of variance from like, some sort of data definition. So I want to hear their approach on that both like from does their product support it, but also from like a philosophical standpoint because they’re sort of potentially some limits to what software can solve, in that regard. So that’s my burning question. What about you?

Kostas Pardalis 1:29
Well, it seems like you’re going after the hard questions, so I have to be the good cop this time.

Eric Dodds 1:36
You’re usually the bad cop.

Kostas Pardalis 1:43
My intention is to talk with him a little bit more about like the product. And like the technology itself, we’ve had like, the opportunity with Ben like to talk a lot about data quality, and like the need and like all that stuff, listed like a little bit of like a higher level. So I think it’s like a great opportunity to get a little bit more tangible around like the products, how it is used, what kind of problems it shows, and in what unique ways these problems get solved by Great Expectations, so that’s what I’m going to ask.

Eric Dodds 2:16
All right, well, let’s dive in.

Kostas Pardalis 2:18
Let’s do it.

Eric Dodds 2:20
James, welcome to The Data Stack Show.

James Campbell 2:22
Thank you so much. I’m excited to be here.

Eric Dodds 2:25
All right, well, give us your brief background, and tell us what you do at Great Expectations.

James Campbell 2:31
I’m the CTO at Great Expectations, and one of the co-founders of the project together with Abe Gong, it’s crazy to think this, it’s been about five years ago, now. Wow. And it’s been quite a journey driven tremendously by community. And, and now the company, getting to focus on product is really delightful. Before working on Great Expectations, I spent the most of my career in the US Federal Government, specifically in the intelligence community, and I was an analyst. So I did a lot of work on originally, cybersecurity and understanding strategic cyber threats and then broader political modeling. And in both of those domains, I had a really exciting chance to be able to move back and forth between very quantitative and very qualitative types of analysis. I sometimes joked, some of my job was Microsoft Word, and then I’d go have a job in Microsoft Excel, and back to Word and then back to Excel. Obviously, not just excel for data volumes. But that’s, that’s been a lot of how I’ve gotten to spend my time. And then now it’s just a delight to work across. Again, so much of the domain of superconductive.

Eric Dodds 3:43
Yeah, very cool. Tons of questions about that Great Expectations, but really quickly, was it always interesting to hear about things like political modeling, etc. I think we all subconsciously, those of us who are on it, and Donna, which is, most people, you have this idea of kind of what it’s like in the movies of like uncovering secrets and all that sort of stuff, but tell us what it’s really like to sort of be an analyst and do political work.

James Campbell 4:12
The problem with that question, Eric, is if I tell you, I have to kill you, and you’re so far away.

Eric Dodds 4:19
I love it.

James Campbell 4:24
Firstly, I think one of the key things that’s really important to remember in that in that field is like there’s a whole bunch of different sources for how you build models. And I think a lot of the contemporary machine learning and AI focuses around the structure of data and like trying to use data to be the driving factor of building an understanding and, at least in my experience, a lot of the kind of practical modeling applications are still very much driven by significant domain expertise being put into the model itself. This structure of the model plays a significant role and says like, Maybe one way to say it is big data was is was all the rage. And they’re still dig significant worlds where it doesn’t take a lot of data. And in some ways, like the defining characteristic of the intelligence world is that maybe there’s just one critical piece of information that changes everything, and you’re in the hunt for that.

Eric Dodds 5:18
Super interesting. Well, I could go on and on about that. But I want to talk about some of the guts of Great Expectations. So we talked with Ben, and had some really good chats around sort of definitions around data quality, etc. And so I’m excited to dig into the technical details. So, first question about Great Expectations. So actually, before we get going, could you just give us a super high level, like, what Great Expectations does for listeners who may not be familiar?

James Campbell 5:47
Absolutely. Great Expectations gives users the ability to make verifiable assertions about data. So it allows you to define what you expect. It also helps you learn what to expect from previous data. And then we can test those expectations against new data as it arrives, and produce clear documentation describing whether or not it meets the expectations and when it doesn’t, what exactly is different?

Eric Dodds 6:16
Super cool. Okay, so this is my first question. I have actually, in looking at the show on the calendar, I’ve been so excited to ask you this question. So, data quality is a broad problem. And there are a number of ways to solve it, right? I mean, even including just sort of brute force SQL with really raw messy data on the warehouse, right, which everyone hates. But what’s interesting to me, sort of, if I can put it this way about, like, the geography of the data stack, as it relates to data quality, you can address the issue of data quality in multiple places, and maybe you do want to address it in multiple places. This is a two-part question. The first part is, where does Great Expectations sit in the stack? And in the geography of the data flow?

James Campbell 7:12
Great question. And I think the answer is sort of everywhere. But it’s not the same expectations that will exist everywhere. So when you think about that stack of that you describe sort of the stack of data, I think there are two pretty distinct things that happen during that one is, data kind of moves between systems, and is potentially enriched or augmented or clarified along that process. And the other is that data is synthesized, right, you have a roll-up, you have an analytic running, you have a new model running on data and the output of all those things, it’s another data set. So there are two pretty distinct operations in a data system in that way. And for each of those types of operations, Great Expectations helps you both protect the inputs and protect the outputs. And we’re actually one of the things that we’ve talked about on our team that makes Great Expectations powerful, but also, it’s a challenge for us to ensure we’re making it easy for users to understand how to use it effectively, is helping to kind of differentiate those different ways that people are addressing data quality problems.

Eric Dodds 8:30
Super interesting. Okay, so can you dig in one click deeper? So those two sort of key points where data is moving between a system and then data is being synthesized result in another data set? How does great expectations interact with those two points, right? Because one is sort of what actually this is more of a question for you, right? Like, if data is moving from systems, it can either sort of be raw data that’s sort of maybe being like, flattened out to go into a warehouse, right, where there actually could be transformations happening, etc. So I would love to hear about that. And then also the flip side where data is being synthesized.

James Campbell 9:13
Yeah. So I think for the first case, where data is moving through transformation or enrichment, I think of that as being really applicable to what I’d call a contract model. So there are vendors that provide data and the fact that we can go out and buy datasets that are curated. And by definition, high quality, for example, it could be stock data, it could be health insurance records, it could be any number of different could be weather data. There are all kinds of datasets that have been processed and, and have characteristics that make them valuable for certain kinds of decisions. So the first thing in there is being able to ensure that both parties understand what they’re getting, right like when you’re buying something we want to contract about that we want to know I have a call Harlem. And this is what it should look like. Now, in the past, a lot of the ways that we dealt with that was we had these like, endless giant coordination meetings, and I kid you not, I’ve been the recipient of I think, like 175 Page diagram describing this dataset that we were buying, it was like, what do I do with this, right? So a big part of what we’re doing there is we’re making it possible for you to just agree in very precise terms that are self-healing, the documentation, that 175-page PDF is self-healing because the biggest problem with those things is that they, they immediately get out of date. But by making the x by making that contract, a living artifact, something that can be tested as data continues to flow, it can be immediately flagged when there’s a problem, and then we can also update updated the contract. So with respect to the second thing, the analogy that I think of there is, that is like, kind of, like from physics, the concept of an emergent property. So if you’re looking at a volume of air, you can think about like, Okay, where are all the molecules like, what’s, what are their characteristics, and like, those might be the columns, this one is at location x, y, z, and has momentum, alpha, and so forth. But we in an analytic context, what we’re doing is we’re looking at a higher order, property, pressure, volume, all right, we don’t need to look at all those individual records anymore. And that’s what a model is doing. Right? It’s like taking all these individual pieces of information, synthesizing of them together. And the key thing that happens there is that the nature of the information is completely different. And we’re reasoning about a different quantity. So I’m not reading any more about x’s and y’s. And z’s, I’m reasoning, reasoning about pressures and volumes. And being able to support that kind of transition is really, really important and is one of our critical goals, and why we have done a lot of investment in supporting a contributor gallery, for example, where people can define expectations that are meaningful for them like this expect this column to be in a particular geography. So we’re not saying it’s like has to be an x value and a y value, we’re saying like, it needs to be in New York, like, if this is a lat long, it needs to be in New York, and that reflects how we think about data, and it helps you move to that emergent property, which is I think we’re, we’re data quality really needs to be as a field as because that’s where we’re helping stakeholders really get to the value they need.

Eric Dodds 12:29
Yeah. Okay. So super helpful. And I love the physics analogy sort of the individual components that make up something like pressure as a super helpful analogy. The second part of the question is why you chose to solve it that way. And I’ll say, I would love to hear you talk about maybe, ways that you had seen it solved before. And then why you decided to sort of structure Great Expectations the way that you did?

James Campbell 13:01
That’s such a rich question. I love it. The first thing is, like ways I’ve seen it solved before. And I think actually one of the important things, it’s not just before, that, when we encounter users have Great Expectations, I actually consider a point of pride being the fact that many of them are like, Oh, I’ve written something like this, like, I’ve solved this, like I’ve written the test for nobody for volume four means for stationarity of a distribution. And the reason I think that’s really a good thing is that it reflects the fact that we’re kind of in tune with how people process the world. Now, what’s the key difference, like the key insight, I think that makes it different, and like how we’re solving the problem is that we’re providing what I would call like, a general purpose, language, we like to call it like an open, shared standard, where it’s designed. I mean, like, some of the hallmarks of Great Expectations are the names of expectations are incredibly verbose. And people love it. I love it. It’s very precise expect column KL divergence to be less than, it’s like this long names or whatever, but it means something, and it helps people really express again, express their expectation. The key thing that so why that I, you asked this question like, Well, why that? What does it give you? I think one of the most important things that it gives you, is explainability. So when I get back, when I see that, that some piece of data maybe doesn’t match my expectation, then it can explain what the expectation was, because I told it, what the expectation was. We should dive into kind of some of the more technical details because I shouldn’t I don’t want to suggest is you have to sit at your keyboard and type. expect this 100,000 times like no, you don’t need to do that. But what we can do is really make it easy for you to get that very explainable report back of All right, you didn’t think that this column is supposed to exist? It doesn’t. Which many cases those are like the real problems that break dashboards.

Eric Dodds 15:08
Yeah, makes total sense. And the verbose naming couldn’t be better aligned with the Dicken’s reference.

James Campbell 15:16
I’m impressed you– Yeah, you’re totally right.

Eric Dodds 15:19
That’s great. Super long, one-page sentences. Hopefully, your expectation in Great Expectations isn’t a page long. Okay. Kostas, I’ve been stealing the mic.

Kostas Pardalis 15:31
Yeah, you did, but it’s fun. So you can continue doing it if you want. That’s fine with me. I have a few questions to ask. I’d like to focus a little bit more like on the product experience first, and also like the problem that would lead someone to a solution by Great Expectations. Many times, we assume that everyone who listens is aware of why we are doing the things that we are doing. But that’s not always the case. Right, like, so, I would like to start from the very, very basics, like, let’s, let’s think of like, and I’d love to hear that like from you. Or like describe the work of like a data engineer, or like whoever is like, let’s see the person who faces problems that can be solved with Great Expectations and like, describe, like, go through like a small scenario, until we reach the point where we can say, yeah, now we can talk about Great Expectations on how this problem can be solved with this. So can you do that for us, please?

James Campbell 16:34
I can do my best to present some. I think there’s a lot of different ways. But I think one of the key things, one of the whys, like why did people turn to this tool is that they want to get ahead and be proactive, instead of reactive. Already, a lot of data engineering teams face this question of I got the phone, and we call for we call them data for stories sometimes. And I think other people use similar terms. And these are out there all the time. But you get a call, Hey, my dashboard is broken. It’s when you when somebody says my dashboard is broken, I think it’s useful to think about the way they’re seeing the world. A great example would be salesperson, northeast, region, sales shows zero. I know that’s not true. And the reason I like framing it in that way is like they had an expectation, like I was out there, I made the sale, I saw I wrote the ink down, I know it’s not zero, I expected it’s not zero, but it’s been zero. So the data engineering team turns to like Great Expectations in that case because they want to be ahead, so they don’t want to get the call. And, hey, the dashboards broken, they want to see the issue first and be able to resolve it, before it ever becomes this broken or, or embarrassment. That’s one of the really common problems. Another really common problem is the pager duty problem, like, if systems sometimes sister some example, I gave that first example, it’s a semantic failure, right? Like that’s the dashboard ran that there’s an there is a number in the cell, it’s just not the number the person expected. Other times, it’s like you have a schema mismatch, or a load that totally failed, or something like that, where the key thing that you’re trying to solve is like, I don’t want to get a page at this page at midnight. And if I am responding to a problem, I want to have the diagnostic information that I need to be able to get to a solution right away, and to be able to zero in on where the problem actually happened. Yeah, and I think there are variations on those, but I think it’s kind of this, this, they’re both kind of forms of being able to be proactive in addressing, like your core function that you’re trying to solve as a data organization.

Kostas Pardalis 19:04
Yeah, absolutely. Okay. I have like a slightly trickier question now. Ricky, also for me to communicate in the best possible way. So when we are talking about quality, in general, like date and quality in data, like I’m always, let’s say, confused, or I’m going like back and forth between like two definitions. Okay. One definition and the old stuff like using a little bit of like metaphors here is more like the stamp of QA that we have on products, right? Which is more about like the consumer of the product, in this case data to trust the product that is getting you right, so that’s like, one reason that the one way that we implement like wallets like us, like humans in general, like in products, the other is like, use these tools like Great Expectations as a debugging tool, right like It doesn’t wait for the developer or like the engineer or whoever like is responsible, like for anything related to the consumption of the data to figure out like, what is wrong, right? Now, obviously, both functions are important. But in your opinion, like, both like in terms of both, like for Great Expectations, but like for quality, like data quality in general, like which one of these two, let’s say definitions of like quality you think are like closer to what is needed today?

James Campbell 20:34
That is a tricky question. I almost need to add a third, a third one in order to be able to flesh them out, and it’s very similar to debugging, but just sticking on the theme for a moment of proactivity. It’s like proactive debugging, it’s the way that you generate your expectations in the first place. And the reason I think that’s really critical is, you alluded to this in the first step, an engineer quality is a term that is relative to a purpose. In some ways, quality is like your fitness for doing some particular job. And so it will vary like that the same data is high quality for purpose, one, and not high-quality purpose two, for example. And so being able to, like support, the process of generating your understanding, like kind of your mental model of the world, I think, is actually one of the most important areas that the broader data quality ecosystem can support can do. So, okay, so I added the third, and then maybe what that lets me do is say, Look, they’re all important. It’s really more of a phase. And to be honest with you, this is something we’ve been we’ve been talking about a lot on our team, internally lately is making sure that we’re not trying to have the same conceptual objects the same in this case is like literal code objects that the same objects in Great Expectations sort of do too much. But rather exposing API’s and interfaces that are more intuitive for the way that you’re using the tool. And, and that might be in this phase of I’m, I’m building and creating my expectations, it might be in the face of I am kind of ensuring quality, ie performing that QA function to make sure that I’m going to meet the needs of the product or the downstream data consumer. And then it also might be performing a debugging task. Now that last one is pretty challenging for us today because it’s very interactive, and it’s also interactive on a potentially historical data set. And so there’s a lot of nuance there. We should dive into if we’re going to get into more of the technical detail. I don’t know if that I basically sidestepped your question. I said all three.

Kostas Pardalis 22:53
No, it’s fine. I think like it’s very first of all, like, there’s volume, like, I think the third dimension there. And like to be honest, like I don’t expect, I mean, I think these questions are like work in progress. Like, it’s both the questions and the answers. And I think it’s important to ask this question, even when we don’t have answers right now. But we need to think about that stuff. And I think it’s important also, like, for the people, not like ask who, in the end, we sell products, but around that stuff. But like the people who their job is like to ensure, like the quality of the data are delivered, let’s say data sets to people like to work on them. And like all these things, to have, let’s say the right to ask the right questions, or have the right, let’s say conceptual models around like that kind of stuff. So I think it’s, there’s always like valuing these conversations, even if like, let’s say the answers are it’s a yes or no, right? Like, it’s fine. It’s good. It’s important like to have the conversation.

James Campbell 23:56
To that end, actually, I think there is one, there’s one part of the quality as QA that is really important. But that is I think, a little bit less obvious or less clear in a lot of platforms that kind of purport to provide data quality. And that’s that it’s really a two-way street. The person who, and I’ve been both like there’s the provider and consumer data, but also if I’m providing an analytic model or a curated dataset, or a dashboard, or just a data product, right, a giant collection of records. It’s actually potentially very useful for me to be able to kind of package together with that, what I a description of what I think this is good at doing. And similarly, if I’m what I’m providing is a model without data. It’s actually very valuable for me to be able to say when you use this model, make sure that your data looks like this. And that way you can clarify. Like, IKEA is really good because the instructions are simple in some ways are Lego. Think about LEGO these incredible instructions like imagine being able to give, give a consumer of a complicated product, like Lego, like really elegant instructions based around how the data?

Kostas Pardalis 25:20
That’s a very interesting point. How do you think we’re gonna get there? Because I don’t feel like we have the right.

James Campbell 25:28
Yeah. I think the answer is going to be that we allow people to provide expectations suites, together with their products that are validated at the time that the person brings in their own data.

Kostas Pardalis 25:41
Sure, but okay, let’s get a little bit deeper into that. Because like, also touches like the product experience. So an expectation might be like, their Brexit in a very different way, from a technical point of view than from, let’s say, the point of view of like, the consumer of the data might not be like a technical person, right? Like, let’s say, we have like a marketeer like at the end, like, the marketer wants to know, can I trust that, like, the segmentation that I’m going to do on this data is like trying to like representative of the reality that I’m not racing out there. But on the other hand, like, the person who’s like the data engineer, like, probably doesn’t even know what segmentation is, or they shouldn’t care, right? Like, it’s not like the job to do that, or that their brains are most like, trained to think that way. Maybe they think more in terms of like, the security of data, or like, do we have knowledge or not now? So I don’t know, like, what are the, like this kind of parameters? So how do we breed this to like, how do we semantic like, semantically, maybe it’s like the right way to say that, like, translate the expectations from one domain to the other, so we can apply them at the end?

James Campbell 26:53
That’s such a beautiful question for me. To me, that is the core of what we’re doing. And so, like, let’s dive in a little bit of like, how Great Expectations does that. It’s specifically the way we did composing is we have a core object, like one of the key concepts and Great Expectations is called the expectation. And what that does, is it translate it is a semantic translation machine. We often call it a grammar, right? It’s this long, verbose sentence. Now, I definitely don’t want to suggest that this is like an easy solve problem, like go pick up the one. But let’s take your heart example of a marketer. What they’re saying is expect this segmentation to make sense, right? Like, that’s kind of how they’re thinking about the world. So we have to decompose that into what we call metrics. And so what an expectation does is it asks for metrics about the data. And metrics are a very general concept and Great Expectations. So doesn’t have to just be like a statistic mean, could be a metric number of nulls could be a metric. But so could number of nulls, outside of a particular range, or country code of lat long pairs, is there also a metric so it’s a pretty general compute engine under the hood. So the expectation author is providing this declarative model for buggy, verbose declarative language. And then they’re also doing that translation into what are the metrics that make this mean that. And then Great Expectations is this sort of an orchestration engine that goes out reaches and touches the data, gets the values and finds the values of those metrics, does the comparison reassembly and then surfaces the result in the language of the way the marketer was thinking about the world?

Kostas Pardalis 28:49
Okay. Okay. This is great. How do we assign metrics? Like how do we come up with the right metrics for the great expectation from the great marketeer?

James Campbell 29:02
Yeah, well, I think the our answer to that is community. Okay. And so we have, we have what we call our expectations Gallery, where we’re trying to encourage and we want to encourage a robust process of community engagement for people to be able to expand the vocabulary of expectations to include things that are that make sense in their domain that do these metric translations. And so in order to do that they’re adding any expectations or adding new metrics. What we’re trying to do is make sure that we’re providing the substrate for expressing that or like the mechanism for allowing people to express it, but then letting them take take take ownership of the semantic and domain model. And like the short version, of course, is like there isn’t a right answer to that question. Is this the valid segmentation? It depends on the organization. So one of the things we see a lot is what we call customer expectation. So, yes schema, nobody volume, all these things, people use those expectations a lot. But also they, they say, Okay, well, I want to say expect columns. And I’ll pick an example if you expect combat used to be a valid ICD code. Under the hood, that might just be translated into a fancy regex. But the fact that there’s that translation is actually really important because that’s what makes it usable to the market or consumer in our case.

Kostas Pardalis 30:36
Yeah, absolutely. Okay. That’s super interesting. And how is the engagement of the community around that? What have you seen so far? Obviously, I’m aware of the great community that Great Expectations has, but what have you seen happening there? And what do you have some like working in what not?

James Campbell 30:56
Sure. The power of open source is one of these things that just blows my mind over and over. So I certainly can’t suggest that I can wrap my head around all of it. Some of the things that work, we’ve done hackathons. And we’ve, as I mentioned, produces gallery, we are we’re doing some experiments in this in the space of what we call packages where we have domain leaders in a domain or field who can, we’re willing to kind of commit and say, these are expectations that are like useful and valid, or valuable to be able to understand the kinds of concepts that are relevant. And like, here, an example actually, of a community contributed project that’s really been interesting is data profiling framework where there are expectations that are built around the Capital One data profiler. And they use that kind of as the semantic engine to infer types and allow you to make expectations like there shouldn’t be PII here. And then that, and that that gets translated through now, we didn’t build that on our team, but we are excited about supporting that community. Now, your question was also aware of the challenges and like, to be honest, yeah, these are still challenges. There are lots and lots of expectations out there that there’s a discovery problem, there’s synthesis, and there’s a lot of work left to do in that space. Or that left, there’s a lot of opportunities still to be had for helping people engage in like, that’s what I would really emphasizes our goal of making this a shared standard that people can engage on together, and improve. So that’s an exciting, that’s an exciting area. That’s one area. There’s lots of other exciting things we’re working on, too.

Kostas Pardalis 32:45
Yeah, these are like very, very, super interesting, like, parts of like, growing and building, like, having actual, like, the community as part of the product experience itself, right, it is part of the product of the end. Anyway, that’s another conversation for another time, like we can discuss a lot about the community. So what I want to ask you now is like, Okay, we talked about the problems, like how you think about like the solution. But let’s talk a little bit more about like, the experience that the user has, with Great Expectations. So how, how do I use Great Expectations? I mean, I have like, literally an idea right now that there are some expectations somewhere that I’m like, testing against my data, but like, how do I operationalize Great Expectations as part of my like, day to day job, like as a data engineer?

James Campbell 33:39
Yeah, that’s a great question. We think of it as there’s sort of four key steps to what it means to use Great Expectations. And we call it like our universal map. First, just to be like, really explicit on it, Great Expectations is Python library. So you run a pip install, in the case of Great Expectations today, now, not to get into like the commercial aspect, we are building a cloud product as well, that’s, that’s designed to make it like very accessible, especially kind of just expanding the domain, the reach of people but also kind of simplifying the setup. But just kind of setting it up. We set up we run our pip install, next step is connecting to data. Now, this is where it could mean I’m just going to grab a batch of data like I’m going to read a CSV off my file system. And like work with this data. There’s more you can do to connect to data where you were you also configure the ability for Great Expectations to understand what your assets are and how they’re divided into batches. As new batches of data come in, we can understand the stream of data as a unit. So you connect to data, and to be honest with you, that’s an area where I see us needing to do some work like some of the magic of Great Expectations early on, came from the fact that connect to data was a one-liner read CSV. And as we’ve added the power and expressive it around ensuring that you can understand batches and so forth, it has become a little more difficult there. And so that’s one of the things we’re actually working on right now is like, bringing that kind of magical, viral experience back anyway. So next thing I do is I connect my data, it’s like literally add a data source can it could be pointing at it in an S3 bucket, say, or connecting to a database to a warehouse, and this one important things about grid and expectations as we work across all the different backends, right, all the sequel dialects, Spark pandas in memory, pulling data in from S3, all that any of those things. So we do that configuration. Next thing is you create your expectations. And for us, that’s a notebook experience. So you’re in a Jupyter Notebook, you have this sample of data. And it’s an interactive real-time experience, I say, expect column values to be not null, and I get an immediate response back, we check that right away and says, Hey, success, or actually, 5% of these values are null. And so what we see there is this interactive exploratory process, where you’re creating expectations. Another way you can do that is with profiling, where you ask Great Expectations go out and build a model of this dataset, and propose a long list of expectations back to me, and I’ll choose which ones to accept maybe all of them, and then that will become my expectation suite. So create expectations. And then the last step is, is the validation. So what you do typically, what we see people run is what we call a checkpoint. And they’ll embed that into a Airflow pipeline, or into a prefect pipeline, or wherever they’re running their validations. It’s just an operation, Great Expectations, run checkpoint. And what that then does is produce a validation result, which we convert into a web page that you can share a post with your team that says, here were the expectations. These ones, these ones passed, these ones didn’t pass, if they didn’t pass, here are some samples, examples of what went wrong. And so you have that very tangible, shareable visible report of what the state of your validation was. So that’s how Great Expectations gets used in practice.

Kostas Pardalis 37:29
Okay, that’s super cool. And you mentioned that as you support like, pretty much, let’s say, like, every back end out there, how does this translate in terms of like interacting with his back end? Let’s say for example I have my data on a data lake on S3, right, or the same data, I might have it like on a Snowflake. So first of all, is the experience the same that they get from Great Expectations, like regardless of what I have as a back end?

James Campbell 38:00
That’s one of the pieces of magic. We talked about what Great Expectations is doing is translating between expectations and metrics, then one layer deeper than that, we translate from the metric into an act, what we call an execution engine. So if you’re gonna see if we’re gonna see, let’s suppose we’re connected to a SQL warehouse, will translate the request for the metric mean into the sequel dialect that will give us back that metric, or if we’re in Spark will translate that into the appropriate spark command to say, give me the mean that value back, so to Great Expectations is handling that. And that’s for re-bubbling it back up into the semantic layer for you.

Kostas Pardalis 38:44
Okay, and this is like something that struggles like interactive, or it’s something that like, let’s say, I run the expectation, like every one hour like is the result like dipped somewhere. So I’m googling how Great Expectations has changed in time? Like, how does the spotlight work? Because I can see they’re like, generating even more data.

James Campbell 39:07
Totally. Right. Wow, this is such a rich area for us to continue to build on to be honest with you. So today, the way that works is like the core validation result is a big kind of JSON artifact. Like I mentioned, we do render that and translate it into HTML. So you have you if you go to your generated data, Doc’s site, you’ll see Yeah, you’ll just see a list of all the validations that have run. Now, what we’re, what we’re doing right now in the cloud product is providing a much more interactive, like rich, linkable experience, and you can’t really do that in open source, like when you’re producing a JSON report. It’s just hard to have that kind of referential. You always reference between different elements. And today we’re making that possible, more possible in a cloud environment. But in open source, what you again, what you can do is you can absolutely get that list of all other things, here, x DLC a little X, like this one failed this time and this time and this time, and then it passed this time and that time any other time.

Kostas Pardalis 40:08
Okay. That’s pretty cool. All right, so one last question from me, and then like, I’ll give the microphone back to Eric. I think I’ve punished him enough for abusing the ownership of the microphone at the beginning. Is there something exciting about this coming in the future for Great Expectations?

James Campbell 40:27
I am absolutely thrilled about the cloud product. And what I like here in line I know that probably sounds like, Oh, of course, it’s because we have a new concept available to us. And what is that concept, it’s the user. In a library, we don’t really have access to the user, we don’t, we don’t, we don’t see who you are. And when you’re interacting, you’re interacting with files and, and go into static web pages. Whereas when we’re like, where we really want to go, is we want to facilitate collaboration, right? Like, at the end of the day, quality is fitness for purpose we talked about contracts at the beginning and ensuring that people are on the same page. And so what I’m really, really excited about is the world in which you and I are sharing something, sharing a piece of data. And we both have our expectations about it. And we can just say, hey, let’s go look at that validation result together. Drop in a comment, like, oh, this expectation should be a little bit different. But where we’re turning data quality into this very, yeah, the potential for collaborative, more of a collaborative enterprise? That’s really exciting to me.

Kostas Pardalis 41:45
Super, super interesting. All right. That’s all from my side, I mean, we’ll probably going to need another episode to discuss a little bit better. But Eric, it’s all yours now.

Eric Dodds 41:57
So interesting, this has been such a fun conversation, James, I’m interested to know— Understanding the technical flow was super helpful. So Pippins, saw Great Expectations. Also, again, the dickens reference not lost on me. So amazing, amazing work there. And it’s like, the best sort of smile in the mind. So let’s, let’s play out a little example here. Right. So you kind of talked through the flow of, like a data engineer, who is implementing these expectations, connecting to data sources, running checkpoints in some sort of orchestration tool or wherever they’re doing that. But let’s sort of zoom out to the relationship between that data engineer, let’s just say that data engineers, Kostas, and he’s using Great Expectations to drive data quality, but I’m the marketer. And so expectations that Kostas is implementing on the marketing side, for whatever is reports, or this or that, right, sort of presupposes the cost system. And I have talked about, like, what I expect to see in my reports, and like, the data types in these columns and all that sort of stuff. So, could you help us understand? What does it look like from the beginning in terms of that sort of relationship, right? Where I come with a requirement, and say, Hey, like, whatever it is, like purchase values always need to be an x format, right? Because if not, then we undercount and then my boss gets mad at me, and blah, blah, blah. So I have that expectation as a marketer, but I don’t do anything technical, so how do I interact with process? And I just love to hear how do your customers do that when expectations are originates with someone who’s non-technical on a different team?

James Campbell 43:54
Great. Yeah, happy to dive into that. Actually, I think my favorite example of this is, I mentioned we mentioned it’s crazy believes we started this, I think in 2017, we gave a talk about Great Expectations. And at that very first talk, there was somebody on a data injury team facing this problem. She implemented Great Expectations, so still very, very, very young product and her team, and we had a chance to connect with her much later and hear about it. What she was doing— and I think this makes a ton of sense to me. And like, I do think we can make better workflows with a web app and so forth, but wish she was doing and she would, she would literally kind of creating forms for her team and like a structured interview. So it became a kind of a requirements elicitation exercise and really just helped to structure the conversation in a way that was really valuable. And this is something I saw a lot in my kind of analytics work, and that sometimes we call round trips to the domain experts. So in your example, this marketing stakeholder is an expert in what the data should look like. It’s really expensive to have to go back and forth and like expensive a time and complexity in all these things. So again, like first first first answer of how I see that done is is very provides a mechanism for conducting structured conversations and interviews to elicit what those expectations are. The second way that I would flag is and Kostas, this is similar to what you were kind of I think hinting at is like experimentation and having a real good notebook experience now. So maybe you’re gonna say, Well, why wait domain experts sit down and write notebooks? And? No, I don’t think that’s the case. But I think what actually is happening in that way is that you can, you can accelerate the level of domain expertise, if you will, of a data engineer extremely quickly, when you’re letting them operate on these kind of higher-order concepts. Instead of like, I’m staring at a CSV file.

Eric Dodds 46:09
Yep. super interesting. super interesting. Yeah, that makes a ton of sense that? Well, and actually, let me follow that up with another question. So because there are different approaches to this, and, like, I’m super interested in sort of the philosophy behind the decisions, good expectations, they, some, some companies that are trying to solve data quality, think that there should be a software layer that sort of, like, facilitates that interview process, right. And like a structured interview process, like you said, and that’s an interesting, I actually struggled in how I feel about that. I mean, not that I’m an expert in this, but actually validating data technically, is a pretty big challenge in and of itself. But solving relational connections between two people who play very, very different roles, and like a complex stats, it’s kind of like, Can software even solve that problem? Right? I mean, so, in one way, it’s really encouraging to hear like, I mean, just do a really good, like, structured interview and have a form. And that is a way that you can create, like a very helpful, but like, simple interface between these two people. And then the data engineer can take that and translate it using the notebook interface. But all that said, like philosophically, what role do you think software plays, and maybe even specifically, data quality software plays in facilitating that relational interaction irregardless of the actual technical data validation?

James Campbell 47:47
That’s a stumper for a last question.

Eric Dodds 47:49
You gotta end on a high note.

James Campbell 47:54
I don’t know the answer to that question. And let me tell you why we’re like, what, what leads me to say that I absolutely believe in the power of software to facilitate structured interactions of it of any form. So to me, when I say you have a form, you have a structured interview, like I absolutely believe software could facilitate that and make it happen in a useful in a powerful way. So it’s actually not about like that, that was what we replaced, like, I end to even go further, like, no doubt, in my mind, software will play a role in supporting those kinds of interactions. And, and we can do that, in fact, the way I think about it, we can do that in intelligent ways. We don’t need to start a form from a blank slate, like, what is the answer here? We can say, here’s, here’s the past, here’s what we’ve seen in the past. Firstly, does that meet your expectations? Secondly, what will cause the things to be different? So, that’s, that’s the first part of it, so on that side, actually, maybe I’d be like, slightly different than your priors. And on the other side, in modeling we call this a lot like, elicitation. So you’re interacting with an actor, you’re eliciting their knowledge. And one of the things that I found is like, experts actually can have a very difficult time understanding which parameters in a model are kind of doing the work. If I say to you, about how many days should it be sunny in your city, and we’re going to put a quality measure around a weather sensor, right? And so we’re gonna say, like, hey, if it says it’s sunny every single day, maybe it’s broken. If I asked you that question, it can be really difficult for people to under like to get it that right. So I think, I guess what I’m saying is, there’s a huge amount of that’s the rich problem. And I think it’s not that doesn’t have to be the problem of data quality to solve alone. That can be done in concert with a much richer ecosystem of the way that we facilitate collaboration between people. And unlike one of my long-term passions is like the way that we communicate probability effectively, like, what does it actually look like for people to see probability and understand probability? So I think there’s a lot to be done in that space as well. So I’m gonna give myself a little bit of a vibe that we’re not we don’t have to solve that problem yet in order. Yeah, you’re writing a lot of value and data quality.

Eric Dodds 50:24
Yeah, no, that was a really helpful answer. And sorry to throw that one in. In such a good show. Thank you for your thoughtful, articulate answers. We’ve learned so much. And I really appreciate you giving us some of your time.

James Campbell 50:41
Eric, Kostas, thanks so much for having me. It’s been a pleasure.

Eric Dodds 50:44
I tried to think of like a really good takeaway from this substantive material from the show, but I’m going to phone it in because I can’t get over how much I enjoy the, like, multiple references to Charles Dickens, not only in the name of the product, but like, pip install, the Pub/Sub character and Great Expectations. And then, like, the verbose nature of what you name and expectation, like those being really long, just a data quality product as a Python library. Having such like, clever references to Charles Dickens is just, that just makes me really, really happy. So that’s a big takeaway.

Kostas Pardalis 51:37
Yeah, absolutely. There’s like some marketing genius behind. I don’t know, like, maybe that’s what we should do with him, actually, like, get the founders and get into, like, the conversation of how they came up with that stuff. It’s not like a data-related conversation. Well, I don’t know. I think it’s feels like a very fascinating topic, like how they came up with that because it is pretty unique. And it’s extremely smart. Like you can have, like a conversation with someone from Great Expectations and every other sentence, like, make a connection of Dickens, which is, it’s kind of crazy. So, yeah, we need to, we need to figure out, we need to get the playbook from them somehow.

Eric Dodds 52:25
Yeah, absolutely. It’d be great to have all the founders on the show and have them read passages from the book, Great Expectations.

Kostas Pardalis 52:33
Yeah, absolutely. We should do that. But outside of these like, okay, it’s always like a great pleasure like to discuss with the folks from Great Expectations because you can see like, they’re like some very interesting ideas. And it’s called, like, very deep knowledge around like how they solve the problem of like, bullet fi and how they’re like moving forwards. And I’m, I’m also like, looking forward to see like, their cloud products. And so what these will bring into the this, like whole experience with like doing data quality with great expectations.

Eric Dodds 53:07
I agree. All right. Well, thank you for joining us on The Data Stack Show. Tell someone about the show. If you haven’t yet. We always like to get new listeners, and we will catch you on the next one.

We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.

🎙 Sign up for The Future of Machine Learning Livestream!

🗞️ Signup for Our Newsletter

Episode 100:

Data Quality is Relative to Purpose with James Campbell of Superconductive

August 17, 2022

Notes:

Transcription:

About the Podcast

Sign Up for The Data Stack Show Newsletter