The PRQL: What’s the Hardest Part About Data Quality?

August 12, 2022

In this bonus episode, Eric and Kostas preview their upcoming conversation with James Campbell at Superconductive.

Notes:

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.co

Transcription:

Eric Dodds 00:05
Welcome to The Data Stack Show prequel where we talk about the show we just reported to give you a little teaser process. We talked with James, from Great Expectations, which is a data quality tool. And it’s really interesting. I think one of the things that was really interesting to me about the show was their approach to solving the data quality problem. A lot of the data quality companies we’ve talked with sort of sit on top of some repository of data, right, and then sort of detect changes, right? So it’s on the data warehouse, or the data lake or whatever, right. And so we’re conservative detect variances, but it sort of sits on top of a repository. And Great Expectations takes a different approach, they sort of insert checkpoints based on, you know, very explicit definitions, right. And so you sort of insert checkpoints, like within a data flow, you’ve built data products, you’ve built data pipelines, do you think that there’s merit to like, well, actually, this is a better way to say, do you think you need both methodologies? Or is there like one sort of primary way that you will approach solving data quality?

Kostas Pardalis 01:15
You’re really making like hard questions today? Like? What is wrong with you? My answer is like, I don’t know, to be honest, I think it’s probably like, there’s no one way that you can show like quality. It’s also like, and I think we have discussed this a lot like with all the data quality folks on the show. They’re, like so many different aspects of it, that you have to go after. So I would say that. I mean, I don’t know. But my guess is that, yeah, probably you need both. But it’s also like depends on like the, let’s say, the use case that you have, and how you work with data. And also like, what kind of data you work with. What I find, like, very interesting, you will lose Great Expectations is that they are not focusing only on like the problem of running tests on the data. They also focus on helping the people to come up with the definitions and serve level, which is, let’s say, not like, the purely technical problem, but it’s a very important aspect of the problem of quality. Like, is this like the right way to do it with like, this collaborative environments that sleepover with notebooks or like with different way? I don’t know, like, I think it’s still early in the, in the industry like to like, there’s more experimentation to happen there and see, like at the end, like what the markets will adopt, and will use, but I have to say that like, regardless of how the solution will look like, this problem of communicating and defining the expectations that its person has around the data is part of the problem. And probably like the hardest parts. So I don’t know, like, sooner or later, like I think like most of the vendors out there will have like to authorize it somehow. Yeah,

Eric Dodds 03:25
I agree. It was a super interesting episode. And I think one of the things to give another little teaser here is that I think this makes great expectations really unique and that when you write tests, they automatically turn into documentation that’s easily understandable by you know, sort of data consumers, which is a really unique approach. So definitely tune in to hear James talk about all things data quality and the way that Great Expectations solves the problem and we will catch you on the next show.

🎙 Sign up for The Future of Machine Learning Livestream!

🗞️ Signup for Our Newsletter

The PRQL: What’s the Hardest Part About Data Quality?

August 12, 2022

Notes:

Transcription:

About the Podcast

Sign Up for The Data Stack Show Newsletter