Episode 67:

Now is the Time to Think About Data Quality with Manu Bansal of Lightup Data

December 22, 2021

This week on The Data Stack Show, Eric and Kostas talked with Manu Bansal, Co-Founder and CEO of Lightup Data. During the episode, Manu helps to define data quality, explains how to design tests for quality, and describes what Lightup is doing to help businesses build solutions to test and use your data.

Play Video

Notes:

Highlights from this week’s conversation include:

Manu’s career background and describing Lightup (2:31)
Why traditional tools don’t work for modern data problems (6:04)
How a data lake differs from a data warehouse (11:35)
Defining data quality (14:07)
The business impact of solving and applying data quality (31:36)
Constructing a healthy financial view on the impact of data (41:09)
How to work with unstructured data in a meaningful way (47:44)

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Automated Transcription – May contain errors

Eric Dodds 00:06
Welcome to the data stack show. Each week we explore the world of data by talking to the people shaping its future, you’ll learn about new data technology and trends and how data teams and processes are run a top companies, the Digitech show is brought to you by rudder stack the CDP for developers you can learn more at rudder stack.com. Welcome to the datasets show today we’re talking with Manu from lightup.ai. And we’re going to talk about data quality. And as a marketer, when I think about data quality, and sort of how to understand it, there are so many variables. And in marketing, one thing we talk about a lot is seasonality, which a lot of marketers use as an excuse. You’re laughing, because,

Kostas Pardalis 00:52
yeah, I think like market the best people to find excuses why they like bad data.

Eric Dodds 01:01
But I’m interested to know what you think about data quality? How do you control for things like that? Because that’s a really challenging problem, especially when I think about a tool like light up, that’s SAS that’s trying to do that. So that’s what I’m gonna that’s what I’m going to pick his brain about. How about you?

Kostas Pardalis 01:17
Well, first of all, we have to say that, like, it’s been a long time since the last time that we were together in the same place.

Eric Dodds 01:23
That’s true. We’re in San Francisco together. Yeah. This is a very special, it’s a very special episode.

Kostas Pardalis 01:28
Yeah. But yeah, when it comes to quality, to be honest, like I think I have like more fundamental questions like, what is quality? Like, what do we mean about that? I wish many vendors coming into this space and like each one, like, have their own definition of like how what quality is and like how we should implement it. Everyone is using some kind of like metaphor with SRE DevOps, and it’s like infrastructure monitoring and data dog, like where the data does have data like blah, blah, and all that stature. But that’s like a very bold claim, right? So I would like to investigate and see how close to the tooling that we have for referees we have to offer right now for data. And yeah, that’s pretty much what I think I’ll be starting with Monroe about Well, we’ll find

Eric Dodds 02:19
out yeah, let’s dig in. Many Welcome to the dataset show. We’re really excited to chat about all things data quality.

Manu Bansal 02:27
I’m glad to be here. It’s exciting.

Eric Dodds 02:29
So give us a little bit of your background, kind of give us an overview of your career, how did you get into working with data? And then tell us about what you do at light up? And what light up is?

Manu Bansal 02:40
Yeah, let’s dive right into it’s a, it’s a long history we’re talking about here. So I come from very technical background, computer scientist by training, did a lot of signal processing for my PhD at Stanford, which wasn’t an interesting detour and was building software for wireless systems and embedded systems as being with a lot of data. So we kind of talk about data now in a very different setting. That’s processing 20 million events per second that microsecond latency is at one point, right? And then we spun out that research into a company called ohana, that VMware acquired in 2019, where they build a predictive analytics pipeline for telcos. Right, so this was like at&t event horizons of the world. It was very, very interesting experience processing 3 million events per second through a distributed system coming in over Kafka going into Apache heron, which is like flying and then dumping it into influx DB a time series database, right? And then serving it out to work is to make use of that data. And all of that was happening at sub second latency, very interesting scale, very interesting richness of data that you’re dealing with in the telco space, which you don’t normally hear a lot about. And, and one of the hardest problems that we had to deal with, which we didn’t have a good solution to at the time was just keeping the data healthy. It’s like on any given day, the data would just change on us without any notice. And before we know it, we are producing junk on the output side, predicting, Hey, Eric is going to get a gigabit per second on his iPhone, and the customer’s like, your, you know, this is ridiculous, right? And turns out, maybe sometimes it was our fault. At times, it actually wasn’t our issue at all. We were just being fretting. We’re just getting fed garbage data into the pipeline. And then the system was producing garbage out. And we could keep our services up and running. We could monitor application endpoints, but we didn’t have a way to detect those kinds of data for for data outages, so to speak. Right? And he said, okay, there has to be a better way to build and monitor data pipelines than just relying on customer telling us your system is faulty right now. So he said okay, this problem needs solving. It’s we were doing doing that for telcos. But as we now know The whole world has become data driven, right? It’s FinTech, its consumer tech, its hospitality, you name the vertical, right? And you guys are at the forefront of it in many ways, right? So you’re seeing this firsthand, obviously. And now that we are starting to rely on data so heavily, we need a way to make sure the data we are building off of is actually worth trusting, right? And so he said, Okay, this is a problem that needs solving. The old tools don’t work in the new data stack. And we had ideas because we had seen this problem firsthand. And that’s how it was born. And that’s what you’re solving today.

Eric Dodds 05:35
Very cool. And let’s dig in just a little bit there. So the old tools don’t work in the new data stack. Could you give us just a couple of specific examples of the tools that were insufficient for the job in the context of sort of dealing with that type of data at that scale? And in the telco?

Manu Bansal 05:52
Yeah, I mean, so kind of even look at data quality is a problem that’s at least two decades old, right? If you look at the space, Gartner, Magic Quadrant dotnet, for example, right? Informatica has been talking about that since 2005. Or actually, maybe even before that talent has had a product since early 2000s. Right. So before we even talked about big data, we used to talk about data quality, and then the whole Hadoop ecosystem happened. And the rest is kind of history at this point, right. So the traditional tools are designed in a in a setting where maybe you had a spreadsheet worth of data, you got a data dump from a third party that feeds you, consumer phone numbers, or names, for example, right, and now you want to put it to use in your marketing campaign. So now you have a data steward, who would have days to look at that data, make sure it’s all good fit for us, right, and then publish it into whoever was the internal stakeholder. Or if you are distributing data, you will do the same process in that setting, right. That’s where the old tools are designed for, you know, built for small data volumes, usually static data, right, where you have a human in the loop, and who can stare it at length, right? And kind of facilitate that interactive process of manual judgment on data quality. Right, what we are now talking about the kind of pipeline I described, where you have Kafka bringing in million events per second or even per day, right? feeding into a spark system down the stream, you’re talking about, maybe minutes of end to end delay, or even less, that’s a setting in which now you have still the same problem, which was my data, LD. And if it’s going to go trigger an action and at the end of the pipeline, or populate a dashboard for an exec or show up. As a result for my end user. I need to make sure that the data is healthy. But that’s the setting in which the old tools simply don’t work, right. I mean, for a variety of reasons. At scale. It’s real time nature. It’s just the data cardinality you’re dealing with, right? It’s not just one spreadsheet, you’re talking about 1000s of tables, hundreds of columns, columns each. Right. So that’s the setting that that we are designing for now. That’s the big change. That’s

Kostas Pardalis 08:02
my number one question before we move forward with talking more about data quality. You mentioned a few technologies in the few architectures also a lot. Also, you mentioned a lot like the term modern data stack. Can you give us like a definition of like, what is these data that we are talking about that quality gets becomes relevant?

Manu Bansal 08:24
That’s actually a very interesting question, not just from the point of view of what is the stock that we’re designing for, but also like, why now is a good time to think about data quality, right? For anyone building, the so called modern data stack, right? So it kind of let me just take a segue back into how big data has evolved, right? So it started out with, let’s say, the Hadoop ecosystem, right? Very file based hourly batches. Maybe it was the best you could do. But it was mostly like once a day kind of data processing of, of large, big batches, right? Then we saw spark happen. And that was kind of happening building on Hadoop and made it in memory. And then we saw Kafka happen, right? And we used to talk about ETL stacks at the time, right? Say extracting data, transforming it, either through disk or in memory using Spark, and then putting it to use my loading into data warehouse or database, usually, right? Because the data volume would be quite compacted, we would have almost produced finished metrics. By the time it will be published out of the big data stack. Right. And that was a very hard stack to work with. It was like, it wasn’t clear. If you wanted to even monitor it. Where would your monitoring tool integrate? Yeah, because data was all in flight all in memory. You have Kafka you have spark where it’s like, what’s my story for what’s a canonical stack I can rely on for which I can now build monitoring support, and it wasn’t clear why we are now finally articulating a modern data stack and it’s catching on as because we have now gone to an ELT. to architecture, right, so the data warehouses since then have caught up. Right? So back still in, I guess, 20, no early part of 2010s, right? We didn’t have scalable data stores. And then retros was starting to get born. And then you saw snowflake happen. And then BigQuery, right. And this basically started to pull us back to the convenient architecture, which is easy to reason about easy to work with, right, which is you have a central store of data. That’s, let’s say, snowflake, or data, bricks. And now you have all the raw data landing into that one place, and getting persisted through stages of transformation all the way to finish metrics. Right? So to me, that’s the crux of modern data stack, we can debate what are the right tools? What’s the right level of aggregation? Should one house do all of it? Or should it be 20 Different components working together? That’s less important? I think it’s more the idea that you have the central store.

Kostas Pardalis 10:56
Yeah. And this central store, you’re talking about? Are you thinking more in terms of like a data warehouse architecture, or a data lake architecture, where, and I’m asking that, because we started with Hadoop, that was like processing over a file system. And we ended up talking about file systems, right? Today, they don’t make so what what, which one of the two, you find, like, let’s say the most easy to work with, or like the most important one.

Manu Bansal 11:24
So I think there’s two things here, you know, one is kind of a logical choice. So to me, data warehouse is actually a logical structure. You’re declaring a certain data store to be a data warehouse. But like, I could take snowflake, for example, and call it a data warehouse. If I wanted, I could actually call it a data lake to Yeah, right. So it really comes down to what I want to declare as less prepared more raw data down what I want to call finished data that is fit for use, right? So we have heard people use a terminology where they would call snowflake, the data lake, right? And then you could take data breaks. And then you could say, yeah, that’s my data lake and my data warehouse, right? It’s really just a logical partition, or you could just entirely work off of an object store, you just dump everything into s3. But you still need some query layer on top, right? So you could bring Spark as a query engine, or it could use Presto or or Trino. Right? So you have choices there. So I think it’s kind of less important to decide if you’re using a data lake or a data warehouse style of design, right? I think it comes down to having a central store. And so that’s one part of the story, right? The other is, what is the scale of data you’re dealing with, right? So where we see this distinction becoming important is when the so called Data Warehouse technologies, like snowflakes start to be too expensive, just from a data volume point of view, people will start to say, let’s give up on some of the functionality of the query engine and structured data definition right, and go to a more or an more freeform, less structured data destination, which I’m going to call a data lake. But both are great, it’s easier to work with a structured data set. Because now you have a query language available, you could just set it to SQL. But what we are seeing I mean, the lake house pattern, for example, is giving you the same facility at the lake scale, I think that’s where the world is going to go. So that distinction is just going to keep shrinking, and really just comes down to disaggregation, where you have a store, which is scalable. And then you have a query engine on top, which can serve out that data through a well understood query language.

Kostas Pardalis 13:34
Alright, and let’s go back to quality, right, I think we should be fine it first of all, yeah,

Eric Dodds 13:42
I think it’s one of those things. I was thinking, you mentioned data quality several times. And it’s one of those terms that I know it when I like, if I see bad data quality, like I know it, but I don’t really think about it, unless I see it right. Like, which means that there’s generally a problem. So I’d love to know what’s like, what is your definition of data quality?

Manu Bansal 14:07
There, I’m happy that you know, when you see bad data, because it’s a great starting point. You’re right, it’s kind of an elusive thing to define, honestly. Right. You know, what helps is to think about the symptoms that you’re going to see if you were running into bad data. Right? And especially the symptoms that you don’t have monitoring support for today. Right. So like, what would be bad data issues? How would they end up affecting data consumers? Right, which would actually totally go unnoticed right now. Right? So to me, that’s, that’s the big umbrella way of defining data quality issues. Yeah. That creates this ridiculous output from your data driven product, right, whether it’s an internal consumer or an external consumer and went unnoticed, right. So How could it present? For example, maybe let’s start there, right? So let’s say your food food delivery entity right now your orders are not making it to the person who ordered the food. Because you have an issue with their data add with the data describing that address and the phone number, right? It could be as simple as that. Now your product is entirely failing, right, or you or your Uber and rides are not showing up on time and customers are complaining that the ETA estimates are a loss, right? In terms of x, because you are making an error loading of data from the mapping service. So you don’t have the right traffic information coming in. Right? It could be credit scores getting mis predicted, because of the data you are pulling as a bank, or as a credit scoring entity from the credit bureau is malformed. So now you cannot correctly credit the credit score, right? Or ticket prices are getting a factor of $100,000 Tickets are getting sold for $50. Right? So you have pricing errors, right? So data quality issues show up in a variety of forms, depending on the nature of the business. And, and often to an extent where there’s direct top line impact to the business, right settings we’re discussing, there are kind of equally harmful. But let’s say more face saving issues where you have bad data showing up on a CFOs dashboard, right? The CFO just says look, sales volume is looking too low today. And I can’t explain this, right turns out you’re dropping transactions, so you’re not counting your sales correctly, b2b issues of that nature to so to me data quality issues, or any of those issues, where data is not what you expected it to be. And that issue went unnoticed, right? Notice by your IT monitoring tools, when noticed or noticed by APM tools, right? Anything that you have to monitor infrastructure is not able to catch it. Right? It’s kind of like those hidden data outages, if you will, which could be dropped events, which could be data getting delayed, which could be schemas being wrong, or values just being plain wrong, your reporting fence instead of dollars, and now everything is fine. Right?

Eric Dodds 17:11
But one question for you, then, this is such an interesting challenge, because let’s go to so those of you kind of define two broad categories, which I think is really helpful. One is the end customer who’s using an app or a website or a service, something goes wrong or right, which is sort of the worst type of problem because you’re getting feedback from the people who are sort of the lifeblood of your business telling you that something’s really wrong. Right. Like it’s, it’s, it’s too late at that point. Let’s go to the the example of the someone in the C suite looking at a dashboard for sales. I think one interesting challenge in that example, is that the data engineer or analyst who is managing the pipelines to deliver those reports. Sometimes it may go unnoticed, because they don’t have context for what thresholds, right, like seasonality, we hired a bunch of new sales people. We I mean, there’s like a lot of factors there, which I think complicate that, because they may notice it in terms of oh, well, this number looks lower than it did last week, but sales forecasts fluctuate, right. Okay. Can you speak to that a little bit, because there’s also this organizational side of it, where the people dealing with the data may see the actual data and a derivation from whatever is expected. But they don’t have the context to interpret that necessarily? That’s,

Manu Bansal 18:43
that’s a great question, actually. And it’s, I’ve been drawing comparisons to other monitoring tools that we understand better now, right, like data dog or New Relic. But there’s a big difference between the two monitoring data from a data quality point of view versus monitoring a more standardized asset like a virtual machine, right, which is when you talk about monitoring it and you’re talking about monitoring a container VM, this is a kind of majority of metrics are well understood. They’re standardized, doesn’t matter. If it’s an AWS VM or an Azure VM, right? You’re monitoring CPU and memory, and desk, right? Sure. When it comes to monitoring data, though, that’s no longer true. And that’s why it’s such a unique in many ways, a more challenging problem, which is that you’re dealing with the customer’s data model here. Right? And that data model is actually specific to the to the organization that you’re trying to monitor data quality and right. I mean, even argue that even Lyft and Uber are not going to have the same exact data model internally. Right. Sure. And, and that’s why we are seeing all the data technologies, basically enabling definition of data models or data quality monitors, as opposed to prescribing has a fixed set of pre built definitions. Right? So that’s what we need to do. That’s that’s the kind of Grand Challenge, if you will, but we’re seeing success stories all around right now. I mean, DVD, for example, is really structuring how you could produce your data model, day over day, minute or minute, if you wanted, where the model definition is still coming from you, who is the data engineering team. Yeah, or the analyst team. right rudder, for example, you guys are doing data collection, where you will let the end user define what data should be collected, what should the event structure look like, right. And what the tool is providing you is an easy way of encoding your own definitions. So that the system can take over and put it into production, and do it at scale continuously. It’s the same with data quality. And I think that’s the kind of key to unlocking design of the solution, which is, you need to design it in a way so that you can make it easy for someone to go describe a data quality monitor, they want to instantiate without enforcing or limiting them in any way so that it can work on any data model. But at the same time taking care of everything else that needs to happen. After you have put that definition in place. You’re continuously evaluating the data quality roll, continuously doing it at scale, doing all the incident management workflow around it, right. And so on.

Kostas Pardalis 21:22
My No, I have a question. You mentioned, the traditional infrastructure monitoring, right? There are like many attempts there, and the industry there has matured enough to start like, discussing and introducing some standardization, right? Like how we communicate, like the metrics that we’ll get into these things. Do you think that we will reach something similar with with data? And the thing that comes in my mind? Actually, it’s like, I found very interesting, like, what great expectations is doing right? What do you have, like experiences that are like people can contribute new expectations, blah, blah, blah, like all these things? And the reason that I’m asking about that is because what I understand from the conversation that you already had the two of you, my problem with data is not semantics matter, right? Like the semantics are of the data that you are consuming, made science completely what it means to have bad or good data, right? Yeah. So somehow, we need to get into the loop the definition of the semantics there. And I think that’s the schema itself. I mean, it’s still like on a syntactic level, right? Like, there’s more information than about we can get so how we can overcome that. And what do you see us future like,

Eric Dodds 22:32
happening? And to add on to that, I think one of the other interesting things is like, easy example of a CPU, right? And so like, CPU to CPU, there may be sort of differences, right? And like, manufactured from a different manufacturer, or whatever. But the core vitals are the same. Not only is that different from company to company, but I was thinking about when we’re talking about definitions, like every SAS company ever, has had years of conversations about the definition of a lead the definition of an MQL, the definition of an SQL, and sales accepted lead. It’s like anyone who’s worked inside of a SASS company knows, that sounds really hard. But what’s interesting is, this schema has changed from company to company, but within a company, this schema changes over time. Right? We change the MQL definition. And that has a direct impact on what you would consider quality and sort of Yeah, the downstream context. Yeah,

Manu Bansal 23:29
yeah. Yeah. Yeah, absolutely. I think this is a really important question in terms of how do you scale keeping data held on the check, right? How do you scale that, and this is in regardless of how we solve the problem, how anyone solves the problem. The thing is that this is something that we need to solve as an industry, right? I mean, if you’re going to scale our data driven operations, we need to answer this question, right? Because if the answer is that there is no common factor between one data quality rule and the next. It’s game over, right? I mean, then we are just always going to be praying. And that’s the best solution we can come to. We need better than that, right? And I thought a lot about this problem even before we started LiDAR, right? Because we wanted to be sure that you’re building something that can actually apply generically, regardless of what vertical you’re in what team within the organization you are in, right? I mean, if you’re going to get stuck in the definition of SQL and MQL, it’s not going to scale. Right? It’s kind of go back to my time at to Hana, the company I built before this where we were building the predictive analytics pipeline for telcos. Right, and I look back at the process, we used to follow in just doing our own ad hoc manual data debugging, right. And the thing that strikes me about about that time is that the process is actually pretty much the same playbook, issue after issue. Even though I had the domain knowledge and I understood the context or my My colleague did that our basic process was, something is wrong with the data. That’s something I can smell, or someone told me about. I need to debug if it’s really an issue, right. And the first thing I would do is I would say, give me the data as it stands today. But now let’s pull data from a week ago, when I know things were running fine. Right? And the first thing I would do is compare these two data sets, right? And then look for significant differences, which would explain or confirm that, yes, data is indeed looking very different today, then I would start to reason about, in what ways is it different? Or is it really a problem? Or what might have caused this difference? depending on what’s the shape of the difference? Right? You’re seeing too many events, okay. Maybe there’s duplications happening somewhere in my pipeline? I know what, what, modulators in the data pipeline that could have introduced duplicates? Or I’m seeing records getting dropped? Okay, I exactly know I have a bad filter somewhere. Right. So then it will start to tell me what how to go about fixing the problem. But the process was actually fairly generic, which was I was comparing data on hand with data from a time I knew things were good. And that’s something we now call anomaly detection, right? Anomaly detection is kind of this bastardized term where you think of false positives very, very first thing, but to me, anomaly detection is more kind of a recipe or, or an algorithm, right? It’s not a solution by itself, right? So it’s like, that’s the principle you want to apply where you can compare data from today with data in the past, right? And then it starts to be a repeatable process, then I don’t need to know what the semantics of this data asset, exactly are, how it relates to business, right? So that’s what’s actually working really well in the field for us.

Eric Dodds 26:46
So one quick question there. And this is really tactical. And I’m, I’m coming at this as a consumer of reports, and as someone who, probably on a daily, weekly, monthly basis, looks at the data from this week, and compares it to the last couple of weeks just to you know, sort of do like all of us who work in business, like do our own internal gut anomaly detection, which is rarely statistically significant. But how do you approach the problem of controlling for seasonality, which, as a marketer, seasonality is the excuse that you can use for any problem with data that’s like, oh, the numbers look bad. And it’s like, well, that’s seasonality. Yeah. But if you think about the holidays, right, like November, December, and people going on vacation month over month, data from sort of October to November, or December to January, is hard. And I also think about this in the context of let’s say, You also don’t have year over year data for this particular data set. How do you think about controlling for those problems when comparing time periods of the same data?

Manu Bansal 27:53
Yeah. And this kind of ties back to I think, the question cost of raising earlier, right? Is there a standard here that can emerge? Even if it doesn’t exist today? Right? The quick answer is, if you have the human being cannot affirmatively say if the data is good or not, the system probably cannot. Right? So there is a boundary, we need to draw between what is a confident conclusion about the health of data? And what is subjective interpretation? Right? And we’ll always have that boundary. The question is, how much can we have as kind of the standardized tests on data? Right? We want to keep raising that mound, keep keep shifting that boundary, and start to actually think, think about test driven development for data pipelines or, or your data assets, right? If you can test if the data is good or not, how, how do you even base a business decision on it? So the question is, like, what are the tests I can run? And those are the tests you would always want to run? That’s what the data quality system should be solving everything else should be left to human experts to interpret? Right? I believe there’s a large set of tests that can always be run, right? Some of these are very generic. Right, like, what is the data delay? That doesn’t really depend on seasonality? Right? I mean, your data pipeline is processing with the same delay in November as written in January. Yeah. Events you’re seeing coming in the data volume? Well, that depends on how much people are interacting with your service. Yes, that is subject to seasonality. So there are different kinds of tests you can run, some are more black and white than others, the ones that are very clear cut, these should be standardized. And we should always insist on defining a contract around them, right? Data delay should always measure that null values, unique values, like things that you might even say are extensions of data integrity constraints that you used to put in relational databases, right. And then there are all these interpretations that you’re drawing around. What are the semantics of What is the content of data, right? And there is a set of tests where experts can get together, let’s say someone who understands the sales data and says, Yeah, I mean, look, a sales value of $5,000 simply doesn’t make sense for my consumer product, right? So that’s definitely wrong. And you start to encode some of that knowledge. But these become what I would call custom tests, right? And then then the system should facilitate easily encoding those data model specific tests. And then there’s everything else where you can’t even be sure what’s good and what’s not. Right. And we should not even try and attempt to make a conclusion on those aspects.

Kostas Pardalis 30:39
So my No, okay, it’s very clear from the conversation so far, since you’re very passionate about data quality. But you’re also building a product for the data quality. Right? So you want to tell us a bit more about that. So how, how do you solve the problem of data quality with a product that you’re building?

Manu Bansal 31:00
Yeah, I mean, then, and maybe before I go there, right? Why do you? Why am I so passionate about this? Why do I think this is it’s high time that we all started to think about data quality, right? It’s like, to me, it’s like, you just bought this fancy car, right, which is your business, let’s say and you’re starting to drive it with data, and data is the gas here and you have no control over the quality of gas you’re putting on putting into it right? Before you know it, you’re just going to throw a wrench into the engine and your car comes crashing down. So to me, it’s like, actually bread and butter to just monitor the health of your data, we should all be doing it. I think the hard part here is two or three different angles here. One is what we have been discussing so far, which is what is standardized tests and and what is the custom to the business? Right, that’s definitely one hard part. But I think the other kind of challenges that we are seeing people run into who have been actually solving data quality. For decades now, right? I’m talking about fortune 500. Companies with very strict controls over any data that’s being collected are put to use, right? I mean, they have somehow managed to find the right mix of standard tests and custom tests, right. But where we are starting to see a lot of limitations now in incumbent tools is number one around data volume. data volume has just grown What 100 400 fold in the last decade or less, right, and we are anticipating tremendous growth in the next five years, right. And now with the new modern data stack, the old tools simply don’t work, right? They can’t keep up with the data volume. So that’s one challenge that we are solving. Another challenge we are seeing is kind of what comes along with data volume, which is data cardinality. Right, or what you would sometimes call sometimes called variety, right? It’s no longer just a couple of spreadsheets and a couple of tables in MySQL dB, you’re talking about 1000s of tables, and potentially a total of a million different columns, across those tables in data. Bricks are snowflakes, right? I mean, at that scale, if you’re writing a test by hand for every single column separately, it’s not just expensive, it’s infeasible, you’re never going to finish covering all your data assets, right. And the typical number we see is one to 5% of all your tables and columns being actually covered by tests. Right? I mean, you you don’t work with that kind of coverage for your software, right? You will look for 95% or 99% unit test coverage on software that’s in production, right. And we are running with 1% coverage for data in production. Right? Why is that? Right? I mean, that’s kind of the biggest challenge here, which is, it’s just too hard right now, to build our data quality tests at scale. Right. So that’s the second big problem we are solving. The third one that we are seeing a lot of this shift to more and more real time stats, right? And real time here. And the data were could could even be as as much as a minute of latency or a couple of minutes of latency, right? That’s already very real time compared to doing nightly data processing with Hadoop, right? And the old tools were built for that environment where you’re processing data once a day. And now you’re talking about doing this every minute, maybe even every second, right? So you you need the notion of streaming checks, which are kind of in line with your data pipeline, if you’re going to be able to do this before, before he hits a fan or before bad data is actually put to use right. So if you need to catch bad data before it can do the damage. And that’s the third big challenge that all tools are running into, which we are solving really well.

Eric Dodds 34:42
You know, I was thinking back to our conversation about the looking at the sales number and the sales number seeing off or saying a sales number that’s off. That is probably the main number that’s covered in the 1%. Yeah. And then the head of sales isn’t happy. Then

Manu Bansal 35:02
that’s actually that’s an interesting point. There’s this big. I don’t know if I would call it a tussle or this tension. But there’s kind of a debate in the industry right now about who owns data quality. Right? And the point you’re making Eric, right, that you’re covering your sales numbers? Well, it’s because the sales team really, really, really cares about this number being correct. Right. And they will find a way to do it themselves, or they will get the data engineering team to implement some controls, right? But why aren’t we covering the rest of our assets? Well, because if you look at the front end of the pipeline, which the data engineering team is responsible for, right, that’s where you have the whole multitude of data assets, but the data engineering team is not yet seeing this as a responsibility. Right? It’s really left to the consumer of the data, which, who is an analyst or a salesperson or marketing person rate? Or an exec for that matter? Or even your customer, your customer? Right? Who is the one pointing out data issues, right? But we need that shift in mindset where we say, Look, you need to be producing good data in the first place, it cannot be an afterthought, after you have already served me with bad data. And kind of a thought experiment, I like to run suppose there was an internal data market and the organization where, say, the analyst has to pay for the data and buy it from the data engineer producing it. Hmm. What would happen then? Right? What How would that change the landscape of data quality? I’m very curious. Someone wants to run that experiment where it’s like, if I was an analyst buying data from data engineering, I would want to make sure that I understand what they’re selling, right? What’s the what’s the spec? What’s the contract? What’s the QC on it? I’m buying a product from you. It’s a data product, right? Can you prove that this is worth the dollar you’re charging for it. And now suddenly, it will become the responsibility of the data engineers producing data to make sure that the data is worth selling, and it will actually sell to the right. So I mean, short of creating that market, I think that’s the shift we need.

Eric Dodds 37:17
I love thinking about that in terms of economics, because I also think back to your point about testing. So I’m just going to put myself in the shoes of the analyst who is essentially working as a middleman between a manufacturer say, and then the person that they’re delivering the final product to, right. And so my approach, naturally, just thinking about the economics is I have a limited amount of resources. I know I’ll get more resources if I can deliver the right product. And so what I’m going to do is buy like a small piece first and deliver that and like ensure that it solves a problem downstream. And like looking at sort of the QA and like being more meticulous about that would naturally drive sort of, like a testing mindset, right, which is super interesting. I love I love that analogy. That’s a really great way to think about it.

Manu Bansal 38:13
I just see that going so far, right? Where it will force force communication between the data consumer and the producer, where a spec will have to be defined first, right? Yeah. And then that’ll start to bubble up. What are the data tests? Right? And yeah, maybe some, some cannot ever be encoded into Logic, right. But most of those tests can, and they’ll get pushed upstream, and they will start to affect the data pipeline itself. They’re become part of the data, data production pipeline itself, right? It’s like what CI CD has now become this kind of shift. And are the blurring of lines between software, QA, software, QA, and software development? Right? With all the DevOps mentality? I feel like a lot of those themes will now start to make it into the data world, short of creating that market. I think it’s still happening. Because, yeah, as data leaders are being pushed on the value they’re creating for the organization, they have no choice, but to start asking those questions. And we are seeing that happen now. Right. And the best leaders are actually doing this going around asking, what are the controls in place, right, and making sure that the data engineering teams have the right tests in place, and have a way of not only writing those tests, but being able to maintain them over time so that they don’t fall out of step with evolution of the data pipeline itself, right. So it is going to see more and more of that happen.

Eric Dodds 39:43
You know, it’s interesting to think about economics, which I don’t know a lot about, but when you put money into the equation, what’s interesting and would love your thoughts on this one is that part of the chapter that data in terms of quantity, to your point is not a scarce resource. And when you introduce money into the equation, what you’re actually doing is shaping the way that the manufacturers like implement their process, or even if he wanted to boil it down to sort of raw units would be how they use their time. Even right, and it’s really interesting to think about because volume is like, it doesn’t feel scarce, right. And most people actually probably feel like they have too much data. Like I have so much data. I can’t analyze all of it. Right. Yeah. So yeah, that’s really interesting. So how do you think I guess my question is, what are some ways that you’ve seen companies outside of actually making an analyst by data, maybe we should give them enough like to go get Monopoly money? And we can try this experiment where

Kostas Pardalis 40:58
you shoot tokens, or NF T’s? Oh,

Eric Dodds 41:00
that’s right. Yes. You do it on blockchain. Outside of using Monopoly money, what are some ways that you’ve seen companies do a good job of sort of creating a healthy contract that starts to drive that testing mentality?

Manu Bansal 41:17
Yeah, so So that’s actually, that’s interesting. It’s starting to take us in the direction of the team structures of best functioning data teams, right? where data is most trustworthy, and is being actually really returning ROI, right. And we see this pattern where a young company will just have, let’s say, one team doing it, all right, they’re the producers of the consumer, for the most part, they’re the ones analyzing it and whatnot, right over time, then as the team grows, and, and the pipeline matures, you start to see some split happening between the data engineering skill set to people who can store and process data at scale. And then people who know how to extract meaning out of it, who would be the analysts, right. And then the third stage that we see is this kind of birth of a data quality or a Data Governance team out of usually the data engineering team, right. So the organizations that are thinking about it the most, will now start to separate out the function, which would kind of be this unaddressed or implicit function within the data engineering team, right? Where it’s best efforts, pretty much ad hoc, no well defined contracts being passed from the consumers of data to sort of data, but then they would realize that’s the effect they’re, they’re seeing, which is why data keeps breaking, right. And so then what they realize is they need to create some separation between data engineers and data quality people. And then they would start to create this data quality analyst group, or data quality engineering group, right? Who works closely with both sides, but now starts to become the bridge between what is the destination of good data, and how to test for the destination? Right? And at the same time, like, they are not interested in writing data engineering pipelines, right. So they start to be now it’s like kind of the SRE team, right? What that did to software engineering, and software production or operations, right? And started to be this specialist group, who would understand enough about how to think about software behavior and production rate, page load time, for example. Okay, Google is supposed to load in under a second. Otherwise, you start to lose users, right? But at the same time, we’ll have a sense of why could the page be slowing down, right? And how to even measure the page load time, they could be analyzing the metrics and whatnot. Right. So I think that’s the that’s the kind of maturity model that we are seeing, were starting to see a data quality team or data governance team or data reliability team, if you will, and you start to see more and more of that, who is bridging the gap between these two worlds, and enforcing a certain contract between the two teams?

Kostas Pardalis 44:15
I know we keep talking about the SRA in that DevOps parody. One of the things that is quite important when you’re monitoring right infrastructure is like, also do some kind of like root cause analysis, right? You need to figure out, Okay, we have a problem now, how we saw it. Now, data infrastructure is like, quite complex thing. Also, a piece of data goes through like a lot of, let’s say, transformations. It’s not something new. From the moment that you got to recovery, you go and consume it. Right, right. How are we gonna do that with data, how we can try and like figure out what’s going wrong and fix it?

Manu Bansal 44:58
Yeah, yeah. I mean, I you Used to be a networking person. So I tend to go back to what I understood in networks and network deploying is a hard problem. And in many ways I see the data pipeline being a topology of sorts, right? We call it lineage, sometimes we think of it as a DAG, maybe, right? But it can crisscross at multiple places. So where it starts to where it finishes, it may not actually even look like a tree anymore. Right? So it’s a very generic, you know, topology through which data is flowing. And how do you monitor a system like that? Right? And how do you trace back an error on one end of it to source of the error on the other end? Right? I mean, look, truth be told, it’s not an easy problem to solve, right? I don’t think we know how to do that very effectively, right now. And when, when we don’t know how to do it, we end up relying on experts, right? So I think the the short term answer to this question as we need to facilitate presentation of data quality information from all parts of the data pipeline, so that an expert, right, or a group of experts for that matter, right, can look at that information, and then come to a conclusion, the root cause analysis. So that’s what the tool can do here, which is to create this single pane of glass or single source of truth, where the data engineer, the data quality engineer, data quality analyst, maybe the data analyst, or analytics engineer, maybe even the business stakeholder who has mentioned consuming the data can all pull their context to be able to root cause what’s on hand, right? And then be able to say things like, Okay, this can be explained by seasonality. This is not a data pipeline problem. This cannot be explained by seasonality or the definition of sales number we have, right. So let’s go for the back. And now the data engineer starts to contribute their insight into it. Right. So I think it’s, it’s going to be a collaborative process. Definitely in the short term, maybe forever. And and the way to approach it, in my opinion, is to facilitate that collaboration between all the different stakeholders at different layers of the stack. Yeah. One last question for me. And then I already can ask you some questions. We talked a lot about structured data, we talked about schema, we talked about measuring things and all that stuff. But in a modern data stack, you don’t only have structured data, right? So what happens with the influx of data, what happens when we are working with binary formats with free text, we see majorities with labels more towards like what we usually hear about like a maintenance, this kind of pipeline, because that’s also like a pillar pipeline at the end, right?

Kostas Pardalis 47:51
So does what we talked so far applies there, or we need a different approach with for the quality there,

Manu Bansal 47:59
it perfectly applies. What happens today is we just don’t think about it as much, right? I mean, we just ignore it. And we just hope things will solve

Eric Dodds 48:08
the problem, and it’ll go away.

Manu Bansal 48:11
But actually, so that’s, that’s one of the very interesting problem statements that we are hearing from the customers we’re talking to. They’re asking us exactly the same question that users asked me, right? What happens to my other data? Right. And the good news is that the leading data teams are starting to realize that that data matters, if not as much, you know, if not more, it matters at least as much as structured data. But in many ways, that’s actually even more important place to monitor the health of data. Because you want to shift the problem left or shift detection left as much as you can. Right? We all understand that if you don’t collect good data in the first place, you’re not going to be able to produce good data on the tail end of the pipeline, right? That’s clear. But if you’re discovered at the tail end, and you’re trying to now fix the issue, you have to go back all the way and clean it out from the entirety of your pipeline for days, right. And that’s a very expensive operation, you’re so much better off by just detecting the problem before this bad data percolated downstream. Right. So it’s, it’s more proactive, it’s more economical. And it cuts your costs a lot, not just in terms of productivity, but also in terms of the repercussions of bad data. Right. And in many ways, the root cause problem also gets solved because now you’re directly monitoring the source of the problem, right? So you know, that is the root cause, instead of having to trace back from tail end of the pipeline all the way to the source of the problem. Right? So that needs to happen. In some ways, it’s actually more challenging because data is less structured. So how do you even start to analyze it, right? There are tests that you can run, right? You have Kafka event stream coming in, let’s say and you could just try So the delay that the events are coming in at or the volume that they’re showing up with, right, or the schema that the event has. And some of those tests, for example, a part of rudder or segment, right and even Kafka on the cloud side now, right? So I think we’re going to see more and more of that. But that’s something that we are also solving. Eventually, you want a single pane of glass to be measuring the data health, right, right, from ingestion to the object store, to finally a data warehouse, right? So we need to do more of that. Absolutely.

Eric Dodds 50:33
Well, we’re close to time. But man, we want one last question for you. And I’m just thinking about our listeners who, you know, know, they may have some data quality issues, because they’ve had to fight some fires. And we often like to ask people like you, if you could just recommend one or two practical things that a data engineer could do this week? What would those things be to sort of help start the conversation or start stepping towards data quality in their organization?

Manu Bansal 51:06
That’s a great question, I think, to think about that if what would the first thing I would recommend someone doing? And I feel like when when we hire new recruits, let’s say when when, you know, we were young programmers, ourselves starting out in our careers, right, we would have the tendency to just write software first, then think about testing, right? And as you start to become more and more senior, you flip it. And you’ll see, let me first start out my tests, even if I even if I don’t implement them yet. But at least let me write this because it brings the spec out, right? I mean, it tells you what your module is supposed to do. Anything you can test is what it needs to do. Anything you cannot test is actually in material, that functionality you should never be implementing, because you don’t even have a way of proving it. Right, and start to open up the design space so you can find an efficient and effective solution. And I would say the same thing to data engineers who are listening this right? First, think about how will you prove to your consumer that you are giving them the data they asked for? And go ask them, right? What properties do they expect to be true of this data asset? Whether it’s a table, whether it’s s3 dumps, or discus claimants are collecting, right, what are some constraints I can apply on it? What are some invariants, you would like to see in this data set, which are, which are always going to be true, it doesn’t depend on how you’re going to use this data. It’s just innate to the data I’m producing. And if there’s one recommendation, just that kind of think of test driven data development, start there. And then very quickly, you will start to look for ways to implement and encode your tests. Right. So rest of it will start it from for

Eric Dodds 52:51
sure. Well, that’s great advice. And we have had such a good conversation. Many thanks for joining us. And one last question quickly. If people want to learn more about lightup, where should they go?

Manu Bansal 53:04
You can start@lightup.ai a very simple URL and or just get in touch with me or anyone at the company would be very happy to strike a conversation, give you unbiased opinion on what you could be doing for data testing. If you wanted to try lightup, we could very quickly set you up, we can deploy as a SaaS service, we can also deploy in your own environment, right? So we can bring our software to where data already lives, which makes it very easy to get this going. Without any compliance or regulatory issues. There’s a full spectrum of ways in which we can we can work with you just find us@lightup.ai.

Eric Dodds 53:43
Cool. Well, thanks for being here. Thank you for the advice and the thoughts. And and thanks for joining us on the show.

Manu Bansal 53:49
Thanks for having me. It was a pleasure talking to you both Eric and costus.

Eric Dodds 53:54
My takeaway, which is not going to surprise you is that I love the idea of running an economic experiment inside of an organization where you have to use Monopoly money to buy data like I wanted. I want to

Kostas Pardalis 54:07
know it has to be true money. Do you like coming to play poker with fake money?

Eric Dodds 54:12
I mean, I guess technically you can, but you’re not. It’s not. Yeah, exactly. Yeah. But this is an experiment I want to run like maybe we could get a Harvard economist to help us sort of design an economic experiment on this.

Kostas Pardalis 54:27
Yeah, it’s very interesting. I think when we stopped like talking, I think that the whole point of that is that we are starting very, in a very concrete way to talk in terms of value, right? We are that’s like whatever theories we have, like our obsession with products technology, blah, blah, blah, whatever. At the end, what’s the value and for whom? Right Okay, data, obviously, like in most cases, our customers are internal, right? I am generating or like moving the data around because marketing was not so my job is marketing. Yeah. But at the end like this relationship and the quality of the product and the experience that our consumers are going to have different customers, that’s they’re paying us for that right? Yeah. So yeah, I’m thinking it makes total sense. It’s a very good experiment. And we need to figure out a way to do it. Yeah. All right.

Eric Dodds 55:17
Well, if you want to volunteer, we can help facilitate economic experience in your organization. Lots of great shows coming up. Thanks for joining us again. And make sure to subscribe if you haven’t already. So you get notified of the next episode, and we will catch you then. We hope you enjoyed this episode of the datasets show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me Eric DODDS at Eric at data stack show.com. That’s E R I C at data stack show.com. The show is brought to you by Rudder stack the CDP for developers learn how to build a CDP on your data warehouse. Rudderstack.com

🎙 Sign up for The Future of Machine Learning Livestream!

🗞️ Signup for Our Newsletter

Episode 67:

Now is the Time to Think About Data Quality with Manu Bansal of Lightup Data

December 22, 2021

Notes:

Transcription:

About the Podcast

Sign Up for The Data Stack Show Newsletter