Episode 57:

Improving Data Quality Using Data Product SLAs with Egor Gryaznov of Bigeye

October 26, 2021

This week on The Data Stack Show, Eric and Kostas spend some time with Egor Gryaznov, co-founder and CTO of Bigeye. Egor discusses issues surrounding data quality in organizations of various sizes and uses helpful analogies from software engineering to help define data quality.

Play Video

Notes:

Highlights from this week’s conversation include:

Egor’s software engineering background and history with Uber (2:19)
Experimentation platforms and analytics definitions (7:49)
Bigeye’s function and use cases (9:40)
Managing the relationship between the data engineer maintaining the pipelines and the downstream teams providing the context (18:49)
Pinpointing problems in data compared to problems in software (21:55)
Defining data quality at Bigeye (24:13)
Machine learning models as a data product (28:38)
Determining SLAs (32:22)
How Bigeye brings different parties together and addresses natural communication barriers (36:42)
Looking at when an organization needs to implement data quality tooling (45:54)

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 00:06

Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com.

Eric Dodds 00:26

Welcome back. Really excited to talk to our guest, Egor today. He founded a company called Bigeye, and they are in the data quality space. And it’s a really interesting topic, I think my burning question, Kostas is, really what does data quality look like in an organization, and when do those problems start to become really acute? We’ve talked about scale a lot. The startup that’s two people in a garage just querying their Postgres database, and they don’t even really have a sense of what their data is going to look like. And then at scale at a company like Uber, where Egor spent time working building data products, it’s a completely different game. So I’m interested in his perspective on when the problems become acute. When do you need tooling around data quality? How about you?

Kostas Pardalis 01:18

Yeah, I think I would start with trying to, with Egor, to define what exactly is data quality, or at least like, give some kind of better definition. It’s one of these terms that’s like, together with some other stuff that they go under the broader umbrella of products related to data governance, that we talk a lot about them. We use the terms a lot like quality, something that it’s very easy for anyone to have an opinion on quality. But I don’t think that we really have a very clear definition of what data quality is. And I’d love to try and make this much more clear today with Egor, and I’m sure that we will have more stuff to chat with him about.

Eric Dodds 02:01

Absolutely. We always do. Well, let’s dive in.

Kostas Pardalis 02:03

Let’s do it.

Eric Dodds 02:05

Egor, welcome to the show. Super excited to have you with us today.

Egor Gryaznov 02:09

Thanks a lot for having me, Eric, and Kostas. It’s great to be on the show.

Eric Dodds 02:12

All right, many exciting things to talk about. But as always, we’d love to hear about your backgrounds, and then hear about Bigeye.

Egor Gryaznov 02:19

Oh, definitely. So I’m Egor. I’m the co-founder and CTO of Bigeye. My background is as a software engineer that fell into the data space. My first job, I was working on call center analytics. We were working on a new platform using Hadoop, which back in 2012, was the hot new technology. At that point in time, we were writing raw Java MapReduce jobs, and just trying to process information. How do we even make Hadoop and MapReduce into a scalable solution? Obviously, now, there’s a whole lot of better technologies out there for scaling analytics, but it was definitely an interesting introduction to the world of data. From there, I got into data warehousing. I joined a company called One Kings Lane, which is an e-commerce company. My team set up the data warehousing stack, from infrastructure all the way through ETL, and data modeling, and visualization. So I got a taste of what the whole space looks like? How do you scale a data platform from the lowest level, which is just set up your database and get the data in there all the way through? What do the analysts use? How do we present dashboards? What tooling do we want there? An interesting part about that experience was we were one of the first Looker users. Yeah, we actually had some of the, I think one of the, co-founders come in and present to us because this was really when Looker was just getting started.

Eric Dodds 03:54

Yeah. Oh, wow. Okay, so what so what remind me, what time period is that?

Egor Gryaznov 04:00

So this was 2013, 2014.

Eric Dodds 04:03

Okay. Yeah. Yeah, wild. Okay. Yeah, just in the last couple shows, we’ve talked about what it was like to build a stack back then versus now. I mean, just drastically different. And like, of course, Looker is now part of Google. And so that’s, that’s amazing.

Egor Gryaznov 04:19

It was very different back then. Back then everything was hand rolled. We wrote so many Bash and Python scripts just to make everything work together. And the options back then for analytics and BI were really, either Mode or Looker for the more modern ones, or you go with something like Tableau or MicroStrategy if you have people who are using that. We decided to try out Looker and it was interesting just to get into LookML. They had some really great ideas even back then. I’m really excited.

Eric Dodds 04:58

Yeah. Awesome. Okay.

Egor Gryaznov 05:02

And so from there, I actually joined Uber in late 2014. Uber at this point was trying to scale their analytics. And their data team, they were doing all of their analytics on a Postgres replica; there were a few of these replicas. That wasn’t scaling. The company at that time had 1,800 people or so. So we have the experience of building out the data platform, myself plus a couple of other folks joined and started the data warehouse team at Uber, did the same thing that we did at One Kings Lane, but at 100x, the pace and scale, set up the infrastructure, all the ETL necessary, data modeling, just corralling people into telling us how does the business look at the data? And what does it mean? All the visualization stuff. And by late 2015, early 2016, once the basic core was there, the platform was there, I worked on a lot more specific projects. I did a stint in ad tech. I worked on the experimentation platform, which I ended up being the tech lead for for my last two years at Uber. And then I also worked and contributed to some of the core data platform efforts.

Eric Dodds 06:24

Wow. And just this is a quick just nerd question from a marketer who does a lot of testing. Was your testing platform hand rolled in house?

Egor Gryaznov 06:36

Yes, everything at Uber in general was hand rolled in house. I think this is a little bit of an engineering fallacy of we can do it better and we know how to build something better than we could buy. And I think it’s still prevalent today. It was definitely prevalent back then. But a lot of things that Uber was building in house for a very specific and hyper focus to the problems that Uber was experiencing, which were generally not the same problems that every other company would experience: the pacing, the scale, the types of data that we had, it was all fairly unique compared to what else was on the market.

Eric Dodds 07:19

Yep. Yeah, I just asked because testing infrastructure is, I mean, a lot of things in the data space are non-trivial. But when you think about statistical significance, and there’s a lot of math that goes into it, when you get into multivariate stuff, it gets into pretty gutsy mathematics, in addition to, like, executing software that has a very, it’s at the sharp end of like, user experience. And so it just seems very complex to build.

Egor Gryaznov 07:49

Yeah, and it’s interesting that you bring up statistical significance. On the experimentation platform team my biggest project there was an experiment analytics tool, where users would come, pick the metrics that they want, pick their experiment, and then the tool would go and compute the metrics, run all the statistical analysis necessary, and then show them statistical significance on that. And it was a really interesting experience scaling that out, and making it generic enough where it could be used with any experiment in any metric. But at the same time, still corralling everybody into a sane set of metrics that everyone needs to look at. Basic. How many trips are being taken? And is your experiment negatively affecting that?

Eric Dodds 08:36

Yeah, yeah, that kind of goes back to like, the hardest part about any analytics are actually just the definitions across the company.

Egor Gryaznov 08:46

I could definitely get into that there was I know, there’s actually a lot of tools nowadays that are starting to address metric definitions and consistency. You have super set, you have transform, they’re all working to help businesses standardize their metric definitions. At Uber, same problem. It’s a large business, everyone defines metrics differently. And I remember part of the difficulty was just getting everyone on the same page of even, what is a trip? What is revenue? How do you count the fare splits and make sure that all of this works into the definition of the metric and that every team can actually meaningfully use it?

Eric Dodds 09:29

Sure. Love it. Well, thank you for going down a little rabbit hole. That’s super fascinating. But okay, so Uber and now Bigeye.

Egor Gryaznov 09:40

Now Bigeye. And so Bigeye is a data quality platform. We want to help people ensure that their data is high quality; that it’s fit for use. And we are building a platform that helps them monitor their data constantly. Tell them whether something is going wrong or not with their data so that people can know ahead of time, rather than being unpleasantly surprised when they open a broken dashboard.

Eric Dodds 10:10

Yeah, totally. And so could you just give us a couple of like use cases, right? So companies are collecting this type of data. And then there’s some sort of aberration or derivation or like, could you just give us a use case, maybe from a customer, or just to give our audience an idea of, this is what it looks like in action?

Egor Gryaznov 10:34

Sure. So I actually bet that every single one of your listeners has experienced the data quality problem at some point in the past, and probably as recent as earlier this week. So data quality problems really range from anything as simple as well, our vendor told us that they would deliver the data by Monday at 8 a.m. And they did not. And now it’s Tuesday, and we don’t have our data yet. So this is known as freshness, or latency issues, where the data just isn’t being refreshed on time. This isn’t always third party data. Sometimes this is internal data that just doesn’t get updated for whatever reason. Then, from there, you move on to more interesting cases of data quality, such as, let’s say your business has, on average, 10,000 users logging in a day. And you can look at your logins table, and you can see the fluctuation, you’re like it’s right around 10,000, maybe a little bit more users during the week less on the weekends, but it’s in a reasonable range. Now, what if all of a sudden 100,000 users sign in? Because you’re getting spammed by a bunch of bots? You would want to know about that before all that information goes into your analytics, and you’re presenting this dashboard to your head of growth? And it’s like, look, we just 10x-ed our business overnight, but that’s not real data. But you don’t know that because you’re not actually looking at the data before you’re using it. And so there’s a lot of examples of data quality problems. But it always boils down to the question of, can I trust my data? And can I actually make accurate decisions with the data that I’m using?

Kostas Pardalis 12:28

You mentioned your journey through Uber. You worked on many different things right there, as you say, like you started on pretty much building the whole infrastructure, and had many projects later on. What made you focus with Bigeye on quality? Why do you think, like, quality is important, or at least why you’re so excited about it?

Egor Gryaznov 12:50

I’m excited about quality, because it was one of the biggest pain points that we had when we were first building out the data platform at Uber. The number of times that somebody would message us on Slack–it was HipChat back then, but same same–someone would message us and say, something looks wrong with my query, I’m pulling up this dashboard, these numbers don’t make sense. And that’s about it. There’s zero other visibility into what they mean by something going wrong. And this happens over and over and over again. And sometimes it’s internal analysts and data scientists talking to the data engineering team, which is, which was the case for us. Sometimes it’s the executives looking at some KPI dashboard, who then messages an analyst and says, this smells fishy, like something looks wrong in this graph. Can you double check this for me? If the analyst doesn’t have a place to go and say, all of the data that is feeding this dashboard is high quality and trustworthy and is at least consistent with what we expect that data to look like. If they can’t say that with high confidence, then that analyst has to go and waste all their time and probably spend half a day just digging through a bunch of tables and SQL queries just trying to understand why something looks wrong? Now, the problem of data quality is even more acute today, because it’s so much easier to scale data platforms today. If you think about even 2014, 2015 when we were building out platforms, everything had to be rolled in house. There wasn’t that much tooling around. You would buy a data warehousing solution at that point it was either Vertica or Teradata. If you are really ahead of the curve, maybe you’re already adopting Snowflake, Redshift was also a common one then. So you get one of these solutions, and then you’re kind of left on your own. Great, I have a place to put my data. Now what do I do with it? Well, now you build out these processes to make sure you’re adjusting it and modeling it and presenting it in a very controlled manner. And so everyone had eyes on the data pretty much all the time, there would be an analyst who’s responsible for that specific data set, that specific dashboard. And they would know what it looks like, they would have a gut feel for what it looks like. And they could identify issues early.

Egor Gryaznov 15:32

In today’s world, if I were to build out a data platform at any business today, in 2021, it would take me a matter of days. I would go to Snowflake, and I swipe a credit card and I get a data warehouse, I go to Fivetran, I swipe a credit card to get my ETL. And I go to Looker, Mode, or Tableau, and I swipe the credit card to get a BI tool. And all of a sudden, I have this full pipeline going. And the amount of data that I can get into my warehouse and actually start using for business decisions, has grown exponentially, because I can just connect as many things as I want through Fivetran. All of a sudden, I have my marketing data going here. And I have all my sales data going here, all my product data going here. And I’m just one data engineer. And I can’t meaningfully know what this data should look like. Is this correct? Can people be using it?

Egor Gryaznov 16:29

And so now I’m fielding all these questions from the business saying, Well, my dashboard looks wrong. And my only answer to that is, well, the pipeline is running fine. So I have no idea what else is going on there. And so because of that, data engineering today does not scale linearly, from a headcount perspective with the amount of data that is actually being used by the business. And so because of that, data teams need a lot more tooling, in order to actually scale with the business and scale with their data growth. And so that’s why right now is such a good time to focus on building tools around scaling, helping good teams scale. So for example, data quality, understanding where data is coming from. All this pipeline management. DBT is like another great example of this. DBT is just a very, very fast way for us to build data models in a repeatable, sane process. And so this is why you’re seeing this revolution in data tooling, is because data has started to scale so much faster than it ever has before.

Eric Dodds 17:39

Egor, one point you made that I think is really interesting, that we haven’t talked a lot about is the context around the problem. And it’s funny, because earlier this week, on the marketing side, we had lots of data pipelines that run and do our reporting. And there was a number that just kind of seemed a little bit off, and it wasn’t off enough to like, be super concerning. But I was just like, that’s really interesting. Like, is that correct? Certainly, as I went through that, the context, I think, is something that’s really hard to translate, right? So if you think about marketing, going to the data engineer, there’s so much context that the marketer has around these of these are campaigns we’re running, and these are the conversion rates that we’re looking at, and all that stuff that the data engineer doesn’t have.

Eric Dodds 18:27

How do you address that problem? And in many ways, I think in some ways it transcends the tooling and gets into the cultural aspects. But I would just love to know the ways that you approach that problem or have approached that in the past. And what does a successful relationship look like there between downstream teams who have context and the data engineer who’s making sure the pipelines are running?

Egor Gryaznov 18:49

I think that’s a really interesting question. Because there’s really two sides to this problem as you surfaced. One side is really organization. From an organization perspective, you have disparate roles that don’t really understand each other’s domain. Marketers don’t understand data pipelines; maybe they’re writing some basic SQL, but they’re probably not at the level of what the data engineer is doing on a day to day basis. Then you have the data engineer who just says, Well, I’m already overwhelmed with all of this data that I need to move over into the warehouse, I don’t have time to understand every single business domain. And so I think the right answer is to make them meet in the middle and bring the two knowledge bases together in a way that they can both benefit from each other.

Egor Gryaznov 19:48

And so something that Bigeye sets out to do is to build a tool that allows for that process to happen. We want to allow users to express their expectations around their data in a way that is understandable to the business users so the business user can bring their context over. If they say that, well, we expect our average sale price or average number of views on an ad to be around 200. They have that information and they should be able to provide that information into a data quality monitoring tool, which we do that through a simple to understand WYSIWYG UI so that someone can just come in and say, this is exactly what I’m expressing here.

Egor Gryaznov 20:40

But on the flip side, it needs to be scalable enough where the data engineer can say, Okay, this thing is alerting me. It’s saying that something is wrong with this data set. Where do I even start? And so that should then be able to provide enough context for the data engineer to say, well, it’s this table, here’s the metric that’s alerting you, here’s some SQL that you can start running right now to try to help debug it. And I think that in the future, this then extends even further into really joint runbooks. This issue fired. The data engineer knows what they need to do in order to fix it on the infrastructure side. But then the marketer can then come in and contribute to the same runbook. And say, by the way, here is the expectation, here’s why this is an expectation. So now you have context around why this is a problem.

Eric Dodds 21:38

It’s almost like really good, like error logging, in a way, right? Like if you think about, like, a lot of detail around. Here’s the basis of the problem. Here’s where you should start troubleshooting. Like, it’s super interesting.

Egor Gryaznov 21:55

It’s like a stack trace for software. Yeah, you look, you look at the next thing down, and you’re like, Okay, well, great. What line of code caused that? I think it’s just so much harder in data, because there is no stack trace for data. Yeah, if I, if I could have like, any tool at the snap of a finger, it would be a stack trace for data, where you can say, here are the 10 records here that are causing this. And by the way, they actually came from here, and they came from here. And I know lineage is a very, very popular topic nowadays. But no one’s doing lineage to the degree of at the record level. Software engineering has this line in this file that caused your exception. Yeah, data has at best, this is something interesting that’s going on in your data. But which, like 100 records out of the 10 billion that I loaded today are causing this? Good luck. And it’s easy for some cases, sometimes you can say, okay, well, this column should never be null. And so if you have no record there, fine. This is a very easy filter, easy fix. But what if your average moves? Or what if you’re doing some machine learning and your distribution shifts? Like your variance goes up? What caused that? Well, it could be anything. You can’t really tell. And so I think it’s just so much trickier in data than it is in software to pinpoint these problems.

Kostas Pardalis 23:31

Yeah, absolutely. Something that I observed, like all this time that you were talking, like, it seems like quality is something that is pretty much like every part of the organization, right? Like it starts from the hardware that you use over there, right? Up to how the VP of Marketing, for example, interprets the numbers, right? And I want to ask you, it sounds like a very big problem like it’s hard even to define right as a problem. Talking about quality, it’s easy to use the term quality. But at the end, if you want to solve the problem, you need a better definition of the problem. So how do you define quality at Bigeye?

Egor Gryaznov 24:13

So I think that’s really interesting, because I agree with you that everyone defines data quality differently. In our viewpoint, data quality is about the quality of the final data product. So I’m going to take a step back. I use the word “data product” a lot. But if you think about software, it’s very easy to define what is the end deliverable for software. It’s usually a website, an API, an SDK, whatever it is, that is the product. And when something goes wrong with that product, it’s immediately apparent what is going wrong. If you go to your webpage and it throws an error, then your product is broken. For data, it’s important to define what those data products are. Now, a lot of the time, data engineers will say, this table is the data product. My deliverable is the fact that this table exists in the warehouse and is being updated consistently. But you need to take a few steps forward from that, because that table is then used in ETLs, goes into other tables, which then eventually go into a dashboard, a machine learning model, which may be feeding some sort of product functionality for your core application. So that is the end deliverable for the data team. And so then it’s important to measure quality at that stage. It’s important to understand that my KPI dashboard is good to use. My product is the KPI dashboard.

Egor Gryaznov 26:03

Now there might be 10 tables that are going into this dashboard. No one really cares about the tables, people care about looking at the dashboard, at the end of the day. The tables are helpful in order to inform us about what could cause this dashboard to go wrong. What could cause this data product to be broken? And at Bigeye, we have a concept of SLAs which our customers use to define the state of data products. So if you think about SLAs from software, the service level agreements, it’s the ability to say, when is my application available, and when would I consider it unhealthy, broken, and which metrics are contributing to that. So for applications, that’s error rates, late latency, throughput, however you want to define your SLA.

Egor Gryaznov 27:03

For data products, it becomes the combined metrics that you’re measuring about the underlying data sources. So for example, let’s say I have my KPI dashboard, let’s say I have two tables feeding into that, my users table and my sales table. Right? Now, if my sales table is delayed, for example, or all of a sudden, we notice that there are negative values in the sales amount column, which should never happen. Then I can say this table is unhealthy, because this metric is outside of its expected range, in the same way that you could do that for latency. And then that can then flow into your data product and say, the KPI dashboard is unhealthy, because something that’s feeding that KPI dashboard has turned unhealthy, because one of the metrics has gone outside of an expected range. And so the KPI dashboard has an SLA, that SLA is now red, it’s violated, because an underlying metric has violated that SLA. And so we measure quality at that end product level, but enable users to build up that SLA from the underlying components from those metrics that we are using to measure the state of the data.

Kostas Pardalis 28:26

It’s very interesting. You mentioned as an example of data products, usually like the outcome of BI, which is like reporting, right, what other data products do you see, usually in an organization today?

Egor Gryaznov 28:38

Machine learning models are going to be the most popular ones today. And it’s actually interesting, because BI is the most easily understood and grasped example of a data product. Because there is a dashboard that you can see on a screen. It’s very easy to understand when that goes wrong. There are a lot more data products today that are more automated and less apparent when they go wrong. So machine learning models are a great example of this. These machine learning models, you have a training data set. It’s going to go build that model. And then it’s going to use that model to make some sort of prediction. And usually that model feeds some sort of product functionality. So at the end of the day, if we want to talk about the machine learning model being the end data product, or that product functionality being the end data product, it doesn’t really matter that it’s pretty one to one.

Egor Gryaznov 29:40

So let’s talk about the machine learning model. Now, if the data going into that model that’s used for training is incorrect in some way, then this is the classic phrase: garbage in, garbage out. If that data is incorrect, then your end data product is going to be wrong. This is actually extremely costly to the business, even more so than a broken dashboard, because with a broken dashboard a human’s going to look at this and make a decision about whether to trust the data or not, and whether to make a decision to believe it or not. If you have a machine learning model, no human’s looking at that. The first person who’s going to notice is the customer trying to use the product feature.

Egor Gryaznov 30:23

Let’s even go back to the example of Uber even. Let’s say you have a machine learning model that says, this is how far away we’re going to get drivers to accept a pickup. And let’s say that model trained on bad data. And now it’s saying all of our drivers are coming from half an hour away, because we are failing to dispatch anyone closer. Well, that’s a problem. And that’s a customer facing a problem. They’re gonna stop using the app. That’s immediate impact on the business. And it’s dangerous because no one’s looking at it. No one’s looking at the model and saying, well, what’s the model doing? And sure you, sometimes you have really tight feedback loops that can measure the outputs of the model and say that, okay, something looks wrong, let’s roll it back. But most businesses don’t have this. And most organizations don’t build this into any sort of automated flow. And so if you look at them, you must monitor the data quality of those training datasets, and you have to monitor it holistically enough and deeply enough to be able to detect issues that can cause these things. And a lot of times those training data sets aren’t monitored at all. I mean, it would be, it’d be great if even the inputs to those training datasets were monitored, but even though sometimes they aren’t. And so now you have a bunch of like, who knows what’s going into this model? And you just expect it to work? And that’s just not how machine learning works?

Kostas Pardalis 31:51

Yeah, absolutely. So okay, we are talking about a range of like different data products, where the stakeholders involved in them are like different, right? So who is the person who defines the SLAs for Bigeye? Because if I understand correctly, that’s where everything starts, right? Like someone has to define the SLA. And then the SLA is attached to a number of metrics that you are calculating below the SLA, and you come up with a warning. So who is the responsible entity?

Egor Gryaznov 32:22

So SLAs need to be agreements between both parties. And when I say both parties, the way that I see data teams organized is really into two segments: data producers and data consumers. At the end of the day, there’s going to be somebody who is producing the data that you are using. So typically these are data engineers, an even easier example is third party data. Let’s say Facebook is sending you your impressions, and all your ad metrics, they probably have an SLA with you that says, we promise to deliver this on this cadence, and it’s going to be complete, and so on and so forth. That SLA is between the data producer, Facebook and the data consumer, which is your team. Now within an organization, same thing. A data engineer is the data producer, and then the data consumers are usually the analyst, the data scientists building the ML model, the product engineer who’s actually consuming some data feed, and then using it in the product. So the SLA needs to be a contract between both. And so within an organization, sometimes it’s a little bit tricky, because there are different expectations of what the data should look like, from the consumer and the producer, but they can at least meet in the middle and say, alright, I expect this data to be updated daily. And then the producer might say, Yep, that’s totally reasonable, we’re updating it more frequently already. So that’s a totally reasonable expectation. Once they come together and set that expectation that can then go into that SLA. The SLA for that data product now includes that expectation. And you can go down the list and make all of these expectations. And then the interesting part here is there can be auxiliary SLAs, you can have your core SLAs where full stop, is this data product good or not? So is it on time? Is it complete? Are there any serious anomalies in data nulls, bad formats, incomplete data feeds? We expected 1,000 Records, but we got 200. But then what we actually see our customers do is build auxiliary SLAs. So the data consumers are then saying, well, we have expectations about what the actual data should look like. And that might not even be the problem of the data engineer building the pipeline. This might actually be a, we instrumented logging in correctly in our product and by the time that it got here, something looked wrong. And so then they will go and instrument their own expectations around what the data looks like, we expect to have three product tiers and any other values are invalid. We expect a specific range of numbers when we’re looking at how much we’re charging users, maybe it’s somewhere between $1 and $100 because we know we don’t have anything outside of that range. And if it’s outside of that range, then the data itself is bad, I shouldn’t be using it. And so it really depends on what that SLA is trying to represent. But usually the way that we see it is there is a joint SLA that is just that fundamental, is this data good, answers that fundamental “is the data good?” question. And then there’s the secondary SLAs around, what does the data look like? And is it meeting my expectations as a consumer?

Kostas Pardalis 35:58

How do you bring these people together as part of Bigeye? Because from what I understand this is something quite important that affects also at the end the outcome of the product itself? Right? Around SLA, for example. A not well defined SLA or the thresholds of the SLA not being right, at the end might affect, like, the value that Bigeye delivers. So how do you handle human nature in the end with how people can communicate or in most cases cannot communicate?

Eric Dodds 36:28

Really light question here. *laughter*

Egor Gryaznov 36:30

This is a softball, yeah. *laughter* How do you solve massive organizational issues?

Egor Gryaznov 36:42

I think that’s something that we’ll always have to work on. At Bigeye, our goal is to build a product that allows for people to come together and talk about these very important topics. I think there’s the flip side to it, as you mentioned, that the human nature side of people just don’t want to do that. People want to focus on their work and their problems. And for us, a lot of that is just education, even me coming onto the show, someone’s gonna listen to this and think, oh, maybe I should go talk to my data scientist and just wonder what’s important to you about your data. And even just taking those small steps of education is important to us. From a product perspective, I don’t think a product can ever solve organizational issues. And the only way to do that is through education and through, really, at the end of the day, empathy, you have to have empathy for your co-worker. You have to understand that they are also just trying to do their job, and helping them understand what makes their life a little bit easier is just going to make for a better organization overall.

Kostas Pardalis 37:57

It’s really interesting. The reason that I am asking is because I’ve seen many companies that they’re building one way or another, like data-related products. And most of them have also to tackle some kind of organizational obstacles there. Because I think it’s the nature of when you’re working with data, it’s like one thing is managing the infrastructure, like the technology, blah, blah, blah, all that stuff. But at the end, you have pretty much the whole company, as a stakeholder who’s going to consume this data. So it’s always a collaborative game at the end. And I have at least seen a couple of different ways that companies are trying to solve these problems. One, the common, let’s say, the GitHub approach, right? We are trying to build a collaborative platform where people can get on the platform and collaborate and blah, blah, blah, put them on a workflow and all that stuff. Education. That’s like a very, very good point. And I think this is also like marketing, from the perspective of a company it becomes like an amazing tool because you can educate your users and customers. And then there is also because you mentioned, you mentioned Looker. And one of the things that always impressed me with Looker is how they solve an organizational problem, which in their case was like making sure that they separate the data engineer and the business user as much as possible. And they did that by providing LookML for the developer to create the modeling and then a user interface which is as easy as Excel for someone to use. And the two people okay, they have to talk to each other, but at least like the whole communication is, it is much easier.

Egor Gryaznov 39:37

Have you ever had to implement a LookML model in production?

Kostas Pardalis 39:44

I said they tried. I didn’t say they succeeded. Okay.

Egor Gryaznov 39:52

But the reason I ask is because if you completely separate the two, you’re never going to get anything meaningful out of it.

Kostas Pardalis 40:01

Yeah.

Egor Gryaznov 40:01

Because for the model to be meaningful, you need that input from the business and that stakeholder in order to know what needs to go into it. Yeah, the pivot table stuff is great. And I mean, Tableau did the same thing. Tableau extracts are meant to do the same thing, which is, well, here is a pre-built data set that you can now go in and WYSIWYG and drag and drop your way around. But without knowing what needs to go into that data set, you’re just going to get into that same cycle where the stakeholder is going to come back to you and say, Oh, this doesn’t exist in my model. And then you’re gonna go at it, and then they’re gonna say, Well, why is this filter on here? And you’re like, I don’t know, somebody else told me to do that. And so on …

Kostas Pardalis 40:41

Yeah, absolutely.

Eric Dodds 40:42

I was gonna say, I think Kostas, that’s a really interesting observation. A few thoughts here. So one, and I’m making some assumptions here. This is a little bit of a hot take. So Egor, please correct me if I’m wrong. But if you think about data quality, it happens at different places in the stack or in the data flow, right? So like, one thing that we’ve talked about a lot on the show, and Kostas and I talk a lot about is tracking plans. That absolutely gets right at the heart of organizational problems, right? Because it gets to like, share definitions, and then like new processes and all of these things, right. And I mean, there’s some really interesting companies doing some really cool things with tracking plans. And I know there’s some great solutions out there. But it’s a really hard organizational problem, because a lot of companies just don’t do it. Right. I mean, the bottom line is like, it’s just, you have your work to deliver and like, tracking plans slow things down, and there are demands on the teams. And collaboration is hard in general. And so that’s sort of a difficult organizational problem to solve.

Eric Dodds 41:45

When you were talking about SLAs. And I’m going to speak specifically to the BI use case. Because I think the ML one is a lot more complicated. But what’s interesting is I was thinking about even just my own day to day, and the way that Bigeye is approaching it through the paradigm of SLAs is really interesting, because even in our organization, a lot of those are just already, like already defined, even if they’re not made explicit. Like, I know what my SLAs are for the marketing dashboard, right? And I haven’t necessarily written that down, or gone to tell the person who’s writing DBT models or whatever. And those conversations come up organically, but I know those, right, so those are, it’s not difficult for me to mine that information. I already know that. And so it’s interesting when we think about the organizational challenge, like a product that formalizes something that already exists on some level that just hasn’t been made explicit, and then adds collaboration, I think, is really interesting, and I think that’s where a really well done product can actually really facilitate that. I won’t speak to machine learning because that seems, and I’m not technical, that seems like a much more difficult problem. But it seems like a lot of the SLAs already exist in the organization, there’s just not a great way to formalize them.

Egor Gryaznov 43:04

And a lot of it is just getting it out of people’s heads, like you said, you have an SLA in your head, you know what you expect out of this data. But a lot of the tools that data teams build internally, are usually very technical. And they’re very geared towards the data engineering team. They usually involve writing some sort of SQL or configuration or checking in code, even sometimes, or updating the ETL. And those can capture a lot of the basic information. Again, how fresh is my data? How many records do I have, but it’s not going to get all of that stakeholder knowledge in there. So that’s why any data quality tool needs to be able to be accessible enough to extract that information for the stakeholders, the data consumers to come in, and actually express what they have in their head. Because at least that gives you a starting point, you can at least go and write that down, create that configuration, start that monitoring. And when it goes off, you can then go to your data engineering team and say, Here were my expectations with this data. Which one of these did I get wrong? Do you understand something that I’m not understanding? And do I need to adjust my expectations?

Egor Gryaznov 44:24

And a lot of times the data producer team, the engineering team might say, nope, you got that right. And this is a real issue. We just didn’t know about it. Thanks for flagging that. Yeah. But it’s important to first and foremost get that out of your head, get that out of the stakeholder’s head and in a place where somebody can see it, visualize it and understand it, and then that will prompt that conversation.

Eric Dodds 44:53

Well, we’re getting close to time here, but one thing I’d like to talk about is to kind of get specific on when we think about data quality, and we think about Bigeye as a tool that helps solve that. One thing we talk a lot about on the show is that, and you mentioned this with Uber, right? Like the data problems that Uber scales at are very different from data problems that a much smaller company. What are the symptoms that you think and maybe that you even see with your customers that necessitate, like, okay, we need to start thinking about data quality and tooling around that? Are there particular tools in the stack, or data pipelines that are indicative of the need for this? I’d just love to give our listeners a sense of when–like you said, everyone faces a data quality problem, and everyone’s probably faced it this week–but from the Bigeye perspective, when do you need to implement tooling?

Egor Gryaznov 45:54

In a very biased answer, as soon as you have data in a warehouse, you probably want data quality tooling. Yeah. In a more objective answer, it really depends. My gut feel would always come from how much time are you spending on data quality problems? And this is typically a question for the data engineering side, but it works for the business as well. Yeah. How much time are you wasting, looking at dashboards that are broken? On the data engineering side? How much time are you spending fielding questions from the business about why their stuff’s broken? Or can you look into this query and tell me what’s going wrong? Yep. Because one question a week, fine. Yeah, but if you’re spending five hours, 10 hours a week, just debugging people’s SQL to help them understand what’s going wrong, you might want to invest in something that’s a little bit more automated, yeah. Because at the end of the day, people just want to do a job that’s fun. They want to do their job, they want to do the fun parts of it. You know, the business that’s getting new insights, making decisions, driving direction. For the data engineers, that’s I want to build frameworks, and I want to build and create new pipelines and explore new tooling. And neither side can do that if there’s too many data quality problems, because they get in the way. And so at some point, the business will have this critical point. We at Bigeye actually have a term called the “Oh, shit moment”, which is, at what point did you have such a big data quality problem, that it completely derailed the whole business, say, KPIs were wrong, sales numbers were wrong, if a product rollout couldn’t be tracked, because the instrumentation was incorrect, and no one noticed for a week until you went to pull the report. So at some point, you’re going to have that moment. And you’re going to realize, we can never have a moment like that again. We need to start worrying about that.

Eric Dodds 48:06

Yeah, Kostas, I’d be interested in your thoughts on this. But so Egor, it was a little bit of a leading question, because I had kind of my own thoughts on this. As I think about this, and I’m just putting it through the lens of my own experience, for me, the trigger would almost be like, I have all my data in my warehouse, you start to build out dashboards, but then you go through this weird period where your dashboards aren’t stable, because you have all this data, and you’re just trying to figure out, what should I measure? How should I measure it? And then you get to a point where you’re like, Okay, this is the dashboard that the marketing team is looking at every single day. And these are the numbers and like, then you have a baseline from which derivations become really important. And that’s like SLAs in your head. So to me, it’s almost like okay, the first signs of dashboard stability, give you your initial set of SLAs when you can measure from it, but would you agree with that Kostas, because you’ve done I mean, all sorts of reporting, especially on the products.

Kostas Pardalis 49:04

Yeah, I would agree with the biased version of Egor’s opinion, to be honest, like the sooner you have at least some principles, I mean, you might not want to start using like a product or something, but at least have some principles to check what’s going on with the customer facing side of your data product, let’s say okay, which is, I don’t know like your dashboard, for example. I think the better you’re going to be. I mean, it’s amazing how many times I’ve heard from pretty big companies, data engineers coming to us and being like, oh, we just realized that this pipeline stopped running three weeks ago. Whoa, like something is feeding, this pipeline is feeding something right? So why does it take so long? That’s a great point. This “Oh shit moment”, like usually it’s like very late when this happens and someone is angry, right? You might have your board meeting and you don’t have your numbers. Ok. Nice. Fun, right? That’s where you have like the common excuse, we are still working on our metrics infrastructure. Until next time, itt’s part of our OKRs. Right?

Kostas Pardalis 50:34

But, yeah, I think the sooner you do that, the better. And one of the reasons is outside of like, okay, avoiding these “Oh shit moments” is because like people, especially people that they start working for the first time, like with data to understand and educate themselves that like data, something always will go wrong with them. There is no perfect data out there. It pretty much can be proven in computer science that you cannot have that period. Okay. So I remember for example, I’ll give an example that I kept, like remembering why Egor was like talking. The first time, I mean, when we started Blendo we were using Google Analytics at the beginning, right? So we were taking numbers and measuring from there. Then we started using Mixpanel. And we were like, Oh, we have another data source. Like with the same data. Let’s compare the two now that we have them like on our data warehousing, of course, they didn’t match, right? Okay, what do you do now, but the most important thing is not just like how you’re going to tackle the problem, but realizing that this is the reality that you’re going to be operating in. And getting into this habit of caring about data quality, will make you like, understand and incorporate this as part of your business practices, which, in my opinion, is probably super important to start. As soon as you start reporting, even on an Excel document, like some numbers to your board.

Egor Gryaznov 51:58

The more you wait, the worse the problem is going to get.

Kostas Pardalis 52:01

Oh, shit.

Eric Dodds 52:03

Well, I also think about something that Kostas says a lot, which is it’s easy to talk about data in a way that it almost comes across as static. But the reality is data is changing a lot, right? New pipelines are added, other pipelines are deprecated, right? Like it is never static within an organization. And the complexity is only increasing.

Egor Gryaznov 52:24

Even within a pipeline. I mean, even looking at just one pipeline, because there’s plenty of … I’ve seen teams that have one table. And like this is our event log table. It is 500 columns wide, and it stores every single event that happens in our whole product. And even those pipelines can go wrong. Even if nothing changes about the pipeline, there’s no new pipeline. Yeah, but you stop publishing a signup event. And all of a sudden your conversion goes to zero because no one’s signing up. Even within a pipeline, things can go wrong. And like even there, data is never static.

Eric Dodds 53:04

Totally. Well, we are at the buzzer. Egor, before we jump off, if someone wants to learn more about Bigeye, try it. Where should they go?

Egor Gryaznov 53:15

They can go to Bigeye.com. They can also email me at egor@bigeye.com.

Eric Dodds 53:21

Awesome. This has been such a fun conversation. So many rabbit holes we could have gone down. We’ll have to save that for another episode. And thank you again for giving us some of your time to talk shop about data. It’s great.

Egor Gryaznov 53:34

It was my pleasure. I really enjoyed the conversation. Thank you.

Eric Dodds 53:40

I love talking to our guests. They’re just so smart. We learned so many things. My big takeaway is the paradigm of SLAs. And I love the framework that Egor used to talk about SLAs for data products. And I think that’s just a really, really smart way to approach the data quality problem. So I’m even thinking about that for my own, you know, day to day work. So I just really appreciated that perspective.

Kostas Pardalis 54:11

Yeah, absolutely. Actually, I would say that it’s like a broader theme in the way that he was approaching the problem of building data quality related products. If you notice that there were like two main things that happened during our conversation. One was using the definition of a term SLA, which comes from software engineering, right? And there is, again, like the usage of the term data product, which he also defined exactly. And it’s again, like in terms that we are much more familiar when it comes to software, but it’s something that we can reuse also in data. And I think that’s what Bigeye is trying to do is get a number of best practices and principles out there, but much more mature in software engineering and apply all these like also to the problem of data management and data consumption. And I think they’re doing a pretty good job. And I’m really looking forward to having another follow up episode with him because we just I think we just scratched the surface of quality. We didn’t even talk about what happens after we defined the SLAs. So there are many more things that we can discuss with Egor and I’m really looking forward to doing that in the imminent future.

Eric Dodds 55:22

Absolutely. Well, thanks again for joining us on the show, and we’ll catch you on the next episode.

Eric Dodds 55:29

We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at Eric@datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

🎙 Sign up for The Future of Machine Learning Livestream!

🗞️ Signup for Our Newsletter

Episode 57:

Improving Data Quality Using Data Product SLAs with Egor Gryaznov of Bigeye

October 26, 2021

Notes:

Transcription:

About the Podcast

Sign Up for The Data Stack Show Newsletter