Episode 93:

There Is No Data Observability Without Lineage with Kevin Hu of Metaplane

June 29, 2022

This week on The Data Stack Show, Eric and Kostas chat with Kevin Hu, co-founder of Metaplane. During the episode, Kevin discusses why data problem are “silent,” how much we can trust data, and the role of data lakes in the future.

Play Video

Notes:

Highlights from this week’s conversation include:

Kevin’s background and career journey (1:54)
Metaplane and the problem that is solves (6:47)
The silence of data problems (9:53)
Data physics work that requires more (13:35)
Trusting data when bugs are present (19:12)
Building a navigable experience (22:36)
Developing anomaly detection (30:06)
What Metaplane provides today (35:05)
Metaplane’s plans for the future (37:45)
Comparing Bigquery, Snowflake, and Redshift (40:56)
Why data goes bad (48:15)
Advice for data trust workers (59:24)

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 0:05
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com.

Welcome back to The Data Stack Show today we’re talking with Kevin Hu from Metaplane. Kostas, there are a lot of tools in the data observability space. And that’s what meta plain does. And I’m interested to know, of course, I do a lot of stalking under Gaskey, for the shows, but I want to know how he went from MIT to starting Metaflow because that’s an interesting dynamics are coming out of academia, and then going through Y Combinator and starting a company. So I just want to hear that backstory. How about you?

Kostas Pardalis 0:56
I want to learn more about the product, to be honest. It’s data observability and data quality. I don’t know what other names are going to have tomorrow for the category. It’s like a very hot product category right now in terms of like developments and innovation. And I think he’s the right person like to talk about that. So let’s see how Mettupalayam understands and agreements that observability? And also what’s next after that? Like, what are the plans there? And where are they going?

Eric Dodds 1:27
Let’s do it.

Kevin, welcome to The Data Stack Show. We’re so excited to chat with you.

Kevin Hu 1:33
I’m so excited to be here. I’m a longtime listener of the show, I’ve recognized both of your voices, and to be here with you on the Zoom, it’s really a privilege, so thank you.

Eric Dodds 1:43
Cool! We always love hearing from our listeners, especially when they are guests on the show. Of course, I do LinkedIn stalking. Our listeners know this, you probably know this as you listen to the show. So you started at MIT studying physics and then you made the switch over to focusing on more computer science subjects, so I have two questions for you. One, why did you make the switch? And then two, did that influence you to start Metaplane? Actually sort of studying those topics from an academic standpoint?

Kevin Hu 2:18
Yeah, one great research is both found ourselves in the either fortunate and privileged or unfortunate place, obviously, yeah. And I did start studying physics. And I remember the gauntlet course, at the time, which was the experimental lab course everyone took as a Junior was notorious for burning people out. And every week you replicate a Nobel Prize-winning experiment. And the second week, you analyze that something that really stood out to me was the people who had the hardest time in the course, weren’t necessarily the people who weren’t the best physics students. But it was the people who didn’t know MATLAB and didn’t know Python. So they could collect the data but weren’t able to analyze it. They were the ones who are pulling all-nighters. And at the same time, my sister who is a biologist, she had about five years of data on fish behavior. So tilapia are very interesting fish into a tank of them. You drop in another tilapia and all the other tilapia change. Oh, fascinating. Yeah, they’re very tribal, very easy to observe. And at the end of five years, he messages me saying, Hey, Kevin, can help me analyze this data? Because I don’t know are. And to me, this is just absurd. Because why are some of the brightest people in the world bottlenecked because it don’t how to write code. Obviously, that doesn’t apply only to scientists, but really, to anyone who works in an organization that either produces data or consumes data. If they don’t know how to program, you’re not necessarily working with it on the most low friction way. So that’s how I got into ces for research, trying to build tools and develop methods for automated data analysis. This is back in 2013.

Eric Dodds 4:12
Okay, wow. Super interesting. Tilapia are also tasty, by the way, if you’re a good cook. That is a data point. That’s a qualitative data point. Happy to share that with your sister.

Kevin Hu 4:25
I have plenty of qualitative data points, too. Hopefully your listeners are not fish.

Eric Dodds 4:32
That’s right. Okay, so you studied computer science tooling, how to sort of support people, help people based on your experience of really bad people not being able to analyze data. Take us from there to starting Metaplane and then tell us what Metaplane is and does.

Kevin Hu 4:53
So for six years, we built tools that given a CFD We try and predict the most interesting by some measure the visualizations or analysis that could come from that CSV. So at first, it was really rule-based, but then it was more machine learning based where we had a lot of datasets and visualizations and analyses scraped from the web. And the papers were very, really interesting. And it turned out, you could predict how analysts worked on a data set with relatively high accuracy. The problem was when we tried to deploy it at large companies, including Colgate Palmolive, Estee Lauder, they funded a large part of my Ph.D. and I still have many goodie bags, some of my colleagues have GPUs, I have retinal, lots of toothpaste, tons of American planning. But the problem was when we wanted to deploy these tools, it became very clear, like, Okay, can I just see your database? And they’ll say, Okay, what database we have, like 23 instances of SAP. This was back in 2015 and 2016. So it was a bit worse back then than it is today. But it became clear that data quality is one of the biggest impediments to working with data, not necessarily when you have a final clean data set, and the last mile is generating the analysis of that. So that’s the motivation to build out a plan where we couldn’t necessarily make that flower grow. Now we have augmented analytics and different categories arising trying to do that analysis. But we figure if we can plant the garden, maybe someone else can take it further from there.

Eric Dodds 6:40
Very cool. And so tell us about Metaplane. Like, what’s the problem that it solves?

Kevin Hu 6:46
So Metaplane, we like to think of it as the data dog for data. It’s a data observability tool that connects across your data stack to your warehouse like Snowflake, your transformation to like DBT, a BI tool like liquor. And very simply, we tell you when something might be going wrong. Specifically, there’s a big asymmetry that we observe today, where data teams are responsible for hundreds or 1,000s of tables and dashboards. And this is great in part because data is becoming a product, right? It’s no longer used, just within the main vein of BI decision support, even though that will always be important, but getting reverse ETL. Okay, maybe a term is not cool anymore, but being sent to activated into you. But a market tools being used to train machine learning models, and that is all good, the promise of data is starting to be more and more true. However, while your data team is responsible for hundreds of tables, your VP of Sales only cares about one report, which is the liquor dashboard that they’re currently looking at. So there’s this asymmetry, where frequently teams find out about data issues or silent data bugs, as we call them. When the users of data, notice it and then messages the data team that matters for two reasons. One is if you’ve received those slack alerts, and if you’re listening to this podcast, you probably have. There goes your afternoon and you did not have much time to spare to begin with. But also, trust is very easy to lose and hard to regain, especially when it comes to data. Because once that VP of Sales decides to Okay, Screw this, I’m gonna have my dev ops team build up reporting in the shadow data stack. Benbow is the point of getting a Snowflake and getting all this data together to begin with, right? We don’t have a culture around trusting data, it doesn’t really matter how much of it you collect or use.

Eric Dodds 8:47
Hmm. Yeah, absolutely. I want to dig in on one thing, and then I’ll hand the mic over to Kostas. You mentioned the silence of errors or bugs or problems that happen with data, which is a really interesting way to think about the problems that we face in data. So two questions for you. One, how do you think of the audible nature of those things differs and data say as compared with like software engineering because software engineering— Like if we think about data dog, there was a lot of defined process and tooling or whatever, a lot of that’s being adopted into the data world. So one would love a comparison there. And then two, could you just describe on a deeper level—and maybe do this first—describe a silent problem. Like, why are the problems with data silent? Or why do you even use that term?

Kevin Hu 9:47
Let’s start from that solid data bug. Great questions. Frequently all of your jobs are running fine, right Airflow is all green Snowflake As up, and yet, your table might have 10% of the rows a year expected, or that some distribution like the mean, revenue metric has shifted a little bit over to an incorrect value. So these sorts of issues in the data itself, unless you have something that is continuously monitoring, the values of the data aren’t necessarily flagged by infrastructural issues, like your systems being up or your jobs were running. And that’s why we, we do want to make the silent data bugs more audible increase the volume a little bit, because if you don’t know about these issues are occurring along the way, then inevitably, the only place that you will notice it is at the very end, right when the data is being consumed. One, because that person has the most incentives to make sure that the data is correct. But frequently, the person who’s using the data also has the most domain expertise. If they’re having a sales team, they might know what exactly should go into this revenue number. They might not have known how, how it was calculated along the way, but they know when it’s wrong. Yep. And that is one departure from software observability, which really is the inspiration for date observability. Right, the term was completely co-opted from, like the data dog and Splunk to the world. But to be fair, they go off to the term from control theory, where observability has a very strict definition, right? As like the mathematical duel for the controllability of a system, a dynamical system where you want to understand how the state changes from the inputs, so I don’t feel too bad.

Eric Dodds 11:45
Yeah. All art is stuff, right?

Kevin Hu 11:50
Exactly, exactly. If we keep tracing all the way down, like back hundreds of years, we’ll find a Dutch physicist trying to make wooden mills turn at the same rate as great. I love it. But just to finish that fun thought that in the software world, before the data dogs, right, you would frequently find out about data issues, or in software and infrastructure issues when the API went down, or when your heartbeat check, failed. But as the number of assets that you’re deploying, increases and increases, that level of visibility does not sufficient. Right now, if you’re on a software team, it’s almost mind-blowing to think that you want your customers to find out when your API is failing, or when a quarry is slowly you want to find out about that progression internally.

Kostas Pardalis 12:45
Yeah, absolutely. Okay. Before we resume the conversation about observability, I want you to go back to physics and your other graduate studies. And I want to ask you— and that’s like a very personal curiosity that I have. From all the stuff that you have done in physics, what was the one that required more in terms of working with data and using R or Python. What did you think of like, could didn’t exist in a way almost, let’s say if we’re going to generate as a domain of physics, if we couldn’t have if we didn’t have like today, like computers, and all these languages, and all these systems like to go and like crunch the data?

Kevin Hu 13:32
Well, I have two answers to that question. One is when I was doing more pure physics research, like Mo, atomic, Molecular and Optical physics research, you can pick up about ultra cool atoms using laser cooling and trapping, where the fine level of control that you need to calibrate these systems. And then the amount of data that you’re retrieving from the systems that you’re observing is immense. Right? There’s a reason why high-performance computing was really like invented at CERN. And by the Internet was kind of invented these scientific research facilities is they have they had the need for data first, and then even today, the scientific computing ecosystem almost exists separate from our data stack, yet, the qualities of the data are completely different. The other strain was, at some point. I got more interested in quantitative social science research. So we published this paper on the network of languages. Oh, trying to understand how information flows from person to person via the languages that they know. Specifically, there is nothing stopping us from going to any new site and another language besides the fact that we might not know that language. We had tons of data at the time about bilingual Twitter or users about Wikipedia editors who edited Wikipedia in more than one language, translations from one language to another to try and figure out the connectedness and the clusters of different languages. So that wasn’t necessarily a problem of big data, necessarily. It all fit on one person’s laptop. But we wouldn’t have collected that data if it wasn’t today.

Kostas Pardalis 15:23
Yeah. 100%. No, that’s super interesting. And yeah, I remember at some point, one of the first episodes that we had, we had like a guest who worked at Zoom. He was taking care of the infrastructure layer and writing like code in C++, like Francis eight other and it was funny to hear him saying, like, what’s he What was his first impression when, after his Ph.D., he went into the industry, and hearing about big data and like people saying, like, Okay, we need like a whole class that would like to process data. And he was like, okay, are you sure you can say that?

Eric Dodds 16:04
He was dealing with petabytes and petabytes of data, just an unbelievable amount. So he goes to work in insurance and he’s like, this is the kiddie pool.

Kevin Hu 16:15
Totally. There are levels to the game, and I’m sure that when he taught, like, goes down the hall to another person that was certain that like pedophiles, like we have even more data than that.

Kostas Pardalis 16:23
Yeah. Yeah, it’s super interesting to see the different perspectives when someone is coming, like from scientific computing, and the point of view that they have, and like how you solve the problems, like with working a lot of days with a lot of data, although, again, we also have to say that, like the needs are completely different, like the environment, the context that they do, the processing is also very different. So it’s not like exactly comparable, right? Like, you cannot say that the work that Facebook is trying to do with the data that they have is like the same type of problems that are solved by highly paralyzed allegories, like trying to solve partial differential equations, for example, right? Like there’s like very, very different like problems, and they have different needs, both in terms of infrastructure and the software and the algorithm we are using. But yeah, like 100%. I mean, there is a reason as you said that, like, the internet’s that the web, came out of obsession, and like all these technologies, like they are, like, highly associated with fees.

Okay, enough with physics. Let’s go back to data observability. I have like a question about— We use a lot, and it’s very interesting, because you talked about like this experiment with languages and when you’re bilingual and all that stuff, but something similar, I think is also like happening when we do like new product categories. As you said, we stole like the term observability from data dog that took the term observability from like, control theory, and who knows about the dots guy who was what was doing, but when we are talking about lower using like, and you use with Eric, like the term bog, right and silent, but like, okay, like in software, when we are talking about like bugs, there’s like a very, let’s say, key relationship between how to adopt like, severe deterministic thing, right? Like, okay, there are like a few bugs that it’s hard to find them, especially like in distributed systems and stuff like that, because the behavior is not deterministic, necessarily. But like, broadly, when we’re talking about bugs, we’re talking about, like something with deterministic as a system, right? But with data, my feeling is that when we’re talking about Budgam data, it’s not exactly that, like there’s much more Wagner’s there. And it’s not that, like, clear to define what the bug is. And that’s why many times I say that maybe it’s better to use the term “trust.” Like, how much we can trust the data? So from a binary relationship, bug or no bug, we go into how much we can try something. So what’s your experience with that? And like, what’s going on? And what’s not common between patterns from software engineering and data and working with data?

Kevin Hu 19:10
You’re so right. The way that we refer to data as having bugs is not it’s not a one-to-one with software, right? Like a software bug. It’s a logical issue, that somehow your logic did not produce the outcomes that you’d expected when it encountered the real world. Right? Either the real world was more complicated than you thought, which is making the case or your logic was not sound. Yep, in which case get someone to review a prs. Engineers on my team would be like, Well, Kevin. Yeah, the data bugs are interesting, because I think the root cause can be equally similar in some cases, where yes, there are logical issues in your DAG UTI extending beyond the warehouse but from very beginning to very end, right. It is conceptually a chain no logical operations. But the data could be input wrong, right? Either came from a machine that did not do as you’re expected or a person entered in the wrong number. So you’re right that the scope of a data bark is a little bit larger in that sense. And as a result, what goes into data observability is slightly different than what goes into software observability. Software, you have the notion of traces, right? Data or you have an incident that occurs, but also the traces right, the time or the time correlated, or the request scoped logs that help you. Okay, where did this begin? And where did this end and in data, right, that’s kind of replaced by the concept of lineage. Right? But the tricky thing is that lineage is never perfect. Yeah, that. That’s until Snowflake, start surfacing it to everyone. Right, and Snowflake will not cover end to end right off. Yeah, yeah, tool and upstream as well. Maybe they’ll work with RudderStack to figure it out. But there’s always some loss of, of resolution along the way. So as a result, right, even if you build all those integrations and build an amazing parser, like, you’re still working with incomplete information, whereas traces in the DevOps world can be extremely exact, you might not be first inferring causality, but at least you have all the metadata that is relevant.

Kostas Pardalis 21:39
With observability, in DevOps from a product perspective, the problem that you have there is that, like, you need to build an experience, that’s probably like, going like the two matters solution in a way, right? Like, there’s like, just like too much data and you need like to kill the user navigate over his data to find the root cause, right? So that’s the product like the problem that you have during the design and product experience with that. But when we’re talking about data, observability, we have Wagner’s together with probably way too much data at the same time. Because if you stopped like collecting all the volume, that’s that, that data like, yeah, like, you can also have like an explosion there. So how do you do that? How do you build an experience that can help people navigate these wagons and complexity at the same time to figure out the root cause of the problem? Like, it’s at the end or figure out if they can trust the data or not.

Kevin Hu 22:34
Part of this is a very challenging, computational problem on the backend. But another part of it is a UI UX problem, which is no less difficult, it may even be more important. So let’s take for example, a table is delayed, right? That it’s usually refreshed every 10 minutes. Let’s say it’s been two hours. And that is unusual, even after taking seasonality into account. Where if we surfaces issued to a customer, then we’d be okay, that’s useful. But almost always, the first question is, does this matter? That the table is not being used by anyone, maybe we don’t need to fix it right now. And then the second question is, what is the root cause? So can I do something about it? And obviously, when all those three pieces fall into place, like in a real issue has occurred? It has an impact, and I could do something about it? Is this necessarily going to bubbled to the top of your triage list? The answer to your question, what that means is being very mean, it means a few things on the meta plan side or any tool that’s trying to do this for you. One is building really robust integrations across their data stack. So it needs to be in your BI tool, ingesting all of your dashboards and the components of those dashboards and getting the lineage to a table in his fine resolution as possible. And making sure that that’s up to date, and reflecting the latest state of your warehouse and latest state of your BI tool. That means disambiguating entities correctly. So if you have a transactional database that’s being replicated into your analytical database, right? How do you know that one table refers to the other? If you are trans sync, how do you know that this fight friend sync is syncing? Those two like entity a to entity BEAM? That’s a tough problem. And then the third piece is a couple prioritization. Right is one table might have 100 downstream dashboards, right? And how exactly do you want to surface this to your user? Right? Do you just say the number 100? Or do you list all 100? Yeah, here’s a principle at least in information visualization, Schneider’s Schneiderman mantra of the inventor of the tree map. He’s the professor at the University of Maryland, I believe he always says like, overview first, and then filter. And finally details on demand. So the way that we tried to map out a plan is like giving you as useful of an overview of what happened and incident when you filter down what you think is relevant, and then finally zooming in on the details when you want it, for example, the number of times that one dashboard that depends on this table has been used.

Kostas Pardalis 25:29
Okay, and that’s super interesting. You said it’s both like a UI UX and a computational problem. Let’s talk a little bit more about the computational problem. So what are the challenges there? What are the challenges that needs to happen on the back end, and like the methodology and the algorithms that you have to use to drag these things and make sure that your survey is the right thing to do the user at the end?

Kevin Hu 25:57
One tough problem is anomaly detection. Or one reason why data observability exists as a category is because it’s tough to test your data manually, right? There are great tools to do that, where you say, Okay, I expect this value to be above some threshold. And honestly, every tool, every company should probably have a tool like that, for the most critical tables. However, it becomes quite cumbersome to write code across your entire data warehouse and then merge a PR every time the data changes. Which is why data observability comes in where us and everyone in the category says okay, you do that for the most important tables, but let our tool handle testing for everything else. One necessary ingredient is some sort of anomaly detection. And it could be machine learning based, it could be more traditional time series analysis, where we track this number for you. And of course, we have to take the traditional components into account like here, the trend component is a seasonal component. But there are a lot of bespoke aspects to both enterprise data. So for example, row counts tend to go up, and they tend to go up at the same vary over time. And if you use an off-the-shelf tool, you just kind of be sending false alerts every single time it goes up. But to like your data is particular, right, your company is a little bit different. So there’s a lot of work that goes into anomaly detection, because if you cry wolf too many times, you’re just going to turn it off. Yeah. That’s the other component is log ingestion. where let’s say you’re using Snowflake, you have 365 days of quarry history, that a tool like meta plan will be ingesting up that core history and then parsing it for both usage. So understanding how tables and columns are being used, but also lineage. Like, what, what does this query depend on? And what does it transform those dependencies into? And this is a notoriously difficult problem, I think. No one has figured it out with 100% coverage and 100% accuracy across all data warehouses, except for the people who the data warehouse vendors themselves.

Kostas Pardalis 28:21
Yeah. Why do you say that problem is notoriously hard? What makes it so hard? Like, you have all the queries that have been executed the past 365 days. What the difficult part that like using that to do like the units?

Kevin Hu 28:39
It’s a combination of differing SQL dialects from warehouse to warehouse. So things are starting to get standardized, right? The what you’re the person that you write for Snowflake is different than the one that you might write for Redshift. Secondly, there is often a lot of ambiguity within the data warehouse, right? Which tables are being used within this query. And that’s a relatively easy problem. But then what columns are being used by labels, and no tables might have very overlapping or duplicate column names. And you might say, Okay, well, the know. The compiler is able to turn is SQL is a well-defined language, right, Snowflake is able to turn the SQL into columns and tables that are being used, but they have access to the metadata and they have access to their runtime.

Kostas Pardalis 29:35
Yeah. Absolutely. Absolutely. Like, so you think that like this could be easier to handle with like, more metadata were exposed by the database system again, right? Like if the information that was like exposed through Snowflake, for example, was more like that would help like a lot to figure these things out. So it’s more about exposing more of the internals of that other bases that might be in that is needed there. That’s interesting. That’s very interesting. All right. Okay. I know all distinctions so what are you doing in your product with anomaly detection right now? Like, do you have some kind of functionality around that? And how does it work?

Kevin Hu 30:16
Yeah, one quick note on the data warehouses releasing their internal lineage, I know that Snowflake is starting to do this. It may only be available to enterprise customers right now, but the moment they do that one whole category of tools will have a much harder time, the data lineage tools, and everyone else will be exponentially more powerful. If we had access that for all of our Snowflake customers, which is basically almost all of our customers, it’d be insane. The amount of workflows without off mark.

Kostas Pardalis 30:47
Okay, that’s interesting, actually. So it’s going to be a problem like for the units, companies and the products out there, obviously, because the product is going like that. The functionality is going to be provided, let’s say, by Snowflake, but at the same time, this is going to make things much more interesting for you. But is that like, is there a reason? I mean, why is this going to happen? Like outside of like having access to the metadata to the additional metadata? Is there something else that’s like going to make it more interesting, because all your customers are on Snowflake? Or, like, it doesn’t matter?

Kevin Hu 31:24
I think it’s primarily being able to rely on their lineage over our lineage party like that they’re much more correct and up to date, and have higher coverage than we do.

Kostas Pardalis 31:35
Yeah, but on the other hand, that’s only the Lenovo Glue as part of Snowflake, right? Like, what happens before and after that. So let’s say you have, I don’t know, let’s say you have spark doing some stuff on your S3, to prepare the data. And then you load this data into Snowflake, which I think it’s pretty common, like in many use cases. So even like, if Snowflake does that, how do you can see outside of the Snowflake, especially like, before the data gets ingested into Snowflake?

Kevin Hu 32:10
Totally, yeah. They don’t have the full picture, which is why did observability tools come in and kind of document? Right, say, okay, the lineage within the warehouse might be a very key part of the picture. But it’s not all of it. Right? It’s not the doctrine impact is not the upstream root cause. Yeah. Which is how the two play together a little bit.

Kostas Pardalis 32:35
Yeah, it makes sense. Makes sense. Okay, so back to you. I don’t want to distinction. What do we get from you today, in terms of animal detection, like, what’s happened? Like, what can I use out of the box.

Kevin Hu 32:48
So out of the box, right now, if you go to Metaplane.dev, you can sign up and sign up through email or G Suite, and connect your warehouse, your transformation tool, your BI tool, typically, people can do this, within 15 minutes, you’ve had highly motivated users do it within five, which is insane because I can even do it within five, I guess when you want it, you really are motivated to do it. And off the bat, we cover your warehouse with tasks based on information schema metadata. So for Snowflake, right row counts, and schema and freshness kind of come for free pass the warehouse, you can go a little bit deeper with out-of-the-box tests, like testing uniqueness, known the distribution of numeric columns, where you can write custom sequel tests. And for all of these tests, and our customers usually blanket their database and have hundreds of tests on top of those within like 30 minutes, then you just let it sit because we have the anomaly detection kind of running for you in the background as we collect this historical training set. And depending on how frequently your data changes, it can be either between one day or five days until you start getting alerts on data.

Kostas Pardalis 34:11
So it’s like between one and five days. That’s neat. And the blueprint that you have so far because we are talking about like data observability the conversation that we have like focusing a little bit more, that’s how I feel, at least on the data warehouse. So would you say that what metal plane is doing today, like more or observability of the data warehouse or you provide, let’s say observability across like the whole data stack that the company might have, like let’s say I have streaming data and I have a Kafka somewhere and then I also have like a couple of other databases in them I might also have like a Tara data instance somewhere running. What kind of cover you would say that Metaplane today provides?

Kevin Hu 35:05
We are focused on the warehouse and its next-door neighbors. Right now, part of that is a strategic move as a company, right, like we want to start from the place of the highest concentration, Snowflake is getting tons of market share as his Redshift as his big CoreOS. But we don’t have to build a whole slew of integrations, those three cover a lot of the market today. And one of our customers use one of those three. We have the downstream BI integration, so Looker, Tableau mode, sigma, kind of go down the list Metabase, we support as well as the transactional databases, like MySQL and Postgres, and increasingly many OLAP databases like ClickHouse. Well, that’s where we stop. And honestly, that’s where everyone in our category stops today. I’m not very happy with that, because this is just the level one of monitor, when you check out an observability tool in two years, or in five years, it’s going to be completely different going to be much like the picture that you cost us where it’s like, totally and, and that’s, I think that is not only important but really critical. Because data is ultimately not produced from your data warehouse, right? Snowflake does not sell you data, it sells you a container into which you can put your data yet but that data is being produced by product teams, engineering teams go to market teams, and they’re being consumed by those teams to. So when we talk about Data Trust, which you mentioned before, which I think is a much better category, name the data observability. Because observe, what is that that trust is ultimately in the hands of the people who consume and produce the data? That’s where we as a category have to go.

Kostas Pardalis 37:00
That’s interesting. Okay, so what’s your experience? So far, we’ve been gathered, let’s say B container of data, which is data lakes, right? So we have data warehouses, much more structured environment there. But we also have like data lakes, okay. Databricks is dominating they’re completely different environment when it comes to interacting with data. And okay, we get to the, I mean, there’s also like this new thing now with the lake house where you also have like SQL interfaces there. But what have you seen so far, like, with data, lakes and observability there, because that’s also like a big parts, right of like working with data, especially with big amounts of data. And in many cases, there’s a lot of work that is happening before the data is loaded into something like Snowflake, it has to be within like, data lake, right? Is Metaplane doing something with them today? Plans to do something like in the future, and what do you think is the role that data lakes will have in the future?

Kevin Hu 38:04
Honestly, we don’t come across data lakes, too often. Part of it is where we’re focused in the market. If you’re, for example, at a company with less than 5000 people playing, it’s probably the right choice for you. As the data observability tool, fast time to value time, implement the focus on the workflows, and if you’re above 5000, there are other options on the market. And you might be in a position to build it in-house too. We found maybe this is incorrect, that Databricks is much more highly concentrated at the enterprise. Yep. And when, when we come across a company that uses Databricks, frequently, they’re also using Snowflake or data warehouse, and they’re using Spark for free. Like, pre Snowflake transformation.

Kostas Pardalis 38:59
Yeah, 100%. That’s interesting, but you don’t see as the need right now, for multiplane to walk into, like observability for these environments, right, like, because and the reason I’m asking is because technical. It’s like something very different. And I’d love to hear what are the challenges that like, what are the differences and learn a little bit more about that? That’s why I’m insisting on these questions around the data lakes and the spark ecosystem.

Kevin Hu 39:33
There are some big challenges. I mean, there are some engineering challenges like having to rewrite all of our SQL queries into Spark cores, right and not necessarily on a table but on a data frame. And there are also differences in terms of the metadata that’s available to you. Where a data warehouse met the data we found is quite rich in comparison with the metadata that you might have within a data lake might have the number of rows, but two or not. Or you might have to run a table scan for that or to continuously monitor the queries to keep track of the number of rows that even get the schema you might have to do a read. It’s a general much harder to have the level of visibility that you have warehouse as interior datalake.

Kostas Pardalis 40:27
Yeah, 100%. The query engine makes like a huge difference there when you have to interact with that stuff. Alright, cool. So Snowflake Metaflow in like your experience so far, because I mean, you mentioned Bigquery, Snowflake and Redshift, and from what other sounds like, there’s probably like a big part of your customer base isn’t Snowflake, what’s your experience, like with these three platforms so far? Give us your pros and cons of each one of them.

Kevin Hu 40:56
There are pros and cons of each Snowflake has the richest metadata in terms of the freshness, and the row counts of different tables Bigquerry also has that metadata. However, to use meta plain, our customers either have taxes onto an existing warehouse, or they provision a warehouse specifically for meta plan. And this is nice because you can separate out the compute and keep track of our internal spend that is incurred to this monitoring. By the same time we necessarily impose a cost, whereas some users who use Redshift with some not at their full capacity can tack on Metaplane at no visible financial costs themselves.

Kostas Pardalis 41:44
That makes sense. I think that’s like gauge the trade-off between having like, the elasticity that like the serverless model that the query has compared to being for a cluster that, yeah, obviously, it can’t be underutilized. And when it’s underutilized, you can put more stuff there without paying more, right. But yeah, it’s like the trade-off that every infrastructure deal has to face at some point, our decisions, right, so like, from, from, let’s say, in terms of what is supported? Like, do you guys like muddling? Like the same experience across all different platforms, or, like, you have like more functionality towards one or the other because of what they expose.

Kevin Hu 42:26
It’s the same experience across all three. No major differences.

Kostas Pardalis 42:33
Okay. That’s great. And how much of a concern is the cost of BNI, I mean, the additional cost that is incurred by a platform like Metaplane that continuously monitors the data on the data warehouse.

Kevin Hu 42:45
It’s surprisingly much less than people might expect because we’re using information schema as much as possible, and the existing metadata. So the tests that rely on your metadata, right, we can read that within seconds, at the top of the hour, or whatever frequency you set. And it turns out to be a pretty negligible amount of overhead compared to spend that you might have from other processes running in your data warehouse. Like, measured in the single digit percentage points. Some customers have longer running queries for much larger tables or more sophisticated monitoring, but typically, that step is taken more deliberately. Sometimes the cost is more justified.

Kostas Pardalis 43:33
There are use cases where like people are, okay, you have, let’s say, a continuous monitoring where you establish, let’s say, your, like, the monitors, and they ran every, I don’t know, one hour, 10 minutes, one minute, whatever. But did you see also ad hoc monitoring that users do? Like, do they use the tool also for not just for monitoring, but also to be band names with the data?

Kevin Hu 43:59
Totally, that is the next step after that monitoring is it the flag kind of goes off? Is now you have this, one, you know that an incident occurred. But to you have this historical record of what the data should be and how it has been over time? It’s a little bit like, debugging, once you have a product analytics, yeah. Yeah. Where, right? If you did not have a product analytics tool, you don’t necessarily know like, what the latency has been over time, what all the dependencies are what has happened in a user’s journey, and it’s very similar with meta plan where in addition to the core incident management workflow, there’s another component which is trust, awareness and data, where teams that bring on meta plan of course at first is because it’s often because stuff has hit the fan and they’re like, Okay, now we need to get ahead of it next time around. But right After implementing Metaflow, it could be within a few minutes. And you see how cores are being used across the warehouse how the lineage looks from within your data stack. It’s like, wow, how did I live without this?

Kostas Pardalis 45:14
Okay, take us by the hand now and give us an example. Like, let’s say we have an incident. The monitor goes off and it’s like, oh, something’s wrong with this table. Okay. And from things that you have experienced, as a common example, they’ll describe to us like the journey that the user goes through meta plain, from that moment on, if they can resolve the problem. And I’d love to hear like, what happens inside meta plane for that, and what outside, right? Like, how do these two work together for the user to figure out and solve the problem?

Kevin Hu 45:56
So today, Metaplane is like— Let’s say you have like a home like security system, it is the alarm and is like the video, it does not call the police for you. And like the triaging for you. So the Metaplane, we will send you a Slack alert, or maybe a pager duty alert, saying this value, we expected it to be 5 million. It fluctuates a little bit, but now it’s up 1 million. These are the downstream BI reports. So like this dashboard is has been last viewed today, this many times by these people. And here are the upstream dependencies. So like, are all the EBT models that go into this model? And what you can do from there is click into the application and kind of see the overall impact. And assessing okay, what are the immediate upstream root causes. And then till you can give feedback to our models, where if this is actually an anomaly, and you want to be continued, continue to be alerted on this the market and then we’ll kind of exclude it from our models if it was actually normal. Because at the end of the day, data does change and no anomaly detection tool is 100% accurate. You click on you say, Okay, this is actually a normal occurrence. Do not continue to alert me on this. Frequently, when you have an alert our customers our whole conversation around that alert, saying by looping and other members of their team creating JIRA or linear tickets to address this issue. But that is where we stop is the actual incident resolution. That’s where we want to go in the future.

Kostas Pardalis 47:48
Yeah, makes sense. And what’s a— This is my last question and then I’ll give it to Eric. From your experience—because obviously, you’ve been exposed to many different users out there and issues—what’s one of the most common reasons that data goes bad?

Kevin Hu 48:11
I like that you said that there are many issues because that’s what we’ve observed, too. It’s like the whole Tolstoy’s quote: “Happy families are all alike; every unhappy family is unhappy in its own way.” The same thing is true for data. There are so many reasons why data can go wrong, it goes back to what we were saying, either someone put it in wrong, machine did something wrong, and then or there are some logic fighting correctly. But that said, across all of our customers, delays are or freshness errors are probably the most common issue. Second is probably a schema change. In the data warehouses are upstream. And the third is a volume change. Where the amount of data that’s being loaded, or exists is higher or lower than you expect. The beautiful longtail from there, the and all of that is kind of correlated with the causes of data quality issues. This depends on the team, right? If it’s a one-person team, do you not have many data engineers or analytics engineers stepping on each other with code? Right? There might be many more third-party dependencies that causes issues. If you’re on a larger team, perhaps shipping bugs might be like actual software bugs, not datalake bugs that are more frequent.

Kostas Pardalis 49:37
Awesome. Eric, all yours. I monopolized the conversation, but now you can ask all your questions.

Eric Dodds 49:47
It was fascinating. Okay, so I want to let’s dig into Tolstoy a bit more because that quote is an amazing quote. I think it’s called Isn’t it like a principle? Like the Anna Karenina principle or something? That’s exactly what it is. Yeah. Okay. So this is the reason I want to dig into that a little bit more, you’ve mentioned the word trust a lot to our conversation. And in fact, that’s been a recurring theme on the show through a bunch of different iterations, I would even say from the very beginning, cost is just one of the things that comes up consistently. So what’s interesting though, is, if we think about some of the examples we’ve talked about how you have the executive stakeholder who’s refreshing a looker report, and something’s wrong, or the salesperson doesn’t necessarily know exactly why, but then other revenue numbers off or whatever. And so, what’s interesting is, that’s those examples kind of represent a one dimensional trust almost right, which is, things don’t go wrong, right, like, I trust you, if nothing ever goes. Which in the real world, like that sort of one-dimensional trust, does is it really a great foundation for relationships? Just kind of like the undercurrent into principle, which I know I’m, I’m sort of stretching that a little bit. So thank you for humoring me. But like, it’s interesting, right? Like, if the reports aren’t broken, then everyone’s happy, right? Like things are good. What are the other dimensions of trust a that you’ve seen, or be that you are trying to impact with meta plain, or the way that you think about data quality and lineage and those sorts of things.

Kevin Hu 51:37
I love how you brought it back to trust because that is simultaneously a very simple problem, if you could state it simply but also extremely complex, like you’re alluding to where that you could define trust, not necessarily that something’s going wrong, but that there’s some contract between two parties, that is violated in some way. And if the contract is not explicit, then the two parties will always have implicit contracts. And unfortunately, in the data world, the implicit expectation of a data consumer is frequently that the data is just not wrong. It’s exactly what you’re saying is the data is wrong. What am I paying you for? Why are we paying Snowflake? So much money if the data is wrong? Yeah. As we’re alluding to, that is not a reasonable expectation across the board, a reasonable expectation from a data consumer might be, I am aware that data is not perfect. That it will never be perfect the same way that you will never have software without buttons and code. So how can you expect that to be true for data as well. But I think part of it is establishing these contracts and these expectations upfront with both the data consumers and as well as the data producers and saying, Okay, this is what you can expect from the data and how it will trend over time, and how I’ll try my best as a team to make sure that meets the demands of this particular use case. I think that’s a shift that I would love to see in the data world, instead of talking about data being perfect, or being ideal, instead of talking about it being sufficient for a use case at hand. Where if this dashboard is being used every hour, right? Do we really need real-time streaming data? If you’re if this is making more of a directional decision, as opposed to being sent to a customer, right? Does the data have to be completely correct? Right enough to like, shatter your trust and over time, right. So I think really reverse engineering from the outcome. And the people who are using the data is the most clarifying approach that we found to think about data quality and data trust overtime.

Eric Dodds 53:56
Super interesting. Okay, let’s dig into that just a little bit more just because I’m thinking about our listeners, and even myself. We deal with such things every day. So I love what you said, but my guess would be that there are a lot of people out there who well, let me put it this way. If you have an explicit contract that requires mutual understanding, right, and, and even mutual agreement on let’s say, it’s a real estate contract, right, like there’s mutual agreement on say, default and other things, which both parties need to have a good understanding of, for expectations to be set well, right. So, if we carry that analogy over to an explicit contract between the data consumer and say, the person who’s building the data products in whatever form that takes the why one of the challenges I think probably a lot of our listeners have faced is that If you try to make that contract explicit, the consumer oftentimes can just say, You know what? I don’t actually really care about these definitions that we’re trying to agree on. And sometimes maybe there’s some malcontent there, but a lot of times, it’s like, look, I’m busy, like we’re all busy. And like, I would love to like understand, like your pipeline infrastructure and data drift issues and whatever. Can you speak to how you’ve seen that dynamic play out? I mean, I think in some ways that’s getting better as data becomes more valued across the organization. But I think in a lot of places, there can still be a struggle to actually make an explicit contracts, like a practical reality and a collaborative process inside of a company.

Kevin Hu 55:46
You’re right. It is an idealistic process. However, I do think the conversation is important not just to talk about expectations of the data, but really just to understand what exactly did the use of their data want, right? And members of data teams are is a tough job, right? Because a classic example is okay, someone asks you for a dashboard. But do they really want a dashboard? Or should they want this number to be continuously updating over time? And to have a relatively fixed set of questions that can be No, vary a little bit, but not be super flexible? Or do they want data activation to use it again, into the Salesforce? Or do they just want a number like right now? And it doesn’t have to be changing over time? Or do they want a data application that is maybe more involved, but is more flexible? And has both inputs and outputs? Right? I think that is the importance of having a conversation about expectations from users, like your stakeholders, is your there are some downsides. And it takes a lot of time. But that I think, once you’re the consumers of your data, feel like you really understand where they’re coming from, that that is a foundation from which you can build trust. Right? It’s like, okay, they kind of get what I’m asking for and reverse, I know, the amount of work that goes into producing data products, that Okay, now, the trust is much less brittle. And maybe you don’t need that explicit contract. But what you develop an implicit contract that Yeah, I know, okay, it’s not, when it’s completely broken, that I can still trust it, because there’s a human on the other end of it.

Eric Dodds 57:40
Yeah. If only there were software that could solve the problem of time compression and mutual understanding and the investment that it takes to build that between two humans.

Kevin Hu 57:53
We were talking before this call about all of the SaaS products that exist, I really think tools are just tools, right? They exist, because people use them to do processes more effectively and more distantly over time, if I told doesn’t result in something actually changing in terms of people’s behavior, this is a tool that actually is being used by people, not machines, then. Is it really that important?

Eric Dodds 58:22
Yeah, totally. Okay, we’re close to the buzzer here. I want to end by asking you an admittedly unfair question but that I think will be really helpful for our listeners and for me. So and I’ll start with the unfairness. So none of the answers to this question can relate to Metaplane or data lineage or data quality tooling at all. Okay, so outside of what you’re trying to build with your life and your team, if you could give one piece of advice to our listeners out there who are working in data, in terms of building Data Trust, even maybe like, one practical thing they could do this week, before the week is over? What’s the one thing that you would tell them to do? Like, if you can only do one thing to sort of improve trust? What would that one thing be outside of all the data learning history? So sorry for the unfair question.

Kevin Hu 59:23
No, no. At the end of the day data lineage and observability, it says a technology, right, it is one technology that can be used to solve a much broader problem that can’t be solved by one tool or even like tentacles, I would say, to conduct some user interviews. If you had a week or two weeks, have one on ones with every person at the company who could be using your data or is not using the data as much as you would like or in the ways that you would want and sit down and really approach them as if you or, like I found are building a product for a customer? What do you really want to hear? And what problem are you trying to solve? How will you know that you solve that problem? And how can I improve the product that I’m developing for you that, I think is the process that we’ve seen our customers, especially the ones who are very, very high performing data teams, do you over time and start seeing from this position of the trust is yours, and it’s yours to lose, as opposed to you start from have to build it up over time.

Eric Dodds 1:00:35
It’s super helpful. All righty. Well, thanks for giving us a couple of extra minutes for me to ask you an unfair question. This has been such a great conversation and best of luck with Metaflow. And it sounds like an awesome tool. And it sounds like you’re doing great stuff.

Kevin Hu 1:00:48
Thanks, Eric. Thanks Kostas. This has been amazing conversation and thanks for having me on. I’m such a fan.

Eric Dodds 1:00:55
Absolutely. Because this, of course, I have to bring up tilapia. And the fact that you can drop a tilapia into a tank, and they all start to behave the same, which is interesting, which actually, is pretty similar to VCs with new data technology. It’s like he dropped a new data technology and all the VC start to behave the exact same, which is really interesting. So that was one takeaway.

Kostas Pardalis 1:01:26
I think we should rename for more into the tilapia effect or something.

Eric Dodds 1:01:30
VC FOMO: the tilapia. I love it. So that was one thing. On a more serious note, I thought that discussion around implicit and explicit contracts is really helpful. I think we talked about the way that data professionals interact with other teams, the way that tooling sort of facilitates those interactions, etc. And it was helpful for me, even in my own day to day work to really just think about what implicit contracts do I have with other people in the organization, right, whether they be consumers of data that I produce maybe for my boss for the data that I consume from other data producers, so that was really helpful for me.

Kostas Pardalis 1:02:12
Yeah, 100%, I think that’s like a big bottle of building organizations. But I am pretty sure that you have experienced that by building companies from scratch. And, like, scaling a company or a team, like a big part of it is actually figuring out all these contracts and make them more explicit, like when we say like, we need the process to make things scale. That’s what pretty much we are talking about, right? Like, when you’re alone, and you’re running the whole growth function on your own, like, yeah, you kind of like, blend your contracts with yourself, right, and then the other person and then another person, and suddenly, the contract is not exactly the same, right? That’s where friction starts. And I think one of the first steps that you have to do when you’re trying to scale and organizations actually do that. And I think that’s just human nature and has like, it’s something that we see with data is something that we see with software, it’s something that we see with everything. So yeah, 100% I think that was like an extremely interesting part of the conversation that we have outside of all the rest that we talked about the knowledge is where like, durability goes and how they work all together. That was actually like my other very interesting point of like, how lady, these products are with some foundational products, like the data warehouse, for example, on whether the lower house like exposures and the metadata of their and how this can be used, like to deliver even more value in observability and all these things. So yeah, always interesting to chat with Kevin and hope to have him back in relation.

Eric Dodds 1:03:52
Agree. All right. Well, thank you for listening. And if you like the show, why don’t you tell a friend or a colleague about it, we would love for you to share the episodes that you like the most with people you care about, and we will catch you on the next one.

We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.

🎙 Sign up for The Future of Machine Learning Livestream!

🗞️ Signup for Our Newsletter

Episode 93:

There Is No Data Observability Without Lineage with Kevin Hu of Metaplane

June 29, 2022

Notes:

Transcription:

About the Podcast

Sign Up for The Data Stack Show Newsletter