Episode 108:

You Can’t Separate Data Reliability From Workflow with Gleb Mezhanskiy of Datafold

October 12, 2022

This week on The Data Stack Show, Eric and Kostas chat with Gleb Mezhanskiy, CEO and Co-founder of Datafold. During the episode, Gleb discusses adoption problems, the importance of a data engineer, and how to implement new technology into your company.

Play Video

Notes:

Highlights from this week’s conversation include:

  • Gleb’s background and career journey (2:51)
  • The adoption problems (10:53)
  • How Datafold solves these problems (18:08)
  • The vision for Datafold (26:27)
  • Incorporating Datafold as a data engineer (38:53)
  • The importance of the data engineer (42:12)
  • Something to keep in mind when designing data tools (46:46)
  • Implementing new technology into your company (53:18) 

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 0:05
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com.

Welcome back to The Data Stack Show. Today we are talking with Gleb from Datafold. Datafold is a data quality tool. And they have a super interesting approach to data quality courses. One of my burning questions is around when to implement data quality tooling and processes in the lifecycle of a company. Because a lot of times when you and I both know this from experience, you a lot of times approach it in a reactionary way, right, like something’s breaking and dashboard breaks, you’re trying to do some sort of analysis, you want to launch something new. And you run into data quality issues that really limit what you’re doing. And so then you begin to implement those processes. And so I know that Gleb works with companies who are trying to tackle data quality all across the spectrum. And so I just want to hear from him. What does that look like today? Do you have to be reactionary? Is it worth the time cost it takes to do that proactively? So that’s, that’s what I want to learn about. How about you?

Kostas Pardalis 1:26
Yeah, I want to hear from him. Like, how do you start building and technology in the product around data quality, because data quality is one of these things of like, so broad, and like, there are so many, like different ways that you can implement it like and so many different, like parts of the data stack, where you’re gonna go and like start working out. So I’d love to hear from him about his experience in again, like starting a business and also in your products, rather than, like, the hard decisions that you have to make in order to start. So yeah, like, I think we’ll start from there. And then I’m sure that we will have plenty of opportunities to go much deeper into the product itself, and the technology and all the different choices made around it.

Eric Dodds 2:14
I agree. All right. Well, let’s dig in and talk with Gleb.

Gleb, welcome to The Data Stack Show. We’re so excited to chat with you about all things data, and specifically data form.

Gleb Mezhanskiy 2:27
Thanks for having me. We’re excited.

Eric Dodds 2:30
Alright, well, let’s start where we always do, we’d love to hear, tell us how you got started in data with your career. And then what you were doing before Datafold. So yeah, we just love to hear about how you the path that you that led you to start in Datafold.

Gleb Mezhanskiy 2:49
Yeah, absolutely. So my original academic background is economics and computer science. And I started my career around 2013. And data engineer joined a company called Autodesk which focuses on the b2b software for creators, but at the time, they were putting together a consumer division. And I ended up essentially, almost creating a data block from scratch for the consumer division, tying all the different metrics from different apps that Autodesk acquired. And now another really interesting time, because if you remember, 2013, is where I think spark was almost just released, Looker just went out, it’ll still stay in a Snowflake cell, well know that a lot of the tools and companies and technologies that we now consider really foundational at that time were, you know, super early stage super, super cutting edge. So it was a really exciting time to tinker with data. And after a year at Autodesk I moved to lift, where I ended up being one of the first one of the founding members of the lifts data team. And at that time, it wasn’t a super hyper-growth stage. data infrastructure was kind of barely there. So we had one Redshift cluster, and we’re building everything on top of it, and you’re constantly possibly on fire. And I remember days when essentially entire analytics do you would like, go basically go get mobile because Redshift was completely underwater by all the queries that everyone tried to run. And I initially was tasked with building all sorts of different data applications from forecasting supply and demand and understanding driver behavior. But I was so frustrated with the quality wound up with the quality but with the tools that I had at my disposal, and that wasn’t necessarily lifts all that basically at that time what was available for the engineer. or some general off-the-shelf, these prints was quite terrible from just, you know, not being able to run very. So not being able to tough pace, understand data, trace dependencies, all that was extremely time-consuming. And so I kind of gravitated majorly to building tools, or simple tools for my team members, for example, build a dev environment where it runs live their ETL jobs before pushing down to production. Before that we kind of test this live in production, which was really bad idea. I also had some really, really bad for stories that I think led me ultimately on my journey to build beautiful. So one of them, which I quite often slide is, I’ve literally had this practice of being on call, late engineers being on call, essentially, the data live has been so important that entire company pretty much run either on fully has made a decision making a machine learning or, you know, meetings organized around reviewing dashboards, seeing how the company before and so delivering data on time, by certain SLA is always has been super important. So that’s why we’ve had on-call engineers making sure that whenever pipeline was clogged or late, we could actually address that. So I wasn’t on pull engineer one, one day, and I was woken up by pager duty alarm, I think 4 am Because there was some really important job imputing rides, that was that failed. And so I was the error and found some bad data that entered the pipeline and implemented like a two-line SQL filter. And, you know, push the change, everything was good, did some quick, quick sanity checks, got a plus one from my buddy, and then went back to sleep, everything seemed normal and green. And then you kind of see where it’s going. So yes, next day, I can tend to work and then probably two hours and the work day, I was forwarded the email from I think CFO that was looking at a dashboards and everything was kind of all over the place, you can really weird. And so the craziest thing is not that I managed to break lots of dashboard. The craziest thing is that it took us about six hours of sitting in a war room with me a few other senior data engineers, and trying to understand what’s going on. And then it took us six hours to actually pinpoint the issue to the hotfix that I made. And obviously, if I was able to make such a trivial mistake, and bring down so many tables and had a really business, big business impact, with just a two-line sequel hotfix. If you extrapolate this to how much you know how how much loss is happening just due to data breakage, while the industry is actually enormous, and it’s really easy to break things. And luckily for myself, I wasn’t fired back then I was actually put in charge of building tools to make sure that this exact error doesn’t happen again. And so we introduced a real-time anomaly detection system that was helpful. And then focused on improving the developer workflow to make sure that developers actually darling, introduced such issues into production. And one of the interesting learnings that I’ve had like a clip that ultimately formed, I think the way Datafold approaches the topic of beta quality, is that why we introduce so the first reaction when we had this issue, and other I was not the only one breaking things was that we need a system that would catch anomalies in production, we need something that would detect when things are broken, because, well, we don’t want the upload to forward as an email, you know, dashboard screenshot and say, Hey, guys, I think this is wrong, right? This is a really bad situation to learn. I’m just stakeholders. And so we implemented a really sophisticated, Real-Time Anomaly detection system that would compute metrics in real-time using a bunch of layers, both from our ETL transformed data, as well as some streaming so events. And that was somewhat impactful. But we really struggled to get adoption and really struggled to make an impact with that as much as we hoped. And the ultimate problem was that one piece of that system was kind of too late into the process. So like, by the time that something had already broken, and so one of the to use, that I’ve really struggled to see the value there. And the second challenge was that it in a way existed outside of workflow. So, you know, if someone is doing a break, introduce a breaking change, like I did, right, or a bug. And they have to learn about this on a system that kind of ducks this bug into production. They have to kind of drop everything, whatever they’re doing, had to go outside of their workflow and then focus on investigating whatever anomalous that’s finds, and we found that disrupting The workflow is actually a really expensive way to get on top of data quality issues. And so started, we started focusing back, you know, at lift to build tools that would actually prevent things from breaking. And that also informs what we’re trying to do available. So our philosophy is in for active data quality and shifting Lab, which means that as much as possible, we try to detect issues very early on in the process, ideally in the IDE when someone types sequel but at least in staging or full equestre you not in production when things the way they do the damage. But that’s another story. That’s not our problem question. But that’s pretty much how I love doing data field.

Eric Dodds 10:42
Love the story. I’d love to dig into the adoption problems a little bit more because I think that’s really interesting, in large part, because, you know, there are a number of sort of data observability data quality tools that are betting on anomaly detection as sort of the primary way, you know, to solve that problem, would you say, and maybe this is just me rephrasing what you already said, but would you say that, the anomaly detection, as you described it, that lift was actually just a less manually intensive way of doing alerting, right, it was sort of almost like, you know, a more efficient way of doing alerting, right. But alerting tells you that something is already broken. And that’s why, you know, people are just kind of like, well, we already are getting alerted when something’s broken. The fact that we can do that more efficiently isn’t exciting. Was that, is that how you would describe that dynamic? Or I’d just love to know more about that adaption problem?

Gleb Mezhanskiy 11:50
Yeah, I think there are a few challenges here. Well, I think in general, alerting and monitoring is valuable. So we actually were able to really bad incidents in production, where, like, we would not do a really important step in the billing process. And that led to would have led to massive loss potential loss for the company, who would basically not bill open door one morning there, Bill. And that system actually helped us detect not just data quality issue, but a real production issue. So definitely not discounting the value of alerting, I think the challenge is that a lot of times, we forget about where the data quality or general issues are coming from. And I like to think about this in a very simple form. So I think, ultimately, when you think about data quality, although it’s a very big topic and a large problem, it’s ultimately either we break data, or they break data. And when I say we break data, I mean that we made it to maybe other data developers, people can actually do development on the data pipelines, or touch data in various ways, introduce changes that, you know, change the definition of things like, we change how a session is calculated, and then that limits really throws off all our calculus around conversion, right. Or maybe we change the schema off in advance because microservice cannot, can no longer provide human shields. And then some machine learning model that relied on this is no longer, you know, doing machine learning, things like that. So that’s kind of we break it. And then there’s also category where it which I called a break data when external things happen. And those external things are, for example, we may buy data from a vendor, and that vendor chips or some bogus data set that doesn’t meet all the CDs, right, just completely outside of our control. Or we have, let’s say, you know, we’re running Airflow and Airflow is known to have a really funky scheduler, and sometimes things don’t get scheduled. Or sometimes they may be marked as like completed, but actually left them and they state and well, not really our fault, like infrastructure full. And so I think where data monitoring really is helpful is in detecting things that are in the daybreak data category, right things that are outside of our control, as well as maybe are they being like the last final defense for stuff that we break that people break. But I think where we are really missing sometimes, like the counterintuitive challenge is that we tend to actually the failures to like external attackers were in fact, it’s us, like genuine people in the company break one way or the other. And so what I think the major learning for me was that we really need to invest in building systems that are more like anti-fragile and more busts to breaking. And that means But eventually having better systems to how we do data in production. So, like in general, improving our jobs and having data contracts. But the point where I am most excited about is improving how we develop data. So improving the development process, in particular, how we introduce changes to data products, right be that events be the transformations in SQL or Spark, or even end-user data applications like Looker dashboards or sort of machine, you know, machine learning models. I think that’s probably my bad is that the I see a huge opportunity in improving the status quo, oh, everything is broken, in really improving the change management process. And then if you do that, then there are less, fewer things that are breaking in production. And the other challenges that we saw is, data is inherently noisy, right? And it’s always changing. And so when we’re talking about data monitoring, we’re talking about typically unsupervised machine learning where a system would learn a pattern of data. And by pattern I mean, that can be anything from a typical number of rows of data set gets daily, right? Or what is it typical, what are the distribution in a given dimension or metric column in a data set. And then when the reality doesn’t conform to that baseline, we get an alert, right? But given that our business is changing, especially in high growth companies always and given that we are operating in the world, where even not even a unicorn BAT Team, not a data team, not even at a unicorn startup earlier stage, is it’s not uncommon to have 1000s of tables in production, and 10s of 1000s of columns. You know, we can find anomalies there all day long, right? And the real challenge is how do you actually identify what is important? What is worth fixing? What is the real challenge? What is it business issue versus data quality issue. And that’s probably what really makes adult fun of the quality platforms. That’s what really held us back with the old Real-Time Anomaly Detection System. And that’s why I also think change management is so important is because if we can prevent preventable errors early on in the death cycle, if we prevent production, and we have less and less to do with in fraud, right, it’s just fewer things to worry about.

Eric Dodds 17:30
Yeah, absolutely. No, that’s super helpful. Well, we probably should have, I should have asked this question earlier. But can you tell us and I think we kind of, we kind of touched on probably many things, but tell us how Datafold specifically solves these problems? You sort of describe the problems really well. But I’d love to hear about what the product does, specifically. And you know, how you’ve built it in response to those things that you’ve experienced?

Gleb Mezhanskiy 18:05
Yeah, absolutely. So I think just by our Datafold approaches, it probably makes sense for me just to outline certain beliefs that I have about the space. And I think the three principles of reliable data engineering that I had, is that one, to really improve data quality, we need to improve the workflow. So not invents tools that would kind of sit outside of the workflow and SaaS budget decisions, but look at how people go about, you know, writing sequel models, or modifying sequel models, or, you know, developing dashboards and improved their workflow so that they are much less likely to introduce bugs. And so that it’s less painful for them to develop, and they can develop faster. The example of that is, right now, we’ve pretty much adopted the notion of version control and data space, right? Maybe even five years ago, that was kind of novel, but by now everyone agreed that we should version control everything events, transformations, even BI tools, right? Even diversity, all event events, schemas, everything. And so that means we have ability to stage our changes, and also have a conversation about the changes in a what’s called a pull request or a merge request. And so if we can’t get around that stage, this is exactly sort of a circuit breaker where within the workflow, you can improve things this as an example. I think the other principle that’s really important is that we have to know what was shipped. And it’s kind of obvious, and it’s kind of humiliating to admit that we, you know, data engineers, lace engineers, a lot of times we don’t really know what we’re doing. We think we know our data, but data is far more complex. And, you know, it’s not. It’s not uncommon for us when we write SQL query with a new data set, just like not really knowing what’s in there. We make certain assumptions. I think this column is had a certain distribution, or I think this column actually has values. But a lot of times, we actually don’t know and this assumptions are wrong. And if we develop our data products with wrong assumptions, we are just saving the data or for bugs and for errors. And so I think it’s really important to know the data that you’re working with. So examples of how data warehouse, there’s YAML, we provide profiling. So anytime you’re dealing with any new data set, we visualize all the distributions from the data side, as well as provide you with certain quality metrics. For example, what’s the field right? Column? Is this color unique or not? And that really helps avoid a lot of database errors when we are writing code or building dashboards and sample but they just set. The other important thing about knowing what you ship is understanding dependencies within data, because the hardest thing is, it’s hard enough to understand a given data set, right? But it’s even harder to understand where does the data come from, or to understand what is actually the source of a given data set, as well as understanding where it goes. So if I don’t understand where it comes from, I might not know how actually, it represents my business, right? If I don’t know which events for example, a given table is based on I may not know exactly what this table is describing. If I don’t know who is consuming my data, I’m likely to break something for someone eventually. And so understanding dependencies within the data platform is super important. And that solves with lineage. So beautiful, provides column-level lineage, which essentially allows you to answer the question of take any given column a table, how it’s built, and what downstream data applications such as dashboards and machine learning models, and then it really, really foundational information. And then, finally, I think the third principle is automation. So I think, no matter how great your blood checks are, or your processes, I’m less this is so easy to do for people that they don’t actually need to do it. It just happens automatically. They won’t do it, right. So we can’t assume people will test something, we have to test it pretty much in a mandatory way. And then software that falls out this long, you know, I adopted long time ago, right? So in, when we ship software, we have CI, CD pipelines that build staging, unit tests, integration tests, sometimes even UI tests automatically, for every single commit, right. So we need to also get to this point with data because this is the only way to actually be able to get issues and not rely on people to do the right thing. And so the example of this, of how this comes together, and Datafold is that whenever someone makes change to any place in their data pipeline, let’s say it may be a sequel, table model NZBC, or docstore, Airflow free, as an example, Datafold provides full impact analysis of this change. So basically, we call it data do we show exactly how it changed to a source down on let’s say, SQL affects data produced? Let’s say I changed the few lines of SQL. And now the definition of my session, duration column change. So the question is, how right Am I doing the right thing. So we can do that difference. And we show exactly how the data is going to be changing. And we’re not only doing this for the script for the table for the job that you’re modifying, we’re also providing a point bucked analysis downstream. So we’re showing you okay, if you’re changing this table on this column, these are all the downstream changes, cascading change that you’ll see in other tables now, ultimately, in dashboards, and maybe also in the reverse city, all applications. And so we can trace how, you know, dating Salesforce that gets synced, you know, five layers down will be affected by any given change the data pipeline. And that all happens in CI. So we don’t need people to do anything. It’s basically completely integrated with the cicp process. And it’s basically populates the report back into full requests so that the developer knows what they’re doing. But very importantly, they can loop in other team members and have them have full context about change. And this is really important because kind of pull requests reviews for data engineers is almost like a meme topic. Because no one understands what’s going on. And it’s really hard to get the full context of a change unless you really see what it does. Right? So but now with Datafold. I seem to have a conversation about what the change is going to do. Lifting the stakeholders and make sure that nothing breaks and no one is surprised. That’s an example of how Metaflow does it.

Kostas Pardalis 24:59
You’ve been describing for, like, quite some time now. Like all the different dimensions involved like in data quality. And I think like, okay, like we all understand like data quality is like, just complicated, right? Like, data infrastructure gets more and more complicated, more moving parts. Just stack, it’s, I don’t know, we can call it whatever we want. But the only certain thing is that, like, it’s too many moving parts that they have like to be orchestrating and work together. And like, they’re just like too many actors that interact with data. And each one of these actor like, might change something in a very unpredictable way. So my question is, deciding, like to start building a product and the business around that, right? What’s like your vision? What’s your goal, like, let’s say, from five years from now, what you would like to see the products being able like to deliver to your customers? And seconds? Where do you start from realizing this vision? Because obviously, it’s not something that you can build, like from one day to the other, right? So I’d love to hear like a little bit of like, how you as like a founder of the company, and with obviously an objection around like this problem, you try like to realize this vision.

Gleb Mezhanskiy 26:25
Yeah, absolutely. So my vision for Datafold is to build a platform that would automate data quality assurance, from left to right, in two dimension. So the first dimension is developer workflow. And what it means is, if we consider where data holds, it’s now we’re kind of really getting things in staging. So when someone is opening a pull request, and they are involved to universitario, how can we get as many errors as possible before that pull request gets merged. And so we can go left from here, meaning that we can detect issues even earlier, for example, as someone is typing SQL in the IT plugging into their IDE, and provide a lot of the decision support and bet points so that they don’t even write boggle stone. That’s more that Yeah. And software industry is moving, right, we have sort of code accomplish order completion, as well as static code analysis and flagging up all sorts of issues in the ID. And I think we’ll get there in the data space. And we have a lot of the fundamental information to do that. And what we need is to really plug into the workflow and onto the, into the tools where people develop those workflows. And then going to the right in the developer workflow also means that once someone ships, the code to production, we also are going to provide continuous monitoring and make sure that that code base reliable that these data products stay reliable. And to the point of data monitoring, right, I still believe it’s very valuable. But we decided to start a little bit to the left of that earlier in the devil lifecycle. My vision for what a good data monitoring solution is, it’s not only about detection, I think that’s the easy part. The much harder part that I’m excited about solving is the root cause analysis and prioritization. Because, again, it’s not helpful to receive, like 100 notifications about Emily’s, on a daily basis, we need to basically say, Hey, you have 100 potential issues, but really, you should focus on these top three because this one goes into CFO dashboard, this one ours, your online machine learning model, and the other one probably gets synced to Salesforce. That’s one and the way you do this is by relying on column-level lineage and the metadata dependencies basically, because we understand how the data is used. And then the other part is root cause analysis right is that the business issue is a data quality issue that is isolated and given ETL job or should actually propagating multiple people maybe 10 A pin to a source data like an event or a vendor data set. So doing that again, relying on lineage is a lot less sophisticated analysis is what we are really excited about doing the density kind of the Wurtzbach I mentioned right kind of going to the labs, all the way to when people start doing developments and then go into the right into production. Now, the second dimension of what we want to cover is the data flow itself. So right now, if you consider the data flow, ultimately we have source data like events, we have production data extracts, we have vendor data, SaaS, other kinds of exported data, that old gets into warehouse data, we have those formation layer and ultimately they application layer, meaning BI tools, dashboards, reports, machine learning reverse y’all. Right now, Datafold really shines in the transformation piece. And then we’ve already extended into the source data and into the application. So for example, or the source Spark, the big problem is how do we integrate data reliably in the warehouse because everyone is moving data around. And for example, it’s very customary to copy data from transactional stores into the warehouse, for example, from Postgres to Snowflake. And doing this at scale is really hard because you have to sing data from data stores that are changing rapidly at rady high volume into another data store. And what we’ve noticed is that no matter whether it seems like using vendors or open-source tooling, pretty much everyone runs into issues there. And there hasn’t been no tooling in that would help validate the business, you’ve made a recommendation and making sure that the data that you end up with your in your warehouse actually stays reliable. So we’ve actually shipped about three months ago, a tool called Data dip in an open source repo, under MIT license that essentially solve this problem of validating data replication between data stores, and does this at scale and really fast. So you can basically validate billion rows in under five minutes between systems like Postgres, and Snowflake. And so this is how we essentially address the source and target data, what an example right, there are more problems there. But this is a really, really major one that we saw. And then if they get shipped to the right, from a data transformation layer to beta apps, the core there is to understand how data is used. So once they believe in the word, belief warehouse, where it goes and how it’s used. And so we’ve been building integrations with all sorts of tools from big data, versus your bait activation layer. And that’s really important because you can paint the picture of how every single data point is consumed, you can etchers way earlier. So for example, we recently shipped integration with PyTorch, late exhibition company. And so now, if anyone is modifying anything in that data pipeline that eventually affect the sync to let’s say, Salesforce or MailChimp, they will see a flag that that data replication process will be impacted. So that’s my vision, sort of like going into the left to the right place, and how the data flows and how people work with data.

Kostas Pardalis 32:34
All right, that’s super interesting. Let’s dive a little bit deeper, like stopping from the left. Okay, and you mentioned data D, which is an open-source tool that can help you like run very efficiently and fast, like diffs, between Microsoft database like Postgres and a destination like Snowflake, correct? Yeah. And you mentioned like, you were talking about other than you were the you said that, like, it’s really hard. No matter what kind of technology or vendor you’re using, like to connect the transactional database with a data warehouse. It’s really kind of like to make this upscale like world without the issues. We’ve added a little bit like, like, the issues that you have zoomed out there around that because I think like people tend to take that stuff like for granted.

Gleb Mezhanskiy 33:26
Yeah, so basically, what happens is you have a transactional store that underpins your let’s say, microservice or monolith application. And that transactional store gets a lot of writes a lot of changes to data. And then you then you’re trying to replicate this into analytical store, which might not be actually that great with handling a lot of changes, as soon as you just end data. And the way that typically those replication processes work, both for vendors and open source tooling is relying on what’s called Change Data Capture event stream. So every time anything changes introductional store able to emit an event for example, a given value in a given row is changed to certain value. And then all of these events are transported over the network using different streaming systems into another intellect set your analytical warehouse. And so, the types of issues that can be is as simple as event loss. So, sometimes these pipelines built on messaging systems that are not guaranteed for example, great order or exactly once delivery. And you know, if that is the guarantees that you lack, then eventually you might have certain inconsistencies, you just have to let you know accept that there was a happening. The other type of issue is, which is very common is soft deletes. So we’re hard doing it. So for example, if a row is deleted from a transactional store, making sure that it’s also deleted from that later Go store which gets synced, it’s actually really hard problem because the Change Data Capture students might not be able to capture that. And so just two basic examples of when things can go wrong. And then you have infrastructure layer on top of that, right? So what if you have an outage event stream? Or a delay in an event stream? How do you know that? You know, what data is in pocket. And the reason why it gets harder, or even, you know, critically important is because transactional data is considered sort of source of truth. Because if you have a table that kind of underpins your production system, and you record users in it, you kind of rely on that as your source of truth about users, because everyone knows the events are kind of messy and lossy. And so if that data becomes non-reliable, your analytical store, maybe kind of lost all your last, you know, last resort data sources. And that’s why it’s really paramount for data teams, that they keep that data really reliable. And the what makes things complicated. So how do you approach data quality there is that how do you check that? That is consistent, right? Let’s say even if you’re thinking your replication process is reliable, right? We know that all things break eventually. But how do you make sure that they are consistent? How do you measure constraints, it’s actually very hard problem, because you might have earlier billion row table and Postgres million rows table and, you know, Databricks, or your data leg. So how do you check, check the consistency, you can run count star, but that just using number of rows, right? And validating consistency across that volume of data and being able to pinpoint down to individual value. And row is very hard. And it’s very hard to do it in a performant way. Because you can’t also put a lot of load into the actual store. And so the way that data dip solves it is by relying on caching. But doing it in a pretty clever way, where obviously, you can be hashable able to compare hashes. And there is any kind of inconsistency we’ll get to that. But the other problem is that it will hash just ones match. If you have one record out of a billion missing, and they won’t match if you have, you know, a million records missing. And you want to know what’s the magnitude, it’ll all be inconsistencies. And so that’s why we aren’t doing just like single hash, we’re actually breaking down tables into blocks, and then checking the hashes in those blocks. And then pretty much doing a binary search to identify down into individual row where the data is inconsistent. And so what you achieved with that approach is both speed. So you can do, you know, billion rows in under five minutes. But also accuracy, you can pinpoint down to individual roles and values that are off. And it’s a really hard balance to get right. So we’re really excited about that.

Kostas Pardalis 37:59
That’s awesome. And like, okay, let’s say I’m a data engineer, and I hear about this amazing data dev tool, right? How do I incorporate it in my everyday work? I would say by far, like a typical scenario with Postgres, using the bedroom, Kafka riding on S3 and loading the data on like Snowflake, right? Like, I would assume that this is like a very typical, like, yeah, definitely. Many things come. I mean, things like there’s latency, first of all, right? Like, there’s no way that’s on the same time that something happens from Postgres will also like be reflected on Snowflake. So I have this amazing new tool. And I can use it. So how do I become productive with it? Like, how do I endeavor my workflow?

Gleb Mezhanskiy 38:51
Yeah, absolutely. So when you’re dealing with streaming, data replication, there is no way around pretty much watermarking. And by that, I mean, sort of drawing the line and saying that all the data, which is older than this timestamp shouldn’t be consistent, right, because we do expect that, you know, lags and delays. And that’s the first thing once we established what the watermark is that you just basically say, Okay, I expect data, as absurd. If I establish to be consistent, you collect data drip to bore both data source. And we use, you know, typical drivers for that. So Postgres, let’s say, and, and Snowflake, and then you can run beta process, which typically takes within seconds or minutes. And then ideally, because we’re talking about continuous data check day, you would run it on schedule. So probably, you would want to wrap it in some sort of an orchestrator like Airflow or guidester, in run-up every hour, every day or every minute, depending on your SLAs for quality. And it’s essentially a Python library that also has a CLI so it’s really easy to embed in an IDE orchestrator, or you can even run in a cron job. So that’s also client table.

Kostas Pardalis 40:08
All right, that’s, that’s great. And actually, you give me like the right materials, like to my next question, which is about developer experience, and the importance of the data engineer in this whole flow of like, monitoring data quality, right? Like, again, I wouldn’t like to repeat that, although I might start getting like a little bit like, boring. But data is something that can go wrong for many different reasons. And because it’s the director with so many different doctors, but when I say actors, I mean, like, actual people, like the CFO, yes, or no made a mistake on the query, or like whatever clicked for our three, which sees like, I don’t know, like, what’s a eventually cron Zustand? Like, when we got the snaps of the from the data there, it wasn’t that consistent at that point, or whatever, right? Like, there are technical reasons that are like human reasons that Michael thoughts that so but they’re all of the data engineer at the end is like the person who’s responsible, right, so like to deliver, let’s say, like safeguard like the infrastructure and also the quality of the data. So what I hear from you, when you talk is that you put like a lot of effort in like building the right tooling, specifically for like, data engineers, right? It’s not like you’re building let’s say, a platform that it’s going to be used by an analyst or like someone who only knows like, what Excel is in like doing like stuff on Excel or like whatever. We are talking about data engineers, here, we are talking about developers, we are talking about people who have like a specific like relationship with software engineering, you mentioned, for example, like CI CD, best practices there and like pull requests, and like detox and like all that stuff. So first of all, I’d like to hear from you. Why do you believe like, the data engineer is like, so important in realizing your vision behind like, the product and the company? Because unless I’m wrong, right? So correct me if I’m wrong.

Gleb Mezhanskiy 42:08
Yeah, Kostas, you bring up a really great point, I think, yes. The reality is that it used to be even, you know, five years ago that data engineers would be people would own the entire data pipeline, and they would build everything be responsible for everything, and would have control of everything. As you point out, the reality now is very different. So we have people on the left of the data, right, so software engineers contributing a lot with, you know, instrumenting events in the microservices, as well, as you know, the tables that we talked about I copied from Oscars, ultimately, software engineers own them, right. And so for engineers might not actually have the context. And then to the right, we have less technical people who aren’t the data consumption side, and analysts, analytics engineers, and then even people from other teams like financial analysts that now kind of have to become familiar with data pipelines, because they need to rely on political data to get their job done. I think the right way to think about what data Dataflow, ultimately Souls is, in improves is the workflow of a data developer. And I define this really broadly, right, so it can be a software engineer, who defines an event that eventually gets consumed analytics, or it can be financial analysts who contributed to their contributes to DBT job, because that informs how they build a financial model. The reason why data engineers and analytics engineers are really central in this conversation is that even though multiple teams can choose to beta ultimately, deal, that is the persona, and that is the team that buys the majority of the data stack of how data is consumed and produced. And they are almost center of the collaboration around data. And so giving them tools that would empower them to do their job faster, is really important for us. But what data fool does actually goes beyond that, and it goes into the truth. I don’t want to say like a corny phrase, but I’ll say that democratisation. Because if you have if it’s so easy to test at change data pipeline, no matter who does it, software engineer, analytics engineer, data analyst data engineer, then it’s much easier and less reasonable other goals, other agencies to collaborate on data pipeline than before, right? And be very, like one of the worst bottlenecks we’re seeing in companies and data-driven companies that data engineering teams become really bad bottlenecks for their business because, you know, data scientists cannot build ml models unless they have been there. And so they push down that data, data engineers, and then data engineers don’t lead other people contribute that if I bias people, they’re afraid that people may break stuff. And so they basically become one leg never worked out and the company, the business doesn’t move as fast as possible. And so if we make change management and data testing, so, well, easy and automated, regardless of who’s making the change, right, then anyone who does the data will do a better job. And we will be catching errors for everyone. So basically, we able to elevate inequality throughout the pipeline. And we can help the business move faster because more people will be able to reliably thanks up to the building of data applications. Does that make sense?

Kostas Pardalis 45:37
Yeah, absolutely. Absolutely. And one last question from me, and then I’ll give the stage but to Eric because I’m sure he has many questions. And we have to also mention that, as you say, like, there are many people that are touching the data and inside the organization, and, you know, like probably one of like, the most dangerous long as marketeers. And they really want like to touch the data zones, we need to hear from the experience of like ruining data rounds. But before we do that live, like one last question, which is like a little bit of like, a selfish question, or like, as a product person, or that, that builds like stuff for developers and engineers? Can you share with me like, based on your experience so far, like what is like something that really, really important to keep in mind when you’re designing an experience and the products for the developer and engineer, like what is like, let’s say, the first thing that you bring in your mind when you start like brainstorming about a new tool for data engineers?

Gleb Mezhanskiy 46:44
Yes, I think there are, there are hard to say like, what are the most important thing, but I’ll probably call out to you. I think better? Yes, I think one and I think you called it out, Kostas, is that data stack is so good a Janus. So many tools, so many different databases, and beta application frameworks? And how do you generalize your product so that you work? As equally well as possible with most of them? How do you distill your vision to certain principles and patterns that could work with any tool and it would allow you to build, you know, especially in the data, quality space, improve the workflow, regardless of what stack people are using, right? And ultimately, will help you will inevitably counted for Muslim certain stock, for example, data all really focuses on what is called modern data stack, so loudly to workhouses, or really mature monitor data lakes, right, like, what it didn’t get with, you know, Trino, these days that are really kind of almost self serve right for data teams, as opposed to more and more systems. So you have to make these calls. But even though even that you limit your scope, the Morin data saga is still really hard. And I wouldn’t say we really rocked this, but I think we so far did a reasonably good job, I think of a second problem is, how would you integrate in the workflow of data engineers, again, given that different teams go about things in different ways. For example, if you want to do any kind of testing, before production, then you’d have to rely on teams having version control for their pipelines, as well as having a staging environment, because you’re going to have staging. There’s nothing you can test, right? So you have to basically take the new version of code doesn’t matter if the cheat code Luke arcade were pi spark code, and basically tested, right, the way Dataflow does it is like entering into production and showing you like with how things are going to be changing, right. But beta teams may be building staging in so many different ways. Sometimes they may using a synthetic data, sometimes using production data to build staging. And so how do you generalize this is a really big problem. And so again, finding common patterns that would be most applicable to most teams. And sometimes it means betting on maybe less simpler ways of doing things, but betting that those things eventually will become mainstream. One example of that, actually, that tough trade-off that we had to make is, so you got to know how data transformations are frustrated these days. The most popular tool is Airflow, right, used by perhaps hundreds of 1000s of data stacks in the world. And then there’s also DDT, which is an emerging tool, which is a more modern approach to building monoliths, maybe dogster. But those tools have orders of magnitude smaller adoption than Airflow and we had to make all early on to actually not focus on building the Data Quality automation or Airflow was still supported, but not as deeply as we would support, let’s say DBT. And the reason is that with Airflow is very, very hard to build reliable staging environment, just the way the tool works, it almost kind of forces you to test in production. And so that makes it really hard to work and implement any kind of like really reliable change management process within Airflow. Whereas with a tool like Baxter or DBT, development and staging environments come with a tool. And so it’s really easy for us to come in and actually build them Don’t hold back. And these are really hard goals because by the time we started, you know, prioritizing, trading with the BTC and Daxter, it wasn’t the power that they will actually win. We just thought that their approach is actually the one which eventually will take over because it’s more robust and more, more reliable. But it wasn’t really top goal, to make.

Kostas Pardalis 50:53
Super interesting. Eric.

Eric Dodds 50:56
Yeah, that is fascinating. I have more questions on that. But if I got into that Brooks wouldn’t be happy because we do have to end the show at some point.

Glenn, I have a question about the adoption and implementation of Datafold or technology, like Datafold in the lifecycle of a company, and I’m speaking from experience here as a marketer, to Kostas’ point, who has messed up a lot of data. I mess up a lot of reporting by sort of introducing new data.

I’ll explain the question like this, you know, and actually, you know, even with, with just sort of a personal anecdote with, with my experience, you know, multiple different companies: you’re growing really fast, or, you know, you’re launching some new data initiative. And you have, you know, limited resources that are working on that. And, you know, say it’s an analyst, and maybe you’re borrowing some engineering time, you don’t have like a fully formed data team yet. Yet, the company is growing really quickly, in many ways. You know, it’s almost like what you described it lift, right, where, like, Data is the lifeblood of the company. But, you know, that’s growing so fast that everything seems to be on fire. And what’s difficult in that situation is to slow down enough to implement both the processes and discipline it takes to sort of change your workflows, like, implement new tooling, etc. Because you’re moving so fast. And it really is a challenge, because you inevitably create technical debt later on. So you wish you would have done it. But when you’re in the moment, it’s hard to, you know, sort of, it’s hard to be that forward-thinking because you’re dealing with what’s right in front of you. How do you see that play out with your customers? How have you dealt with that? And do you have any advice for our listeners who are thinking about that very challenge, where it’s like, wow, I would love to, I would love to implement this. But it’s just really hard for where we’re at as a company.

Gleb Mezhanskiy 53:15
Yeah, I would say, ultimately, it’s not even about Datafold. I think this comes down to how a company and our data team thinks about winning together the data stack. And I think it comes down to doing the right amount of tinkering. And by that I mean that if you look at the modern data stack today, you have really great tools that take care of most of your needs, kind of ingesting data, moving it around, forming it, razzing it. And you really can assemble a stack in such a way that it just works. And then you spend your time thinking about business metrics, thinking about analysis, and how to drive your business forward. And then what is probably not the right way to do go about this, or some of the mistakes I’ve seen this made because then when they think they for some reason need to tick or more than they actually do, like they would start adopting, you know, like, take open source projects, and start running them in house because they think that will save costs or because they think they need control. It would they would be, you know, kind of afraid of vendor locked in. And sometimes it’s premature optimization. Sometimes it’s kind of engineering ambition out of control. I’m sorry for engineers, but just the right amount, right? So the data team is adopting all the good tools and focuses on like the things that they should be focusing on. That’s great. And in that world, they default is extremely easy to implement, because we integrate with the standard modern-day stock tools, the integration to it’s wait about an hour, and you basically improve your workflow in a day. So you connect, let’s say DBT, like you have your warehouse and things are just working out of the box. So it’s actually not hard, you don’t need to, you know, spend a lot of time writing tests, or anything where you may have run into longer implementation times, is when, let’s say you have like a really legacy data platform, or you have a really unorthodox stack for some reason, where you started to kind of build a framework internally that is, like a certain patterns that are not common venue, you know, then you will probably use our SDK as opposed to like turnkey integrations, and that may turn into like a longer model impatient project. But all that is to say that or, or data team on modern data stack, using best practices data pulled with extremely fast to implement. So I don’t think, you know, unlike assertion tests, where you do need to invest a lot of time to writing them Datafold, or like the regression testing ability is actually really fast to implemented I have to choose between moving fast and moving fast. And again, having reliable data, you can have both, you can move fast with a stable infrastructure.

Eric Dodds 56:16
Love it, that is super helpful. Well, glove, this has been such a wonderful conversation. Super helpful. We’ve learned a ton. So thank you for giving us some of your time to talk about Datafold, and data quality in general.

Gleb Mezhanskiy 56:32
Thank you so much, Eric and Costas. Also really appreciate you asking really kind of deep questions and helping me clarify my thinking. So really enjoyed as well.

Eric Dodds 56:43
Awesome show. What an interesting conversation, Kostas. I think that my big takeaway, and I’ll probably be thinking a lot about this week, is the initial discussion around anomaly detection, which I thought was really interesting, and how it was really helpful in some ways, in his experience, that lift then also fell flat in some ways within the company because of some inherent limitations. And then the more I thought about that, in connection with the way that he described, starting at the point in the stack, where Datafold, sort of solves the initial problem, and then moving left and right, with, like, in being involved in sort of the, like, pre-production process, or the deployment process was really interesting. And I think, to summarize that, it would be that, you know, anomaly detection, AI, all you know, all of that technology can be really useful. But ultimately, you have humans involved in creating managing processing data. And so you have to have tooling that actually is interjected into the process that those humans are using, and Gleb seems to have a really good handle on that. And so that’s just a really good reminder, I think about data quality, and the particular characteristics of trying to solve it.

Kostas Pardalis 58:12
Yeah, I agree. I really like and enjoyed from the conversation. What I’m doing like to keep and probably like, think a little bit more about it is about the role of the data engineer, I think we are seeing like the data engineering, discipline becoming like much more solid and well established. Like something important around having your data infrastructure and working with data. We don’t really do it. Like, it’s evolution, like it’s going through like a very rapid evolution right now. Like we have that like from blip by saying, like, how a couple of years ago you could, and when I say a couple of years, we are talking about like two or three years ago, it’s not that far away. You could have like, as a data engineer, like the complete control over like, your pipelines and the infrastructure, so like was like much easier, much easier. I mean, it was you had control over like, what was going around. But that’s not the case anymore. You have like so many different stakeholders, and so many different like moving parts, like in the infrastructure that makes sense yet, solid zinc, and it’s part of the evolution of the roles. So yeah, I think that’s something that like we we should investigate more in general.

Eric Dodds 59:31
I agree. All righty. Well, thank you for joining us on The Data Stack Show. As always, subscribe if you haven’t tell a friend. We love new listeners, and we will catch you on the next one.

We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.