This week on The Data Stack Show, Eric and Kostas chat with Barr Moses, co-founder and CEO of Monte Carlo. During the episode, Barr discusses trust issues, defines terms like “detection” and “resolution,” sheds light on SLAs, and shares her experience with data teams.
Highlights from this week’s conversation include:
The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Eric Dodds 0:05
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com.
Welcome to The Data Stack Show today we are talking with Barr from Monte Carlo. She’s one of the cofounders and CEO of the company. And they are in the data observability space. One of my questions that I hope we have time to get to the sort of the practical nature of what it takes to set up data observability in your company, kind of think about, in like, let’s say you inherit a piece of software that you need to go back and write a bunch of tests for. I’m not an engineer, but I’ve been close enough to that to know that’s kind of like, Ooh, no one really wants that job. So I want to know, how hard is it to actually do this, right, because we all have messy stacks because we’re trying to sort of build these things out as we go. So that’s my question. How about you?
Kostas Pardalis 1:12
I think I’ll spend some time with you on definitions like trying to define like better, what data quality is, what data reliability is what data observability is we’re using all these terms. In many cases, we take the definition for granted, because they are mainly used metaphorically coming like from other domains. Because a lot too always like exactly the same, right? Like it doesn’t mean that’s an SLA for server availability is the same thing as for data availability. Right. So I think she’s the right person to have these conversations and to try to understand a little bit better what all these terms mean.
Eric Dodds 1:51
All right, well, let’s dig in and get some definitions.
Kostas Pardalis 1:54
Let’s do it.
Eric Dodds 1:55
Barr, welcome to The Data Stack Show. We have wanted to talk to you actually for a long time so what a treat to finally have you here with us.
Barr Moses 2:02
It’s great to be here. Thanks for having me.
Eric Dodds 2:05
Okay, so give us your brief background and kind of what led you to starting Monte Carlo.
Barr Moses 2:11
Yeah, so let’s see, I was born and raised in Israel, I moved to California about 12 years ago, work with data teams throughout my career most recently was at a company called Gainsight, where we work with organizations that help them make sense of their customer data, and basically improve their customer success and really create the customer success category. A large part of that was actually making that data-driven, that was a huge shift in the category like before, before that the world of sort of customer success was really built on pretty fluffy customer relationships, actually kind of like buy for me, I buy from you. And the introduction of subscriptions and recurring revenue actually forced organizations to think through how do we make customers successful every single day. And oftentimes, that actually requires data to do that. Well, and also, this was around the middle of last decade, when just it was easier to actually, like, ingest data and process data and analyze it. And so really, sort of trying to become a data-driven organization was something that more and more companies were doing. And as I’ve worked with these organizations, as VP operations, I noticed that you said, look at a company, there’s like someone there that makes a decision, like, let’s get data-driven, whatever that means. And they hire lots of data people, and for a lot of money into this, and just really get data-driven for a second, and honestly, like that those initiatives often fail. And I found that the number one reason that those fails is because people actually don’t trust the data. And so people might look to making a decision about the data or actually using the data, surfacing it to customers using it in production. When the data is wrong, we were like, well, why don’t we just like resort to gut-based decision making, this isn’t really working for us. And it’s incredibly hard for data teams to actually know that their data is accurate and reliable. So actually started Monte Carlo with the goal of sort of the mission of accelerating the world’s adoption of data by eliminating will be called Data downtime, which is basically periods of time when data is long or inaccurate. And a lot of the concepts that we use are actually concepts that have worked well for engineers. So we’re not really reinventing the wheel in any way. We’re actually taking concepts that work well in other spaces and bringing them over. So we started the company about three years ago, it’s been incredible to see the data observability sort of category really accelerate. I feel incredibly fortunate to be sort of at the forefront of this and work with amazing customers who are pioneering this. We have folks like Vimeo and affirm and Fox and CNN and off zero, and really, really strong sort of data teams who are actually adopting sort of data observability as a best practice and a real part of the modern data stack.
Eric Dodds 4:58
Amazing. Okay. I have so many questions that I know Kostas does, too. But of course, we do a little bit of LinkedIn stalking and one of my favorite questions to ask is how previous experiences sort of influence the way that you approach what you’re doing today, especially as a founder. So I noticed that you were involved in the military as a commander, which is fascinating. So I just love to know, are there any parts of that experience, that have influenced the way that you think about starting a company, running a company, and even solving the data observability problem?
Barr Moses 5:33
For sure. So in Israel, military service is mandatory, so everyone is drafted at age 18. Women typically do two years and men do three years. I actually originally wanted to be a pilot, but I was not accepted. And so as a result, I actually ended up being an intelligence unit, as part of the US as part of the Israeli Air Force. So basically, working on data intelligence that’s related to operations as part of that the Air Force units. I was quite young, at the time joined, I’m 18 years old, and was promoted pretty quickly to be a commander, which meant that I was responsible for many other 18-year-old kids. You’re joining the military without professional training, right. So you just finished high school, and especially without sort of domain expertise, right. And so you actually, like, you don’t have a college degree, and you don’t have sort of further degrees. And so wow, I really have to learn on the job, there’s no more on a job than that. The second is it’s also a lot of responsibility at a very young age. So I learned a ton, it’s definitely like a reality check, right, you have a lot of responsibility. And as a commander, you’re responsible for your soldiers on different levels, you’re responsible for their professional expertise, right, making sure that they are the best at their job, you’re also responsible for their physical well being in finally, you’re also responsible for their mental well being right for making sure that they are driven and motivated and excited about what they’re doing. Right, and that there’s camaraderie. And so at a very young age, we’re sort of thrown into this situation where I really needed to create a cohesive team of folks with very little experience to do work that was very impactful and important, and did it in a way that people can also thrive in it. And that was hard.
I obviously learned a lot from that experience, I learned a lot about like, what people care and how to care about people, but also how to motivate them. But also honestly, like, about the power of bringing people were really aligned on a mission. And even if you don’t have all the experience in the world, and even if like you don’t, you’re not the world’s greatest expert on this particular topic, you actually can learn on the job, and you actually can make a big impact. And I think that gave me confidence later on, both to take on things that I necessarily wasn’t necessarily the greatest expert on and diving into something and, and know, being confident that we can bring something to the table, and also making bets on other people who are perhaps earlier in their career and helping set them up for success and making them shine. And that’s one of the things that are most important to me in building Monte Carlo is that we were really proud of the journey that we’re on. And we make Monte Carlo said of a company that can be life-changing for people.
Eric Dodds 8:28
That is incredible. Thank you so much for sharing, and you are building a great company. And so those lessons that you learned are clearly evident in the company, the team. Okay, I know, I want Costas to ask a bunch of questions here. But I’ll kick it off. Let’s talk about trust. So you mentioned trust. And we actually were talking with someone data engineer from a big mobile consumer company. And we asked, what is the hardest problem that you’ve ever had to solve as a data engineer? And it was fascinating because he paused for a second, he kind of like, stared off-camera, and you could tell he was thinking really hard, and his answer was trust. He said, that’s the hardest problem that I face every day. So my question for you, and would love to just hear your sort of thinking on this, even philosophically, as you as you’re building Monte Carlo? How much of that is a technical problem? And how much of that is a human problem? Because there are certainly technical aspects to it, to your point, like adopting principles from software engineering, there’s testing, there’s a lot, all sorts of stuff, right? But trust is a very visceral human experience. I’d just love to hear how you think about that.
Barr Moses 9:43
Oh, man, that’s a great question. So first of all, I love that you asked that because that’s actually what I’ve experienced before starting Monte Carlo. So do this in the context a little bit for how we started the company actually started three companies in parallel when I started Monte Carlo, don’t talk to me Spore, like, what does Product-Market Fit look like or doesn’t like. And so actually read this book, it’s called the Mom Test. It’s a pretty bad title. But the book is quite good. I don’t know if you had a chance to read it.
Eric Dodds 10:12
I haven’t read it, but I’ve heard of it and the title is hilarious.
Barr Moses 10:17
Basically it gives good guidelines to help think about how to have conversations with customers early on. And the idea is, basically, there are people who, if you would share an idea with them, they would give you positive feedback on that idea, no matter what. And you have to find the people who would give you real feedback, those are often folks who actually, like don’t care about you or what you’re doing. Yeah. And that’s contrary to most, most ways in which startups get started today, which is like going to your network. And so actually, what I did is reached out to, like, hundreds of people who knew nothing about me would mean nothing. For data engineers, perhaps like the person that you spoke with. I just ask them, like, what’s keeping you up at night? Like, what’s your biggest nightmare? What’s like, what’s some like shit, that’s like, annoying you these days. And their reaction was so visceral to this thing around data trust, people were like, I wake up sweating at night because there’s a report that my CMO is going to look at tomorrow, tomorrow morning. And I’m not sure that like, the numbers are gonna be accurate, and everything is gonna work on work out. Like when people or I remember chief data officer of a public company told me, last week, we nearly reported the wrong numbers to Wall Street, like, we caught it 24 hours before, there was someone, someone on my team, like, not even me, someone on my team caught the issue 24 hours like, that is sort of implications for your job, right, very professional integrity for goes into so many levels. And so I similarly, starting the company, I saw that, but people are just really sort of visceral about this. So it’s kind of ironic, given that data, you’re like, You had one job, just get the number. Sure, but it’s freaking hard. And I experienced that myself as well. So I was leading data analytics team, and like, the numbers were wrong all the time. And I would get like, WTF emails from like, my CEO, and not and others are, like, what’s going on? And so first I’ll say, I think the kind of importance of, of this problem is that it’s not just getting the data, right. It’s, it’s also like, tied deeply to people’s kind of pride, professional pride, and a sense of satisfaction in their job as well. And I think that’s why it’s even more important to solve it. So, a specific question on like, how do you go about thinking about solution like this, it’s definitely combination of tech and people write. On the tech side, I think there are changes in the last couple of years that have made it possible for a company like Monte Carlo to create a standard solution. So the rise of data warehouses and big eyes and honestly a standardization around a finite set of solutions that a typical company will use, has allowed us to build a finite set of integrations, right. So Monte Carlo today can we support all data warehouses like Redshift, BigQuery, Snowflake, all BI solutions, Tableau, Looker, Sai cents mode, and data lakes starting to support that as well. And also ETL solutions like an orchestrators like Airflow and DDT. And because of there’s sort of this rise of quick of— there should be a buzzword alert on the podcast right now.
Eric Dodds 13:32
I love it.
Barr Moses 13:32
I do it a lot. The modern data stack, thank you. So with the rise of that it’s basically like, sort of a consolidation or kind of an agreement or what are like the top vendors that folks will request, right? And that actually allows the standardization in terms of how do you think about pulling metadata across those stacks? And how to think about, like, what we call the pillars of data observability, which we sort of codified as a framework to think about common shared metrics for how to think about data. observability. So I’ll explain a little bit here. What does data observability mean? It really is a corollary to observability. In software engineering and software engineering, it’s very well understood what you would measure, right? So if you use AppDynamics, or New Relic, or data dog or whatnot, you look at specific metrics, or engineering looks at specific metrics to make sure that you have five-nines of availability. Now, look, let’s look at data organizations. You have people like data engineers, data analysts, data scientists, were working to create data products. And those data products can be dashboards for machine learning models or tables or datasets in production. And they need to make sure that those data products are reliable, but how do you measure that whatever use the software engineering does not translate. And so we actually had, we had to codify what that means and sort of build around that. And so when we think about the tech part, there’s definitely kind of advancement have allowed that. And when you think about the people part, it’s definitely the rise of what I would call sort of the data engineer role in particular, because that person is now responsible, not only for the job to be completed, but the for the job to be completed the data to be accurate, the data to be on time, sort of every everything else that actually encapsulates what we would call the trusted data.
Kostas Pardalis 15:23
Barr, I have quite a few questions around that. So we’ll have the opportunity like to get much deeper into observability. But before that, I have a very clear question to me first, which is why Monte Carlo? Why did you choose this name from the company?
Barr Moses 15:36
Great question. That is a great question. We are big Formula One fans at Monte Carlo. I’m actually wearing a Formula One hat, but that’s actually not the reason but so when we started the company, one of our principles in Monte Carlo is to build something that solves a real customer problem versus building something that we don’t know if anyone will actually use. And so we actually got to a place where before starting the company spoke to like a couple 100 People from data organizations, asking them about what their problems are, we actually got to a point where we very quickly wanted to start working with customers. And so I actually had like, 24 hours or something crazy like that to choosing. I was like, Okay, we’ll just choose something for now and figure it out later. And I like studied math and stats in college. So I literally, like open my stats book, who I have right here, actually, and read through and like the options were Bayes theorem, which did not seem to be great. The next option was Markov chains, which seemed even worse for IAM. And the third was Monte Carlo. And I was like, oh, Monte Carlo, like I can work with that. Actually, it’s something that is both approachable. People know about it, but also has its sort of roots in data, if you will. And so it’s basically named after the simulation in that sense.
Kostas Pardalis 16:54
Okay. Okay, that’s really interesting. I really like the bridge between like, the racing there in Monte Carlo and statistical methods. All right, cool. So as you were like talking about data and data quality and observability, you, you mentioned, like a couple of terms, and two of them is data has to be accurate, it has to be reliable. So makes total sense, right? I don’t think that anyone’s going to argue against that. But many times we forget, like to be a little bit more of engineers would say and try to build a bit more accurate on the Stern show. What does it mean to be accurate and reliable when it comes to data?
Barr Moses 17:35
Great question. What you’re saying is spot on in terms of, let’s introduce more diligence to what it means to get trusted data. I think the path to that is actually operationalizing what that means. So, let me get a little more specific. When we think about, we sort of call this the data reliability lifecycle, there’s three core components to it that help us, or what we’ve seen is that our customers, if they actually operationalize with these three parts, that helps them generate trust. So there are three core components to it: the first is detection, the second is resolution, and the third is prevention. So I’ll double click into each of those for a second. And in by the way, just to clarify, each of these stages is like introducing tech, but also processes SLA is contracts between teams, ownership, clarifying sort of who’s responsible, who’s accountable, etc. So on the detection side, actually understanding when data breaks and understanding why it breaks or understanding the impact of it. So let’s sort of define how we actually do that I sort of talked before about how, in observability, and engineering, it’s really clear what you’re measuring, it’s not as clear in data.
So what we did is sort of from all the conversations that we had with, with companies, ranging from large organizations like Facebook and Uber, and Google built this in-house to small startups who didn’t, we basically codified like all the reasons for why data breaks, and all the symptoms for it, and all the different things that people do to deal with it. And we’ve come up with these five pillars that we think together help bring sort of that holistic picture.
So the first is freshness of the data. So there are different ways to look at it. But basically give you an example, if there’s a particular table that gets updated three times an hour, so the last week, and today it hasn’t gotten updated. Yes, yet for the last three hours. That is a freshness problem that can potentially indicate about some problem with your data. There are different ways to look at fresh to try to collect timestamps to look at the volume of the data actually, there are different ways to measure freshness, the basically like the data arriving on time.
The second concept is volume. So literally, like again, pretty straightforward, but like you can look at the number of rows over time and say like, okay, the number of rows has grown 5% Every single day to the last week. Today. It’s suddenly dropped by 30%. What’s going on? Did we miss did we missed something. So maybe the job was completed, but data actually wasn’t transferred, for example. So there’s a second around volume.
The third is around distribution. So distribution is sort of like a catch-all phrase for changes at the field level. So for example, if you’re this is like a credit card field, and you’re expecting numbers, and then suddenly you get letters in that field, for example, or if you have shoe sizes, and it suddenly says shoe size 100 or something, it’s obviously doesn’t make sense and look at percentage, no values, instead of negative values, etc.
The fourth is schema changes. So actually, schema changes is a very, like a common culprit for data going wrong. And so oftentimes, like someone, an engineer might make a change somewhere that will result in field type changing, or in a table added or removed, and everyone downstream is not aware of that. So automatically tracking all changes to tables to fields that are being added, removed, or edited. That’s the fourth pillar.
And the fifth pillar is lineage. And so we actually just released a great blog about how we both feel level lineage. And when I say, Take, when I say lineage, I mean those table level and field level lineage, and actually being able to automatically reconstruct your lineage without any manual input, just by connecting to your data warehouse and datalake. And your BI, actually understanding like the connections between tables and fields and overlaying data health on top of that is incredibly powerful to being able to say, someone made a change somewhere upstream, and that resulted in this table downstream that now is not having the data up is now doesn’t have the right data, and resulting in this report downstream that now has a higher percentage than what you expect no rates. Having that view is sort of the start of what we call this, like detection phase. Does that make sense? I’ll just pause there.
Kostas Pardalis 21:52
Oh, absolutely. Absolutely. Like, I love the way that you have codified reads, to be honest. But I have like a, like a follow-up question to that, like, the way that you describe it makes total sense, okay, like for my athletes experience so far. But how much can be done with standardization, right? And how much we need to account for the business context that this data operates in to make this framework work? Or we can completely automate it? Like, what’s your experience so far in that?
Barr Moses 22:26
Great question. So what we see typically, is that most companies today try to do a lot of this stuff manually with tests that they would write. 80% of issues actually go on catched, or unnoticed, meaning they hear about it from someone downstream. So let’s say for example, my CMO, or my data scientists, or some that I’m working with is like, hey, like ping, ping on Slack? Why is it data wrong? What is what’s causing the years off? Help, it’s probably your follow-up, go figure it out. Call like not fun nodes on Slack. And so 80% is caught in that way. And 20% is caught it with sort of manual kind of tasks or manual checks that you could write, based on business knowledge? Well, we think in sort of a better world, and what we’re seeing with our customers, and other strong data teams who are implementing this is that actually, with standardization, automation, you can catch 80% of the issues. That’s the reality of what we’re seeing, like, I think everyone really thinks that they’re Snowflake, and everything’s everyone thinks that they’re unique. Yes, it’s true, they are snowflakes. And you can also automate a lot of that stuff, again, thanks to standardization in solution in the modern data set, right. And so I think what we’re seeing is that 80% of the issues can be caught with automation. And then there’s probably like, 10, to 15%, where only you would know, like, no automation in the world can actually caught patch that. And so those that like 10 to 50% is one or like, really the unique expertise of data engineers, and data analysts should actually spend time figuring out like, for example, our customers view this tick data every Monday at 6 am. So at 5:55, this better be accurate. Or, for example, this field in our business doesn’t make sense to be higher or lower than 100. For example, like, let’s say I have a ticker value, for example, like there’s, there’s some sort of some use case in particular, where only the business would understand that and so, but that is only like the 10 to 15% It doesn’t make sense for those teams to spend their entire time actually building that and then where there’s like one or maybe 2% That’s like caught by things that just went on, on notice and would be caught by our customers, sort of like downstream consumers. And so I would say like expertise and domain know-how is very, very important, but I feel like we can make better use of that so that data engineering teams don’t spend their entire time building this Start with something off the shelf and then can add their knowledge sort of in a more custom directed way, if that makes sense.
Kostas Pardalis 25:07
Yeah. 100% Yeah, I totally agree. And, okay, so far we had data to work with, right? A lot of data, actually, one of the biggest problems that we have is that like, we have way too many data sometimes. And now we also generate data about the data so we can monitor the data, right? So how do we deal with that, like, what kind of like experience we need, what kind of experiences we need to build us like product people or like companies, or vendors in the space to make sure that like, at the end, we don’t cause more frustration to our users. But we actually like to help them figure out what’s going on with that data.
Barr Moses 25:47
Yeah, I sort of a break it down. There’s like this big like hoarding of data. And now there’s this like, big hoarding of metadata, and you’re like, that was useless, this is going to be useless to build, we’re just going to do it because that lets people do like Meltwater data, all of us will say something controversial, I think metadata by itself is quite useless. Lineage by itself is quite useless. Like, nobody gives a shit about that it doesn’t matter, right? It’s like great eye candy. And you’re like, Oh, yay, like I have lineage. But where is it actually useful? What we’re seeing is that metadata, and things like metadata and lineage can help and be particularly useful when they’re applied to a specific problem. So for example, I sort of talked about this data reliability cycle of detection, resolution prevention, in the detection phase, if something breaks, like if some table breaks, but nobody’s actually using that table. Like there’s no downstream dependencies on that. And no one is query that maybe I don’t care about that particular table, right? Maybe actually, I don’t need to know whom that the data is inaccurate, there, it can be inaccurate, and like, who cares, right? On the other hand, if there’s a particular table, that the data is inaccurate, there are 300,000, downstream nodes that are connected to it. There are 10 reports that my CEO, and my top executives and all of our customers are using on a daily basis? Yeah, I better get that data, right. So actually, using that context can help inform that and make it better. And similarly, we see that throughout this sort of lifecycle. So in the second part of resolution, which is basically like, how do you speed up, moving from being the first to know about data issues, moving to being very fast and quick, and actually identifying the resolution of data issues, that’s where these things can also help us give us clues to okay a table here is broke. And at the same time, there was a change in this field, or these three other tables also broke at the same time. And this is related to a change that a particular user in the marketing team made, for example, around the sentence. So you can use all that information together to actually speed up resolution of data incidents, we find that data engineers and data analysts, and data scientists spend, oftentimes between 40 to 80% of their time on data fire drills. And so if you can use all that stuff, in context with this, it’s actually really powerful. And then the third part of like prevention, we’re actually finding that people have reported sort of north of 70 to 80% of their data downtime, incidents reduced once they have more access to this context, that metadata, and a lot of other metadata and data together can combine. So for example, if we have like, a report on like deteriorating queries that can give us really insight into like, where are there specific problems in or in our infrastructure that can help us give clues as to how do we build a more robust infrastructure overall, that can reduce downtime incident. So being more proactive about how we manage our metadata in our data, I think also helps us make sure that it’s more robust and trusted at the end of the day.
Kostas Pardalis 28:48
Yeah. 100%. Okay, so we talked about detection a lot. The next step is resolution. Right. So what does resolution mean?
Barr Moses 28:58
Yeah, so resolution, great question, means moving from the world today, in which it often takes weeks, sometimes months. I hope I haven’t seen a case of yours. But I might, I might have to go back to my notes, but literally, very long periods of time that it takes to identify the problem, do a root cause analysis and fix the data? Right, and moving that you shortening that time significantly? Right? I’m also thinking about SLAs. For that, right? It’s like how quickly are those issues have to be resolved in for what severity right P zero p if from one to three different issues, we should have different agreements on that. And so actually having SLAs and contracts with your data team on different data sets. So one thing that we’re seeing is thinking of as part of this movement of like, buzzword alert data as a product, there is, thinking about sort of different domains, right, and sort of domain-specific ownership. And so you can have like specific datasets, and pipelines that are sort of mostly used by the mark A team marketing team, for example, and maybe the SLA is there are different than that delays for data that’s used by the finance team or the finance domain. And actually maybe finance uses the data, like once a quarter to report to Wall Street. So, you have more time by maybe marketing, it’s actually feeding, like a pricing algorithm for pricing houses in a particular market. And so if you’re underpricing overpricing house, there’s a big difference there. For example, or just to give another example, one of our customers, Vimeo, it’s a video hosting platform, they have a very strong team, and they actually have used data and the pandemic to not only sustain their growth, but even fuel it, they’ve done that by sort of identifying new areas of sort of revenue and opening new revenue channels for them. And a lot of that has been doing has been enabled by introducing data observability concepts, some sort of detection and resolution and prevention across the business. So for them, they use real-time data to basically make a decision or like, how much bandwidth does a particular user need, for example, and so that’s delays on that kind of data is obviously very, very different than the others. And so thinking about resolution, you sort of think about kind of impact radius. Right. So who’s actually impacted by this? And then also sort of downstream, then also upstream? How do you actually locate that particular problem, and sometimes a problem can go sort of beyond a particular data warehouse and can be part of sort of, you can kind of use like logs, runs, RDT, or Airflow or sort of other orchestrators to help sort of pinpoint that. But generally, I don’t think that like, I don’t think that bad, necessarily resolution is about like, automatically identifying the problem and solving it for you as a platform. But rather giving data engineers the tools and information to identify the problem way faster, kind of like in the same way that you would use New Relic or Splunk to identify a problem in the infrastructure or application side.
Kostas Pardalis 31:59
100%. Okay, I need your help to understand a little bit better the concept of SLAs. I love using metaphors in general, because it helps a lot to help people like understand what we’re building, especially when we’re talking about a new category, like, data observability here. But the problem they’d have is that like, initially, when it comes to infrastructures, like this very, very specific thing that has to do with the availability, right, like we measure, like how much time a service or a server or whatever is available, like to us right? Now, with data, things are a little bit more complicated there, right. And the problem there is that you might even have like, it’s cases where the data because something we’re drawing does not even exist, right? Like, Let’s take, for example, how RudderStack is used, right? It’s like, we have a developer, and SDKs, in the great deeds on a website, for example, if the developer does not capture correctly an event, let’s say instead of sending the actual values in just now it’s there, like the data, like, wasn’t ever captured, right. So what does it mean to set an SLA there? Or how do we communicate also, like these kinds of problems that might exist with data and that we don’t have with observability? When it comes to infrastructure? Right?
Barr Moses 33:23
Yeah, such a great question. I can give my two cents from what I see our customers do, but just to be honest, and transparent, these are early days of the category, right? So we are defining that sort of as we go. And I think for data teams, more and more teams are doing this, but it is still early days, I remember Meta did an observability summit a few weeks ago, and we’re talking about that they have a very high number of sort of tests that they haven’t started to introduce SLA s. But actually, some of the concepts that they were really sort of interested in is how do you think about data downtime? And how do you measure during downtime? That’s a very vague concept, how do you actually do that, right. And so one definition that we sort of introduced is a combination of three metrics. One is number of incidents that you had, the second is time to detection, like your average or median time to detection. And the third is time to resolution. And so like, measuring data downtime, as dependent on those three measures. Now, I can introduce that concept all day long, but most teams actually don’t even measure time to detection time to resolution yet. And so you’re right, in order to introduce things like SLA s, there needs to be a baseline that’s established. And I think that’s where we’re at these days with establishing those baselines. Now, particularly in the example that you shared, you could are an example of an SLA can be, this particular table gets updated every three hours. And we can only accept a deviance of X hours or we need it to be on time 99.99% of the time, for example, so that’s like a very very specific kind of SLA definition for a specific table that we know that like this needs to be updated on a regular cadence. So that’s like an example for freshness, problem, freshness SLA, for example. But I totally agree with you. This is like very, very early days. I’m happy to share actually, there’s like a blog, they’ll be wrote, with an example of like an SLA dashboard, they can actually get an example of collibra games, I believe, actually, that has specific SLA is from how often and how regularly, they get their marketing data from third-party vendors, like Facebook, for example, though, they have particular agreements on how often that data is, is received in contracts between teams on when they can actually use that when they can rely on that. So I’m happy to sort of share that as an example, for something that’s like putting in practice, but I agree with you, it’s early days, and we’re seeing a lot of innovation. So I’m excited about where we’re going with this. But I can’t tell you that I’ve seen a data team that like operates with 100% data uptime on all five pillars, and they’re like, our data is perfect. Don’t talk to us. Like I’ve just never seen anything like that.
Kostas Pardalis 36:06
Yeah, absolutely. And I think that’s like one of the differences that we are going to see with this and ladies in this industry, like compared to the infrastructure abilities that, okay, data is not like a server, it’s not like a binary thing. It exists or doesn’t exist on the branch or it doesn’t run, right. So I think when we are going to like get to a point where the standards are less better, they are going to be like much more dimensions that we measure there. And I think that it’s going back to the beginning of the conversation, and it has to do more with trust, like trying to measure like how much I can trust something in this data center, they have they know that much about these the data set available or not available, like these kinds of things. Also, I’m like, that was like, super interesting. And thank you so much, you really helped me like understand better, like the concept of SLAs.
One last question from me, and then I’ll give the stage to Eric because I monopolize the conversation. And this time, I want to ask you something that has to do more with building a company and not that much about data and data observability. So you are one of these quite rare cases of people who decided to start a company while a new category was under formation, and you are at the forefront of that. What does that mean? What is, let’s say, the fun part of doing things like trying to build a company while the category is not yet there, and what’s next?
Barr Moses 37:34
Great question. So in general, I would say it’s funny. When you start a company, kind of everything is really hard. Like, nothing is easy. I always remind myself, that was the percentage of startups that fail. I think it’s like 99.9% startups that fail. And so you’re by definition, embarking on a journey that is very likely to fail and yet, you aren’t getting started. There’s something very weird about that.
Kostas Pardalis 38:04
There’s something wrong with these people, I know. I am one of them.
Barr Moses 38:12
Exactly. Like, why am I doing something that is basically doomed to fail, right? You develop different ways to think about the world, right? One of the things that helped me get conviction early on, that’s why I sort of mentioned, I actually started three different companies, I was like, Look, if I’m gonna get off my cushy job and get other people that are like, leave their cushy job, but better be idea that I have a lot of conviction on it. And I’m really excited about and so I spent a lot of time before actually starting the company, with customers and with their pain. And that gave me a lot of conviction about this. Now, when you sort of start a category, it really like, it sort of makes the existence of the company hinge on the category creation. And I remember I was actually chatting with, I think there’s Lloyd tab, one of the founders of looker. And he’s like, look, the worst thing that can happen to start up early days is that like, nobody cares. Just nobody gives shit about it, or about the category. And so it’s way better to have like, either, like some strong reaction, either love or hate, but some strong reactions. And so we spent a lot of time on category creation at the very, very first days of the company. So actually, one of the first things was that I wrote a blog post about data downtime and about my experiences. And I actually, I remember this, this was before we actually incorporated the company. I was, I was curious whether the concept of data downtime is something that anybody cares about. So I actually applied to a conference data Council actually, and with like, the title of data downtime, and this was before there was a product or anything like that. And I was like, let’s see what happens. And I assume only four people would show up and it would be kind of awkward and, uh, kind of like, hang out with them and talk about some random things and then we’ll just move on with their lives and pretend like that didn’t happen. And actually there were like, I don’t know more Then 100 people showed up. And after the talk, they were like, Thank you for giving us language for this. And we feel like this is the beginning of a movement. And this was literally like, it was just me, I didn’t have, it wasn’t with my co-founder yet. And there wasn’t anyone else. But that gave me the sense that like, okay, we’re working on someone that can do something that people care about. And we need to, like, actively invest in it all the time. So I would say like, the fun parts are that you’re working on something that’s like really important to people. And you’re solving like real customer problems, like, 100% of our customers renewed at Monte Carlo with us this past year, and so many of them are actually like it’s because we’ve been able to make a positive impact on their lives. And that’s the stuff that I’m really excited about, like, we’re actually doing something that people really care about I think that the hard thing about it is that it’s hard. We’re not competing against 10 other options for customers, we’re actually educating your customers on the fact that there’s a problem they’re very well aware of, but there’s a lot of education on like hat, hey, there’s actually a different way to think about this problem, right? So our customers live in a world where they might have to manually look at reports all the time to make sure that the data is accurate. And now we’re introducing a world where you can actually sort of rely on sort of an alert to give to tell you that something is wrong versus having to like manually check that. And there’s also an education of saying, hey, look, there are concepts in observability, and software engineering that worked, they can also work in data, like there are a lot of engineering best practices that we can bring over and that is the education part. So we invest a lot in it, like we write a lot, and we spend a lot of time with our customers on it, we really see it as like an existential part of the company.
Kostas Pardalis 38:23
Yeah, 100%. I think of like the mechanism of metaphor. So like, something very, very important in category creation in general. That’s why I asked also about the SLA and all these things. And we can be, like discussing about this stuff for like, four hours. But Eric, all yours.
Eric Dodds 42:09
All right, I think we have time for one more question. Barr, I’d love to actually get practical here. So in an ideal world, you can have this experience of like, let’s say, building software from scratch with all of these principles, like unit testing, and all this sort of stuff, which is like, it’s cathartic to think about that, I think for all of us, just because cleaning up messes is really hard, right? But the reality for most companies running, sort of even a stack that has a moderate level of complexity, which I would say is most companies, like you’re trying to do something related to the modern data stack, like, when you think about observability, you’re often coming into an environment that, not because anyone made really bad decisions, but especially as companies are just growing quickly, I mean, even thinking about reacting to COVID, all of the data and everything there, you’re dealing with a situation where there’s a lot of complexity, there were a lot of things that would have been great if you had done them when you were initially building the stack, but you didn’t, because you’re moving too fast, you didn’t have enough resources or there’s new technology coming in, etc. So could you help our listeners understand, it would be so great to just do observability from the ground up and have it integrated into every piece of what you’re doing. But that’s not the world that anyone lives in. So practically, when you go into a customer, and you’re implementing Monte Carlo, what’s the lift? I’m thinking about the data engineers, maybe even heads of data, who are just kind of like, that sounds so nice, but like, ooh, I don’t know if I have literally the resources to do like a six-month project and all this sort of stuff. So like, what does it look like? How hard is it? How long does it take? Like, what’s the lift? How many people etc.
Barr Moses 44:05
For sure. By the way, for the record, I don’t think I’ve seen a customer who’s like, literally has like the perfect or even like a clean grade standard, right? Our setup, right? Like most folks, I don’t and I percent have like a lot of debt. A lot of people like come and go come and go on a lot of questions. And they have a lot of complexity, right. I think that’s actually the reality for like most everyone at Monte Carlo particularly having been in our shoes have in the shoes of our of teams that we work with. We recognize how little time they have and how unrealistic it is to spend six months to do something like that. And so actually, early on in the company building, we’ve invested a lot in making it incredibly easy to get started with Monte Carlo. And so if you have sort of a standard stack, which I would say like Redshift, BigQuery, Snowflake, Looker, TEPCO, etc, you can actually get some In less than 30 minutes, or less than 30 minutes, and those five pillars I talked about, you get that all of the box out of the box. So within 24 hours, you actually have table and field level lineage. And within our model, start working, and within a couple of weeks, you will start having detection, resolution and prevention. So the features looking for you, we can add customization on that, on top of that for your own, but those five pillars are automatically out of the box within that 30-minute onboarding, no other sort of integration work required.
Eric Dodds 45:28
Wow, that’s, that’s amazing. So pretty low lift, I would say and then you can get into customization. And then just again, just trying to help our audience understand, like, what this dynamic looks like inside of organizations? Who are the users? Or who’s the primary user of Monte Carlo? And how do they interact with the product? And like, what does that cadence look like as part of their workflow?
Barr Moses 45:52
Yeah, so users are sort of data teams. And so most typically, data engineering data analysts, sometimes data scientists as well, I would say that titles are a little bit murky these days. Sure. Right, and depends a little bit on like, who the people actually sort of responsible for the data being accurate. Those three titles are the ones that we see the most, I would say, more data engineers and data analysts. And then in terms of like, what does it actually look like, folks are kind of incorporating in more and more into their workflows, and so that might mean like waking up in the morning, and checking to see whether, like, the status of the data is up to date, and then, throughout the day I want to make a change somewhere, and I want to understand it’s going to, I’m going to change someone’s workflow, because if so, I might go in and see, okay, if I’m making a change to this field, or particular table, who was downstream actually, like dependent on this, and that you would need to know, so I’ll be thoughtful about that change. And then, maybe later in the afternoon, I get sort of an alert about a particular problem in the data. And then I might sort of double click into it and understand, like, is impacted? What are the queries, etc, and put on doing research to understand what exactly what actually happened, just mostly sort of an embedded into folks sort of day to day, if you will. And largely, it’s because folks end up spending a lot of time on data fire drills. And so the goal is to reduce that amount of time, basically.
Eric Dodds 47:15
Okay, one last question. I said, we had time for one more, but I’ve asked three. So I love that you talk so much about interacting with customers talking with customers. So you’re really close to these data teams. So really quickly, I just want to know, what’s your favorite part about working with data teams? You’ve worked with so many different teams over your career? But what do you love in particular about working with data teams?
Barr Moses 47:34
That’s the favorite part of my job, literally. That’s what gets me out of bed in the morning is, you know, to sort of hear the amazing stories of data teams, I think, maybe even a favorite part of like, how powerful data is no, we work with companies like Fox, for example, that you know, covers events like the Super Bowl, and they literally track like, number of users and time spent in London and devices and like literally powering such important events. And then we have customers in healthcare that use data for diagnostics. And it just like the use cases are so wide, even more so like, you know, with COVID-19 and everything, it’s become even more important, it just inspiring what data teams are actually working on. It’s pretty freakin’ cool. It makes me really proud to be working with them.
Eric Dodds 48:23
Awesome. Well, thank you again, so much for giving us some of your time, a wonderful conversation, and we loved having you on the show.
Barr Moses 48:31
Thank you so much for having me.
Eric Dodds 48:33
Okay, Kostas, this is something that I’ve thought about a lot over the years. And I don’t know why I’ve thought about this. But we think about technology. And maybe this just comes from me doing a lot of consulting. But I was like, Okay, there’s probably like 10 business models across b2b and b2c, that you could build basically sort of a predefined, like data schema and stack for and it would work for, like, 90% of the companies out there, right? Like, a lot of times this, like the customizations are not necessarily a good thing, even though each business is unique, right? And it was so interesting to me to hear Barr, to some extent, validate that. We probably can solve 80% of the data problems, because they’re fairly known quantities. And some of that has to do the tooling and other stuff like that, but it was just really interesting to hear her talk with such a high level of confidence and say, Look, yeah, every business is, is unique, but really only 10 to 15% of the problems are of the nature where it needs sort of customized resources and we can automate the rest of it. I just that was really cool to see her talk with such confidence about that. Yeah, 100% I did not agree with you. It was really invalidate my idea, of course.
Kostas Pardalis 49:54
Yeah, it did. Okay. I agree. I mean, it was very interesting to hear from someone lecture like talking about subsidization, and how subsidization like, shouldn’t be part of products that we offer, you know, like standardization is usually, I mean, mentioned and loved by engineers, but in the model that say, like 10 to 15 DB context, usually, we don’t really considered as part of like building a business, but it is important. And I think it’s even more important when we are trying to build like a new category as like bar is doing right now with the rest of the vendors like in this space. So people need guidance, people need education in standardizing processes and concepts. It’s one of the best tools. Do you have like to do that? So? Yeah, like, I love that part, and that the whole conversation was amazing. And hopefully, we’re going to have your back again and discuss more about all these concepts.
Eric Dodds 50:55
I agree. Well, thanks for joining us again, and we will catch you on the next show.
We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.
Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
To keep up to date with our future episodes, subscribe to our podcast on Apple, Spotify, Google, or the player of your choice.
Get a monthly newsletter from The Data Stack Show team with a TL;DR of the previous month’s shows, a sneak peak at upcoming episodes, and curated links from Eric, John, & show guests. Follow on our Substack below.