Episode 88:

What Is Data Observability? with Tristan Spaulding of Acceldata

May 25, 2022

This week on The Data Stack Show, Eric and Kostas chat with Tristan Spaulding, the Head of Product at Acceldata. During the episode, Tristan discusses how to update old technology, defines “data observability,” shares early symptoms of a data drift, and more.

Play Video

Notes:

Highlights from this week’s conversation include:

Tristan’s background and career journey (2:43)
Updating old technology (11:40)
Defining “data observability” (18:44)
The primary user of a data observability tool (29:56)
Handling an incident (33:01)
Why multipliers for data observability (37:06)
Early symptoms of a data drift (43:12)
Tuning in the context of data engineering (50:11)
What keeps Tristan working with data (55:12)

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 0:05
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com.

Welcome to The Data Stack Show today, we’re going to talk with Tristan from Acceldata. And he works for an observability company. And I’m sure you have questions about that costus, my burning question is actually around the type of company that he has served in his work in data engineering in ML. Throughout his career. So he worked at Oracle, he worked at data robot, and now he’s an Acceldata. And we’re talking about big enterprises who are facing data challenges that, let’s say you’re sort of standard, like Silicon Valley data startup, like, it’s a very different customer than they’re serving. And I think that market is huge, relative to how we kind of think about the legacy company. So I just want to ask him about that, because I think he’s gonna help me in our listeners, like, develop more of an appreciation for that world, if they don’t already live in it. So that’s what I’m gonna ask, how about you?

Kostas Pardalis 1:22
Yeah, I’ll be a bit more predictable down here, I guess. And what I’m going to ask, we’ve been talking with quite a few other vendors in the space with data quality data observability. It’s a very interesting in the very vivid markets. So I want to see and ask him how they perceive data observability, like, what it is why we named it the way we name it, and what other people are using it and like get a little bit more into like the broad, build the problem space and the product itself and see how it works and how it delivers value.

Eric Dodds 1:57
Great. And you just use the word vivid to describe a market and so you get some vocab points. I’ll get some major vocab points for describing the market is vivid. And what concerns me about that as he might actually turn into a venture capitalist if you keep using that language.

Kostas Pardalis 2:17
Maybe one day, who knows.

Eric Dodds 2:20
Okay, let’s go talk about observability with Tristan.

Kostas Pardalis 2:23
Let’s do it.

Eric Dodds 2:25
Tristan, welcome to The Data Stack Show. So great to have you.

Tristan Spaulding 2:28
Thanks so much for having me here. I’m excited to join the illustrious list of guests we’ve had here and hope to hopefully live up to it.

Eric Dodds 2:35
Oh, yeah. So many fun things to talk about. Okay. Give us your background and tell us what led you to be head of product at Acceldata?

Tristan Spaulding 2:42
Sure. So for me, it’s actually a long story that starts with being a philosophy major at MIT Tech labs, and all this stuff. But basically, that took me through a path with the indica group at Oracle, working on sort of search, BI analytics, big data, things like that. And then I found this nice Boston company called Data robot, and sort of got in there. One of the early product managers and sort of helped see that grow from where we were to now a big company, they played her in the AI, ml, auto ml space, things like that. And so the interesting thing about that, though, and really, the thing that led me to Acceldata is that when you sit at that end of the spectrum of the data, sort of data lifecycle spectrum, you’re dealing with these Ultra Refined datasets, the best datasets that are, and you’re doing some really sophisticated things on top of it. But for a company that’s based, where it was based really around automation. And can we automate a lot of the data science lifecycle, like, what you start to notice is when that that works really well, but then you run out of problems because there simply isn’t data sitting around. That’s perfect, that’s pristine, even a static form much less and an actual, dynamic data pipeline that would be suitable to run out. And so looking at this, on the one hand in one year, I’m hearing about all these awesome developments in machine learning across the world, like it’s incredible what’s happening on and in the other ear I’m hearing, Oh, no. We’re not ready for ml, like, we can’t do that yet. So we’re not, we’re not ready for that. And so, for me, one of the things I was sort of looking at is like, is there a way like, what are the barriers to being in a world where something like, ml data robot tools like that can be used constantly non-stop because we solve this data bottle. And it’s all these really interesting company sort of sitting in the middle of the sitting at the layer, not where it sort of moving data between places, but where it’s kind of observing all these tools together. And so I thought, this is a really interesting thing, obviously, observability application monitoring huge areas. I know there’s a debate to be had, hey, does that really apply to data as well? And I think my take, my bet and I think people on this podcast may agree as well, but Like, there’s so much diversity and so much innovation happening in the data world that my belief and our belief that Acceldata very much is like, Yeah, let’s go into this, let’s understand all of these rich tools that are becoming increasingly specialized, increasingly powerful, and kind of provide a common layer to understand all aspects of your data pipelines. So that’s, that’s kind of my story in brief, and excited to see this, I think it’s turned out to be these pipelines are as messy as we thought. And let me kind of pits from cleaning up ours, as appreciated as you would imagine.

Eric Dodds 5:34
Yeah. Love it. Okay, so Kostas knows exactly the question that I’m going to ask. And actually, this is in your wheelhouse, because it’s related to philosophy, and you come from sort of the seat of philosophy, philosophy. Of course, he’s laughing. Okay. This is something that I’ve asked a ton of people on the show, and it’s probably one of my favorite questions, because it’s just a fascinating to see how people’s sort of education and experience influences their work, especially when they’re sort of training doesn’t necessarily fit a mold of what you would think of for someone who ends up in sort of a very data heavy or engineering every role. So how did studying philosophy— Like, what are some of the main things that have influenced your work as an engineer or working with data that came from your study of philosophy?

Tristan Spaulding 6:27
Yeah, no, I think it’s an interesting question. There are answers at different levels of this. So I think one of them that I think will resonate with everyone that works with data is you’ll find that most of your time actually is in sometimes about philosophical debates here, like, Well, how do you define what really is sales? And like, which way we do this? And is the thing we really care about this or the thing we care about that? And, of course, you’re answering at some level, this is all about answering things with data. But at the metal level of actually defining and articulating the connection between these things in what you’re looking at what matters, it’s precisely the question that the data can’t answer because the data has been structured, one thing there. So I think on one side, there are all the fun arguments with that. I think the tactical answer for me, actually, was that a lot of what I ended up doing actually was— this was in the maybe initial attempted heyday of the semantic web, that things like this was really trying to go out. And map is essentially data modeling by different names. So map a domain, get into these nasty Legacy databases, and kind of map those to these beautiful ontologies, that we created an owl and all these things but really, underneath all this nasty SQL query, so I think that also is an element of like operating in multiple levels here, and kind of connecting to things together. But for me personally, I ended up I went through this path of wrangling data, just throw it out there, make things work, figure out what is Linux, how to work with command line, things like that. But where I ended up was in product management. And so product management, I think, is a great example of sort of applied philosophy and its most glamorous parts, which are me, it was 2% of the time, the other 90% 98%, so a little different skill set foot, I think the parts where it’s relevant, it really is about trying to clarify what you’re working on here. And what’s worth doing it? And are you doing it for the right reasons or not? And so I think, I’ve always been one of those annoying product managers who wants to know what we’ll let someone get away with, like, while we’re doing this because it’s just easy, and so I think that we’re gonna do that, or like, oh, I will take up too much time, right. And like, I think everyone just worked in projects and worked in technical projects knows that it always takes more time than you think, even taking into account that it takes more time than you think. And so I think, being really careful about, why is this important, something to work on? Like, why are you really the best in your category? Or why is it makes sense to win in this dimension become something interesting. So, I think those are all traits. But I think the biggest trait of all was really, you spend time on these philosophy classes, and then like, you realize this is a waste of time, and like, I want to get hands-on and build something, and you become really impatient to actually, you get really tired of talking about philosophical debates because you know it’s pointless.

Eric Dodds 9:20
Yeah. You don’t actually ship something at the end of the philosophy class.

Tristan Spaulding 9:24
Exactly. And that’s really what it is. It’s like commenting, like, oh, I spent all this time arguing, or whatever, it’s, and then you’re like, oh, wow, I can write some code and like, it does this cool stuff. Or like, I can work with people who know how to write code very well, or do excellent designs or sell things or make solutions work. And that’s incredibly rewarding, and you never want to go back to what you’re doing before.

Eric Dodds 9:44
Yeah, for sure. No, I love that. I’ve worked with some people who’ve studied philosophy over the years and some of the best people are like running down to the sort of root of a problem or a question and it is like, it can be difficult, right? Because you’re just like we’re trying to move fast. You’re right. like answering those fundamental questions is so important for sort of producing ultimately what can be sort of the best outcome.

Okay, I know Kostas has a ton of questions on the technical side, and we want to try to define observability and do some of those things, but one thing I’d love for you to give some perspective on is that we have the benefit of talking with a wide range of people on the show, right, so we’ve talked with some founders who are building some like, really neat tools, or sort of new tools, stuff that’s like spun out of like Uber and LinkedIn, and open source technologies, like really, really cool stuff, you’ve spent a decent bit of your career sort of operating at a strata that probably the people who are working in, let’s say, just the Silicon Valley data startup just aren’t as exposed to, right. So like Oracle Data robot, like we’re talking about companies who have been around a lot longer, right, then maybe some of the customers of these newer Silicon Valley companies, so they’re literally built on technology that’s just been around for a lot longer, which has a lot of implications, that markets also like, pretty large, and the data problems that they face are pretty different. And I would just love to get your perspective, what are the things that you see in terms of the challenges that that sort of, call it like enterprise strata or companies that have been around for multiple decades that have technology and run their business on technology that they adopted when it was state of the art 20 years ago, but we’re still running that today and trying to modernize. So any thoughts on that would be helpful to give us some perspective?

Tristan Spaulding 11:39
It’s always been eye-opening to go around, and you realize, one, you work in B2B products anywhere, you realize all of these companies and all these layers of the economy, basically, and like society that maybe you didn’t know about before. And so I think, certainly, as you work in these, these enterprise software companies, you’ve realized things are maybe more complex than that, at least I thought, as you get in there. And I think as you actually start working with these companies, you realize their internal landscape is incredibly complex. And so the reason we have such awesome sort of modern startup companies or companies that have now gone public and things like that the Ubers of the world is the classic example of Facebook’s, these are able to move fast because they don’t have a lot of things weighing them down. Like they’re able to do things differently, they’re able to build things that they want, and then they get to a scale where they have to do things that the way that they want to, or a way that hasn’t been done before, and then you end up with Presto, Trino, any number of other examples that came out of Uber or LinkedIn, it does. And I think those become quite appealing to the next wave of the company. And so the next generation comes out and says, look, we know what you’re feeling we’re operating at huge scale, like we started with an app, that’s what we had. And then we go from there. If I look at the challenges that these companies have, and the reason it’s sometimes you end up with different fits and different product priorities for that group, I think one is maybe the obvious one, but they there is a history here of technology investments. And so I think the obvious aspect of that, is that, okay, there’s a lot of technology around that it’s not maybe not, not the new, cool one, it’s not the people aren’t giving talks about it, things like that, but it’s been around, it’s driven the company for a long time, and there are also people there that have used it and know how to use it. And that’s just what they do, and things like that. And so I think one of the things then that you need to consider is like, obviously, integrations. How do you connect with those things like that? But I think more than that, is actually looking at the change process. So, like, these companies are very smart people like, these are giant companies. Like, they’ve been very successful, they’ve been around for a long time for a reason. And the people in decision making and leadership and architecture roles there, are always thinking about how do I transition us and they’re always thinking about previous rounds of choices that have been made, and how to basically do better in the future, and things like this, and help with the migration. One of the interesting things about sort of the Acceldata background actually is like a lot of the core founding team, founding engineers, these were Hortonworks engineers, so much building out Hortonworks, and this powerful Hadoop, just the shirts and, and helping install it and support it at these super complex installations. And so, I think that they all had a very, I had a different layer when I was at Oracle had a different perspective on this the same phenomenon, right of people going very big on this technology that seemed to offer a lot of promise for sort of cheaper compute and bigger processing and things like that. And then, in some cases, I think there were there were successes with Hadoop, but in many cases now it’s viewed as probably it was an over-investment. Then in many cases, and people are looking at, like, how do I deal with that investment, so I don’t just leave it. I make use of it, I trained as soon as the part that makes that makes sense to transition, I leave the parts that are there. On the other hand, now that I’m evaluating the new wave of technologies, the modern data stack, so called, and maybe there’ll be a data stack after that, like, you’re gonna be facing the same discussion, like, hey, how do I actually evaluate this? Like? And how do I adopt this in a responsible way, that’s not sort of like lurching from one thing to the next. Navigating these decision processes is definitely something that’s quite relevant for these groups because they’re looking at a much wider span and much more significant tracks here. So certainly for Acceldata, one of the places where we spent some time with people is basically starting to become the experts on this monitoring data stock and trying to advise people on this in an informal way. Like, we want them to be successful running these workloads, like we want them to adopt new things. And we want to stay, like everyone here, we want to stay like abreast of what’s the latest technology here, and the best choice for what option. So I think that also these organizations will also not always be hiring people who are going to be contributing to open source or like, Masters open source, they’ll be BI, but there is a role as well to kind of understand their needs and help bring that in, in a responsible way. It’s kind of a little different, and, in some ways, much more, I want to say diverse in terms of its types of challenges you face, constrained in some ways, but also open and some others given the budgets and the size of these companies?

Eric Dodds 16:40
Sure, no, I appreciate that so much, because I think it’s easy to lump a lot of things under like, the legacy umbrella. I just appreciate so much that you said like, there are really smart people trying to navigate, how do I modernize a stack that has had 10s or hundreds of millions of dollars invested in it over a decade’s long period? And like steer that and sort of make good on that investment over time? And like, Man, that is a really difficult challenge.

Tristan Spaulding 17:17
I think many people you hear have experience with technical refactoring. It’s a rare engineer who’s able to come in and take a legacy codebase or really complicated pipeline or things like this and kind of gradually improve that. Both of what we have now and sort of restructured in a new way, like that’s something, it’s a lot more fun in many ways. And it’s not so easy to build something from scratch, but it’s just a different skill set and things like that. So, but it’s tough. It’s the price of success for a lot of companies is like they’ve built this. They’ve invested in technologies, those technologies, they’ve succeeded in fulfilling those use cases, in most cases. And now, you don’t want to, there’s an explorer exploit dimension for them as well, I don’t think it’d be phrased that way. But, I think, yeah, I mean, when they’re confronted with an endless stream of a million startups, pitching them on their unique thing, like how you decide between that?

Eric Dodds 18:16
Sure.

Kostas Pardalis 18:19
I have a couple of different questions, but I’d like to start with the basics. So let’s talk a little bit about data observability. So what is data observability? And why we use the term observability there and not something else, like data monitoring, or I don’t know, why observability is the right word, or what we are doing?

Tristan Spaulding 18:41
Yeah. So I think, just before we get into data, I think generally people draw a distinction sometimes between monitoring and observability. Monitoring is meant to be kind of telling you that something has happened, I think, in many cases and drawing your attention to it. I think people use observability, often to say, this is the thing that helps you understand the internal state of what’s happening, basically, it gives you enough information, not only to see the symptom but to go in and find the cause very quickly. And so you see, different tools do this to different degrees. But I think that’s where, where that’s mentioned. Now, with data, I think what’s interesting is that sometimes you get asked a similar question around data quality versus data reliability, and things like this. And so I think the interest with data is that another many of the use cases have existed for quite a long time. So data, data quality, like and BI dashboards reporting things like this, like these are not novel concepts. They’re done. Likely, much better now than they were in the past. The tools are really awesome. But fundamentally, it’s still the same question and the scenario is still the same. The social scenario is still the same. I’m in the room. I’m the analyst. I presented it. I’m trying to convince someone to do something they ask that number can’t be right. Like, where do you get that? How’d you get that? Like, why is this data wrong? That’s a bad feeling. You don’t want to be doing that, and you want make sure it doesn’t happen again. So I think that use case of kind of, hey, I’m monitoring the data, like, tell me if there’s something weird, that’s on this, I think is an established one, and fits with data quality. And there are things that do that, where I think data observability is different is it’s really applied to new work, some of the newer and, in some sense, more taxing use cases, like where you’re actually providing a service to the outside world, whether that’s a product, whether it’s recommendations, whether you’re literally selling your data, or your analytics to a third party, which I think is, or in a marketplace like Snowflake, obviously, is promoting this, and Databricks, things like that. So, like, I think when you get into that situation, when you’re gonna feel awkward, or like your colleagues are gonna lose trust in you, it’s like, you’re literally going to lose business because this is broken, or it’s delayed, or it’s wrong, and no one’s going to tell you either. They’re just not going to show up again, I think that’s where understanding the internal state becomes quite important. So for us, I think observability means not just do I know that something happened, but how do I dig into the layer that I need to dig into? And figure out why that is? And just to give an example, what that would mean? Like, let’s take the classic one of like, hey, my pipeline is delayed, like what’s going on? Okay, so it’s one thing to know, I think it’s monitoring to know, hey, this pipeline is delayed, let’s go figure it out. It’s another thing to know. Okay, let’s go into this pipeline let’s say was using confluent for streaming, it was using Databricks, to do some large-scale aggregation. I don’t know, maybe he’s using Databricks. Returning to, and then it’s landing something in a data warehouse and using Snowflake for that. And you’re running queries against that. And that’s what’s led. So I think, actually isolating at what point in this cycle like this became slow, I think it’s quite important, I think, then digging in and saying, Okay, now show me essentially, the Databricks console or information gathered from it on what’s going on? And was this constrained? Or, like, did I have the wrong number of executors? Like, was my data skewed? Like, what happened here? To cause this my shuffling tons of data? Like, that’s the level where I think you get into true observability, in the sense that that term is used in the broader, non-data context too. We see it as quite intensive, I would say, or more comprehensive than then monitoring, and they’re, like, fairly serious about saying, like, yeah, if this was monitoring, it’s a great thing. As with BI dashboards, there are better generations of that now than there were in the past, but we see observability, as the thing that’s going to let you actually control and optimize and basically operate this in a predictable way.

Kostas Pardalis 22:35
Is observability more relevant to the infrastructure that handles the data, or to the data itself, or both, is actually at the end?

Tristan Spaulding 22:46
So my sense is broadly that it’s both because, and I find it hard to decouple them for a couple reasons. So I think one is that, as you get into these data platforms, like, the actual structure of the data has a huge impact on basically the compute layer. So like, just to take the spark example, as well, like, if you do have data that’s coming in, and data has drifted in certain ways, like, that’s going to make whatever configuration and whatever resource provision you’d have before inadequate and suboptimal, and, like, at the volumes, that and the velocities you can be talking about, like that can be a quite significant difference. Likewise, I think as you start playing it out, and you start looking at, I’m doing this complex data pipeline, it’s going against multiple things, like, I mean, timing becomes a factor as well. So if this is supposed to be read from this table this time, and this upstream job was delayed for whatever compute reason, like now you’re gonna have semantic problems in your data that ultimately were caused by an infrastructure issue. So I think, in my view, certainly there’s an aspect of being able to dig into this, and underneath, but it’s almost hard to detangle them, even though I understand, like, traditionally, like, it kind of has had this division of like, okay, I’m a data analyst type person, I look at this analytics engineer, look at this. And maybe I’m an IT engineer, I look at this, I kind of my read on this is that, especially as we move to sort of cloud environments, where you’re not asking your central IT team to manage this stuff, you’re, you’re you as the data engineer interacting directly with the cloud provider and using their services. This starts to become more of a blending of skills that are, let’s say, have concerns that used to be separate.

Kostas Pardalis 24:29
Yeah, it makes total sense. Can you give us a little bit like, of more actually a few examples of like, what’s the experience that someone has with the product itself? Let’s say that I, I know that I have a problem with my data and my pipelines, and I am convinced by your salespeople that observability is the solution to all my problems. So what happens next? How do I implement that and what’s I’m interacting with in order to experience the data observability?

Tristan Spaulding 25:02
Yeah, and I’ll try to give a broad answer to this. I do think we have a unique take on observability? And do things a certain way? I think other people do. And I’ll try to answer it and the general here, like, I think, I think the way that I see this happening effectively is basically, I think there are two entry points that people look up, one of which I would prefer, or recommend to the other. But like, one of them, this is sort of a connect to it to a data source. And in many solutions, you’ll see this, I think this is a little bit of a division in industry, or the micro industry that we’re talking about data durability industry, but basically, you might connect to the data source, which might be a data warehouse, or it might be a files, or might be a stream, I think, and you might at that point, say, okay, these are all going to be basically analyzed by this data processing service, it’s going to look at, on the one hand, the compute layer, of where this is actually going through and sort of like the jobs that are processing, it’s going to extract out from the writer, and it’s going to actually look at the data itself, and compute distributions on it, like analyze it for anomalies built, like everyone’s building these like, little simple time series models, that forecast if values and artists are sort of tell you if or not values and all of us or things like that, I think that’s one way to do it. I think the way to do it, that is we’ve seen a lot more excitement around has been actually instrumenting the pipeline itself. So these days, I think it’s part of the unbundling, or whatever re bundling whatever, there are a lot more code based pipelines, not like you’re not necessarily dropping in, obviously, Dragoljub ETL, vendors still have a huge market, but like, many new initiatives, especially ones that are serving external customers, are basically pursuing code based frameworks, you’re coding, and then you’re orchestrating with Airflow or Daxter, or proof outdoor, like, the list goes on and on. And so I think the way that we’ve seen people get really excited about, especially the people responsible, who are not, who are not writing the pipelines, but the people responsible for keeping track of the 10,000 pipelines that are written is actually to basically decorate these pipelines and have that information emitted back to the mothership here, and how do I actually give you this digest of the same type of information, like what’s happening, what’s going on this data side doesn’t evaluate that arrange that not passing rules, as well as let’s get a read on the actual query statistics, or the load on the database at this time and show you what that look like. So I think there’s, there’s a couple of entry points that you look at here, I would say in terms of what happens, like, what’s the experience as a user, like supposing that you have this setup? I think there are two phases. So one phase is defining and sort of setting up what types of things you analyze. So, sometimes people refer to these as test or quality checks or dimensions, things like this. And it’s very important, I’d say my philosophy on this is basically to try to automate as much as possible. And like I would, it pains me, every time someone asked to write a test here, right? Like, what I really want to do with data is be able to say, Hey, I know that you have certain expectations of your data like it’s not going to drift, the values aren’t going to be here. And you can define those things in a way that you can’t actually even define them for software. So with data, there’s a way to measure distribution drift, there’s a way to forecast if things are anomalous, things like that. And those in my view, and in our view, Acceldata should be applied in an automated as automated way as possible. Once you’ve got all these things set up, of course, it’s not always possible to automate it, you want to write code like you have custom rules, you need to look across multiple columns. So like, there is this aspect, it’s not going to be fully automated, but you want to push as much as possible. Once you have that instrumented, it basically becomes, in some sense, a familiar pattern, right, you’re getting an Alerting Framework, you’re getting told what’s going on, you’re getting an incident filed, you’re jumping in, and you’re sort of seeing all these things connected together, I would say the only big difference with data that you’ll sometimes see is that people will want and we had a big sort of data seller requests and drive this feature, you’ll want to actually split up some of the records that might have failed, and sort of segregate them from things that go in. So there’s a sense where, especially when you’re dealing with files very early in the stream, you want to filter out—in quarantine or whatever you want to call it—basically some of the things that are a little suspicious before that ends up contaminating your old warehouse data or whatever is going into your model or things like that.

Kostas Pardalis 29:27
And who’s the user who, let’s say, takes care of observability. In the infrastructure of war, like things are, like pretty clear out there. You have like the SRE is there like the primary consumers of like, observability products, right? And for a very clear and good reason why when it comes to data, because like things are a little bit more complicated there in terms of who are the consumers who were all the different stakeholders there? Who is the primary user of an data observability tool?

Tristan Spaulding 29:56
In our experience, it’s data engineers, but it’s not all data. Engineers, and it’s not all the time. Because what I mean by that is basically data engineers, machine learning engineers, data scientists, analytics engineers, these terms are kind of telescoping, or whatever you want to call, there’s an explosion in specialization on these. I don’t know that the terms that we use for them, or the job descriptions have actually perfectly lined up with where it actually is today, let alone where it’s going to be in a few years, I’ve also seen job posts for data quality engineers, data reliability engineers, things like that. So we’ve seen these. So I think there’s something where if you played this out a few years, there may be a specific role similar to how there are SOPs now where there are cloud ops focus people that are focused on ops things. My view at this point is it’s sort of like the responsible data engineer who is looking at this. And I would say, importantly, kind of DOT management chain. So I think as you’re getting up and you want to view into what people are doing, that becomes the primary user of an observability system. They use the first use case is, of course, hey, I’ve got this, I’ve got my pipeline, like, why is it broken? Let me know how to fix it as quickly as possible. I think that’s, that’s a clear one. I think there are larger ones that you step back, though. And the more you get removed, like the more that you go from having five pipelines that are monitoring to 10,000. I’m like, Yeah, I don’t know if this number sounds incredible to people. But like, we absolutely have heard, and we have customers that have 10,000 pipelines that they want us to monitor here, like it just insane, this fall vicious, yet it’s become so easy and so powerful. The tools that like if you’re someone sitting in the central group here and try to keep tabs on it. Like, you have no idea what someone downloaded and what they’re running and what data they’re using. And like, should they be using that data? Like, someone asked to delete that data, would you? One of the things that we found is people start to get very interested when they’re like, wait, you can instrument our data pipelines and sort of keep an eye on them and tell us what’s going through them like, that starts to be quite, quite interesting, both to buyers as well as to people in sort of the data governance type world. But I think the further out that you the more that you aggregate this type of information, the more that you’re actually getting, like, if you come back these people we talked about earlier, the people judging and carefully considering the technology investments, you’re starting to get a map of like, what your business is actually doing with data, like which systems are being used, which system which pipelines actually work, which pipelines feed into others, which pipelines are reliable, who is reliable, who is not. So even though the initial engineer or initial user is like absolutely a data engineer who just wants to get their stuff working, and keep it running, and you’ll be able to progress to the next thing. This does add up, having different users as you grow and expand it.

Kostas Pardalis 32:53
It sounds like some something very interesting and I want to ask a little bit more about that. So okay, let’s say we go instrument our pipelines. Now we are able like to monitor things, everything goes well because it should and at some point, like something breaks, right? And you get your notifications, and like the data engineer will go there and fix it probably again, but like the difference between something like data and like observability of server, right? Is that like the impact that you have with data? It’s much harder to that, like, it’s much more difficult to calculate, right? Like it figure out, like, what happened to the organization on the end? Like, how many reports were wrong because of that? So how do we deal with that? Is there something that we do today? What’s your experience with the organization? What have you worked with? What happens after the incident?

Tristan Spaulding 33:52
My experience with this actually comes more, like from the ML world, like a data robot and looking at what happens with when you deploy a model, and you start feeding data into that model, and it starts doing things? I think the familiar the case everyone’s probably familiar with this with dashboards, and of course, this affects the dashboard report, good to see us. I think with models there’s increased sensitivity, like one just because it’s a kind of a newer technology, and people have concerns about it too, because of potential regulatory reputational things where you start putting out like bias data or just like, obviously, garbage things, and people come back and maki about it on Twitter and things like that. So I think broadly with data, like there’s one aspect that’s just being able to understand the whole flow of what’s being used and not there’s, I think there’s no way around this other than kind of having integrations with the various and stops here. So I think people are certainly very curious. Okay, we’ve got a history of historical runs here. This one went bad at this phase. And now show me everything That was impacted by this. And so this is something that we built on an accelerated show to sort of show the downstream impact across all these tables, things like that. But it’s a bit of an infinite like spiral here where like, there’s the people that there are the tables affected by it. There are the reports affected by that there are the people who looked at those, there are choices they made off of those who are the customers affected by those choices? Like, I don’t think there’s ever I mean, you’re right, that there’s never, there’s never enough that you can really trace this all the way through in the way that what software, maybe sometimes you can say these people hit us and they failed. On the other hand, I think you can at least trace that chain for a few steps here and say, This was used in cases I’m familiar with, like this, this data was, like this batch of data while he was used by this model, which made these predictions, which were off in these ways. Let’s go back and make sure it’s these customers. We score them again, better, or we actually go back and offer them again, because we gave them a terrible offer. Last, we messed up the data going into the model.

Kostas Pardalis 36:01
Yep, yep. 100%. And I think that’s where like, things are more complicated and more interesting with data because that you have observability, we are talking about that, right? Like, because that’s what like actual data is doing right now. But then you start thinking, Okay, if I also have other, let’s say, capabilities that come from data governance, like lineage, okay, so I have, let’s say, like a track of like, how the data got consumed, and how it’s changed. And like, all these things, then I can start like, tracking, like, what’s the impact there. And then, if I also have, like, proper, like auditing mechanisms that are in place, I see the people that are like getting involved in affected by that. So what are like the, let’s say, the parts of data governance specifically, but like the broader terms, like data stack elements that you think are important? Like, why multipliers for data observability, and vice versa?

Tristan Spaulding 37:04
This has really been one of the key areas that I’ve been looking at and trying to think through in the time that I’ve been in Acceldata. It’s one that everyone is sort of thinking through as well. And I think I would answer it basically, with the theme that we’ve talked about, throughout this discussion a little bit, which is really the theme between sort of, like existing and new, modern data stack, I don’t know what you call the sort of existing data stack in a nice way. But basically, like the previous data stack, or the current data stack, and the modern new data stack. And I think the fascinating like, it’s an easy distinction, where, like, basically, I think a lot of the data governance, things are almost new, even hear it in the word lineage. Like, it’s almost a standing still, like studying how things are fitting together and almost an academic. It has very real applications in terms of data privacy, data production, things like this, but it is almost sort of understanding, like what’s the intended connection between some of these things, I think we’re observability has proven to be quite useful additional lens on this is basically, for these people that are not working in the data warehouse, like they’re out writing code, they’re grabbing datasets from different places. And this stuff actually did not really go through the data catalog or things like that, it did not get classified, it did not get a million rules applied to it. And so it’s basically out there in the dark, as far as the data stewards are concerned, and I can tell you, that’s something that they’re quite concerned about, in many cases, because going from 80% coverage and analysis of some amount of data to 90% is not that great if now there’s a whole new 80% of data that you have no insight into. And so I think these things converge in that. And if it all felt like, certainly, like one aspect is just what touches what so I think the ability to go into pipelines, basically be generic about what you pull in, and what you analyze, I think is quite significant. And sort of understand what the actual usage and impact was. So if a certain thing is always breaking, are always causing something else. And it’s an additional point, you’re getting a real map of kind of the actual dynamics of how your data is flowing versus the intended map of how things should be working in the systems you have insight on. I think more specifically, if you dig into some of these things. Like, I see the observability sector is a bit more I would say performance focused in the sense of looking at what’s something that’s going to affect the outcome of the end result here and what I mean by that is, again, this kind of little this distinction, which maybe it’s just informed by my background and things like this between kind of the very established BI reporting side of the world and kind of the data product data service ml powered thing on the other side. So when you’re building and deploying a machine learning model He very much care about, like subtle distinctions in and how the data is distributed and how that’s shifting. And is that in a harmful way with respect to the model and things like that? I see observability tools is very much getting into this and being very dynamic, because the thresholds, not to go into all this stuff, it’s a different question. But the ways that you measure this will change over time in response to seasonality and different ways that you cycling the data and updates the model, and all this stuff is. So like, all of those things, I don’t think any data governance tool like cares about those, because those are not about the data stewards life and kind of making sure we’ve got a complete map of where things are, those are about basically Real World Performance and impact on things like that. So, little different focus areas. And I think, certainly there’s tons of room and it’s one of the big things we think about with our partnership strategies is around, there’s tons of room for cross-collaboration and connection here. So just like one example is, of course, all that great stuff that the observability tools dig into, could very well be populated back into a data catalog, that kind of owns the experience of where people look for data. And now that’s enriched with some awesome information or out that more detail than you would have gotten otherwise, around once, and I think we’ve actually seen some moves by some of the vendors there, to start bringing in some amount of durability and quality stuff, I think there’s a lot more to do. Likewise, I think you mentioned, I think one of the big things we tried to do to accelerate is automate as much as possible, and no more. And one of the things that helps with automation, and defining policies and propagating policies across data sets, and across columns is metadata about those. And so when we know this data attribute has been classified in this way. It’s used in this way, it’s subject to these controls, that actually can inform the intelligent creation of policy so that you as the data engineer instrumenting your things that actually don’t need an instrument at all other than say, hey, point to my observability platform and let it take care of the rest. So it’s, these are early days, I think. It’s been one of the learnings for me as I’ve come in, just the different sort of emphasis areas of data governance versus data, or voting and different personas that end up using them.

Eric Dodds 42:10
One question, if I can jump in, Kostas. Tristan, you talked about data drift, which is an interesting topic that I don’t think we’ve actually discussed in depth on the show. But I would guess that most of our audience has a base sense of what that means. Have you seen data drift start, like as a problem, right, like, and part of the emphasis behind the question is, data drift is one of those things, we actually had a guest a couple shows ago, who said, data is silent, which is I thought was a very interesting way of describing like a lot of the problems or the nature of the problems? When you discover it, a lot of times, at least in my experience, which is probably on a smaller scale that would you’ve seen, but like, you discover it, when it’s a big issue, right? Where you’re just like, my goodness, and you realize, like, Okay, this has been happening for a while, where does it usually start? And like, what are some of the really early symptoms that our listeners could look for?

Tristan Spaulding 43:12
The first thing I want to do is distinguish a couple of different senses of this, you can tell I was a philosophy major now. But I think sometimes data drift in the data engineering world, sometimes it refers to, like schema drift, and like, Hey, we’re changing the structure of this or like, what type of field this is. That’s actually not what I’m talking about. And I don’t think it’s what your guests was talking about, I think we talked about, we talked about data drift, and not science is basically changes in the distribution of data. So you’ve got the same fields, but the patterns in the data are shifting. And I think one of the funny things about this, like, one of the things that I learned is that a lot of the bet. So, in the machine learning context, a lot of the banks and insurers sometimes banking and insurance are either nature, like fundamentally about forecasting and predictions, things like that. So like, they, they’re extremely mature, in how they govern models since they’d be using it for a long, long time. And so, for models for banks and insurance companies, there’s been a standard measure for quite some time called population stability and ducks and the various forms of this where basically you’re building histograms of two datasets, like one is the data when you train the model, and one is the data as you just saw it, or in whatever time window you care to define. And basically, they for a long, long time, like there have been measurements of this and you can quantify this and say, hey, if it’s above 0.2, like that’s worrying and things like that. And then I think one of the things that’s become more sophisticated as more people have adopted machine learning, and it’s become more critical, is like, there’s actually a lot more techniques to do this now than maybe there once was. So there are a lot more ways to do that. There are ways to do it using models. There are ways to adjust it for time series, things like this like so it’s become almost a routine check in many machine learning context, I actually have always thought it should be used generally like, like that shift, it can actually be meaningful and like it can catch things that aren’t, are not obvious to people. So I totally agree that like, the problem with data failures is precisely that they are silent unless you set up the things to to alert them. And so machine learning, they actually have done that lesson. And it’s fairly common and like ML ops now. Now, the interesting thing about this, I would say is, like, you’ve asked exactly the right question from my perspective, which is like, when does the start, like, because that data before it got to your model, like it went through a whole data pipeline, just like that, and so incurred a lot of expense and touched a lot of things and landed a lot of warehouse. Like, Why do you only get it? Why, why would your alerting only be at the last second on this thing when it goes into this model? And so I do think one of the opportunities and for data observability companies is to bring in some of that stuff. It’s not necessarily what data engineers wake up every day and think about, that’s exactly why I think vendors who build this in have a lot to offer, to be able to tune this and offer to just get these checks out of the box for you. And to be able to apply those, like on a stream of data, or like on a file on a JSON file, like before it gets processed, and wrangle all this stuff and ends up in the warehouse. I think there are a lot of things there. Yeah, I would say I mean, all of them. But that the other problem with that, sorry, is like, yeah, it is basically the alert fatigue thing, and I think that’s obviously an issue with any system. I think it’s particularly intricate problem here since blowing the carbon monoxide alarm because not to make it more dramatic than it is, but it’s something that you wouldn’t sense otherwise, you don’t know that it’s there, and you don’t really necessarily know. So the last thing you want to get as 100 things, I’ve seen people do this, like, you get 100 things and all my data is drifting, it’s like, I don’t care, like, mute that, like, no one’s gonna notice. There are some other reasons it’s tricky in machine learning, you won’t you wouldn’t notice the effects. But I think figuring that out and basically reconciling these metrics of drift that people compute now, against the actual business relevance of it is one of the interesting, UX problems that a lot of companies are kind of thinking through now.

Eric Dodds 47:13
Yeah, super interesting. And I think one of the things that’s interesting to me is that in the context of sort of the drift happening, that the, like, final state of the model, is that even then it’s sort of a variance on a baseline. And so you could have drift that doesn’t necessarily trigger an alert, but that is damaging in some way.

Tristan Spaulding 47:38
Exactly. It’s all stuff that’s hard to tune. I think the other distinction is— we saw this when I was at a data robot during 2019 and 2020 and 2021. Everyone’s models were drifting, whether you could quantify or not like, genuinely was a real change in the world, like a huge severity. And so like, there will be quantifiable and legitimate trucks that you can understand. And we have people looking at that and analyzing it and adapting to it and things like that. But like, so sometimes it’s real, like, just the fact that there’s a long trigger doesn’t mean the data is wrong, it could be like, No, this is a genuine shift that you should. And like, it’s almost as good an insight as you expect to get on any BI dashboard, hey, this has significantly changed, this attribute has significantly changed, and things like that. So I think it’s one of these things where, like, people, when I told them, like, Hey, we’re trying to automate a lot of things and make it really easy for people there, they started asking, like, Well, how do you automate the response? Like, how do you automatically fix it? And like, my answer is usually like, I don’t think you can automatically fix these, like, where you shouldn’t at the space, like, it’s very setting aside the permissions issue to actually affect these things. Like, it’s just there are things that are possible to automate and things that that aren’t really possible to automate, while and I think judging the impact of these things, and is this actually wrong? Is there a reason for it? Like, I think that it’s mostly a human discussion where you need to go put your shoes on, and do some detective work and figure out what happened here. And there’s no tool that’s gonna magically do that the tool will magically tell you about it.

Eric Dodds 49:11
No, yeah. Can we just spend a few minutes talking about tunings? So this came up when we were prepping for the show. And we kind of talked about the analogy of an engine, like building an engine requires you like connecting certain parts. And I think a lot of people standard definition of a data engineer is like, building an engine that just runs really well, right. Like, it’s efficient, like it doesn’t have a lot of problems. It behaves consistently. But tuning is really kind of a different skill, right? Like we’re talking about performance, we’re talking about taking a system and sort of identifying the elements that can be optimized and then like talking about the ways in which we want to optimize them and a lot of times tuning, this sort of balancing like cost benefit, right? Like, we can sort of accelerate one variable of the equation, but like that may come at a cost to other variables. Tell us about tuning in the context of data engineering or ML ops and what is that skill set?

Tristan Spaulding 50:11
What’s interesting about it is it basically comes back to diversity in the power of the tools out there. And what I mean by that is, is one, like, a lot of these tools are quite new and evolving quickly. And so it’s not that things are static and you can master it and you’re going to be working with one vendor, it’s like, you’re actually like, always facing this choice to some extent of like, do I use something I know how to tune while I personally know how to use tune wall from a manager that I met, that my team knows how to tool? Or do I try something else that like, might be a better engine, but I don’t really know how to deal with it, and things like this. And so I think this is a choice like everyone is facing all the time, like, I don’t know, if you guys are on this bike, if you go and look at the DB engines like page that’s showing the ranking of these, like, plot a few databases on there, these and you’ll see like the curve to accelerate to the top to be to become quite sharp. And so tons of tons of cool companies on there. Presto/Trino spike up, you see Snowflake, obviously. There’s more that are happening now. And so I think like, this just shows to me, like the ability to choose a better engine is always out there. But you’re always facing this risk of like, yeah, I don’t know how to deal with that. Like, I don’t know how to not, and how should I do this? And so, I think certainly one of the things, I’ll just mention one other thing for you, maybe one of the approach, I’m not saying it solution, but one of the approaches to solve it. I think the other aspect is in many cases that the answer, right is like, Oh, well, there’s a commercial company kind of backing out and things like that, and they can take care of tuning for you or anything like that, which is a great solution. It’s value for money on that. But I do think like, sometimes people are looking for control and like, what if they, they want to do it differently? Or they want to compromise things like that? Yeah, or what if they switch and, and things of that nature. So I think one of the things that we try to look at, at least in the Acceldata world is he’s tried to look into each of sort of the major offerings, and basically cloud data engines here, in enough depth that like, if you’re a specialist in one, you can keep operating it as you want, it’s neat, but if there’s something that might be a better fit, you can use that inside, you don’t need to choose all in one, which I know, obviously, the vendor, many of the vendors, not to name names, but like, are trying to converge on each other’s territory, whether it’s data processing, or an article databases and things like that, so like, I don’t know that for the actual end user, like, that needs to be a one or the other thing, at a technical level, at least, like I think these things can be tuned through expertise route through data. And that’s something software, in my view, should be able to help with if you do enough sort of research and into the actual platforms, and so you should be able to get out of the box. I think the only other factor I throw out there is is the cost aspect as well, so many things are solvable through paying more.

Eric Dodds 53:11
All things maybe.

Tristan Spaulding 53:14
No. Some of these Query, no. There’s always a solution. That’s to spend more, buy more engines, things like this, like the brute force, right? Yeah. And so like, at some point it doesn’t always solve, solve things, but like, we have seen it, like, that wasn’t possible 10 years ago, necessarily. They just say, throw more at this. It’s like, well, I need to go back and buy another server that I’m gonna put this thing on. And it’s like, well, today, I just say, yeah, let go more like, maybe I don’t, maybe I don’t even need to do anything. It auto-scales up. And then I realized after the fact, oh, wow, I just spent, like, $30,000 on this query or something like running this query for a month? Like, yeah, someone yells at you that way. So like, Yeah, I mean, it’s a really complex, like, on the one hand, it’s become some really powerful tools are available, and they’re very easy to use. On the other hand, it’s become a very complex thing, just because there are so many of them. And because it’s so easy to basically rack up some to spend on those that might not make some of the people in finance happy. So it’s gonna be an interesting time. My view is that this is going to expand, and it should expand, like, I want to see these databases coming up left and right, that are specialized for specific things. I want them to be really easy, easy to use, like, that’s what’s going to help us accomplish these use cases better. It’s just going to be the data engineer sitting in the middle of us is going to have a quite complex world to kind of navigate through.

Eric Dodds 54:36
Yeah, for sure. You find it is fun. And Brooks is telling us that we’re at the buzzer here so I have time for one more question. And this is less on the technical side. But we with a show, we want to just sort of help our audience get to know the people who are doing interesting things and data beyond just like the technical components. And so my question is like, what do you really love about what you do working with data? Like what? What keeps you coming back to working with data? And like, why is the problem interesting to you?

Tristan Spaulding 55:09
It’s an interesting question. I mean, I think part of it is definitely, just as the seat like seeing some tangible improvement in tangible impact is always significant, or quite, I should say quantifiable, maybe it’s the better way to think, one of the interesting things with data is like, it is structured enough that you can measure, hey, this got this much faster. Why did I do this? Because I did it in a smart way, I was able to do this, do it 10 times faster. And I can actually measure, like, similarly with just some of the things in the machine learning world are endlessly one of the most impressive accomplishments that has happened in the world in quite some time. And so I think on the one hand, just understanding work, being able to learn about those is just stimulating in its own right, and proud to see people building these products and services that use predictive technologies and things like that, like, in a quite smart way that can explain what they are, like, it’s just very cool. Like, you can spend forever learning about it. And then I think what we do day-to-day, like, at Acceldata, I think, is rewarding in its own. Right. Right. Because it’s basically combining those two aspects. So it’s taking, look, we have crazy data technologies, but we don’t know how to do them totally, especially for a complex, like enterprise environment with a lot of investments already. Then on the one hand, there’s this promised land, if I could just get to it. Like, if I could just get all these data pipelines actually working and running and things like that, I could use all of this awesome technology to help me build things, we can’t even imagine really. And so I think, I don’t know, it’s motivating for me, like, it’s sometimes data pipelines, it’s kind of like messy stuff, like, you know, it’s not always pretty, and things like that, and flashy and shiny, shiny lights and shiny, graphical interfaces and things like that. But it’s kind of the part that brings it all together. And so I think it’s rewarding to kind of bring something to that group and give them a place where they can sort of accomplish these pipelines a little more easily.

Eric Dodds 57:06
Yeah, I mean, it sounds like, you kind of can have your cake and eat it too because you sort of get to address problems on a philosophical level. But you don’t have this intractable philosophical quandary that doesn’t really have a solution, you can actually like provide value and make things better and faster.

Tristan Spaulding 57:26
More engines, that’s the product.

Eric Dodds 57:27
Just more engines. This has been such a great conversation. We learned a ton and thank you again, for taking some time and teaching us about all sorts of things.

Tristan Spaulding 57:39
Yeah, thanks so much for having me on.

Eric Dodds 57:43
Kostas, you know that I love asking our guests about how what they’ve studied previously whether it’s academic or not, that’s completely separate from data and engineering, influences what they do. And so my favorite part of the show was both hearing about that from Tristan, but then seem I mean, it was so fun for me when I asked a question, and he said, My response is going to be indicative of the fact that I’ve studied philosophy and he really, in a very concise way, sort of broke down, in some ways, like the problems with the question, right? Like, it requires definition and other things like that. And so I just really appreciate that. And I’m going to continue to ask those questions, which maybe you should because I think sort of your lineage if I can use a terms that are relevant to what we talked about on the show is much more philosophical than my lineage. But yes, I will continue to ask those questions. What did you takeaway?

Kostas Pardalis 58:45
Yeah, you should. I think one of the most interesting parts of like, the show is making like connecting the dots with what people were doing their past or what they started them how they ended up, like working with data also, use definitely, I keep asking buttons, I’m pretty sure that we will be hearing more and more interesting stories. I would say that’s from my side. I got really finally in the answer of like, what do you start when you want to become a product manager and that’s philosophy?

Eric Dodds 59:21
Oh, interesting.

Kostas Pardalis 59:22
Yeah. Like, I was always thinking of, okay, if you want to become a product manager, like, what do you do? Like, where do you study? How do you do that? And it made a lot of sense today. Yeah. Like, you go study philosophy. And then you’re like, equipped with all the analytical tools that you need to keep asking why again, and again and again and again until everyone like in the room is mad with you and they just want to get rid of you?

Eric Dodds 59:51
And that’s actually what shows you that you’re doing a good job?

Kostas Pardalis 59:55
Yes, exactly. Exactly. So yeah, it was a very interesting point of alchemy. Session today like making this connection between like, having these skills that you get from something like philosophy and actually ending up like developing and thinking and designing products. So I was like super interesting.

Eric Dodds 1:00:15
I agree. I feel like have a very vivid picture of how that should work. All right, thanks for joining The Data Stack Show. Lots of great episodes coming up, subscribe if you haven’t, and we will catch you on the next one.

We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.

🎙 Sign up for The Future of Machine Learning Livestream!

🗞️ Signup for Our Newsletter

Episode 88:

What Is Data Observability? with Tristan Spaulding of Acceldata

May 25, 2022

Notes:

Transcription:

About the Podcast

Sign Up for The Data Stack Show Newsletter