This week on The Data Stack Show, Kostas and Eric are joined by Stephen Bailey, Director of Applied Data Science at Immuta. Immuta is a startup that focuses on enabling data teams to have really fast, efficient and understandable access controls on their data.
Highlights from this week’s episode include:
The Data Stack Show is a weekly podcast powered by RudderStack. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Eric Dodds 00:06
Welcome back to The Data Stack Show, we have another fascinating guest for you, Stephen Bailey of Immuta. He works on data governance inside of a company that has a product that does data governance. So it’s going to be really interesting to hear about, potentially his own usage of the product and his work. But he also has a fascinating background in studying the human brain, which I hope we can talk with him about as well. Kostas, you are doing some data governance work in our own product right now. What questions do you have for Stephen that you’re interested to ask about?
Kostas Pardalis 00:41
Yeah, absolutely. I mean, data governance, in general, is a very hot topic lately. And there are many things associated with it from access control to the data to data quality, data catalogs, metadata management, all that stuff that they sound a little bit too enterprisey at times, but actually, the more we work with data, the more of a necessity they become. And all of these are problems that we haven’t solved yet. So it’s very interesting to have a company that is trying to solve this problem. So yeah, there are plenty of questions around how they do it, why they do it, what the use cases are, and how they approach in general the actual definition of what data governance is. So I think, you know, we’re going to have a very interesting discussion, and a very useful one for anyone that works with data today. I agree. Well, let’s dive in and talk to Stephen.
Eric Dodds 01:33
today. We have Stephen Bailey from Immuta. Stephen, thank you so much for joining us.
Stephen Bailey 01:38
Oh, thank you all. I’m excited to be here and chat through some interesting data governance and privacy topics.
Eric Dodds 01:44
Well, that’s a subject that we love. But before we get going, it’d be great. You have such an interesting background with a variety of different experiences. I would love to get a quick overview of, you know, your background and what led you to Immuta and then also just give us an overview of what Immuta does. What problem are you solving?
Stephen Bailey 02:04
Sure, I’d be happy to. So I have always been interested in a wide variety of things. In college I did chemistry and philosophy major and really enjoyed digging into history and literature and intellectual ideas and bandying those about but when it came time to get a job actually started in an education and working in business operations for an education nonprofit. But through a series of turns of events, I went and got my PhD in cognitive neuroscience and investigated how kids learn to read and how the brain changed as kids grew from four to five to 15 to 55. What I found, you know, throughout all of that journey was that I just really loved working with data, I loved asking questions. I loved figuring out what is valuable and what is not. And also even the process of managing data itself was, you know … there’s endless opportunities to optimize and change and improve things and I just really fell in love with it. So as I was finishing up my PhD, I started looking for data science jobs, found Immuta, and it was just a perfect fit. Immuta is a startup that focuses on enabling data teams to have really fast, efficient and understandable access controls on their data. And we use the word governance in most of our marketing materials. But really, it’s all about enabling more efficient access control and more responsible access control. So technically, the way we work is, we sit either in front of your database and mediate access to data and force fine grained access controls and masking and row level security directly. Or we have some, some plugins essentially, that sit on the database systems themselves and can enforce access controls natively in the system. So these are for technologies like Databricks, and Snowflake, so the cloud native technologies. What’s really exciting to me, as the someone on the team who works and leads our internal analytics efforts is that access controls, data quality, data governance is really the place where data engineering meets data science meets the business requirements, and all these people have to come to the same place. And it’s very much a not-solved problem. There’s, I think there’s as many ways to define data governance and define what good data governance looks like as there are companies that are using data. And so it’s just a really, really rich place to innovate in.
Eric Dodds 04:56
Well, I know we have tons of thoughts and questions around data governance and would love to even discuss sort of the different definitions for that word, because as you said, you know, data control, data governance, data access, you know, there’s sort of overlapping components of those definitions. But before we get into that, I just have to ask this question, because I know from researching you, you have young kids, and you did a PhD and sort of understanding how kids learn to read. So I would love to know about your experience, studying that at sort of a doctorate level and then seeing your own kids learn to read and being part of that process. Was there anything interesting you can share from that experience of sort of studying it from an academic sort of data driven perspective? And then your own experience actually, actually doing that with your own kids?
Stephen Bailey 05:48
Oh, man, that is? That is such a good question. And I think that is such a good question. And the reason I love it is because it, it, it really showed me the experience of studying cognitive neuroscience, and specifically like how the brain rewires itself when you’re learning to read, like the brain specifically, takes visual circuitry and auditory circuitry, and semantic association circuitry and makes a super efficient connection between those different systems in order to enable you to read rapidly and automatically. And that happens through practice, practice, practice, practice, and you can actually observe this happening in the brain. And that’s what my lab was focused heavily on doing using functional MRI. But, you know, I spent five years learning the techniques to manipulate medical images and do these group analyses and clean all of the data and all this stuff. And then it really teaches you nothing about actually, actually teaching kids how to read. We’re in the middle … my … I shouldn’t I shouldn’t tell you, it shouldn’t say it teaches you nothing, but it doesn’t prepare you for the experience of actually teaching a child to read. So I think there’s some principles that you can get. So you know, rewiring the brain takes practice, practice, practice, it takes attention. So it’s not just about, you know, the amount of hours but you got to have good hours, the kids have to be focused,
Eric Dodds 07:24
Like quality versus quantity. It’s not just brute force.
Stephen Bailey 07:27
Yep. And you’ve got to scaffold your learning. So you’re learning a bunch of skills that can kind of be learned independently, and then you’ve got to learn to associate them together. And then you’ve got to practice. So and then the other piece is the emotional piece, the more kids like to read and enjoy reading with you the more susceptible, more open, there’ll be to, to additional practice, which leads to more, you know, neural refinement. And so it’s, you have like, you can reduce the equation, so to speak to some very dry variables from a scientific perspective. But then when it comes to actually raising a kid who loves to read, you have to embrace the human elements of, you know, creating an environment where they enjoy it, and creating and finding books that they like. And all of these pieces are super important. So there’s both the scientific question, and then there’s the human question that you have to take into account in practice.
Eric Dodds 08:32
Fascinating. And I mean, I would argue, and I’m monopolizing here, so I want Kostas to jump in, because I know he’s, I mean, honestly dealt with a huge number of data governance issues. But it’s interesting. In many ways, I would say the same principles apply to even data within an organization, you know, where having clean data and focusing on a process is one thing, but you have real teams using real data, which is messy. And, you know, when the rubber meets the road in a fast moving company, it’s, you know, it’s a little bit of a different game.
Kostas Pardalis 09:06
Yeah. Actually, I have a question. Similar to your question, Eric, before we move to that, so Stephen, you said that, like, you studied how kids learn, but then you also try to figure out how this also happens, like in later stages of like, the growth of the kids. So how does this, I mean, you mentioned some stuff earlier about, like emotional tension, are these things that like, keep and are still important in later stages of our lives? Like, for example, how important are these like, for a person at my age or your age? Right? Because we keep learning. It’s not like we stop learning at some point, maybe not as rapidly or like, so efficiently, as like the kids can do. But a learning process is something that continues in our life. So how do these things change as you grow up, and you get older?
Stephen Bailey 10:00
That’s another good question. And I’m loving, I’m really loving going back to this brain stuff because I haven’t talked about this since I graduated, so this is a breath of fresh air for me. This is awesome. In developmental neuroscience, there’s what they call “critical periods” where children or adolescents are particularly disposed to gain new skills, they can really just soak it up. And if you see like a child learning language, when they’re between two and six, they will just like, they just ambiently pull it all in. And, and it just kind of takes shape. What happens as you get older, or what happens during those critical periods, that doesn’t happen when you get older is your brain is particularly plastic, it actually is going through and disposing quickly of connections that aren’t as useful. So you have what’s called pruning that happens. And as you get older, you, you sort of settle into a very efficient pattern. So I would say the general model that you can think of is when you’re young, you’re very disposed to creating new connections very quickly. But as you get older, your brain basically figures out what are the most efficient paths for what I need to do, and it becomes more efficient and automatic at doing those things. Now, what’s cool about the brain and why everyone loves studying it is you can change that as you get older, like right now, for example, I’m learning guitar, and I’m going from like zero to trying to be, you know, something of being able to play at least one song, right. And it’s very, it’s very challenging, it would be very challenging if I were eight years old, but as an adult, I have a lot more awareness, and I know how to structure my practice in an effective way. So I’m not worried about not being able to learn that thing. It’s just that it’s probably going to take me a little more time, focus, and practice and, and kind of structure around the way I’m doing discipline around the way I’m doing it to, to really be super effective at that.
Kostas Pardalis 12:14
Yeah, and probably you are also much better at controlling your emotions, something that kids need, like someone external to take care of that. Actually, I found it very interesting that you either like the concept of emotion, liking the learning process. That’s very, very fascinating. But I think we need to arrange another recording just to discuss that.
Eric Dodds 12:33
I know I could go on all day because this is so fascinating. Yeah. Yeah.
Stephen Bailey 12:40
Just one last thing, and this will be a bridge to some data stuff. But I do, you know, anyone who studies the brain hopefully gets a little offended when people link neural networks, AI and neural networks directly to the brain. And we say like, oh, yeah, we’re gonna build something. That’s exactly what the brain does. There’s so much stuff to what the body does that supports brain functioning. Like that is just totally not even part of the conversation when you’re building when many people talk about the relationship between neural networks in the brain like hormones, cortisol, attention, emotion, that even like sensations from your body, like all of these things are super important for brain functioning and brain processing, that there’s just no real analog for in data, computer science, neural networks.
Kostas Pardalis 13:32
Yeah, absolutely. And I totally understand. And that was something that I was thinking, while you were saying about things like attention and emotion, because, for example, one big thing right now with all the deep neural network research that’s going on is about how to use attention, and how it’s called. Because, of course, the attention in this context is much different than what attention was brought like in the human brain. We keep trying to find some kind of like parallels between how the human brain works and how these computational models work. So when you talked about the emotions is the point where they couldn’t help them say like, Okay, this is the next thing after the attention. Are we trying to put them there in the neural networks. But anyway, these are things that I think we need a lot of time to chat about and probably arrange another call to do that. So yeah, let’s let’s move forward with talking a little bit more about your role in Immuta right now. And what I wanted to ask you, and I find quite interesting in your case, is that you have a data-related role inside the company that builds all the products that are all around data, right. I assume, and this is something that I would really like to find out during our conversation, that data governance is something that affects also the lives and the work of data scientists and data analysts. So how do you use that internally? What’s BI and data analytics for Immuta, first of all, how you use it. Is it for product, is it for business decisions? Can you give us a little bit more information around that?
Stephen Bailey 15:09
Sure, I can break this into two responses. The first is like, we can talk a little bit about the technical kind of responsibilities and stack and then maybe about the organizational piece, because I think both are very, very relevant. So we do, we’re heavy believers in dogfooding our own product. And so we, one of the first things I did when we started building out our internal infrastructure for analytics was get our product in front of our database and behind our analytic tool of choice. So our current stack is Stitch … should be pretty familiar to anyone who’s heard of the modern data stack, as it seems to be called now. But it’s basically Stitch to Snowflake to Immuta to Looker. And that forms the core. We use Argo, which is a Kubernetes native container orchestrator for orchestrating jobs. But it’s, it’s a pretty, pretty standard setup for a small company. So Immuta’s role, which I think is really the interesting piece here is, is as an arbiter of access control, but also as a place to land and focus our metadata management. So we have information about jobs coming in from jobs and raw data coming in from Stitch and from some custom taps that we run, we have metadata about DBT and the models that we’re building in DBT. We have usage data from Snowflake. And what we want to use Immuta for internally is to aggregate especially governance related data, such as personal information, where personal information is stored, who should have access to data, identity management concerns, and to have Immuta push that to our consuming services, whether data scientists are accessing data in Snowflake or in Looker, we’re basically trying to build out a centralized governance or access control capability there.
Kostas Pardalis 17:20
So from what I understand is that within Immuta right now, you’re handling two main components and two main functions. One is like the management, the aggregation and the management of metadata. And the other one is like access control, which probably also, I mean, access control might probably also need the metadata in order to be implemented. Is this correct? Do I understand it correctly?
Stephen Bailey 17:40
Yeah, that’s correct.
Kostas Pardalis 17:41
So how are these metadata defined? You as a data scientist, you have to start with changing the new pipeline for your data, you have a new project. What are these metadata, how do they come into existence? And how also at the end do you use Immuta to store these metadata and use them outside also, access management?
Stephen Bailey 18:02
That’s another good question. So the metadata that we leverage in Immuta is all built around enforcement policies. So it tends to be much simpler than, you know, the massive amounts of metadata you could associate with an individual dataset or pipeline. In particular, we want to define a minimal set of tags that are related to any actions that are going to drive a decision about who has access to what data, for what reason. And so it basically boils down to three things: user attributes, data attributes, and contextual attributes like accessing data for a certain purpose. These are all elements of attribute-based access control, which a lot of companies implement. But what we found in working with companies and deploying Immuta internally, is that you really have to take a step back at the beginning of building out your data warehouse and define what are really my hard requirements about what needs to be tagged, who should have access to what data, and for what reasons. And so at Immuta, we have a pretty transparent organization around data, but we still have heavy requirements around making sure that any data that comes in, we identify whether it has personal information in it, whether it has privileged information in it such as you know, like, employee salaries, for example, and making sure we’re tracking that as it propagates along the the data modeling layer. And then enforcing access control in our database system.
Kostas Pardalis 19:43
So we were discussing right now about how you are using Immuta internally, and we also like to, let’s say describe a very important use case, right, on how the product is used. Is this the main use case that you see or have you seen people like using Immuta and deploying it in different ways or trying to solve other problems outside of things that you mentioned already.
Stephen Bailey 20:06
The main use case for Immuta is simplifying that access control layer and uniting your different systems with the same identity access control. In particular, one of the core innovations, I think in our product is a global policy builder, that’s quite human comprehensible. So, you know, if you’re familiar with AWS IAM policies, you know, you know how hard those can be to comprehend. Immuta makes it very easy to create a policy that, you know, a compliance person or a data access person or a data engineer can understand and then apply it across any data set that’s tagged a certain way. And so we actually, it was one of our core bets when the product was originally built was that, to do data governance better, we have to have better communication channels around our data and understand, if I’m a data scientist, and I can’t get access to data, why and what attributes do I need to get access to? If I’m a compliance person, you know, what is actually being implemented in Snowflake and who has access to it? So that’s definitely the main use case and it does … What is great about attribute-based access control, and particularly policy-based access control that’s a little more human understandable, is that it can take a ton of policies that might be implemented, in effect on a database down to a single policy in some cases, or a couple of policies, in many cases.
Kostas Pardalis 21:43
Oh. Okay. That’s great.
Eric Dodds 21:46
Well, actually, I think you answered part of my question I was gonna, I was going to ask, in what ways and I know, it varies sort of on the level of complexity of the stack and the size of the organization, and even probably the industry and type of data. But you mentioned, you know, AWS IAM policies like is that the primary way that people are solving this if they’re not using Immuta? Or a similar tool? Or what other ways? What I guess, what are the ways that people are experiencing the pain that you solve? And how are they trying to solve that outside of Immuta?
Stephen Bailey 22:23
I think, to answer that question, you really have to be asking, Who are you talking about? And where in the pipeline are you talking about? Because you take even a very simple pipeline like ours, we have to manage data access in Stitch, we have to manage it in the raw tables in the database, we need to manage it in the sort of Immuta sanctioned part of the database, we need to manage any consuming application. So if you expose it in Looker, are you? Are you using a system user that has global access to access the Snowflake data? If a data scientist comes in and then wants to stand up, like, you know, some infrastructure of their own, like, how are you managing access to it? So I think there’s two real issues. One is there’s just a huge proliferation of where data can be within an organization. And then the second issue is no one knows the answer to any questions of who should have what data. That’s really problematic. I think a lot of times, well, I won’t say a lot of times, I’ve been in place organizations where there’s some documents that exist somewhere on someone’s computer, or in some shared drive about what you know, a data policy is. But then in effect, like no one who’s on the frontlines knows what that policy really is. And so if someone asks for data, they just get data, or, you know, or if they might ask for data, and no one knows how to get them the data. So I think having clarity around how data should be used. And also then of course, knowing where it is. Those are the two biggest pain points that companies are facing.
Eric Dodds 24:06
Yeah, absolutely. No, I think it’s very, very interesting to think about various levels of access at various points in the pipeline. And sort of the points where you do need some, some sort of governance around access, but one more specific question, and then I’ll hand it back over to Kostas. But so in your pipeline, you said that you go from, you know, Stitch and, you know, some other sources into snowflake, to imusa to Looker. So is a Muta actually sort of sitting between Snowflake and Looker? I asked because we have leveraged Looker on top of Snowflake as well. And just as a user of that particular piece of the stack, I’m interested in what it’s like to insert Immuta into that equation and what it’s like to interact with Looker, running on Immuta if that’s actually what you meant by how it works.
Stephen Bailey 25:03
Yeah, so our Snowflake integration and our Databricks integration are what we call native workspaces, which means Immuta sits behind the scenes and actually creates views or secure views of your data within Snowflake, so that your Looker would still be pointing to Snowflake. And so what we have internally, which is really actually a pretty neat experience is Google single sign on to Immuta to snowflake to Looker. And so there’s one identity, people don’t have to know any passwords, except for their Google password. And Immuta is enforcing access controls whether they’re row level security or column level masking, or just subscription level masking, or access, on a Snowflake account, without anybody ever even having to log into Immuta, or like changing where they’re pointing Looker. Now in other cases, for example, we started out on Redshift. In that case, Immuta does act as a proxy, and so you’d be accessing your Redshift data through Immuta. And Looker would actually be pointing to Immuta as a Postgres proxy engine. But the Snowflake integration is very cool, because you can create different warehouses, and everyone accesses the data through the public role, but they’re having individualized access controls applied. So it really eliminates some role management issues that you might have if you’re trying to do dynamic access controls in Looker.
Eric Dodds 26:37
That’s very cool. Very, very cool.
Kostas Pardalis 26:40
Yeah, that’s amazing, especially when we’re talking about managing access to many different products and tools, like we already mentioned, at least two right, like we have a database itself, and then we might have like the various different BI tools of the user. So that’s super cool with what you’re doing there. Who is responsible for these policies? Who has the role to create these policies in Immuta? Like, who is the user of Immuta?
Stephen Bailey 27:07
This is a question that the answer varies upon who you’re talking to. And I think it also varies heavily on the size of the organization. At a small startup, so speaking from experience, what I found is that the person who owns the data platform is the one who knows the most about the data, you know, he or she knows where the data is, you know, most sensitive, and they’re also the ones actually enforcing the policies for real, right. So if there is no centralized policy defined, then whatever the database policies are, that’s the actual policy that’s being implemented for that company. And so, but in larger organizations, you might have compliance organizations that have standards, and there’s someone whose job it is to make sure that warehouses or data assets are up to that standard. What’s challenging in that scenario is that data changes so fast. I mean, it changes all the time, it changes so fast. And so if the person who’s owning the data platform and actually releasing the data to people isn’t the person who’s most on top of the policies, and maybe even defining the policies, then it gets out of date, you know, whatever that downstream organization is, gets out of date, or, it takes time, it takes additional time to release a data product. Whereas if you have the data platform owning it, they’re making sure that that data is up to snuff, then they can release it without … It’s almost like a CICD process for releasing data or data governance. And that’s, in some ways where I think the future is. You know, it’d be awesome, and it’s sort of how it works now for Immuta users, when you create a pull request against your data warehouse, as long as you have the right metadata attributes on it, then, and you put those metadata attributes in Immuta, as soon as that data is released to end users, the correct policies will be applied. And you’ve already defined those policies in the first place. So it makes it easy for you to have like, one big initiative to define all your policies, and then just be confident that that data is having those policies applied as you add new datasets.
Kostas Pardalis 29:28
That’s very cool. All right. I think I mean, the product itself has monopolized a little bit of discussion, which okay, makes sense, because it’s pretty interesting. And it’s very interesting also, like the kind of approach that you have and what you said about CICD. But let’s start off a little bit more about your role inside Immuta. So what is your team doing and what are the products that you’re delivering?
Stephen Bailey 29:52
Great question. So I talked a little bit about my background. When I started at Immuta, I was a data scientist. I came in; I was focused on doing some ad hoc data science projects looking at performance considerations or doing, you know, maybe customer segmentation and things like that. As, as I pivoted more towards managing infrastructure and building a data platform for the organization for downstream users, you know, we’ve gone through that data maturity cycle of starting from, hey, let’s just like get some basic counts that everyone can agree on, you know, count of customers, count of like opportunities, kind of like these basic things and getting consensus around that. So that’s where we started. And then where we’ve been going, as we started growing, is, we’ve been building all of this great analytics expertise and operational expertise within all of our different departments. So within sales within marketing within product. And so now our data team is focused really heavily on enablement and the development of new interdisciplinary products, data products. So finding ways to unite sales data and marketing data and product data, telemetry data into a single, for example, activity stream of users, a unified users activity stream that we can understand what the customer journey looks like. That’s an example of something we’re working on right now. And that’s been, it’s been great because it’s positioned us both as partners with the different stakeholders in each team, but also as individually independent experts who are creating like custom data products that can accelerate the business’s impact.
Kostas Pardalis 31:36
That’s super interesting. So can you give us a little bit more color around like how you unify the data? What kind of sources do you have? What are the challenges of unifying? And like, where do you stand in terms of like, how mature do you think this product is that you’re describing right now inside the company?
Stephen Bailey 31:52
Yeah, I think one of the biggest challenges is building sustainability across the whole data supply chain. So from the original source system, for example, Salesforce, making sure that that data is really high quality, and then you’ve got the technical infrastructure that extracts it, loads it, transforms it into a custom data product that we expose in Looker. That’s a technical challenge. And then you have to train people on what that new product looks like. So you’ve got to have the high quality source data, or the downstream product doesn’t work or it isn’t valuable. And then you have to start repeating that process across different domains. And each time you do that, you know, you guys have worked in data, so you know, it’s like, you get excited, you build a proof of concept in two weeks, and then it’s six months of ironing out the kinks and realizing oh, this doesn’t mean that or, you know, there’s this weird, weird data quality thing here. So it’s really about building out that supply chain. And then, and then there’s a really big element of team building and an education as well. Yeah, it’s both exciting. I really enjoy that aspect, but it’s easy to forget about.
Kostas Pardalis 33:11
Yeah, yeah, absolutely. I totally agree with you. I mean, we tend to forget how important the human factor is. Because at the end, I mean, all these numbers, and all these data are going to be interpreted by human beings, right? Like, they have to make sense for the humans that are involved. And of course, you have also like to train them. And that’s a very interesting topic, actually. And like, for other people that work in the technology, we tend to forget about it. But you also touched on another very, very interesting topic, which has to do with the quality. You said that, the first thing that you have to do is ensure the quality of the data supply chain, and you mentioned also Salesforce, so I think it’s a very good example that we can discuss a little bit. So what is quality? I mean, when you talk about the quality of the data, how do you define it? And how do you solve the problem of the data quality in your pipelines and the systems that you’re building?
Stephen Bailey 34:04
That’s a great question. I think of data quality … I think of it in sort of the same way that I think of access controls, actually. So access controls are, they’re basically agreements between people about who should get access to what what kind of data, and I think data quality is in a similar state where it’s an agreement between the person providing the data and the person using the data and the person, even maybe downstream, like the person originally providing the data of, of what certain things mean, and what the expectation should be across that data product. So, you know, we’ve recently embarked on a data quality project and so we’ve been thinking a lot about it. In fact, it’s, you know, you could take one approach of adding data quality and schema test to every single column, like when you build out the original data model. But, it quickly leads to noise and it becomes impossible to maintain because things are firing all the time. I think what we’re trying to do currently is define critical fields. So define sort of our key metrics that we want to back as a data science org, and then work backwards from there to identify what are the guarantees that we need to make as an organization to make sure that that final product, that final number is quality, and then build visibility into that pipeline so that both the people, like who aren’t my team can maintain it and identify when something goes down quickly, but then also other people can can look in and understand whether the number they’re saying is actually correct, or whether there are some known issues around it. But it all comes back to taking the time to identify what are the most critical components? What supports those components? And then what is the agreement that we have made with our consumers around that?
Eric Dodds 36:19
You know Stephen, that’s a fascinating point around agreement. In my past, we referred to that as the sort of the end all be all definition. And one example that keeps coming back is I worked with a company who said, Well, we need to track active users. Right? And that sounds like a simple metric. It’s just one metric. Right? But when you started to ask people around the organization, what, you know, what is the definition of an active user, then you would get wildly different responses. Right? For what seems on face value, just like a very, like, well, this is easy, let’s just track active users. Right? And it’s like, Okay, well, you start getting into it. And I mean, there’s all sorts of edge cases, and, you know, it can sort of cross different user actions that are difficult to track. I mean, there’s all sorts of complications in there. And so the agreement that really resonated with me when you said agreement, because that’s actually, I mean, unrelated to the pipeline, or the actual sort of data science work itself. The fundamental challenge of getting agreement is actually pretty formidable in a lot of organizations, not because anyone’s, you know, necessarily territorial, but you just have to do a lot of work across teams to get to a shared definition.
Stephen Bailey 37:40
Yep. And I’ve found, I mean, investment from executives and leadership is so key there, right? Like we’ve got, we have a leadership team who’s very invested in becoming data driven and, like, all looking at the same numbers, and that we simply couldn’t do, we couldn’t be an effective data team without that investment. Because it forces the question of, what does this number mean? And what are we going to accept that it means, and also, what do we accept is not known or knowable from this number? And, and that’s, you know, that’s hard. I think that is one of the things that people find very hard, because when they do they look at the active users? And it’s like, well, I want to know all of the information about the active users. But you know, as soon as you define it, you’re defining it as also in the negative, like, it’s not this.
Eric Dodds 38:38
Sure, sure.
Eric Dodds 38:41
One, one quick question. And this is just very practical, I’m just thinking about our own experience. I mean, I would say, in over the last two quarters, you went through a similar effort of, hey, let’s just make sure that the numbers in marketing and sales are the same, right? And that we can agree upon all these numbers. But just thinking about a lot of our listeners or data engineers are working in, you know, or related to data engineering. What was that effort like for you? I mean, again, it’s kind of one of those things that sounds simple, like, let’s get a set of numbers that we can agree upon. But what was that process like? Were there any sort of particular challenges? And then how long did it take? Because that’s one thing, going through it ourselves. And having done it before, you know, part of you wonders, like man, does every other company have this sorted out? It seems like it’s taking us forever to do this, you know, and in reality, it’s something that every company struggles with. So would just love some practical, you know, you know, some practical thoughts on your experience with that.
Stephen Bailey 39:42
Yeah, I would definitely, you know, offer encouragement to anyone who’s feeling discouraged from efforts like this. It has been a two-year, rapid growth experience for me. I mean, the amount of conversations and like you said, the time it takes to implement these things, is much longer than I expected. And I think a large part of that is trust that you have to build trust among people, and then people have to trust the numbers. And if it’s something new, there’s an intrinsic skepticism. And so one of the first things, for a couple of our more effective projects, I think, the first thing we did was we got a graph. And you start shopping it around to different stakeholders, and it’s like, you find a format that people are gonna see over and over and over again, and start shopping it around. And that starts to build familiarity. And then over time, as that graph is shared in meetings and stuff, that’s when you start to build trust in it. And then you can start getting kind of derived information from it. But as a data scientist, I know my instinct a lot of times is to put the big profiling dashboard together like the 50 different ways we can slice this data model. And it’s, it’s too much too soon in many cases. I think it’s better to have like one graph, and then slice it in a couple different facets, get people to build trust in that, and then kind of roll out new stuff, at least that’s been, you know, my experience?
Eric Dodds 41:21
Yeah, that’s, it’s really interesting to think about that. And we don’t have time to get to it today, but as a consumer of the type of data that you’re talking about, you know, so in our organization, I would be, you know, sort of one of your internal customers. When I hear you describe it that way, what comes to mind to me, and I don’t know if I would have articulated it this way if you hadn’t sort of given that explanation, but I’m making decisions with the data, right. And so it really does take time for me to sort of take, understand, and have enough confidence in a chart or a data set and make a decision on it, and then sort of, you know, get feedback on are the decisions I’m making based on this data better? Are they helping the company? Are they helping my team? Are we progressing as a result of this? And that really does take time to build trust, not necessarily, because I don’t trust you. But you know, because there’s a lot on the line, as I’m making decisions with this data. And so, I want to see that it will actually prove out to be producing results as I use it in my job day to day.
Stephen Bailey 42:38
So I had this experience during my PhD, where we analyzed brain images. And these can be three dimensional, like image volumes, or they can be four dimensional, or even five dimensional volumes. And so it’s, you know, coming in, I didn’t have any experience in this type of data, totally brand new data to me, it took me four years of working with brain data, day in, day out running experiments, running processing on it, to really gain a lot of trust into that data and understand at a sort of an intuitive deep level what I was working with, or when I saw like a blob in this part of the brain that’s like a statistically significant result in one part of brain, I was like, Oh, I trust that I know what it means. And that’s a situation where I actually, I could have 100% trust that the data that I was getting was correct. And, and so I had like 100% data provenance over or understand control over the data provenance. But it was all about the expertise of, you know, becoming a user of that data, it was all about building trust, and building intuition, and building knowledge. And that process just takes so much time. I share that just because it’s one of the few times where I, you know, had like a totally unfamiliar data set, and just had to build that intuition from the ground up. And, you know, it just takes a long time to trust.
Kostas Pardalis 44:10
That’s actually super interesting. I mean, I’m observing, like the conversation that the two of you had in these past few minutes. And at the end, I think, we end up at data governance again, because if you think about it, working with data, it can be distilled into two things right. It’s actually one thing, and this is trust in the data and trust among the people, right, and the understanding that people have around the data. And I think this is, let’s say, the broader definition of what data governance is trying to solve and as a problem, how the people can work with the data and trust the data and also trust and communicate and come into agreement between them of what this data means. I mean, I know that we said that about Immuta that it’s more around the access control around the data, but this is like, I think a very foundational part of building trust both on your data into the processes and the people that you have inside the company. And then on top of that, you can build other layers, like you were talking about the definition of a KPI and how we understand it. I don’t know what the plans are around like the product, and that’s like my next question, but from my experience, at least a big part of data governance revolves around how we can have a data catalog where we agree upon like the definitions of the data that we track and the KPIs that we measure. And it’s interesting, because these are problems that the lab’s enterprises have been solving for quite a while, or they were trying to show for quite a while. But I think, as more and more the whole industry becomes more data driven, anyone will have to deal with these problems. So in the end, we discussed so many different things. But I think like all the stuff that we were discussing were around data governance at the end. And having said that, my last question for you, What’s next about Immuta? I mean, you have solved what it seems like a very core problem around data governance, a very important one, and in a very elegant way. So what’s next?
Stephen Bailey 46:15
So we’ve had a lot of conversations around this, I think one of the cool things that I’ve gotten to experience at Immuta is when we started two years ago, I didn’t see any other, like access controls entitlements and security startups, really, that I would say were direct competitors. And we started to see more of a movement in this space. And it’s been really exciting, because I think there’s an acknowledgement that governance has to be part of the data development lifecycle. And so we’re starting to look into some of the adjacent governance responsibilities. I think there’s a really good article by Andreessen Horowitz’s group on the modern architectures. And they define metadata or data governance in sort of four buckets. There’s metadata management, which would be like your enterprise data, catalogs, entitlements to security, which would be what I’m used to currently doing data quality, and then observability. And so data quality and observability are of high interest to us and really creating a centralized place for data engineers to understand what’s going on in their data pipelines. And then exposing that to end users. That’s an area of intense, I’ll say, research interests right now. Because I think it’s a big gap as a data platform owner at Immuta, I’m managing, you know, I’ve got my my GitHub repos with Terraform in them, I’ve got a couple of AWS Lambda functions, I got a Snowflake, I’ve got an Orchestrator, Stitch, we’ve used a little bit of Fivetran, I’ve got Looker, I’ve got Immuta, it’s like, I a bunch of tools, each of the tools does what they do really well and makes my life better. But now I have to manage all of these different tools. And all of these tools create dependencies for my golden data products that I want to give to end users. So thinking about how do we, you know, extend beyond data sharing agreements and go more into maybe data quality agreements or adjacent spaces. And that’s really where our mind is at. And then of course, improving the core experience of making access control simple and easy. And communicatable. I think there’s so much to do there.
Kostas Pardalis 48:42
Absolutely, I mean, it’s a very foundational problem, as we said. And, of course, there’s still like a lot of space for improvement. I’m really interested to see what’s going to happen. It’s like, very, very interesting, what you described. And I also like, personally very interested in anything that has to do with data quality and observability with data. My feeling is that like many things that we take for granted, as engineers, when we develop code, there’s things like they are missing right now, when someone is working with data. So I think there’s going to be very interesting times ahead of us, like, it’s very interesting products are going to come into existence, and I’m very excited about it.
Kostas Pardalis 49:20
Thank you so much. It was a great time and we really enjoyed the conversation with you. And I’m looking forward to connecting in the future and seeing how things are going with Immuta and you and discussing more about data and the human brain again.
Stephen Bailey 49:39
Yeah, this was really great guys. One quote, to end it on that I think is really relevant is, I was talking to a colleague who runs a data team and he said, “When it comes to data governance, it just feels like there are tons of wrong ways to do things, but not a really clear right way to do things right now.” And so that has stuck with me. And I think as a community, I’m just really excited to see how we grow in terms of sharing best practices, and also technologies that help us, you know build sustainable, confident, sustainable type pipelines that we can be really confident in.
Eric Dodds 50:13
Absolutely. Well, again, thank you for spending time with us. Thank you for teaching us both about, you know, data governance and your work at Immuta, and also a little bit about how we can deal with kids learning to read, which I know is very relevant for me right now. So appreciate that from your background as well. And we’ll catch up with you soon.
Kostas Pardalis 50:33
Awesome. Thanks, guys.
Eric Dodds 50:35
Well, that was fascinating, not only because I’m teaching my four-year-old son to read and sort of working on letters and, and recognizing words, so it was really interesting to hear Stephen’s take on that. But I think one of the things that I found most interesting, and this is somewhat of a theme we’ve seen on the show is that the technical problems of data are absolutely fascinating. But they really sort of are secondary to getting alignment within an organization around data. And that’s a sort of a particular skill and particular endeavor on its own that, you know, doesn’t even necessarily in its early stages relate to the technology. And I just found it really fascinating the way that Stephen talked about that dynamic within Immuta and within organizations in general, what stuck out to you Kostas?
Kostas Pardalis 51:27
Absolutely, I totally agree with you. Working with data is not just an institutional problem, that every company has to solve. I’m pretty sure that our listeners will notice how many times we use the word trust, right? And trust is like, human characteristics, right? We need to trust our data, we need to trust our technology. And above all, we need to trust the things that work with the data, and that we have a common understanding on how we interpret the data. So I think this is something that’s like a big part of what data governance is trying to solve. It’s a very interesting problem. It’s, as Stephen said, we’re standing at a stage where all the problems we’re trying to solve around that, they have many bad ways to solve them. But we haven’t figured out yet the good ways to solve them. So it’s very fascinating. It’s very exciting. And I think in the next couple of months or like a year or something, we will see more and more companies and people trying to come up with interesting solutions to these problems. And of course, we’ll see what Immuta is going to do. I mean, they started with the access control to the data and from what it seems they do like an excellent work product-wise to solve this problem. But I’m pretty sure that they are going to attack other problems or they’re not governance. So I’m very excited to see what’s going to happen in the future.
Eric Dodds 52:44
Me too. Well, thanks again for joining us on The Data Stack Show. As with many of our guests, we’ll check back in with Stephen and Immuta maybe in six months’ time or so and get updates on where they are with the product and what his team is up to. We’ll catch you next time.
Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
To keep up to date with our future episodes, subscribe to our podcast on Apple, Spotify, Google, or the player of your choice.
Get a monthly newsletter from The Data Stack Show team with a TL;DR of the previous month’s shows, a sneak peak at upcoming episodes, and curated links from Eric, John, & show guests. Follow on our Substack below.