This week on The Data Stack Show, Eric and Kostas chat with Prukalpa Sankar, a co-founder of Atlan. During the episode, Prukalpa discusses selling SaaS, data teams, defines phrases like “agile data” and the metadata plane, and more.
Highlights from this week’s conversation include:
The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Eric Dodds 0:05
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com.
Welcome back to The Data Stack Show. We have another data term to dissect in today’s show and that’s the term “data ops.” We’ve talked a ton about ops on the show and how ops is being adopted into the database using a lot of the principles of software engineering.
We’re gonna talk with Prukalpa from a company called Atlan. Super interesting. Kostas, she comes from a background where she’s solved massive worldwide data problems that have focused on things like poverty, or access to clean fuel and water. And I am so excited to hear what that was like. Because those tend to be different in so many ways, from a lot of the things that work in data companies in b2b SaaS, sort of in the venture-backed world face, and I’m sure there’s probably some similarities. So that’s what I’m going to ask about you.
Kostas Pardalis 1:22
Yeah, I’m very, very interested in chatting with him about metal data. I know that to build a platform like the one they have, you have to build some kind of like, let’s say, metal data layer there. And I really want to see, first of all, how mature The technologies are in order to collect the lease. And also, what do you do with the metadata. And the reason that I am so interested in data is because you have to work with the middle data, and then you can go to the semantics.
Eric Dodds 1:56
Always. Yeah, I guess though, it’s gonna get complicated now with the meta sphere, or the metaverse. Yeah, talking about metadata. What’s that gonna mean?
Kostas Pardalis 2:07
Yeah, I think that’s going to be a very hot topic next year with all data and in the data, outside of jogging. It is an important aspect of like working with data. And it’s good that we start hearing more and more and more with our data because it means that the foundations of the technology are starting to be solidified. So we can start working on the next iteration of how we can deliver value. So we’re talking about metadata, like from a business, especially perspective, I think it’s a very good indication of the maturity of the space. So that’s good.
Eric Dodds 2:43
I agree. Super exciting. Well, let’s dive in and learn more.
Kostas Pardalis 2:46
Yep. Let’s get into it.
Eric Dodds 2:48
Prukalpa, welcome to The Data Stack Show. We’re so excited to chat with you.
Prukalpa Sankar 2:52
Thanks for having me.
Eric Dodds 2:54
Okay, let’s start where we always do. We’d love to hear about your background. I’m excited because you’ve done data work for some really interesting internationally, sort of internationally known organizations. So can you just tell us about your background and what led you to creating Atlan?
Prukalpa Sankar 3:14
Sure, yeah. So I’ve been a data practitioner my whole life. Prior to this microphone, Robert and I founded a company called Social cops, mainly with the mission of saying, hey, lots of problems in the world, like national health care, poverty alleviation. They don’t seem to be using data. And it really feels like they should be using data. So let’s do something about that. And our model very quickly turned into that became the data team for our customers, because we were typically working with folks like the United Nations, the World Bank, the Gates Foundation, or several large governments, who did not have data teams or technology teams for that matter. So we started just became the data team, which is really where I learned everything that I learned about building and running data teams, and how complex and chaotic they can get. So because of the kind of work we were doing, we were lucky to be exposed to a wide variety and scale of data. At one point, we were processing data for 500 million Indian citizens and billions of pixels of satellite imagery, which all sounds like they’re really cool projects. But they were not really cool on a daily basis, but the day-to-day was a nightmare. I feel like as a data leader, I have seen it all I’ve had cabinet ministers call me at eight in the morning and say, the nightmare that no data leader wants to be woken up with, which is the number on this dashboard doesn’t look right. And then I’ve done that like wild goose chase of calling my project manager who called My analyst who said hey, it looks like the pipelines broken and then call my engineer and he pulls out the logs and says, no, nothing looks wrong, and it takes us like four people and eight hours to figure out what went wrong. I have sat in the tub for terrorists this one time and cried for three hours because an analyst quit To me this one time, exactly a week before a major project was due. And he was the only one who knew everything about our data, and there was no way I could deliver this project without this analyst. And that’s just because the thing just brought us to this breaking point. Our team is spending 50-60% of our time dealing with this chaos of which data set should I use for this analysis? What does this column name mean? How do we measure annual recurring revenue, and now this dashboard is broken, stuff like that. And we realized we couldn’t scale like that until we actually started building like this internal project that we call the assembly line. And the goal was basically to say, our team is super diverse. And we want to find a way to make our teamwork together effectively, we actually tried to Long story short, like we tried to buy a solution, we fear that buying a solution, we were forced to build a solution. So we actually act after it was never born to build like to sell as a product to anybody else, we actually built it. To make it more agile and effective. Over two years, we ran 200 data projects on the tooling that we built at that time. And in that time, we made our team of over six times more agile, and we realized that would build tools that were more powerful than we had earlier intended. Our team went on to be the things like we build India’s national data platform, which the Prime Minister has a few days, it’s one of the largest public sector data lakes, it’s kind, but it was really cool about that project was it was built by an eight-member team in 12 months, it’s also one of the fastest of its kind, to sort of realizing that these tools could help data teams around the world hopefully be a little bit more agile and effective. And that’s when Atlan was born. We said, can we use these tools to help every data team in the world?
Eric Dodds 6:42
Sure. Okay, I have to ask, this is so interesting because we love hearing about really diverse experiences with data. And when we think about subjects as big as fighting poverty, and then apply sort of a data-driven mindset to that, could you just give us a little bit of insight into maybe, like, what’s a specific poverty-related project that you worked on? What data were they not using? What data were you able to introduce? And how did that change the project? That’s just so fascinating.
Prukalpa Sankar 7:19
Sure, yeah. So in some ways, actually, I think social problems are some of the most complicated ear problems that can exist, actually in business, because the outcomes are a lot clearer. You want to improve revenue, and you want to reduce cost, versus when you want to improve the quality of life of a human being, it’s much harder, just like a problem to model and we saw this. I’ll give you one example with a project that’s super close to my heart.
We partnered with the national government, which was basically rolling out clean Qinghua to about 80 million below poverty line women across interested in India. And this was actually so just to give you context on the problem, people are basically in all women in India, and in rural areas, and below the poverty line, they actually use sort of this national cooking fuel in their house, firewood basically, which, which is equivalent to smoking, like 400 cigarettes and our or some crazy number like that. It’s, it’s crazy. And so obviously, the government wanted sorry, for this, we were looking for programs, these are gas cylinders that were free that were going to these below poverty line women. We’d all had the program, and there was like initial operational monitoring. And we put in place data systems for that, the program rolled out really fast and really well. And then we started hitting this challenge, which is that while the penetration of gasoline doors was increasing significantly, lenders need to be filled, right? So and typical, the stations for gas cylinders, were only in urban India, because there was no penetration or demand, right. And the government was creating this very rapid demand because of what they’ve done. Now, this was a super interesting problem, because the person who runs the gas station is actually an entrepreneur. So it’s a decentralized model. And it’s privatized. Now, the entrepreneur obviously cares about this being comfortable, which makes sense. On the other hand, the government wanted to create access. So what the problem statement that they gave us was, or the minister at that point told us was, I would like a gas cylinder station to be within 10 kilometers of every single Indians home. And so now I have this like, really unique problem where you’re balancing accessibility with profitability. And so how do you do that the right way and some of these right and so for example, what we ended up having to do, it took us a bunch of iterations to do this. Like, if you do top-down allocation, you’re talking about 640,000 villages. So what we ended up doing was we actually turned it into a geospatial modeling problem brought together. But 640,000 villages got about 600 datasets in so population affluence, a bunch of those parameters, we layered market data on top of that, where are the existing gas stations and cylinders? Like where is that already accessed, and that basically got out of our clustering algorithm. And then the rest of the villages, we basically ran a clustering algorithm, and there was a threshold on profitability. So essentially, you could basically say, Hey, this is the population. And this is the affluence. And so this is what we think people are going to be willing to pay. And so every cluster was actually a different size, in some ways, in terms of like, the distance that it was covering, and then use that to basically figure out where you should go open these next 10,000 gas stations across the country to actually solve for both profitability and accessibility. Right. And so those are just some examples of the kind of modeling kinds of challenges that we have to deal with.
Eric Dodds 11:07
Yeah, no, super fascinating. That’s really helpful. It is wild to think about that because we just off the top of my head. I mean, you mentioned geospatial, but there’s the economic component of it. Economic modeling the demographic component of it, the socio-economic component of it, which is pretty wildly different datasets. Interesting.
Okay, so you’re dealing with issues like that. Let’s talk about what Atlan actually does. You talked about, okay, you get the call from someone who says the dashboard doesn’t look right. But like, what does it look like for a team to use Atlan and how does that make them more efficient?
Prukalpa Sankar 11:51
Sure. Yeah. So let’s jump in on some of those problems I talked about, which are pretty commonplace, and most data teams around the world. And if you think about these problems very deeply, you realize that the place it stems from is actually this fundamental reality of data teams, which is diversity. The regimes are diverse. To make a data project successful, you need an analyst, an engineer, a scientist, a business user, a machine learning researcher, analytics engineer, all these people are very different. They have their own persona types. They have their own DNA in the way that they work. They have their own tooling preferences, and they also have their own limitations. And while this diversity, in some ways is our biggest challenge, it’s also our biggest weakness, because a ton of the challenges that I talked about, like come from the fact that all these people need to sort of come together and collaborate. But they all have different contexts that they’re operating in ecosystem. And so as we sort of see ourselves as a collaboration layer for the modern data team, every time there is a function inside an organization, engineering teams have a GitHub sales teams have a sales force, what does it take to create that true collaborative hub for a modern team, knowing that the only reality in the data team is diversity. So that’s the sort of place we operate in is if you think about the fundamental modern data stack in some ways, which is your data ingestion, and warehousing and transmission and bi, that’s what I think of as the data stack. Atlan sits on the metadata plane, or the control plane layer of the data stack. We bring in metadata from all of your different tools in your ecosystem, we bring that together, put it together to essentially start creating intelligence and signals make it super easy to discover data assets, and so on, so forth. But most importantly, we actually use this to start driving back better context into the tools that you’re working in daily. So for example, when I am in a dashboard or a BI tool, I want to know can I trust this dashboard. But when I but the truth about whether you can trust this dashboard is actually in the ETL. And it’s in like the pipeline gets updated today or not. And or did the quality check, run and did it pass? That’s the metadata that and brings together we make sense of it. We construct auto lineage, we basically make sense of your entire data map in some ways and create a single source of truth. But then we take that back into tools like BI tools into Slack into collaboration hubs into GitHub, into tooling like that, to actually make the day-to-day workflows of teams significantly more simple.
Kostas Pardalis 14:28
I have a few questions because you have mentioned some very, very exciting topics. I’d like to start from the people. You mentioned quite a few times about the diversity in the complexity of the data teams. Coming from the more technical side of things and the data engineering, we talk about data teams. We keep on forgetting, like all the different stakeholders that are part of this seems right, we focus a lot on the Indian persona talking about like data engineers and maybe sometimes like also, also analyst. Based on your experience, like a description of what a functioning data team realistically looks like, what are the personas involved there?
Prukalpa Sankar 15:23
Wow, that’s a loaded question. I wish there was a way a typical data team functions. And I think that’s the reality that every team is diverse, like, it’s every team is unique. And teams also evolve over time. And so I think this is a classic, like we’ve seen from fully centralized data teams to fully decentralized data teams, to all kinds of hybrid structures in the middle. We’re increasingly starting to see like, sort of, for example, some functions like data platform and enablement, which in my mind, is a new form of governance. There are centralized functions, and then there are decentralized functions, which is pod structures, but analytics engineers and analysts and I think what I’ve realized over time, is that there are four or five different ways that you can structure your data team, I also am a very big fan of not fitting people to JDS or fitting people to structures, instead of actually building a structure that works for your team. Because the reality is that there’s a lot of overlap. If you think about like, the skill sets, like the skill sets, like the fundamental skill sets, from an analyst to an analytics engineer, to a data engineer, to a machine learning engineer, that you’re actually talking about overlapping skill sets, it’s not black and white. And in a lot of ways, it has to do with the person in some ways, like, I’ve never met perfect data scientists, I’ve never met a public, I don’t think that exists. And so I’m actually a very big fan of this, this method methodology of actually starting at the fundamental skills and building roles around people. And then in some ways, the structure of your data team gets structured on the basis of your leaders. And how does that How do you? How do your leaders interplay with each other? And what are their skill sets? That’s, I wish will be loaded up. Because I think that’s really the only reality in a data team.
Kostas Pardalis 17:29
Yeah, that’s a great point. Companies usually do not start with a data team. Like when you incorporate if you start like a new project or a new company, and you don’t really have the resources or even the need for a data team. There is a certain point like in the lifecycle of the company, that you will start meetings that based on again, on your experience, because you mentioned, like having like a core set of skills, and then building on top of that play, what is this core skill set that is required for the people to create this first data team in a company?
Prukalpa Sankar 18:10
Yeah. So I believe that the way to think about this and I think every startup founder, like in fact, actually have a blog, and then switches how do you go about prioritizing this? Because I actually get a ton of questions from like, startup founders are like, Oh, we want to invest in a data team. Where do we start? And what I typically ask them to do is actually say, Okay, I think you should think about this from a strategic perspective in terms of what you want your data team to achieve in the first place. And so to give you an example, and I think this needs to start with, like, what is this biggest strategic priority of the company, because let’s say I am starting a hyperlocal delivery startup or something like an Uber equivalent, for example. Maybe what’s what else out with maybe the most important thing, when I’m starting on day zero is just operational analytics, I just need to know how many rides are we serving, and things like that. But right after that, probably, or even at that point, probably the most important thing that for the business ends up actually becoming the matching algorithm, which is actually a pretty complicated data science problem. So on day zero, you’re not just starting at analytics, you’re also probably starting with data science and investing in data science so that you can actually solve like data science, the fundamental part of your product in your business. And so on day zero, when you’re investing in your team, you’re probably going to try and find a leader or an initial team, you’ll probably start with like an analyst and a data scientist who can stretch and then you will build out those two teams like that. On the other hand, let’s say you’re a software startup, and you’re selling SaaS, for example. Now, when you’re selling SaaS, operational analytics is almost what you need to work really well, up until you get to like a relatively mid-sized company in some ways. Like you want to invest in product Analytics, you want to invest in sales analytics and sales ops. And so then in that case, for example, you probably just want to invest in a really strong Analytics, maybe someone who comes with domain expertise in SaaS because SaaS is complicated in the way that the domain itself works and you don’t need data science at all, up until maybe much later in the company when you decide to build a product using all the data that you have collected in your SaaS product, for example. And so I think that is the nuance in some ways. Building a data team and structure, I think you need to start with the first principles of what you’re trying to optimize for, as a company. And then from there, figure out what skill sets you need your data team to have on day zero.
Kostas Pardalis 20:53
Yeah, I think it gives, like, super, super valuable advice here and a very interesting perspective on how building things like, I don’t know how many times I’ve seen SaaS companies at an early stage and be like, Okay, we’re struggling with attribution. For example, let’s find a data scientist to do like some magic. For us, it fails at the end. But anyway, that’s, that’s the topic of another episode. That was great. Like, I really, really appreciate like sharing this information with us. So you mentioned at some point that using let’s say, the platform that Atlantis today, and you become more agile. As a term in software engineering, has a very, very specific meaning. The easiest way to explain but at least can you give the counter-example of a waterfall. But there’s like in software, what is “agile data?” What does it mean that become more agile with working with data? Is it the same thing as software? Or is it different?
Prukalpa Sankar 22:00
Yeah. At a high level, as we thought about like, how do you measure agility? In some ways, like we, we sort of thought about this as velocity and in some ways, and how can we get stuff done? But also, at what level of quality? Can you get done? At what level of how can you reduce the iterations that you need in your work? When something changes? Change Requests are really important part of like a data team job. And when someone tells you Oh, yeah, the dashboard looks great that metric looks page. But can you just like make this one change to it and add this, pull this one number additionally to it? Only a data person knows how difficult it is to like, go and get one number to pull into the dashboard. And so how do you how can you build your your your entire pipeline in a way in some ways that can give you that kind of reusability and reproducibility to be able to like manage change requests? And so I think all of those are competencies, going to agility? To answer your question on is agile, the same as software engineering? Absolutely not. Software Engineering is a very different practice, with the fundamental actually, the one fundamental that’s different between software and data, is that in software, you humans create code. So that fundamentally changes the equation, because in data, we can control the data that we are working with, in most cases. And I think that itself is like a fundamental paradigm shift between software and data. Second, in software, often, you already know what you’re probably gonna buried, and what you have to do, like it’s much more execute, it’s much easier to measure execution. And quality of execution versus in data. Many problems are exploratory in nature. Let’s say it’s an exploratory. Why is our ARR number dropping? That’s an exploratory analytics project. How do you know? It’s really difficult to scope a problem like that on day zero. And so I think those are things that are fundamentally different between software and data. And I think that’s why it becomes very difficult to just stay, say, let me pick agile as a framework. It works in software engineering, and I’m just going to bring it, bring it into integrator. And so I think a few things that for us what was useful were we basically tried to take best case practices, but not just best practices from software engineering. We also took best case practices from like lean manufacturing, and DevOps. There are so many, like data itself is such an interdisciplinary team. So in some ways you can take like learnings from a bunch of product teams, for example, it’s something I’m really really bullish about is this idea of going from like almost like a data service team where you’re just like servicing requests to a data product team, where your a product team, for example, is building for your end-users. Then your success is measured on whether your users at the end of the day You use the product the same way, like, can you actually think about your data products? And can you measure yourself on success rather than just closing out service requests? So I think all of those components, I think that we should learn from as a data community in some ways, and build what our practice of agile, or people call it data ops should look like in the ecosystem.
Kostas Pardalis 25:23
Great, great. That’s super interesting. And again, very good definition. And it’s good to make, let’s say, clear the differences because especially like many people, especially like data engineers are complex from Software Engineer background and they have been exposed in like, very specific semantics around words. It’s the meat, for example. So understanding like the differences between what it means to be agile when you work with data and what it means with and when you work with software, I think it’s really important. If we want to increase, let’s say, the quality of the work that we might have to do at the end. I’ll keep like in the same, I would say, approach of trying to redefine the terms. And you mentioned it out. Again, Ops is not something new as if they’re like, We have DevOps, we have SRE we have revvo Ops, we have these data ops. Why do we need it?
Prukalpa Sankar 26:32
Yeah. The way I think about data ops is it’s a principle or almost like a way of doing things. I know, it’s caught in like a lot of it’s a buzzword now. And it’s gotten a lot of attention. And there’s a lot of products that claim to be a data ops platform and a data ops product and like all these other things, but I actually don’t think that that’s what data Ops is. And data Ops is fundamentally about saying, how do we take the principles of agile and DevOps and lean manufacturing and all of this and bring it into a fundamentally collaborative practice that helps teams work together effectively, it’s built on the foundations of collaboration, reproducibility how do you ensure that your, your data sets are usable and reproducible? It’s very done foundations like self-service and self-serving. How do you create something that is where you’re reducing the dependencies on the core data team? I think those are some of the elements of what data ops means and can create, for example, in our case, like, we actually created like something that we call data ops culture code, which is about what does implementing a data ops culture truly mean inside organizations? And I think that’s the way we need to think about these concepts. Be it data ops, or be it the data mesh, for example, these are all design principles, these are ways of doing things these are not, technology is just a part or an enabler in solving these problems. But it’s a broader principle that you’re that we’re working towards.
Kostas Pardalis 28:12
Alright, so I think enough with terminology, let’s get into the technology now. So alright, we figured out what data obviously is why we need it, how we build such platform, and what do we need in order to? Actually know, before we go to this question, I have another question. I’m sorry, which I think is going to help us with this question. And this question is about the data stack. We keep talking a lot lately about like “the modern data stack.” We had a panel here trying to define what this thing is, why it is modern, when it stops being modern and it’s not modern anymore, like what’s going to happen in the future?
I’ll try to avoid the controversial conversations around it, but we need to start in order to work with data, there are like some architecture that needs to be in place some minimal kind of pieces of technology that we need to work and operate. So based on your experience, to two parts of this question, first, what is like the minimum set of data stack that a company needs to have in place? And second, what is the minimum, let’s say data stack that you have that are need in order to go and operate and deploy your database platform?
Prukalpa Sankar 29:33
Sure, absolutely. The way I think about it is broadly a bunch of additional like, as I think about the data plane or the data stack itself, I broadly think of it as a few building blocks. The first around just first collecting your data in the first place. This is where you have data ingestion, you have CDPs and you have essentially what does it take to actually even better Your data together in the first place and collect the data that you need. I think the center stone of every data platform in some ways is the storage and the processing layer. And there are a bunch of different architectures that you can use. But it could be your cloud data warehouse, or your Chowdary, or your lake house or whichever of those architectures you’re picking inside the org, but that I think, is the center stone in some ways, then there’s transformation. How do you go from raw to like bronze, silver, gold, and so on. So that’s the third layer that I’d say. And then the final is what I call the application there, that’s where I would say, the BI toolset. And then depending on whether you’re a data science organization, maybe some data science tooling, like Jupiter, for example, I think that’s, in my mind, what forms the core data stack. It’s at that point that I think once you have the data stack, or the or the basic data, data stack, which is like say, these three or four tools, there’s a bunch of others that are not mentioning, but this is minimum viable data platform, I think it’s at that point that tools, like Aslin, start becoming helpful, where we say, Hey, we’re building that like metadata governance plan, in some ways for your, for your data stack. So a great for us, for example, a typical customer who brings us in, has implemented something like a snowflake or a data bricks, or an AWS data platform in the last say 12 or 18 months, they already have set up that initial bi, like they’ve solved some of those, like, initial problems with data. And that’s when they’re, that’s when collaboration, chaos becomes a reality, that’s when they’ve started realizing, hey, we hired the first two sets of analysts, but hey, like, when you analysts are not productive at all, because they don’t know what data they should be using, and things like that those problems start becoming field.
Kostas Pardalis 31:50
Is there a minimum size of a team that you have observed that usually exists when that one becomes relevant?
Prukalpa Sankar 31:57
So we typically see that somewhere when your data team is in sort of that 10-member team size is where it, it starts becoming where the problems are becoming a real pain, like, that’s when you’re dealing with a really sizable chunk of your data, like over 50% of your time, is actually probably being spent on issues like this. Interestingly, we also see a bunch of data leaders, which is, which is interesting now, because you actually have people who worked in like larger teams who, who are now going in and setting up teams or like early-stage startups, and some of the things we actually hear now. And we have, like teams, actually, that are starting out with us and much earlier, because we’ve started seeing data leaders say, hey, like, we’ve gone through the chaos of not implementing this, and then having to figure this out at a later stage. And we know how painful it is. And so we just want to get it right from day zero, like, we don’t want to like go, we don’t want to have to fix our problems when we grow. And so we do definitely see earlier stage teams starting to adopt a lot of the practices that we recommend, for example, we talk about things like how do you think about your data sets as data products? And what does that mean? Like, how do you create shipping standards, or D 08? A documentation for sculpture on day zero, like, these are all things that that, we think about is practices inside the team, and we’re starting to see people actually adopt this almost at a zero, rather than necessarily wait till the problem becomes a real pain?
Eric Dodds 33:24
Yeah, I have a question on that because, in an ideal world, all of us working in data would love if companies were constantly looking six months ahead, and we’re implementing processes and tools that would make their future data stack and data team operate more easily. But in the real world, for most companies, especially as you’re scaling and dealing with data and putting out fires, and adding that one number to the dashboard, it’s really hard to anticipate what things are going to be like in the future. So, I’d love to hear you speak to someone who says, Okay, I’m already experiencing that pain. Like, we have a pretty robust data team and stack, we have data science, machine learning practice, we’re starting that journey. So if you do have to go back and sort of solve the pain after things have reached a tipping point, where do you start? Like, which discipline do you start with within your definition of data ops Because there are so many things. Do we start with governance? Or do you need to solve cataloging before that or lineage? There are multiple components of this that Atlan solves, but what’s the starting point?
Prukalpa Sankar 34:51
So I think the best way to think about this, in some ways it is what I think of as the journey that a data team actually will take. People bring their children at different points in their journey that depends on how they think about agility and how forward-thinking they are and how much they think, for their advantage. Different teams operate differently. But for example, the way I think about it is when you’ve just started your, you’re a team, let’s say you’re dating, this is pretty early, it’s pretty small team, the first set of problems that you’re probably gonna start solving are things like, pretty simple things. So it’s gonna be things like we do we all agree on the same metric definitions. And how do we measure the metrics that we’re, and it’s going to start there? And then you’re mainly focused on when you’re that early-stage data team at that startup, you’re mainly focused on saying, how do I help my business users or my business stakeholders starting to trust the data starting to trust me, starting to trust the district make data-driven decisions like those kinds of things, that’s where you’re starting?
Very quickly, what starts happening is that people actually start relying on the data team, and start sending services, as I think of this service request to the database. So you start out with like, maybe helping out in the monthly, the monthly business review on the quarterly business review, a bunch of requests are coming to you now a bunch of ad hoc request, data team, early data team says, Okay, we can’t handle this anymore, we need to hire new people. This is when your data team starts growing. And at that point, I think the biggest challenge that the data team have is productivity, it’s really hard to get new unless up to speed, typical average time that an analyst these in an organization is like today, 18 months, and you’re spending six months onboarding a person in some ways in that time. And so there’s, I think the biggest challenge the data team has is productivity. It’s really hard to get new analysts up to speed. Typically the average time that an analyst is in an organization is 18 months, and you’re spending six months onboarding a person. So that’s where things like data discovery, data lineage, contexts or tribal knowledge around your data documentation. These start becoming a reality and investing in that becomes super important. Now, there’s a point where this, so even if you improve the productivity of your data team, and things like that hopefully your data team is doing much better. The reality is that the requests that your data team is going to get is always going to be much, much more, the demand is going to be much, much more than no matter how hard you try that you can scale your data team size, because the reality is that if you can only scale your data team linearly, and it’s, it’s likely that you’re going to start getting exponential requests. So that’s sort of the time where we see that data teams go from almost this mindset change of having to stop building data services to almost a data product mindset, in some ways, where if you think about the difference between services and product services, you’re servicing a single request product, you’re basically building something scalable, that everybody can use, or like a good chunk of users can use. And so it takes a little bit of upfront investment on Daisy. But as you go along the way, over time, you’re actually using a ton of the repetitive requests that your team is getting, which is saving you a ton of time so that you can actually burn your things. At that time, the priorities start becoming a little different for the DNS. And so that’s where we start seeing people say, how do I start looking at insights as an Ask queries as an asset or a product in Atlanta, this is where, you start saying, for example, we actually have a ton of so. So the two ways you can use it, like one is the Atlan, UI and interface itself. But as an author has a ton of APIs and apps that you can build on top of fashion, which you can connect into your CI-CD pipelines, which you can connect into your downstream tools, which could be BI tools, and so on, so forth. And so that’s where we start seeing people leverage a bunch of spans of capabilities. And then the final layer is starting to truly create that self-service. And I’m writing the holy grail that every data person wants is that we are just able to, like, truly enable self-service in our organization. And at that point, you’re actually starting to expose a bunch of your data products to your end-users or your business users directly. And at that point is where things like governance start becoming a reality. Like I always think about this like democratization as much as it’s a buzzword like democratization and governance are two sides of the same coin. The more people are getting access to the data, the more you’re signing things, who’s accessing my data, the right people accessing my data, PII like those kinds of things are becoming reality. And so I, I sort of see this as a journey. And the question really sort of comes down to where you are in this journey. So for example, teams that adopt as much later in their cycle in some ways, when they’re a much larger team, for example, governance is a priority on day zero itself because of just where they are in their job. Well, it says, a fair a much earlier stage team, you’re like five people, and you’re thinking about, access control and security, that’s super unlikely. That’s the way we think about it.
Eric Dodds 40:14
Yeah. Makes total sense.
Kostas Pardalis 40:16
Sorry, Eric, but we need to, I have a question that I pretty much had from the beginning, but I think now’s the right time to ask that.
Eric Dodds 40:24
Go for it.
Kostas Pardalis 40:24
Yeah. So we are talking a lot about enable collaboration by kill the collaboration between people by like all these things, we’re talking about the data stack, you gave like a very good description of the complexity of a data stack, even the minimum viable, it’s a data stack. Like it has many moving parts. So I wonder, in order to build a platform like Atlan, you need to be able also on a technology level to collaborate with all these different parts of the data stack. Like you need somehow to interact with them, pull some metadata (and I’d like to talk more about that a little bit later). How do you do that? Considering that, obviously, it’s part of the data stack right now. And it’s vendor, they only care about their own problems. I don’t think that the first thing they think about is how we’re going to expose my data, or like API’s or whatever, like to tools like yours. So how does this work? And how much of a challenge it is today?
Prukalpa Sankar 41:25
Yeah, I think we’re actually doing a decent job as a community today in exposing tools do a pretty decent job of making it possible for you to get metadata out of the tools. So it is not true for the fringe tools in the ecosystem where the use cases not as elaborate, but for the main tools in the ecosystem is actually okay. The thing is, you might have to do some work on top of that to mean that metadata is useful, that’s a different angle. And that’s what like products like us focus on it, in some ways. I think the true challenge is not the integration point, as much as it’s the diversity of the integration points. The truth in the data ecosystem is that the data stack is also evolving. So if the data stack was just, these are the 100 tools in the data stack, and it’s going to be these 100 tools for the next five years, that would be awesome and a relatively simple problem to solve. But what I never thought I would be hearing about Firebolt, even like a year ago, like now you hear about it. The data stack is changing so often. And then new tools getting added to the ecosystem, and that is going to continue to happen. Yeah, chain. In fact, I think after diversity, the only reality in data has changed. And so then you need to be able to be as far as for example, as a platform, we need to be truly agile, to be able to actually support these integration points. Because if you want to be the true collaboration here, the only way we can do it, is by supporting these integration points. So that’s why for example, the way we so we turned that into a feature rather than a bar. So the way we thought about it is, we actually actions were done behind the scenes, what we call an open marketplace, which basically mean that customers can actually build these apps on top of Atlin, which allow you to actually build integration points, integration points, not just into the tools that we’re pulling in metadata from, but also integration points into collaboration workflows, and downstream tools that you want to integrate into. So for example, if a team has a specific workflow that they use on Jira, and they want to build a metadata orchestration workflow of it, they’re able to do that on Amazon as well. So and that’s the way we think about the role we play in the stack in some ways.
Kostas Pardalis 43:50
Yeah. Okay. And I know we don’t have much time left, which is a good thing, it means that we need to arrange another episode at some point to keep chatting, all that stuff. But before we reach the end of our episode today, let’s talk about the metadata plane. The two main terms that we listen is the control plane and the data plane. And suddenly we introduce a new term, which is the metadata plane. So what is this metadata plane? And what is a piece of metadata? Like, if you could give us like an example from like a BI tool or something like data warehouse? Like that would be amazing.
Prukalpa Sankar 44:27
Yeah, so let’s get started with what metadata is itself is, and I think the simplest way of describing it is data about data in some ways. And what that means is every one of your tools is generating data assets, and there is context that is created about each of these data sets. So let’s take for example, in your BI tools, you have context about usage, which of these big tools are getting used the most, which of these dashboards that you’re building are getting used the most, at what time by which users, that’s metadata, which data source or which table in snowflake is connected to this dashboard. That’s metadata in your data warehouse tool. Like your query, like, you can use your query logs to actually figure out in some ways, lineage, and how these how different tables are connected to each other. Like, that’s metadata. All of those are in your pipeline or your orchestration engine, you have metadata about what time was the pipeline, updated and that’s metadata. So I think the way I think about it is like metadata could be technical metadata could be social, it could also be about who’s using what when, where about usage, things like that. And the more you’re able to bring in. The standard forms of metadata is all the technical stuff you’re able to bring it in and marry it in with more and more types of metadata. That’s really where you’re able to sort of create. I think about this as almost like a single plane for all your metadata. In the ecosystem. Like, there was a world we’re used to, the same thing that happened with the data lake, actually, we used to, like, there was a time where we, the big data world back in the day, you bring in data from a bunch of different places to put it dumped into the data lake in some ways to say, hey, we don’t know what the countless use cases of this is going to look like. But we know that this is valuable. And we can talk about, of course, the implementation details and the issues that had happened. But like, if you think about from the fundamental concept level, metadata also has a ton of different use cases, I think we’ve just scratched the surface of what those use cases could look like today. Today, in our ecosystem, we are talking about data discovery, or data lineage or data observability, like, these are just one or two or three use cases of what metadata can do. In the future, you could be using metadata to auto-tune your data pipelines, you could be using metadata to actually cost optimize your entire data management ecosystem, there are a ton of different use cases of what metadata can do. And so the way I think about the metadata plane is it’s sort of this. I think, the metadata plane is the foundation, the control plane, to be honest. To bring in all of your metadata, and then you’re using it to drive these use cases, governance and security, and catalogs and discovery are, are some of them. But then there’s a ton of other newer, intelligent operational kinds of metadata use cases, that that are still remaining to be discovered, in many ways.
Eric Dodds 47:43
So interesting. Well, we are close to time here. But I have one more question for you. And we like to get advice from our guests. And I think one really interesting experience that you’ve had is tackling these massive data problems with multiple different types of data. So going back to the beginning of our conversation, we talked about clean gas and how that included geospatial data and economic data. What are maybe one or two of the lessons you learned when trying to face a big sort of crazy data problem like that our listeners could learn from?
Prukalpa Sankar 48:29
That’s a great question. So it’s a couple of things. One that, and this comes back to me, and maybe this is the same batter, that’s probably why I’m building an injury. But to me, I think it really just comes down to the team and the culture. I think that is the most important thing in being able to crack the most difficult data problems, like, for example, in that team that I was telling you about the clean cooking us, and I honestly think we will probably the only like, I have not heard of a problem like that being cracked that way. Like it took us multiple iterations, three months to actually get there. And the reason I think we were able to do it, even fundamentally think about how do you think about accessibility versus profitability? It sounds today, like when you hear it, like simple, but it is not like when you’re really trying to figure out how to this time in the world, it was not. And we had a, we had a development economist in the room. We had a data engineer in the room, we had a project manager who came from a political background in the room, we had all these very, very diverse people in the room. And I think that enabled us to actually sort of rethink the problem from first principles in a way that a standard team that would just have had maybe analysts or just to have, like a single kind of persona would not have been able to think about that problem. And I think so that demo is very, very important. Like, for example, again, I go back to that example, we actually had a solution that had been signed off by a client, where it was not the ideal solution. But it was a top-down way of allocating that there are multiple ways to solve a data science problem. It was a top-down way of allocating where this guy centers which districts they should go get opened in and be wasted, says like, it wasn’t solving the access problem, we find it’s all the profitability problem. I wasn’t solving an access problem. And so literally three days before the final presentation to the cabinet minister, I remember my co-founder, and I were in the room and my co-founder basically, like listens to the problem. And then he’s like, Hey, like, so we this is not a profitability problem. This is actually accessible, this is a sales problem. So why are we not thinking about it? From a geospatial perspective? Why are we thinking about it until we actually flip the entire solution in like two or three days, and that wouldn’t have happened if we didn’t have the diversity in the room. And so I think that, to me, is the most important thing. And so you should really strive to find a way to build diverse teams, and have them work together. I think the second aspect of that is trust. The problem with diversity is it’s really hard to build trust in teams, when a number on a dashboard, going back to the number on the dashboard breaking I know, we laugh about it a lot in the data space. But the reality is that at that moment, when, when the cabinet minister called me and said, the number on the dashboard is broken, I couldn’t answer his question as to why the number on the dashboard was broken. At some level, the hardon trust that I had built with him drew the same time when I call my data engineer, and he said, I’m gonna pull audit logs and check. At some level, I didn’t know if the problem was that the pipeline already broke, or if my data engineer was messing up your work again, and this creates trash deficit in diverse teams in most of the teams says leader started out as a sales rep, everybody does the same job in the team. Everybody has clarity. That’s not the case in editing. And so the second most important thing to build in a data team to make it successful is how do you build an ecosystem of trust? How do you help people trust each other? How do you help people trust the data that they’re working with? I think that’s the second most important thing that I would invest in as a data leader.
Eric Dodds 52:22
Incredibly wise advice and we thank you so much for that Prukalpa and thank you for your time today. It was a great conversation.
Prukalpa Sankar 52:30
Thank you so much for having me. This was a lot of fun.
Eric Dodds 52:34
My big takeaway, I just appreciate we covered a bunch of topics. But I appreciate that Pucallpa return to a theme that we’ve heard on the show multiple times. And it was so great to see her kind of think through all of her experiences of data, building a data ops platform, and what she went back to as the most important thing in solving data problems as a team. And I really appreciated how she said, diversity is so important to have on a team that solving a data problem, but it also makes the trust component difficult because you have that diversity. And people are coming from different backgrounds and skillsets and have different responsibilities to stakeholders in the project. So that was just a really, that’s one of those things we have all heard and in the back of our mind that to hear it articulated like that is always a great reminder.
Kostas Pardalis 53:35
Yeah, 100%. If you think about that, like when we when you build the company, and like you build the product, you build the product for a very specific persona. Like you have only one persona to keep in your mind. And even that is like, super hard like carrying out how to satisfy this one persona. Now. If you get in the shoes of a data professional, like analyst, data engineer, whatever, like whoever is like a member of this data team, like these themes have as customers, all the different departments and functions of the company has. So they have to satisfy by delivering services or products, all these different personas. And that’s exponentially harder to do. And of course, you need trust, like without trust, like, you can’t do that anything. So yeah, I think that was probably like one of the most important mic topics that’s with us during this conversation. And we don’t usually talk that much about that when we talk about data and the technologies around it, but we should spend more time.
Eric Dodds 54:46
I agree. I agree. Well, thanks again for joining us, and we will catch you on the next episode.
We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at firstname.lastname@example.org. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.