Episode 180:

Data Observability and AI for Data Operations Featuring Kunal Agarwal of Unravel Data

March 6, 2024

This week on The Data Stack Show, Eric and Kostas chat with Kunal Agarwal, the Co-Founder and CEO of Unravel Data. During the episode, Kunal discusses the evolution of data operations and the role of Unravel in simplifying these processes. The group discusses the shift towards real-time workloads, the impact of AI and machine learning, and the challenges of cloud migration and managing complex data environments. Kunal shares his journey from fashion to data management and emphasizes the importance of observability for data ops teams. The conversation also covers cost optimization, the productivity of data teams, reliability of data systems, the unique cost management considerations in cloud versus on-premises setups, and more. 

Notes:

Highlights from this week’s conversation include:

  • The evolution of data operations (1:13)
  • Unravel’s role in simplifying data operations (2:17)
  • Kunal’s journey from fashion to enterprise data management (5:23)\
  • The Unravel platform and its components (10:08)
  • Challenges in data operations at scale (16:34)
  • Users of Unravel within an organization (22:32)
  • Calculating ROI on data products (25:55)
  • Understanding the cost of data operations (27:01)
  • Measuring productivity and reliability (30:59)
  • Diversity of technologies in data operations (34:52)
  • Efficiency in cost management (44:15)
  • Implementing observability in AI (47:55)
  • Challenges of AI Adoption (50:17)
  • Final thoughts and takeaways (51:36)

 

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 00:03
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com. We are here with Kunal from Unravel Data Kunal, thanks for spending a few minutes with us today.

Kunal Agarwal 00:29
Eric, Kostas, thank you so much for having me here.

Eric Dodds 00:32
Give us your background, how did you get into data? And what are you doing today?

Kunal Agarwal 00:38
Yeah, so I’m Kunal, co-founder and CEO of Unravel Data, that I started with my co-founder, who’s a professor of computer science at Duke University. We both started this company to simplify data operations. We feel data engineers and data teams spend too much time firefighting issues rather than being productive in the data stack. And we wanted to automate it, simplify some of the daily activities that these teams have to wrangle with every day and simplify their lives a little bit and make the data work that they do more exciting. Yeah, and I’m really excited like, with you today now, because I really would like to talk about how did operations have changed in the past, like 10 years, like, it’s

Kostas Pardalis 01:25
extremely interesting that you’ve seen this whole, like, from the Hadoop days up to today, like all the changes that have happened, and I have a feeling that like, the complexity around like, data operations has like exploded, right, especially with like, having pretty much like every one or two years, like new use cases around data coming in, right? So even like, let’s say, observability. In data, what does it mean? What it meant, like, five years ago, and what it means today, when we also have AI, for example, right? In the mix. So I’d love to get more into this journey, how things have changed, and what it means today, like to operate. We’d like to be an operator around data. And of course, learn about like and forever, like how it helps in that. What about you, what are some topics that you’re excited about?

Kunal Agarwal 02:17
Yeah, nurses, never a dull moment in the life of a data team member, especially usually those last year. So we’ve gone from doing things with Hadoop, as a one Swiss Army, if you may, to having a multi system stack now, primarily running on prem now primarily running on the cloud, that’s a mega change that’s happened. The other is you’ve gone from doing these batch workloads with ETL, to now doing real time or near real time workloads, you know, in production and marches as a science project. And then we’ve gotten from doing the BI business intelligence or just advanced analytic workloads, to now doing machine learning in AI in production. So if you’re part of a data team, as a data engineer, or a data scientist or a data analyst, you’ve had to keep up with the demand of your business, and also had to keep retooling and re skilling yourself on how do you work on a MapReduce based system, to now a big query in a Snowflake system is incredible. The rate of the pace of change and the rate at which things are getting involved in this ecosystem, right. So that’s a very exciting part for us. Because what Unravel is ultimately helping me do is to simplify how these data engineers or data analysts are creating their applications, how they’re making sure that these applications are reliable, that they work on time, every time, they don’t break, that they get the right results. And then for the data leaders, and for people who are managing these projects, making sure that they are able to scale they’re very efficient matter. It’s not a linear scale in dollars versus, you know, productivity, can we bend the cost curve, as these environments are scaling up, so that they’re starting to get more bang for their investments. And as we now see the most exciting thing, you know, it’s actually here right now, not even the near future is AI, and how those workloads are changing businesses and turning industries upside down. So it’s a really powerful industry to be a part of, and a really exciting time to be a part of this industry. But it’s not for the faint hearted. It’s for people who are up for a challenge, who like change, who like evolving and also like to try new things. And that’s what makes it exciting overall, for everybody in this industry, and I’m sure you’ve had the same experience through process

Kostas Pardalis 04:49
Yeah, 100% I can’t wait to get deeper into all that. Eric’s What do you think?

Eric Dodds 04:54
All that’s done on the edge in all I’m so excited to chat with you today and dig into all things data ops. But your story, actually, as a tech founder, started in the fashion industry, you know, and you’ve come a long way. Enterprise data asset management. So go back to the beginning, how did you start in the fashion industry as a tech founder,

Kunal Agarwal 05:22
you know, just spent a lot of time trying to figure out what to wear out every day? I’m sure we all did, and still don’t live that good, do we? It came from an actual frustration, you know, why don’t we have something like we have, you know, for songs, I’d recommend what you should be listening to not always thinking about the exact song you want to listen to, it just shows up, and it’s the right song at the right moment. So we decided to create an algorithm that helps you decide what to wear, based on a lot of different factors such as where you are, what the weather is, like, what your friends are wearing. And then TikTok stuff that you actually have in your wardrobe versus things that you should be getting in your wardrobe. It was exciting. But I realized that, you know, b2b enterprise software is where I have, you know, more experience. But if you break down even that fascist experience it is really a big data model that has a recommendation engine running on top of it based on a lot of data that helps you connect the dots and figure out, you know, what should be wearing that experience. And, you know, I was consulting with Oracle products back in the day working with large enterprises. As I started to get the first exposure to technologies like do, which, obviously, you know, was very nice, and we’re talking back in 2012 2013 timeframe was very powerful, you could get a lot more skill processing done for a really cheap price, because it’s open source software that can run on commodity hardware. Unlike Vertica, or Teradata, back in the day, which is costing millions of dollars. But we realized that it needed to be a complete product, you need to not have rough edges, it needs to be simple and intuitive, for more users to get on the platform and start to use this powerful technology. So that’s when I met my co-founder, not when I was at university. And we figured that if we were able to simplify running applications in a high performant way, and make that automated, then that will reduce the amount of toil a data engineer spent in getting their applications into production. And that was the hypothesis that we started Unravel with. And then since we’ve actually extended the platform to obviously continue to focus on performance and reliability, but then also start to think about efficiency and cost. And then as cost we were talking about earlier, the evolution has also led us to make sure that we are able to support all technologies that the teams are using. Back in the day, it was just one technology to do and back to the use. It’s a whole zoo of animals, no complicated names, but really powerful stuff. So we want to give users a choice and bring a technology that they’re using, and then be able to get that same quality of service and an efficient way to go and scale your environment. Right. That’s really our promise to the customers. But yeah, it’s been a fun journey for all the Fashion Days. dimeric

Eric Dodds 08:38
Yeah, before we jump into Unravel, I do have to ask what was the most surprising thing you learned about the fashion industry or even fashion consumers you know, with your diet with your said dining in though?

Kunal Agarwal 08:53
You know, interestingly enough, we learned that men engaged more than women did. Yeah, much higher than anticipated. And you know, marginally higher than certain categories of women in different demography, you know, when when you break it down by regions and age, there were so men that were participating and be more active about this than women were and I think the reason for that is women have so many other outlets to discuss fashion and men didn’t did not and this became one of those places where they would actually engage with them understand like, Oh, what are my choices for you know, right. But then you also have men who go all the other way like the soccer Briggs and you know, the Steve Jobs the world we just have a uniform way that every day, and you know, come to think of it that may just be a better tax haven anyway.

Eric Dodds 09:51
They do. Yeah, totally.

Kunal Agarwal 09:54
But that was definitely, you know, interesting and insightful. Yeah, maybe

Eric Dodds 09:59
we’re just Much more clueless when it comes to fashion. And so you got to create it. For me it’s

Kunal Agarwal 10:04
looking around, boy. Yeah, that’s probably

Eric Dodds 10:08
well give us an overview of the Unravel platform. So there are multiple components here. We talked about data ops, and maybe we can start just with a definition of data ops, how do you define data ops? Yeah.

Kunal Agarwal 10:21
So it’s rather simple. Think about all the stages, your data pipeline, or your code has to go to get an outcome. All the code that you have to write, all the sequencing, you have to do with the infrastructure that it’s running on the services that it’s touching. It’s a rather complicated tangling of wires if you made it, that’s the kind of visual that comes to mind. And when something goes wrong, or something’s not behaving, the way you’re anticipating it to behave, then you start to ask the questions of what’s going on, why it’s happening, and how do I go and fix it? And to add to those questions, you need this thing called observability. Right? That’s the simplest way to think about it, you need to understand everything that’s happening inside to then be able to ask questions. So the Unravel platform at Center is an observability platform for data ops teams. And for data ops, really, that helps do a couple of things. Number one, makes your applications highly performing. So your business is depending on certain data pipelines, or certain AI models, finishing correctly and on time, otherwise, revenues word or your products aren’t advancing, so Unravel makes sure that happens that your service level agreements internally and externally are met Chicago SLA is. The second thing is, if you do have a reliability issue, then Unravel helps you troubleshoot that and fix those issues in a proactive and automated fashion, which we’ll dive into. And then third is nobody’s running a small data environment these days, because every company is becoming a data company. So when you’ve got him spending $100,000, to a million dollars to $10 million, to the bigger companies spending hundreds of millions of dollars, you need to make sure that you’re doing it in an efficient manner. And what we’re seeing is companies are wasting upwards of 30 to 40% of the cloud bill, by just doing wasteful things, and inefficient things that you may not even be aware of. There are some common things like, you know, keeping the tap on when you’re not, when you’re not, when you’re brushing your teeth. So you know, I should be turning them off from something as mundane as that to writing more efficient code. But there’s a lot of, you know, efficiency to be gained out there. So when we step back, we look at, hey, let’s connect to everything. So if our data team has 712 14 different components that are stacked up together on relevant X and collects data from everywhere, so we know absolutely everything that’s going on, and then collect data from all the layers of the stack horizontally as well and vertically. So from your code all the way down to infrastructure, see everything, gather everything, measure everything. And once all this data is inside the Unravel platform or our service, that’s when we run our algorithms and our AI models on top of it to automatically detect where problems are. So you don’t have to go hunting for that. Tell them why it’s happening. So you don’t have to do the cross connection. So it’s connecting the dots for you. So you don’t have to go understand why something happened. And then, in some cases give you an automatic resolution, and in some cases give you a guided remedy where it’s not possible to automate things, but at least tell you what to go and do to go and get out of this particular issue. So that it stops that trial and error that’s going on in your head, maybe I should try this out, maybe I should try that. You don’t have to do that anymore. And what we’ve seen is if you approach it in this way, then you can save several hours for a problem or engineer inside a company, which ultimately manifests itself into better productivity and just improve Rubin in productivity rates and efficiency rates across your organization, but also improvement in the efficiency of your infrastructure. But more importantly, you can now start to depend on your data outcomes. Companies are betting their reputation and their money on data outcomes. And if it doesn’t work half the time that it’s useless. Now you can stand behind it and say, You know what? This thing that we’re launching this recommendation engine that we’re creating, or this fraud prevention app, the relaunching it will work on time, every time. And that’s when companies can start to confidently invest in the second wave of AI or any other applications that they may be, you know, creating out there.

Eric Dodds 14:50
Yeah, it makes total sense. I wanted to get to the analogy I loved, you know, turning the tap off while you’re brushing your teeth. It made me think about something you said earlier, which was, you know, you started in data, you know, back when, sort of, in terms of big data who did was really the main game in town. And it made me think back to, you know, business intelligence, originally was a, it was a finance function, right? A lot of times reported up to the CFO, right, which in many ways makes a lot of sense. But then, you know, there are a couple of dynamics that happen. Number one, the cost of storage just starts to plummet, storage and computation separate with this big migration to the cloud. And so all of the sudden, even just that is this massive workflow optimization, right? Wow. Like, you know, we can be so much more efficient than we used to, we can run way more queries, etc. Pipeline technology advanced significantly. And so it’s way easier to move data around and cheaper to move data around, you know, and so Free for All is probably too strong of a term, but like, you know, it’s like, well, yeah, let’s just load the data warehouse, load the data lake, we can do all sorts of analytics, self serve, analytics, you know, machine learning, all this sort of stuff. And now it’s sort of, we’re coming full circle, right? And like, when you get the computer bill at the end of the month, finances like, Okay, you got to figure out who’s, you know, who owes what, on this big computer bill? Can you kind of talk like, talk through that? Because you’ve lived through that story and unraveled that story? In many ways?

Kunal Agarwal 16:34
Yeah, no, you’re absolutely right already. So if you break it down, Barry’s three things have increased, right, so the number of use cases for data has increased. We’ve gone from this, Hey, it’s good for financial reporting. And it’s good for understanding our sales data, too. Now, you know, we want to create brand new products, we want to improve our operations, right? So the use case for data is increased, the data sets that we’re capturing has increased. We are only capturing a subset of our financial data or sales data. Now we know everything about the customer, right?

Eric Dodds 17:10
Yeah, I was just transactions mainly. And now it’s Yeah, every digital touch point,

Kunal Agarwal 17:16
exactly. And the users of the data technologies that have increased as well, earlier, this was limited to the hardcore engineers, maybe the financial analysts who knew how to, you know, switch from Excel pivot tables, to maybe getting into a more powerful system. But that’s really it. Now, product guys, or audit marketing guys are on it. Every department of the company wants to get on and legal teams want to get on it, right. So the number of people jumping on these platforms has increased. By the way, all those three things are good things to happen, because you can get some great outcomes with data. But back to your point, this does become a mess, as companies start to scale this out, because the promise that we have heard, the cloud will solve all problems and world hunger. This is actually not true, because the cloud is better suited for data analytics, absolutely. It’s got limitless compute limitless storage, that then you can definitely scale out your systems for sure. But as companies started to democratize data access, and give it away to a large number of audiences, they started to be spurts of people using data analytics in a fashion that should not be used in that way, knowingly or unknowingly. And a big part of that is the range of skills that people have in these different departments. Not everybody is an expert on data systems. And you may have, you know, on the other end of the spectrum, some people that are, you know, just who click and drop tools, maybe some SQL, and unknowingly, what’s happening with them is decreasing inefficiencies in code, inefficiencies in the way this, these pipelines are rescheduled and run, or Just how these AI models are being used. So I’ll give an example. You know, again, like a mundane one, like, you know, turning the top off the toothbrush, and good deep can be a select star that a novice user does on Mega tables and cities that I hear about from our customers every week. And it racks up hundreds of 1000s of dollars of bills. And you gotta scratch your head, like, who did that? Why did they do that? Sometimes doing a select star, maybe what you need to do it on a table. But then who did that? Why did you do that? How do we control it from happening next time? Is the question people start asking once they’re shocked. Yeah, that’s just one simple example. There’s 100 other ways in which people can keep up with these inefficiencies. Other ones are for example, in architectures that are not serverless or even in serverless architectures, you have to understand that What’s the size of your warehouse? Or what’s the size of your containers, and if you give people a small, medium, large, extra large, guess what everybody’s choosing extra large, is the most important, biggest, baddest workload, compared to the next guy. Nobody ever chooses a small, baby medium. But then, you know, you run a profiler on that, and you understand that you’re only using 10% of the resources. So it could have been 1/10 of your cost. But people don’t know that. So you know, I can go on, there’s so many of these inefficiencies that happen all the time. But even before you go to improve, you know, the system, just understanding who’s spending water becomes a critical issue. So companies have a policy of showing back now, as the enterprises start to increase their usage. Fuel actually is, hey, look, we spent a million dollars last month, we’ve got five departments. Did you all spend two or $1,000? No, this person spent 100, this person spent 300, etc, etc. So it really breaks down in a person who’s spending a lot. And then companies are also going to a chargeback, where the group actually has to pay out of that million dollars. And that’s how we’re going to go and pay this particular bill. Right. So it’s an interesting evolution, because on prem, you didn’t have to think about that, because of a set of resources. So the worst you could do is Eric could steal from Costas and cost those workloads would stop. But the bill would still be the same, because it’s hardware that you are appreciating.

Eric Dodds 21:36
Yeah. Now, yeah, I mean, this is a fascinating conversation. Yeah. And I agree to show back and then the chargeback dynamic is, you know, super interesting. Can you help break down, you know, are talking about larger companies here, where this, you know, a small, you know, smaller company isn’t necessarily going to face these issues, because their workloads are fairly simple, right. And, and their, even their sack is simpler. But at scale, when these things really become a problem. Can you talk about it? Who is the sort of owner of Unravel within an organization? Or who is that group? And can you just kind of break down what it’s like their day to day problems that they face, you know, and I’m sure we have some of those people in our audience. And then some people who work with data in an org that maybe they’re not as familiar with that person’s sort of day to day issues relative to this infrastructure?

Kunal Agarwal 22:32
Yeah. So multiple people that do teams using gravel, let’s start with the data engineers, because they’re always near and dear to our heart. So Unravel health data engineers have a couple of ways. Number one, when you are developing your application, removing errors, removing bottlenecks from your code is something that a route helps you out with automatically. Putting that code into production, and making sure that it’s running their meeting, it’s SLAs. Everyone is something that unraveled and also helps out with using its AI engine, making sure that it can understand deviation and performance. What happened, something new was introduced. And if you don’t have something like unravel, then data engineers get called into these production issues or into the cycle that, you know, moves our applications from dev to prod when somebody’s doing a code check or a code review, for example, right. The other side that we have data engineers out with is, again, when it’s running in a production system, your boss, the head of product, or the head of the business unit may say, I need you to go into this cost out, I need you to run this more efficiently. And now these eight engineers have to go and hunt for ways in which they can reduce their costs, which is also something that Unraveled can help you automate and tell you in plain English, look, go and do this thing. You will not sacrifice performance, but your costs are going to improve by so much. Right. So that’s one group of people that use Unravel on an everyday basis. The other group of people are the centralized group of leaders and operators, who are responsible for making sure that this environment serves the purpose for every business unit, meaning they may get a Databricks environment, they may get a Snowflake environment with some Kafka Starburst, right? And they are providing this to their business and saying, Hey, use the stack, and it will run well. So that’s the other group that uses Unravel, to make sure that both the performance reliability and the cost part are taken care of. And then this group is able to also set budgets and guardrails for all these subgroups of products that are using this platform so that they can proactively understand how the cost trends are going towards the sponsor, not surprised at the end of the month. And then if there is any misuse or road usage. You’re Catching it live, rather than catching it in retrospect, when you’ve already burned through the dollars, then you’re able to fix that problem in real time. And ultimately, when companies mature to becoming a true data driven organization, where they’re actually generating revenue from their data applications and products, then we have business leaders using a product to go and understand what the ROI agents of running these data endeavors to then ultimately go and generate revenue for the company? And can we improve that in a certain way? So it really starts to go bottoms up when people are running, you know, applications all the way up to how those applications are serving business.

Eric Dodds 25:43
Yeah, one quick question. And I know Costas has a ton of questions that I want to dig into the ROI question, because you’ve mentioned a couple times, you know, sort of data output or data product, right? And I’m just going to pull an example and tell me if this is a bad one. And maybe you can pick one. But I think about, you know, TurboTax is their system that allows end users to submit their information, you know, and essentially file a tax return, right? That’s a hugely intensive data operation, because it’s ingesting all this information. It’s running it through all sorts of queries, I’m sure there’s machine learning going on in the background, it has to check it against, you know, all sorts of regulations. I mean, that thing is probably a norley at, and it requires a huge amount of data infrastructure. And so can you walk us through? Like, how would you think about calculating the ROI on that product? From a data and infrastructure standpoint,

Kunal Agarwal 26:44
such a good example, Eric. So you can take Intuit TurboTax, that’s ultimately costing you 10 bucks a pop, right? And then and then you walk backwards. And then you have to really understand what the unit cost of serving just arrogates items you’re making on that product. So any data product has multiple stages, you have collected data, you have to cleanse that data, then you’re running some algorithms on top of it. And then you’re getting some outcomes. And I’m making it very simple. I’ll give you the engineers listening to me on this podcast, probably like, yeah, that’s like 100 step

Eric Dodds 27:21
than what they presented in the board meeting.

Kunal Agarwal 27:24
Especially the guys who I can intuitively like, could debate me this way too simply. It’s probably 100, nested workflow, looking at running something on airflow, you know, something running on Spark. And I know Intuit now is on Amazon, it is a mega migration from on prem. Again, that’s probably evolution. So anyhow, what you need to ultimately do is understand what is the cost of one unit at work? So if you’re running your own Spark, what’s the cost? That’s our job? And how many spark jobs do you have? And then understand that end to end, from source to outcome? What’s the cost of those? All those different stages and multiple pipelines put together really? And then you have to think about what is the optimized cost version of doing that. So there is a cost. This cost me $1,000 to run this pipeline. But if I understand room for optimization that I can make this pipeline run for $6,000, for example, right? Now, how many users can 6000 people serve a conservative 1000 people? Great, it’s six bucks a pop right on the cost side. And then you want to bend the cost curve as you’re scaling up. So if it’s 4000 people, what does it look like for 10,000 people, and the answer should not be linear. And if it’s 100,000 people, right, and that’s how you start to scale it out. And then you understand how much margin you can get. Now, that’s a very advanced, mature company that is using Unraveling data to be able to do that. But what we’re encouraging people to do is start to think about that from the get go. Because you don’t want to run a full project and spend millions of dollars to then come to the outcome that you know, there’s actually not a feasible product, or just want feasible projects for our business to even get into. This is especially true Eric, in the age of AI, as everybody wants to create some sort of an AI outcome. But then if you get MLMs off the shelf, you’re spending about three to $10 million a year. But if you create your own LLM, looking at $150, a million dollars of spend. So really understanding how are you going to measure those costs? How are you going to break them down? And then thinking about products and work could actually mean super important from the design phase itself? And that comes back to our philosophy of measuring everything, right? It’s a philosophy of bringing all your data into a data lake. Now that you’ve done that, start to measure every process from the get go. So you at least know how this guy was. Just from the get go to them, think about how you should be scaling this up. Yeah,

Eric Dodds 30:03
makes total sense. It cost us one more question. Forgive me. So we’re talking about infrastructure, right? But I’d love to know, you know, and I’ll use the example of sales and marketing cost, you usually measure what you measure it in a ton of ways, right? But when you’re measuring a refinance perspective, you’ll measure like this. And you know, okay, so how much spin? Are we, you know, marketing spend, do we have? And then you have a fully loaded cost, which includes all of the headcount of all the Commission’s on the sales side, right? When it comes to data and the types of ROI you’re talking about, how are organizations thinking about the human capital aspect of it, right? Because it’s not like these systems just run themselves, at least now, maybe in the future. But you have people who need to run these systems, right? And how do you think about that as part of the cost equation there?

Kunal Agarwal 31:00
It is one of the bigger parts of the cost equation, actually. So even on the data side, just the infrastructure, this would be the stack side, it’s infrastructure, is the datasets itself. It’s the services that you’re using, you know, all of that stuff adds up to your total cost of the stack. And then you’ve got the cost of the people. The way we think about the cost of the people is thinking about a measurement, like a two word measurement of what kind of productivity are you getting from a class of people, that’s what we’ve seen works best. The productivity of the throughput metric could be anything that’s more relevant for your organization. It could be how many data pipelines per team, per member of the team. It could be how many AI models, how many new pipelines? How are you able to generate every month with your team? And then, you know, you could also map that back to how many issues how many problems, how much downtime did you have in your environment, and then started to see the productivity of your team across that what we have seen though, Eric is the people’s productivity is nowhere near it should be even getting half productivity, meaning four hours of productive time a day out of the eight hours a day two engineers working on is average right now. That’s what you’re getting. So people are spending half their time firefighting, wasting time on troubleshooting, debugging, fixing problems, things are breaking, trying to stand them back up. Things were working yesterday, today, they’re not. It’s a complicated piece of tech that these guys are running. And unfortunately, they haven’t had enough time to train themselves. People running on Oracle systems have been masters of Oracle systems for 20 years, right people running Databricks, Snowflake big Clary to be running for two years, three years at most, right. So they haven’t gone through those experiences and sorted this out. So productivity will get better. As more maturity happens and more experience of these data teams the business cannot stop the businesses running because the competitors are creating amazing data outcomes. They just need to get theirs out of the market as well. And that’s where automation is around, you know, what we do with a Ravel, you don’t have to be an expert, it tells you in plain English, how to go and fix certain things. So you could be a person who’s coming in straight from a Teradata, onto Snowflake, and you would know Snowflake overnight. And if you had any issues, you wouldn’t be spending four hours a day doing that; it’d be a couple of clicks and a minute or two if it’s not completely automated.

Kostas Pardalis 33:40
Okay, I have a question about reliability, especially in the environment like the mature environments that you have seen, like in the enterprise, and let’s say more like, purely data driven like companies, right? My experience, especially with enterprise because what is interesting with them is that they’ve been around long enough, right to go through many different products for what they are trying to do. And what I’ve seen in practice is that usually, technology does not get replaced immediately. You usually end up with like, pretty much everything like running together. I think if anyone could take a look into a big account of a fortune 100 company, they will probably see pretty much every possible vendor in their operating system right? How is reliability managed when you have so many different systems, and so many steps that the data has to go through right? And let’s look at it from a technology perspective. For now because when we get into the people aspect of it, it gets even more complicated. Yeah. But how have you seen things like working there, like with all these, like diversity of technologies like operating like together being? Yeah,

Kunal Agarwal 35:15
your own costs. So the people’s side is hard, right? Because no one person is an expert in all the systems in the stack. And that’s an inherent problem. On the technology side. Look, people are choosing different technologies for different use cases. That’s the reason why they have different stacks. So the and the other reason is just compliance, that certain data cannot move the cloud, that’s what we call the on prem version and the cloud version. And then the third, as Eric was pointing out, is, you know, with democratization and opening up the data stack to the company, people were kind of encouraged to go, like, Hey, if you want to go spin something up, go spin something up, right, you want to start a Snowflake cluster started out. And before you knew it, you had these, you know, bursts of, you know, clusters, here and there. And then before you know it, you know, the entire company started to use it. All these technologies are very different. There are similarities which end with a visit, that’s an SQL engine. But you worked in presto, at Trino. The way you triage that, versus your triaging or Spark SQL application is completely different. To reliability is something that has always been an issue since the early days for MapReduce. And the only way to solve that is to understand what’s happening under the hood. And a lot of people just don’t have that skill set. Like you don’t know how to drive a car, but you do know how to fix a car, it’s the same thing. And what Unraveled does is attacks that problem head on, by automating all the steps that somebody we’re doing triaging, so collecting logs, collecting metrics, you know, connecting the dots between all of these different causes, effects, and then bubbling up and saying, look, it could be 100 things that could be causing this problem today. But this is what it is. And this is how you need to resolve it. So instead of even giving you a check light engine, imagine it was more descriptive, when you knew how to take it to a mechanic. And it would just say, hey, you know this problem in your we’ll get this fixed, and that will be a faster way to resolve it. So that’s where we have seen the cloud fallacy of, hey, the cloud is on no ops, or a low arc solution, actually, you know, falls flat, because bad code, sometimes bad code, right? It doesn’t matter where you write it. So you can have the same experience and problems, no matter which environment you’re running in, if the underlying cause of those problems is similar across these different environments, so wildflower makes some things easier. It’s not a silver bullet that will resolve all your reliability problems itself. Now, the way it manifests itself, and, you know, coming a little bit to the to the people side of the question is, it could be an internal or an external application that you’re running, if it’s an external one, like your consumer store running on Etsy, or doing an online banking app. And that doesn’t work, your customers can use your services. And if it’s an internal one, then there are people who are waiting for that report or waiting for that analysis that business decisions are getting held up for it. Each of them has an SLA. And what Unravel helps you do is guarantee those SLA so we’ve seen in companies where those SLA s were missed 10% of the time, 7% of the time, sometimes 20% of the time. So we’ve gone from 80% SLA to about a 99% SLA, you know, attainment for those kinds of different data applications. Just because something a system is looking over and making sure that problems are caught proactively. And there is a fast solution to fixing that problem without hair balling into an even bigger issue.

Kostas Pardalis 39:06
And, okay, you said like, like Unravel, like got on these, like the SLA is but there’s also like the human factor here, right? Like, at some point someone needs to go there and fix something, right? So how is this working? Between technology and the person who’s on call today, right? Like, how I was like in this relationship, like working with Unravel?

Kunal Agarwal 39:34
Yeah, so before Unravel, you will get the problem and you will be notified about the problem much later. Because now it’s visible to somebody. So cause us to not get the report. Because as the CEO of a company and now on a Monday morning meeting with his exec team was not able to make decisions he was able to make, you know, 10am this problem gets logged being somebody on the data team gets called the person or the data team understands if it’s a code level problem or infrastructure level problem, and then tries to ping the relative teams. And by the way, there’s a big fight that’s happening over here. Right now. There’s a lot of finger pointing going on. Infrastructure guys are saying it’s a code problem, code guys saying it’s an infrastructure problem. I’m sure we’ve all been there. And then it turns out, okay, say we’ve identified that it’s become a code level issue that we try to find the degreed engineer who actually created the application that wrote that piece of code to then go and debug and dissect that, right. So as you can see, there’s a very involved process, lots of people in it, lots of time spent. And then this person is going to dig into logs, check out a lot of metrics. And by the way, each unit of work can have 100 Page logs, to look at 1000s and 1000s of logs to go and understand what’s happening. So it’s a very inefficient process, really, this used to take several hours in man hours, which could actually be days in terms of clock time. And then, you know, forget about lost productivity that even happens on the business side, because applications are working properly, right. With Unravel. Because we are able to do the identification proactively, you will firstly understand this problem before postdocs see this problem. It’s like, hey, this application is not going to finish on time, that Monday morning report is not going to be generated on time, we’ll notify you about that when the app is running, and then tell you what you need to do to fix that. Secondly, because it’s root causing the problem, there’s no more finger pointing. It’s like, look, today’s issue is infrastructure. Today’s issue is code. Today’s issue is data layout, or, you know, your services itself. So you couldn’t go pinpoints and that brings a team together on one side of the table rather than, you know, be combative. And then it’s telling you a guided remedy, or it’s taking an action on your behalf. So in the guided remedy, it’ll tell you what to go into. So depending on your role and permission, you go and do those actions, and fix or improve the reliability and performance of this application. But then in a lot of cases, unraveling can also take the action on your behalf. So you can complete the loop of doing the action as well and see the results. So a lot of times we see people wanting to as a simple example, prevent any app from spending more than $10,000 and requestor as an example. So unravel could take that action to stop this data pipeline, or this machine learning model, as soon as it nears $9,000, for example, right? So that you will suffer that and then, you know, resolve this problem reactively. So what we’ve done is improve efficiency, improve the productivity of this team, and made it more like teamwork, that everybody’s on the same team rather than being on different teams. Because when problems happen, that’s the finger pointing starts, you want to avoid that as well. Yeah,

Kostas Pardalis 42:54
100%. Okay, and if we switch now, like to the, to the cost management, again, you have like a very unique perspective here, because you have seen things happening like on the cloud, but you also seen how things work on the brain, right? And by the way, there are cases where you have a hybrid solution, right, like, especially as we said, like in the enterprise, you might have data systems drowning, like on their own data centers, and also have like bots, like the workloads are running on on the cloud. But let’s say the economics of one and the other are very different . When you have your own data center, you buy your own hardware, you have it there, you can really go and ask for more hardware, like that’s probably going to take some time to become available. And on the cloud, you have a completely different situation like you pretty much like it at any time, like you can release whatever you want, right? So the equations there of trying to figure out what the cost is, when you operate these workloads is different. Can you understand a little bit of the differences there and what it means like to operate in the on prem and what it means like to operate efficiently in the cloud. Yeah.

Kunal Agarwal 44:15
So when you think about on prem costs, you’re thinking about, you know, cost per machine, the fully loaded cost per machine. So the hardware for getting the machine, all the software and services, you’re going to run on that. So what’s your licensing costs for everything? And then depending on the type of hardware, you try to depreciate that over three to five years, right? straight line depreciation, so the cost of you $30,000, it’s about $10,000 a year, right? Just running on the cloud. Obviously, it’s below the drink, you know, 20 cents per hour for only one machine. And then, you know, you keep adding more machines, obviously. So, there’s a lot of differences in how people approach both of these equations. In some cases, people See, look, if you have predictable workloads, stuff that just needs to run every day, it’s not going to change. And it’s going to be the same way every day. It’s better and cheaper to run it on prem. That’s what we’ve seen across a majority of the enterprises, especially for large scale workloads. And then if you have experimental workloads, things that you may be just trying out, or you’ve got seasonality in your environment. You know, you’ve got Friday, Saturday, Sunday, workloads are bigger than Monday, Tuesday, Wednesday workloads, for example. In any kind of situation like that, having a more liquid environment that can scale up and down, is a better use of resources as well as cost. That’s the primary difference. The way to start thinking about the cloud costs in particular is, nobody knows what it’s going to be. On day one, you can have some sort of an idea. If you break down your workloads into CPU and memory, and you know, just the basic unit, we’re never going to be right. So it’s always good to measure everything from day one again. So you can start to see the trends and patterns of these things. So by the end of month two, month three, you at least have an idea what this yearly cost could be, and then start to put proactive guardrails to avoid exactly the problem that costs as you were talking about, which is, hey, yeah, the cloud has infinite scale. But do we want to give people that power? Because you don’t have infinite money? And how do we put some sort of guardrails against that? Now, obviously, looking at just the number of cells, you have the part of the story, you’ve got to talk to your team and understand what they’re actually trying to do. In some cases, they may not even be knowing that they’re doing these inefficiencies, in some cases that may be the actual use case. And they’re like, Yeah, spend $100,000, like, why are you because it was doing an amazing thing, and you needed to run that way, and then put the guardrails appropriately. But then people who are running a hybrid environment, they’re also using it in a unique way, because they’re thinking of, let’s use the power that we have on prem. And only when we need to burst workloads, only when we need to scale out workloads, then we use the cloud. But then everybody’s got their own patterns and anti patterns and how they run these things. But these are the most common ones that we see.

Kostas Pardalis 47:25
Yeah, yeah, no, that’s super interesting. Okay, one last question for me, because we’re close, like to the end here. How things have changed because of AI. And I’m talking about observability here, and I don’t necessarily care that much about how it changes in terms of helping someone to perform observability. But more about, like, how we implement observability. When we are implementing AI, it’s different. I would assume that when you have a bi, it’s different when you have a melt. It might be even more different when you have AI, although they’re similar to ML, but what have you seen out there? I’m sure you have much more experience in that. And like, I’m very curious to hear from a vendor with like, how, what is missing today? Or what works when we’re trying to like to actually bring the same value with observability. But like when we’re doing AI?

Kunal Agarwal 48:27
Yeah, look, AI is super interesting. Every company is rushing to create some innovative products, or at least they’re starting off with using AI to improve their own operations, right? But when you break it down, it’s again, a series of data steps and sequencing or data steps that need to happen to create, you know, meaningful AI outcomes. So a couple of steps are actually similar to, say, bi workloads, where you would have your ETL or ELT of bringing data in and prepping that data as a common step. So, in fact, we almost always recommend to people to think about all your data apps as being modular pieces, and think about what he can repeat and use again, so that your costs as well, your efficiency is great. So that’s one of the ways but he added to answer your question, you still have to have something that can observe multiple systems. Because AI is again, not a one system or one technology based app. You need something for data ingestion, ie something for data modeling, you need something for running your AI algorithms on top of something to serve it, etc. So you need observability that is capable of measuring things in multiple services across multiple environments. And what we’re seeing God says this is becoming very real as people are actually moving to a multi cloud environment as well. So you need to take notice that cuts across these pieces too. Now, with AI, you will, again, have more teams and more users using your data platform. Because the ideas for AI generated apps are going to come from everywhere in the organization, you’re going to have a legal team, for example, jumping on and saying, Hey, we can use this dataset. We’re doing amazing things with AI for our company. Which means that leaders need to be even more careful. And recognize that you’re going to have varying skills of people. And with that may come in more complexity and inefficiencies into your platform. So having observability from the get go to measure all the sub pieces is going to be even more crucial as you’re going to be able to.

Kostas Pardalis 50:46
Okay, that’s great. All right. Back to you.

Eric Dodds 50:51
Yes, I could. Okay, I have to ask. On a personal note, now having done a consumer startup in an enterprise, sir, would you ever go back to consumers?

Kunal Agarwal 51:04
That’s the itch left to scratch again, for sure. There are all these exciting things that need to yet be created on the consumer side, believe it or God. But yeah, that’s, that’s gonna be one of the companies, you know, that I will create in the future. I don’t know how much in the future but definitely an itch to scratch.

Eric Dodds 51:24
Awesome. Well, thanks so much for joining us today. We learned so much. And best of luck with unravel and your future consumer.

Kunal Agarwal 51:32
Thank you. Eric. Kostas, thank you so much for having me here.

Eric Dodds 51:36
We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe to your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.