Episode 84:

Why Are Analytics Still So Hard? With Kaycee Lai of Promethium

April 20, 2022

This week on The Data Stack Show, Eric and Kostas chat with Kaycee Lai, CEO and Founder of Promethium. During the episode, Kaycee discusses why analytics is hard, the relationship between data virtualization and ETL, data catalogs, and more.

Notes:

Highlights from this week’s conversation include:

Kaycee’s background and career journey (2:34)
Why analytics are hard (7:28)
Defining “data management” (11:47)
Defining “data virtualization” (15:57)
The relationship between data virtualization and ETL (18:34)
Where a company should invest first (21:40)
Building without a Frankenstein stack (25:19)
How Promethium solves data stack issues (27:53)
Giving context to data (35:14)
Cataloging: background, at Promethium, future (39:29)
Who uses data catalogs (48:00)

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 0:05
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com. And don’t forget, we’re hiring for all sorts of roles.

Welcome to The Data Stack Show. Today we’re going to talk with Kaycee from Promethium. Really interesting background. I’m always interested by talking to people who build technology based on not just sort of seeing like a market opportunity maybe or thinking of a cool technology, but who have worked in context around the problem and just repeatedly experienced different kinds of pain that relate to the same problem. And that’s, that’s what Kaycee experienced. And that’s why he built promethium I’m really interested, he talks a lot about and just some of his blogs, and the materials like analytics are pretty difficult, even though we live in an age of like modern tooling. And I want to ask him why that is, I think it’s something that different people in different roles and companies feel different pain around, but it can be kind of hard to articulate, like, why is why are analytics. So actually pretty hard, and why are they huge projects, even midsize companies. So anyway, that’s my question. You?

Kostas Pardalis 1:29
I think it’s a great opportunity to learn more about the new term, which is data fabric. So I’d love to learn more about it and put some context around it, like, why we need the new term and what it means and how it relates to the rest of the technologies that we use, and also revisit and all the tariffs which have the feeling that it is related to the Data Fabric and of the data catalog, we have talked about, like many different it’s a data management data governance-related tool. So far, I think data cataloging is not something that’s not much, although I think it’s quite important. And I’d love to hear from Kaycee about with a data catalog, how to use the wide use and which evolution into the Data Fabric.

Eric Dodds 2:17
All right, well, let’s dive in and talk with Kaycee.

Kostas Pardalis 2:19
Let’s do it.

Eric Dodds 2:21
Kaycee, welcome to The Data Stack Show.

Kaycee Lai 2:23
Hey, thanks for having me.

Eric Dodds 2:25
All righty. Well, let’s start where we always do, I’d love to hear about your background. And then what led you to Promethium?

Kaycee Lai 2:32
Yeah, thanks. So my background is a little bit mixed. Got a little bit of go to market, as well as product as well as financial analysis kind of all mixed in and probably explains how I got into the data management space and how I became a founder of a data analytics company. So I started my career actually, as a business and data analyst, the back crunching numbers getting insights. And as I like to say, the guy getting yelled at by my executives for always taking too long those insights, which led me to do everything from take a sequel class and learn how ETL work and why data warehouses were structured, though they were and why I couldn’t get a data Mart’s refresh every minute why it had to be every three months. So my journey kind of led from there into being more on the go to market side with sales, business development, marketing, and then eventually back to product management. And 20 years later, after being an analyst, I somehow ended up as President CEO of a data catalog company selling data management tools. And one of the things I realized when doing that was that the problem not only didn’t go away, it actually got a lot worse in 20 years. So when I was a young guy, crunching numbers, I was lucky enough to have one data warehouse, one BI tool. And most customers we talked to today. Unfortunately, for them to be competitive and leverage their data. They have to get data from multiple databases, SAS application, data warehouse, data lakes, multiple clouds. And to make it all worse, they can’t even standardize on a single BI tool. Right then. So this is a challenge that I saw a lot in my old job as President, CEO of warline data. And it led me to want to find a way where guys can just make analytics easy for people, please. And they can we make it so that way, it doesn’t matter what type of data source you have, it doesn’t matter a BI tool you have, can we actually streamline this process so that way, you don’t pay tax, just to try and use your data. And that’s been sort of the where I exercise the product management background in me as well as kind of the go-to-market in terms of figuring out that product-market fit and how do you actually deliver a product that hasn’t been built before? Because the old way was simply creating more of the same problems over and over again.

Eric Dodds 4:55
Hmm, definitely. Okay. I have so many questions there but I have to ask one. So I snuck around on your LinkedIn. I noticed that early in your career, you were an analyst at the Federal Reserve guild. And so I just want to know, what did you work on? What types of problems you’re trying to solve? And then when did you discover anything like that really interested? You are surprised you in that role?

Kaycee Lai 5:23
I’m not sure I’m at liberty to say. I’m kidding. It wasn’t that exciting, trust me. We worked for that, that you actually do a lot of your Macroeconomic Analysis by looking at housing trends, and big trends, stuff like that specific files, also looking at things that were affecting the banking, landscape, like things that were driving m&a regulations in a house, some of those regulations kind of know, monitor and enforce the monetary policies. So I would say, that was kind of the day job, I was in the statistics department. So I was doing a lot of number crunching, believe it or not, in my spare time, I realize, somebody should actually build a database of all the different m&a activity that’s happening. And so I actually found time to actually do that. And that’s where I kind of really got interested in the whole.

Eric Dodds 6:18
No way.

Kaycee Lai 6:19
Yeah, I know. When you work for the government, you actually have a lot of time.

Eric Dodds 6:22
It’s not like that as the founder of that. So I know, he talked about don’t talk to that. Thank you for entertaining me. Okay. I want to dig into a question. I think you said analytics is hard. And people experienced that in so many different ways, right? I mean, on sort of the, I’ll use a marketing example because I’m a marketer by trade, but like, Okay, I’m just trying to get, like these events into Google Analytics and get my Google Analytics accurate. It’s like, okay, well, that’s painful, right. But then on the other end of the spectrum, it’s like, okay, I have legacy systems, I have new systems, I have multiple lines of business, I have all this sort of stuff. Right. And it’s really fragmented. Yeah. Could you help us understand, from your perspective, why is analytics hard? And I agree with you, it seems crazy that it’s still hard today because the tooling is gotten, like the different tools. Yeah, have gotten way better in many ways. But it is still hard, for sure.

Kaycee Lai 6:23
I looked at it in a couple different ways. One way I looked at it is that, so the analytics landscape kind of changed a lot. As we more for life. Everyone just put everything right, first in their databases, and they said, Hey, don’t do any database, put in your data warehouse. And then from there, right, we had new paradigms, right? With data lakes with a do, and they were cloud and so forth. And that’s okay. Like, I feel like, Okay, those are ships that we could deal with, right? The thing that made it worse, in my opinion is the vendors, these damn vendors who made bad data management tools, if you look, it’s kind of crazy, right? It’s like, Hey, I’m only going to do this piece of the whole data management process. And I’m only going to do this for this platform for this type of data. Right. And I don’t know who started that trend. But like it became vogue to start doing that, like, Well, okay, if those guys can only do it for RDBMS, I’m gonna do it for HDFS, or I’m gonna do it for time series, or I’m gonna do it for whatever, just stayed on AWS. So I think the challenge has been a lot of the data management tools only do part of the workflow and only address part of the environment or part of a data type. Now, that may not be so bad if you never ever change your data infrastructure. So if you said, I forever, this cloud, this environment, this data warehouse, awesome. Problem is, that never happens. Right? It’s every day, there’s a batter, newer data warehouse, data lake, some SAS application out there, and I’ve never seen it. So we’re gonna keep what we have, but we’re never gonna innovate and get something new. The problem is, once the business unit starts consuming, analytics, consuming reports, and it starts getting operationalized. Good luck telling someone they get, you’re going to shut back down as you buy something that never happens, right? You never happens, you end up keeping it and then you say, oh, for the new stuff, I’m moving on to the new platform. And so what happens is you end up having to support the legacy data management tools on top of the legacy, analytics, structure, and so forth. And so this is where it gets hard. And then to make it worse, the knowledge of the human being doesn’t necessarily go back 30 years, certainly, especially the technology, right? So today is drag you through major and they’ll kind of look at you funny, like what are you doing? Why are you here? Maybe that’s right. So in the tech stack is moving so quickly. So I also find that it’s also very hard for data to Stephen do this. And so this is why it’s become super challenging. And then the last part of the light bill I the nail on the coffin is, I think 10, 20, 30 years ago, it was okay to make data-driven decisions once in a while, right. And that was kind of the norm. But I think companies have shown us that, hey, if you can be data-driven, like Amazon, like Facebook, like the Googles of the world, you can go out and really kick butt to do really, really well. And so as companies now realize, I have to be data-driven, right, and you look at the pandemic, that’s actually kind of kid taught his lesson. You can’t make a decision in three months, but you’re gonna be around three months. Yeah. So it’s forced people now into this mad rush of Oh, my gosh, I have to somehow make it work with all this legacy, disparate stuff to deal with and the knowledge gap that I have. So I think that, in my opinion, is one of the leading factors of why things are as hard as they are today.

Kostas Pardalis 10:59
When we started the conversation, you used the term “data management” a few times.

Kaycee Lai 11:05
That’s old.

Kostas Pardalis 11:09
We said it’s also wisdom, there’s also wisdom there. So I would say you’re wise but plan a way to count down was been like sometimes, like defining what data management is based on, like, your experience so far, because I feel like one of the not exactly problems, but like something that’s very interesting, right with this industry, and that we use a lot of different terms. And the semantics are not very clear. Like everyone’s like, I was like, it’s slightly different to me of how we use it.

Kaycee Lai 11:39
Drives me absolutely nuts. I know we were talking about that. How would I define data management? Okay, my high-level definition of data management is all the stuff you do after the data lands, in the database, a warehouse date, like the after lands, the right has been committed. It’s all the stuff that you do to get the insights, that’s going my rough overview definition of data management, right? It’s so it is the ETL process, it is the data cataloging process, it is the prep it is the modeling, it is the query optimization, query Federation, the sequel, the process, and even to some extent, the visualization process, right. But I would say the hard part when people think of work, the hard part, when people have a negative reaction to the word data, they’re scared or a strong visceral reaction to data management, it’s because they’re reliving some traumatic events, they use experience, through those processes that I just talked about.

Kostas Pardalis 12:45
Is there a minimum set of activities that every company needs to have. I would assume that it’s okay if you would like to consume data on some kind of visualization to stick to being REITs, or a database system. What’s your experience with like, it’s a reason. Let us Namie blade, the minimum via beam the data stack when it said, how glass line.

Kaycee Lai 12:46
Yeah, well, so I’m going to start at the two ends. The first end is basically where the data originates, right. And so this sets up patients already BMS, so the data source event, and I even would even include the data with a data warehouse there, I know, yes, you pull data and put it in there. But for the purpose of analytic, it’s like kind of— think of data sources as anything that stores houses or generates data, right. So I kind of put that in one end. And then at the end of the stack, on the other end, is your typical BI tool visualization. But that’s even involved, right? Like, I would say, the last few years, we’ve gone beyond that to the dashboard isn’t enough anymore. People want narration, people want storytelling people. There’s this trend that, Hey, maybe I don’t want to have to go look at a dashboard every single time and figure out what’s going on. Maybe I just want you to help. Maybe I want the tool to tell me this is what I should care about. Right? So but I would say, let’s call that the Insight part for the moment, right? And so, for the most part, every organization has someone figure that out, right? And figure out where the data is coming from word store, they figured out how to visualize it. And for the most part, anything in between, I think that’s where it gets messy, right? You’d see I’ve seen everything from like, data scientists needed for doing crazy Python and extraction will are and Scalla to cobbled up bespoke solutions that you may have had three sides over 20 years come in, go for you. I think the best practice has been to put in different data management tools like it might have a data catalog for discovery for governance, you might have a tool to do prep modeling. Obviously an ETL or a pipelining tool to get the data into data warehouse or someone somewhere else. And, and from there, it’s kind of, we’ve also seen kind of things like data virtualization technologies as well. So I would say for the most part, right, I would classify as you have your discovery governance layer, right, you have your credit modeling layer, and you have your output access layer, the access is what I would love ETL what I would love moving into data, the pipeline data as well as the query of the data, I would love virtualization there as well. So I always say these are the three broad categories, right in the middle of now the data sources and the no bi visualization and analytics tools

Kostas Pardalis 15:46
That’s very interesting. And okay, I have another question for another term. You mentioned that’s like the data virtualization. What’s that?

Kaycee Lai 15:57
Wow, how much time do we have? There’s like it that term is so misused the stories guys have what have a definition for do virtualization. I’m sure that virtualization with VMware guys has a different definition and the state of virtualization in, the more the data management space. It’s been around since, like, Cisco had a version of that a while back popular one stremio. Starburst, obviously, talks about that. So I will talk about the more recent as well as the more relevant to Analytics definition. And that is really, the ability to use a layer two, allows me to have virtual access to the data sources where you don’t have to do the ETL. First, you’ll have to load the data first, before you can query it. And being able to also do federated queries. Because if you look at something like Starburst, you actually abstracted out the SQL query execution engine, right? So away from the underlying data sources. So what you can do that then you can actually push a lot of the operations like joins, aggregations, so forth away from the underlying data sources, and you can actually do parallel processing, or the sequel execution, so you can get better performance. But it also means it is now possible to actually run the query route joining data from multiple sources, which before that we would never think about that, right? We would say, oh, my gosh, no way, I have to do the ETL, transform everything landed into one single data warehouse and do that. And so I think when I say data virtualization, I’m really talking about kind of more recent incarnations, data virtualization, right? All our dremio, all a starburst, those kinds of technologies.

Kostas Pardalis 17:34
Did you see and I’m assuming you’d say the role of like a data engineer who’s looking you. In this disappearing, I hear you describing let’s say, this data stack, and you mentioned both the virtualization and also ETL. But when I showed you like, describing virtualization, like, it makes me feel like we don’t really need again, right? If we have, let’s say an ideal visualization. And I’ll get there, though, how did you see like the relationship there between the two? And I want to ask you to give me a definition that’s as pragmatic as possible. Like, why do you think at the end is possible out there? Are we doing to get out for a good deal and give you or notes?

Kaycee Lai 18:30
I think my thinking actually has evolved. I would say, two years ago, if you had a video of me somewhere, I was probably out there protesting the sign definitely TL No, no, ETL we’ve got data virtualization networks are fast CPU memory on servers, we’re getting that we don’t need it. Well, I have to say, in the last year or so, I’ve had to change my mind. And it comes from just practical experience with customers, right? So I’ll tell you what I mean. So I think data virtualization is fantastic. When you’re exploring, when it’s ad hoc, when you have an idea, you’re not sure yet. But data virtualization allows you to do is get you a quick way to validate is this the data you’re looking for willing to answer your question, without waiting for the complex task waiting for the data to be loaded? And so so that’s awesome, right? Be the harsh reality that physics exists. I love my brothers at Starburst. And all the data virtualization help base, but when you get into customer environment, and they say, Hey, I’ve got 12 billion rows across these two tables, once in the cloud, and one’s an on-prem Postgres database, and I need you to do destroy this query and I don’t want it to take more than a minute. Did you like it? Physics man, like click man Oh, Just no matter how many nodes that add how much memory like you still, there’s still that extra hop that you’re going to take, right. And so what I kind of changed my thinking is I’ve seen in the customer deployments is the data virtualization is a good way to start off where compared to that hot stuff, you believe talking about operational pipeline, operational jobs, Operation analytics, and you have SLA s and meet in terms of time. And in a big data sets, you’re not going to win versus a dedicated query, right? You’re just not going to wait, especially if it’s against the data warehouse that it’s really tricked out and have in and then and the data engineers know exactly how to tune it for performance. And so this is where I do see them coexisting. I don’t see it as a replacement, I say, hey, look, like the best practice, I always tell our customers is use the data virtualization and make sure this is what you want very quickly. And then if this is what you want build the pipeline, right? But now you have full confidence that this pipeline is actually going to delivery exactly what you want, which is a lot better than before you trial and error with the ETL, the complex pipeline, right? Only have it break multiple times to figure out yeah, this is the one that you want. So it actually is a nice marriage. And I actually think that is a good way to actually combine the two technology to get the best of both worlds.

Kostas Pardalis 21:21
Yeah, makes sense. Makes a lot of sense. What do you think that like it company should start from, let’s say, you have like a medium-size those, like small companies. That’s the point where they want to start implementing some kind of data initiative and their data warehouse while all that stuff. Where should they invest first? Is it the annual virtualization or both?

Kaycee Lai 21:45
Yeah, and then this is where I’m going to be a little controversial, because I know I’m going to save you, you’re never gonna find in any book or manual that you read out there, right. I know, every book manual or consultant, you talk to us and I say, start the data warehouse, or daily move everything into one single place, you can find everything, that’s what everyone is going to tell you. And that is the current quote, unquote, conventional wisdom by the problem with that approach, we’ve seen time and time again, it’s two problems, one, take it from an old infrastructure guy moving data sucks, that’s hard, it’s complex things break. So you’re gonna have to have a long project that probably won’t even finish on time. Yeah, after you actually build the data warehouse in a way to take it from an X catalog guy, whenever you just moved in here, you’re probably not going to go five 80% of that stuff. So your users are now really mad at you for the next 18 months, why they can’t leverage data beforehand. Yeah, so this is where conditional conventional wisdom break. Look, it was relevant 20 years ago when you can have that many different data sources. It’s got that much data, but when you now have millions of billions and 10s of billions of table in multiple data, sources, pipes into a single place, this is just the problem you’re going to run into. And so my suggestion is, don’t start with that. Right? Do that. Last, right actually start with, if you can, data discovery process, right, and data discovery process, meaning, at some someone use a catalog, but it doesn’t have to be but a way to which, whatever your data assets are. Number one, when I say data assets, I mean the whole gamut, right? I mean, tables, views and queries, right? Start with knowing what you have where it is, and then start with knowing what people are actually using. So you have a way to actually prioritize, because a lot of people when they think about doing these type of data, Lake data migration, data, warehouse, migrations, they, they think they have to move everything. And I can tell you, nobody uses every single table, nobody actually uses every single query. In fact, most people have a lot of orphaned query stale queries, or even failed ETL jobs, or enough still tables, start with the ones that people actually care about, and pay for actually using, and use that as a basis to say, Okay, this is what we want, this will be one, let’s really make sure we know how to optimize that on the Discovery, governance and performance perspective, you can do that, people gonna use it, then actually building your data lake or data warehouse, first with that set of data is going to give you the best experience going to give you the fast experience of getting that data lake data warehouse up and running. And your customers are actually really happy with you because they’re not waiting 18 months for you to tell them, okay, trade with us. So I would say start with that discovery process, right to rationalize what you have, and where it is and why people are using it. What are the most popular ones? Then from there, like I said, the data virtualization is great for you to validate and then having a data warehouse or data lake for that basketball performance of the next. So that’s kind of the three steps that I would recommend.

Eric Dodds 24:52
Question on that. This is such an interesting topic. You talked a lot in the context of Okay, you kind of already have these disaggregated sources, and the go-to conventional wisdom today is just get everything collected into a warehouse or a data lake. Let’s just imagine a world where you can start over from scratch, right? You don’t have the which I know you know, but entertain me there. Let’s say you are starting out and you didn’t have to deal with sort of this legacy, sort of integration, debt and technical debt, from a Frankenstein stack and all that sort of stuff. And would that change the way that you approach building out or scaffolding the analytics infrastructure of practice inside of a company?

Kaycee Lai 25:44
Temporarily, yes. And why I say temporarily acids. I’ve seen many examples where people say, Hey, we’re building it from scratch. The problem is, where did the data come from? Right. And there’s only so much you can control for that’s your internal data, right? The minute a business starts expanding, we have to take on new partnerships, oh, hey, look, their data is from another source. And they have to pipe that to us. Yep. The minute marketing starts going, Hey, I actually need this type of third-party data, I need social media feeds. There’s new data being added, right. And, and what happens is, this is just a cycle that has played itself out over and over again, is, you can probably get that started for like your core main app, your next data. And the minute someone says, we’re grown as a company, we’re come out and you want to build a new app. So when they’re in it, and our development stack of small, I don’t want to be on the same stack as yours. No way, man, I want independence to be able to do my own thing. So what happens is, it’s a temporary solution where you get to say I’m going to redesign from scratch. Eventually, you get into the world, or Oh, my gosh, I do have data in multiple places. Sure, they might be newer systems, right? You could, you could say, hey, all my data are now all cloud data platforms and data warehouses. But it’s still they’re still separate formats. They’re still separate API to connect to. And if you try to do cross source analytics, you’re still going to run into problems.

Eric Dodds 27:13
For sure.

Kaycee Lai 27:15
Temporary and you could be living in bliss for a little bit. But eventually, you got to pay the pilot for man.

Eric Dodds 27:21
Yeah, it’s like you start a new job and your calendar is empty and your inbox is empty and you’re like, wow, I have so much time just to work on stuff, and then that the train is off the tracks.

We’ve talked a ton about the problem, this has been super helpful. How do you solve some or all of those types of issues with Promethium? And how does the products do it? Like, what’s your approach?

Kaycee Lai 27:52
Yeah, well, number one, I would say whatever you think you take it out, there’s nothing more humbling than actually going out the cuts, whereas then that getting kicked in the face. And so we’ve, we’ve had the luxury of getting kicked many times. I used to be much better looking

Eric Dodds 28:08
You have some aggressive customers.

Kaycee Lai 28:12
Have you worked with data engineers? No, I’m kidding. No, no, it’s all good. What you think you can do in the real world, it’s always very different, right? And so one of the things that we realized very early on was, you got to connect to every data source out there for the most part, right? You get like just Doom, every customer you get to walk into that has this problem probably has a smattering of relational data sources, data, lakes, data, warehouses, cloud, Hadoop, you name it times two, right? I just assumed that’s going to happen. So right off the bat, that means you do have to know how to connect everything very quickly. And when I say connect, I actually mean being able to figure out what they have, and being able to show people what they have very quickly. So the old way of, well, I’m going to load everything into me. I am going to connect badly copy everything, it may well, that’s a horrible idea, right? Because you’re now actually creating yet another Data, Data Silo number two, the performance impact to actually go scan every system to do that, is God awful. So some of the earlier version of the catalogs went through that problem. And I can tell you, like, a lot of times, they would take six months to just finish scanning, right, by which time you’re now behind by six months. So pemmican we’ve actually figured out a way how to connect the How to very quickly within minutes, kind of give you a logical view of every table every query every view and then you got to figure how we deal with enterprises that not everyone’s a good citizen, not everyone puts all their data in the Data Warehouse so you have to figure out how to get them for like Git repos I kid you not write and being able to do it same thing. So just in the ability to connect a kind of give you this no live view is one way that primitive can do and literally in minutes and literally one day will crash First tell us that you show me 15 minutes what my existing legacy data catalog guys took a year and a half to show, right. So get that global visibility one day very quickly, then the next thing you need to do is help them understand the meaning behind the data and what it can be used. I think this is where some of the drawbacks a lot data catalogs have is like, yeah, they can tell you the metadata information and so forth. But like, is that really not that helpful? If I’m trying to know, can I use this table to answer a specific question, right? Or is it more helpful if I tell you, this table has been used to answer these five questions that are actually very similar to the one that you’re asking. So that ability to actually extract context and how it’s actually being used is super important. And then the last part that I think is even more important is the ability to actually let you use the data. So a lot of the metadata tool, there only metadata only, or if they do have some, quote, unquote, preview, it’s very like it’s a small subset, and you have to move data from them to preview oriented page text. So this is where your premium has to figure out a very light way view of actually seen what’s inside preview. But then, as and when you need let you actually work with a date, right, actually let you join, let you build queries that you build this tool, they can very quickly on the fly. And that’s a whole different experience. Because before what people are used to it, I found it, I needed to use it, let me go call someone else. And let’s hope they have access to it. And let’s hope they can get it from me, let’s hope that they validate or I can’t find it, it has been built crap, I can do it. Let me go get someone else to do it. Being able to actually do that actually, all the way through is where for me come shines. And then the part that I think a lot of folks overlook, is you have to make this into a seamless workflow. Because today, these all like this as separate prophecies potentially done by separate people using separate tools. Right, I don’t naturally assume what I find in my catalogue, right? Don’t fly, I can instantly build on the fly query virtualize that doesn’t exist today, normally, and so how do you make that not just easy, and clear, and intuitive, but also performance? Right? You got to make sure that it performance facet, so that for us our goal has always been get to be answered in three minutes. Right? Try as hard as you can get through have to have a single easy workflow, but get to the answer three minutes. And what we found is, with analytics, you don’t necessarily need to answer the question right away, right, with analytics away, you know, when I was an analyst, most oftentimes, whatever you think you’re gonna answer, it actually wasn’t all you’re gonna ask in terms of working the data. And we looked at the data, you’re like, Ooh, hey, I think about that, or, oh, wait a minute, there’s something else or Wow, this is wrong. And so the faster you can get to those points of iteration, as I call it, the better announced the longer it takes. This is where things start getting hairy, right? It’s really like, Wow, maybe I could convince my boss to just accept this. For sure. Let’s use the data from three years ago, right? Yeah. So those are the things that Prometheus can actually help is not only giving you that fast connection, understanding, but actually allow you to actually work with it all in a single platform. With end-to-end collaboration between the business person and the data team has years as a data analyst, I have no idea what the marketing guy really is going to do with it very yet, right. And as the marketing guy, you have no idea how to do the gnarly extractions, right? I pulled the data from different data sources. So why do we actually make you wait until someone finish that pass that only for Eric to say, Hey, man, this thing that it’s not why me? Why not have them collaborate in real-time together? Right. That’s what and that’s kind of sort of that this new era of collaborative analytics that we can just bring into the table?

Eric Dodds 34:09
The dirty secret is that the marketing person may not know at all what they want to do anyways.

Kaycee Lai 34:20
I won’t tell if you won’t tell.

Eric Dodds 34:23
I know Kostas has a bunch of questions. One specific question for me before I hand the mic over. You mentioned giving context the data, right? So show me everything that I have, right, which is really useful. I mean, goodness gracious, like, even in small companies like ours, it’s like there are nooks and crannies already in the warehouse, so that’s helpful. And then you talked about, this table has been used to answer five other questions like this one. Yeah. That on the surface feels like it’s it. There’s a very high level of sub-activity and like, context there? How did you do that? I mean, are you like sort of diffing SQL queries that had been run on the table? Or like, how does that even work?

Kaycee Lai 35:11
How much time do you have? This is actually part of the secret sauce. Now, and whenever we actually have a patent on it, so we figured out kind of both ways. One is, how do you actually figure out the semantic and also the context, right of a query or table? Right? How do you figure out the relationship that has with other tables and other queries? And then because you also have to, it’s almost kind of like, understanding it from a graph perspective, right? A graph database perspective, to understand these multiple relationships could actually exists between multiple objects. And the object could be a table, it could be a query, it could be a view, right? It could be a tag, right? It could be a BI tool, right? And so figuring out how all these semantic objects actually map to each other is hard. But actually, it’s very useful number one, but number two is also taking advantage of crowdsourcing, right, knowing what people have rated, reviewed, frequency of access, those type of metrics come in play. So one of the things I learned earlier on is that very rarely, you rely on one metric to determine viability or relevance, right. Oftentimes, in an organization, we look for multiple data points we look for, has this been used by someone they trust? Number one, right? So who actually uses frequency, right? What was lessons used? How often it was used that, hence, give people a level of comfort and crowdsourcing, right? Four Star thumbs up, thumbs down, believe that gives people comfort? And then some people want to get a little deeper, they won’t look at me, he told me what actually came from show me what happened to me, the transformation actually happen and prove that way, I can get a sense of comfort, how he was actually built. And so with promethium, we actually realized number one, every organization has multiple things they look at, and it’s never just one, unfortunately, right. But what we found is that everyone probably uses the same six or seven things, and they might just weigh them differently. Right. And so we’ve actually figured out now, number one, not only how to get those things, but how to actually create an algorithm to rank, right, based upon those six are different things in terms of Rome ordinance, and then have a tune. So it’s kind of like our own little PageRank, if you will, right, that kind of determines the level and there’s a scoring behind it. And if you don’t like it, right, you voted down. If you don’t like it, you don’t access it. And so it’s always live. And this is where like customers have actually started seeing a lot of value because it’s not static, right? The problem with data catalogs and data governance tools is it’s kind of static, it’s what someone says, or it’s that profile data sets. But if you don’t actually know how it’s actually being used, and right, not just itself, but also parts of it, and so forth. You don’t really get a complete picture. Right? And so this is where we are, we’ve been able to do this. So you don’t realize it. But as you’re using Prometheus to answer questions very quickly building now, you’re actually contributing to the governance as well, because you’re actually contributing to one of those factors in the scoring.

Eric Dodds 38:25
Fascinating.

Kostas Pardalis 38:27
That’s very interesting, Kaycee. I don’t remember that they fleet dope thing, these assault before about data catalog. So I’m going to ask you a bit more about it. Because I also understand that the cataloging process is something that is quite important in general, but also it has like a white Central Saint role in the product itself, if I understand correctly, right, so can you give us a little bit of a background about cataloging because zooming it there, right, like, as you said, you worked in the past. With catalogs, you had high expectations, at some point in your career as well. And you got hurt by them probably have some kind of trauma? How did it start? How would you like and how would you like to be with like the innovation from Promethium? And also, if possible, talk a little bit about the future.

Kaycee Lai 39:26
Yeah, that’s a good question. Catalogs started decades ago as kind of just a way for Yeah, DBA see if you’re able to annotate things and find things right and the heard terms like data dictionary right from you, just put the little things that helped me understand what this term actually means, what this column actually means. And then people started adding in things like lineage right to really understand and a hybrid, because as things move from the sources, some into the data warehouse, transformation can take place and so you want to understand Hey, how did the transformation happen? So lineage started becoming a big thing. And then you have things like data quality score, etc, that allow people to rank the trustworthiness of the data and so forth. So I would say, they all kind of started with a very heavy governance influence upper data catalog, that’s where most of them actually have that background. The one thing that most catalogs have in common is really search, is really how we ask most people, why do you want to have note, the number one reason is always search and I will tell you that once someone actually bought an implement a catalog, if you ask them, hey, which feature do you actually use? Right? Search and tagging is like 80, to 90%. The rest like they can. And the reason is, sometimes the rest is either doesn’t work that well, it’s hard to implement. But for the most part, I would say, you get a catalog to number one, find where things are, find where they come from, and a way to put some sort of information that helps you assess whether or not this is good or bad day. That’s the high-level view of kind of how catalogs usually work. And then what we find is that recently, right people, because like I said, the problem are multiple data sources, multiple data types, and so forth. People are asking more from their catalog, right, I want to profile LFC cardinality, I want to see statistics and so forth. And you can do that be where the catalog stocked, unfortunately, was always manually going to find new things that are already there. That means good datasets that you can use, right? Or I’m going to find you things that are wrong table, which you probably wouldn’t know what to do with. And so a catalog never allow you to actually build it never allow you to actually experiment with the data other than saying, Hey, I found in Costco said, This is good. You know, Eric said, This is bad. Yeah, beans gave it four stars. And it comes from this source, and so forth. What happens is that the user has to make a lot of interpretations, before they even know whether or not this you can actually use it. And so if you look at most usage, most catalogs are being used by a Data Governance team, really to you know, quote, unquote, manage whether or not something should be used, whether it poses a risk, who should access it. So for. And the reason why is because that next step of actually using it for analytics, which requires you to actually work with the data to separate persona, it’s a separate requirement that a catalog isn’t to write the catalogs for me, it’s like 80 90% Metadata own, right, it doesn’t do that building part. And so it doesn’t worry about performance. It doesn’t worry about scale, it doesn’t worry about you know, can you actually answer a question? And query optimization and all that stuff? So because that, you know, take away the marketing, it’s not you not as useful as you might pay for analytics, right? Because if you can’t do all that, how are you going to really help a data analyst, right, or a data engineer determine this the data sets and views to answer a question? So that’s kind of you know, it’s a catalog circa 2017 2018. Right. And so where Prometheus is gone is we realize that the most natively intuitive thing to everyone regardless of function, regardless of creed, etc, is search. Nobody needs to really teach you how to search the most natural thing to you. So we didn’t start out wanting to build a catalog, what we realized was, almost everyone can search, almost everyone’s used to notion of tags. And when almost everyone’s used to notion ratings and reviews. I know for stars me something’s better than two stars, I know, thumbs up means thumbs down, right. And so if you can leverage the catalog as an entry point, or use that capability, the entry point and build on top of that. So once I find something or what they think these five things are what I want, then add the billable part to it, right, then figure out a way for people to prep and model query and see the results, then you have a way to very quickly get most people to actually be able to work with the data, as opposed to having to stop, you know, post-discovery and having to go and ask someone else and use another tool. So this is where you’ve seen kind of our Data Fabric come up. So that and I’ll talk about Data Fabric data pipeline, data mesh, and data catalog, right? Because they’re actually not the same. But the problem is the marketing is so darn confused, right? So right now I’m seeing a lot of catalog guys. I’m gonna click on the Data tab. Okay. So the way to think about it, the fabric in terms of what it should do, and what it needs to have is actually a fabric. Yes, it needs to have the ability to connect to multiple data sources. It needs to have, you know, a catalog like Shall it metadata, infection, metadata governance, but it actually needs to have the data modeling an access layer that you can actually and then be able to have some sort of coordination orchestration layer of saying, This is who uses it, this is how you should use it. This we should do next, right? That’s kind of the broad overall definition of what a Data Fabric is, you know, Gardner’s definition as well. Now we’ve taken that and kind of modified it in the sense that we think that the access layer should be both direct and federated, right, because if you still require people to move data, it’s not going to be a good experience. And we also believe that you do need visualization. Because for a lot of people, that is a better way to validate whether or not this is what you’re looking for or not. I challenge anyone to say, I can send you a 50 page SQL query, and you can tell me, this is the data you’re looking for. Yeah, me it cost the jeweler-type smart guy that could do it, but he gave it to me, I wouldn’t go do it to my friend, right. But I can look, I can look at a pie chart. And I can say, yeah, it’s probably looking pretty good, right, or the narration and the storytelling to be able to tell you that value. So we actually kind of go above and beyond what the standard order definition would be that fabric peers. Now, that means it’s doing all these things like catalog, a catalog stocking at the metadata management and discovery, it’s like getting to the access layer. It’s like a prep or visualization. data pipeline is just the moving over data, right? And if you think about how my dad was an English teacher, so probably spend way too much time analyzing words and their meanings. I pipeline has the connotation of its steel, it’s rigid, right? Once I put it in, it’s just gonna move we’re fabric. It’s loose. It’s flexible, right? And that that is because a fabric is as flexible as a question. If I asked him questions and one question, asked another one, I’m going to iterate and change a question. Fabric allows you to kind of on the fly, very cool change where you’re looking for very quickly build what you’re looking for, and have that flexibility. So this is kind of how I think we think about the world and data mesh, our friends and servers, they don’t have a lot of work on the data mesh. I think I think there’s a fabric and data mesh kind of coexist together. I think data mesh is a framework, right? That encompasses a lot of things. And you can have a data fabric in the data mesh framework. So that’s kind of how I see those things.

Kostas Pardalis 47:27
Yeah, yeah. I don’t know why. I agree and make sense. And I think we’re still in the process of like, properly defining all these terms and understanding them.

One last question for me. You mentioned at some point that the data catalogs were mainly used by the data governance, people in the organization. After this evolution of the data catalog into the data fun, right? Who are the people who use and consume this tool?

Kaycee Lai 48:00
We’re seeing data analysts and data engineers, now actively using kind of the data fabric to be able to automate the building of datasets automate the building of on demand SQL queries, so etc, right? I think the next evolution or the next iteration is as you later on No, co actually, later on NLP. And then they’ll G, a lot of business users a lot of what right now, I would say, even the kind of fairly technical business analysts could also give me a fabric. But I think the goal, at least the goal I have is, how do we get to the fabric in the hands-on even the non-technical people and the guys that just want to ask a question and get an insight. And that’s where a lot of work around NLP and algae and AI and free text search and so forth, is really going to come to play and how to take that to the next level. And so that’s the part that makes it very, very interesting, because now, now the fabric can actually span right to layman’s citizens, business folks, the business analysts, data analysts, data engineers, and even the governance team. And I think, where you can have that that’s where you have things start to make sense from that you can have governance, analytics, bi, all under the same framework, which today, it’s a necessity because these crazy governance rules like GDPR, CCPA, they’re really hard. They’re really, really hard to actually be compliant. And I can’t think of a way to do it. If you still live in silos world that most people have, where everything, I only do this, I’m gonna do this, it’s gonna be a nightmare. And so I see the data fabric as it’s finally there is a way to actually do this to drive velocity and decision making, but also doing the way that automatically takes care of governance.

Kostas Pardalis 49:49
I feel like we have more to chat about, to be honest, but I know that we are close to time here so I’d like to allow Eric to ask any last questions he might have, so Eric, all yours.

Eric Dodds 50:03
I think we’re at the buzzer. I think Brooks is telling us we have to close it down. Kaycee, whenever we run long, we know it’s a topic that is not only deeply interesting and valuable to us, but also to our audience. Thanks for digging in. Thanks for letting us get a little technical. And thanks for teaching us about not only data catalog, but helping to further demystify data mesh fabric and all the other terms that marketers like me, proliferate the industry.

Kaycee Lai 50:31
Am I gonna see a blog from you on the data mesh/data fabric now?

Eric Dodds 50:38
Oh, man, I’d have to dig pretty deep for that one. But cool. Well, thank you again for your time. It’s been great to have you on the show.

Kaycee Lai 50:44
Yeah. Thank you guys for having me. I had an absolute blast. So appreciate it.

Kostas Pardalis 50:48
Thank you. Good to see you. Thank you.

Eric Dodds 50:50
I’m glad to Kaycee, let me ask him about working as an analyst at the Federal Reserve, I just had to know. And he said it was boring, which I kind of expected, but at least he built a database in his spare time, which is pretty cool. I’m glad he was a good sport. I was really interested by the sort of it sounds like a system that learns about the value and context of data over time that they’ve built the premium. And I mean, it sounds like they even have a patent on it. That was pretty interesting and is a really compelling way to think about the challenge of data governance and sort of a self-optimizing system. And I don’t know if we’ve talked to a guest who’s brought up that approach yet, which is really interesting. So that was my thing to think about for the week.

Kostas Pardalis 51:40
Oh, yep. I totally agree. I think this time, I’m going to think about the same as you do. It is a bit of a surprise that we haven’t heard more about the ever-changing nature of data and how the Slingbox the things that we do the products that we build infrastructure that we design, and both stage. So yeah, I think it’s something very, very interesting. It’s intuitively, I would say, very important, we see, I mean, probably makes more sense to face this challenge when we are talking about data cataloging. Because it makes a lot of sense that it’s this temporal nature of data, like it’s more obvious there. But I think it’s something that has a much broader inbox with pretty much whatever data and the products around it. So I think we both should keep the middle Northwest more about our guests from now on.

Eric Dodds 52:39
I agree. All right. Well, many more great episodes coming up. Subscribe if you haven’t, and we will catch you on the next show.

We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.

🎙 Sign up for The Future of Machine Learning Livestream!

🗞️ Signup for Our Newsletter

Episode 84:

Why Are Analytics Still So Hard? With Kaycee Lai of Promethium

April 20, 2022

Notes:

Transcription:

About the Podcast

Sign Up for The Data Stack Show Newsletter