Episode 110:

How Can Data Discovery Help You Understand Your Data? Featuring Shinji Kim of Select Star

October 26, 2022

This week on The Data Stack Show, Eric and Kostas chat with Shinji Kim, the founder and CEO of Select Star, an automated data discovery tool. During the episode, Shinji defines data discovery and discusses metadata and economies of scale.

Play Video


Highlights from this week’s conversation include:

  • Shinji’s background and career journey (3:35)
  • Defining “data discovery” (6:03)
  • The best conditions to use Select Star (8:45)
  • Where Select Star fits on the data spectrum (13:38)
  • Why Select Star is needed (17:35)
  • How Select Star uses metadata (21:02)
  • Exposing data queries (27:04)
  • Composing queries into metadata (33:27)
  • Automating BI tools (37:28)
  • Limits to data governance (41:39)
  • Maintaining economies of scale (48:56) 

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.


Eric Dodds 0:05
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com.

Welcome to The Data Stack Show, we are going to talk with Shinji from Select Star today. Kostas, this is a really interesting, they describe the company in a really interesting way they call it data discovery. But they sit in the data governance space. And so I’m really interested to ask about what data discovery means. And then how that plays into data governance because you don’t, you know, when I think about the term of governance, I don’t think about terminology, like discovery. So that is, you know, that’s an interesting not a juxtaposition of terms. But you know, an interesting, an interesting set of terms to describe the product. So that’s what I’m going to ask about, how about you?

Kostas Pardalis 1:12
Yeah, I guess before you govern the data, you have to discover what kind of data you can read through and it’s only requirements for governments. I think historically, also, like data governance, was all built around data catalogs, so and dictionaries. So I guess there is an overlap there. But anyway, yeah, I mean, it’s like a very, very interesting topic. I think. We see many companies out there, like new startups, that one way or another, they inclement, like a small part of like different governments. And they try like to create a new category of products of a lot of the end like, Okay, well, these are like things, that’s one way or another, they are already in use out there in the industry. So yeah, well, I really appreciate with Select Star is like how they are not trying to create a new category, right? Like they say, what’s the law govern us what we are doing? And yeah, I’d love to hear what are the challenges of building a product like these? To date? Right? Yeah. All these different pieces of infrastructure out there, like, complex data stacks that customers are using? And yeah, like I think like fraud, like focus a lot of them like the solution itself, and how, what kind of challenges they are facing during like the building the product.

Eric Dodds 2:53
Absolutely. Let’s dive in and talk with Shinji.

Kostas Pardalis 2:55
Yeah, let’s do it.

Eric Dodds 2:57
Shinji, welcome to The Data Stack Show. We are so excited to chat with you today.

Shinji Kim 3:03
Thanks for having me. Excited to be here.

Eric Dodds 3:05
All right, well, let’s start where we always do, can you tell us about your background, which is fascinating, by the way, you’ve had, you know, heavy-duty data engineering roles. You are a successful founder. And with the previous company, which is really cool. So we want to hear about that. But tell us about your journey. And what led you to starting Select Star?

Shinji Kim 3:31
Sure, yeah, we’re very excited to be here. I have a computer science background worked as a software engineer, data scientist, product manager, and data engineer in the past, primarily, doing sales, forecasting, Model A, or looking at a lot of SQL, or building ETL. pipelines. I will say one thing that I used to do 10 years ago, password, I started a company in 2014, called concor system. We were having a lot of trouble scaring our real-time data pipeline out a mobile ad network that was growing very fast, hitting about 10 billion events, are they, you know, Flink or Spark Streaming didn’t really exist back then built our own solution, and spun it out at the company. And that’s called core systems. So that company, you know, after two years, I sold it to Akamai. Now it’s a product called the IoT Edge connect, where it’s designed to process, you know, billions of data points coming from different IoT devices around the globe, utilizing optimize a CDN edge network. And from that experience at Akamai starting to work with global enterprises, especially in the automotive and consumer electronics space, I’ve noticed that the next frontier of issues that are starting to happen was around utilizing data, utilizing data A company to have collected and have processed and in sitting in the data warehouse, being able to find and understand and hence, being able to actually pull insights and allowances out of data is actually starting to become a lot harder than it shouldn’t be. So this is why I started Select Star, so that we can give up more context around data. So that any data analyst, data scientist, or anyone that has access to data can easily find and utilize data as they need it.

Eric Dodds 5:38
Love it. And I’m gonna steal one of the Kostas’ classic questions here because he loves to talk definitions. But could you give us your definition? You call Select Star a “data discovery tool.” What’s your definition of data discovery? You kind of covered a couple of the pain points, you know, they’re at a high level, but would love your definition of data discovery.

Shinji Kim 6:03
Yeah, I mean, just to utilizing the term discovery in a way, like legally, data discovery is all around increasing the discoverability of your data. So that means basically being able to find and understand data sets that you have access to. So what that means is that, regardless of where you’re starting from meaning, even if you may not know what the field or the data set may be called, by just typing a certain keyword, something that’s related, you should be able to find the right data sets that you’re looking for. And on top of that, whenever you are looking at a data set, it’s not, you know, sometimes you can try to query the data to see what’s inside. But more often, you the data may even though you may see the data, it may not be exactly what you’re looking for, because it might have been filtered from before or was aggregated differently than what you expect, and so on, and so forth. So, truly understanding the data. And the other component of I would say, data discovery. And a lot of that will come from understanding where that data came from, who’s using this inside the company and where it’s being used, and how this is being used. So we call that a data context. So data discovery can really happen. It can work cannot happen without context. So having a lot of context, providing that context, I would say is what data discovery platforms do.

Eric Dodds 7:39
Yeah, super helpful. So can we begin to some of the conditions from a technology standpoint, but also from maybe sort of a team or company standpoint, that are big drivers, for companies who needed to like Select Star and I’m thinking, you know, that, and I think our listeners probably span a wide range. We probably have some that say, like, Okay, well, data discovery for me is, you know, we’re running Snowflake, and I can just scroll through all the tables, and I have a pretty good sense of what’s going on, you know, to companies that have, you know, probably a mixture of on-prem and cloud infrastructure across data, lakes warehouses, you know, production databases, and, you know, at a pretty complex setup, so could you help describe sort of the, you know, where on the spectrum, do you see a lot of your customers, you know, in terms of their stack?

Shinji Kim 7:41
Sure. So, I would say traditionally, and I’m probably the data catalog, and other mechanisms to increase data discoverability, it was introduced, mainly because companies have so many different sources of data. So just putting them all into one place as a place that added an inventory that you can search drill, that was primarily the need. But today, that’s really not always the case, there are still a lot of different sources of data. But all of those data are arriving at a single location, whether that is data lake or data warehouse. So the main issue around data discoverability it’s not that you don’t know where to go, it’s just that you even when you are in Snowflake, even when you have access to all the data, you have to sift through hundreds and 1000s of different schemas and database tables in order to locate what you are looking for. And even then you may not be 100% sure whether this is something that you’re like truly going to use or not. So to me, one of the so there are doing Main levers that I see when a company is very needed data discovery tools. So first and foremost, the ease. Once you have a single source of truth, and you have, like a data warehouse where you all of your data is arriving in one place, and you’re recognizing that the number of tables are a growing number of a schema they’re growing, and it’s starting to become a little bit more cumbersome to just refer to well, you can just like use this schema, and, you know, they’re only like 10 tables or 20 tables, there. So as the number of tables full beyond the hundreds, I think this is like one area that you would want to put in some type of like data dictionary, some documentation to start putting together more context around the datasets. The other part that we see as a lever is data team, as well. So this also happens as data growth. But sometimes you may have just a small data team that manages like large, large volumes of data sets. And then from here, as you are trying to bring on new data analysts, or data scientists or data engineers, for them to ramp up, it just takes a lot more time than you anticipate. Because there’s so much tribal knowledge within the data sets. And you may also realize that this tribal knowledge is also very hard to transfer without significant effort. And you may realize this, because, you know, someone decides to leave the company, and you’re realizing, oh, like shit, like, how are you gonna make sure that you know, all the pipelines and materialized views that they built to ease of going to, you know, stay sound even after they leave. So I think just overall training, onboarding and making sure all the data team, all the data team members are on the same page. This is like, one of the goals that customer trust to customers dry rot cheap. Last one, at least, there is also I will say, some companies that start thinking about data discoverability, because it gets triggered by a data governance. So data of opponents initiatives where you may need to understand the actual data flow, and the model of how the data puts us through different stack or different models, you need to start putting together either documentation or start tracking where the data is heading to. And this is very hard to do manually. So having an automated discovery tool, is another reason why companies start looking for solutions like Select Star.

Eric Dodds 12:53
Yeah, super interesting. Yeah, it’s funny, just earlier this week (or maybe at the end of last week) there was a Slack message floating around. And it was a new analysts, fast gaming, people who had been around for a long time, what is this table? And it’s like, okay, you got to, you know, find the people who’ve been around the longest in the company, and who’ve been close the stack and the data and the recording. You know, so even, you know, we’re not a huge company, but we see that we see that every day.

One more question for me before I hand it over to Kostas. So you mentioned data governance. One of the funny things about the data industry is there are all sorts of terminology, right? So you have data governance, data observability, data quality, etc. Where do you see Select Star fitting into that spectrum? Right, because you do, you know, discoverability, you know, isn’t necessarily you know, that it’s related to quality. It’s related to governance, like there are some components there, but I would just love to know Meritus, you see Select Star on the spectrum?

Shinji Kim 14:14
Yeah, I think that’s a great question. So we see ourselves as more of a horizontal platform that thinks and interacts with different stacks of data, whether that is on the theta, like generation side or to the data storage side or to BI side. So we’re the transformation side, we, you know, want to plug it in and bring out all the metadata together in one place, and also give you a more of a comprehensive analytics of that metadata, so that you have a one place that you can find out what is happening and how different data offsets that are related to another regardless of the tool that you are looking And so we see data discovery as more of like a capability that supports data democratization, because this allows everyone to be able to easily find and understand data. Suddenly data governance, so governance regarding its you want to, and the main difference that I see data governance and data, which both goes hand in hand, a lot of the metadata management that really is around being able to collect all the metadata housed in one place, having a unified metadata model that you can collect and utilize, and that including operational metadata, as well as like the catalog type of metadata. Governance is really kind of like a layer on top of that, that allows you to add ownership policies, and ways to like put a taxonomy on top of that. So we see discovery as a supporting capability for all those use cases. Because once you have this really good amount of discoverability, which is backed by the auto-generated context that we reveal, whether that including like, they don’t really age and popularity who’s using more top users, and a different quality during that ran, or like entity relationships that we see, like these are the components that you can use to drive the policies to define which ones that you want to measure the quality for, and to also keep the access to whom. So that’s kind of how we see data discovery playing in those use cases.

Eric Dodds 16:45
Super helpful. All right, Costas. I have been monopolizing, please take the mic.

Kostas Pardalis 16:51
Thank you. Thank you so much. All right, so Shinji, you weren’t going through, like how things works, when you were doing with Eric’s, and you were saying that, like, given data has to be collected into one place digital pieces, like the data warehouse. And on top of like, the data warehouse, then you can start like creating like the data catalog items, and it will date of discovery. And I’m wondering, like, when without like all the data enter into that warehouse? Why do we need like another service on top of the catalog that data warehouse has in order to like to understand what kind of data we have there? And what we can do with this data? Like why did they not base itself is not enough?

Shinji Kim 17:34
Sure. The catalog that they made has, were what we call information schema is really just the structure of the fiscal metadata. And that is pretty much like where we are starting from as well. But what data discovery platform does, on top of that, is way more than just showing the metadata it is, I mean, we will go through all the SQL, corny history, all the activity logs, and also connections that other tools may have with that data warehouse, to try to come put together, like what where the data is flowing into, and how that data is being used. So really, the one thing that we do at Select Star is, you know, really parsing through different SQL query history, and then piecing that together into a form of different data models that everyone can consume, like column level lineage, regardless of where you’re starting from, you can see all the upstream sources of data as well as the downstream effect that one metadata change may have, you can see, like how many people were utilizing this table or a column in the last 90 days, and how often has it been used, which dashboards or other tables or queries like this is flowing into, so you can see the impact, but like, where this data is going to, and you know, going further like we are now showing like more like an ER diagram, so that you will know when if data warehouses or data, lakes have long lost the property like primary keys and foreign key, those can be detected, or if they are already in there. That plus like any of the joints that we recognize, can be put together as a more of a data model so that you have a visual of how different tables are connected to each other. These are all the parts that I would say either more of a— it’s not a core feature some of the heavy hitters today. And I think also when you look at data discovery, it’s a lot more powerful as you’re starting to connect, you know different tools that you’re using on top of the data warehouse. So being able to see your I look our dashboard, which upstream tables that it’s using, or which like look at Mount views, like, are these things coming from? And vice versa by changing a, like this production route table or this specific field? How many dashboards may crash in your Tableau server? I think like that, I think that is when it becomes a lot more interesting as a meta-discovery platform.

Kostas Pardalis 20:27
All right, that’s awesome. You must have a couple of different like sources of, say, information or metadata, but it’s used to create, let’s say, the user experience around so let’s start. The first one you mentioned was like, the metadata itself, that’s coming from data warehouse, or the database system. Can you tell us a little bit more about that, like when you say metadata from the data warehouse? What kind of metadata we are talking about? And how do you use them?

Shinji Kim 21:01
Yeah, so there are like the, like on Main, I guess, like face level, like “metadata catalog,” they are including, like, names are all of the tables, columns, schema database, the comments of that. So hence, like this kind of gives us the structure of like, which table belongs to where things like that. And then they’re the part of operational metadata that we collect. So when was this slide created? When was this last updated? What may be the DDL or DML, that’s related to this, like, look who’s querry this the last time. And some databases are also providing, like the row count how big the table is? Things like that. So those are things that we utilized to be free, try to give a snapshot of like, what that table looks like, and what’s the current state of the table today is. So that’s kind of like what we consider, like the kind of the core main metadata. And then there is the aspect around, I think I mentioned the table, column comments, but this is the description side description, usually, you know, we consider as like a part of the metadata because it’s already baked into the database itself. And then on top of that, there is the logs that we collect. So the logs, I wouldn’t call it fully metadata, but it is something that we can basically parse through to generate metadata because I define that data as data that describes a data set. So yeah, so the querry logs will tell us, like, who’s carrying while how long they take, and what are the vehicles that it uses from that quarry, and from that quarry itself, which tables and columns are being querry that is being queried directly, is it getting, you know, and whenever we are looking at the actual transformation, is it being transferred being transformed? Or is it getting aggregated? Where is the data actually being used as it is, these are all like details, all kind of like extracting being for better data information about the quarry, so that we can analyze that under the same umbrella to make the like a what we call our like, a table P, a two-column page of our object, data asset page to be much richer, though that make sense. Yeah, and then last, but not least, there’s a user metadata, as well. So it’s, it’s more lightweight, but just more of like, you know, how many users like which users like how is it? You know, how are they logging in? When’s the last time this person was logged in? Things like that.

Kostas Pardalis 23:53
Yeah, you actually like, answered like, some of my follow-up questions, to be honest, because I want to ask about the other sources. Actually, I will. But before we do that, the other thing with that data voltage data warehouse, like is exposing out there. Have you noticed, like, any considerable, like difference between the different technologies that you’re integrating with? Is there? Like, are all the systems out there on par? Or is you wish there was something more out there in terms of like, the core metadata, right? We’re not talking about Wheaties and MLOps, and stuff like that right now.

Shinji Kim 24:32
I think the core metadata are primarily similar from one database to another. But the way to retrieve that metadata and the way to the type of access that we need, just to get the metadata can be drastically different. Okay? And this is something that we really care about because as like a metadata platform by In depot, we try to carve out or given recommendations to our clients that give us access only to the metadata, but not to their actual data. Yeah, and every database is slightly different in this way where like, for example, with Snowflake, I may have just the access to their metadata database on there called gist, Snowflake Metabase QCon usage. Whereas for something like Oracle or Postgres, there are very well Redshift, there are very specific types of tables that we would require for access, instead of, I guess, getting like “metadata database” access, because those will be generally only available for admins. Yeah, so it’s, yeah, that’s where I think things get very tricky. But the other part is also how the logs are generated. So yeah, some data warehouses like this will be already in a table, like for quarry history, whereas some other data warehouses, we would need to ask the customer to, like, you know, unable the logging, and then I’ll enable, like, other types of logging to, and then also have them to point to like, you know, their CloudWatch logs to go to like, you know, this bucket or Yeah, so there can be like more of these the integration setup that could be required, depending on the database we are working with, and put big offer a whole another.

Kostas Pardalis 26:37
Yeah, we’ll get there. We’ll get there. All right. And the next thing that I found, like, quite interesting, you mentioned is that like, there’s a lot of data that’s generated by parsing the queries. My first question is, have you seen any kind of like, resistance from your customers so far as to expose these queries to your service? The reason I’m asking is because you mentioned already that, okay, you’re a middle data platform. Don’t want to live to get access to the data itself. But queries, many times can reveal that say, some of the logs also, like important information that shouldn’t be late, right? Like, it’s always like, a little bit like, tricky. So how do you see that so far? How do you see customers and the companies out there react to that?

Shinji Kim 27:33
Yeah, so a few things. First and foremost, this has always been a concern of mine, when I first started Select Star. So my last company, we had an compress up there, right, where we just deployed to our customers environment, so we didn’t have to worry about this. But with Akamai, as I was building more of a platform as a service product, this, you know, provenance of data, and security on the data side, that is something that I just like Google had a lot more interesting. So with Select Star, I from the data, we got type two on it, and making sure that everything is treated as confidential information or customer, even though it’s just the metadata, we, you know, we will first make sure that we are not getting as much data as possible, and then anything that we bring to our system, and we will treat it with the like, best model enterprise security that we can add it. So that’s the first thing. The second part is that with logs, it really kind of comes down to depending on what kind of queries you’re running. So first of all, we have a very specific types of queries that we process, you know, mostly around the creation of the qualifying DDL DML, things like that, but also a lot of select queries. So what we allow our customers to do is allowing them any sensitive fields in Select Star, and it’s already tagged as PII or sensitive from our parsing perspective, we will strip out all those values before we fully sink in Select Star. So anything that you might come across do Select Star, whether that is a quarry or something else. Like because you can look up different queries from Select Star like you can look at, like how this people was created, or what are the popular queries that people use to utilize this table. You can look up other people’s stories by going into their profile pages or team pages. So when these query the show up, we will strip out the values that the field was already defined as if it’s a Pio box So that’s like another way where we are ensuring any of the sensitive data itself does not enter our platform. And that’s one that leads that we are starting to allow more customers initially, just as a trial, and this is something that will be available in the future, but for them to just like load the metadata themselves, or load the walls themselves, so if they want to strip out any specific parts, or if they give us a certain configuration, that’s more of like, we don’t want any of the forays of like this user to come through that those are fairly straightforward settings that we can adjust. So that like, you know, we, we basically filter out or we don’t touch any of those queries from the get-go. Does that make sense?

Kostas Pardalis 30:54
Oh, absolutely. It’s super interesting to see the complexity of building a product around that. And like, many things are like, people don’t realize that they didn’t have like, reason to go through, like this kind of process, or using this kind of products from white establishing this kind of processes, right, it’s like a big part, like the product itself, the complex building, like sells a product, it’s not just like the metadata that you have to process. It’s also like all this processes that’s there are for a good reason. They’re around like security around like, auditing and all the stuff, that’s when we’d like to pay attention and make sure that we provide all the functionality that our customers need.

Shinji Kim 31:43
Yeah, yeah. So, this is like something that we also see like not just to like strip out the data, like, you know, marking something as sensitive or PII, it is something that customers can really elaborate to specify any of their sensitive data sets to the rest of the company. And because the data sets are not exposed to the intellects are in Watts, what in whatsoever, even if the end-user may have access to it, they can actually understand that, oh, this is not something that I should easily share or, you know, freely share with others, right? The all our data, when where is it being used? And like, where did they go? So any of these, like reporting around GDPR CCPA has been one of the areas that some of our customers are starting to use Select Star for. And it just so happens, one form the usage perspective, but the other also as a lineage perspective, because then you can follow the trail, where is the referee ended up in and then get kind of like the usage information, or audit logs, all the all of those fields?

Kostas Pardalis 32:55
And okay, about flood, parsing the queries, what kind of information you are looking for there? How do you compose, let’s say, the query into metadata that can be used on a data book, what are you doing? There was probably one of like, I wanted to ask this question since we started, so I’m really happy that I can do it now.

Somya Kapoor 33:19
So your question is, what are we actually doing in the parser with the SQL? So I think that in high level, there are a few things we’re doing, I mean, we are not trying to like, you know, like reverse engineer the quarry. So, I think there are like many different things that have happened until it gets executed in the database, like, if you are using DBT or if you are using like, you know, any like a templated for a draw a BI tool, like so, many things may happen, but what we care about is that at the end of the day, well, how did they execute and how does that look like on the database perspective? So, two things that we are looking for is first of all, we are looking at, like what are the fields and people that are actually being selected? And so, that is coming from like, just looking at you know, it can be through different CPUs and the different like, that’s it for us, but we will look at like, you know, so, how are like the result of this query being mapped and the where, where, where did that source come from? Around that source? We will like, you know, we will try to define, you know, to match the existing data that we have, because we already and that’s kind of almost like match with a dictionary, sometimes we may not find everything, but as long as I ever saw loaded into sort of salary should be able to find and from there as, as we are looking at the field level collection That’s when we will try to determine whether we can tell what the general relationship is, is the field being generated, as is, it’s just like an, or is this feel that it’s being abrogated or transformed? These are some of the details that we try to build in, there are more things that we plan to do and fun to do. But that’s kind of like a more of the extent of how we look at the place. So in a lot of information, then we are parser exports, that will go into kind of like our backend model. So different back-end models. So here’s what would be used. So for any of the select for this, we’ll go into more of like the usage and like we’ll put on, we have a popularity score for every field, every payments will dishes, like where this gets added to, if this was a DML DDL CoreOS, then we will add this as more of information for our lineage. So that’s like determining for this specific field that this is a source of the field. And then next year’s target. And it’s been propagated as you know, in this manner, I think like that is basically what our, like core parsing. Do that answer your question?

Kostas Pardalis 36:26
Yeah, yeah. I mean, okay, like, well, I can keep, like, disguising eyeballs for a long time.

Somya Kapoor 36:31
Yeah, we’re going definitely deep here. Yeah, we thought, yeah. And then we have a, I was gonna say, we have like a similar parsing that we use for DBP. Or put them out, like Yamo files, like, you know, and, you know, for ETL, it’s a little bit different, right? So we try to basically follow through the data model each stock has, and map it to, like the manner of the data model that we have of lineage, popularity, Entity Relationship things.

Kostas Pardalis 37:02
Interesting. And, okay, let’s leave the databases behind for a little bit and let’s go to the BI right, because you also have like to deliberate with them and like, loud like information and, okay, database systems, obviously have been developed from day one to access information. But okay, BI tools are mainly like visualization tools, right? So how family can automate this process and how you can go like to a system like Looker and pull out useful metadata.

Somya Kapoor 37:45
Yeah, so it doesn’t happen overnight, we have to go in and look at the BI tools data model. And we try to map out how our integration process requires mapping out like, rich metadata of the BI tool will map to like the BI metadata, that model that we have. And if there are parks that are BI main Aveda tool may not have what are the other models that we may need to augment in order to support that BI integration, and Emery integration that we have, will have, you know, lineage, popularity, top users, and the kind of like the integration into like any of the data’s the actual like the database connections. And all of these can, they are happens very differently per BI tool. So for example, with Looker, we can model the look or like it’s for dashboards, user information directly from their API. But you look our API does not expose and look ml information. It just gives you maybe just the Explore information, but it doesn’t tell you like you know the actual view which view it’s coming from and which specimen connection that you have. So for those like we usually either again, a snapshot of the LookML repo from the customer, or we collected directly to their repo as a V Dory mechanism to load and parse the look ml views so that we can bridge the connection between the data warehouse and Docker. On the other hand, the for something like Pablo Pablo has a metadata API that will expose a lot of this through a Graph QL API, but we also use REST API because it gives the other information that we need But for something like Tableau, companies use it to run. Dashboards drew the tabular data model, which can come from the API. But someone, the workbooks and dashboards, will run through what they call quote unquote, custom SQL. So in those cases, like, we would have to fetch the SQL separately, run through our parser, against the connection that we see on, you know, Pablo, and then bring out the result and add it to the language. So it’s, yeah, it really kind of depends on the nature, but we try to basically look at how does this like BI tool actually work? Yeah. And what is their view of the data? And how, and hence, uh, how are they defining the dashboard and charts and metrics? And, like, we try to basically map them, because eventually, the, the view that we have, and some of our customers have, like multiple BI tools, when they want to see all dashboards together, like by just typing a keyword, we do have areas that we want to consolidate this under like the same umbrella to. So yeah, so those are, that’s the exercise that our team has to go through to ensure integration is done correctly.

Kostas Pardalis 41:31
Yeah. Yeah. That’s super interesting. Okay, one last question from me, and then I’ll give the microphone back to Eric. Okay. Many times, you know, like, especially like, people, we know, they’re coming late with an engineering background, like we tend to, let’s say, abstract seems to the point that become too ideal, right? Like we are talking about, like, the data stack, and we have like, the ETL, pipelines, the data warehouse, then we’d have some consumers that are going to be some BI tools and like all that stuff. But the reality is, Ms dies like zoo little bit differently, right? Like, every company is doing things, Sasol like, in a different way, users are not exactly always following their rules. I mean, I do that like many times, like, I get frustrated with the tool, and they just like export the CSV and put it into like a Google sheet and just do what I want to do. And that called the day. So, and I’m unsure, like, you see that also, like in companies out there. In your opinion, like, how, how marks like tools that have to do with governance that they have like to do with data discoverability and like, providing, let’s say, like this kind of infrastructure for, like the users inside the company, and the company itself likes to have like, the best possible visibility on how data is getting used? What are the limits there? In your opinion, like, do you think that we will be easy the problem? Do you think that like, we have to still, like work to do with like products, like Select Star to provide, like, more coverage around governance? Or is it like, at the end, it’s like, it’s okay, like, we just have like to accept that people will not always follow, let’s say, like, they’re all sounds like, there are always going to be exceptions. And that’s fine as part of like designing the product. And we are doing like to cover let’s say, 80 90% of like, the use cases out there. And always keep in mind that like, some things might be all different. Did I confuse you? Because I feel like I just booked two months. But let me know, like, I can rephrase the question.

Shinji Kim 44:00
Ya know, I can see where you’re coming from. Like, I guess one question from here is, like, do we get to just, you know, accept the fact that people are going to do their own thing, and it’s not all understandable. It’s coverwallet, you know, whatnot. I think that, given that we are still working with systems, they’ll working with something that hasn’t like parsed and hence, like, you know, has compiled and is out there. We see it as something that we can process. And our job is to process it and show you what it actually looks like, so that you don’t have to do that manual work. Once you start trying to add like a manual lineage, or like something you know, that is beyond what you know. Like, something that’s additional, like in the same level of, like, you know, like, it’s like, meaning if we can parse everything of what you’ve done, then what’s the point of trying to like augment the automation by adding more adding something manual, especially if you cannot maintain it. The big a big part of automation in the beginning is so that you can save a lot of time. And you don’t have to do this manually. At the same time, I think that actually, the bigger, you know, ROI of automation is that you don’t have to maintain it. Because you don’t have to go and update or delete or add anything, because your data model changes so fast. And the part that we recommend our customers to maintain more manually would be more around business processes, documenting business processes, you know, making sure that those are clearly defined and has a, like the domain data model that’s connected to it. These are things that like, we won’t be able to automatically define for you. But you know, in terms of like, Leaney, your like usage, things like that, these are, I think, much better for us to leave it to the machine to figure it out. That’s how I think about it. But it also means like, yes, we do have a lot of work to do to ensure that everything, you know, is being processed to fully and correctly.

Kostas Pardalis 46:43
Yep. Yeah. But that’s a good thing. Because it means that there is an opportunity there to build products and businesses. So that sounds like a pretty big opportunity. Excellent. So Eric, all yours. I think we are close to the buzzer, as you usually say, but—

Eric Dodds 47:02
Oh! I stole one of your typical questions, now you steal my line?

Kostas Pardalis 47:08
Oh, yeah, I like to take my revenge. Yeah. So all yours.

Eric Dodds 47:15
Shinji, actually my question is sort of getting maybe just some of the practical implications of the way you and cost is just discussed. And it’s around, I would say, like the, at what point you reach economies of scale, in terms of coverage, and then what the relationship between ongoing work and new data coming in, looks like typically, because, you know, the, one of the blessings and curses of modern data tools and sort of the, you know, way, way more cost efficiency in terms of data storage, right? Like, it’s, it’s never been cheaper to pull in a huge amount of any type of data and store it right. And so a lot of companies you see, even smaller companies, you see, you know, follow whatever you want, I’ll say like maybe large amounts of potentially, like unnecessary data flowing into the system, in part just because it’s technically very easy to fool the data and right. And so when you think about, you know, cataloging the discovery, it, it seems like you kind of hit an economy of scale, where you get like, a certain amount of coverage. But then there’s always new data flowing into the organization. So what does that look like? Like, how do you manage that relationship? And is it a lot of ongoing maintenance? You know, what do you see on the ground with companies that are dealing with that?

Shinji Kim 48:56
I mean, the fact that there are new datasets being added every day, and new model being created, or dashboards being created every day, is the pure reason why somebody would adopt Select Star then other tools or other ways to manually track this metadata. Because we will, every 24 hour the at minimum, we will update your all your metadata, all your lineage popularity, the way that it’s been used, so that you can always trust them. Whenever you’re looking at Select Star, it is up to date, and it has all the latest and greatest information is really the only, you know, way to like keep your metadata in one place. Only when it’s like really all made it in my opinion. Yeah. And I think in terms of cost perspective, costly moving to compute, right, it’s easier it is cheaper to store it but now Now you are running a query. And that’s going to cost you some amount. It’s easy to schedule the query. And it’s going to be $2, but it adds up. So this is another reason why it’s important to be aware of the usage patterns because we actually see a lot of customers that ending up saving their cloud data costs, because they end up finding these pipelines and these materialized views and chemicals being created, fueling dashboards that are not being looked at anymore. Yep. And that can really only come through by having both lineage and usage model popularity together. So I think that’s more of an ongoing effort. And whenever it’s an ongoing thing that you want to monitor are worth just referred to as only like the right place to go, then you would want that tool to be automated, and is up to date with the state of the data, like, you know, storage and consumption, where it went wherever that’s happening.

Eric Dodds 51:12
Yep, that makes total sense. And so do you see a lot of Select Star customers, the automation runs, and they’re essentially just reviewing new information. They don’t, it seems like there’s this very little adjustment to the model.

Shinji Kim 51:33
Twofold. First six months, a lot of customers using Select Star, they use it, instead of Google for data. They already have a lot of data. And doing Select Star, they’re finding information about this table where that dashboard, or this pipeline that they didn’t know about before. You know, even read it, you hear customers all the time, saying that they’ve been using this data warehouse, or they’ve been using record for the last two years. And Select Star is telling them things that they don’t know about, for their own data model. And this is more of the value that we drive in the beginning. So that, you know, you are actually you didn’t have to do any work or to like connect in your database. But you’re seeing a lot of new things that you didn’t know about your data. And, you know, usually after three to six months, if customers are stuck, a lot of our users start using this data to either deprecate all dashboards and tables based are putting some like taxonomy together, or they may add some documentation. And this is like where the real like data management, data governance really starts happening. And they get they have a lot of like, choose a lot of directions and hints to start off, because they are, they’re getting this like visibility and needs, like, all their data has been used today.

Eric Dodds 53:01
Yeah, what a great answer. I don’t think I asked that question very well, but you answered it perfectly in terms of sort of the lifecycle of usage. And, like, when you get over the learning curve of sort of the initial, you know, economies of scale, you know, sort of understanding or learning about your data, and then sort of taking action on it.

Shinji Kim 53:26
Yeah, I think a lot of it and because like, they get to find something new, every time they come back to Select Star. So this actually, fuels like really the user engagement cycle. Then as they start inviting others, as they are referring to different Select Star links, they will be realized, oh, like, as I’m gonna start sharing this with more of my data teams, this product manager and whatnot, I better also put some more documentation or I better put some more context. Yep. You know, because like we are, you know, we are referring to this anyway, because this is already has like, more than half of the, you know, documentation automatically filled in and is being updated.

Eric Dodds 54:08
Yep. Love it. Very cool. Super helpful. Well, this has been such an interesting show. Love the concept of Google for data, you know, Google for your on data. I think that’s, I mean, that sounds exciting. You know, even to me, so I know a lot of our listeners will be interested in it. So, Shinji, thank you so much for taking the time to talk with us and teach us about data discovery.

Shinji Kim 54:36
Awesome. Thanks for having me here. This was fun. I think we went pretty deep on multiple subjects. But yeah, you guys. You guys be great.

Eric Dodds 54:44
What an interesting show. I think one of my big takeaways was I just appreciated how Shinji was. She had strong opinions about what types of things a machine should handle, and I know that’s come up on the show. have a number of times, but I’m just a huge fan of, you know, if you can remove, like the unnecessary, laborious parts of working from data from, you know, say the data engineer near the analyst role, like it allows them to not only probably enjoy their work more, but be more valuable because they’re spending time doing more valuable stuff. And so I appreciated that at the same time. It’s a really challenging problem, as evidenced by some of the technical things that you discussed. So just sending something to a machine, you know, doesn’t always work out perfectly. So anyways, that was my big takeaway. And when I’ll be thinking about how about you?

Kostas Pardalis 55:52
Yeah, I mean, like, it’s very interesting to hear like FAO complex problem, data governance is on there, like so many moving parts, and you have to synthesize like information from, like, indeed, the great information from many, so many different sources. So, it’s pretty hard to like to make something like to make like a product that can actually help you. And data governance, like the experiences that are on the product that you have to build and like, say, the reliability of the technology underlyings kind of crazy, like, think it’s a very big challenge. And I have a feeling that like more and more companies out there that they position themselves one way or another like but they are close to data governance, they will end up like in these in like in this category. Even if we want to take the brunt of a category of governance and break it down into smaller pieces, like being like the same thing. And you need all of them in order to at the interview, but what the company is looking for what customers are looking for. Yep. So anyway, I’m looking forward to that, like, I’m sure in the future, let’s discuss more about all these challenges with like, the great thing with all the different sources and applications and how to parse all this information and how you integrate information and present it at the end to the user. That’s like something that we didn’t do that much. But I’d love like to continue the conversation with her in the future.

Eric Dodds 57:27
Absolutely. All right. Well, thank you again for joining us on The Data Stack Show. Subscribe if you haven’t, tell a friend, and we will catch you on the next one.

We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.