Episode 174:

Does Your Data Stack Need a Semantic Layer? Featuring Artyom Keydunov of Cube Dev

January 24, 2024

This week on The Data Stack Show, Eric and Kostas chat with Artyom Keydunov, Co-Founder and CEO, Cube Dev. During the episode, the group discusses the evolution of semantic layers, their importance in data management, and Cube’s growth and adaptation to industry needs. Artyom highlights the challenges in building a semantic layer and the solutions Cube has developed, including their own SQL engine. He also discusses the potential of integrating semantic layers with natural language processing technologies for improved accuracy and much more.


Highlights from this week’s conversation include:

  • Artyom’s background in the data space (0:32)
  • The growth and changes at Cube (5:58)
  • Pain points of managing metrics definitions across different tools (9:39)
  • Trade-offs between coupled and decoupled semantic layers (12:12)
  • Making a case for implementing a semantic layer (14:17)
  • The evolution of semantic layers (23:28)
  • Challenges in designing a decoupled semantic layer (24:16)
  • Different approaches to solving the interface problem (26:58)
  • Implementing a SQL engine in Cube (35:58)
  • Overhead and debugging in semantic layers (39:08)
  • The semantic layer and its importance (46:26)
  • The need for semantics in data products (47:34)
  • What’s the future of semantic layers and user experience? (51:49)
  • Final thoughts and takeaways (57:34)

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.


Eric Dodds 00:05
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com. We’re here with Artyom, the co-founder and CEO of Cube. Artyom, thanks for coming on the show. And welcome.

Artyom Keydunov 00:32
Thank you. Thank you for having me today. My name is Artyom. I’m the co-founder and CEO of a company called Cube. I also co-founded the Cube open source project back in 2019. Then a year after I started a company with my co-founder. So it’s been a journey of you know, like building a universal semantic layer and how it was going through the cycles of evolution in the last few years. So yeah, exciting to be here and to chat all about medical errors and data. Yeah,

Kostas Pardalis 01:04
that’s so someone’s we’ve had you before, it’s been like, almost like a year since you were on the show archie home. And like many things have happened in the industry. So I’m very curious to see how semantic layers are involved in this one here. And also, what’s next, especially after the let’s say this whole revolution that’s happening right now with AI LLM seemed like all these new technologies around data. So I’m really looking forward to chatting more about that stuff. What about you? What are like a couple of things that you’re really excited to chat about today? Yeah,

Artyom Keydunov 01:44
It was a great year for semantic layers, for sure. And I’m very glad, you know, like how the data community evolved in their thinking about the need for semantic layers. And I saw different vendors, different companies coming up with semantic layer solutions, and it’s definitely happy to see that kind of, you know, the categories maturing overall. And cube, you know, like, I hope we contribute a lot, you know, like to the thinking, you know, like to the framework, how the semantic layer should fit into the modern data stack. And, obviously, you know, like, the elephant in the room this year was ellanse, right, like and AI and I felt like it contributed to the, you know, the ideas and the need for the semantic layer. Because LLM they are all about semantics. They essentially like a text in text out, right. And text is semantics. So that’s why you know, like, Caesar was a strong tailwind. Okay. We really need semantics, not only for humans, but for API’s as well. And let’s talk about semantic layers, right? And like how we can get semantics about our data? Yeah,

Kostas Pardalis 02:48
no, that’s awesome. I think we have plenty to talk about. So what do you think, Eric? Let’s

Eric Dodds 02:53
dive in.

Kostas Pardalis 02:54
Let’s do it.

Eric Dodds 02:55
Our job is so fun to have guests back on the show. Again, it’s been about a year. And, you know, so many exciting updates to talk about. Before we get into the topics, can you just remind our listeners, what cube is

Artyom Keydunov 03:16
a cube is semantic, layer, or universal and standalone, semantic layer. And the reason why, you know, I highlight standalone and universal because I feel like semantic layers, they were here for a long time, right business objects, they had a semantic layer problem that kind of started to happen as we started to have more and more BI tools. And Cloud made it really easy, you know, to buy more and more tools, we started to have a lot of semantic layers sort of scattered across different tools, because now we have five thoughtspot Power BI you know, like DOM or you know, like, Tableau, you name it. And then they all have this bi they have a semantic layer attached to the product coupled with a product and our organization risks five BIS and they all have this semantic layer. The problem is is sort of we repeat ourselves, when we define a metric at this every vi level, right? Like we go into one tool, we define all the metrics we go to the second tool, we define it. And the frameworks always have a different way we define metrics. But essentially, we define the same metrics. And it creates a problem and you start becomes not dry. And you know, like in engineering, we always try to keep things dry. Do not repeat yourself, right? So and that’s the whole idea behind on semantic clear, let’s make our data stick dry on a scale. So we take the metrics out of the BI tools, and we define them in one place. We call that place the universal semantic layer. And it sits between cloud data warehouses and all the data visualization tools. And then we just define the metrics in that place. And then we deliver metrics to all the different data consumption tools. So that’s the whole idea behind semantic layer universals multiplayer and cube is building one.

Eric Dodds 05:06
Very what a great concise definition. It sounds like you’ve explained that a couple times

Artyom Keydunov 05:11
before. He did. Yeah.

Eric Dodds 05:14
Let’s talk about last year. So when we left out you on the show, you were focused on a fairly specific use case. You know, I think that you were talking about the magic, but not quite the, I know that you were focused on analytics use cases. And I think you were talking about this concept of headless bi. If my memory serves me correctly, can you explain the journey that you and Cuba had been on going from sort of the positioning as an analytics type solution and using the term headless? And then what led you? How did the company grow and change? What did you learn about your customers that that sort of push the move to Symantec,

Artyom Keydunov 05:56
right? And I think we went through quite an evolution since we started a company and project in 2019. Right, the company started as an open source project in 2019. And then the company itself started in 2020. So it’s been three years, a little over two years now. I think that we initially had this big vision where we wanted to create metrics, semantics, and then deliver them to different places. But we never liked, had a really good semantics about semantics, I would say how we would call that right, we had different names. I remember calling ourselves API for data at some point. And then we were calling ourselves to headless API, and then the metrics layer. And eventually, it felt like the industry arrived, and the term semantic layer, and everyone is using semantic layer right now. So from an even naming perspective, we went through the several eight steps of the evolution here. And from a product perspective, I think, when we only started, the obvious problem to solve was how we use metrics in the embedded analytics application or customer facing application, because that’s where you still need to build a semantic layer. But you would build it manually, you’re not going to use one provided API, but you would actually like to write code in your Django app, or in your Ruby on Rails, to deliver the metrics to the customer, right? So we thought, let’s try to remove that piece that developers need to build inside this framework and just kind of make it generic. So that was a very clear first use case. And it was a big need. And that’s why it helps us to get initial traction. As we started to build on top of this use case, we started to have customers saying to us, Hey, we’re using cube to show metrics to our customers right inside their app. But we kind of work at the same metrics inside our BI tool inside, you know, like our second BI tool, and a third BI tool, why we not just use cube to centralize all the metrics for all the different places. And that’s how we started, you know, like to go to this next step. Next, kind of, you know, like, Age of evolution for cube is like, okay, let’s bring something called API’s, right now to work on top of the queue. And that’s sort of our vision, kind of phenolic, I wouldn’t say expanded, because we always wanted to do that. But the product started to expand over the bigger vision, right? And then we added more BI tools, and you know, like different AI apps this year. So that, you know, they’re going if you go on our website, right now, it’s quite a different website than a like a year ago, one or two years ago, right, you talk about the bigger vision right now, I think that was a major change on our end as a product is now we’re like not only serving competitive analytics, but it was serving a bigger picture of Paul, we’re in all different data experiences in the organization.

Eric Dodds 08:59
Makes total sense. One question I have for you is around adoption. And what I mean by that is, I’m just interested in the point at which your customers or users come to you. So you talk about having, you know, multiple different bi systems that all sort of have their own semantic layers. It would seem that a lot of companies hit a pain point where they’re managing that, that you know, those metrics definitions across a number of different tools, you know, and platforms. And so, do you see a lot of that sort of the main inflection point where companies come to the cube? Yeah,

Artyom Keydunov 09:40
I think it’s a compound problem because you get so many BI tools and then even inside of one BI tool, like Tableau you may have a lot of different workbooks. And then every workbook it acts like a silo, with all its metrics inside it and then you think oh, how they connect or books together some Think so. And it’s sort of phenolic, it adds up, you know, to the problem every time you build a new dashboard every time you do a new report, or like someone tries to do analysis in XML. So that’s why I think this problem is always on top of mine, or, you know, like data engineers, data leaders. So they always try to find, okay, what is the best way to manage data modeling and metrics, you know, like, because it always felt like we made so much progress with ideas around, you know, code, first management, applying software engineering best practices. So we have matured, like data pipelines, you know, like, they’re like a Medallia in architecture, like all of this, like different ideas. But then we sort of fail at the last moment where we actually need to build metrics. And I think that sort of creates this sort of anxiety or, you know, like an uncomfortable feeling among data leaders and data engineers, there is something that shouldn’t be a better architecture to do that. This and I’m doing it today, that sort of probably, you know, like, why people were starting to think about semantic layers. I know, they could talk to us and kind of explore different options.

Eric Dodds 11:11
Yeah, that makes total sense. Are you seeing more companies? Or teams? Try to start with a semantic layer?

Artyom Keydunov 11:21
You know, you made like, just from scratch, right? Like, even have the? Yeah. I think it happens. I see. It happens sometimes. I think it’s mostly something that comes after, you know, like, you have a warehouse, you have 123. Because then the problem is more evident. Right legs, you see. Now I have all that math, I need to clean up the house here. Yeah. I see. Companies are storing I was thinking about, okay, semantic layer from the beginning, which felt to me just kind of maturing of the category, more awareness, you know, yeah. Data Team that is aware that they would need to semantically or sooner because they would need it eventually. Right. So they think, okay, let’s put a cherry on earth, then later. I think the caveat here is that sometimes there is a, there is sort of, you know, like, an opportunity to use Turvulo, for example, which offers a great coupled semantic layer, which might have looked like a good idea. Today, say, it’s a mid market company with like, 100 to 100 people, while still a small data team, right? And you might need only one BI tool at this point and liquid eruption, and semantic layer is coupled with looker. And it kind of makes sense, right? You have your transformations, you have your semantic layer and LookML and causation. The problem though, is that once your organization would grow, you definitely will you will hire people who would say, Hey, I use Tableau all my life, why should they use correct? Lead? Yeah. And then Power BI will come and seek my like XML. So that’s sort of a trade off, always, you know, like, I see that companies, when they pick it up semantic layer, sometimes, you know, like, LookML really looks like a good option if you’re small. But then if you think about it, what happens next, that’s probably decoupled to Montclair would be sort of a better option. So that’s an interesting caveat. I’ve been seeing, you know, like, when companies kind of on a smaller size, thinking about semantic layer?

Eric Dodds 13:29
Yeah. Yeah, you know, that’s a tricky thing. And I’d love for you. To help us think about how, you know, cog data teams can make a case for this, right? Because it’s one of those, you know, you said, which was my hypothesis, as well, just like, you know, this becomes, the perception of value becomes much higher when you’re in a lot of pain, because you have multiple BI tools, right. But you save a lot of time and money by not having to get to that place of pain in the first place. But I think one of the challenges is justifying the, you know, expenditure of like, caught, you know, costs, right, whether that’s like, you know, paying for software, or the, you know, your team actually implementing it, how do you think about, you know, how would you recommend that someone make a case for doing something that, you know, it’s kind of the thing where it’s like, you know, is this going to provide us immediate value now know, like, will save us a million dollars over the next three years? Yes, you know, in time that we would have allocated to like, wrangling all this data, right. How would you help someone make that case?

Artyom Keydunov 14:46
Yeah, I think we need to do well by ReeseI, all the semantic layer providers and you know, like, to some extent BI vendors as well that BI vendors that want to integrate with semantic layers. We need to make it as easy and as cheap as possible for the small teams to implement the best practices, you know, like from the beginning, right to buy it in our like, our solution, the way we think about the pricing and cubed trade and make it scale is organization. So we don’t want it to be as expensive as look at for example, right? So we want to make sure, okay, skip initially. So you still have a budget for like, maybe super set up, I choose a preset right preset, which is a cloud version of superset. Yeah, you bundle these tools together. And then you kind of go with that architecture instead of Looker, where you would, you know, like to have the vendor lock in all the coffee. And it’s a young ale. So I think, from what we can do, first, we need to create a correct business model. And then we said, Can you come in cheap first and then scale with usage. But we also need to make sure that our products offer a very good experience comparing two coupled solutions, right? Like, obviously, we are both solutions, it’s easier to build a good user experience versus when you have a decoupled product because you need to kind of try to make two products work as good, almost always the same product, which is really hard. But we need to solve this problem. So I think these two things that we need to do, and then for data teams to justify it, I guess it’s really, you know, like understanding the best practices and understanding that eventually you would need to scale. And you know, like if vendors make it as much as easier as possible with attractive cost and attractive integration. So that would be easier for data teams to just kind of go with that architecture sooner than later. Yep,

Eric Dodds 16:47
makes total sense. Okay, well, we’re talking about semantic layers, I feel like you’ve done a great job of explaining where the semantic layer sits. DBT is, you know, a very widely used data tool. And they emphasize the semantic layer a lot. How do you compare, you know, I think the way that you’re describing a lot of things, a lot of people would describe that, you know, describe DBT in a simpler way. So can you just explain some of the key differences, or even use cases? Or

Artyom Keydunov 17:20
Maybe first we’ll go a little bit through the history, you know, like how GPT-3 arrived at the semantic layer. DBT started as a transformation tool, right, like a DBT, core widely used and popular transformation tool. And then DBT at the company, they kind of raised a lot of money as part of low interest datas type phenomena, and then kind of started to build around the initial DBT core CLI tools. And I think at some point, they announced that they wanted to build a metrics layer of semantic layer, right? The first attempt was to build it in house, and then they sort of failed to deliver on expectations. And to solve for, you know, like this issue of failing, they decided to buy a company called transform data, and transform data was one of our competitors. Really great team really? Well. So you know, like, it’s always, when you start a company, it’s always good to have competitors, because you understand thoroughly other smart people a, you know, like they are doing something the same as you so you’re probably doing something right, right. So like, it always feels good from competitors, like transforming data, because it was a lot of reassurance that we do the right thing. So what happened is that we got a little bit more traction to transform data from a business perspective and kind of reigning, the categories of transform data, they decided to like, go and get acquired, and DBT acquired them. Think that was like a second attempt of DBT to deliver that. Now, it hasn’t happened, like almost a year ago. I think we’re still in a stage where it’s not quite clear what DBT semantic layer is as a product. When I talk to the community, when I talk to users, I hear a lot of awareness about DBT semantic layer, but they don’t see actual users and customers. So the DBT semantic layer because it’s still I think the product is still not there. And I mean, it’s hard for me to, you know, like to kind of talk about and kind of think, why it’s happening and cry because they have all their reasons. That’s a big company right now, they raised a lot of money, it creates a lot of pressure, maybe, you know, like they’re looking into different areas of the product, how to, you know, like optimize for monetization, how to optimize you know, like the conversion from a cloud or open source to cloud the broad AWS V You have an AWS VP of product, right? Like to solve all these problems. So it feels like Symantec layer is not getting enough attention, because it’s a really hard problem to solve the technology and product. And it’s not an it’s not like an existential thing for DBT. Right? If DBT fails, it’s much clearer, they still have a business, if they give 1000s of therapy they don’t have a business, right? So as to make a triad we have to make it work for DBT. Is it just one of the features they have at this point? Yeah, yeah, super interesting. One question, I should have thought of this earlier, but I’m sure a lot of teams build their own kind of semantic layer to address some of these issues. What tools are they using to do that, you know, to essentially sort of mimic what a cube would do? Yeah, how are they doing what the in-house build looks like for this. So I think that we can categorize these tools into sort of a one. That’s the simple versions of that and more like, complicated, a simple version of that would be to use data Mart’s as sort of your semantic layer, the problem with leveraging data Mart’s that you would have to create every data mart for every level of detail or grade right in case of data because non additive measures joints, they all Cray, this complexity, where you cannot have a single data mart serving, you know, like, a tricks with a multiple array. So in that case, you would have to have Baiser weighed a lot of data Mart’s and that’s what some companies do. And all like, if you have a process, you know, kind of how you can produce this data Mart’s and control, you can put all they can manage org, just to do that. It just does. But it’s possible. But again, it’s very expensive. The other option would be to create this your own like in house virtual layer that could give you the virtual, semantic layer, and that would generate a so called, and I think some of and I know, some of the more sophisticated, you know, like tech companies, they build their own in house versions of that. But essentially, that would be cuprite. Oh, that would be a layer that would offer you the virtual data layer that would actually generate SQL when you query it this way, you kind of solve this like a grain level of detail problem. So I see this kind of essentially two options for you, you either put a lot of money and time into like, manual work of creating duplicating data Mart’s or like you build your in house version of cube.

Eric Dodds 22:45
Yep, it makes total sense. All right. Well, Costas, I’ve been monopolizing the mic. So please jump in here, because I know you have a ton of questions. And I want to hear about the LLM and AI stuff.

Kostas Pardalis 22:57
Yeah. But before we get there, I think it should be good. Like to spend a little bit of time getting a little bit more technical, about semantic layers. You mentioned RDM, that. You said, like a few seconds ago, that’s semantic less I have a problem, both from a product perspective and like a technical perspective, right? What does this mean? And let’s focus on the technical side of things, why building a semantic layer is hard.

Artyom Keydunov 23:27
Right? So essentially, at the heart of the core of the semantic layer, you have a virtual data representation that can generate a sequel, right? That’s like, I think the whole idea around semantic layer, even when semantic layer is being a part of the BI, right, because if you look back into, you know, business objects, like a first generation, like of BI semantic layer, or even with at Looker, we’ll see that semantic layer essentially, is this sort of virtual representation of data that lets you drag and drop things. And then when you do this, kind of, you know, like when you build a query, then the system generates SQL and executes that SQL against your Cloud Data Warehouse in case of a live query, who is extracts, obviously, kind of it’s going to query its own data store. So I think that sort of core of the problem is how you build a virtual layer that exposes data as a measure and dimensions to the end user and then generates a sequel? And now you know that you have all these problems about how we deal with joins? How do we deal with fan outs, you know, traps, chasms, all of that when we’re generating disabled. So think like a sequel generation and create the right framework for obstruction. That’s one problem, one piece of the problem. The other big problem which was not solved by any coupled semantic layer is How do we make an interface to a semantic layer because that problem really arises when we build a decoupled semantic layer. Because when we have a decoupled schema declared means we need to have an API. So a different system can connect to the semantic layer. So from that perspective, the question would be like, Okay, we have a tableau now, how Tableau would connect to your decoupled semantic layer. And that could be different, you know, like ideas, right? If one can build a one to one connector with Tableau, the problem here is that you would have to build one to one connectors with all the tools, right? And it might be like, just a maintenance burden, you know, like, even almost impossible. So if we would look at different options, what you can do, you will probably arrive at SQL, we would think, okay, she might declare, she could probably speak SQL, because all these tools, they already speak SQL. Now the problem with SQL is that SQL doesn’t know about metrics, right? So gorgeous columns, right? So you kind of define metrics, when you write a query, your relic average, some you like, can do some a little bit, you know, they can mass right in your circle, but you can say, hey, sequel, give me that query that measure, right, that metric. So that’s where the problem is, like, how do we make a sequel look almost like MDAX, or work almost like N decK so that it can query multi dimensional data? I think the missing piece here is the idea of the manager, how do we make the sequel aware of measures. So when you write your SQL query, you can say, Hey, I just wanted to get that measure with that dimension, and apply these filters and get a result back? So I think that would be the biggest second technical problem when building a semantic layer. So the first one is, how do you design the architecture? You know, like, how do you define measured dimensions, all these objects engineers are equal? And then the second big problem would be how do you create an interface? So tools like Tableau can query a semantic layer?

Kostas Pardalis 26:57
Yeah, let’s start with the last one. But you mentioned the show? How, how do we do that? Like, do we have yet another time? Like try to extend sequel and add, like more syntax there like to do that? Is it? I don’t know, like, do we just add things like meta data or annotation? Like, what I like, first of all, you’re the expert here. So what are the different attempts that people have tried, you know, the like to solve that right outside of building like one to one? connections with every possible like BI tool out there?

Artyom Keydunov 27:36
Yeah, so far, outside of one to one connectors, I saw mostly two attempts. One is to introduce your own query language. And this query language would be, you know, just some sort of a custom. Think about like an Oracle, no SQL style database query language, right. Like, you’re still curious about something, but it’s not a sequel. That was what transform data had, before DBT acquired it? I don’t think that’s the right approach. Because the good thing about it, it’s native to query metrics, which is, you know, it feels good when you use it. The problem is that you have all this data infrastructure already built around SQL. And then you need to go and pitch Tableau Power BI, hey, can you support my metric square language, which is almost impossible to do? Right? So that’s why I feel like it’s not the right approach. The other approach is to make it sickle first. Now, here, we have two branches. How to make that one is what we do at cube and what Luca is doing, I’ll talk about it in a second. And the other is what GPT-3 Montclair is currently doing. So what DBT is doing, they take a simple plus as a container, and then they put a bunch of ginger inside it. So from that perspective, SQL is more like just a protocol, right? Like the container, but the actual querying happening right inside, right inside this Jinja template inside the SQL. I think it sort of you know, like, it might solve the connectivity issue, just basic connectivity, but then the question would be would our Tableau generate that query, right? Because tabular generates a SQL query, when you do drag and drop, then you sort of come down to like one to one connectors, because you would need to run it like a driver for Tableau that knows how to generate the Jinja template. So don’t think that’s quite a solution. Again, it might be easier for a person and for a human right to like to write that inside the SQL query if you’re used to hacks or if you’re too used to liking a mode, but it might be really hard to scale Power BI Tableau to actually generate that. Yeah. And now the final option, that’s what you do. And what lookers are doing is to be as MC SQL compatible as possible, with addition of the measure type. Now there is a spatial type and a sequel that would represent an already predefined metric, meaning that it’s a spatial column that knows how to evaluate itself. So we define that, you know, like it can be active users, or it could be like percentage of failed transaction resurrection metric. And from the sequel standpoint, it’s going to be just a column in your table with a spatial type measure. And you would use a function called measure the spatial aggregate function to query this. So you would say, hey, I want to get my measure back. And then you kind of by doing this SQL query, you’re telling the system that I don’t want it to calculate, I know that you know, how you calculate it, it already just gives me the value back. So that seems like the, you know, like the least evil here in terms of, you know, like changes to the simple, right, because we don’t want to change the sequel, right? But we have to make this minor change. And it feels like the least necessary change, you know, like that is required that we can make to make it work. The challenge here is that it might not be very natural physically, because you kind of make an SQL multi-dimensional point. So we want to make sure to make it accurate so we’re not breaking any SQL standards and SQL expectations. But that’s possible to do. So that’s what we do at cube and that’s what Looker is doing with the looker modeler. And just for context, Looker, when GCP acquired Looker, they announced that they wanted to turn Looker into this certified semantic layer as well. And it means they need to build an interface. An interface for that would be the Apache called side that is being developed, but Julian Hyde. And that’s how Julian and the team approaches the problem of curing measures as well as a spatial measure type, and especially can measure function. The one thing here to mention is that you may say Tableau doesn’t have a measure function, right? Like it only have some average, here’s an interesting thing is like, we probably need to provide some backward compatibility for like, could be eyes, it still don’t have it, maybe you know, like use some for like psalms for measures with subtypes some with measures with object average, we can use average, but they’re like a some compatibility things. But long term, I think that approach is the most viable one.

Kostas Pardalis 32:46
So, okay, I will, what I see here is that, let’s start with the approach of DBT. So the DBT is like going trying like to solve the problem, like more on the front end side, let’s say so by having like, considering SQL models, a template, and then having a preprocessor, the air that’s based on, like, the Jinja logic, and reaches the SQL with whatever it has to be. And I can see the volume of the app in terms of flexibility. And most importantly, like you don’t really have to go back to the query engine and make changes to the query engine, which is a pretty hard thing to do in general, right? But as you say, like, Okay, you have the problem that you need all the BI like all the front end tools at the end, to somehow understand these template languages. And that’s in there. And then you have the other approach of like, introducing new types, which, okay, it sounds like the moles, like engineering sound where you like to do it, right. But then, okay, you don’t necessarily have to go and change things like on the front end that much. But you have to change things on the back end, right. So what’s the solution there? And like, let’s keep it away from Google, because Google is like a special kind of creature, right? In terms of like, the resources that they have, and how they think about products. But let’s talk about cube rights. You can’t go out there and be like, hey, Redshift, like, let’s introduce another type here. Right? So how do you do that by how do you implement that like an ice cube?

Artyom Keydunov 34:25
Yeah, I think first a high level architecture here would be that Tableau generates that query with verse all of that query is going to be sent to cube or any semantic layer. And then cube a semantic layer should generate a real sequel based on a data model that’s going to be executed in snowflakes Databricks, and then kind of send the result all the way back to Tableau. So in that case, the question is, how do we implement that sequel engine right so that it can talk about Azure. No, that’s yeah, that’s a challenge that the cube we implement in our own sQLl engine. So we’re building one. We’re like building, obviously, on top of existing technologies, right, we are using arrow data fusion as a sequel parser. And, to some extent, a sequel, sort of logical and the planter, right like, but we extend the planter on a level where we introduce measures and dimensions. And then we build our execution engine in a way that for some of the queries and execution happens at some core part of this montclair, where we generate the SQL query executed, sent all the way back to sort of the cube SQL engine. And then the rest of the execution happens just like a regular SQL, because you might have an inner query that goes to your lake semantics, and then you can make an outer query that, you know, they do some post processing quite late when the data is fetched. You want it to change. So it’s kind of a combination of things. But to answer your question, yes, in that case, every semantic vendor that is going with that approach would need to have their own sequel engine. And for us, we built it based on the Postgres protocol. So in aqueous compliance, we also support the Redshift style of Postgres, where, you know, like, some of the functions might be different, but essentially, it’s Postgres compiled. Okay,

Kostas Pardalis 36:33
so but you’re still in there operating with other query engines, right? You don’t expect the users to substitute, like, for example, BigQuery, with cube to do the data warehousing, right? Is it?

Artyom Keydunov 36:48
No, yeah. I mean, we like to look at Postgres like Tableau dorma, you know, like thoughts, but all of this, and then they send a query to the cube. And then once we get a query, we generate a real query to all the backends, like Snowflake Databricks, you know, like Starburst, all of these tools. So it’s sort of, you know, like a two step process, right? Do you first send a query to cube, which is a query to your semantic layer, we just do a simple query, but then you get a completely separate SQL query based on your real data back end? Yep.

Kostas Pardalis 37:21
So you actually, like rewrite the query in and make it like, so in that case? And like, my question here is two things. One is like the user experience and how it is affected, because, okay, you need, we add another layer of indirection there, which is like, a very common way of solving problems in engineering. But it’s also like, more latency. They’re probably right. We don’t know. But like, it might. And the other thing is, like the developer experience in terms of like, how do you debug issues now, because now you don’t have just like, the sequel dots. I write on like, or let’s say, I generate something like on my tableau, these things, which is just visual stuff. For me, as a user it goes like a cube right. Cube rewrites the query and execute like the query them like on like, BigQuery, let’s say, for example, and you have like all these, like, different steps where like, the query gets transformed one way or another, we’re like, things can go wrong, like, for whatever reason, right? And the reason I’m asking is because like, I remember back in, like 2015, or 2016, but they’re not Blendo. Like we had, like a customer who was using Glue Kia, with Redshift, and they were like, in a total panic one day, because something went wrong with our LookML. And Looker started, like generating some queries that really destroyed the glass there, right. And these things can happen, but like, it becomes like fodder for the developer to debug. So how do you find the right balance there? Right.

Artyom Keydunov 39:07
Yeah, that’s a good question.

Kostas Pardalis 39:12
So first,

Artyom Keydunov 39:12
we’ll talk about overhead, right, and, you know, like potentially adding some performance penalty here, a valid, that’s true, you know, like, you got something in the middle that gets one query in and generates a second query. So that simple generation kind of might take some time. On the cube site, we optimize it so many things have been pre compiled and reused from, you know, the data model generation and compilation perspective. So we usually try to minimize it overhead to like 100 milliseconds or so. And you know, like, well, we deal with analytics. Usually we talk about seconds in analytics, right anyway, so it’s usually not a big overhead. We also have a worry about developing cache and layer because the cube started the lake Lord for him with that analytics, right like an embedded analytics lead and seeing his performance is really critical. So that’s why we have a really sophisticated caching layer that can help in many cases to Nolleke. Not only to mitigate that additional latency, but also to improve over even the scenario where you wouldn’t have a cube in the middle right of the cube, it actually not can add a latency but can remove latency in many cases, if you use Keras at the group level. So I would generally say that’s not a problem that we’ve seen in customers that can be mitigated. The second thing is really interesting. And you’re spot on that sort of debugging issue. observability, you know, like, how do you deal with that, you know, that’s the funny thing, that it’s a problem, but the opportunity itself, and that’s how the cube cloud started. So when we first we had this open source project. And when we, you know, like, started a company around it, you know, like, we raised a seed round, and we raised a Series A, and at some point, okay, well, like we need to start building a commercial product, right? So we need to make money. And we, my co founder, and I, we sat and started to think about what we can build that creates value on top of the cube. And that’s what the first thing we’re built, like an observability debugging platform, helps you to understand what’s happening, because queries now keep clouded much bigger, so a lot of features and a lot of stuff, but that’s how it started. And that’s still a big part of that. So we like to spend a lot of time to building a lot of tools to you know, like, to help you navigate issues, because that’s right, you know, once you have something in the middle, right, you have to give tools to people, you know, to be able to debug and understand what’s happening. Yeah,

Kostas Pardalis 41:48
yeah. 100%. All right, cool. That’s what I think like, was a very interesting dive into the, like, internals of a system. Like it’s multiplayer. And like, for me, like it was important to do that. Because I think that people, it’s hard for people from the outside to see the complexity of building a system like that. And there are like spots, you say, like, still like problems out there. That’s okay, prove that, like, even better ways like to do it. And that’s like, the mission is like a cube ultra fast, right? So let’s move to the future. And let’s talk a little bit because the semantics are very interesting there. And it’s something that’s like, also, like, related a lot with AI, and LLM, just like being like the technology field, like the AI growth right now. So in a world where, okay, people imagine that, like, I don’t know, like, in a year or two from now, people would just speak on the laptop and the laptop will generate the SQL insider knowledge, like, come up with whatever. And we can argue, is this, like, realistic or not? But I’m obviously like, just replicating the type here and exaggerating, but how do you see semantic layers working together with our lamps? And was like, the importance, let’s say, like, of the two being together for the organization?

Artyom Keydunov 43:27
Yeah, great question. When LA, you know, like, came our tribe and I think GPT-3 What year ago, right? I think it’s been like that, a little over a year ago. Now. It created a lot of excitement. And I think in data, one of the first use cases was like, Okay, now we can write sequel automatically, right? Like, that was what everyone was thinking about, and a lot of attempts and like he was starting to see a lot of companies kind of started to build around this idea, again, because it’s all in UHD. i like, remember, thoughtspot, right? When they started, it was like, all over, like, their positioning and messaging, like thoughtspot is like, text to SQL kind of generation. So we’re like, we do it again, now with better technology, for sure. I think we had cube being, you know, like thinking a lot about it talking to a lot of people trying to do that. And sort of To summarize, you know, like my experience, you know, it was using LLM for the text to SQL generation. I think that the recent paper from the data DOT world team did a really good job of kind of summarizing what’s happening. So what they did is they decided to do a benchmark and they published it as a paper. So the idea of the benchmark is let’s take a data set with a bunch of relations. I think they took some of the public datasets on like insurance, kind of, you know, like use case domain. And what they are doing is asking a set of questions. And they expect a set of answers back? And you know, like they can measure accuracy if the answer is a correct nod, and they have specific prompts. So the first attempt is to run it directly on top of the schema. So essentially in the prompt they say, you’re about to answer the question, use this DDL, you know, like to learn about the schema. And then accuracy was about 16%, or something in that case, not good. Then what they did is they ran it all over the Knowledge Graph. So they took this ontology, you know, extended ontology as they built it, and they kind of fed that ontology to the LLM. And they were saying, like, here’s the ontology, run the query on top of ontology. And the result evolved with something like 56, or 58%, like essentially three eggs better, but still, like hit and miss, essentially over one question would be right when a question was wrong, but it’s still a 3x. improvement. And after that, we have a few partners that companies that are building some sort of tax to SQL products on top of semantic layers, and out of these partners they’re like a year that they believe that semantic layer can significantly improve the accuracy of this solution. So what they did is they took the cube as a semantic layer, and then they ran the same benchmark on top of the cube and they got to 100% accuracy. And that cube gives all the semantics of the relationship of the ontology to the LLM system. And then you craft the prompt with all this information, and you just run it against goop, and you get a really good accuracy. So it’s like it went from 16, which is like a raw raw SQL to like 100% accuracy. So I believe, if we want to build a future with, you know, like text physical products, it has to have a semantic layer. And now I went to reinvent last week and AWS, they announced a few things like a big chalkboard, right, that leaves now in AWS products, and it is connected to multiple products in quick sight. So what do you QuickSight right now, and Creekside is a bi from AWS, right, like just for the context. So it’s a pretty standard di all the features you would expect from a bi and now they added this natural language. So what you can do, you can sort of create a topic, which is essentially kind of a data set, like a representation that you get a bunch of measured dimensions together builds, like some kind of data set, you call it a topic. And then they require you to give a lot of like semantics to the topic, like what are synonyms, what you know, like, how do you call your metrics, you know, he may have, you know, like your Sargon, or like certificant, I like acronyms in your organization, you use to call metric. So you kind of essentially give all the semantics to the system, and then it can give some results, some kind of good results, right? So now what I think is going to happen is that every BI is going to add features like this. Now, every guy would require semantics to be inside that BI to you know, like to make that feature work very well. So it will create and I started with that right? When you asked me what is semantic, layer, universal semantic, Larry said the problem is you’re trying to make a dry across dataset. So now the problem is going to be even worse, because you will have all this in semantic synonyms, all the words inside every bi if you want to make it work very well with natural language, right? So you’ll have semantic scattering across all of these places. So I think that’s what’s going to really happen. And I think that’s why the value of a standalone semantic layer would even be bigger in this LLM based world. Yeah.

Kostas Pardalis 49:02
So what is the semantic layer like in the example that you mentioned, like what is cube bringing to the LLM? The ontology cannot be captured. And we get such a big difference in the performance at the end.

Artyom Keydunov 49:19
I think ontology, the idea of ontology is that they don’t have it all about relationships, mostly right? They don’t have metrics. And now in Analytics, you can ask about metrics anywhere like you would ask, What am I transaction rate failure, or what bounce rate or website and where is it defined? You need to define it somewhere correctly. So you either define it inside UPI, like you would do it for cue in a quick site, you would exactly define it, but then he will have the same problem, right? Like because he’s going to be scattered across multiple places. Or you’re going to define it some you know, like a standard place like a semantic layer ontology by itself just doesn’t support it. I feel like Maybe if we just take ontology and extend it to some, to some degree, you know, like to cover analytics, it will help, but then in that case, it would just become an anti glare. Really, right? Yeah,

Kostas Pardalis 50:12
I get it. Okay. It’s like more about, like, it’s not how the information is represented, but like, what information is actually, like, include these as part of the ontology like, okay, like, theoretically, you could have like an umbrella to describing metrics, like, nothing stops you from that, but obviously, that was not the case. And okay, that’s, that’s interesting. And my question is like, okay, when you remove completely, let’s say, the guardrails from a human because like, either like if we call them you eyes, or like a DSL, or language or whatever, at the end, what we are doing is that we’re creating, let’s say, a very rigid and strict context in which the human brain operate, right? Now, if we remove that, and we let the user just like, type, whatever they want. Like, what can be asked is like, completely open, right? We don’t even have like, let’s say, synth ducting checks, they’re like, as we have, like, when we write code, right? And my question is, sure, like you have, your semantic layer will always be, like, a limited representation of the word out there, right? Like, you cannot have like in your semantic play, like everything. On the other hand, the user can ask whatever, right? So how do you handle the user experience in such an environment? I don’t expect you to have an answer. To be honest, I think there’s like, some of like, the really hard problems that people like to ask : who builds have to answer with using lamps and bringing a lamp like to the markets out there? But I’d love to hear what you think about it. Yeah.

Artyom Keydunov 52:07
I mean, that’s a good question. And it’s obviously more about the distant future. Right. But I think it’s good to think about it now. So the flow as it is today, right, with, you know, the sort of data products that you have as a data engineer, right? Because they’re all on the Mantic layer, and whether you’re doing it inside your BI, or you know, you’re doing it in a cube, you still define metrics somewhere. And then you say you have a user who consumes that. And then usually you have all this conversation, really, you have a meeting with some marketing team, and they say, Hey, we have this HubSpot data, we want to look at the metric, you know, like maybe like, contact request forms fails, and you kind of as a data engineer, you having a conversation with them seeing you try to understand what do they want? And they try to map it to a habit, my semantic layer, do you have these metrics? Did they build this smart or not? And then you say, Okay, I have this and that, I probably don’t have these few metrics and dimensions, but I’m going to build it for you. And then next week, we’ll have a meeting, and I will show you how to do that. And then you essentially do that, right? You build that. And you say, Here’s, you know, like the metrics measures, here’s like, the list of things you can do, like drag and drop, enjoy it, right? And then they would probably say, hey, it’s, I’m missing this dimension, right? Can you build it, and it’s like, always comes and goes, that’s probably the best scenario, right? We’re not like, the worst would be just asking for ad hoc reports. But in the best case, you like you’re building an actual semantic layer, and they like to use the semantic layers. But you are still going to develop a semantic layer, it’s not something you built once and you know, you don’t need to touch it. So now we’ll say in a future we have this this system, you were describing Brian, like, show think from that perspective, that system would somehow should act as a data engineer as well, that can modify the semantic layer because the system the AI would know, okay, this is the multiplayer, I have your you want me to get this information, but I don’t have it yet. So I probably need to go and make a change to semantics, layer. And then you will be able to access all of this and either make a change like an ad hoc change on the fly to extend the semantic layer at the given moment to satisfy your request, or I make an Moloch universal change and you know, like that can be applied everywhere. Now the question is, like, do we trust AI to fully automated processes or the person is still going to be in the loop? That could be a good question. You know, like I see a world where AI can make a pull request to semantic layer change. And then, you know, like, as a human will review the change and accept that it can be possible. Something may be fully automated and AI will just kind of maintain it. So I think we’ll see how that’s going to be developed. But that’s like roughly how I see that you know, the future flow mind It’s like,

Kostas Pardalis 55:01
yeah, no like, well, it makes total sense. And anyway, I think we are going to have some very interesting, at least months, two years ahead of us with pressure, new things coming out there. And like, I’m very, like, excited to see what cube is going to be built there. So wearing the microphone back to you, I know that you can stay away from it for that long. So

Eric Dodds 55:24
up, wow, I just have a magnetic pole, or I have a magnetic pole, we’ll never know. But if you have one personal question, just to land the plane here, you know, we’re always interested in what people would do if they weren’t working with data. So if you didn’t have a job, you know, building tooling or working in the data space, what would you do?

Artyom Keydunov 55:48
I would, too, I mean, I’m a big fan of table games, and dungeons and dragons, and you know, like video games, or PGS, all that stuff. I would do that for a living. I think that would be a fun job. So but, and then I got pulled into software engineering, like early on, you know, like, and just, I haven’t had a really chance to think about what I would do outside of that something, you know, like I would do in all like, I understand that may not be as how would I put it profitable? Hasn’t. But it’s still gonna be fun.

Eric Dodds 56:25
Yeah, that’s interesting. I mean, I think anyone who’s played a really good game feels so natural. And then you play a game that’s designed pretty poorly, and you’re like, Whoa, this is like, it’s not fun, you know? So yeah, that would certainly be a fascinating problem space to tackle.

Artyom Keydunov 56:44
Yeah, I think the games do they all involve the stories and I’ve actually been thinking a lot lately about you know, like, how we can apply AI to the games and I think like a lot of people in the industry in the gaming industry, they thinking about it, especially, you know, like to you know, like FPGAs and we’re like a story part of that. So few projects online were like, you know, Dungeons and Dragons try like you have a dungeon master kind of knowledge leading a game. So it’s so projects online, kind of making an AI based dungeon master that would run a game for you, like just based on LLM. So I think that could be cool. So yeah, and LLM can definitely have a really interesting impact on the industry.

Eric Dodds 57:26
Very cool. Well, RDM thanks for giving me your time. There. Good to have you back on the show.

Artyom Keydunov 57:30
Oh, yeah. Yeah, I had fun. Thank you. That was really good.

Eric Dodds 57:33
We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe to your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.