Episode 90:

The Modern Data Stack Has a Join Problem with Ahmed Elsamadisi of Narrator AI

June 8, 2022

This week on The Data Stack Show, Eric and Kostas chat with Ahmed Elsamadisi, co-founder and CEO of Narrator AI. During the episode, Ahmed discusses limitations of progress, managing data with just one table, and how Narrator is impacting the data stack world.

Play Video

Notes:

Highlights from this week’s conversation include:

Ahmed’s background and career journey (2:27)
Why the modern data stack “sucks” (4:53)
The limitations of progress (9:13)
Showing data with only 11 columns (11:55)
Managing one table that rules them all (19:02)
Viewing the world as timestamped activities (32:40)
When this model becomes harder to use (35:15)
The two parts you need in a company (44:41)
Those who use Narrator (48:32)

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 0:05
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com.

Welcome back to The Data Stack Show, Kostas. Today we’re talking with Ahmed from Narrator. I am so excited about this conversation because maybe this is the first guest we’ve had, who has made the bold ask rotation that the modern data stack, or at least a subset of it, because I want to be fair here sucks. And we talk so much about the ways that people are combining these tools to sort of build architecture. And it was problematic for a med and he’s building a company to try to solve that. So my burning question, not to steal the thunder from you, but why does the modern data stack suck? I think this is going to be a great conversation.

Kostas Pardalis 1:03
Yeah, absolutely. Absolutely. I mean, it’s always nice to see like people that have will say more radical view of like the things that are happening out there. And I think this is not something that’s like we need because it’s kilts, like refeed Kylin Snowplow datalake, just whatever we call the best practice, and just continue with that, like, we should always be challenging, like our methods and like the things that we’re doing the products that we have till week, and nothing nada middleware is doing, like an amazing job in that. So yeah, like I’m really looking forward to see also why the modern day that’s like sucks in a way, but also see what’s like the alternative that he’s building there, which might not be an alternative at the ends, right? Like, it might be something that works like pretty well with the modern data stack anyway. But other than like a little bit more in depth, like, what the solution is there and what the problem is that they’re solving.

Eric Dodds 2:04
All right, well, let’s dig in and figure it out.

Ahmed, welcome to The Data Stack Show. We are so excited to chat with you.

Ahmed Elsamadisi 2:11
So excited to be here and dive into all these details.

Eric Dodds 2:14
All right. Okay, so give us your background. You have built various iterations of the modern data stack many, many times over, but give us the timeline. So how did you get into data? And what are you doing today?

Ahmed Elsamadisi 2:27
Yeah, so I started my career actually in robotics. So I was really interested in a human and robot interact together to make decisions, self driving cars, big kind of bigger projects, human of attraction, eventually made my way to AI for missile defense for the US government. So understanding kind of missiles go through space, and how to unblock them kind of got burned out from that intensity switch to we work and build out. We worked data team and did infrastructure that you see today. So that’s when I implemented the data stack many, many times decided that there was a fundamental problem when I’m like a tour of all these big companies to be like, how do you solve this problem? And pretty much realized that the way that data, the fundamental, there’s fundamental problems in data that these these different approaches haven’t solved. So I decided to really rethink data. And that’s where I ended up having narrator, a single table approach to answer any question in data. And it’s a lovely column table that you can use to answer any question. And it makes ask and answer question with data really easy. And the really special thing about narrator is that that single table is a standard. So whether you’re like airline companies, media companies, ecommerce, sales, crypto banks, which are all different companies, we have those selectors, you can use the same exact 11 columns to answer any of your questions, allowing us to share and reuse analyses. So really bringing that data world together, enabling that data analyst truly make the best decisions.

Eric Dodds 3:52
Love it. Okay, I know that our audience’s ears are burning just like our’s are because we want to know what those 11 columns are, but I will make this as essential show. I want to start out and you mentioned this, when we were talking right before the show as well, that you built sort of different iterations of the modern data stack nine or 10 times. And in when we were prepping for the show, you were like, it sucked. And that’s such an interesting thing to hear. Because in the industry in general, and of course, in the show, one thing we hear a lot of is like, Well, you gotta move towards the modern newness stack, right? Or these are the components of the modern data sack or this is sort of the right architecture, etc. And I want to know why and I’d love for you to be as specific as possible. Why did you come to the conclusion “this sucks,” because most of the industry is trying to push everyone towards modernity.

Ahmed Elsamadisi 4:52
I think that the everyone who has implemented the stack more than once will tell you that it seems like the only way and it’s unnecessarily evil. So I hide the core, you have your data everywhere, and you dump it into a warehouse, we call this ETL. And there’s a lot of tools that have automated, this process has kind of been played selves, then you have your warehouse, and there’s a lot of different warehouses, you can use different flavors, different benefits, and that’s been solved. Then you have your middle layer, we call this a transformation layer, where you actually use data and write sequel to represent the questions that you need to answer. That table gets materialized and put it to your BI tool, that’d be a tool that allows you to build dashboards and visualize it. And anyone who’s ever done this will tell you what happens. So what happens is that you build a dashboard, then there’s a follow up question, or your team is like, Yeah, but I want to understand, slice, the number of emails by how many people are repeat purchasers. And they go cool, let’s go back to the data team, let’s build a new transformation. Let’s build a new materialized view. And let’s build a new dashboard. And as time goes by those number of transformations you have in the middle, continue to grow, the number of data that’s similar and multiple transformation continues to grow, it actually gets so messy, that you often have 700 800 transformations, each answering a series of questions that then you end up dealing with, Hey, how come these two dashboards don’t manage? How come these numbers don’t match? My warehouse, the slope? How come everything is so expensive? And because of this entire cycle of constantly, you need to go back to build these new transformations, you end up having to spend the time to answer a question goes into weeks and months, every new question goes into these complex 1000 MySQL queries, so you could answer it, what we’ve done is we’ve actually built different ways to manage this middle layer, but we haven’t solved it. So whether you’re back then in like 40 years ago, this was called Microsoft stored procedures. And you would do that in like SQL Server. Then we added more ways to build a staging layer. Then we added, we have like Luigi, which was like Spotify is a version of it, then we add Airflow, then we have DBT. Now all this history has kind of built better ways to manage that SQL query. But the fundamental is that you still need 1000 analyzed SQL query to answer is a series of question. And that doesn’t go away. Now, why Dude, why does that happen? Why do you need 1009 complex SQL query? And that itself, the underlying problem? And that is because data isn’t separate is actually captured in separate systems that don’t relate, I need to figure out ways to stitch How do you tie an email to an order, because everybody wants to know, email attribution to order? Well, because you type need to go from email to web page web page UTM parameters to UTM parameters to click, click to copy that a 10 parameters, assume no duplication. And that complexity of doing that simple join across all these systems ends up generating these really complex queries. And I think that’s where the data, a lot of data set consistently fails. And no matter whether you’re Apple, or Airbnb, or Spotify, everyone will tell you that they have an entire team of people don’t get it, it has now become a job that people call that analytics engineer, whose entire job is to build these transformations. So you can answer questions. Every company will tell you, how long does it take you to S3? A follow up question? How will it take you to every new ad hoc question you get? Is there an infinite backlog on the data team of questions? And that’s always the kiddies and I think that’s the problem we need to solve is why this transformation layer causes all these kind of roadblocks. And that’s the problem that I went out to really innovate in solving.

Eric Dodds 8:29
Can you dig in just a little bit more into… You said that some of these modern tools—so you have stored procedures and then Airflow and then now DBT—what’s the fundamental limitation there? They’re making it easier to manage these transformations, but are they not making it easier to actually write them is like a fundamental underlying challenge with like, the structure of the data and the disaggregation? Or like, dig into, like, what are the like, there’s progress that’s been made. But what are the limitations of that progress?

Ahmed Elsamadisi 9:13
I don’t think it’s a problem of the tooling. It’s a problem of the approach. So the approach of building custom tables, to me is the idea of like, every question, it could be rebuilding a car in every piece you needed to cast molded, custom, like you need more of the world to change what parts where different pieces can fit together really easily. So right now, the fundamental problem is a sequel today requires you to join based on a key and if that key doesn’t exist, or you put a person to hack, get it with like a bunch of complexity to do it. That is the problem. Now, what the tool you’re using to manage a sequel doesn’t really matter if your sequel is doesn’t solve this fundamental problem. And I think that is the core problem that we realized is that It is actually a joint problem, because joints depend for keys and foreign keys don’t exist. So to solve eight, you actually need to reinvent how you join. And how you structure data, not how you manage transformation, the man detach summations is the like DBT. Love the tool, I love Tristan as well. This is one of the best tools to manage transformations in this traditional way we call the traditional way of doing data, which is known as a modern data stack. But that way itself is fundamentally flawed. You need a different way that allows you to work in the way that modern data actually really is flowing, which is how do you ask and answer questions and bridge all your systems quickly, easily in seconds. And that’s the point. And that’s the thing that we have to really highlight because a lot of questions that appear so complex, they have to write so much SQL to do in their reader appears so easy, you can answer them with a couple of clicks. And that’s because we solved that underlying problem that lies within SQL, which is showing the data.

Eric Dodds 10:57
Okay, so let’s dig in. How do you do that with only 11 columns? Because it sounds, honestly in many ways, it sounds too good to be true. And I know that we want to talk about how there’s no no decision that you make, technically, that doesn’t have a trade off. And so I want to get there as well. But if you think about even a moderately sized company, that, say has sort of maybe some behavioral data in their warehouse, they’re loading a bunch of structured data say for marketing tools, or CRMs, or whatever, right? You have a bunch of materialized views. It doesn’t it’s not that hard to have 10s, hundreds 1000s of tables, right? Like you can get there really quickly, right. And if you do get there too quickly, everyone knows the pain that that creates. So it sounds crazy that you like solve all that with an 11 column table. So tell us how do you do that?

Ahmed Elsamadisi 11:54
Yeah. So first, I think we like to say one table, because of the kind of like shock factor, it’s like 95% of the table there is likewise you can add additional tables. But that’s not the core, the core single table that we’re going to discuss is none other than activity schema, you can see an activity schema.com. It’s an open source project that kind of discusses this one table approach. And this is really just kind of taking the way that we speak about data and really bringing it to the way you can structure it. So let’s just the time series table, where it’s customer, time, action, and you we just abstract three features. So it’s feature one, feature two, feature three, a couple of additional columns, but that’s kind of the core of it, which is that’s it. So it’s customer time, action and features. And you’re thinking, well, like, why did he omitted like, How can I just put everything I needed three features, like I have so many features I need, I need like 100 features. And that’s where the tool dataset comes in the narrative provides is a way to pull in what we call borrower features from different activities. So let’s take a simple example. That you’re I want to know, every email, did that email the turn order? I want to know what the campaign of that email is. I want to know, when that person did that, when that person came to our website, from that email, how many pages did they view, and I want to know what page they landed on. That seems like we already are talking about 10-50 features, right. But if you break it down to like actions, you have opened the email action, which has one feature, which is the campaign, you have the visited website, feature, visited website action, which has path which is also one feature on that you have the startup viewed page, which might be the have some features on it, but just the fact that the customer viewed a page, and the fact that they completed an order. So now it’s four activities. And all I’m doing is really pulling the data from each activity. So if I want to know, if between those emails that they have an order, I can pull by default, when the next time step from that order is I can count how many page views they had in between that and say that’s the number of pageviews, I can grab the first page view from the Started session activity, I could sum just pretty much like kind of think about it as thinking this really long table and doing a very clever fancy pivot and pulling the columns I want from these each of these activities. And when you do that, what it turns out is that if you actually represent your business as this really rich customer journey, you don’t need that many features per action, but you do have a lot of actions. And those actions are where all that nice rich information comes and because time and they can take and all that stuff is given to you. By Narrator out of the box, you’ll need to add features like first visited page, last visited page, number of visited pages, number of visit pages last 30 seconds. All those can be recomputed on the spot when you’re answering the question that you need instantly. Does that make sense?

Kostas Pardalis 14:48
Yeah, it does. So how do we populate this one table? From there all data that we have. I mean, obviously let’s say This is the data model that made sense to have like on your data warehouse like for analytical workloads, obviously, the data that is coming in is not modeled for that, right. So, again, we are going like to do the extraction and the loading of the data. So, after we have staged the data that we have loaded into the data warehouse, how do we get to the point where we have, let’s say, well curated one table to rule them all?

Ahmed Elsamadisi 15:26
Yeah, great question. So, they are to provide a very, very thin layer, that’s known as our transformation layer. And this is not like a deputy transformation layer, because you’re really just mapping columns. So you’re pretty much saying like, for example, I have my internal database has a user’s table, and I want to have an activity like added user. And I just say, like this, you’re mapping the to the to the 11 columns here, you’re saying like, the timestamp is the createdAt of the table, the action is “created user”, here’s the features that I care about. If it’s a very thin layer to map it, it’s so thin that it averages around 12 minutes to write, I think most customers that have experienced it see like the how easy it is to kind of take your data from whether it’s already event based, or relational, or tickets. And we have like a library of all these common transformations in our doc site. And you just kind of like map it to that simple structure that is this per building blocks, you define each activity, and then Narrator migrates that data does a bunch of caching, which was big to make that really nice and easy and fast to use, and provide you with an interface to actually ask the answer these questions. And the good thing about doing it with activities is that you only ever need to add a new building block when you have a new concept to add, not when you have a new question. So often in tables in that you materialized in the modern data stack. Every time there’s a new way of relating data, you build a new table in there, no, you don’t do that, you just build what’s called a data set. And that’s why a couple clicks, every time you have a new concept added to your company, then you add it. So you’re often doing these activity transformations, when the first week, and then you add one every other month, it’s like really rare that you’re adding a bunch of new activities. Instead, you’re taking the building blocks that you’ve kind of built, and you’re reassembling them to answer all sorts of questions.

Kostas Pardalis 17:09
And how is this table implemented? This is like a materialized view that gets like populated inside in our house? Is it like a logical view like what’s the natural length they go?

Ahmed Elsamadisi 17:22
Yeah, it’s an actual table. It’s a table that we insert into we update, we manipulate narrative does a lot of additional things like identity stitching, and across all your systems and like handling fraud users, and anomaly detection and all that stuff. So we’re actually just constantly updating and mutating just one single table. And we’re sorting it and partitioning gate based on your warehouse to optimize performance. And it does a lot of we do a lot of stuff on that one table to make it really performant. And really nice and fast. And then dataset queries are all there’s no free sequel in there. So you’re actually using the dataset to answer any question. And all the craze those generate are super optimized for speed. And on that single table. So in your warehouse, you’ll have a schema or a data set, depending on which warehouse you use. That’s called narrator. And in there, you’ll see the activity schema, or the activity stream is often what’s called.

Kostas Pardalis 18:12
Okay. Yep. All right. So let’s talk a little bit more about like the management of this table, right? Like, I mean, obviously, this table relies on like the underlying data that is getting loaded in their warehouse, right? How do you do things like okay, let’s say, accidentally, someone like drops the table, right? Like that is used, like, as a short. What happens then? Like, is these like changes? Is there any move on let’s say, of that is going to be reflected also with deviations from this table? Like, what’s what like the, the logic behind with like working with data that might cease to exist at some point, or it might be figured out that it’s their own data? Like, how does this work?

Ahmed Elsamadisi 19:00
Yeah, great question. So one of the benefits of only having modeled activities, or average Karela is 20 lines, it’s really small queries. And if an activity, a transformation of an activity, let’s say, the query, we’re updating this in green incrementally, so every like, every 510 minutes, we’re reporting the new data into the activity stream. Let’s say we go to insert it and the query fails for because data is not there, we take that activity at that transformation, we put it into what’s called a meeting state. So anyone using that data will get a flag, hey, data isn’t up to date, something went wrong, you get notified, you can go in and fix it and rethink it. The data is up to date, and now it gets re synced and the maintenance goes away. But we also provide out of the box and automate tection. So that data ever like stops producing rows, you can only write your own custom alerts on it. So we’ve done a lot of stuff to make sure that as your data is migrating. It’s correct. We do a lot of duplication checks for IDs and stuff like that as well, to ensure that the data that you’re inserting into your warehouse is always accurate and The benefit is again, because it’s a single table, we can do a lot more checks very cheaply and easily because we have guarantee structure and guaranteed assumptions. So the narrator’s always incremental, it’s always time series, all these things get a lot of benefits from it. So that’s what ended up happening a lot with this a so that people often find managing those like single table actually the easiest part, like super cheap, because it’s often on the raw data. Because it’s so simple. You’re often just pointing a timestamp from your tables to a structure, like there’s really few, like complex queries that you’re putting in activities, all that stuff happens in datasets.

Kostas Pardalis 20:34
All right. And so let’s focus a little bit more on the modeling side of things. Now, one of the things that I have, like experienced, like when I’m talking like with companies, or like I’m observing what the company is doing with their data is like how the semantics would say, or might change for the same thing, like, what is the user for example, like a gossamer what how a customer is like perceived by sales, or how the customer is perceived by product, or how gossamer is perceived by marketing, right? And just to give an example, like, you go with sales, and chat with them, and you start hearing about, like, prospects, and leads, and opportunities and conducts and all these things that we pretty much like we all learn to live with, because sales was became a thing and there’s Kima became a way of representing sales in the world. Yeah. So I had to deal with that. Because from what I understand, like, the core concept of your modeling, is that everything like these around the concept of the user, like the customer, let’s say, right? How do you differentiate with that? And how do you make these like accessible to people that they use different syntax and semantics about the same conflict?

Ahmed Elsamadisi 21:53
Yes, so honestly, this is actually one of the best parts about narrator that you can actually, one thing that we have, we see a lot when you’re depending on dashboarding is that you have to force everyone to bite on one definition, total sales has to be total sales, and total customers has to be total customers. Well, you see a lot of narrator is that a person might you can have multi identifiers, a narrative that can map to what was domestic your global customer, and customer could be we have companies that are right, sharing that the customers car, with companies that are customer like we work as a building, like you have different ways of defining customer. So what we say a lot is the idea of that entity, having events. So like you might have a created lead activity, you might have a crew started opportunity, you might have a closed opportunity, you might have a signed contract sent contracts moved in made payment, like started subscription. And the reason why that’s so important is when you deal with that argument. And I’ve heard this i We work a lot. Well, when is the sale? Is it when they sign? Is it when they move in? Is it when when they pay the first invoice is it when they start their subscription? What is the sale? You don’t have to actually fight that battle anymore. Instead, what you do with narrator’s do you have this concept of dataset, which is you have the activity that you can represent them differently. And then when you go to create your KPI, which is like your key performance indicator that there allows you to create, you can then choose very explicitly what that is. And the user then sees the KPI, they can always click into it and see the underlying data set and say, Oh, this says you did a timestamp of the first opportunity created. And because of opportunity down activity, it’s just a lot easier to get that transparency. So when you’re modeling the data, you don’t need to model based on how it’s going to be used, you need to model based on what it is. And then when it’s being used for like a specific question the user can actually choose very specifically whether they want it from the sales perspective or the invoice perspective. And then there’s also the global just company KPI, which the company has decided is the thing that they’re going to track and they’re gonna call that total sales. And who knows click on it oh, they’re using signed contract as their definition or total sales. And I think like kind of create those three layers, whether it’s a company global KPI, which people are using to measure any data set, which is answering specific questions, and then having your building blocks represent real actions that the customers taking. It just kind of creates very little space, right b2b. Like questions that we don’t get a narrator often is like, but with it is actually neat. It’s like oh, just click on data set and see exactly what that means. Oh, what is that? It’s because the words are like created opportunity, or like the word might be like, made payment, you can be like, Oh, and you can click onto that and see the exact SQL unless equals 20 lines, so you can easily understand it. But it creates that separation so that the data team isn’t fighting. And if the company decides actually we’re not tracking, redefining total sales to look at it based on when the first invoice is made, that let me talk to data about that, like the data is already modeled. You have you just choose that for your dataset. And that can be done without involving data at all. And everything will just cascade nicely because again, your building blocks are what you’re modeling. Not The final results. So you’re representing the world as these activities, everything else happens in the area. And you can build datasets to combine them, you could build KPIs, and you could change those things, not thinking about going back to data model ever.

Eric Dodds 25:15
I’d like to dig into that with a specific question. And this is inherently biased, because I actually got to use Narrator, kick the tires on it, which was really, really cool. And so I’d love to know, because unfortunately, I didn’t dig in with our analysts team and data engineering team, but I was sort of like a consumer of a question that we are joining us. And in fact, I will tell you what the question is, because maybe that’ll be helpful. And then I have like, a specific question about how something’s happening under the hood. So the question we were trying to answer, which again, sounds like an easy question, but like, actually ends up being difficult to answer is, how much does consumption of a particular type of blog content? (A) does that seem to influence an opportunity being created in a certain time period? So it’s like, okay, whatever this thought leadership or engineering or whatever, right, like, does sort of consumption, increasing consumption? Is that a leading indicator that there’s increased likelihood or whatever? Okay. So this is my question. And actually, it was like, very elegant the way this happened, because the result Narrator actually had sort of built a first touch and influenced view that were like, very easy to get, which was really cool.

But here’s my question under the hood: what makes that— and correct me if I’m wrong here because I’m not a an expert in SQL. But part of what makes that difficult in raw SQL is actually not necessarily like looking at pageviews. And then sort of saying, like, okay, was that user associated with an opportunity eventually, right? It’s that you actually may have like multiple users who have entered the funnel, but are related to the same account, which is also related to the opportunity. But in Salesforce, of course, that their data structure like not everyone is. And so when you talk about something like influence, as opposed to something very linear, like first touch did user did a did B happened at some specified time period. Now you’re talking about a group of users who are associated with like a different object or different table in the warehouse, what you want to know about is the opportunity, which is a different table in the warehouse. There’s like a ton of key crossing across these tables to do some of your stuff. And this is actually also all assuming that your behavioral data, like has a layer of identity switching as well, where you have like unique IDs for like the anonymous behavior, because that can also have NP identify blah, blah, blah, anyways, you get it. I won’t keep going on there.

Ahmed Elsamadisi 27:58
Awesome. So so first of all, that’s a great question. Like, it bridges, multiple systems, it shows you’re asking an analysis, and you probably have seen our Narrator, which is one of the benefits of standardizing data allows us to build and reuse analyses, which is our intelligence to generate these beautiful stories that help you understand your data automatically for you in seconds that actually provides real answers. And that question that you asked, has a lot of nice complexities to it, right? Like multiple systems, multiple tables, you’re talking about? How do you think you’re bringing that together, and all sorts of different pieces that makes that really, really complicated. And if you probably talk to your data team that set it up, they’ll probably tell you that they set up those activities in 145 minute session, because that’s our pure proof concept usually is 145 minute session. So they set that up, get that answer gave it to you allowed yourself survey, the entire setup was 45 minutes. So what did they do? So two pieces here that are really critical? One, is that a narrator? We, because we built an entire company based on a single table, we got a really good identity subject. So we have a very, very proper way of stitching that data to have that thing that you’re talking about of like this thing happening. And that did the first time they ever do it. Like everything is changing a time. If you notice that narrative does, everybody has a function of time. So what that probably looked like, I don’t know what the exact setup, but it probably looked like something like viewed content was an activity. And it had an anonymous idea of whatever that cookie was of that user who viewed the content. And based you had a probably like a contact or an account ID that was like your global identifier, but just how you thought about your business, which is that kind of create the opportunity that kind of creates a lead and all sorts of pieces for the user. So you have like an account identifier, and that applies to like both the users and the opportunity and you pass out through on the activities is how that’s happened. Yeah. So that’s a customer and then you create what’s called Narrative allows you to create tiny little snippets that MAX data together. So you probably have one more snippet that’s like, hey, we know that this Cookie is now this account ID, there’s a lot of explanation of how that works. And then that’s it. So you build us through transformations. A narrative will stitch that together, combine it. And then when you’re asking that question, and if you use our tool, you can right click on any piece of data, see the exact customers, right click that customer’s entire journey. So you can see that customer like view to page B2B viewed blog, viewes blog, viewed blog, created opportunity, viewed blog… And there’s able to understand the difference between that. And we talked about the differences between like knowing how many there were, that’s a simple like, give me the count without knowing the rate give me the count divided by the time from the the first one giving me the first content they viewed versus the last content they viewed, all those things were using words like first talent last, we’re still talking about actions that the customer is taking. And that’s kind of the beauty is that the way you ask the question, you kind of have to express questions, you kind of convert them into these active action based questions. Because you’re saying how did the customer you already combine the fact that it has to be the same person? Because you’re not asking how does something affect something else, and nothing is tying it together? You often tied together by a person. And you talk about these two building blocks, viewing content and creating an opportunity. And you’re looking at a conversion rate, and you’re trying to optimize that. So you’ve already done the way that you’ve asked a question, you’ve been 80% of the hard part of preparing data. And all they did was take that same structure of how you’re imagining the data happened, customer reviews blog, that creates an opportunity. And we enable you to create that structure. And then we quickly enable you to actually structure that data using the way that you asked it. So that’s what makes that experience so seamless, and look kind of like magical, because you’ve done three things in your head for us already. And we just kind of represented the way you think about it.

Eric Dodds 31:47
Yeah, super interesting. Super helpful. Okay. And I can verify it was really cool to see that happen. Okay, I do want to play devil’s advocate. And I’m actually going to ask Kostas a question here because this is beyond my technical depth. But when you talk about activities, sort of the way that you view the entire world, you’re talking about, essentially converting every type of data into event data. Okay. And Kostas I mean, there are a few things that come to my mind, but I would love to know, Kostas, like, that’s a non trivial sort of lens to put on all data, what comes to your mind as potential challenges, benefits, whatever, when you view the entire world as sort of timestamped activities?

Kostas Pardalis 32:35
Yeah, that’s a pretty interesting question. Usually, the problem that we have with that is that there are questions that you can better answer when you, let’s say, keep a track of like everything that has happened. Where having events, there is like the way to do it. And there are questions that are like much easier to answer where you just keep, let’s say, or you have already replicated like, the current state of your concept or entity or whatever. However, like, whatever you want, like to call it, right? So usually, the problems that you have with events is that yeah, like three ships you to measure change, for example, and stuff like that. But if you want to see at the end how things look right now, you will probably have to go and replicate the whole, let’s say, journey, like get the data there and go and replicate like the current state. That’s like, from a very, let’s say, it’s a naive description that I’m giving, but it’s usually what people have to deal with from an engineering perspective when you have to decide am I going like to work with mutable state or like, go and keep like, events there and work with events and usually like events gives you like this extra expressivity. But there’s some kind of explosion in terms of like the amount of data that you have to deal with or like what it means to go and replicate the halls, like the state by iterating over different events that you have. Now, obviously, there are situations, there are things that you can do only if you have events, right? Like if you want to see, let’s say, what is the journey of your customer, you need to have all the events there otherwise, like, how you’re going like to do that, right? So having this kind of like turning everything into events makes sense, in a way. But the question is, and that’s like a question that I have for Ahmed, is when does having let’s say, these modal becomes a problem. What are the let’s say the equations that are not impossible to answer but harder to answer because you had this different way of describing rewards?

Ahmed Elsamadisi 35:08
Great question. So a couple of things that to kind of highlight. So one of the things the benefit of kind of having we an aerator put this like really intense, rigid structure. And it allowed us to kind of solve a lot of the core problems using datasets. So one thing that you can easily do in any activity is, say, give me the last ever updated subscription, or give me the last ever, like status of this company. And when you can use words like last ever, it makes it really easy to know what the current state is. So we find a lot with our customers is that as things are changing, you can get like if you let’s say you have a contract object, and that contract object is changing. If you want to know the current contract, you say, give me the last ever updated contract and you get the get that contract object, then, however, sometimes when you’re asking questions, you’re saying, Do you want to know what the contract was at the moment when that person submitted a ticket? Those questions are nearly impossible to do with non event data. But when they’re because they give me the last before, before you submit a ticket, give me the last before updated contract. And now give me the state so you can actually benefit of doing like generating state comes from instantly with the last ever we can also generate data at any given moment in time. This was inspired by if you’re familiar with like the use of a very big database paradigm known as the Lambda architecture, where you have like a streaming layer, and then you kind of do a bachelor to process it. But what are the benefits of that approach allows you to structure data at any moment in time. And those change questions can be easy. The second thing you ask is like what about things that aren’t changing, like your customers age maybe, or like their gender, or some of the things that change less often than you think? Well, I said that narrator is mostly a single table, we do have what’s known as well, we call it like kind of an attribute table, which is on this customer, because everything is centered around a customer, you can just kind of create, we have a materialized view, that’s like a dim customer, for example, you can add all the kind of static attributes of the customer, that makes it really easy. You often don’t add stuff like when they first signed up, you don’t add timestamps there, if you actually do narrate it will alert you saying you shouldn’t do that. But usually, it’s like your name, address, blah, blah, blah, you can put it there. If it’s changing, you make an activity. So like, you might have an updated address activity. And you want to know, when we first acquired this customer, what was their first updated address, or give me the last one to know what their current updated I just was. So that’s kind of how we handle a lot of these like cases, and we handle them in product. So the thing about the single table approach, and I’ll tell you the honest truth, it has two huge, huge, huge downsides. The first downside is that a single table for you can get really hard. Like take all the sequel you’ve learned and kind of throw it away because you can’t like you could imagine like when I say last before, like Kostas, you’ve done this before, but you can imagine that SQL query is very non trivial, like, like that is very probably, if you write it without realizing you might do very inefficient, like, do you think, Oh, I can just use the last value window function? Good luck. What happens if it doesn’t exist? What happens if it duplicates? All those things that can do? So the burning of that is extremely difficult, which is the challenge having a single table? And the second thing is, if you notice, I’m doing something with every question, you’re asking me, where I’m doing this thing that looks kind of that makes their to work, where I’m actually translating your question, to be a little bit more defined in this activity. Wait, there is a mental thing that I’ve experienced and I’m I’ve mastered, but a lot of our customers take a couple of weeks to learn is this new way of thinking about how to think because it’s equally you could imagine stacking the data joining and how it works. But this new approach, you have to relearn the mental model of how to combine data using these like temporal relationships that we call, I think we actually find it most customers who come from like a deep sequel background, haven’t have a harder time learning our relationships than people who come from like a like marketing or product mindset. Because you’re used to thinking about things from this perspective, while sequels often thought about it from a table perspective, so that mental model, learning is a big overhead. And then knowing that that table is really hard to query by head is really hard. So what we decided to do was build a company around it, like the reason why activity, the single table isn’t just an open source project. Because we found that like, we open sourced it, and people tried to use it, and they were like, hey, this sucks. And I’m like, Yeah, you’re right, like querying this thing takes you forever. So we spent years building and iterating over already experienced. So you can actually generate any table by using this tool called data set, which Eric got seen, which is a really just seamless way of combining data. And it makes it look very seamless and nice. So that we solved that problem with product and we solve the second problem. With just iterating. So we often give customers like examples, we do a lot of documentation. We do a lot of blogging, we do a lot of automatic analysis we generate, we have a series of templates that helps you see how to ask a question. We have an entire library, depending on your industry, it gives you a bunch of different types of questions that you can ask and shows you how to map it scenarios world and how to answer it using your own data, and a couple of people under 10 minutes each. So something like that is like a huge educational overhaul. But one of the things that you said that I did find beautiful, is they talked about this language that Salesforce created. I loved I studied Salesforce for a while because I think they’re one of the most interesting companies because prior to Salesforce, everyone had their own definition of structuring sales data and recovery had their own sales data models, it was like not nowadays, we’re like, of course, every sales company can be represented with leads, opportunities, tasks, and contacts. But that’s only a Salesforce day, they changed how we thought about data, and they standardized all of data for sales. And thinking about to think about us narrator is that’s exactly what we’re doing for data. We’re like, here’s a standard data model. And yes, it is very rigid. We showed you that you can do so much with it and answer so many questions with it and do all these things. And we’ve taken all the trade offs and the cost that the downsides of using this data model and said that is narrator’s job to make that solid. So making sure that super easy to create for anyone, whether it’s technical or non technical, using our tool like dataset, making sure you can see the value in instead beautiful analysis, making sure that the benefits of the assumptions can be shown by giving you stuff like automatic anomaly detection, instant analysis that they can answer any questions, templates to understanding Kakkar, LTV and all sorts of complicated analyses, we gave you so much. So that you can value in learning this mental model overhead of thinking differently about data. And that’s the goal. And that’s kind of why I ended up saying like, the modern day tech stocks and all these approaches, because they’re just so different each one is so every company you go into, you have to learn a new way that how they represented their data and the 1000s of tables, they built the narrator I can switch between any of our companies that use us and instantly answer any question, because it’s a standard way of thinking. It’s a standard way of answering questions. And we’ve showed that it’s flexible enough to answer any question. And I, if you’ve come to any of my talks, if you send me an email, and I always the audience, I message me on Twitter, tweet me, LinkedIn, email me directly. And give me a question. Okay to answer. Or I’ll tell you if I can insert a post publicly that I can answer it or else I do exactly how you can extract question, a narrator. And having done this for five years, you often see that almost always questions can be easily inserted into structure, you just got to think about a little differently. And that’s the downside to it. Is that thinking differently, but we do believe the upside of the value of speed is just so incredible, that it’s a no brainer for us.

Kostas Pardalis 42:52
Yeah, absolutely. And that’s where the opportunity is. And that’s why you’re like building a company for that. Okay. I I find interesting what you’re saying about like the, like the comparison with, with Salesforce, because I think there are similarities, but there are also like, some, like big differences, right? Like Salesforce, went there and had like, one domain that they had like two models, right, which was sales. Now, within Narrator you do like let’s say, in a way, like the not the opposite, but you’re saying, Okay, I have one model that is abstract enough and expressive enough like to go and cover, let’s say all the different domains out there, right. So you’re working away, it’s like exponentially harder than Salesforce, I would say, because you’ve now like to deal with all like these different domains and people that’s working there and like trying to help them like think in a different way. But obviously, like also, if you manage to do that, like the reward is going to be probably like, even even bigger. But I have a question that has to do with like, at the end like the expressivity because we keep talking all this time and we are talking about customers and users. Like, the sender of this data model is around like the concept a lot. You have a user there who acts. So you have an NTP and activities, right? Is this all that we need in a company? Or are there also other activities and other things that are happening that’s maybe the future that we’ll also address?

Ahmed Elsamadisi 44:37
So great question. So there’s two parts there. So what we’ve done is not trying to abstract away your business. I think that if we’re trying to build a model that says every business, that’s like a very hard thing, but we’ve done is we’ve built a model that represents how we ask questions. And when we’ve actually solved is behavioral a change. We build a model that’s really good understanding page. And what we’ve shown is that every question can be actually a function of understanding change. So when you think about a company, we think about a single table, it’s per core, like we talked about, like we have a ride sharing company. They have two streams, let’s call the customer stream and a scooter stream. So their customer stream is everything customer opens an app customer buys customer right starts to provide customers submits a ticket, a customer makes a payment, customer moves, Scooter goes to enters a new zone, customer parks. But then you also have a separate stream, which is where the customer is actually a scooter. The scooter gets rid of a scooter and the ride a scooter goes into me and it’s a scooter gets repaired, a scooter gets purchased, a scooter gets launched, all sorts of things happen to scooter, a scooter gets like presented to a customer. So like you every it turns out that everything in that company, what I’ll say 99% of fakes can be represented as some sort of global entry that you’re trying to understate how it’s changing. And its actions and whether the actions are done to it, because of it done by it is independent, it’s just that this action has happened. And tied to this core object. It’s really representative how we speak it’s like there’s a noun, a verb, and you’re just talking about the these actions that are happening. So what we see is that most companies have one stream. But some companies like us Narrator We have two streets, we have a company stream, and a person stream. And we use the President’s trip to understand people behavior. We use the company’s job to understand like our financial reporting, and our like onboarding and accompanies onboarding, when a company adds a user and accompany these behavior that we care about it from the company’s perspective company pays an invoice. So you can create more than one stream and narrative makes those multiple steps really easy to switch between. But yeah, so the thing that I’ll say is that everything in the business can be represented as some sort of the entity that you’re trying to see how it’s changing, and change. And Nerida is really done by implementing this really strict data model has allowed us to really focus on how do we undo state change. And whether we generate the current state of a business from doing the last ever of a chain change. It’s really that are really secret sauce. And we help people really think in a way of change instead of thinking of about static things. And thinking about things like first of all, let first side up and first attribution model, to thinking more about if you’re looking for a customer and the first time they visit a website, and the answer from that. So that’s what we really need mastered is that change, and we still look for ways to things that don’t get represented by change.

Eric Dodds 47:39
I’m gonna ask a question because I think Narrator is a really interesting example. In fact, we had an interesting conversation recently cost us about roles in the data space, right? Did engineer, analytics engineer analysts, data scientists, even so Ahmed narrator’s sort of exists between different spaces in many ways, right, like in the data world? So who’s the user? And in some ways, like, maybe the question, this is leading a little bit, but is there a new sort of user that Narrator imagines? Or are there a set of users? Who actually is interacting with an organization?

Ahmed Elsamadisi 48:26
Yeah, so I’m about to give you a very controversial opinion.

Eric Dodds 48:29
I love it. We love a hot take.

Ahmed Elsamadisi 48:32
I like to think about, everybody got into data to answer questions and make an impact by using data. That job used to be called a data analyst, data analyst for people who’s used to take questions and ask good questions to derive answer. And whether you’re a product person operating as data analyst, or you’re a data engineer answer your question and you’re operating as data analyst. I think that the tool that we build is for people who want to answer question and those people are data analyst, we’ve seen a company’s really interestingly happen is that it turns out that job of a data analyst analyst kind of disappeared, and now we have like seven roles that do part of the data analyst job. So if there are links engineer, we have the data engineer, we have the data scientist, we have the the BI Engineer, we have the insights engineer, or the insights analyst who was one of them is doing dashboards, one of the building PowerPoints, one of them is building tables all trying to answer a question. And what we’ve thought about is that what if we got rid of all of them, and forced everyone to be a data analyst and use enable data analysts to once the data is in your warehouse, like there’s data engineering splits into two parts, getting your data into your warehouse, and pipelining and capturing data? And then there’s like the data engineering to like structure data. For getting the first part, let’s read let’s get rid of the second part. Let’s get rid of the LX engineer. Let’s get rid of the data scientists. Let’s get rid of the BI engineer. Let’s just kind of make everyone who’s good everyone. Then nowadays trying to answer questions, and you name it as enable that data analysts with very SQL limited equal knowledge. But really the ability to ask good questions to do that work into and create a dashboard, create the analysis, create the story, create the represented data, the way to answer that question, and do that all in under 10 minutes. And I think the future of the world is going to be where everyone becomes a data analyst. I think that’s the value that that’s what drives business value, the people who are helping make decisions. That’s really what everyone really wants is to answer questions. And I think the more we start focusing on the meets, and we start focusing on the end, the more that these data analysts are going to be the ones who are going to just kind of take over every company. And I think when you think about it, every company’s ability to answer questions is their competitive advantage. And I bet the more the more that these people who are data engineers who are trained into asking and answering questions and like data scientists, loves great skills, but instead of working on like, preparing the data instead, or actually answering questions, you’ll find a lot more insights, you’ll find a lot faster, and your business will grow. And I think that’s the future is the world just becomes all about data. And narrative becomes just like Salesforce, and the tools for salespeople. Narrative becomes a tool for data analysts to answer any question.

Eric Dodds 51:17
Wow, that’s a super powerful vision. Ahmed, we’re here at the buzzer, thank you so much for giving us some of your time. We love I learned so much about the way that you’re approaching drastic simplification, at least for analysts with a single table. And we’d love to chat with you again soon to hear how things are going.

Ahmed Elsamadisi 51:35
Yeah, I love it. And excited to be here. Thank you. If anyone’s interested, just follow me on LinkedIn or Twitter, and then you’ll see everything I do.

Eric Dodds 51:45
My first takeaway is I was thinking about the interview recorded and I said the word sucks, like 20 times. And I realized that if my kids are young, so if they came home from school and said socks, I would probably say like, Hey, you’re not allowed to say that you’re too young. But in the world of publicly accessible content. My son could play this episode back to me and say, Well, you said so. So that’s my main takeaway. So that cats out of the bag? No, I actually, I think one of the interesting things was the controversial take on on the role of the data analyst and sort of the sort of connected roles of data engineering. Med basically said, when you’re collecting data, or sort of managing pipelines that do ingestion, that’s that data engineering role will stay. But the data engineering around the transformation layer, as we talked about, on the show, he thinks should go away. And in fact, like he thinks that anyone who has questions around data will be coming to you to analyst, certainly a really interesting take. I will say, I don’t know if I’ve wholesale agree with it. But here’s what I do agree with the mindset that the like, tedious nature of manual labor, as it relates to preparing data, for simple things should go away. Like, it is a good thing for technology to abstract those things away from a human having to go through like a laborious multi 1,000 line coding exercise to do things that aren’t actually that difficult. And so Narrator’s certainly a very opinionated way of doing that by turning everything into an activity. But I agree with the vision that the laborious nature of some of the preparation work, it should go away, right? That’s not a great use of really smart people’s time.

Kostas Pardalis 53:46
Yeah, I agree. There’s obviously a lot of space there for improvement when it comes like to be ergonomics of like working with data. While I will, what I will keep, like from this conversation that we had with Ahmed is like, how hard it is to change the way that like, people think, and they have learned, like, great in their work, right? Like, it is amazing. I mean, if you just like take a step back and listen to what Ahmed was saying, like something really complicated. He says, you just have like to think in terms of actions, right? I mean, okay, sounds like something great. Yeah, of course, you have like a user and the user does something right. Like, which it might be a sign up like, so the user signing up or like the users like signing off or like all that stuff. But like even like this, let’s say symbol change in the way that we think it’s like very, very hard like to implement and changing that like for a coal industry, like it’s obviously like a very, very big and hard task. Yeah, what’s very interesting, that says a lot of like, how Change happens and how incremental or not incremental it is at the end. So that’s what I keep from the conversation. And I’m really like curious to see like how, what the future will be for like an opinionated solution like this one, like narrative that has to do with how like people feed right? On like a change there. So that’s what I keep. I want to see like how things will change with how people like interact and use the product.

Eric Dodds 55:27
Yeah, I agree. I think they’ll do well, whatever the final solution looks like people who are thinking like a med are certainly the ones who are going to invent the next iteration of sort of the way that we interact with data. I’m sort of on the layer on top of the raw data. All right. Well, thank you so much for joining us. Tell a friend, if you haven’t, if you haven’t told a friend about The Data Stack Show and you enjoy it, and we will catch you on the next one.

We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.

🎙 Sign up for The Future of Machine Learning Livestream!

🗞️ Signup for Our Newsletter

Episode 90:

The Modern Data Stack Has a Join Problem with Ahmed Elsamadisi of Narrator AI

June 8, 2022

Notes:

Transcription:

About the Podcast

Sign Up for The Data Stack Show Newsletter