This week on The Data Stack Show, Eric and John chat with Ryan McCrary, Product Manager at Rudderstack. During the episode, the group explores the complexities of customer data management, focusing on data activation, identity resolution, and entity management. They also discuss Rudderstack’s profiles product, which aims to bring business users closer to data, making it actionable within their existing tools. The episode covers the challenges of stitching user profiles deterministically, handling anomalies, and the significance of reverse ETL in the data industry. They also touch on the importance of data ownership, visibility between teams, and the role of machine learning in building a data foundation. Overall, the conversation sheds light on the evolving landscape of data management and the need for structured, collaborative tools in the space.
Highlights from this week’s conversation include:
The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Eric Dodds 00:05
Welcome to The Data Stack Show. Each week, we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com. Welcome back to The Data Stack Show. We are here with Ryan McCrary, who is a product manager at RudderStack. So close to home. And Ryan, you’ve been building a bunch of stuff that is intended to get business users closer to data. And I’m really excited to dig into that whole problem, because I think it’s been a topic of late, you know, in the last year or two in the data space, you know, or at least has reached fever pitch, you know, with venture backed companies. So, I want to talk about that. But briefly, give us a background? Yeah.
Ryan McCrary 00:58
So my original background is, as a software engineer, been at RudderStack for a few years now in a number of different roles. So kind of approaching this from different, you know, phases of the customer journey. So started off with customer success engineering, working with our existing customers, moved from there into Solutions Engineering, so, you know, building higher level solutions for prospects and customers. And now I’m on the product team helping build out the actual solution that we needed this whole time. Awesome.
John Wessel 01:25
Yeah. So Ryan, one of the topics I wanted to dig into is activation. I think I’ve had two or three conversations over just the last week around people complaining a little bit, maybe on dashboards and like, oh, you make a dashboard. He gets to the hill, and then really desiring like, hey, well, what if I could have the data and the tools I already use? And I think that’s one of the big things on that activation? So happy to talk about that. And then I’m excited to hear what you want to talk about.
Ryan McCrary 01:52
Yeah, so from a data activation perspective, I mean, that’s kind of the impetus for what we’re trying to accomplish with profiles, and specifically with like that last bit of the activation piece. And so, we’ve talked about this, you know, a lot internally, but to your point, like, yes, everyone’s used to like the world of BI, and, you know, here’s a view into our data, but it’s largely worthless, if it doesn’t, a align with what they’re seeing in the downstream tools, but also, just being in the downstream tools is kind of the prerequisite to that. So, you know, a dashboard is great, but unless you can act upon it in the tool, where you live, whether you’re marketing a product, you know, advertising, anything like that, it’s really largely useless. So really, understanding not only how we model the data and visualize it, but then what we actually do with it is kind of the key there. So that’s kind of what we’re, I guess, discussing today. Yeah,
Eric Dodds 02:40
I’m excited. All right. Well, let’s dig in. Okay, Ryan, with so much to talk about here, you mentioned profiles, which is a RudderStack. product. We did an episode on this a while ago, actually. So I want to get an overview of that. But first, I actually want to dig in a little bit to your experience, going through multiple customer facing roles before becoming a product manager at a data company. And I think this is really interesting. So you mentioned that you started at RudderStack on the customer success side, and then you moved into a solutions architect role, and then eventually moved into the product. So my main question is, I mean, I’m assuming the answer is yes. That, you know, being customer facing has helped you as a product manager. So maybe that’s not true. If it’s not tell us but if it is true, what are the specific ways in which that’s influenced your role as a product manager, the way you think about building products? Yeah,
Ryan McCrary 03:40
So I mean, obviously, in both of those previous roles, working very closely with customers, which, you know, I think is the way to understand what we’re actually trying to solve for and what we’re trying to build. And, and what you see pretty quickly is that everyone believes that what they’re doing is very unique. But at the end of the day, that really aligns around a handful of use cases, and, you know, solving it over the years, we’ve seen a lot of tools kind of come into vogue, and you know, DBT, probably being the primary one. And I’ll go ahead and caveat that most of our customers that use profiles also use DBT. So we’re not thinking of the tool as a replacement for that really, more as an enhancement. And so, you know, the strength of DBT, which is also kind of the pitfall for this particular application, is that you can do anything with it, it’s it, there is no opinionation. I mean, it’s largely just a better SQL interface. You know, it’s oversimplification, and so, profiles kind of introduces a light layer of opinionation, specifically around customer data, right? So around the entities that we call them, you’re actually interacting with from a business perspective. And so, you know, our approach to that kind of data modeling is really around, as I refer to those entities, largely for most of our users, that is a customer or a user or, you know, a person Uh, essentially. And so, basically we do two primary things. I didn’t need resolution, you know, so what are all of the various?
Eric Dodds 05:07
This is the profile product? Yes. So for those who didn’t listen to the previous episode, which I love, we’re just digging in here. Okay, so yeah, this is the profile product. So give us a breakdown. So RudderStack profiles are the product. What does it do? Yeah,
Ryan McCrary 05:21
sorry, a lot of excitement there just to jump in. But yeah, the two kinds of primary building blocks are the identity graph. So you know, we have this customer journey across online offline, different, you know, datasets, how are we reconciling that to a single user? And then that’s really the foundation for Okay, what do we want to know about those users and making sure that we’re doing that on the solid foundation of an identity graph that we believe to be true and trust? And so that’s really it from a level of opinionation? Right? Like, we think about less of a just fully unstructured or relational data set, but really more of how do we coalesce around that individual entity, that individual user, you know, I’ve mentioned entities a couple times. Now, we can also have accounts or businesses or households or anything like that that can be related to each other as entities but don’t have to be. And so that level of opinionation is kind of what we built this on is, you know, the reason for that is to find out, you know, that single view of the customer, whatever that entity is, now we’ve got to get that, like I mentioned before, into an actionable place, right? And so like, what does that look like in an odd marketing automation or a CRM or something like that? That’s really the reason for that. opinionation? And that’s really kind of where the opinionation is kind of dense.
Eric Dodds 06:29
Okay, so, right. And obviously you and I both work for RudderStack. So And, John, you’re an unbiased, you know, consultant. So I want to ask you a question here. I think there actually may be a number of questions around that you may have about profiles, and then we’ll get into like the activation piece and getting business users closer to the data. But like, John, why would you use this? Right? I mean, you were a heavy, DBT user in multiple different roles previously. And I guess, I guess, like another way to ask this question, I’d be like, I’ve never been in a business where anyone asks for an identity graph. Here, Ryan says, like, Okay, you have tools that help make writing a sequel better, which is awesome, right? I mean, those are, we all use them, right? And so to add in an opinionated layer around identity resolution is really interesting, because that’s not really, if you just go out and talk to a bunch of people like, No, there aren’t a bunch of people saying, like, we need an identity grab. Right. And so John, I mean, I’m sure you have answers for that, Ryan, but like, John, why is that? I mean, why do you? Is that neat? You know,
John Wessel 07:35
yeah, it’s funny, I remember, first talking about profiles as a product, and then some other, you know, similar products. And I have the same question of like, why would you not just do this and DBT? Right, yeah, that’s right. That and I’m sure. Yeah, or whatever. Yeah. And, and I think the funny part is, there are a select number of people who you’ve even, I think, talked to these people that bully did this all themselves, yeah, some role that maybe they just wrote custom software to do it, sequel or whatever. But I think that’s a very small number of people. So then the next layer down is like, Okay, I do your point, like, I’m probably not trying to solve like I identity graph, I, but I want the result of like, hey, I want all my customer data in one place. And I want to start adding features like churn prediction, I want to add a lead score, I want to add those things that have them all in the same place. So once you get to that, and once you get sold, I think the other big concept, if you’re kind of sold on the idea that we want to have first party data and like have data in a warehouse versus like, keep it all in marketing, a marketing tool, or all in an ERP or something. I think that’s the other key. But if you’re there, then like you back into like, oh, we need to solve identity resolution, we need to solve some of these other problems. And then when you get there, I think it is easy to get trapped in like, oh, like, well just write some people won’t be that bad. And I think when you get there, and then you start down the path, you realize, like, oh, like, this is harder than I thought, yeah, this is way messier than I thought. And then like any data project, you’re like, even if you did it all yourself, you have to maintain it. Yeah. And that is the kicker with any data project, where even if you did an excellent job the first time, you don’t remember what your past self does, though complicated. Yeah. And a team member that maybe has to come in and maintain it when you’ve moved on. Maybe you’re not doing several projects or something. It’s just so hard.
Eric Dodds 09:37
Yeah. Yeah. I mean, I think that maybe it is the story. So you know, I think our listeners may know, like, I used to do a bunch of marketing stuff at RudderStack. And we did a ton of work around understanding attribution. And we did all of that and the warehouse, you know, their own first party data, wherever. And we happen to have this guy named Benji, we haven’t had him on the show. Actually, it would probably be good to get Benji on the show. And he’s just been, he’s just like an unbelievable sequel. And so I just had some attribution needs. And so I just went to him. And, you know, I didn’t ask for an identity graph, I was like, hey, actually, I need to see, like, first touch, and then I need to see like a couple other things or whatever. Right? And then so, you know, five or 6000 lines of SQL later, you know, I have a table where I just every week, I’m asking him to add another one. Right, right. Well, you know, and he’s really the only one who could do it. You know? Absolutely. Yeah. So that’s kind of like tribal knowledge. I think, yeah,
Ryan McCrary 10:38
There’s two things I mentioned. And I mean, one, is it kind of what John said, like, no one and that we mentioned before, like, no one’s ever saying, like, you know, what I really want to do today is like, get someone to build me a really solid ID graph that’s just like, gets me going. I think every, like, everything starts from a use case. So even you know, John was saying, like, he didn’t think about the ID graph as more of the features. But even the features themselves are driven by a business use case like, like attribution, like, right, okay, attribution school, but what why, right, so you’re trying to, like, actually measure and quantify that. And so, you know, if you think kind of top down, like you have that objective, you know, we want to understand where we should spend more money, what’s being effective, or even, like, for most usage perspective, like, what’s the next best action that then informs the features that we want to build. And then you always end up in that place to build that future, we need to understand the full customer journey. So that’s where it really does rely on the base of that being the ID graph. The second thing I mentioned is, you know, I’ve been the victim of some of Benji’s work and where you have this thing, and I’ll be completely honest, when I first saw the MVP of profiles, I didn’t get it either. I thought to myself, and this is a part of profiles, you know, it is at the end of the day, it outputs SQL that you can read an audit. And I thought to myself, like this is something I can just write and do. Interesting. Okay, so
Eric Dodds 11:57
Let’s stop there for a second. Yeah. So, profiles. So we’re talking about the identity graph. Profiles do additional stuff. But like, the actual, like, what happens is profiles generate SQL, and then that runs in your warehouse.
Ryan McCrary 12:12
Yeah, it’s all in your warehouse. The data we’re consuming is in your warehouse, the The sequel is being shipped to your warehouse and Python and some of the ML models. And then the outputs are actually in your warehouse. So the tables that we generate, again, your warehouse as well, so nothing belongs there. We’re generating a sequel. Yeah. And so and, and we see a lot when we first you know, talk to folks about profiles, they’re like, I can see the sequel, can I just use this? And the answer is yes. Like, you can just use it, and that’s fine. But to John’s point, that’s not the issue, right? Like to come like, yeah, Benji, like we’ve got a working model, we got attribution solved, whatever. As soon as someone comes in, says, Hey, we have a new data set, we need this as an input, or like, this is a new part of the customer journey, or this is a new data set that’s going to inform a feature. That’s when the whole thing falls apart. to shoehorn that in and this is where you see data teams struggle, it will really the business teams struggle with getting what they need from the data teams is the Go de team, they say I need this simple thing added to this dashboard, or to this metric, can you just do it? And the answer is yes. It’s simple. It’s simple. And it is but the problem isn’t that feature, it’s by shoehorning that into a you know, multi 1000 line model. Yes, you risk affecting adjacent features. And so like, Customer Success comes to you and says can you add this in this is simple, you say yes. And as soon as you get it deployed, now sales is yelling at you because their dashboards are broken. And so the impetus is like for profile sales to kind of encapsulate some of that so that you’re not affecting other parts of the model as you add to it. Okay,
Eric Dodds 13:34
so you mentioned customer success in sales. And so I actually want to ask about this, because I’m genuinely curious. I mean, I’m obviously fairly close to some of this stuff. But I don’t know exactly how the sausage is made. So we get to hear all of that, to bring our curiosities. And John, please jump in here, but like, Okay, so one interesting thing that I know about RudderStack, because I work here is that when I was doing the attribution stuff, with Benji, we were looking at it all on a user level, right? And so like, its first touch, we’re looking at leads, we’re looking at, you know, how does someone enter a site? And then we’re like, what did they go on to do, right? And we’re, you know, you break that down by channel, and there’s all sorts of stuff, right? Do they request a demo? Do they sign up for the app? They do all these other things, but the customer success team is actually much more interested in sort of a collective account view, right? They don’t necessarily care about the lead number. They want to know what an account is doing in the product, how much data they are sending, which features they are using, etc. That is actually pretty interesting, because like when I said, I also asked her that because of course, we’re doing attribution. Eventually, I asked Benji to add columns that were representative of a, I guess you could say, like an account roll up, right? So you have a user but then I also want to know how many other users are associated with this like account? Yeah, that actually is where things got really wild. Um, and then if we thought it was complicated before, that’s what that’s when Benji likes reality. That’s when I got really crazy, right? Because that’s when he quit.
Ryan McCrary 15:11
That’s what I’m here to do.
Eric Dodds 15:12
That’s actually that’s how he did change roles, not because I didn’t, yeah. But can we talk about that? So like, we talked about an identity graph, but like, if we just have that on a user level, that’s fine. Like Benji sort of rolled like a V, one of those in SQL. But what do you think about these different sorts of, I mean, you said entity, like, what does entity mean, but like account user is kind of a classic version of that. Yeah,
Ryan McCrary 15:37
that’s a really common one, you know, another common win in a different space would be like a household, right? So like in a roll up of multiple users, but an account we think of the same way. And so like, to kind of where we started with this, when you think about, like, let’s take RudderStack as a product, as an example, from a sales and marketing perspective, we care about getting an individual across the line, whenever we defined a conversion, right, like signing up for the app, you know, setting up a source, whatever the case may be, from a customer success perspective, not to say they’re not worried about the individual user. But when they’re thinking about what is the overall product adoption, or help score of this account, they’re thinking of all of the many individuals within that and how they behave, because different people are going to use different parts of the product. Yeah, exactly. And some might not use it at all right, like, so when you think about it again, like using RudderStack, as an example, you know, the front end engineer who may be responsible for most of the instrumentation, they’re never in the app. So if you’re looking at it from if you’re looking on a regular basis, you’ll say like, Hey, this front end engineer is very uninvolved,
Eric Dodds 16:37
or understood person upstream of the API, like they’re just sending it to an end, right, just
Ryan McCrary 16:41
sending it. But if you look at an account level, you might say, Oh, wow, when you’re the business user is in your daily, you know, looking at the health score of, you know, their health dashboard, understanding the, you know, the overall volumes, their downstream destinations, they’re the ones that are getting the emails, that stuff like this threshold is dropped, go in and check. And so like, in one sense, there’s the aggregation of those on the other side, there’s also excluding those. So I work very closely with a lot of our customers. So I’m in some of their workspaces. So if you were looking on an individual level, or if you weren’t calculating the Account entity correctly, you might say like, wow, this is a really active account, like they’re in there setting stuff up this person’s in there, right. But then you realize that’s me, that’s like, internal employee, Oh, interesting. Customers’ behalf, but I’m part of their account. And so it’s important to include the right metrics and but also to share things that might, you know, influence it incorrectly,
Eric Dodds 17:30
like dev prod also, right? Like, you may see a bunch of activity and in a dev environment. Yeah, that’s interesting. John, how did you like entities, like talk about entities a little bit? And like, Did you face any of that?
John Wessel 17:42
So I mean, the funny thing about entities is, if I had to pick something and data that almost everybody handles poorly, it would be nice. If I had to say collect one thing. we were interacting with, like, companies, I’ve worked for companies I’ve worked with, yeah, like getting that it’s so hard for them. And I think some of this is some of the a lot of products are tiered around entities where you can do that. And who uses it but yeah, the enterprise to like, do. That’s part of it. Yeah. And then people end up just racking things. So they don’t want to, you know, upgrade. Yeah. But the second, well, this is probably even bigger than entities, and I have to bring this up as duplicates. Yeah. So when you do an ID resolution, like there’s no magic, right? There’s no AI magic that can deduplicate your customer records yet? Yeah, but, and then, of course, there’s two different types of duplicates, one, which ID solves, like, it’s in two different systems, and we have an ID and we can stitch them together. The other one is the one that is the tough one where they’re truly duplicates. They have different ideas. There’s no clear way to be sure to do that. But I’m sure people would be interested in things like, how people, you know, how people are addressing that problem, or how you’ve seen customers address that problem?
Ryan McCrary 18:54
Yeah, I mean, that tail is the oldest time. I mean, when we think about stitching users together in profiles, it largely is a deterministic system. But the way that we stitch is based on the ID types themselves. And so that gives us the ability to map back and find some of those outliers. So when we think about setting up the initial ID graph, will we have some scripts that will run some QA on that, and there are some that are very easy to spot, there are others that are more difficult. So we worked with a customer recently, who, you know, we built this out, they were very pleased with it. But when we did the kind of QA of the ID graph, we found there was a single user that had I think they were stitched to like 10,000, different identifiers across the
Eric Dodds 19:34
might be in trouble if Yeah, and
Ryan McCrary 19:36
there’s two things I think about there. One is, depending on the use case, you may not care because you may know, hey, that’s an internal user impersonating folks. Oh, sure. Testing, if we’re stitching it together, like some, like for marketing use cases, that doesn’t matter. That means that person might get a couple extra emails to their other emails. And so it’s not a huge deal. When we’re thinking about, you know, maybe custom offers or if we’re doing more sophisticated things like you know, There are folks that use their customer data like password unlocks, or Canon locks, that’s much more important to stitch that user. So we find we, you know, part of it is, that could seem like a bug of like, oh, well profiles and stitching all these people together. In a way, it’s a feature because it helps point out instrumentation flaws. And so what we realized with this particular customer, is that it was their policy, their standard operating procedure was that some of their employees would impersonate other users to place orders on their behalf. Yeah, that seems, you know, fine. But then you realize it only takes one node to stitch all this together, right? Like, when you now sign in as this person, you know, anonymous IDs are a good example, every time you clear your cookies or launch your browser, you may get a new anonymous ID. So you’re having a bunch of anonymous IDs, not a red flag. But if two years users have a bunch, and now you’ve impersonated that one, you’re now stitching all of those other anonymous IDs and sharing everything that those are stitched to you. Yep. And so we do have mechanisms around excluding specific nodes, whether that’s things, you know, often we’ll know that, like, let’s just ignore internal email addresses. But we can also do it programmatically, where we feed duplicates, above a certain outlier into a table that are excluded in the future. And so a good example is, for most operational systems, you know, that a user should say most, but by and large, a user should have one email address. And so for most systems, if they have two email addresses, that is something we want to take a look at and understand why we’re stitching them together, because that’s going to be more wide and interesting. And so we can also put thresholds around individual ID types like, again, for anonymous IDs, we’re okay with the threshold of, you know, anything below 250. Whereas emails, we wanted to be exactly one per user or internal IDs as well. So it helps kind of find some of those anomalies. And again, a lot of times that’s, that’s the challenge that we’re helping solve is this goes back to instrumentation. And some of them and with them we’re like data inputs in general. Exactly, yeah. And so like, you know, for the customer I was just referring to, we actually realized they had really good server side identification on these users. So we’re really just able to basically ignore anonymous IDs. Because we know there’s, you know, web browsing behavior, yeah, we know that there are systems in place that we’re not going to get rid of that are merging some of these together, but we’re using a much more robust internal identification system. And so it was really fine to ignore those. But that showed us that and it also allowed us to speed up the project, because that’s a lot of stitching, you’d have to do all this Oh, gosh. So I have
John Wessel 22:19
a funny example of this. And I think this happens a lot in businesses. So at a previous company, we had an order management system not connected to some of the online systems used, and people would enter orders, right? And then we had some integrations that would flow between the systems. And it was funny, you got me thinking about it with the ID graph thing with a bunch of like one node tied to a bunch of different IDs. So we had this customer in there, you know, and it started popping up on Analytics reports. It was, let’s say, Jane Smith, it was some person’s name. And they would just have this like, massive number of orders like, nobody’s ever talked to her like we should give her our best, our best customer who is this? So it was funny. And I think this is true of a lot of OMS and even CRMs. What had happened was the first pick, grab the name off of the First Order. They came in, and it stuck that and it was just an integration, whereas all the Amazon orders, so everything that came in from Amazon, it grabbed it, grab, the first one that came in was like Jane Smith, and then it just stacked them up. So if you did report it, it was like, wow, there was this chance. But I think that happens in a lot of these systems. And like, if you’re just browsing, like one record at a time, like operationally, like it just doesn’t show up. But when you get into data problems like this, it shows up in a big way sometimes. Yeah, yeah, that’s super interesting.
Ryan McCrary 23:37
And I think one thing I’ll add to that is, you know, I mentioned that everything we’re doing is a sequel that you that’s running on your warehouse that you can see, and I think that’s a big, that’s something that’s appealing to me, because, you know, traditionally using these black box systems you don’t ever see that’s happening. And you know, as you’re Yeah, with that, for who knows how long until they realize like, oh, shoot like this has been happening forever. Yeah. And in a closed system, that just happens and you’re unaware, whereas if you can see the sequel that’s running, you can, yeah, debug, you know, what might be called? Sure.
Eric Dodds 24:04
Yeah. Well, I mean, going back to entities, and then I want to talk about, I mean, we haven’t even gotten to make sure that that’s fine. Brooks is actually not here today for all the listeners. And so we can go along, which is when I get invited when the producer is not here. Yes. That’s exactly right. The producers are gone. Let’s get Brian on. One of the So and actually. So John, I want to return to something you said. So you said one of the things that, you know, most companies do poorly from a data perspective as entities, right. I think that is probably most of the explanation of why every Salesforce is the biggest nightmare. Yeah, or sorry. Yeah, that’s right. It is yeah. All Salesforce customization is like trying to wrangle entities into a system that is like a lead and contact account opportunity, like whatever. You
Ryan McCrary 24:54
know, it’s why Salesforce developers exist. Sure,
Eric Dodds 24:56
yeah. And they make a very good living. Oh, no, yeah. But it is essentially, like a fairly complex entity resolution inside of a system that doesn’t support that, that is only designed from a data model perspective for like a limited number of entities.
John Wessel 25:15
Yeah. And then, you know, in today’s day and age, you’ve got companies that are splitting and merging. And like, you know, it’s not a simple problem, but, but even just that simple, like, parent company, child company, or a sim, or, like, multiple people in one company, like, that’s easy enough to, like, mess up. But once you have parent companies, and they spin off, and then they merge back together, and they change names, 100 times, like, that’s the challenging data problem, especially over time, like, do you want to update that information forever? Or do you want to keep a record of like a 9097? They were this and then, you know, like, the slowly and data is like that slowly changing dimension and problem, which almost nobody does that they just, you know, retroactively. Sure, yeah, we talk about it a lot. But we just retroactively, like just, you know, update it every day when they change names, or, you know, get acquired. Okay,
Eric Dodds 26:05
So let’s switch gears a little bit here. So, hey, that’s really interesting. And I have a bunch more questions, actually, actually. Okay, one more question on this, to close it out, just from a Product Manager standpoint, just because I think it’s, it’s really interesting to think about how we build data products, generally, right? We’ve had a ton of people on the show, but this is very interesting to me. So as a product manager, one of the things that you spend a huge amount of your time on is a product that generates SQL, the first output of which is an identity graph, which is something that no one asks for inside of a company. But that is required in order to like, resolve entities or whatever. What do you think about and that’s a very, that seems like a very difficult problem, right? Where it’s like, you don’t like no one’s asking for this. But it’s actually what you need.
Ryan McCrary 26:57
I mean, you’re hurting my feelings. Right now. We’re working on a product whose primary output no one wants? Thank you. Yeah. I mean, I think it goes back to what we’re saying before, it’s like, you have to solve that to solve the actual problem, right?
Eric Dodds 27:07
And so what is the actual problem? Actually, like? I know, you mentioned this, but just to say like, because identity graph is a stepping stone? Yeah.
Ryan McCrary 27:14
I mean, the actual problem is to solve business use cases in the tools where these business stakeholders live. You know, like, yeah, again, like we talked about Salesforce, like, I don’t care who you are, you’re not going to get your sales team out of Salesforce. Yeah, of course, you’re not, you shouldn’t. Yeah, you’re not going to get your marketing team out of customers, I Iterable, Braves whatever you use, like that’s where they are going to live. That’s where they are doing their jobs. And so, you know, all of this is for nothing if we can’t make use of it. Yeah, totally. So
Eric Dodds 27:40
how so? I guess maybe to put a little bit sharper in the question, like, the solution to that problem. And what you’re building lives, like really far upstream? Yeah. of Salesforce. Yeah. I mean, I guess whatever. You could argue about the distance, right. But the person in Salesforce, like, probably should never know about the intricacies of like, entity resolution, or yeah, all of that that’s happening, you know, in the data warehouse, right? Yeah. How do you think about that just as a product manager, and like, you have this outcome that needs to happen in the business? And then you have this really technical process? Well, I guess what’s interesting about profiles is that the identity graph is actually just a stepping stone to produce, like, computed users. Yes. Right. Yeah. And so it’s even upstream of the stuff that the data team produces?
Ryan McCrary 28:30
Yeah. Yeah. I mean, the at a high level, you know, it’s a single product profiles, but there are really two interfaces for it, there is the the actual data definitions, you know, the this ID stitching the building of the features, you know, which eventually result in these output tables that we mentioned in, in the warehouse. But it also yeah, like I said, it’s all for nothing if you can’t access it. And so, you know, we have a UI that essentially, you know, get backing up, I guess, profiles is a set of configuration files that connect to warehouse, you know, build these queries, run these queries, get out, build out the tables, and then that’s all done in like a version controlled environment. So you can manage that and whatever version control use, and then that actual Git repo can be connected within the RudderStack UI, and allow for the business users to interact with. Oh, interesting. Okay, so the data team is doing all of this in their own dev work through the Dev workflow with config files. Yep.
Eric Dodds 29:31
But then the actual user interface, like the RudderStack web app, is reading the outputs.
Ryan McCrary 29:38
Yeah, it’s connected to that Git repo, which is what it’s, it’s being used to kind of source and build those tables. And then the UI sits on top of the warehouse tables as well. So you know, I always have to preface this when I’m doing demos or explaining the product to folks that the UI is admittedly slow because it’s actually pulling from the where interesting. Everything that you see exists in your warehouse, wow, within the UI, okay, and
Eric Dodds 29:58
so Of course, the reason for that has to be to expose that data to someone who’s not in the data team, because why wouldn’t I just go?
John Wessel 30:08
Like into a warehouse?
Ryan McCrary 30:10
I’m literally already there. And I have the config files, right? Yeah. Okay. So walk us through that, like, Yeah, I mean, if you think of it, there’s, it’s a spectrum, right? So there are a lot of teams that operate, there’s a lot of different ways in some teams, you know, you have to teach everyone how to use the BI tool, or how to understand how to query this data to get what they need. And so the way that we think of it is, you know, how can the data team live where they want to live? Can they have a technical tool and use software development best practices, but then give that to the non technical stakeholder? How can they, you know, have that in a way that they can see and understand it? And then understand, like, at what grain? Am I sending this to the downstream tool like, like, what do I actually want to send? Do I want to, and again, different teams operate different ways, some teams send everything, some will, you know, slice that data according to their needs, and send, you know, subsets of that some will send full audiences as just lists of users, some will send traits and then do the dynamic audiences in the downstream tool, it really kind of caters to whatever they need. We talk about it, it’s kind of funny. I mean, this is a technical audience. So I can be honest here, but we talk about it as being you know, the, for the two different users, you know, there’s the technical solution for the technical user, and then the UI version for the non technical user. But they’re really all for the technical user, the technical user wants the business teams to have that self serve as much as maybe more than the business team wants self serve, because they don’t love sending CSV. Exactly. And they don’t want to, you know, handle JIRA tickets. So like, they’re both for the data engineer. But that’s, I mean, that’s an unnecessary solution for the technical. So
John Wessel 31:46
it’s like, it’s like a presentation layer. So like, hey, look, I can show you what this thing does.
Eric Dodds 31:50
Yeah. Okay. Okay, so, can we take a slight but related detour, and quickly talk about the reverse? ETL? Yeah, the price is not here, we can do it every three years, we can do whatever we want. So this concept of reverse ETL, you know, has cropped up in the last couple of years, but I think it’s actually an old idea, right? This is, yeah, that’s ETL. Actually, I mean, you’ve actually mentioned this, like, I’ve talked with a bunch of companies who just call it ETL. Yeah,
Ryan McCrary 32:18
I mean, it is, I mean, I would say we’re, I would put a reverse ETL in the same bucket as the ideograph. It’s not like no one’s like, give me a river CTL. I mean, they are now because we’ve told them, they want it, but like, no one’s out there. Like, you know, what I really want to do today is like, get into some really cool reverse ETL. Like, it’s just, it’s just a means to an end. Like, it’s in the same way that the ID graph is what we need to build, you know, reliable data solutions, reverse ETL is where we need to get those data solutions into the tools where we actually can use them. Okay, so spicy data take care of you. Like, how did it become?
Eric Dodds 32:52
There’s obviously a ton of buzz around it, right. I mean, RudderStack has a reverse ETL pipeline, right? Yeah. But then the other thing is, just, it seemed like this, it seemed like a quote unquote, industry unto its own. But now just about every company is building this, right? Even like the marketing tools. Right. So John, I mean, you see this every day, right? I mean, it’s like, it’s actually just ETL data movement, and any company can build a pipeline to slurp it up. But how did it become like a thing? A couple of years ago, we
John Wessel 33:22
talked about this a little bit before the show today, and my theory on it is you I had this pulled up, but think Snowflake IPO in 2020, around 2020. You know, the biggest, biggest IPO uncheck the biggest IPO in tech history really splashy, so that freed up a bunch of money for startups. Right. And then it seems I think those were Eric’s where it’s like that people formed startups around features of products. And then you had all these tiny little slices of like, we do ETL. Well, we do reverse ETL. We do observability, we do transfer? I mean, just every little slice imaginable, right? When they all got funding. Yeah. And then in the last couple years, AI has kind of been the focus, right? So the fundings are a little bit drier in the data space nowadays, and you’re seeing some merging of company acquisitions. And some others that are like, I don’t know if they’re gonna make it. But so I feel like it’s just the macro environment that created it honestly, like, in another time, you know, would, let’s say Fivetran? Would Fivetran Just be like the data pipes company? They do reverse and maybe transformations to like, I don’t know, maybe, which is kind of what Alteryx is, right? Yeah, like, cuz that was like, generation before. Right. Exactly.
Eric Dodds 34:33
Yeah. So I mean, who knows, remains to be seen. I agree. I think I would add on one, one layer and Ryan, feel free to disagree with any of this because we love a spicy take on right on when the producer goes. But I mean, the intent is good, right? Like, I think, I mean, to your point, Ryan, you’re like, who cares about an identity graph, if you’re not getting it into some tool that marketing can use to send a campaign to increase conversions, or whatever. They’re Use Cases downstream, right? And, of course, if you’re just writing a Python script to do that, you know, or you have some custom ETL job, like, that’s annoying to manage over time. And it’s, you know, arguably not the best use of the data team’s time. And so having that as a managed service, like, of course, makes sense. But I do agree that, you know, they’re, like, of course, like, it is probably a feature. And we see that now. Right now, marketing teams can literally self-serve from their own platform. Yeah, like, data that’s available in the warehouse. Yeah.
Ryan McCrary 35:31
I think some of that comes from too. I mean, to John’s point about the Snowflake IPO, I mean, I think you’ve seen a huge acceleration to have just the accessibility of a data warehouse. And so you’ve got teams that normally wouldn’t have had access to that now, you can just sign up for Snowflake for free, or BigQuery, or whatever. Sure. And so these are smaller, you know, maybe even younger software teams a lot of times, maybe not even data teams, they’re just the software, they’re just the engineering. Sure. And so ETL is not a concept that they’re well versed in. And so there is a place for, I think, reverse ETL from that perspective, but I think, as you enter into a mature data team, you see that it becomes much more of a, you know, just kind of table stakes.
John Wessel 36:14
Yeah, I think you’d bring up a great point, Ryan, because historically, databases were very locked up. Yeah, like, if it’s a production database, it’s lock and key, you locked developers out of it, you’ve got like a couple of ops, people that have access to it. You’ve got privacy concerns, you’ve got uptime concern, we don’t want to take down production databases. So some of it too, is that like, oh, like, I can go click a button, sign up for this thing and have a database or this is cool. And then I can move just the data I want. And it’s not going to impact, you know, production and I can anonymize things like, yeah, some of that is like we kind of unlock what used to be like, a lot more tightly held.
Eric Dodds 36:50
I think about the early days of Data Studio in just how well we probably don’t need to go down that path. But there was a certain element of magic to it. Yeah, it’s so easy to get data into BigQuery. Yes, it’s so easy to just bleh Data Studio, right on top of this, and, you know, do like really cool reporting things that were so so hard. Yeah. Any other way? Now? Of course, like, you know, right. I think Ryan’s Oh, it was like, yes, there are a number of things.
John Wessel 37:29
You prefer him to call it? looker? Data Studio? Gotcha. Yeah, maybe
Eric Dodds 37:33
worse. Okay. Yeah. Well, okay, that’s a totally other episode. That’s a totally separate episode. But okay, so we talked about reverse ETL? A little bit. Thank you for this spicy take. But if let’s just say I have a reverse ETL pipeline, it doesn’t matter, right? But profiles that are outputting this identity graph, you build all these traits and profiles? Or, you know, what are your features? Okay. Okay. So features, user features, entity features, I guess I need to be very accurate. Yes. And so I just have this table, or maybe a set of tables that are like, Okay, this is my entity. And here’s like everything I knew about this entity? So do I just slap a reverse ETL job on there? And like, I’m off to the races? And this is of course, a leading question, because you as a product manager just shipped two features. One is called cohorts, and one is called activations. And cohorts is actually sort of like an opinion about creating subsets of this giant table that represents an entity. Yeah. And so why don’t I just send just use a reverse ETL job to like, connect to this two entities table? Yeah. And then send the data where I want?
Ryan McCrary 38:42
Yeah, I mean, you can ultimately, like, that was kind of the original intent was just like, hey, you have this you can send all of it, or some of it, you know, wherever you want. And, you know, I think that becomes a challenge at scale. Because, you know, who knows what, or how you want to send that, you know, like, that’s a level of opinionation, for the business to understand. And so what we did, so we’re just,
Eric Dodds 39:07
I mean, honestly, just a ton of data. Yeah. Yeah. Hundreds of columns. Yeah.
Ryan McCrary 39:13
I mean, if you think about sending all of that to a downstream tool, I mean, yes, most modern tools support custom traits and things like that. But you do just, you ship that mess somewhere else now. And so you have to deal with that. And so I’ve mentioned entities couple times, early on in the product, we, we had the concept of entities since day one, we noticed customers kind of almost hacking this. So like, a good example would be, you know, multiple customers, we found were stitching users as an entity together and then had a second entity that was like known users or customers. So essentially, like recomputing, that identity graph for users where they had an Oh, interesting, right. So like, you want the whole I’D graph when you think about things like attribution when you want to, you know, you have a bunch of anonymous users, but you still don’t understand how they got there. Yeah, what their behavior is, but then when it comes to targeting them, you know, again, to the same point as tons of columns, that’s tons of just empty records that you don’t need to send to your marketing automation or, you know, ESP or anything like that, right. And so we found them kind of hacking this together as like a, you know, users cohort. And like a customer’s cohort, karmic users entity, and then a customer’s entity, which was just, at the end of the day, like a subset of that, but just driving the computer. And so filter, yeah, and so cohorts became cohorts was kind of born out of that is, okay, you have your entity, you know, you can now define on those traits that exist in the entity cohort, which is a subset of that entity graph. And all it’s really doing is filtering that stitched master ID based on some criteria that exist about those. So now, you have a user entity. And then you can have a known user or customer cohort within that, or really any type of cohort that you’d want. Those cohorts can also have different features than the main set. And so that’s kind of what we saw customers beginning to break down to different entities. Because if you think about, you know, something simple, like just calculating an aggregate of like LTV on customers, even if everything’s No, you’re to calculate that on all of your anonymous users take our time and compute. And so you really want to actually compute those features on the cohort, which they actually apply. Yeah,
Eric Dodds 41:15
yep. That makes total sense. Yeah. That’s super interesting. Dig, have you seen cohort creation kind of follow? Team use cases? And so I guess, like, the, the immediate thing that came to my mind was, if I have a known users cohort, like, as someone who works in product, or, you know, I’m trying to understand feature adoption, or I’m trying to understand, I’m trying to increase lifetime value of my customers, or EECOM, or whatever our did is like, do cohorts that are fall on business lines? Or what kind of patterns are you seeing there? Well, that’s
Ryan McCrary 41:51
where it gets really interesting. We’ve seen, you know, customers in different verticals, and really even different structures of internal teams that have taken those different ways. And so in some cases, yes, it’s by kind of function, you know, so the product team looks look at this, the marketing team wants to look at this different cohort, we’ve also seen cohorts acting as journey steps or funnel steps, where you can have mutually exclusive criteria for each of these folks can move, you know, between them, and you can target those accordingly. That’s something where, you know, we’re still deciding where we probably won’t have a heavy opinion on that, because I think it really depends on the team and how they operate. And so some teams will have just that basic, you know, user cohort and the known users. And then they will activate or sign that known users cohort, either the whole thing, or they will segment on those features that exist and subsets of those with them. That’s where, you know, maybe a more resource constrained team, where there’s a single data engineer that says, this is the clear definition of a customer, you guys go run with it and send it to the tools that you want. And then we see teams with more robust data teams, where they say, here are the five primary cohorts that we have defined and split the customers into and then look features on. And so these are your entry points. And that could be something like US customers, or, you know, we’ve worked with any QCon customer recently that their primary ones are, you know, business and residential. And I’m sure there’s two teams that operate very differently. Oh, interesting. Yeah. Yeah. So it’s really this guy’s kind of a limit as to how those are segmented.
Eric Dodds 43:13
Interesting. All right, John cohorts? Did you try to do this? We like rolling a bunch of stuff. Yeah.
John Wessel 43:20
It’s funny. And this is kind of a sad story. But we got our
Eric Dodds 43:28
the producer leaves, and we get like these hot tea, sad.
John Wessel 43:32
It’s a sad story of one of those companies that was funded in that, you know, 2020 range, they built an awesome product that got acquired and then basically killed. But yeah, we actually there’s a small primarily email tool, but they really built a pretty robust kind of customer data feature into it. Like I said, they no longer exist. But one of the things we did was feed custom entities into that. And then they and they did some neat things like computing like predictive stuff inside the tool, as well. But that was something that we found was really helpful for, you know, for targeting and for email, and customer messaging inside that tool, and then getting insights like one of the cooler things that we did is we had this cup like product ranking thing where it was like an X and Y axis. And in it, it scored on views and conversions. So what are your like high view, low conversion or low view high conversion on like a x and y axis? So that was something that we like, pipe data into. And then from a customer data standpoint, I think the biggest problem we faced, when we were selling b2b and b2c was how do you pinpoint the customers you should reach out to, especially like businesses, because you’d get purchases and they’d be suffering some big names of like, wow, and you know, and that wasn’t necessarily the only like, indicator there would be a good customer but that was an interesting one because you can reach out to everybody. If you do want to reach out especially if a business buys something like well, what else do you buy? Like, where else are you going? And that sounds probably one of the more interesting customer problems. Yeah, we’re working on that. Like at the very end of my time thinking through like, Alright, how can we rank them? Let’s find properties to rank them on and give like a call sheet or email sheet or something to a sales team, and then automating that further. So that was probably the most interesting.
Eric Dodds 45:25
Mandy’s how rare to have, like some sort of like customer engagement tool that actually handles entities. Well. Maybe I don’t.
John Wessel 45:35
Then it’s gone. Yeah, it really is.
Eric Dodds 45:39
Okay. Well, actually speaking about that. Okay. So cohorts are one of the things he recently launched. But speaking about email tools, there’s this other piece of this called activations. Yeah. So what are the activities? And to put a spicy take on it? Like, is it that sounds just like reverse CTE? It is.
Ryan McCrary 46:00
Reverse ETL? Reverse, reverse ETL? Wait, that? That’s just ETL?
Eric Dodds 46:09
It? Yeah. So you’re like, cancel out. ETL cancels out? Yeah,
Ryan McCrary 46:14
I mean, so yeah, at a high level, essentially, what we saw is that we were providing a way to define these entities in a trustworthy manner for the data team doing that definition. And then for the data team to segment that further into, you know, again, cohorts that different business units or teams or, you know, different phases of the customer journey, cared about. And so that became the grain at which we saw people needing to actually get that into the downstream tool, you know, like, you’ve built your ID graph, you’ve built your features, you’ve subset that into, you know, usable buckets of users. And then that’s where it was like, Okay, now we’ve got it to a place where we can actually take action on it. And so like, again, like beating a dead horse here, but like that, you still can’t do anything until it’s in the tool where you want it. Sure. That’s the interview literally, talking about materialized views in the warehouse. Yeah, yeah. So that’s literally the grain at which it was like, Okay, now, you need to get this into a downstream tool. Because RudderStack is building these, we know exactly how the views are materialized where they live in the warehouse. And so it becomes very simple, then for the non technical user to say, you know, I’m looking at this UI, which again, is built on top of Snowflake or warehouse data, I want either this cohort, or even a further segment of this cohort, or even some traits of features of this cohort, in my marketing tool. And so, activations is basically, you know, a UI, you know, low number of clicks way to get that there, I like it, I wish it was one click, that’s like what I was really going for, but honestly, because you have to kind of map feature not about the map, it’s your rails, yeah, you map it to the field, in the tool that you’re sending them to. So sir, but the idea is that it gives you a centralized place to say, you know, again, business users are exploring that saying, This is what I want to get is a subset of this, and then put it into the downstream tool. And then that’s now connected in the UI, at least to that cohort, so they can see all the places it’s being sent, or what sub slices of that are being sent. And so it really, really kind of ties a bow on that notion that I mentioned of the data team owning the definitions in the config. Yeah, and the business stakeholders owning the interface to that. Yep. So to go back to the spicy take, you are actually actively just turning reverse ETL, like melting into like, it’s under the hood and a business user, like goes to look at data. And then they’re just like, I just want it in this tool, which actually is just like reverse ETL is bogus. Use my reverse ETL. Yeah.
John Wessel 48:43
Exactly. Awesome. Yeah, that’s, like what tools are supported out of the gate? Or commonly used?
Ryan McCrary 48:50
So downstream? It’s any of the integrations that RudderStack Oh, wow. Sounds like a big library. Yeah. So anything where you send clickstream or reverse ETL data is automatically supported by activation? So okay, yeah,
Eric Dodds 49:01
I will trade your one click for like, yeah, sending data anywhere. Okay. I gotta ask a question, though. And this is for both of you. Okay. And I don’t care who goes first, you guys can fight over it. We just talked about it. And I mean, I kind of know this for myself. And maybe this is just because I was like a very technical marketer. And I actually did go into the warehouse, so I’m probably not right. Yeah, like, but like, Why can’t Why is it important? You mentioned the data team can own the definitions, right? Like, couldn’t I just go in and create a bunch of definitions? Like, why is it important to have that dynamic? Look, you as a marketer? Sure.
Ryan McCrary 49:38
Well, because you would do it wrong.
Eric Dodds 49:42
Okay, hold on. I didn’t have to create it. Me personally for life.
Ryan McCrary 49:50
I mean, that’s a good question. And I mean, my answer to that is that data is never as clean as you want it to be. And so Okay, a good example, we worked with a customer recently where they in their downstream tool wanted, like a list of what they wanted to see. And, you know, do activities on recently active users. And this was a product like RudderStack, you know, a SaaS based product. And so someone on the marketing team, probably someone brilliant, like you, like, just literally like, grabbed, like, you know, the count of like, did they have a session in the last week, and they were using that as active users. And, you know, it wasn’t converting like they thought it would or had in the past, and the data team was able to come in and say, you know, this tool is primarily used as a browser extension. And so like, if you’re signing in, you’re probably not really using the tool. Well, if you’re using the tool, well, you’re using it from a browser extension, and you’re never signing in. And so they were able to come in and essentially correct that feature, by the definition of it. But everything downstream remained the same, right? So everything that marketing had already built around Sanctuaire, recently active, was still fine. But the definition was just done more, more lively around the business concepts. And that’s not me, I love to knock against marketing people don’t get me wrong. And that’s my favorite activity. Me too. But that was just someone doing the best with what they had and reloads, because that’s what existed in that marketing tool based on just like Clickstream data was what they had available. Yeah, but the business has access to those metrics in a different data set that’s not available in the interest of the marketing tool that can say like, oh, this was the average number of minutes, they used the tool last week, and like, that’s much better. And so yeah, that’s why I think it’s important for the data team to own those definitions. But the marketing team, as much as I hate to admit, this is always gonna know how to use those better, right, like, how do we actually touch
Eric Dodds 51:37
her? Yeah, yeah. But it’s like filters on its core business definitions, that shouldn’t change, because it can create situations where someone is sending a campaign, and actually reporting something that’s inaccurate. Exactly. Yeah. Interesting. Yeah.
John Wessel 51:51
Yeah. Well, and I think, just as a general concept, some of the best tools are tools that bridge two teams, or more than one team together. I mean, that is a lot of the value of really any SaaS tool is like, Okay, this is how marketing looks at it. This is all data looks at it, and provides, like clarity and an interface to work that out. Because that flipside is often true to the right where the data team likes to do things, they model things technically correct. And they need to do everything right. Like, technically, and then marketing’s like, Yeah, but this isn’t useful. Yeah. Because of things like XY and Z, like business rules. Yep. Or just oddities like how some systems set up that can’t be changed or whatever. Yeah. So like, you have to have that like, these tools, like, you know, like profiles that forces you to kind of get it on paper. And to agree on that. And, and like, I think it can flesh out a lot of the problems with data definitions, because a lot of times the data teams and marketing teams aren’t working together, because the marketing team can have their fully enclosed black box. Yeah, and just like to do stuff, and they would never even know if it was wrong, because they don’t have anything to like, compare it against. Yep.
Eric Dodds 53:01
So is it so it sounds like both of you are advocating for this world where like, you know, I mean, there was this whole, like self serving analytics, data, democratization, blah, blah, blah, right, which, you know, that’s another episode, that’s really good talk about how that didn’t materialize, or when it did, it’s some severe issues. But what’s interesting is like, you could say, Okay, we’ll just gonna send this data to your tool. And then you can do whatever you want with it. But to your point, Randy, I think what’s interesting is, without context, we’re sort of without an agreement on what the meaning of some of those four business definitions are, which ultimately, like materialize as, you know, some sort of a column or a table or something like that’s actually how it exists physically, quote, unquote, in the business. But it seems like there’s this desire to create How do I say this? Get the marketer closer to those, like physical assets in the warehouse, but in an environment that has like a bunch of safeguards? Yeah, exactly. Okay. Interesting. But
John Wessel 54:06
I think the vision of, okay, we need clear ownership for pieces of this thing. Because like after, after, if you’re at a certain scale, like, nobody can do everything, right. You have clear ownership. But then you also have, like we’re talking about with a GUI earlier, you have visibility between teams, like I’m not the owner here, but I can at least go see what is happening at a high level on this downstream thing that impacts me. And then the same on the other side of like, I can see the results of what I did, I think that’s a really positive thing. So the teams feel like they’re actually doing something useful versus completely siloed or like, I ship it across the wall. I don’t know what will happen. What happens before me or after me it’s like an assembly line mentality in a bad way. Versus versus like the bowl visibility, like what’s going on and then like clear ownership. Yeah, lines and the process. super
Eric Dodds 54:57
interesting. All right. So Two more questions before we end here . I actually have no idea how long we’ve been recording, which is a great feeling. Yeah.
55:06
That’s such a great
Eric Dodds 55:11
All right, we may change the format of the show. Yeah. Just, you know, the two sets of video, you know? Yeah. Okay, so two questions, one for Ryan and then one for both of you actually will end on a spicy take? Great. This isn’t the spicy take this is for you. So what are you building next? Okay, so you have your identity graphs, profiles, you generate an identity graph, no one asks for it. Everyone needs it. You build trades on top of that and it becomes this sort of table that represents everything you know about an entity. You create cohorts that are business definitions, business users go in there look at a cohort, they can filter it, they send the data to their tools. I mean, this sounds like a great world. Yeah. What are you building next?
Ryan McCrary 55:53
Yeah, a couple of things we’re focusing on right now. One, one is already live, but we’re doing a lot of work around it, but it is around the ML piece of this. So you know, a lot of teams that are approaching us in wanting to use profiles or wanting to do so to build kind of that solid foundation to start thinking about ML use cases. You know, everyone’s trying to do right. Oh,
John Wessel 56:13
wait, like that’s what you do. Do you mean AI? fryer? Oh, God, this whole title. Nobody said AI?
Eric Dodds 56:19
I think we’re like an hour. You did mention Oh, you said there’s no AI tool. Yeah, you’re right. You’re right. I have not felt that it should be a game. Although, as a product manager, you did say it’s a feature not a bug. That’s you? Yeah, I
Ryan McCrary 56:34
I think I’m contractually obligated to status. Yeah. So when we think about, you know, a lot of folks are doing this to build that foundation to start to, you know, leverage some of these more advanced techniques. By the nature of profiles, I mentioned a couple of times, we’re outputting, this table that’s got all of these features that you’re defining, we also from the start have stored historical snapshots of that over every run, oh, interest, a lot of every job run a new, like materialized view that has this point in time. Yeah, so the view that you look at in the UI is pointing to the most recent run, but all the previous runs are in there, you can kind of set the retention that you want. For those that sound fruitful for NL Yeah. And so, you know, that was the intention is we can do this for, you know, teams to have a good foundation for their ML, but then we realized, you know, we have, we know exactly how we’re writing this, we can do some ML for them. And so our predictions product, you know, kind of sits on top of that, and allows you to say, Well, you’ve got all these users, and all their, you know, their feature evolution day over day, you know, if you can in some of these features are as simple as defining like, which feature would you say as a conversion? And then what are the ones you want excluded? Like, obviously, you want to predict on like, my first name and my state, but, you know, excluding those, like, what are the things that are changing day every day, and then we can give you you know, with this, how far of an outlook you want, we can run models on that will say, Hey, this is the, you know, either lead score, turn score, but this is the propensity to do this define conversion action. Okay, fasten
Eric Dodds 57:57
you go find the training data.
Ryan McCrary 57:59
I mean, I guess it’s all there, but it’s trained on the customer’s data. So yeah, so we train it on those trains, you know, there’s default, usually trains weekly, and then on the inputs to whatever you define as like, your conversion or something. Yep. Interesting. And so thinking about what else can we layer on to that. So we’ve recently built an attribution unit. So you can do first touch, last touch multi touch, and then, you know, eventually, that gives you on a per user basis, what’s the next best action for this person based on, you know, actual training data? And then, you know, we think we’re building some of the things around like LTV, prediction, category prediction and things like that. So that’s really exciting. And then the other is, everything I’ve mentioned today is a batch process. So this runs on a defined cadence, shared, and calculates these things into the warehouse. Does it make sense? Yeah. And so real time features are kind of something that we’re beginning to work on currently. And, you know, that’s the ability to have that daily aggregate run, which then gives you quick access to that historical aggregation. And then you can compare that to events coming through in real time, and access those out either through our API, which haven’t really been mentioned yet, or to tack on incentive, the downstream tools so that you have that kind of real time access. And that’s being used in beta by some customers right now for like, fraud detection, you know, different things like that. But understanding your users at that point of contact versus you know, having to wait for the batch process can be exciting. super interesting.
Eric Dodds 59:11
Yeah. Wow. Okay, well, we’ll definitely have to have you back on to talk about that. We’ll make sure that producer is in here. Okay, so last spicy ticket after MMSI acetic. Okay, so reverse ETL is getting turned into a feature of a bunch of other products that Ryan is at the spearhead of. He’s changing the entire industry right now. What is another one that you see getting like a sort of, let’s say, cottage, you know, you know, sort of data explosion VC backed product that is just gonna get turned into a feature.
John Wessel 59:46
My first thought would probably be observability. Like, there’s a lot in that space where we’re like, how does that not get rolled into? Yeah, there’s so many places that could get rolled into. Like it could get rolled into orchestration, it could get rolled into, like the warehouse itself or the shader pipeline tools. Sure. So that’s the one that feels the most is useful. But it’s, it’s, you know, like, I don’t wanna say it doesn’t do anything, because it’s useful. And you can have alerts and stuff. But at the same time, it’s not like core like actually like, Hey, this is a data pipeline, what that it moves data from here to here that I need. That would be my boss.
Eric Dodds 1:00:24
Yeah. And there are so many tools that have access to the same stuff that could just build that.
John Wessel 1:00:31
Yeah. And then the catalog. Yeah, the catalog. It’s maybe a similar space here. Yeah.
Eric Dodds 1:00:35
Yeah. I mean, that’s yeah, both of those things are kind of just like, it’s just a matter of time until Snowflake and Databricks. Right, right. They may, they probably already have products or have acquired companies who have done that. Yeah, you know, right.
Ryan McCrary 1:00:50
I’m going to stick to my, I’m gonna stick in the same vein as reverse ETL. But I think ETL I think your traditional like interesting Fivetran stitch EVO, all those players. I mean, we’ve built huge businesses around this, but I think we’ve already started to see it some, but I think the actual cloud tools themselves should start writing that data to the warehouse. Yep. Yep, kind of cut out that the need to like, dedicate
John Wessel 1:01:11
like zero ETL, just directly like, exactly, they may just have, like data shares, basically, with all of these like hotspots and like these big providers. Well, I
Eric Dodds 1:01:19
mean, that’s kind of what you see with the I mean, going back to reverse ETL. Like that, you see this with marketing platforms that are just sick. We’ll just plug into Snowflake. Yeah, yeah, exactly. Yeah, I agree. I agree. I mean, the other thing is for, let’s call it traditional ETL. Like, at some point, the big cloud providers, I mean, they already have tools that can do this. Right. And so at some point, they just acquire or build or something, the connectors or just the zero ETL thing. Yeah, that’s interesting. That and actually, that sort of goes back to like a bundling. Right? Where he’s, you know, a lot of you know, he mentioned, you know, Alteryx, or some of these others. Yeah, he’s large.
John Wessel 1:02:02
Which is super interesting, which is back to like, the Lockean thing with a lot of people are trying to get away from, and then we’ll probably unbundle again, you know, yeah. Next day.
Eric Dodds 1:02:10
Yeah. So well, well, I think it’ll be actually, that’s a whole other subject. But we recently had Andrew Lamb on the show from influx and this whole idea around object storage, and like Apache aero is like, actually creating some crazy unbundling, you know, sort of on the analytics tax, which is really interesting. So, I don’t know. We’ll see. All right. I can’t wait to see how long we recorded. Yeah, it was, like 25 minutes, or more than 20.
John Wessel 1:02:36
It’s more than 25.
Eric Dodds 1:02:37
I think so. All right, Ryan, thanks for coming back. Have you been on the show? I was saying come about? I
Ryan McCrary 1:02:42
i don’t think so. Wow. Okay. Thank
Eric Dodds 1:02:44
you for coming on.
Ryan McCrary 1:02:45
Thanks for having me. Great.
Eric Dodds 1:02:47
All right. Well, when you get the NL stuff sorted out,
John Wessel 1:02:49
yeah, let us all know, walk over to my office. And when you get a I figured out a Yes. There we go. Okay, there’s my third mentor. We needed another mentor. Yeah.
Eric Dodds 1:03:00
All right. Thanks for joining us. Yes, subscribe if you haven’t, and we’ll catch you on the next one. We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe to your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.
Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
To keep up to date with our future episodes, subscribe to our podcast on Apple, Spotify, Google, or the player of your choice.
Get a monthly newsletter from The Data Stack Show team with a TL;DR of the previous month’s shows, a sneak peak at upcoming episodes, and curated links from Eric, John, & show guests. Follow on our Substack below.