This week on The Data Stack Show, Eric and John welcome Lewis Dawson from Momentum Consulting to discuss the complexities of marketing attribution. Lew shares his extensive background in data, from coding in the late 90s to working in data warehousing, martech, and cybersecurity. The conversation delves into the challenges of attribution, including data accuracy, integration, and the need for robust monitoring systems. Lew emphasizes the importance of creating unique join keys for campaigns, fostering collaboration across teams to improve data-driven decision-making, and more. Don’t miss this episode!
Highlights from this week’s conversation include:
The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
John Wessel 00:03
Hi, I’m Eric dots and I’m John Wessel. Welcome to The Data Stack Show. The Data Stack Show is a podcast where we talk about the technical, business and human challenges involved in data
Eric Dodds 00:13
work. Join our casual conversations with innovators and data professionals to learn about new data technologies and how data teams are run at top companies. Welcome back to The Data Stack Show. We have a special guest today. Lou Dawson from momentum consulting. Lou, you have such an interesting background and have done lots of different things. We met when you were a RudderStack customer. Now you’re a RudderStack partner, and so you and I have actually talked many times about one of our favorite subjects, which is attribution and all of the related data and reporting stuff. And so I am pumped to spend a whole hour talking with you about that. So welcome to the show and give us just a high level background of your journey in data.
Lew Dawson 01:04
Yeah, thanks, Eric, awesome. Be on the show. Thanks for letting me come on. In short, my background changed quickly. I started writing code back in the late 90s. So I got started really early. I loved it and have been doing it for over 25 years now. Got started early in the data warehousing space. Spent a long time doing that, then moved over to the marketing space and doing early days of Mar tech, and implementing a lot of Mar tech technologies from scratch for companies, and did a little cyber security in there and back, really solving martech full time. And that’s that’s that’s the niche I found that businesses really need help with, and really can use my consulting services. How do you really implement a proper and great marketing Mark tech ecosystem? So that’s, that’s where we are today, and that’s how I got here. Awesome.
John Wessel 01:58
So, Lou, you were talking before the show about attribution, and we’re gonna dig deep. Today. We’re gonna be pulling out wires for you know, where did that go? It’s gonna be fun. Yeah. So what? What attribution topic are you most excited to jump in?
Lew Dawson 02:16
Oh, man. Attribution is a deep and wide topic, like this one that interests me immensely, because it’s a hard business problem and a hard data problem to solve. So it’s just it touches every facet of a business and every facet of data, from coordinating with leadership, product marketing, so yes, like, you have to deal with scary people, right? All the way to those really skin scary, like down in the basement engineers who then, you know, talking about the data side for a second, you have to figure out, how do I model my data? How do I make sure my data is accurate, and how do I accurately represent it to the people who care about that data so you can make good marketing decisions, and it’s just a cycle that continues over and over. And hopefully, if done right, you optimize your ad and retention ecosystem, and you keep getting better and better and better, and you continue to grow conversions by using that data. You know that attribution data, how
Eric Dodds 03:21
It is hard to get there. Make it sound so simple, we’re gonna break it down today. So let’s dig in. Yeah, let’s do it. Lou, I’m so pumped to have you on the show, and I’m kicking myself because we talk. We have talked a lot, you know, I guess, over the last year plus. And like, a week a week. Yeah, every week, you talk every week, every single week, and and just somehow I haven’t invited you on the show. I’ve been keeping the secret of our conversations, but now we’re going to expose that to the world. So you give us a brief overview in the intro, but go back maybe just a couple of roles. So we met when you were doing data and Mar tech stuff at all birds, you know, who’s a RudderStack customer, and so that’s how we met. That’s how we connected. We’ve maintained our friendship. Now you’re doing consulting, so go back to a couple of roles, maybe prior to all birds, and tell us kind of about that story, and then an overview of your consultancy momentum. And you know, the types of projects that you work on, yeah, of course.
Lew Dawson 04:22
In short, I really got started and the entire data ecosystem back before I got out of college. I worked at Teradata for a long time, the data warehousing company. Probably a lot of viewers are familiar with it to some degree. And they were, they were one of the, one of the primary vendors at that time for data warehousing, yeah, and so I got a lot of exposure to data warehousing there, large scale processing of data. And then somehow I don’t recall the details doesn’t really matter. I moved over to intuit for a while, and early on, was tasked with Re. Writing the personalization engine on the marketing website. And so a lot of that was, how do we optimize what the customer sees and on the marketing website? So engagement is part of it, so when they come, and how do we really optimize for conversion when they get here? So how do we get them into the product? And so that really, like, that was my big, big exposure, and my big realization that this is a cool technology, this is a cool area to focus on. And I like it, like, it’s really interesting, interesting problems to solve. So built, built that for a while. Then, like I mentioned, cyber security for a while, I’ve always been interested in security. That’s less interesting on the show probably, and then ultimately ended up at RudderStack through an acquisition. Sorry, not RudderStack. My apologies. Birds and I ended up on the data team, and that was early on. We wanted to implement customer 360 so we could improve our acquisition and retention campaigns, but especially retention. And so we developed out well, we partnered with our stack early on. I think y’all were a really, really early stage startup at that point, if I remember correctly, yeah, I think we were one year earlier customers. And so we worked with you on a lot of stuff. I think, if I remember correctly, some of what we ended up implementing fed directly into some of your requirements, so that, yeah, you know, we built off of each other, yeah. And ultimately, some relationship, yeah, yeah. Ultimately that cultivated into a somewhat successful customer, 360 all birds, and then I realized that I really enjoy doing this for a lot of different customers, and I enjoy the data space, and that got me to consulting for multiple different companies, doing a myriad of different things in the MAR tech space. But I do always love talking about acquisition, because it’s such a challenging problem. Yeah, and so yeah, that’s what I’m doing today, with momentum consulting, anything martech related, I do other stuff outside as well, but it’s generally marketing folks focused on Mar tech. And you know, the niche I’ve really carved out for myself, in short, is a combination of providing solutions or providing a strategy for how to implement Mar tech for folks from marketing and product all the way up to leadership, so communicating with them, getting the requirements, etc, to occasionally actually implementing solutions, working, working with them, either me or a team people working to implement solutions. So that’s what we do at momentum consulting.
Eric Dodds 07:35
I love it. Okay, I want to start out. There’s so much to cover, but I want to start out with a brief for both of you. This question is for you too, John, with maybe brief anecdotes about, like an attribution war story, okay, that was either wildly successful or a huge failure. So either one, but it needs to be sort of like, you know, on either end of the on either end of the spectrum. So, John, why don’t you go first? So attribution, war story, huge win or huge failure.
John Wessel 08:09
Oh, I’m definitely going to have a huge failure, because we’re having the most fun. There’s kind of two parts. Those are kind
Lew Dawson 08:14
of more frequent. That’s yeah, like, yeah.
John Wessel 08:17
Those come to mind. Yeah, quickly, too, this. This was a fun one. It came to mind during prep, actually, and so you picture yourself, this was several years ago now, the board meeting, I’m sitting at the time in like an IT spot, like, I eventually, like, started managing marketing and it, but I’m sitting from like it spot had marketing leader in their board meeting presenting, and they’re presenting the just overall, like performance, acquisition, performance, and talking through that. So presenting the return, you know, the row, as the return on ad spend, super common metric. And they’re saying it’s, it’s so good, like things are great. It’s 800% return on ad spend, which is quite high. That’s, yeah, quite high. And my data brain starts turning a little bit like, like, you know, like, I think the thing is, like, go forward from there and fast forward a little bit. I ended up taking over that group and digging in really deep on the attribution and all of that, and we found two major problems, one of which I think, like, was already there, and one of which I think we created. The first one was, like, the most obvious problem, but it happens a lot, is conversion events were firing twice, so that eight was a four, and that is a massive financial difference if you’re trying to understand your ROI on ads and
Eric Dodds 09:43
your willingness to deploy a budget, yeah. So
John Wessel 09:47
that was like an early on find of like, Ooh, this is not good. And then the second one, which was just a, which was just a bizarre one, it was hard to find, is. So this was a B to B site. We had some food. Larger orders, but not every order was large. So there was some bizarre bug where orders over $1,000 didn’t get captured correctly. It had something to do with, like, placement. So, like, you know, like typical orders, like a typical day, we would have several, over 1000 but not a lot, and it was just off. It was the hardest thing to find, because, you know, like, odds are, like, you pick up a, pick a random order, pick a sample, like, it’s not over $1,000 yeah, there was enough to where, like, it was a big problem, like, overall. So those are, those are two, like, attribution, like data challenges, where, yeah, it was tough. All right, your turn, Lou, yeah, I
Lew Dawson 10:42
could think of like John was saying, the failure comes to mind. The quick Gist, and the one that comes to mind immediately is attribution, specifically conversion. Well, attribution stopped working. Specifically, conversion stopped firing in a lot of cases. And no one noticed this for a while, right? So it’s, ooh, you basically see a massive drop off. And they’re, they’re depending on, theoretically, they’re depending on this data in order to make decisions on how to reallocate ads. But for some reason, like both there’s a failure on both the data side and a failure on the marketing side, data side, like we didn’t notice we weren’t alerting on it, and marketing sites like, Were you guys actually using the data and paying attention to the data? How did you not notice a massive drop off? Right? So that’s definitely the one that always sticks with me. It’s like, you really need robust alerting and monitoring mechanisms and data, which is one of the many, many problems for acquisition BF, to solve.
Eric Dodds 11:43
Yeah, totally. Okay, well, let’s start our deep dive. And we’re sort of at the edge of the hill here. And I’ll like, nudge the, you know, I’ll nudge the car towards the slopest towards Yes, okay. I want to talk about why attribution is hard. But Lou, give us, can you just give us a high level definition of attribution, like, what is the business problem that you’re trying to solve? You know, using data? Because I think this is probably something that a lot of our listeners have exposure to, but, you know, perhaps some of them don’t, right, and the levels of exposure may differ, and it can look so different at different companies. So kind of level status was just a really high def, high level definition. What problem are we trying to solve when it comes to the subject of attribution?
Lew Dawson 12:32
Yeah, I think that’s actually one of the challenges. Is this. This is like one of those very early challenges in the many challenges at attribution. But in short, attribution is taking off the traffic that you receive into your engagement properties, so where the customers are coming to actually do their final conversion. So taking e-commerce, for example, like a website that’s selling things, you want to understand the US, if the user converted. So they bought a product, they checked out, they bought a product. Where did they come from? You want to attribute where they came from to an order to understand, essentially, like your customer acquisition cost, understand how well your ads are performing. So you mentioned the earlier row as John, things like that. You want to understand at the end of the day, like, how efficiently am I spending my dollar is number one, and how well are my customers converting across my various channels? Number two, and then tangentially, number three. It’s like, How well am I retaining customers across different channels? So that’s the highest level. Last thing I’ll say is it’s a challenging problem, because every business is a little different and how they want to look at it, when you dig into the details. And then further down, different businesses have different stakeholders with different weight in that, giving you a slightly more specific example. Some businesses like, if, let’s say like the acquisition team is kind of the driver, like their leader has more weight than like the engagement or the retention, or, let’s say the engagement, leadership has a greater weight. They might care more about conversion, right? So especially at larger companies where it’s like KPI driven development, let’s say so like, people care about getting promotions. So they care about boosting their KPIs. Yeah, they’re going to potentially care more about prioritizing their KPIs and boosting their metrics. So like conversion and engagement, versus, versus, like maximizing revenue. So that’s just, I think that’s one of the challenges, again, just defining the problem is, like, what are you trying to optimize for? What are you trying to measure?
Eric Dodds 14:57
So, I think that. Such a good point, but let’s okay. And I think your definition is great. You have customers coming, you have customers coming through channels, right? And you use the example of an e-commerce website, but it could be a store, right? And actually, more and more you have E commerce companies that started online, who are actually launching brick and mortar presence, right? And so you have these channels, and you want to know, we’re trying all of these things to get more people to walk into the store, to come to our website, and then ultimately make some sort of purchase. So I, when I hear that definition, I think, before I actually had to face this challenge, it’s really easy to think, okay, like, I’m pretty savvy, you know, with technology and with data, and so we have a set of channels. And so we need a measurement mechanism, and then we need, like, to see the conversion. And like, you know, and I’m pretty good at math and like, I feel like I can tackle that. And, you know, that is not untrue, but I think it’s easy to start out with an idea of like, okay, that doesn’t seem like that hard of a problem, and it actually turns out to be like a very difficult problem. But why is that like that? Break down for us the different dimensions of why actually putting that math problem together is is really challenging because there is an entire multi billion dollar industry of software focused on this, and that doesn’t even include all of the effort and time and compute that goes into companies that are hand rolling this, you know, in on their own stack, in their own infra.
Lew Dawson 16:41
Yeah, yeah, absolutely. It’s a multi faceted challenge, like I said. So I’ll keep the like from the business perspective. All the people involved are one, one challenge. Just to give you one super quick example, you can dig into this later. You need to structure campaigns a certain way, like the wording, how you define them, etc. So right there, that’s a people problem, but that then becomes a data problem. So then getting to the data side, there’s massive amounts of data challenges to actually make this work. So again, using that same example, you get the data on the other side. Well, what if the campaign name is not the same every single time, even when it comes from the same source, right? So what if someone’s browser mingles it well? Now, you can’t attribute without additional logic. You can’t attribute 100% accurately every single person coming in and every single conversion, right? So data challenges are, I don’t want to say immense here, but they are. There are a lot of them, and they are complex. So it’s basically there from the data side, there are a few challenges. So it’s like, I just highlighted data accuracy, so getting the data in fully, accurately and correctly. So that’s acquiring it, transforming it and spitting it out correctly. Yep. Then we talked about getting the data in. Like part of it is just generating the data, so on your engagement portion, so on your website, your mobile app, it’s, am I even generating the data necessary to track where someone came from, or came came from, and also what they purchased, right? So again, I came in, I checked out on my Shopify car. Like, how do we get the data that says I blue, purchased this product and I came in from these channels. How do I then merge that data with two things, like data that came in session, session data, sorry, so like, I’ll buy behavioral data, and then also all the ad spend in data that I then pulled in. Like, how do I merge those together to say, Oh yeah, for Lou, I spent 30 cents showing him an ad. I spent $10 showing him that ad, right? So you have to connect all that data. There’s sort of a huge data connection problem that’s way more complex than it seems on the surface. Next, what do I do with that data problem? So it’s that’s cool, like you’ve connected it, you now have data, but having the presence of data alone doesn’t help you. Now you have to figure out, what do I do with that data in order to give me data, then I can go take action on to, you know, evolve my business, to improve my conversion, to improve my revenue profit. And that’s a challenge on its own. It’s like, how do you first figure out what’s important, and then what hands do I get that into? So the correct decisions can be made, so that we can evolve these campaigns, so boost the good ones, kill the bad ones. And then lastly, again, like it’s a. It’s a people problem. It’s like, how do I coordinate everyone to do all the things correctly across all these technologies we just talked about to make sure that nothing breaks and that everything is done in a normalized enough fashion that we can continue to do this over and over. John,
John Wessel 20:21
no. And I want, like, I want to expand on that, on the people problem of this, because I think this is, like, really fascinating, is like, 100% like, I think you hit all of the major like components there, but there’s, there’s this, like, additional, like, people probably have to do the right things. Like you said, like, name the campaign, name the campaigns the same, you know, every time, same campaign. So there’s people with problems like that. There’s also this problem of people at its fundamental level, we are taking this big pile of money, of revenue for the company and trying to figure out who gets credit for what. And that creates drama in most companies, right? Like, if you’re, like, like you said, if you’re driving hard, I’m like, All right, like, you’re the Amazon channel, or you’re the, like, inside sales team, or you’re the whatever. Like, each of them wants their fair share, their fair credit or attribute, you know, attribution for whatever they contributed. Many of them have financial incentives like it is not like that adds a whole another dimension to this problem, besides extremely big technical problems. Yep,
Lew Dawson 21:30
yeah. And I think that’s especially prevalent again, it’s like you figuring out what you’re measuring, that adds a whole layer of problems, especially when you get into multi touch attribution, which we’ll talk about later in greater detail. But in short, it’s like people get partial credit. Yeah, that problem when a business, like a partner stakeholder, decides I disagree with that, like, I think I should have gotten more credit for that one. Yep, yeah, there’s, there’s all sorts of people issues here. I think they’re almost as prevalent as the text technology issue I’ve noticed.
Eric Dodds 22:06
Okay, let’s, let’s dig into the tech stack a little bit and Lou, let’s walk through the sequence that you discussed. Because I want to dig into the people’s side more a little bit later. Because I think that’s, you know, arguably, to your point, probably. But like, if you can solve the people side, then that’s that actually paves a pathway, you know, for the, you know, for the tech side. But let’s talk about the stack really quickly, I think, just to orient, just to orient everyone. So we talked about collecting the data. Are you even collecting the data? So let’s start there, before we even get to accuracy. So where is this data coming from? Like, what are the data sources? And sort of, what mechanisms are you using for this capture? Or, like, you know, sort of, what if you’re going to go in and sort of put together a strategy just describe the types of, you know, pipelines, I guess, or data sources,
Lew Dawson 22:57
but actually, take that. It’s up even one from that crazy enough, and it just feels like table stakes. But having been on a number of platforms, I have to say this, like even being able to create those campaigns, it comes, right? Yeah, like, it sounds stupid to say that, like some of those platforms are actually a little bit on the harder side. To like it, to even create campaigns successfully, to like get them started, like
Eric Dodds 23:27
You’re talking about someone going into an advertising platform, you have to, you know, create some entity that’s a campaign. It has to target some subset of users. You have to have some you’re sending something text or images or something that’s going out Yes, to reach these
Lew Dawson 23:46
people, it has to go to, it has to be a valid landing page. Like, oh yeah. It should be, like, it should be a tailored landing page. But, like, Is it the easier part, but no less, this still is a barrier in itself. Like, someone, someone who’s new to this whole paradigm of, like, let’s say an e-commerce website that is a that’s, that’s the first barrier that they have to hop over is like, how do I even run an ad? And that, that, on its own, will take time to learn one, one platform, let alone, you know, like, Facebook, Google, like, there are a number of different platforms, right? Yep, I would say that’s first, yeah. And what are the
Eric Dodds 24:25
speak to the speak to the listeners who are on the other end of the pipeline where, like, the campaigns and landing pages are generating data, but they’re on the other end of the pipeline, so they’re seeing this come through, you know, and probably I see it as tables of, you know, the tables of data speak to them a little bit about, like, what are sort of the, what are the things that you would say, like, here are things to keep in mind about that process of even, let’s just call them assets. You have, you have to have some sort of assets that are actually going to generate this data. Yeah, right. There’s a campaign that’s being served. Someone’s clicking on something, they land on some landing page, or something like that, right? Which sort of ultimately generates the data. What is the data professional on the receiving end of the pipeline? What are the main things they need to know about that whole process
Lew Dawson 25:17
you’re referring specifically to, like, all that data flowing in on the other end? Yeah, totally understanding. Okay, there’s, there’s a number of things that have to be orchestrated on that end. Let me know if this doesn’t completely answer your question, yeah, but there are a number of different areas that have to be orchestrated together to get all that data right, which we’ll talk about in a second. But effectively, like that, that data only flows in if you enable the campaigns, and that data only flows in if further you are collecting either behavioral data manually or your platform is in some fact fashion, like collecting the data, especially like UT and prams that are in the URL. Yep,
Eric Dodds 26:00
Those are really quick, just, just for, just for those who don’t know what UTM parameters are, give us a quick, quick breakdown on UTM parameters, because I think that will become important later in the conversation. Yeah,
Lew Dawson 26:13
it’s a kind of an antiquated paradigm in technology at this point. But in short, query param, well, two things so query your pram any URL after, you’ll see, after the question mark, you’ll see key value pairs. So the key is some sort of text, and then an equal sign, and then you’ll see more text, and then possibly ampersand, you’ll see that over and over, repeating. That’s query Param. That gives you the ability to essentially add additional data and or metadata that modifies behavior of the experience the customer sees in a lot of cases, or just tracks data. So UTM is urchin tracking metrics, I think I can’t remember the M, yeah, but nonetheless, it’s a company who kind of, I would say, to a degree, was the initial starter of a lot of what we would say is modern, like analytics. So they were, they were the company that developed, what is Google Analytics. Google Analytics actually bought them, or Google bought them and turned it into Google Analytics. So in short, there’s a specific set of UTM crams. So UTM name for like the campaign, or is it UTM campaign? There’s a few of those, and those are standard, and those are used to track various dimensions of a specific campaign. Yeah. So those ideally come in on every channel, and every time a user comes from an external site or an external entity into your engagement experience, I say ideally, because that doesn’t always happen due to a myriad of reasons and yet another reason why this is challenging.
Eric Dodds 27:57
Yeah, I think that’s one of the fascinating things you know, you I mean, queer params are used for all sorts of things in software, right? I mean, it can filter a list, it can do whatever, right? But it, you know, and I think actually, when, when urchin decided to use that back in the day as essentially a way to capture metadata about the source of where a user is coming from it’s, it’s a it’s, it was a very elegant way to solve like a pretty tricky problem in a ubiquitous manner. Then Google Analytics as a free tool, gets worldwide mass adoption as the go to way to track web analytics, which means that UTMs for better and now probably for worse, yes, are cemented as a way. So you have five dimensions as key value pairs that drive marketing reporting for like most of the world, and
John Wessel 28:55
There are five arbitrary dimensions. They are completely made up. This is something like, I didn’t know that they are completely made up. You can type whatever you want. It could be, you know, and you can have as many as you want, but, but we’ve, yeah, like you said, because of the Google Analytics adoption. Yep, these are the five that somebody, or like you said, like, 20 years ago decided and kind of standardized on that. Yeah. I think the
Lew Dawson 29:19
Another part of that, like you were saying John is, in addition to people being able to decide what goes in there, like, each platform suggests you use certain UTM parameters differently too. Yeah, right. Yeah, challenging. So, like, here’s how we generally do it on here, but you could do it whatever way you
John Wessel 29:38
want. It’s, yeah, it is. It is the worst kind of standard, because it’s completely unenforceable and interpreted differently, right? So while there is a standard as far as, like, these five things, people use them so wildly differently, it’s totally it’s almost not worth having this, right, right?
Eric Dodds 29:53
Well, and that’s kind of why I wanted to, like speak to that a little bit for the person who’s on the receiving end of that, because my gut is just. Day, like, Come on, we have like, five okay, actually, it even reinforces, like we have five dimensions here. Like this can’t be that hard, but it’s like, it actually is, like, yeah, it is pretty, pretty tricky to actually get things tight, even just from tagging those five dimensions. Is metadata. I think at
Lew Dawson 30:19
At the end of the day, like this, this foreshadows conversation we’ll have later. So we’ll build a little bit of Ooh here. But like, there are, there are ways that you can do this. Like you can, you could make it work across all these paradigms, yeah, and we’ll, we’ll unpack some of those just to let Peter know, like that. Some better ways,
Eric Dodds 30:34
yes, Ooh, yes. For, Oh, I like that loop for shouting, yes, actually, Lou, I think that you have some immensely helpful methodologies here to help overcome that. Okay, so then we have to collect the data, and so, so you have to create the campaigns in the assets, then we’re collecting data. And so you’re using pipelines to do that. So there’s probably behavioral data and structured data that’s coming in.
Lew Dawson 30:59
Well, yeah, yeah. So collecting the data, how do you, how do you there’s kind of two phases to collect any data. So it’s getting the data out of the source system, so out of Google ads, Facebook ads, which, again, this whole thing is crazy, but like, there’s a myriad of challenges there. So again, everyone does it differently, number one, so the scheme is a different data structure, completely different. And in number two, some of these platforms make it really challenging to get the data out, both from its convoluted, the naming, it’s convoluted, complex, but also throttling. Facebook is a great example of this. Their paradigm of like, how much data you can get out within a time frame is completely dependent on your audience size, like the audience that you reach in Facebook. So like, the larger audience you reach, the more data you can get out at a time. Which makes sense when I say it out loud at a high level, but it creates some pretty tough challenges when it’s like, yeah, we’re always getting throttled, like we’re so far behind collecting the data. So, that’s one thing, just like getting the data out of the source system. And then the other challenge, which is a little bit easier, but it’s getting that data then into a place where you can transform it, where you can do this actual attribution, generally, that’s going to be a data warehouse. Sometimes people favor data lakes, get a day Lake, and then sometimes they’ll do a data lake like s3 and two data warehouses. But nonetheless, wherever you store your data, you have to get it end of there, right, which is, we’re talking some pretty large volume of data for some of these companies. Like, it’s not it’s not It’s not trivial, it’s not data. It’ll just take like, 30 seconds to Yeah, stream sometimes, yeah, we’re talking about impressions. Go ahead. Yeah.
John Wessel 32:49
I mean, there’s also just this, like, bad alignment with some of these companies with, like, your interest in, like, Google, meta, whoever’s interest, as far as, like, they don’t want you to get the data out. They just want you to trust like, they’re like, oh, like you’re, you know, return on ROI, is this or you’re, whatever is this? Like, I really want you to dig into it. I mean, let’s face it, it’s, it’s a, it’s better for them because they don’t have to, like, because, you know, it’s, it’s costly to be streaming all that data out of their system. That cost them money. And then B, for the bigger thing of like, yeah, just trust us. Like, we’ll tell you. We’ll tell you if it’s where, if it’s going well or not.
Lew Dawson 33:25
Yeah, yeah. That’s a fantastic, like, foreshadowing point too, that we’ll have to touch on. It’s like, Yeah, well, how does Facebook, how does Google, track a conversion? Are they tracking the same way? Are they tracking like every single user who came to your site? Does that count as a conversion? Like they say they don’t, but it is a black box. And when you go and calculate some of these and you compare them, they’re wildly different, like your calculation with your like, rudder stack behavioral data versus their calculations. Like so sometimes you question, like, is the fox starting the hen house, because they’re, they’re incented to boost the conversion you’re seeing, because then it grows, you know, it will theoretically grow their revenue, ad revenue, because you’ll be like, oh, yeah, I’m gonna spend more. Because it’s, yeah, well, yep. So it’s, that’s an interesting comment, John,
John Wessel 34:19
It’s Yeah, seven. And then you have the, like, the attribution fighting problem too. Of, if you’ve got a different state, you’ve got multiple, like platforms you’re using for advertising, multiple for retention. Like, you’ve got this kind of war of, like, oh, like, I want to take credit for this one. And it’s like, some kind of retention tool, I miss a credit. And in reality, like, rarely does the number end up being. Like, adding up to say it’s $100 like, then it adds up to $200 like, Well, I only got $100 but this attribution data adds up to $200 like, all of these can’t be right, yep, it’s just another challenge.
Eric Dodds 34:54
Yeah. Okay, so we’re collecting data from source systems on the advertising side, we need to collect data from. I’m, you know, the website or the digital property. So Auburn z is RudderStack for that. So that’s the behavioral data. You know, this capturing page view data, conversion data, etc. And so you’re streaming that to the data store, so a data lake or a data warehouse, okay? So now we are with the person who’s on the receiving end of that, and they have probably a lot of different tables. So
Lew Dawson 35:33
What do we do now tables, both in terms of numbers and then a lot of data within those tables?
Eric Dodds 35:38
Yes, yeah, go ahead. Well no, so I’m saying, Okay, what
Lew Dawson 35:43
What do we do now? Yeah, what do we do? What do we do? Yeah, so at that point now it’s the data has to all be the data has to be transformed, which all impacts that, and ultimately it has to be all merged together. Precursor to all that first has to be, which a lot of, like engineering folks especially, struggle with. It’s like, okay, what’s the end state? Like, what are we trying to accomplish here? So because it’s very tough to actually merge the data together and figure out, like, what are we trying to get out of this if you can’t really say, like, what’s the end state here? So that’s, that’s usually the first step, which we’ll unpack in a minute. But like talking directly to your point, essentially. So it’s once you figure that once you say, Okay, I want to actually understand, you know, all of my website across all the channels that we’re advertising on, for example. So like, Facebook, Google, etc. Like, how well are users converting on each one of those? Let’s just say channel level. Start with, keep it easy. Yep. So Facebook is a channel. Google is Google Ads channel. Just to just clarify how, well am I converting
Eric Dodds 37:01
there? So I spend, I spend advertising dollars. People are clicking on ads. They come to my site. When we say converting is just like, Okay, how many people who come from Facebook actually buy something where I like to make money . I make money on the purchase based on the advertising dollar that I put towards the ad that they clicked on.
Lew Dawson 37:23
Yeah, so it’s i i spent, I spent $10 on the ad. How much did the user purchase? Like did they purchase? First of all, and then how much did they purchase? Essentially, did I get more back than I put in? Right? That’s ultimately the question you want to answer, Yep, yeah. And that that then ladders up to all sorts of different interesting things. You know, the other thing I mentioned too is, like, you, you might want to measure conversion, is the other fairly big thing. Now, I’m not, I’m not a huge believer in measuring conversion, because that can be gamed. We can talk about that later. But nonetheless, like, those are kind of the two main things, yeah. So basically, what you have to do there, you know, it’s a transformation problem, so you have to get all that behavioral data. You have to get all of that that you’ve collected on the website. So that’s got UTM crams, user conversions, things like that, yep. You put usually, what you have to do as well is you get all your order data. So that gives you your conversions, the amount the user spent. Sometimes, sometimes you’ve merged those two to a degree to make sure they align closely. So obviously, as John said early on, it’s like, sometimes you can’t get 100% of the data and behavioral, so that’s why you’d want to merge in your actual e Commerce Data, like, let’s say Shopify, or whatever. Yep. Then you have to merge in your ad spending data. So we’re talking Google ads and Facebook ads here. So you have to actually then figure out, okay, how do I normalize that data? To figure out, for per channel, how much did I spend? And usually this is temporal data, so you you know you do like per day, per week, month, per year, etc. Yeah, same with all those other two I should mention, right? And then lastly, then from that, like, once you’ve merged all that together, then you have to then generate data from that, like, metrics, measurements, and that’s, you know, like I talked about a minute ago, it’s that conversion. It’s that, that revenue,
Eric Dodds 39:22
etc. Okay, so I want to ask two things. One of them is that I’m going to play dumb and ask about the keys that you join on at a very high level. And then the second is, I actually want to circle back to your way of thinking about UTM parameters and how to solve some of the problems around that, because you have a couple of ways, and we’ve actually talked for a long time about, you know, some ways of overcoming some of the challenges there. But okay, one, one join key, and I’m, I’m massively over simplifying this, but I think it’s fun, I think it’s important to get into the details. Hopefully. Helpful. One of the join keys that makes sense to me is that you have behavioral data from the website that contains the UTM values from a page view. So when someone clicks on an ad, they come to the site. We’ll use RudderStack as an example, as you and I talked about a ton like with the Auburn stuff, it fires a page call that goes into your warehouse. It gets flat and into a table, and there’s a column that says UTM campaign, you know, from that page view that has the time stamp, you know, in that table, okay, then the data that comes from the source advertising systems. There’s some campaign and ad, you know, there’s an ad with a row of data, however, you have to clean it. Probably not. Probably you do. Actually, in your experience, I can’t play completely dumb, yet you clean it up and you essentially get, like a cup, some clean tables that are rows of data where there’s a URL that you input into the source advertising system when you deploy the ad, so that when they click on it, the user goes there. So at a very high level, you can join on UTM keys or sort of the UI like components of the URL in order to tie like, Okay, I spent this much money on this ad, and then I see this many UTMs and the behavioral data, you know, and then you can sort of correlate that to conversion. Now, what makes us really gnarly is that you have to do that on a unique user level, right? Like because you have to tie the purchase and the page view and the conversion and all of that to like a unique user so that you can say, Okay, well, this page view is associated with this user is associated with this, like, actual transaction that has $1 value tied to it. And so there’s almost, like, a, I like, a user reconciliation, identity resolution type element to this too, where you have to, like, make sure that you’re reconciling, you know, reconciling that cleanly from a user standpoint. Am I thinking about that correctly? Yes,
Lew Dawson 42:00
you absolutely are. And there’s even more to it as well. Chickens, so you’re, you’re spot on. It is a, it actually is an identity resolution problem, and that, that identity is, is basically the, we’re gonna say channel right now, because we’re doing channel level, but it depends on what level you’re doing, right? So, like, channel, AD, set, add, like, at each level, it’s an identity resolution problem.
Eric Dodds 42:27
So, oh, right, yes, yeah. Like at each entity, right? Because you have, yeah, yeah. You have to reconcile all the different, disparate data from the source system actually to whatever key you’re going to join on so that you can, yeah,
Lew Dawson 42:38
yeah. So like taking a channel, you have to do, you have to do identity resolution on, what are all the what are all the channels? So in this case, theoretically, it’s Facebook, Google, then you have to figure it out, okay, for each one of those channels, what are, what’s the order values that we talked about? And so your join key, the end is those two channels, but then there’s the part that I was saying. There’s a little bit more to it. You also have to figure out your spending and the ad platform, which, again, is a join key, and that is, ultimately has to be your it’s a combination of what did I spend at a channel level, and then joining that with the other two to get to get channel orders and spending right, ad spending, conversion dollars and channel. And so the combination of those three at a high level are, like, that’s how you’re joining works. And again, right? So like, think about that that gets more complex each level you go down. Because, like, just a set, just touching on that for a second. Now, the ad set is the step below. For listeners out there who may know, knowing a little bit less ad set is a step below a campaign. So within a campaign, you have an ad set, and an ad set or an ad group is basically, it can be multiple ads. We’ll unpack later, why you’d want to do that, but for now, just think of multiple ads and so now your joint key is the ad set and campaign. So
Eric Dodds 44:21
campaign would be like, you know, Overstock sale channel. Sorry, you could have like, you know, so you have overstock sale, but that could be a campaign in Google campaign and Facebook. Then you could have an ad set that’s like, you know, shirts, and an ad set that’s like, shoes or pants or whatever that are, like, these sort of logical groupings, yeah. And then you may have ads within an ad set that are like, you know, blue shirts or green shirts or something. And so you have, like, a pretty complex hierarchy, even to try to triangulate all of that, but spend your Yeah, and so that’s your
Lew Dawson 44:58
join, right? So your joint key is the combination of all those things. So at whatever, at whatever altitude you want to look at, wow. And so again, this gets back to the like, what does the business want to measure? What’s the outcome? Like, you have to decide that up front. But a lot of people don’t understand that you have to decide that up front. I mean, I guess you don’t technically have to. You can always do it later. Really do it well, you should decide up front. Yeah, that
Eric Dodds 45:25
may be actually like, I want to, sorry to interrupt you there. Lou, not at all. I just want to reiterate that may actually be one of the most helpful things I’ve ever heard about attribution, where it’s like, decide what you want up front, because there are so many ways to slice this and altitude, I think is a great word for that. Like you can go so granular and get so close to the ground with a magnifying glass, right? Or you can be at 30,000 feet. And none of those are wrong. But like trying to do every level of altitude is
John Wessel 46:05
impossible, yeah, at least a bad idea. At
Eric Dodds 46:08
a minimum, rarely ever worth Yeah, the effort, right? But I
Lew Dawson 46:12
I think that gets to the second part exactly for sure you can do any altitude. But a naive, a naive individual might be like, oh, let’s just go all the way down, like, and then we’ll have the data all the way up. Sure, you can do that. But that actually is the hardest implementation it gets. It’s harder to implement the deeper you go, but then also the data, the data it’s harder to gain information that you can use to make, like, actionable decisions the lower you go. I, in a lot of cases, equate this to like stock trading a little bit. And so it’s the more information you have, possibly the better decisions you can make, but also the worst decisions you can make. So if you’re trying to optimize like, if you’re trying to pick a stock like, or you’re trying to pick between two stocks. It’s an optimization problem, like stock A or stock B, and conversely, you’re trying to pick against advertisement A versus advertisement B. Because you’re at the ads level, you’re trying to figure out which one do I do there? There are a lot of different ways, like data points, that you can decide on that it’s not just a straight answer, it’s not always gonna be a straight answer. I should always go with a, or I should always go with B. Same with stock trading, right? Because stock trading is economic based, it’s news based. So there’s a myriad of different things you have to look at in order to actually decide, like, which ad should I boost? Which ad should I kill, or should I do nothing? And so the decisions get more complicated the lower you go, because you also have to, like, get more data. And you have to decide on more ads, which ones do I want to keep? Which ones do I want to get out? Same stocks, more stocks you’re looking at, the more it’s like, which ones do I trade more of? Which one do I get out of, etc, right? Like, it’s a Kelly criterion optimization problem, whether it’s stocks or ads, like you could apply it kind of, yeah, yeah. And so that there’s your joint keys, like, back, just taking that back. And then also, if you think about for a second, the other challenge of just generating the join keys, which I want to fill out to people like I highlighted earlier, is the data is not consistent, so that, I think that’s actually one of the biggest challenges any level, but especially as you get lower, because the join keys get more complex. 100 different users came to my website, to Facebook, 995 like the campaign. The name was correct, but the campaign had a space in it, and so five like the campaign, the space isn’t represented as percent. 20 is represented as plus, right? So now, theoretically, if you’re matching directly, like doing a direct string match, you actually have two different campaign names. So those are going to be there’s going to be different, like, if you do a naive like, I’m just going to directly do a direct string match in order to create my my join keys, you now have two different campaigns, even though they were the same campaign, yeah, but the characters were different, yep. So that creates a whole different set of challenges. It’s standardizing. It’s basically creating standardized twin keys. And you have to standardize those names. You have to figure out which ones are the same, but which ones actually are different, even though they look similar, yeah?
Eric Dodds 49:38
Well, and what? Totally because I think it’s easy to conceive of the modeling problem. It’s like, okay, multiple levels of altitude, yes, that can get complex, but if you don’t assume that you’re going to have dirty data, it’s like, okay, that’s that can get complex, but, like, that’s doable, right? It. But the dirty data problem compounds because you have different levels of aptitude within each platform. You have all the different platforms, you have the fact that the data is actually delivered differently in all these different platforms. And because they’re all different tech, the conventions can break in all sorts of different ways. And so the long tail becomes like, absolutely insane
John Wessel 50:26
well, and even if you have your team, like a completely aligned marketing data team, you know, the whole team aligns, all of your stuff. Name is named perfectly correctly every time and every platform, which never happens, even if that were the case, like this is like three form data, like any user, yeah, if you’re an evil person, you want to mess with some marketing people. Let me give you some tips. No, but really, like any user, can advertently or inadvertently, like you said, introduce a little space to any of the millions of people that may be on your website, and then all of a sudden you have two campaigns for that one little record. And so it is an unsolvable problem to get to perfect, Yep, yeah, yeah, yeah.
Lew Dawson 51:08
Or John Doe decides, like, he, like, he wanted to do something different, because he’s new to the company and he doesn’t really know or understand, or he’s like, I don’t want to read all the materials, and like, he names the campaign differently, or heap modifies a currently named campaign because, like, something in the spelling error, yeah,
John Wessel 51:24
sure, yeah, familiar now, yep.
Lew Dawson 51:27
Now you’ve, you’ve splintered your campaign, right? Yep, yeah, yeah. And the highlight, John, like, that’s a great point. Like, there it’s, it’s free form. It’s an absolute nightmare,
Eric Dodds 51:37
yeah, okay, so we’re clearly gonna have to, gonna have to turn this into a two part, a two part series, because we are, like, maybe 5%
Lew Dawson 51:50
at least, if not more. One
Eric Dodds 51:52
thing I do want to cover really quickly, because this is great. I actually think we’ve gotten pretty, pretty deep down into the stack and into the data. But Lou talks us through some of the ways that you mitigate some of that free form data challenge and the inherent limitations of the prevailing five dimension metadata methodology that is so ubiquitous because of Google Analytics. So what are some ways when you think about the system design, and one thing I love about the way that you think about this approach that we’ve talked about many times is that this is a sort of a holistic way of thinking about the problem, both in terms of the inputs and then also in terms of join keys even right and sort of the way that you even think about solving the modeling problem. So just walk us through a different way to think about that that can help you move beyond being beholden to five free form dimensions that are, you know, impossible to solve for. Yeah, two
Lew Dawson 52:59
things before I, like, get into that. So number one, you know, this isn’t, this isn’t perfect, first of all, right, like, there’s still, has John eloquently put it, like it’s still, it’s free form, it’s, yeah, it’s impossible to get perfect. But, I mean, this is improvement number one, and then number two, like credit, where credit’s due. Like, I’ve been kicking a general idea like this around for a while, and I was talking with Eric about how to do this better. And Eric mentioned, like, his, his old Fern, had come up with a way to do this as well. And like, they’d come up with a pretty good way. And it was, yeah, it was, it was a combination of this, you know, my thinking in that conversation. So like Eric, thank you. Like you. You actually helped out a lot in the space, you and your team of folks like you and Benji. So this, this is definitely not just me, right? This is this far from me coming up with this
Eric Dodds 53:52
many conversations over many months. Yeah. But
Lew Dawson 53:56
in short, you know, the it is a key, right? Like at the end of the day, if you think about it from that perspective. And actually, I’m sorry, one more thing super quick that I want to highlight, that I wanted to highlight before, is I think this is so important, what I’m about to say in a second, it’s so important again, to, like, define what you’re trying to do up front, because doing this up front will save you so much trouble and will enable you to do, like, historical merging of your data, versus if you don’t do this until later on, it’s going to be tough, to nearly impossible, to go back and, like, do your historical attribution. So getting into the meat of it, it’s really at the end of the day, you have to develop, I think, success, to be more successful at this, and take out a lot of the like, hey, UTM, YouTube params at especially at lower altitudes, are really hard to merge together and create a key from the urge key. It’s just create that merge key up front at the end of the day. So it’s created to merge keys up front and attach it to every single campaign. So every single. Google campaign, every single ad set, every single ad has a unique key, and that it’s a spaceless key, right? Like it’s a key that’s gonna be tough for a browser to munge. I’m not saying it’s impossible, but it’s gonna be very tough. And you attach that to every single essentially ad, and it’s that unique join key is a query prime. And then there are some nuances to that, obviously, which you and I, Eric, have talked about before, we can unpack here. But if, basically, if you do that job up front, you could use that join key to do, to skip all of the challenges we just talked about, and just join on that key. Right?
Eric Dodds 55:41
Yep. And you’re generating that usually as some sort of hash, correct? So you basically and how? So what are the inputs to that hash? Because one interesting about thing about this that you and I have talked about, Lou, is that if you limit yourself to five dimensions, what generally well, one at a base level, just from a strict technical standpoint, like you only have five dimensions, and you don’t want to add spaces and other things like that. And so practically, what ends up happening is probably the best way to say it would be that the people who are creating the key value pairs generally. Who are marketers who get very creative in how they package information into those five dimensions?
Lew Dawson 56:28
Yeah, well, and
John Wessel 56:31
I think just like for people that are less technical, you’re talking about key value pairs and such like, it can be as simple. And I think we’ve done this before. I’ve done this before of like, Hey, we’re gonna start at one, and we’re gonna, like, put the number one in there, and then in a reference sheet, number one equal, yeah, that trade show we went to that was in London, that bubble, like, you can say whatever you want, and then you can categorize it in 12 different ways, yep. For later, like, groupings. And then when somebody changes their mind, you go, re change all those categories, yep. And it works,
Lew Dawson 57:00
right, yeah. And I think the key there is you re change them only on your system of record, like, internal, yes, exactly, or you augment it, right? So,
John Wessel 57:08
like, what you’re you don’t reuse that number again, one is toast. Like, do not reuse it exactly,
Lew Dawson 57:13
right? Like, there are a couple nuances to this long pack, and you just hit on one. John, but basically, like, you don’t need to necessarily hash all that, like, all that data Eric and so every single thing you’re interested in, it’s basically you’re hashing on an agreed upon set of columns. So it could even be like the five UTM params, if you want, if you want to keep it simpler, and you’re just hashing that, and you’re hashing it like John said, One and done, meaning your if you, if you go with a hash, once you generate your hash, you never change it, like, even if you change the UTM params, you keep a stable hash, because otherwise it’s your join key, right? So, yep, that’s one. You know that’s one, gotcha. One key piece is like, you have to be diligent about not changing your hashes when you change things internally. Another is like, you have to be diligent about tracking this. So you have to have a system or record so sometimes, like that gets a little complicated. Like, a simple way to do it should be a spreadsheet that you feed into your data warehouse. People make mistakes, so you just have to, like, you have to be careful by and large, yeah, like, I would say hash to the hash is highly resilient to collisions, meaning, you know, just the same output should always generate, the same input should always generate the same output, and any variation in the input should generate a wildly different output. Yeah, the internet is very broken, if that’s not true with modern hash. So, yeah, that’s why you would hash is probably the best way to do it, generally, because that, I mean, that fits that paradigm very well. Yep. And
Eric Dodds 58:43
Lou, one thing I love, just to circle back to what you mentioned earlier, in which I called out, but I just, I, I’m really saying this to myself, almost to assuage my pain from life, right? So this is me doing a little self therapy. Yeah, thank you.
58:59
That’s why we do this. But
Lew Dawson 59:01
I think it’s good to get those out there. Out there and help people
Eric Dodds 59:03
out totally, totally your mistakes, defining. I mean, the hash thing, as we’ve talked about, really can be a game changer, because it just solves so many different issues. But one, one thing about it, that is that you have to be careful of is like you can, you can pack as much information as you want into the hash, right? So I could have 1000 columns of data that I want to pack into a hash and the system of record and whatever, and then I have the ability to unpack all of that, right? But to your point, Lou, the thing is, like, what do you need to hash? It’s the requirements that you defined up front. Like, that’s what you actually need to hash, right? Is, is those requirements, and so, man, that’s just such good advice in terms of, like, getting super sharp on that, because that determines the level of complexity that the system needs to serve, not that that can’t be changed over time. But in all of these things, there’s really no. Limit how much you can add. And of course, our tendency is to just say, well, we might need to use that. And so you tend to, like, add more and more and more, you know, or go or or do what you said, which is like, let’s just do every level of altitude, right? So, right? Changing it
Lew Dawson 1:00:12
Over time definitely is the reason why I say to you it’s really important, ideally, to just find us up front to find what you’re trying to accomplish. Changing it over time is not impossible. Changing it over time adds a massive layer of complexity when it’s undoubtedly like you have to do a full refresh of your data ecosystem, like, say, if you’re doing DBT, yeah. So it just generates a lot of complexity if you ever have to go back and regenerate history
John Wessel 1:00:38
this data, this is, this is the thing, like an accountant part of the show, right? Because, like, if you put that accounting hat on, you’re like, Oh, I’m gonna regenerate all these financials and do this to the bank and like, like, think, like, if you Yeah, grab an accountant, pull them into your team, and they would do this perfectly. Like, maybe that’s the strategy we’ve all been missing. Yes, totally.
Eric Dodds 1:00:59
Yeah. Okay, well, unfortunately, we are over time, but Lou, let’s get you back on as soon as we can. Because, okay, we’re at the point now where we’re deep in the stack. We understand, at a high level, like the input, some of the complexities, why this turns into a really gnarly problem, and we have a way to do this way better with a hash. We just scratched the surface there. I think there’s a lot more to talk about, but we literally have not even talked about like, Okay, you’re producing a metric, and that is the other side of it. That gets even crazier. So come back on and we’ll start where we left off. We’ll dig back into the hash and talk about some specific methodologies here. I think this has already been super helpful.
John Wessel 1:01:47
I’ve got a teaser for the next show. The other thing, like, zoomed way back out, like, if I’m just listening in, like, it’s like, Man, that sounds really complicated. Like, when does it make sense to do this? Gotta answer that question. Yes.
Eric Dodds 1:01:59
Okay, so the agenda for the next show is deeper into the hash attribution models, right? And then, especially when to apply advanced techniques that include machine learning and then Lou. Also, I think another thing that would be really helpful is, how is, I mean, this sounds cliche, but legitimately, how is AI shaping this, right? I mean, there are some things around that that I think are super important as well. So stay tuned for part two. I already can’t wait, because this is so fun. The Data Stack Show is brought to you by RudderStack, the warehouse native customer data platform. RudderStack is purpose built to help data teams turn customer data into competitive advantage. Learn more at rudderstack.com.
Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
To keep up to date with our future episodes, subscribe to our podcast on Apple, Spotify, Google, or the player of your choice.
Get a monthly newsletter from The Data Stack Show team with a TL;DR of the previous month’s shows, a sneak peak at upcoming episodes, and curated links from Eric, John, & show guests. Follow on our Substack below.