Episode 194:

Building Retail Churn Prediction on DuckDB with Clint Dunn of Wilde

June 19, 2024

This week on The Data Stack Show, Eric and John chat with Clint Dunn, Co-Founder at Wilde. During this conversation, Clint shares his journey from an economics major to a data professional, detailing his experiences at Afterpay and in building data teams for e-commerce companies. The discussion covers Wilde’s focus on predicting customer lifetime value (LTV) and churn for retail brands, emphasizing the importance of accurate data for business decisions. The group also explores the challenges of integrating data predictions into marketing workflows, technical aspects of managing and analyzing large datasets, and more.


Highlights from this week’s conversation include:

  • Clint’s Background and Journey in Data (0:51)
  • Starting a Data Career (2:01)
  • Transition to Startup SaaS World (4:27)
  • Clint’s Connection to a Federal Reserve Database (5:31)
  • Challenges in Predictive Modeling (10:27)
  • Data Input Challenges (15:50)
  • Marketers’ Workflow and Data Integration (18:29)
  • Soft ROI vs. Hard ROI in Data Analysis (00:21:31)
  • Balancing Internal Marketing and Data Team’s Value (22:35)
  • Simplifying Data Inputs for Predictive Models (25:09)
  • Data Analysis Workflow and Tech Stack (29:06)
  • Open Data Formats and Impact on Data Platforms (34:40)
  • The S3 and Ecosystem Model (37:08)
  • In-browser SQL Queries with DuckDB (39:24)
  • Data Security Concerns and Solutions (41:47)
  • Clean Rooms and Data Sharing (43:32)
  • Final Thoughts and Takeaways (47:35)


The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.


Eric Dodds 00:05
Welcome to The Data Stack Show.

John Wessel 00:07
The Data Stack Show is a podcast where we talk about the technical business and human challenges involved in data work.

Eric Dodds 00:13
Join our casual conversations with innovators and data professionals to learn about new data technologies and how data teams are run at top companies. The Data Stack Show is brought to you by RudderStack. The warehouse native customer data platform RudderStack is purpose built to help data teams turn customer data into competitive advantage. Learn more@rudderstack.com Welcome back to the show. We’re here with Clint Dunn from wild, Clint, welcome to the dataset show. We are super excited to chat with you.

Clint Dunn 00:47
Thanks for having me, guys. I’m super excited.

Eric Dodds 00:50
Alright, well give us just a brief background. Tell us about yourself.

Clint Dunn 00:54
Yeah, I’m the founder, co-founder of Wilde. We do LTV and churn predictions for retail brands. In a prior life, I worked at afterpay in the marketing data science department. Before that, I was building some data teams at small e comm companies.

John Wessel 01:11
So Clint, one of the topics I’m really excited to talk about is duck dB. We, we both think you more than me have an affinity for it. So I’m excited to talk about that. And then we’ll definitely have to talk about your experience as head of data and working in data producing business outcomes as well.

Clint Dunn 01:33
Yeah, well, there’s not a lot of podcasts I get to go on and like to talk about technical stuff. So I’m enjoying that.

John Wessel 01:39
Yeah. Awesome. All right.

Eric Dodds 01:41
Yeah. Are you ready to dig in?

John Wessel 01:42
Let’s do it.

Eric Dodds 01:45
Okay, Clint. So many interesting things about your background, you gave us a brief introduction, but how did you start your data career? Did you study data? And sorry, did you study engineering or anything technical in school? No,

Clint Dunn 02:01
I was an economics major. And we did like a little bit of SaaS, which is like a real old school programming language common and Oh, yeah. Economics. Yeah. And I kind of had finance internships all through school. But I, I think the turning point for me was going into my senior year I was working at a UFC gym, be like basically franchised out gyms across the street? Oh, yeah, sure. My whole job was like, I was selling basically the whole summer, the Delaware Valley, the rights to the gyms and the Delaware Valley. And I got roped into some meeting with the president of that company at some point. And he was like, Alright, we have a major problem here. We’re giving free memberships away as basically free trials indefinitely to a number of customers. And somebody’s like, raise their hand right away. I was like, alright, what? How many people? Is this affecting us? Like, we have no idea. But I’m guessing 4%. And so I’d kind of started talking to I started talking to the technical team. I was like, how is the pod? Like, don’t have a SQL database? Like, how do we not know and like, nobody really could work it out. And I just got this hunger from that point to like, start answering business questions a little bit better than like putting your finger in the air. Yeah.

Eric Dodds 03:18
And so you were actually selling you like rights to gym memberships?

Clint Dunn 03:23
Not literally, like people were kind of coming to us. I was doing some of the analysis to say how many stores could we put into the Delaware valleys? Yeah, I was going all through Pennsylvania. It’s like we’re gonna stick one in Mechanicsburg. There’s gonna be one in Harrisburg we can get to in this other city. Yeah, yeah. Kind of fast boots on the ground analysis. Yeah. Okay. And then, so where did things go from there? So I ended up working in a fractional CFO and accounting company after school. And I was like the data guy. And I think that’s kind of a common situation. A lot of folks started their career in where they’re kind of tasked with, like, broad data responsibilities, and maybe not the skills to do them. So like I said, I do like a little bit of sass. I stayed after work every day and taught myself Python. And I was like, really good at Excel, and just tried to figure things out. But it was a little bit of everything at that job.

Eric Dodds 04:18
Yeah. And then when did you get into sort of like a startup? SAS world, like software as a service?

Clint Dunn 04:26
Yeah, honestly, like, I’ve come to a lot more recently, I’ve been mostly in the kind of retail side of things. Yeah. So from with that job, we were a fractional CFO accounting company. We were working with startups. So the couple got to look at data for a couple of Saas companies, but I did a lot of things like food and Bev and, you know, kind of traditional EECOM analysis for folks. And then I went in was that a fractional marketing company doing basically the same thing? And then was in house and hair story, eventually. So yeah, I’m like, kind of a retail guy through and through. Really?

Eric Dodds 05:02
Yeah. Okay, interesting. Now you skipped over one really important piece of your history that we talked about briefly before the show. And that is the fact that there’s a database, the Federal Reserve in Kansas City that is named Clint, which is a surprising resemblance to your name. Can you give us the quick story on how you got a database at the Federal Reserve made?

Clint Dunn 05:31
A horrible mistake, and no one knows this about me. Yeah. Jay, Ken, was at the Federal Reserve Bank of Kansas City. I had an internship there One summer, I was like an undergrad, everyone else in the research department has PhDs, they’ve never really had an intern before. So I didn’t really have any right to be there. And I was well aware of it. But they had this project where I was supposed to like to catalog every economic event since World War Two. It’s pretty obvious and actually kind of cool. And maybe relevant the data folks in, in companies now like nobody really knows when things happen. So if you just ask somebody, like when Hurricane Katrina happened, it’s very hard to pull that out. But it will affect a lot of analyses that you’re doing, especially in the south, and the odds. So the head data scientists had this idea to like, catalog all these things. So my job was to go through these old binders that secretaries had, like typed up manually in the 50s and 60s and left to her. So I was digitizing all of them. I was really cool. And I gained a huge appreciation for what life was like, right, like the gas gas shortages and what like the 70s Yeah, what it was like day by day. But, but yeah, as a joke, named after myself, and like had a backronym, a terrible backronym, the chronologically linked timeline with Li N capitalized. I thought it was, I thought, like a really dumb joke. And I realized, like all federal databases are named after people like Fred and Edgar and know, a few more that I can’t think of so yeah, I knew it of like, caught on, when the head of research came down to my desk one day, it was like, I heard you’re the guy working on that Clint database, aren’t you can escape it. Now

Eric Dodds 07:20
That is how to leave a mark. As an intern, for sure. lasting mark.

Clint Dunn 07:25
Right? You have, you’ve had a similar experience. I

Eric Dodds 07:28
just, I had we were there as actually, we’re joking back and forth, John and I on LinkedIn this week, because he tagged me in a post where someone said, you know, you build a quick prototype, and you end up you know, with a database called something like John, John tests, or whatever. And so the joke is around RudderStack, our version of that is Eric dB, which, you know, when I first joined RudderStack, you know, I was getting access to the product, and I asked for a schema in Snowflake. So I could, you know, do testing and they’ll prototype. And there’s still a lot of production workflows that run out of Eric DB today. Four years later, we’ll eventually fix that. But yes, you know, it’s great when you’re onboarding like a new employee. And they’re like, What is Eric DB here? Yeah.

Clint Dunn 08:22
Yeah, you gotta be really careful what you name after yourself.

Eric Dodds 08:26
Yeah, exactly. That’s very true. Well, one thing I’d like to hear about, so I want to talk about Wilde and what you’re doing there. But can you talk about your experiences, as a data professional before founding Wilde? How did those sort of shape what you wanted to do at Wilde? Like, what were the problems that were there problems you kept running into? And maybe just start with a brief overview of what Wilde does, so that the listeners have some context?

Clint Dunn 08:57
Yeah, sure. So for Wilde, we’re basically sucking up information about your customer, either from Shopify or from a data warehouse. And then we’re using that information to predict a few things about them, kind of primary points are future lifetime value. So how much are they going to spend in the future, and then the probability that they’re going to come back and make another purchase? The reason we’re starting with those two, you know, we’ll probably build other models in the future. Those two though, are like foundational to the way ecommerce and retail brands operate their businesses, right? Like it is the economic basis for the business. And, like I would argue, also being a decision point on which you should handle every customer and this is what I call horizontally important and vertically important. And so what I saw when I was internal to these brands was setting up these models to predict things relatively. Some We’re brand new basically, in use the same information, same inputs, outputs are the same. Everyone needs them. And it’s deceptively hard, right? Like, from a coding perspective, you can get this up and running in a day or two, but like productionize it to run all the testing that you need to communicate internally with stakeholders and, and kind of productize what you’re, your prediction is actually really hard. So, yeah, when I started while there’s basically just to solve those problems,

Eric Dodds 10:26
Yeah, it makes total sense. In terms of the stakeholders, like, I’d be interested to know, can you dig into that? A little bit more? So you’re on the data side? And you have these stakeholders? Who and let’s just take lifetime value prediction, for example, right? So some customer has made a purchase, or maybe not, and maybe they’ve, you know, there’s some characteristics that you’re using as an input. But let’s say they’ve made some sort of purchase or a couple purchases, and then you’re running some sort of model that predicts, you know, what is their eventual lifetime value over some time period, right, however many years or whatever? So is the business asking for that? I just love to hear like, what’s the Genesis story like using the data person? Like, how does that come up within the organization? Who on the business side is asking for that?

Clint Dunn 11:19
Yeah, it’s a really good question. I call this the LTV maturity curve. I see a lot of companies starting off where finance and operations owns LTV. And so they’ll usually kind of do a historical analysis. So they’ll take cohorts of customers. Yep, they’ll draw those like classic cohorted lines, come up with a churn rate, and ARV, and then kind of like back into an LTV number. And that works pretty well until the business starts changing. And those lines start going up and down. And it’s very hard to interpret, like, what is good LTV? What’s the reason for things going up? And so, it’s not very actionable. And so usually, the marketing team then will go to the financing and say, Look, we need like, it’s great that we have an understanding economically of how well our customers are performing and how profitable they are. But we need to take action on those profit, profitability of signals. And so a lot of companies will start building RFM models, because it plays around with those. Basically recency, frequency, monetary value. So yeah, recently they purchased, how frequently have they purchased? What’s the kind of AOD or like total revenue thing I want to do? Those are great. But usually, you segment those into three by three kinds of grids. So for each letter, you have three segments, you end up with, like nine segments are just way too many to actually market to. And so I kind of considered the end of that maturity curve, being the LTV number, just one number, super simple. And it’s tied into what the finance team was trying to do originally, which is understand the profitability of these individual customers. So big finance usually is driving these conversations. They’re kind of proselytizing the importance of, you know, economic viability, especially right now in the ECAM. World. Yeah. And then everyone else kind of needs to get on board. And

Eric Dodds 13:14
What is so marketing gets these values, and what are they doing, like some sort of segmentation? And then they like, dump these people into different campaigns? Could you just give a couple examples like, what is the specific number they’re trying to move or like an example segment?

Clint Dunn 13:31
Yeah, so I think the basis of like the LTV and churn, prediction predictions is that they are, again, horizontally and vertically important. And what I mean by that is, like, vertically important, it’s a C suite level metric. But it’s also actionable for, you know, the tactical folks who are actually executing on campaigns. Marketing is a great example. But I also think you know, CX should be using it, operations should be using it, you kind of go down the list, everyone can leverage these and use it as a Northstar. In terms of use cases, a super simple one has had a lot of success talking about this one, Clay vo has actually some of these predictions. Yep. But they have this black box model, and nobody really knows what the accuracy is. Nobody can really pull out what the predictions are. Yep. And so we’ve had customers compare us to clay VO and figure out that clay vo was over predicting churn by four times. Wow. And so I think it goes to show the importance of data teams in this stack, in validating the numbers that marketers are actually taking action on rather than just kind of trusting what’s on other people’s platforms.

Eric Dodds 14:40
Yep. I have a question. And I mean, you use Klaviyo heavily, previously, and so question for both of you. What, what are the mechanics of why Claudio is over reporting. I mean, I know it’s a black box. But, you know, Clint, you’re building these models, but What is the data input problem? Or the regression problem? What would cause that?

John Wessel 15:07
I’ll let you take this one. I have a suspicion but yeah, I’m curious what you found, because my knowledge is a couple

Eric Dodds 15:16
guys old let him validate your suspicion. And just I’ll tell you that Yeah.

Clint Dunn 15:22
Shoot, I want to hear this suspicion first.

John Wessel 15:24
All right. Yeah, he’s so good this election, that’s more like you’ve got to CLEVEO has first party access to your Shopify data. So like, theoretically, you have access to the same data, right? What I would guess is to build a more generic model, right? And you’re just going to run everything through a more generic model. And you’re able to build a more bespoke focused model. As far as predicting that’s my like, high level hypothesis,

Eric Dodds 15:50
but Okay, so to interject on the suspicion there. Isn’t that like, isn’t that what makes machine learning applications on Shopify? So appealing, though, is because the ecosystem is consistent, right? I mean, Shopify has a consistent data model, like if you’re gonna try to scale that for someone like Clavijo. Like, yeah,

John Wessel 16:12
The fields are named the same for every customer. Yeah. Was that? Yeah. Okay.

Eric Dodds 16:16
Enlightened? Yeah.

Clint Dunn 16:18
No, I think that’s one element of it, too. Right. Like, I would say, there’s probably three elements. The first is some model differences. And I don’t know what their model is. And they don’t give accuracy. Yeah. So I can’t really, yeah, time and metrics on, you know, yep. I wish I was, because I know a lot more. And I’d feel a lot more comfortable with what their predictions are. But I think the second element is that some brands do have sales outside of Klaviyo. Or sorry, outside of Shopify. Sure. So you know, one of our brands, they own, you know, three dozen retail locations that they own and manage throughout the country. And so that information is actually not flowing through Shopify clay does not include it. And so they’re missing, like, really important indicators. And

John Wessel 17:08
That’s fairly common, I think, like, yeah, because it’s because we like, in my past life, we didn’t have physical locations, but we had phone sales that didn’t go through Shopify. And that’s just another application that Yeah, yeah.

Eric Dodds 17:21
Fascinating, right. But those Yeah, but those? Yeah, I guess. Yeah, that is super. And we’re

John Wessel 17:27
not talking like one or two phone sales. We’re talking like, winning 30% of revenue.

Clint Dunn 17:30
Yeah. Yeah. Anytime you’re mixing sales channels, right, like it’s getting much more complex. But I think that’s where data shines, simplifying all that. So this data team we were working with, right, we’re sitting on top of their, in this case, we’re sitting on top of their warehouse rather than their Shopify instance. So the data team was able to do the identity resolution from in store to online, and kind of handle that so that we’re looking at like, one unified understanding of who the customer is. Yeah.

Eric Dodds 17:59
So you have a table with each customer, and then their combined order history across point of sale. And then Shopify. Yeah, exactly. Yep. Okay, I have another question on the sort of business results side of things. And this is, I think, again, just based on your experience, like, well, both Wilde, right, because you’re sort of producing some sort of output, and I’m gonna pick on marketers here, because I’ve been a marketer for most of my career. It’s fun to do. Yeah, because it’s great. It’s great. But a lot of times, and I think this is changing to some extent, because like, you know, marketers are getting increasingly technical, I think there are a lot of like, good dynamics, but like, at the end of the day, you know, you talk about Flavio’s model, you know, versus Wildes model or whatever, right? Like the marketer doesn’t actually care at the end of the day, right? They just want the score so that they can do something with it. So how do you think about that based on your past experience, and then with Wilde as well? Where, to your point, like, the details are extremely important, right? I mean, the underlying data concerns are extremely important. But the end customer doesn’t actually really care about that. Right? Like, so how do you think about balancing that because you’re producing some sort of outcome, where you’re producing some sort of output that’s really critical to the business that has all these important components, but like, your customers, like Yeah, I mean, I don’t really care. I just let me know who to email. Right? Yeah, I

Clint Dunn 19:28
I guess so I guess from a data perspective, generally, whether I’m in house or, you know, building data products. I’m not a huge believer in dashboards. I think they’re like that, like valuable but I don’t really think that’s what our end goal should be as data people. I think what we should be trying to do is integrate ourselves as tightly as possible with other people’s workflow. Yeah. So in the clay vo example. Like I really, like my ideal case is if I’m internal to a brand that you don’t ever have to leave clay to as a marketer, right? That like the intelligence that we have as a data team has been pushed to you, and you’re not having to go somewhere else to get it. Yep.

Eric Dodds 20:11
Yep. I love it. John, thoughts on that? I mean, you did a bunch of those. Yeah, no, I,

John Wessel 20:16
I think. I mean, a lot of people. I think, talking just general data maturity, you over the last five years is like, wow, data collection is really easy. Right? So there’s a, Clint and I were talking about this before the show, there are a lot of Azure and AWS bills that are high right now. Because data collection is really easy, right? Yes. And then you’ve got all the data in this database, and like, you can query it, and that’s exciting. And you can even easily hook up a BI tool to it, right? But that’s all like, so opinionated stuff, right? Like you, like there’s no structure, there’s no, like a business framework, nothing. It’s just like, whatever that analyst or data engineer, whatever, like whatever’s in their mind, and their level of in sync with the business, which is often not very unsynced determines the outcome. So by getting in the destination tool, like you just enforced, like a structure, some business, like logic, you’re forcing a certain number into a certain field like, like, even if it didn’t have anything to do with the workflow even that like structure and opinionation. I think it’s helpful.

Clint Dunn 21:30
Yep. Definitely, that moves you from like, soft ROI, to heart out ROI. Yeah. Right. Like soft ROI is like building a dashboard, you might inform some decisions. And I think there’s the classic question and data, right, like, how much does our data team generate? And that’s very difficult to do if all of your ROI is soft ROI. But if you’re able to go into AI, I think, you know, reducing an Azure bill is like one example. But I think if you can actually generate top line revenue, and point to like, hey, we enabled this. Yeah, that’s that standard. Yeah. Right. That’s a hard ROI. That’s actually something you can point to and ask for more heads on your team because of it.

Eric Dodds 22:11
Yeah. Yeah. Do you think about Clinton, both you and John? There’s almost like an internal marketing element to this. And what I mean by that is, I totally agree, right? Like, let’s get the churn score, or the predictive LTV value into Klaviyo, or whatever tool, right, like, so they can integrate it into their workflow. There’s no disruption, right. But to some extent, that can create a dynamic where, how do I say this? I mean, it almost looks too easy to where all of the work that went into that from the data team is undervalued. And so you can’t get another head on your team? How do you think through that element? Right, because it’s a lot of things. John, and I talked about this a lot, like, a lot of times, things that are like, really well done, like, seem easy. When you see the final product, you know, which is great. And that’s part of the point, but then you don’t want that to come back and bite. Yeah. Well, I

John Wessel 23:13
mean, in marketing, right, that’s, we’ve talked about that a lot in marketing, where you read through something, and it’s like, perfect logical flow, good messaging, like all the things, and you think in your mind, like I could have done that. Yeah. Right. And then like, when you’re actually on the marketing side of it, trying to do that, like it’s impossibly hard. It’s very difficult. Yeah. And data has a little bit of an advantage over that, because there’s at least the technical aspects like, Well, that seems kind of hard. But there still is like that, like, really clean delivered product of like, oh, all you did was like, fill out CLTV and clay vo like, how hard is that? Yeah, yeah.

Clint Dunn 23:47
Yeah, I was actually talking to a head of data recently, who’s having this problem right now? And we were kind of, you know, half joking, half talking through, like, what do you do? Because he has made things look really easy. And then, you know, the marketing team is like, coming back and being like, Okay, well, like, you could just do this, and we’ll take like a week, right? And like, actually educating them on how hard the world is in like, you know, just getting clean data is really hard. Just tracking customer interactions. Really hard. Yeah. As you guys know, very well. So yeah, there is a bit of internal marketing, and I think also good data leaders. They’re mixed between marketers and product managers. I’m a big believer in the kind of like, product mindset internally. Yeah, yeah. Yeah, you could do a little bit of both. Yeah,

Eric Dodds 24:38
I love it. Okay, we’re gonna switch gears here. Because, John, I know you’re chomping at the bit with a bunch of technical questions, and I cannot wait to hear about this. So I’m going to ask you a question just to kind of transition. So let’s talk about Wilde now. So can you give us a just my question is kind of what’s happening under the hood. Okay. So you’re connecting to either a table in the warehouse that has certain data or Shopify, let’s just use the Shopify example, what data are you pulling in from Shopify? Or did you access Shopify?

Clint Dunn 25:08
If we try to keep it pretty narrow, we’re looking at some customer demographic information. And we’re looking at a lot of transaction kinds of order history.

Eric Dodds 25:17
Okay. I mean, that’s it. Right? So I mean, yeah, can you give us like 30 columns or like six columns?

Clint Dunn 25:24
When you pull it from the API? You know, there’s, I think, a couple 100 Just from those like two endpoints really? Once you blow every right. They reject everything. Yeah. Yeah, we look at like, five or six columns. Are there? Okay. Well, yeah, that is, yeah,

John Wessel 25:41
I think you said no, PII like you can do it without PII. And we can

Clint Dunn 25:45
deal with that. Yeah, you can hash an email before you send it to us. Yeah, we were trying to keep the scope really narrow, because I think a lot of folks want to fit as many demographic pieces of information or interactions in and again, as you guys know, it’s really hard to collect that information, it’s really hard to clean it and organize it. And so, you know, I think our onboarding engagements would be like, five times longer if you wanted to collect a bunch of information for different platforms. So just keep it super narrow, and we get 95% of the benefit.

John Wessel 26:16
And I think from talking to you in the past, like you realize, with your model that your models that you’re working on now, like, the signal to noise ratio, like if you pulled in every single data point, you could from Shopify, like super high noise, but as you narrow it down, like the beauty and like a really good, like predictive model is like, we know that like five things that matter, or however many it is, right, and then everything else, like if we get a slight like, increase, like you have to like, is it worth it? And is it truly a slight increase every time? Or is it like a one off? Like? I think that’s the Yeah, that’s the beauty of like, simple inputs into like, sophisticated models.

Clint Dunn 26:58
Yeah, conceptually speaking, when you’re talking about retail purchases, online or in store, there’s a lot exogenous to anything that you can measure. So if I go down to my Bodega guy, like, every day, I might be a super loyal customer. But if I move apartments, I will never go back to that Bodega again, and that particular guy is probably not gonna know that. For that reason, you know what the reasoning was, but there are a lot of reasons outside of our actual purchase behavior or interaction with a brand and dictate our journey with that brand.

John Wessel 27:31
Right. I still remember, like, I did a lot of a lot of work with Shopify, a lot of work with Shopify apps, I still remember the sales pitch, because we were really wrestling with the pricing problem, like we had 1000s of SKUs. And pricing is hard, especially at scale. I still remember this, like a model, this guy was selling me and he was like, Yeah, we take in like, hundreds of data points. We look at behavioral data, we look at visits, and we produce dynamic prices for like each of your, you know, 20,000 items. We demoed it. And there was basically no way to, like prove like, cool is this, like more sales or more margin than you know, they didn’t have that built into the product yet. And we ended up with like, Oops, new pricing. We’re like 20,000 skewers. So we had to roll it all back. Which was a nightmare. But that was a really good lesson of like, all right, like less is better here. And understandable is way, like More to be desired than like something that is eking out each little, like percentage of quote, like efficiency, you know, in a model. Yep. Pragmatism

Clint Dunn 28:39
might be like, the most important characteristic in any of these models, right? And interpretability and getting it out the door really quickly is gonna get 95% of the value, right. Your stakeholders expect. Right? All

Eric Dodds 28:51
right. So techstack Yeah, have to go there. Right. Yeah.

John Wessel 28:54
Yeah. So okay, so we started with, you know, Shopify API endpoint, maybe a data warehouse. Yeah. So what happens next? The high level flow? Yeah,

Clint Dunn 29:06
so the Shopify integration is relatively new for us. I mean, we’ve been very warehouse focused for a while now. And so my co-founder and I are kind of looking at different technologies. Because we’re starting from scratch, we have the freedom to do what we want. And so we landed on basically the flow for us as we land data in S3, for cold storage. We use DVT for transformation, an awesome tool. And then we’re actually using web and Mother Duck for all of our kinds of storage and transformation, warehousing needs and on the back end and the front end. So that’s been Yeah, I’ve been learning that stack lately. This

Eric Dodds 29:50
is definitely a first. I don’t think we’ve had anyone on the show. Who has used duct to be like, in production in this way. Yeah, yeah,

John Wessel 30:00
I think it’s a first and we were talking before the show about BI tools and browser BI tools. So if you remember, I guess, it’s been 10 plus years now since Tableau came out. And like one of the major things where there was there was their, like, their query engine or their storage solution, like as part of the tool where you can extract the data, and then you can like, manipulate it on your desktop, in this amazingly, like fast experience, that was like one of the big deals there. Then they take it to the web, ironically, right. And like they had a bunch of trouble early on, I remember, they hired somebody from AWS to try to help figure out the web version of Tableau, you know, obviously, eventually got something that was good enough. But that, but I still remember that initial experience of like, Hey, I’ve got this massive, like millions of lines file, I extract it I can use in Tableau. And it’s awesome. So tell me about your workflow. And like how that might, like, I think you’ve had a similar experience with duck dB, different workflow, but maybe similar. Yeah,

Clint Dunn 31:05
So a couple things that we’ve really liked working with is, first off, it pulls the front end analysis that we’re doing a lot closer to the data team. And so actually, like eyes, so I don’t know any software engineering, front end or back end, really, but it Okay, data guy, and I can go into our data stack, right, and our like our actual proper data repo and modify the queries that exist on the front end. And so, like having that close connection with the front end, web apps are kind of ridiculous for any team. Because

John Wessel 31:44
it’s, so what you’re saying, and this is almost missing that like traditional, like middleware, like middle layer piece. Yeah, totally see,

Clint Dunn 31:51
there’s not that like a handoff, organizationally from data to a software team on like, Okay, now we need to abstract what’s going on here. We’re gonna, like, move it into some other framework. We’re doing SQL queries from the front end. Yeah, it’s fast.

John Wessel 32:08
So I know, there’s gotta be some software engineers listening. They’re like, No, this is a terrible idea. Here’s all the reasons that you need that layer.

Eric Dodds 32:17
Duct TV is like such a great polarizing, like.

Clint Dunn 32:23
My co-founder is a software engineer. And he’s, you know, coming to the data stack. And he’s the one who’s really been pushing for DOT Naiad. So you can go fight all the software? Yeah, I was gonna say,

John Wessel 32:38
Yeah, well, we’ll put him on LinkedIn. But LinkedIn, yeah.

Clint Dunn 32:42
Yeah, it’s been great. I think another school that has an awesome benefit for us is like backend analysis. So you know, we do this cold storage in S3. If I want to run an analysis on just like data that we have parked, you can do. You can basically glob everything from different S3 buckets. So you know, we’ll do like bucket stars, and I could select a bunch of different S3 buckets simultaneously. And so we can basically do a femoral analysis on multiple brands without actually joining and merging that data together. So that’s been kind of a nice added benefit for us. Oh, wow.

John Wessel 33:30
So are you or have you looked into Iceberg at all? Like as far as we haven’t?

Clint Dunn 33:36
No. I think Patrick did a little bit and was like getting very intrigued. I was asking about it the other day. Iceberg. The one that was just acquired recently. Yeah, like the other one,

John Wessel 33:49
The commercial was acquired by Databricks. Yeah,

Clint Dunn 33:53
right. Yeah. We were talking about it. I think that’s on the horizon for us. Have you played with that at all? No,

John Wessel 33:58
I really like but I was reading about this really interesting workflow with Snowflake, of basically people using Snowflake for the right layer into Iceberg and then duck DB as a read from it. So then you’re like cutting your Snowflake computer, right? Because you’ve gotten it’s just being used in the ingestion. But the read out of Iceberg tables is just straight with, with mother Docker, or duct TV. So like, I mean, it’ll be so interesting, because Iceberg is an open format. Yeah. Even though the like, commercial, you know, commercial company got acquired by Databricks. Like, that’s still an open format. Yeah. It’s, yeah, it’s an Apache. Yeah. Project. But it’d be so interesting to have like, Alright, so I’m, like, say I want to store everything in that format. And then you just have these engines right? You’re like, alright, Snowflake engine, like you’re gonna do my rights, like, duck dB. You’re gonna do my reading or any other number. have combinations like it’s going to be really fascinating, like, the cost savings, right? And then just creative things you can do when you’re able to modularize and split up, you know, like that. And then I’m sure there’s some kind of AI-like application here, too, we’ve got everything in the same format. Like, it’d be easier to access.

Clint Dunn 35:18
I’ve been seeing some of these narratives recently, and I haven’t gone super deep, admittedly. But like, do you think this kind of structure hurts or helps a Snowflake or Databricks?

John Wessel 35:31
Like an open structure, like using ice? Yeah, like we’re

Clint Dunn 35:35
Yeah, where you don’t need to put your storage into one of those platforms and where they become purely a compute layer.

John Wessel 35:43
I don’t know, I think, I don’t think it hurts them too. I don’t think it hurts them that much. Because the way they’re going. So like Snowflake, just really, you can like the last couple of weeks, I’m sure you’ve probably seen it, like they’ve got the full Python notebook experience. In browsers, you know, for snowflakes, they’re doing that they already have the streamlet stuff. So like, they’re just going all out, like all the things that we can use computers for. And they’re gonna have ML and AI models and stuff. So like, they’re, like computer time is all going to be like more and more used on that stuff. Like, you’re gonna be spending a ton of money on computers like aI ml stuff, and a ton of money with them for your, like Ipython notebooks. And even maybe, like queering, or like, start to move down the list as far as, like what you’re spending money on. So I don’t know, I mean, I think it probably helps, it might help the industry in general and a little bit of a, like, a rising tide for everybody. Because you’ve got like, kind of, because you potentially like, depending on how it works out. Like you might end up with a standard of like, pretty much everybody just uses Iceberg because it will work with Databricks. And it works with Snowflake, and you know, X, Y or Z. The other thing that you want, so that, I don’t know, that might help the general industry. But that’s hard to say whether I feel like it really helps like an individual, you know, Snowflake or Databricks, or hurts them.

Eric Dodds 37:06
Yeah, it is interesting, the episode that we had with Andrew lamb from influx, you know, an influx does time series stuff. So different use case, but he was I just made this connection, one of the things that we talked about on that show was that his prediction is that things would move towards essentially having everything in S3, and then an ecosystem around that to your point, you know, where it’s like, okay, Snowflake, does this talk to you B does this right, and building an ecosystem around that model? And so Clint, it’s fascinating that you guys are actually, you have adopted that. We’re your product, right. I mean, we were talking about this just in terms of analytical workflows, like within a company, right? That’s actually how you run your entire product. Yeah.

Clint Dunn 38:03
I mean, I think, from what we’ve experienced so far, it is not as easy as standing up. So it’s like right now, and just, you know, letting it rip in there. I think there’s probably a little bit of a way to go in terms of accessibility. Right. But it’s definitely interesting and opens up some pretty cool capabilities. I mean, I can speak to the duck DB thing alone, I was telling you guys earlier, like, we have more than 500 brands that we’re doing analysis for and so we have all this transaction data, it’s two and a half billion dollars, which would be in the last year, and total GMB that these brands done. And so, I started doing an analysis web pull up like a Jupyter Notebook. You know, it’s like a one liner to connect to these S3 files. And I’m like, often away writing sequel inside of a Jupyter Notebook. And on, you know, hundreds of millions of rows, I’m getting like instantaneous queries on my local machine. That’s exactly that’s crazy. Yeah, and it’s like, you know, just connecting the Snowflake from my Jupyter Notebook would kind of be a pain. So

John Wessel 39:09
it is, yeah,

Clint Dunn 39:10
there are some elements where it’s, like, so easy as an analyst to get something out. And I just was able to focus on the fun of being an analyst again, rather than all economic engineering setup.

Eric Dodds 39:23

John Wessel 39:24
Is there some pretty cool things in browser things you can do with duck DB and Mother Duck. standpoint? Yes.

Clint Dunn 39:34
I think that’s I think a lot of that’s related to like what we’re doing on the front end right now, which is we’re basically running the SQL queries directly from the front end, right? Yeah, I’m well versed. I think, on the

John Wessel 39:48
I decided that I could have some discussions. I had some discussions around this and I wish I could represent it better. But it’s that same like where we started that same Tableau concept of basically like you have this like extremely Fast compressed, like data set that the query experience feels just about instant, but it’s in the browser, which historically has been a huge problem. Yeah, just about any BI tool that I’ve used in the browser. Yeah. And I’m sure they’re continuing to improve that part of the product. But it’s, it’s pretty cool to see it,

Clint Dunn 40:20
I’d be interested to see what, you know, Mother ducks kind of go to market strategies, because they do kind of have two different use cases right now, which is like, runs off really fast on your local machine, and then one that runs stuff really fast in your browser. And I don’t think they’re necessarily mutually exclusive, because obviously, we’re using both to good effect. But yeah, it’ll be interesting to see which one they kind of read on and which one proves more valuable? Sure.

Eric Dodds 40:49
Clint, one thing we talked about before we hit record kind of related to this. And it came back to mind because you were talking about having, you know, querying a bunch of different datasets, obviously, there’s a security concern related to that, right. So I mean, maybe you’ve stripped PII or whatever. What do you think about that? Because I can, my mind is instantly going towards all sorts of interesting use cases. And right, I mean, you can provide insights across different customers, you know, because everyone’s in retail, you could provide sort of, you know, reporting benchmarking, I mean, there’s all sorts of interesting product possibilities. But from a data perspective, you have to tread very carefully there, right? Because, you know, you, there are agreements that you have with each customer about, like, how you’re managing their customer data, you know, security concerns around like, if you’re combining all of that in a single place, and they’re, you know, I mean, how are you approaching that side of it as you’re working with data across all of your customers?

Clint Dunn 41:46
Yeah, so we restrict PII for everything. We’re hashing customer information, as well, and oftentimes when we’re joining information, we often are looking at merchant anonymized information as well. So that’s kind of like the first layer. The second is, we actually spin up separate DBS for each customer, each customer lives in their own DB environment. And then when we join it, it’s being joined to the samaroli using duck dB, so there’s no hard table where the data is landing. So I think we probably have some work to do on all of that. But like, it gives us a pretty good model, we’re both getting some flexibility without just mixing a lot. And to be honest, like I talked to a lot of vendors who do kind of push all the data together. It is like a standard. Yeah, it’s pretty standard. Yeah, yep. Yeah. We’re trying to be data conscious on it. One thing I will throw out, like, none of the tooling out there is really designed to work across a bunch of these databases. And so we’re really having to like grok, a few of the tools, you know, because we’re basically running a different DBT instance, for each, we have one central DBT repo, obviously, but like, each customer is getting their own DBT. Bronze. So it’s, yeah, it’s a lot harder to manage this way.

Eric Dodds 43:10
Yeah. Yeah. All right. Well, we’re getting close to the buzzer here. But we had talked before recording about clean rooms. And that’s probably a good place to end like, as far as what you’re thinking about for the future of Wilde. So tell us about clean rooms? How does that relate to what you’re doing as a product?

Clint Dunn 43:32
Yeah, I think at first blush it, it feels far afield. But what we’ve learned, you know, looking at 500 brands, 600 brands, the data at this point, is that a lot of data exists outside of the Shopify ecosystem, because so many of these brands have gotten omni channels now. Yeah, a lot of them are selling and retailers, a lot of them selling on Amazon. And so I started doing research earlier this year on like, okay, if I’m a big brand, how do I solve this, because I’m not gonna be satisfied just not having this data. We kind of started getting into the cleanroom world. And what we really learned there is like, accessibility for clean rooms is a huge issue. You would obviously like smooth or LiveRamp and, and Snowflake both have products, they acquired two companies for data cleanrooms but they’re technically and typically expensive and monetarily expensive. And, you know, most retail brands are not using those technical tools. And so what we’ve been exploring lately is basically productizing a lot of these classrooms so that we can continue sharing data with brands, but then also the retailers.

Eric Dodds 44:43
Oh, wow. Okay, so but you’re sort of building it on like existing cleanroom technology from someone like a Snowflake.

Clint Dunn 44:51
No. So now we’re not actually we’ll build some of that ourselves. Yeah, we have some hypotheses about that. But Yeah, probably too early to say now. Yeah. But yeah, we’ll be building a stack for that.

Eric Dodds 45:07
Love it. All right, John, any final questions before we hop off? No,

John Wessel 45:11
I think the data sharing part, like if you don’t know what a cleanroom is, right? Maybe a quick little definition of that for somebody. And then in general, I think data sharing is a really big place for this stuff to go next, whether it’s sharing to be in an app, like, like, I use clay VO, and I want to share it to klavier. I don’t want to like ETL. Like, that’s too hard. Like, let me just share it, or I want to share it to Salesforce or whatever. So I think that general concept is big. But if you could just focus on the cleanroom piece, like tell people what that is. Yeah,

Clint Dunn 45:41
so I actually don’t really dislike the term cleanroom. We refer to them as collaboration room, which I think is like, explore a bit more, explains what you’re actually trying to do rather than what there is, you know, effectively, if John, you own a brand, and I own a brand, and we want to share information about our customers, neither of us wants to share a list of our customers Shouldn’t we don’t want to expose that. And so you can use these cleanrooms or as we call them collaboration rooms as basically a third party where you can dump the information in. And then neither of us can look at the individual PII. But we can do aggregate queries of that data kind of predetermined aggregate queries. And so conceptually speaking, it sounds a little bit esoteric, but the actual use cases are quite interesting. So, you know, Amazon has a cleanroom solution. And so you actually, if you’re running on Shopify, and Amazon, you can do things like you can give Amazon a list of your Shopify customers so that you can target them in the Amazon’s ad platform. And Amazon won’t actually know who those customers are. And you can do that same thing with Google, and Facebook, a few other platforms that TV platforms have the same technology, it also means that you can go to a retailer. So if you’re selling a Kroger, you can get customer level sales information from Kroger. All of that is kind of inaccessible to most brands because of their revenue and because of the technical requirements. But the big brands can tell you how many new versus returning customers they have in a retailer.

Eric Dodds 47:17
Hmm, that’s fascinating. Yeah, that is fascinating. super fascinating. Well, it’s

Clint Dunn 47:22
been pretty fun to learn

Eric Dodds 47:23
about. Yeah, for sure. Well, as you build that product out, keep us posted. And we’ll have you back on the show. Because I think that’s a huge topic. For us to tackle. Yeah,

Clint Dunn 47:34
definitely. It’ll be awesome, Clint. Well, thank

Eric Dodds 47:36
you so much for joining us on the show. It’s been a fascinating conversation, and we’ll have you back on sometimes in manga.

Clint Dunn 47:43
It’s been a blast. Thanks for having me.

John Wessel 47:45
Yeah, thanks Clint!

Eric Dodds 47:48
The Data Stack Show is brought to you by RudderStack. The warehouse native customer data platform RudderStack is purpose built to help data teams turn customer data into competitive advantage. Learn more at rudderstack.com.