Episode 215:

Data Sharing and the Truth About Data Clean Rooms with Patrik Devlin of Wilde AI

November 13, 2024

This week on The Data Stack Show, Eric and John welcome Patrik Devlin, Co-Founder and CTO at Wilde AI. During the conversation, Patrik talks about his background in software and data engineering, his startup experiences, and the technical intricacies of Wilde’s predictive lifetime value (LTV) product. Key topics include the use of DuckDB and Mother Duck in data architecture, the realities versus marketing of data clean rooms, and the evolution and technical challenges of QR codes. Patrik also discusses Wilde’s data-sharing strategies, the importance of developer experience, future directions in data processing technologies, and more.

Notes:

Highlights from this week’s conversation include:

  • Patrik’s Background and Journey to Wilde (1:12)
  • The Evolution of QR Codes (4:09)
  • Marketing Analytics and Clean Rooms (9:52)
  • Challenges in Data Sharing (13:20)
  • Technical Challenges with Clean Rooms (15:37)
  • Exploring Current Data Infrastructure (19:11)
  • Data Orchestration Tools (22:50)
  • Performance Tuning and Data Syncing (24:00)
  • Choosing Data Tools (26:08)
  • Mother Duck and Data Warehousing (00:30:31)
  • Flexible Data Architecture (32:40)
  • DuckDB Implementation (35:36)
  • Data Marketplace Concept (38:34)
  • Asset Availability in Data Queries (42:21)
  • Transition from Software Engineering to Data Stack (46:36)
  • Data Contracts and Type Safety (49:10)
  • Database Schema Perspectives (50:27)
  • Final Thoughts and Takeaways (51:35)

 

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 00:06
Welcome to the Data Stack Show.

John Wessel 00:07
The data stack show is a podcast where we talk about the technical, business and human challenges involved in data

Eric Dodds 00:13
work. Join our casual conversations with innovators and data professionals to learn about new data technologies and how data teams are run at top companies. Welcome back to the show. Everyone. We are here with Patrick Devlin from wild AI, we actually talked with your co-founder recently on the show. So excited to dig into some more of the technical details. Of you, Patrick, thanks for joining

Patrik Devlin 00:42
us. Yeah, thanks for having me. Hope to be here. All right, well, give us just a quick background on how you got into the world of data and ended up building a product that uses, you know, that uses a ton of data. So I’ve been in software for about a decade now, been in the startup world, pretty much my whole career. And yeah, I got together with Clint, like you mentioned, beginning this year, and we started working on a predictive LTV product, doing it kind of stuff, data related there. And most of my career has been on the software engineering side. And as of late, you know, I’ve definitely entered their world of data engineering and data stack. And yeah, I’m excited to chat about, you know, what we’re doing here at Wild, and hopefully some of the information could be useful to the listeners. Yeah,

John Wessel 01:30
So Patrick, one of the topics that I’m really excited about, okay, maybe two topics. One of them is duckdb and Mother Duck. I always appreciate a conversation about that. And the other one is your experience with clean rooms. We’ve talked a little bit about the marketing around clean rooms, and we’re excited to compare marketing versus reality when it comes to clean rooms. So what are some topics you’re excited about?

Patrik Devlin 01:53
Yeah, definitely. I mean, Duff DB and Mother Duck are a huge part of our stack and our architecture, so that’s gonna be great. And, yeah, I think just like this whole data sharing space is really interesting, and the concept of these, like embedded OLAP databases, really unlocks some interesting stuff to innovate on how data sharing actually happens. And I think there’s probably some inspiration being taken from data clean rooms. Ultimately, we decided not to go that route. But, yeah, you know, happy to take you through that story and journey on, on how we made that decision

Eric Dodds 02:31
definitely. Well, let’s dig in.

John Wessel 02:33
All right. Let’s do it.

Eric Dodds 02:34
All right, Patrick, you gave us a brief intro, but dig a little bit deeper into the startups that you worked at, you know, early in your career, and then you had a pretty good run at the most recent one.

Patrik Devlin 02:46
Yeah, definitely. So I spent time actually in the blockchain space. We were building like NFT collectible cards for sports players. It was really exciting. I think we were probably a little early there, and we quickly realized that horse hands are just like, don’t they don’t care about the blockchain.

Eric Dodds 03:08
which is a shocker. Yeah, exactly.

Patrik Devlin 03:12
So we end up, like, building a lot of tech on top to, like, Obfuscate. You know that part of the architecture, and I think ultimately, like, we dug ourselves into a hole there, but I moved on from there. Was actually my boss at the time. Went over to the DTX company, which is now slow code. He brought me along. He said they need engineers. I was like, yep, sounds great. I need a job. So, yeah.

Eric Dodds 03:37
That’s very like. That’s very like a startup to start a transition. I

Patrik Devlin 03:41
I love it, yeah, definitely. So I came on as the founding engineer at flow code and got to work on a ton of different systems. It was, yeah, super exciting. I was there for four and a half years and moved on to wild, I guess, December of 2023.

Eric Dodds 03:58
so almost a year. Now, yeah, we’ll see ya. Very cool. Okay, we have hundreds of questions about clean rooms and duck DB and Mother Duck, but let’s talk about QR codes for just a minute, because they’ve had this really interesting life cycle. And I remember we were chatting before the show. I remember this has got to be 2011 or 2012. There was a guy in the co-working space where I was working, and he tweeted, he tweeted, I think it was abbreviated as I have never scanned a QR code. And this tweet went viral. I mean, this guy’s not like a super well known guy, right? But it really just encapsulated, like, the joke of QR codes at that point, because you had all these different apps, and it was a hugely painful process. And it was literally harder to like to interact with a QR code than it was to just like. Google, the company’s name, but flow code is a successful company, and like that whole dynamic change. So you were on the inside building some of this technology really like a great time for that. So what was it like to be on the inside as sort of that tide was completely changing? Yeah, yeah, yeah. So

Patrik Devlin 05:17
I would say I think some of the inspiration was the fact that kind of the eastern hemisphere has been using QR codes, and it’s very prolific over there. And I remember Tim Armstrong, he had taken a trip over there, and essentially it was, like, inspired by all of this sort of, like physical activation, and they were using QR codes, and so when he came back, he brought that to the team. We’re trying to figure out exactly what we were going to do. I think one of the requirements was that we wanted to bridge this world of offline to online, and we want to make it it’s got to be quick, and it also needs to be super easy to use, so a very little barrier to entry, and the QR codes sort of enabled that, because you don’t need an app that’s built into your phone. And so you can really create this, I guess, like do our code ubiquity without requiring people to do too much, yep. Now the caveat there is, there was definitely this, like, huge education portion that needed to happen, like you said, like no one over a year in the US, really is engaging or scanning QR codes that kind of looked at this, like old, archaic tech, right? Right? And so, that was like a problem that we spent a lot of time trying to figure out, and we spent a lot of marketing material on simply, just like educating users how to scan it and what does that look like. And then COVID happened, and now is your education. That was all the user education we needed. Was great. It was so like. It really enabled us to like, switch into like. How can we actually know people are using them? How can we create a really strong experience around it? We spent a lot of time designing and making sure the QR codes were integrated into the brand. We didn’t want them to like it, stick out as right? So, yeah, I think that, like COVID sort of is unfortunate as it was. It speeds us up to explore other parts of how flow code could come together. And really it’s like, on the consumer side, you’re solving this offline to online sort of problem. But as a business, you don’t really understand that attribution on the marketing, on those, like physical marketing spaces. And so flow code unlocks that. And really it’s like a huge data product behind, right? You

Eric Dodds 08:02
know, behind the QR codes, marketing analytics, for sure? Yeah,

Patrik Devlin 08:06
definitely.

Eric Dodds 08:06
What was the trickiest? Okay, last question, because I don’t want to waste too much time on this. It is fascinating. It’s like, how many QR code companies died before the hardware and then COVID and made it? You know, ubiquitous is what I would think about, like, what was the trickiest technical problem you faced at flow code?

Patrik Devlin 08:26
Yeah, that’s great. It was one of my first projects. It was probably the most fun thing that I worked on there. And I got to work on it with Neil Cohen, who was our Chief Architect. He’s a brilliant guy, and we essentially built the QR generator. QR code generator from scratch, really, yeah, and because, because we there was like, such a requirement to make sure the design was so flexible that you could completely, like, integrate it into your brand, interesting. So we didn’t just, like, pull some top of the shelf, sure. I mean, there’s tons of frames Exactly, yeah. And we were just Yeah. We were too constrained by what was out there, so we ended up, like, rebuilding that. And that was probably the most math I’ve done since college.

Eric Dodds 09:20
You’re never going to use this in your job, exactly. That’s great. Well, speaking of marketing analytics, that’s probably a great segue into the clean room topic. And as our listeners, especially our long time listeners, know I used to be a marketer, which I can say now, you can say that actually now, but it actually makes it harder to, you know, joke about marketers, you know, because I’m not on the inside there anymore. Yeah, either way, it’s not gonna, it’s not gonna stop you. That won’t stop me. Clean rooms are always fascinating to me, because they’re really big. There’s so much opportunity there, but it’s so fraught with peril. When you think about all the situations where you would want to share data, right, even with, like, a non competitive brand, right? Like, it can be so helpful to share data, and there’s kind of ways of doing that, but almost all of them are very technically painful. They create a lot of risk exposure. You’re a lot of times dealing with a physical file, you know, or some sort of manual data munging, like, you know, and so it’s like, you actually have the there has to be a lot of juice for the squeeze to be worth it, because it’s really expensive, right? So, as you were talking about on the show, like the marketing around clean rooms is like, you just dump your data in there, and it just, like, works, you know. Okay, so with that context, yeah, like, you, ex did a ton of clean room research at Wild So can you just walk us through this story there? Like, what were you trying to accomplish? And then what did you learn about clean rooms? I don’t think we’ve actually dug deep into clean rooms on the show Brooks, I don’t correct us, yeah, so what was your use case? And then educate us about clean rooms? Yeah, definitely. So

Patrik Devlin 11:08
yeah, just like some brief history on what we’re on wild and what we’ve been doing since I started, there is really this Ltd product at its core, we come in, we can ingest some data, we run our predictive model on top, and we give you some really interesting insights. This was sort of like a four way for a into this data sharing marketplace that we had this idea for, and we thought that was back up for one second. The data sharing piece is really between the consumer brand and the retail or, like, hyper focus on, essentially, like, solving the boundary, the data boundary between online data and retail data. And, like, a retailer that sells multiple consumer brands, correct, yeah, yeah. So typically, that’s, you know, that’s a black box for you as the consumer brand. You don’t, yeah, who those customers are, and right, and so. And when you do, you’re, you know, you’re like, tier one brand or retailer, and you have a one to one engagement to get a year to set up. There’s probably a data clean room involved in that situation,

Eric Dodds 12:15
or probably FTC,

Patrik Devlin 12:20
so, yeah, we were super focused on just like, bridging the gap between, you know, retail and online. And we had this hypothesis that we needed to build this huge corpus of consumer brands into our platform, into our system, and then we can go and take that to the retail and be like, hey, look, we have all of these brands that are really interested in consumer data that within that you have, and that sort of jump start the marketplace through that process. So yeah, that sort of switched. We found that retailers actually have a bunch of brands that they want to work with. There’s this pressure to monetize data. And so we found that they actually are, you know, are willing to come in and bring the consumer brands into this platform. That’s

Eric Dodds 13:09
a nice discovery.

Patrik Devlin 13:11
Yeah, yeah, yeah. So, you know, I go to markets like, changed a bit there, but back to, like, the data cleaner in places, you know, data sharing was sort of core to our infrastructure, whether it was, you know, going to be like the first step of our product, or the third or fourth, we always knew we had to execute on it. So we spent a lot of time talking with folks and saying, what, you know, what is the data clean rooms? Turns out no one really knows,

Eric Dodds 13:42
not the dirty surprise.

John Wessel 13:46
Everyone’s got

Patrik Devlin 13:47
their own definition of it, and so everyone talks about it differently. And I think, like, the name kind of sucks too, because it has its like, implicit definition that data is like, put into this room and then you can analyze it safely. But really, like a lot of data, clean rooms support different types of sharing architecture. So some data, clean rooms, you actually have to move data. Some data clean rooms are just all in place sharing, which we see this kind of in place. Concepts of data sharing is we’re very bullish on we see that as, like, the future of how brands interact with their data

Eric Dodds 14:27
really quickly. What would the flow there be? So like, if you actually have to move data, I’m like, writing data into a clean room, or, like, sharing a table into a clean room, yeah?

Patrik Devlin 14:43
So, I mean, it depends on, there’s so many different factors and like each, each Avenue has, like, a different clean room solution.

Eric Dodds 14:55
So if you’re like,

Patrik Devlin 14:57
snowflake, yeah. Exactly, potentially, there’s like regional boundaries as well. Maybe you’re on the same architecture, but your data is stored in different regions, so in some cases, there just inherently needs to be replication, right? So we started to dig into Samuel, which was acquired by snowflake, which is now just snowflake data clean rooms. And we looked at Habu, which was actually required by live ramp, as well as potentially being like the infrastructure provider for us at Wild to support this data sharing use case, we quickly found out it would just be way too expensive. But also like behind the scenes, there’s, yes, there’s, like, some cryptography going on to enable, like, secure queries, but ultimately, the major use cases for the data clean room are all in place sharing so you’re actually not moving data. You just sit on top of you know your data tables from. Let’s take the snowflake example. If you know your company A, you have snowflake tables in US East Company B and I have some tables in USC says, well, the clean room is really just this, like protected space that tells you what you can and cannot query, and it’ll go down and actually pull information from from those disparate data

Eric Dodds 16:32
you just get, like a materialized, like an authorized materialization of, yeah, whatever you can query, yeah. I think

Patrik Devlin 16:41
at one point, it’s been a while since we looked at this, but I think at one point we found that it was like, just a bunch of ginger templates telling you what you can and can’t query.

Eric Dodds 16:54
So, you know, take that for what it is. I’m sure those come from, like, the config so, like, I have some sort of config in the clean room, and they’re just translating that to a Jinja template, which just creates, like, restrictions on what’s queryable. Yes, yeah, one answer, yeah, exactly. Like,

John Wessel 17:15
interesting places have had permissions for a lot of years. Like, it’s just funny, like it. I mean, I’m sure there’s some subtleties around some of it, but it’s like, yeah, cool. So this sounds a lot like Row Level Security, where some of the others like databases, like, they’ve been around for a lot of years, sounds like marketing, yeah. I

Patrik Devlin 17:32
i think it’s, yeah, it was kind of this, like, glorified permission system. And like you said, John, I’m sure there’s, like, some details that we’re missing there, but I think the core learning was okay. They’re sharing data. It’s in place sharing, so there’s no replication, right? And the other one is they use, there’s like, some cryptography going on to make sure that you can write queries against that sort of clean room state, but make sure that nothing’s exposed, yeah? Or you can’t, like, be a bad actor in that situation,

John Wessel 18:10
yeah? And that would be unique. I think I don’t, I can’t think of any technologies that that let like, writing like, a lot of times it’s like, oh, we can give you read only access to a certain piece of this. Yeah. Writing is more complicated. So, yeah, that’d be novel. And it

Eric Dodds 18:22
is like, the ability to do that without moving your data is pretty awesome, yeah, like, that is cool, but, yeah, okay, so some cryptography, some permission stuff, but those obviously didn’t meet your needs at Wild,

Patrik Devlin 18:38
yes, yeah. I mean, it was like, it’s a huge boundary to even just like, maintain and provide a data clean room, not only as like us, if we were a consumer, like a consumer brand, but also like in our situation, we would need to run these things for every data sharing situation within the platform. So we just, you know, it quickly became unscalable in that sense. And yeah, we opted, I don’t know if you can chat about this later, but to give a little teaser, we opted to focus on the in sharing piece or in place sharing, data sharing, I think that was the biggest learning out of the data clean rooms, even though there’s some replication going on from in some scenarios where you’re Yeah, you’re in the across vendors, you’re still, there’s still that in place sharing,

John Wessel 19:37
yeah. So I think that’s the perfect segue to get kind of into your current data stack. So one part of the infrastructure, I thought was really cool, and we talked about it previously, is most applications, when you write the application, you start typically some kind of relational database, like a Postgres, for example, or SQL Server, or whatever you’re on, and then you’re. Writing to this common database. And then, you know, you get your first client. You’re like, yeah, and then, and then, basically, the only thing that’s separating clients is, like, an ID column or something, though, all the data stored together, all the data is together, and then getting access back out. So say, you know, clients like, oh, I want this custom report, or I want this thing or that. Like, now you have to, like, kind of reverse engineer out and make sure you’ve got filters on every single little thing to make sure this client ID is always true if it’s this client view in the data. So you have all this, you know, work built into it. So part of your model, I think, was really interesting that I’d love for you to expand on, if you don’t do that. You actually store the data by client to begin with, separately,

Patrik Devlin 20:40
yeah, and yeah, I think our because we’re building data products, we’re not, we’re not the data producers. The consumer there is the data producer. And so we’ve, we’ve essentially built a system using DAG sir and DLT to sync and land data into S3 on our end, and then, because we have that kind of raw level data, we’ve got the flexibility to put any type of Compute Engine on top. I think in the future, you know, we definitely look into experimenting with icebergs and getting a little more sophisticated on that front. But as things work today, yeah, we land things in S3 and we use DAG sir, which I think in probably the majority of cases would be used to, like, orchestrate your internal analytics. We actually use it to orchestrate all the pipes needed to process consumers data. So every job, every asset and dice there, is partitioned by the consumer. Wow,

John Wessel 21:51
yeah, just a couple things. Like one, like, what I’m curious about, like, what other architectures you considered? And then two, for, maybe explain DLT, a little bit. Think a lot of people are familiar with DBT, but this is another three letter tool.

Patrik Devlin 22:11
Yeah, yeah. So

John Wessel 22:12
talk about the architecture some, and then DLT

Patrik Devlin 22:17
definitely. So we knew we needed something to orchestrate all of the pipes. We have this ML model that we run, but before that, we need a clean so we knew there was a piece of ingestion transformation, running the actual models on top of that clean data set, and then actually doing, like, some additional transformations after the model gets run, and then prepping it for, you know, the platform and the product. So we looked at, you know, DAG sir, like air flow could hand roll some of that orchestration for old data, another show out there. But yeah, ultimately, we went with DAG sir. They had good support for our petitioning, which we know is going to be critical to this system that is required to especially like the process. Like I said, not only do we not like our internal data set that we’re producing, and we’ll probably have something else fit out. Let’s see it. Set the generate enough scale, but yeah, we opted for production and DLT. You can find a DLT hub. What’s it stand for? Data, load data, load

John Wessel 23:31
transfer the data. Load tool, yeah. Data, load tool, yeah. If you Google it, the Delta live tables comes up a lot, is

Patrik Devlin 23:44
different. Yeah, different, yeah. So, I mean, I guess the beauty of building like a green fill project is we can sort of experiment with a lot of these different tools. And DLT was, I think at a second, oh, wasn’t even, didn’t even have a major release. They just released a stable, stable version, like last month, but we didn’t have any problems using it. It was great from the beginning. It’s all Python based. It runs, I think what’s interesting about DLT hubs is you can have high concurrency, and it stores the pipeline state in the destination. So every time, like a DAG, the set job spins up and it’s like, hey, sync client. A’s data to, you know, S3 it’ll go and grab the state from S3, know exactly where to pick up, and it just runs from there. So yeah, after you know, little performance tuning, we were able to sync data pretty SaaS, this is important for, you know, like onboarding into the product. I think we talked to some other vendors and, you know, this process of back sewing historical transaction data, you know, can take up to days, depending on the. Size of the business, and the ideal tee of has been great at, you know, accelerating that. So

John Wessel 25:07
timing wise, I’m curious, because I like, this would have been four or five, probably five years ago. Now there’s an analytics tool built for Shopify, and this was even before Shopify. Shop had pretty good analytics, but this is before they redid all their dashboards and stuff. So the promise of the tool is to give you better analytics, like out of the box, then Shopify, and then I think it hooked up to a couple other things. This was pre I think DAC city does this. Now it was before all that it would hook up to, like, your email and Google ads and stuff. But I remember hooking it up and then like this, like pop up, came up, and it’s like, estimated time come back in, like, 48 hours, or 17 hours, or something like that. And I was like,

Patrik Devlin 25:48
what? Like, yeah,

John Wessel 25:51
it was not like, it really was, yeah. So yeah. So I totally understand how, like, you’re saying the like, parallelization, then speed is a big deal, especially if it’s a larger brand on Shopify like it, it can take a while to backfill data. Definitely.

Patrik Devlin 26:07
Yeah,

Eric Dodds 26:08
what? What other tools besides DLT, did you look at is Question one, and then one A is, as a software engineer, does it give you a little bit of pause to be using in production a tool that just had their first stable release, like, a month ago,

Patrik Devlin 26:23
like, yes and no, because honestly, like, it’s really up to the vendor to decide whether it’s a stable release or not. So it’s like, it’s sort of inherently, like, made up in that way. So like, you could be on like, I mean, they went from like, point six to like a one, oh, release. And you’re just like, wait, what? Like, yeah,

John Wessel 26:48
so subjective. Like, like, Gmail was in beta for what, like, a decade, right? Do you remember that? Yeah, that’s true. So I feel like that’s the personality of the founders. Yeah. It’s like, they’re just staying pre stable for, like, yeah, until, yeah, they really lose some level. Or it can be the opposite, where they can just, oh, here’s the first release. It’s stable. Like, really, right? So,

Patrik Devlin 27:11
I mean, it’s definitely, it’s like a signal, it’s an indicator. And you should be like, Okay, this is what I’m getting into. But I don’t think it’s, it’s like, the end all be all, yeah, sorry, what was the other part of the

Eric Dodds 27:22
question? Yeah, what? So in terms of that stage in the pipeline, so you chose dragster? Oh, yeah, yeah, there was, in terms of that stage of the pipeline, like, what ways did you have, like, what you sort of gave a number of options you explored for, like, the orchestration piece and making sure that you can do that with robust partitioning support. But in terms of that last mile before it hits S3 what are their tools? Did you look in order to solve that part of the pipeline?

Patrik Devlin 27:49
Yeah, yeah. I mean, we looked at air by and I’m sure there was a few others as well, but we liked the fact that DLT was super easy to just run and spin up, and it had already an embedded integration into DAG sir versus, like, when I got so, like, the first sync I ever did of this product was on air by and I had, like, I think I had my Kubernetes running on my freaking laptop to get the same thing going. Yeah, and, like, I don’t know, it’s just a little off putting. Granted, they do have this, like, really mature connector community. They’ve got a ton of options there, you know, for what we were solving. I think DLT just fits in really nicely there.

Patrik Devlin 28:40
Yep.

John Wessel 28:41
Is there something with the architecture that maybe is different? I’m thinking about this architecture with separate data pods, if you want to call them. Is there something different about that than maybe a standard company? Because I think at least, like so far, you have a little bit of a unique scenario here where you have, you know a lot of your, you know a lot of your customers are on Shopify. I think you know almost all of them. So it’s almost like you already have the data standardized. Now you’re just landing it in different places, whereas, like, I think previously, like, one of the reasons it was so crucial to, like, get everything into a common database is this was the standardization step. Like, the reason we’re all in the same table with a client ID is, because when we write to this table, like client name always matches up with client name and like address, at least it’s in the right field. Whereas, if we just started sharing things separately, with no orchestration tool, no common schema, then you’re just gonna end up with a mask. Yeah.

Eric Dodds 29:36
You just need a Shopify clean room, right?

Patrik Devlin 29:40
Yeah, exactly, yeah. The system has worked great for us. We haven’t really had any issues. I think the other unlock, which is usually in Mother Duck as a part of that sort of transformation layer and data warehousing, and that allows us to actually. Sit on top of that raw information and really start to do some interesting SQL to get the data on how we want it to look. You mentioned Shopify being our only source. But in theory, we can sync from any source, DLT manage the schema evolution, and so as long as our DBT jobs are set up to know what that source looks like and how we can actually pull it into, you know, some of our, you know, Mark tables, I think, yeah, I’m happy to dive in on on that part of the stack of,

John Wessel 30:34
yeah, if you’re interested, yeah, I’m really curious about Mother Duck, does or duck TV, is there the ability to say so, so I’ve got all this data in common. You’ve standardized so store in the same schema, in S3 what does it look like to essentially like Union if you needed to do this, like, say, for internal analytics. Like, how would you union across a bunch of different S3 in town? About buckets,

Patrik Devlin 31:02
yeah. So it’s as simple as, like, a glob pattern. So, yeah. So as long as you have your S3 like, folder schema set up, which we, you know, partition by client, and then data source type and so on. You know, you can select the glob pattern to query across everything query across just a subset. And what’s just interesting about the DBT package is it will compile the SQL that you write as a part of your staging files into the actual like S3 URL so when you go and run those first grabs, yeah, when you go and run those first translations, they’re going straight to, you know, as three to grab that information and that potentially, I guess, if it’s a, if it’s like a Cat or a view, or something like that, that could end up getting compiled all the way Up to, you know, a materialized table. Yeah, right. So, yeah, it’s been great to work with on that front and part of the decision with Mother Duck, which has provided a lot of scale for us, sort of bringing that like modern data warehouse and tooling, you know, into that duckdb and process land, is the fact that we can serve an entire platform without having to front all of this, all of this data with like, an analytics like API or something, to negotiate the contract between what data we’re producing from our predictive models and what’s actually showing up on the front end of the inquiry. Yeah.

John Wessel 32:40
And the other thing is this, I mean, the architecture is so flexible, right? Because, because it, like, let’s say you’ve got a big client you’re working with or a big prospect you’re working with, and like, we want to use snowflake or Databricks or something, like, it’s an S3 right? Or you decide, like you said, you want to dive more into the iceberg as a format. Yeah, it’s all there. You have the data landed. You have a partition, like it. This is probably the easiest architecture that I’ve seen to just like, make a pivot and be like, All right, cool. Esther is or the snowflakes the computer. Databricks is the computer. Yeah, whatever, yeah. So that’s pretty cool,

Patrik Devlin 33:21
yeah, yeah, yeah, definitely. I think that’s sort of how we’re thinking about the architecture with this data marketplace and data sharing is with what we found with the data clean rooms is like, depending on where the data is, is a whole different path for the data clean room and how it like comes together and what data needs to be replicated to execute on it. We wanted to create a system that could essentially, like, sit the computer on top and have the flexibility to go into snowflake or BigQuery or, you know, like a blob storage, and be able to pull that information together. And so that was, that was a requirement of, like, meeting our clients where their data is and, yeah, that has implications on security as well, right? Like, if you don’t have to replicate data on our side, then they don’t have to essentially worry about that from a compliance standpoint.

Eric Dodds 34:17
I’m interested in the story of how you ended up trying duckdb For this solution. Like, what? Because it’s a, it’s like a, to your point, John, like, the set of requirements itself is interesting, right? And sort of the stack leading up and doing partitioning with DAG sir, and, you know, then landing it in the client partitions in S3 but what was the process you went through? Like, okay, like, I now have this data in S3 and then I have this use case that I need to implement. Just give us a little bit of the story of, like, what cycles you went through, and then how you ended up, you know, deciding,

Patrik Devlin 34:55
I mean, I have to give some credit here. Who’s the one who. He told me about duck DD, and got me on it. And, yeah, I mean, as soon as he mentioned it, I downloaded it, took a look at, you know, how it functions within your local development. I think something I think about a lot is What is the developer experience going to be like, and do the tools that we choose, like, create a really strong experience here, and partially like, from being just like a bootstrapped small team like we need. We need to be able to push new features, extremely easy, yep, and I think the having duck DD be able to sit on that raw data and just run locally and be able to perform really well on top of this, like, not huge data set, like we don’t have, everyone’s got their own like, definition of what big data actually is, where, like, I guess in my standards, like, we’re not dealing with big data, so everything can be run also with my laptop. And the fact that duckdb can sort of enable that workflow allows me to iterate faster, allows Clint, who’s done a ton of work on the data side of things, allows him to iterate really fast. And I think that’s definitely been a huge part of allowing us to push something out and get something into production. Yeah. Also disclaimer for wild customers,

Eric Dodds 36:28
your production instance is not running on a laptop.

John Wessel 36:37
That is correct, yeah, but it is a really interesting model that, like, I don’t know of any other vendor that’s doing, where you can have this developer, local developer, turtleneck with a duck DB from a costume you’re getting to use all that, like, you know, just amazing capacity of some of these modern like, laptop lines, and then not wasting credits on, like, I’m developing holiday and wasting credits, and then when you’re in production, though, like, there’s the, you know, truly managed service production thing. So it’s a cool model. Well, I

Eric Dodds 37:08
actually, and Patrick, interested to hear your thoughts on this. I immediately thought about hiring as well, where you think about bringing in, like, a really good engineer, and they’re, they’re probably going to be like, Man, that sounds so nice, you know, yeah, it’s interesting from that regard as well, where it’s like, okay, like, when I’m hiring, I can show you, like, a pretty awesome set of tools that we’re using that will allow you to move really fast, you know, with, maybe without some of the traditional restrictions, for sure.

Patrik Devlin 37:38
I mean, that’s a huge part of it, yeah. And this should be baked into any decision on infrastructure choices and things like that. Well,

John Wessel 37:46
I think you bring up a really good point on hiring, because I feel like my analogy for this, I feel like college has picked up on this, like, maybe 15 years ago, 20 years ago, where it’s like, Oh, guess what? Like, the dorms matter. Like, if they’re nice, like, kids want to come here, yeah, if they’re like, foods, good kids want to come here, you know, yeah. Because at one point I think it was a little more bent toward, like, just the academics and like, you know, who has the best ranking here and here, yeah. And then, like, I don’t know, 20 years ago, or however many years, it was like, somebody’s like, oh, student experience like that

Eric Dodds 38:16
in practice. And I still think there’s companies that, like, have some set of tools that’s like, because they’ve always had them, or because they’re cheap or whatever, yeah, and then they just are, like, shocked, like, we just can’t find the right people. We can’t find the right talent. It’s like, maybe nobody wants to work with, you know, this set of tools, yeah, okay, Patrick, I want to talk about the Marketplace a little bit so to bring this all full circle. So the marketplace in the context of wild, I mean, there’s one component of, like, just being able to do secure data sharing, right, or join data sets right, to find, like, some common stuff, right? And that could materialize, as you know, like a product within wild where you could, you know, sort of engage and then get some sort of you could consume some sort of data that shows crossover, or whatever value that you could provide from a data perspective, by showing information to the consumer brand and to the retailer, right here, some sort of crossover, whatever. It’s essentially like an analytics product or a data product. But a marketplace is actually distinctly different in that you’re still showing what’s available, but then it’s actually an asset that the end user of your platform, like, does something with, right? Like, I can consume this. I can actually do something with it. So can you walk us through, like, the differences between you providing a data product that relies on the sharing and then providing an asset that you’re, you know, selling or facilitating a transaction, relative to,

Patrik Devlin 39:49
yeah, definitely. So yeah, as far as like, the dynamics of this data marketplace, it’s less transactional and more of a. A subscription to access the retailer’s data. Okay, interesting, and we see this as like a battery, sort of included experience. So because we’ve created this system to securely share and pull like nuggets of information out of both the retail consumer data or your sales within that retailer and your online sales, because we have that sort of granular, like row level data, we can start to we can build up from there and essentially, like, I don’t know how to describe this without stop giving up too much of the train secret here,

Eric Dodds 40:48
the secret sauce.

Patrik Devlin 40:50
Yeah, exactly, because we’ve been calling them, like, these ephemeral warehouses. So because this, like, a new era of quick, embedded OLAP databases. We can run them really efficiently, and we can spin them up whenever we want. So what the process is essentially like, we can create a warehouse, we can pull in data or query data from sort of external locations, run transformations, run models, on top of that, fill out information and aggregates, and then persist just the sort of de-identified non PII. And because that’s all run in processing and memory, we essentially let it all go. Your data stays with you, and we actually created this system where we’ll store the results, because we needed to show the product and the password, right, right? But the process of getting to those results, if you want a really good understanding of that customer crossover. You need to start at the sort of smallest and lowest grain you can get. And so that’s kind of the system we’re building towards, and how we enable this kind of secure compute environment without replicating data. That’s really,

John Wessel 42:20
Yeah, that’s really interesting. It’s, I’m thinking of it like in the physical world, it’s like, rather than like, Hey, I’m going to give you a physical key to my house. It’s like, we implemented a digital log here. You can have, like, temporary access to the house, do a thing you need to do, like, get the result from that thing, and then you lose access to the house, and you’re not storing the PII. The PII stays inside their infrastructure, right? And yet they can get the results, like customer lifetime value or like, whatever things that maybe you calculated off of the data.

Patrik Devlin 42:56
That’s pretty cool. I think that’s where the like, some of the inspiration around the data clean rooms happen, like, yes, we don’t need cryptography, because we’re sort of in this black box environment that wild controls. So it’s not like we’re allowing the retailer and the consumer to, like, go in and create each other’s data and yeah, we don’t have to move that data set anywhere like you’re saying, Yeah, that’s cool. I

Eric Dodds 43:23
I mean, it is really fascinating to think about. I use the term asset, but it’s fascinating to think about what’s happening under the hood, right? Where, really it’s the availability of this asset, and you can, you essentially generate it when it’s needed. But one question I have is, because you’re dealing with, you know, time stamped transaction data, the queries are going to take longer over time, right? So in an ideal state where you have a customer who’s, you know, you know, a really awesome customer, they’re growing like crazy, right? Like, how are you thinking about the scale, because it’s not a static data set, and it tends to grow over time. And then, from a machine learning standpoint, right? Like the larger data sets going to allow you to provide more accurate results, but there are obviously performance implications, you know, relative to that,

Patrik Devlin 44:15
yeah, yeah, no, that’s a great question. And kind of the uses we’re solving for are like small medium consumer brands, and then sort of small medium, like mid tier retailers. And so when you think about the quantity of transactions happening within the the quantity of your sales happening within that retailer, and just historically, how many transactions you as a consumer brand have had, you’re still in this, like, small data realm with tons and tons of room, yep, to grow. And so I think in that space, I guess the short answer is like, it’s just all. Like, we can do it all on a single machine, a single process, and we can now have access to, you know, these tremendous, like, tremendously large boxes that we can deploy at will. And we, if we can run, you know, we can run most things in memory. But, yeah, I think it does pose a really interesting question like, Okay, what, where does this architecture go when you go beyond that scale that you’re talking about? Yeah, and yeah. I mean, we, because it’s not really the customer we’re going for. We haven’t done much thinking there, but it could be like a really interesting space to solve how someone that gets cash throughout this system, and how do we reduce the load on the actual data provider and things like that, a problem that your investors would love for you to bring up and

Eric Dodds 45:54
help if it became a serious pain, but

John Wessel 45:58
you’re right though, like the brilliance of like spot instances where, hey, I need 30 minutes of this, like, enormously oversized, like a machine to do a thing is, like, perfect use case. And since you’re in the retail space, a transaction is a physical good. So somebody had to, like, ship it and deliver it, typically, yeah, and that does not scale in the same way that digital goods do, yeah, where, you know, somebody takes 1000 pictures and, you know, or whatever the other digital thing is. So I think that’s the other constraint that that like you would have to be, it’d have to be a massive retailer too, I think for you guys to have problems, definitely.

Patrik Devlin 46:32
I mean, I hope we have to solve that. I

John Wessel 46:35
I hope you have that problem.

Patrik Devlin 46:39
But, yeah, I

Eric Dodds 46:40
I think we’re, I think we’re close to the buzzer here. But one more question for you, so you have a background in software engineering, doing really cool projects, like building your own custom QR code generator and, you know, digging into all the associated math. But what’s fascinating about to me about our conversation today is that you’re running a really, a very sophisticated data stack that a lot of companies would run for their own just internal analytics pipeline infrastructure, but you’re running it in production, like across multiple clients with some, you know, intelligent partitioning and querying and everything. So I think we can officially say, you know, you could probably put data engineer, you know, on your resume, but I’ll take that. I mean, you know, I guess I can best soda, but John probably can granted, yeah, data engineer Emeritus, or on a very Yeah, but can you just walk through, like, going from, let’s say traditional, which, you know, I think is a loaded term, but more traditional software engineering as a software developer, and then actually building out this data stack as a software product. What are some of the things that you know have really stuck out to you, that you’ve learned or that are like, wildly different, you know, building out a data pipeline infrastructure as software, you know, as opposed to a more traditional code base, which I know you have as well, but, yeah, no, I

Patrik Devlin 48:19
I think it’s definitely like my background in software has definitely influenced a lot of these architecture decisions. And, yeah, it’s to be said whether that’s, you know, good or bad, but we, I think some of, like, the biggest differences for me coming into the data world was around the tolling, you just don’t have the same like pulling as you do, like a back end microservice stack or a front end stack. And I think part of the reason there is, like, the majority of data stuff is built in Python, and Python is built for, like, flexible, like explorable, easy to write workloads. But as like, when you build a microservice architecture, you need really, like, contracts between your APIs and the data everything is producing and consuming, and you want this really, like, secure contract between the two. And I think on the data side, that sort of like evolution of using Python is kind of creating a system where you don’t have a lot of that guarantee, all a lot of those guarantees, and we’re sort of building some. There’s like new tooling coming out to sort of solve that problem. You see like, you see SDS coming out with their like, column level lineage. Are you able to classify your data across your transformation stack so you know exactly where your PII is landing? So there’s some interesting stuff that allows you to ensure type safety and data contracts. And I think the other. I think it’s a little more of a tangent, but the way I was thought about, like database schemas on an OLTP system where you have high throughput, high transactions, is completely different than how my brain thinks about what the heck we need, like, what are these smart tables gonna look like, what’s gonna be in them, and how, just, like, why they can get and so there was, like, a shift there, and thinking about, yeah, this sort of, like, this storage and schema of of on the data side and On the OLTP side. But yeah, I think the other, I guess the last piece I’ll touch on there is sort of related to the tooling front. But yeah, I just, I value, like, a really good developer experience, and anything that I can do to build that into the data stack, I will probably take a bet on, because I can see, I just know how important that is and how, yeah, like, what type of efficiencies you can gain from that. So, yeah, I think that’s the other piece. Love it, and I think the developer

John Wessel 51:09
experience. I don’t have a good way to quantify this, but I think, if you like, as you’re optimizing for that, thinking tooling a lot around that, I think it translates into the customer experience. Like, there’s just something about it where, if it’s easier to do the right thing as a developer, then the right thing gets done more, which makes it better. You know, it makes it simple but it makes a better product. So, yeah, I think that’s cool, yeah,

Patrik Devlin 51:35
for sure. I mean, the more you can iterate and experiment, the better the end result is going to be, yeah, I guess one of the like, one of the parallels you control, is like, now with front like, front end frameworks, you can write some new code and immediately see that on the page and see what it looks like. Yep. And on the data side, if you’re running DDT on snowflake, there’s gonna be that’s like, you’re gonna write some new sequel, and then you’re gonna wait and it’s gonna execute and just all this stuff. And I think with duckdb and bringing that more locally, yeah, like you can just fly through some of these workloads, yeah? And that’s been particularly exciting for me. Super cool.

Eric Dodds 52:20
Well, Patrick, this has been so fun. The time really flew by. Thanks so much for joining us and keeping us posted on how the Marketplace goes. Yeah, definitely.

Patrik Devlin 52:28
No. I really appreciate it.

John Wessel 52:30
Thanks a bunch. Thanks for coming on the show.

Eric Dodds 52:33
The data stack show is brought to you by RudderStack, the warehouse native customer data platform. RudderStack is purpose built to help data teams turn customer data into competitive advantage. Learn more at rudderstack.com.