Episode 04:

Data Council Week: Using Data Anonymization for Identity Protection with Will Thompson of Privacy Dynamics

April 26, 2023

This week on The Data Stack Show, we have a special edition as we recorded a series of bonus episodes live at Data Council in Austin, Texas. In this episode, Brooks and Kostas chat with Will Thompson, the Director of Engineering at Privacy Dynamics. During the episode, Will talks about all things data anonymization and privacy. The conversation includes the challenges of data privacy and other complexities in data, the journey of being an engineer in a startup, how Privacy Dynamics is solving problems in the space, and more.

Notes:

Highlights from this week’s conversation include:

  • Will’s background in data (0:28)
  • Privacy dynamics and data anonymization (4:18)
  • Addressing data privacy problems in the space (10:33)
  • Developer experience with Privacy Dynamics (13:49)
  • How does Privacy Dynamics work? (21:09)
  • Update of real-time anonymized data (26:29)
  • The problem of dates and other complexities in data (31:24)
  • Being a data engineer in a startup (34:44)
  • Moving at the speed of a startup (41:01)
  • Connecting with Will and Privacy Dynamics (43:28)

 

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 00:03
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com.

Brooks Patterson 00:23
All right, we are here, if you’re following along at data Council, Austin with the chance to record some shows in person, which is great, usually we’re on Zoom. But today, we have got William Thompson here at the table with us. He’s the head of engineering at privacy dynamics. I’m Brooks, I’m filling in for Eric this week. Again, if you’re following along, he can make it to the conference. So you’re stuck with me. Kostas is here. But we’re excited to talk with William Thompson today. Well, to start us off, can you just share a little bit about your background for this? Sure.

Will Thompson 00:55
So originally, my background was, it was very data oriented, but kind of in a different way. I was. I worked in a document centric world. And I worked for a legal publisher. And we had, we were building a legal research platform. And so, you know, you’re dealing with text, we were all, you know, essentially XML shops. And so, you know, we had a search engine, and, you know, no taking tools, and, you know, the very specific customer, which is a lawyer who is trying to, you know, work on their cases. And, yeah, and so then that company was bought by Thomson Reuters. And, I joined the startup privacy dynamics, which was a completely different tech stack. Completely different problem. So that was a huge shift for me. But yeah, so I’ve dived into this world of Python, and data science, and, you know, kind of enterprise application, b2b type. Type software, which was a huge change, but I found it super interesting.

Brooks Patterson 02:22
So cool. I mean, it’s such a big change, both on the data side, and just I’m sure your kind of day to day work, work life going from the legal industry, and working for a publishing company building out the kind of digital platform. And then, you know, straight into the fire of working on a startup. Yes. You mentioned before the show, at the publishing company, you’ve benefited from, you know, building out this digital platform, but you and the business had this extremely successful publishing business. So the pressure that I’m, you know, in the startup world is just always there to move fast was not necessarily there in the same way. So you mentioned, I think you’re like, we got to kind of do things the right way, we had a clear vision of what we needed to do. And when an executed totally different story working at a startup. Could you just talk a little bit about even maybe, yeah, from a personal perspective, like, what’s that been like?

Will Thompson 03:22
Well, yeah, it was like I had never worked at a startup. So I really had no, I had no idea what to expect. And yeah, yeah, it’s obvious, it’s totally different. You know, it’s just an entirely different set of challenges. But it’s definitely more challenging, so you have to, you have to prioritize. And you have to, you have to be really careful with how you’re allocating your time. That’s the main thing I’ve learned is, you know, you can’t, you always have to stay focused, and you can’t, you can’t get too married to any particular idea. Because as you’re, you know, before we knew exactly who our customer was, and, you know, in an early stage startup, you’re learning who the customer is. And as you think you have your customer figured out, and then it needs to shift, then you have to change your engineering priorities, but you don’t want to leave a trail of garbage every you know, at every turn in this little road. So that’s a different, that’s a different challenge.

Brooks Patterson 04:38
This is fascinating, I want to talk a little more about privacy dynamics, and data anonymization which you even said yourself, that’s kind of an overloaded term. I love to use it. Kostas loves to talk about definition. So I’m gonna, I’m gonna hand it off to him. And I’ll dig into the definition of that from a couple of different perspectives. Yeah,

Kostas Pardalis 05:00
So what is a migration? Let’s start with that. Right?

Will Thompson 05:05
It depends on who you are, right? So in our case, we’re talking about data anonymization. And so this is, like, in a lot of cases, people think of anonymization more as a security problem, which is, who has access to what data. So there’ll be encryption tokenization, that type of thing. We’re more about the assumption that someone needs access to this data. But we don’t want to identify any individuals in the data. And so anonymization in our context is protecting identities. Rather than, like in data that you need to use, yeah, rather than, you know, hiding information specific, like, you know, we can still tokenize things if you need it. But generally, people need to, you know, have format consistency, or do research on some data, but you don’t want to make it possible for anyone to figure out who was in it.

Kostas Pardalis 06:08
Okay, that’s, that’s super interesting. So okay, let’s start like with, I just like, the first thing that comes to mind, for anyone who’s has written like, some code in their life is like, Okay, I have an email field somewhere, I have this thing. I guess I like random strings there. And I use that. I have a feeling that something much more than that. So sure, let’s talk a little bit about politics, the technical side of things, and like how anonymization is actually built and implemented, like on top of data, especially like how it relates to hard works like lease data that we might not think that are important, like to anonymize right, like, okay, they mail or like my security, security, social security number are very obvious, right? But there might be like, very clever ways to identify a person, right? Yeah. So let’s look at that.

Will Thompson 07:10
Sure. So you can, the identification of these attributes are typically categorized into two categories. One is direct identifiers. The other is indirect identifiers. People call them quasi identifiers. So direct identifiers are what you just went over names, addresses and social security numbers. And so that’s, you know, a lot of people are working on that. And you know, how you treat those things. It just depends on who the user is. A lot of cases like tokenization are fine. In other cases, like in dev tests, you know, someone developer is going to scream bloody murder, if your email address is some random string of characters, or like, I need it to be an email address, some people need it to be a, you know, like a routable email address. So you know that you have all these different, different concerns for that. And direct identifiers are where it gets tricky. And that’s where we are, that’s what we’re focused on initially. Because that’s really important to healthcare. And it’s also important to CCPA and GDPR things as well. But this is where, you know, you can’t identify someone directly with their zip code, or their gender or their date of birth. But if you combine those together, it becomes more unique. And then you can identify people. And so the risk is what’s referred to as a linkage attack. And so you go get some data somewhere that has people in it, and then you do a statistical attack. Essentially, you try to relink these people based on the sequence of quasi identifiers, and then you can assign probabilities to, you know, what’s the likelihood that this person here who’s anonymous, is this person, this real person that I know? And, you know, it doesn’t have to be 100% to be a risk, but sometimes you can match with a lot of certainty. And so, this type of anonymization we use what’s called K anonymity. The concept is you create groups. And our algorithm is a category of algorithms called micro aggregation. And the idea is you essentially, you create, you cluster, you do you create clusters for everybody in the dataset, and then you cluster people based on similarity. And then the more anonymity the more protection you need, the larger the group, and so you know, we’ll cluster all these people together, and then we find this center of the cluster, and then we, we make everybody match the center. So essentially, you know, so maybe we’ll find somebody whose lives are close to your zip code, who’s the same gender and close age will shift you guys to be the same. And then, you know, you will no longer exist in the dataset. Maybe you haven’t, maybe you don’t change at all in the data set. But now there’s at least one other identity in the data set that matches your combination of quasi identifiers. Exactly. And so now, at a minimum, if someone is trying to link across the dataset, there’s two, yep. And then you can increase it to make it even more, more difficult.

Kostas Pardalis 10:50
Okay. That’s fascinating. So how, I mean, I would imagine that, like, if I was the data scientist in let’s say, an inshore tech company. And that’s another random example, like, I chose, like, on purpose, because I, we have like a conversation at some point in the best I can, in the show with some people from inshore tech, and we were talking about that, like privacy, right? Like, there are scientists, and they were saying, I mean, it isn’t an issue in the way that like, I, we need to remove anonymity to do our job in a way, right, like to build these models, because it’s like, we need these information like to go and like do like risk assessment or like, whatever, right? So I, from what I understand it, everything’s like a mother of like, making the right trade offs, right? How do we do that? Because okay, like, in theory, I get what we are saying, but like, in a real setup, right, let’s say I’m that data scientist, and I’m going to use your platform. How do I choose the right parameters there? How big should the case be? Like, right? Yeah. And all that stuff?

Will Thompson 12:09
Yeah. So all right. So there’s a flip side of this. And so we have this kind of privacy dashboard for everything. And so we do, we have a set of tools with two things, right? So whenever you treat data like this, something is falling on this privacy utility curve, right. So you increase privacy, up to a point. And if you do 100% privacy, your data is 100% noise, right? And then you slide it all the way the other way back, and there’s no privacy. And so, yeah, you want to find that sweet spot. And so we have privacy, you can measure and do this risk assessment, this is almost pulled directly from healthcare literature on how essentially we do these things, it’s like a Monte Carlo simulation attack. And so we do this simulated linkage attack. And then we say, you know, here is how, you know, approximately linkable we think your data is, and then we put them into these categories, basically, you know, low, medium, high risk. And so that’s, you know, risky, a risk analysis. And then the other one is we have tools for measuring distortion. And so we’ll look at the, you know, you’ll run the dataset through the system, and we’ll show you how your distributions have changed? How have your, you know, main, top level statistics have changed, recently added something that shows a relationship distortion, so like, how have the relationship between age and, you know, some other column maybe has non identifiable information, and it changed. And so, this way, a data scientist can look and see, you know, where is my privacy according to the risk assessment, and how bad is storing the data. So like, ideally, what you would do is, you know, dial it to as much privacy as you can get for the, you know, the distortion that you can accept, and then, and then set that and then, you know, let that be your

Kostas Pardalis 14:08
baseline. Yeah, yeah, that makes total sense. Like how, how, let’s say more complexity, it adds to the life of a data scientist, like do that.

Will Thompson 14:19
We would hope not that much we would, you know, this is one of those things where, you know, we’re, we want to iterate on this if anybody gets blocked, but we try to make it as frictionless as possible, but ultimately, you know, we want you to just give you all the information you need, and say, Alright, you know, this is too much distortion, or, like, you know, maybe we should notch this up, dial it, run it again, you know, hopefully you kind of experiment a little bit till it’s what you need and then you don’t worry about it anymore. Maybe you can come back and check. Maybe send an alert, if it’s yeah, if the, if the risk level changes, you know, more than stuff percent, something like that. But essentially, our idea is to let the data scientists work on another problem, like, we will handle the anonymization. And then you know, you can come check the dashboard, you can integrate it in your system, and then and then you work on whatever it is. Your company does.

Kostas Pardalis 15:18
Yeah, yeah. 100%. Okay. Sounds good back to, like, the other type of anonymization was like, the, like, social security numbers and all that stuff. And it was interesting, because you mentioned developers being like, okay, like, I need something to look like an email, obviously, like, I get that if you have somewhere, it’s a regular expression to match something and you want to test for that, that it actually works. We don’t need, like, if you have like a random string there, that’s a problem. Tell us a little bit more about that. Because that’s part of okay, we talked about the data scientists. But there are always developers and engineers involved. And they have different needs, right? And anonymization leads to affecting their work, like in a different way. It’s also a bit more about that, because it’s like, it sounds very interesting, and especially around the developer experience, like working with adults and how it affects their other job.

Will Thompson 16:15
Yeah, it’s like, yeah, so overlapping problem, but that, yeah, they have these unique concerns. So if you’re, you know, if you’re a data scientist, if you’re, if you want to, you want to work on anonymize healthcare data, it’s probably just one data set, or like a handful of datasets that may or may not actually be linked together, where as a developer, they have a database with tables, and those things have foreign key private key relationships, and you need to maintain those relationships. So that’s something you know, so we started this, these are the features we started adding, for developers, they’re like, Well, you know, we want to copy all these tables over. And, you know, we want, we want, we don’t want to have, we don’t expose the same keys, but we need to maintain the same key relationships. So you have to, you know, token, those tokenize those a certain way, email addresses, we have to build format, consistent email, and the like, and you already had all these like little problems. One of them was like, they actually their system was actually sending emails. And, you know, it needed to be a valid email. But then it needed to not be routable. And so you know, off, I go into the, what isn’t like the IETF document on like, email, domain naming, and it’s like, oh, well, yeah, you can, there are actually a handful of these top level domains. And so you’d like build in your format thing and exam, you know, DOT example. Example? What is it? I don’t know. But so you actually have these things that will pass their regex. But then bounce, you know, if they try to send an email, so yeah. And then, you know, social security numbers, those are, those are just numbers. But yeah, like names. One of the problems people have, as you know, yeah, you can generate names, you know, kind of like random, normal looking names. But they want it to be the same name for this record when it comes through. Yeah, the next time. And so, you know, we can do that, in some cases, but not all cases, hard to, you know, you’ve anonymize this, but then you needed to make it compatible, you need to be able to make sure that you essentially want it’s like a cryptographic hash with someone’s name. So they’re like, this row comes back again. And he gives you the same name. Yeah. So yeah, so that’s like, these are the kinds of things we’re working on now to improve the developer workflow.

Kostas Pardalis 18:47
Yeah, that’s, that’s super, super interesting. What other types? Because okay, we talked about names, like foreign keys. What other types are like, tricky and challenging? And, like, developers care about like, is like, Oh, well, about timestamps? Or like, dates, for example? Yeah.

Will Thompson 19:06
Would ya with timestamps? Like, that’s something? It’s a rabbit hole, like you’re like, oh, timestamps? It’s also, you know, talk to a developer who’s, like a senior developer who’s worked with a lot of data about time zones, and they’ll just, you know, the color will wash from there. Yeah, that’s right. But this is, you know, it’s the same thing with dates, because how many data formats are there? Right? So you know, you like, that is, like, it’s just one of those problems. It’s not, it’s a big, messy problem. There’s not like a beautiful, simple, you know, you know, beautiful design that solves it, you know, it’s like you just have to build it out as kind of as needed. Luckily, it’s not, you know, each additional thing Hang is not an enormous challenge. Some of some things are trickier than others. But it’s the trickier stuff is more than that, we also try to identify all of these things. We try to so that you don’t have to go and like, if you have, you know, 1000s of columns, you don’t have to go through and you maybe you just go through and check, we got stuff, right? Because it’s like, it’s probably impossible to get everything everywhere. 100% right. So, you know, people are always just gonna have to check this stuff. But, you know, our goal is to have it as automated as possible. But some things are just like, you know, trying to find the, like US Postal Service rules on what is a valid address, you know, then and even then, like, let’s say, you get that, right, like, like, I got it, I got most of it. But a lot of data is entered by humans, and they will enter it wrong. And so you have to handle that, too. So those are like, not, those are not fun, because they’re just messy, kind of annoying. And like, the hardest thing about it is you have to build a system that can withstand all these additions and kind of bolt on exceptions and things without making it incomprehensible every time because it’s, you’re never going to stop adding stuff to it. Yeah. And then if it just turns into a pile of spaghetti, it becomes unmaintainable. So that’s a totally different challenge,

Kostas Pardalis 21:24
ya know, and it’s a very interesting problem, to be honest, like, it’s alright, so talking about working like with the data, let’s talk a little bit more about the actual, like the product experience, right? I’m the chairman developer, we have a database somewhere. And I want to take your products, the Primus dynamics product and use it for my database, like, how does it work? How, what it takes, how easy it is, how transparent it is, and like, what’s the process after that,

Will Thompson 21:59
I mean, ideally, it’s super easy, you let’s say you have Postgres BigQuery, whatever, you sign up, you create a connector, you enter the credentials for the, and, you know, location for the source database. And then you create another one for the target database, and you walk through a wizard, we introspect the tables and columns. And then, you know, we’ll try to auto detect there, you kind of check which ones you want to keep, and what settings you want, what anonymization, do, you need what defaults, you know, unless you have, you know, hundreds of tables, it’s a pretty quick process, and then you go through, and you set a schedule, and it runs. And then assuming you don’t need to make a bunch of changes to what is included, or excluded from the project. You know, all you need to do is check the dashboard, see if everything, you know, if the data looks like you expect it to in terms of like, did the distributions look good, or the, you know, the auto detection work, like you expected, and then after that, you know, hopefully, you don’t need to use it that much. Except maybe if you wanted to integrate it with some part of your process.

Kostas Pardalis 23:26
Okay, so let’s say we set it up, and whose user life inside the engineering workload is doing the setup and installation? What type of like, like, engineer is usually involved in that? Like, it could be admin Infosec, like someone from security from InfoSec? Is it someone from like, I don’t fraudulence

Will Thompson 23:49
is usually, yes, usually an admin, you know, I haven’t come across anybody who’s not who doesn’t have, you know, like good experience programming. Usually, they’re working in infrastructure operations. I mean, they’re the ones who are setting it up. Yeah. Because we got the you know, because this deals with sensitive data, we have a SaaS product, but also, we, we did a lot of work to make sure that we can install this on prem as well. And so those are, you know, much more involved. Because we, you know, we work with their ops people, things like that. If you use a SaaS, it’s just, you just you sign up, and then all you need is access to the database so that, you know, if it’s a small company, you might just have to look up the credentials, and then you get Yeah, because we want CISOs to be able to just, you know, click and then they have the information they need.

Kostas Pardalis 24:42
Yeah. 100% And, okay, let’s say now I’m, I don’t know, like a product engineer, right? Like I’m building a front end and I’m going to have access to this production database. Do I have to know about the existence or privacy dynamics like how do I interact with was a data,

Will Thompson 25:01
right? So it would fit in your pipeline, your ETL. And then there would just be, you know, the way we would recommend setting it up is, you know, very few people have access to the sensitive database. And then, you know, you wrote that off, and then you know, we will it goes credentials are encrypted, on our system or on in your infrastructure. And then there’s less private databases where you know, more engineers have access to it, that’s maybe in a lower environment. Okay, so then you give that to the engineers Oh, so they don’t even there’s just, they just know, there’s a database, and that thing is kept up to date, we run batches. Okay. So here’s the thing, you know, on whatever kind of instrument you need,

Kostas Pardalis 25:44
Oh, right. Okay, I get it. So the anonymization or encryption of the processing of the data does not happen like on the fly. When I execute the query, right? It happens, like you create a replica of the database that is anonymized. And then people go and access that.

Will Thompson 26:01
Yeah, the anonymization process, no matter what, it’s somewhat expensive. And also, we have to have a picture of, of all the data in order to anonymize it. Yeah. And also to do the risk assessment, we need to know everything that’s in it to say, you know, because one unique row increases, increases in liability, so we have to see everything.

Kostas Pardalis 26:21
Yeah, mate. Okay. Luckily, streaming is something

Will Thompson 26:25
that has, like, it’s definitely something we’ve discussed and want to do, because we’ll need to do it for extremely large datasets, but it’s a, it would be a very large project, but it’d be something really fun to work on. But I have to Yeah, stay focused on what everybody needs,

Kostas Pardalis 26:48
how they’re most fans and all that mix up often. So, okay, from what I understand, like we are talking about use cases that are more like in the analytical use cases, right, like, so someone’s going to work like with static data, say that they are going to extract like from the database, like a data scientist, we want to be like quality, like a model, right? And not math, like use cases where, for example, you would have, let’s say, a real time application, who does have ads like to the database and needs to have very consistent and up to date data or that also like anonymized, right, like, is this? Correct? Like, if I did it, right? Or do y’all just, like more real time also, like use cases?

Will Thompson 27:31
I mean, we wouldn’t be actually truly real time. But you know, you can, depending on the size of the data, we can run it, like, pretty quickly, you can run it, you know, hourly, or even, you know, every 10 minutes if you needed to, if it wasn’t an enormous data set. So we can keep data pretty up to date. Okay. But yeah, it has to do it. Well, you know, also like, if you have big data, and you’re going to stall install on prem, we can outfit you with a really large instance. Yeah, it’ll go faster. But yeah,

Kostas Pardalis 28:08
That’s very interesting. So you mentioned big data. And one of the most important quality jobs or like a good engineer has is like to make the pipelines incremental, right? Because when you have billions and billions of rows, going, and processing everything from the beginning, like every time, it’s almost like an overkill. How you can do that when you need to have access in a way to the whole data set to do the

Will Thompson 28:37
Oh, no. Will we rewrite, we have to reread it. Okay. We have to reread it. And so, yeah, incremental, is, that’s something that we’ve sketched out, okay, as an idea, but it’s really hard, because you have to, essentially, we’re clustering everything. Right. And so we have to update, how do you update? You know, you create all these clusters, and then you add 1000 rows? Yeah. How do you change these clusters? That’s complicated. Yeah. And so managing that, like, that’s pretty, like, I think we could handle the more like, data side, you know, strictly live streaming the data running out, like, transformations. That’s all, you know, what they call a smart, right, simple matter of programming. The, the, like updating a cluster, you know, like a cluster data set. That’s going to take some time, that’s going to take some tinkering. Yeah, yeah. Well, yeah, that’s something we want to do. Yeah, yeah.

Kostas Pardalis 29:46
And, okay, I’ll take tabular data. Do you see other data that are also part of the like, imaginaries like, PDF files, like, how do you work with these types of data? We don’t

Will Thompson 29:59
know. Yet, the thing we’ve gotten the most requests for is more semi structured data like JSON, or, you know, just arrays, things like that. And like, yeah, that’s something we need to do. But it’s also really challenging, but it’s for the same reason we, you know, dates are challenging where it’s like, by an order of magnitude, right? So it’s like you had these word dates? Well, we were just talking to somebody recently, at this conference, and they were talking about their dislike column that was like a JSON blob. Yeah. And there’s no schema for it. And so I can’t even assume that row to row it’s Yeah. So, you know, that’s, it’s doable. It’s just, it’s a big left. So yeah, so what we would have to do is like, take that data, normalize it, run our anomaly. anonymization has a map back to the original data, and then you know, and then do that, to maintain that format consistency of just completely arbitrary.

Kostas Pardalis 31:01
Yeah, yeah. No, it’s, I mean, it’s hard. Like, I can feel what you’re talking about, very rewarding. If you figure it out over it’s like little like things that can go wrong, like, but it’s a very challenging, like problem that you’re dealing with,

Will Thompson 31:19
I’d love to be able to just sink my teeth into some of those problems. This, they are, that’s fun.

Kostas Pardalis 31:26
I mean, I don’t know. Like, I think even if you might not like to solve dates, like your name is going to. I always wondered, ” I have a problem with dating. In databases, I always forget, like, the stupid language where you define like the format, right? Like, I always have to go back to the commutation for its database and see, like, what’s the format is like, when do I need it? Why is that capital? When it’s not capital,

Will Thompson 32:01
it always looks like a regex. But it’s not. Yeah.

Kostas Pardalis 32:04
And I was like, going through that, again, I’m like, too old for that. Like, we live in a need where we have like, open AI that is going to I don’t know, like, make us all obsolete, or whatever they say on Twitter today. But I still cannot give a day’s date. And software tells me this is the format in the language like or I at least I’m not aware of it. If someone is aware of a library that does that, please let me know you will make me a much happier person. So it’s like, the reason I’m saying that is because you know, there’s always a lot of hype around, like, what is currently happening. But people don’t realize how boring real hard engineering is in some way, right? needs to happen for all these things to actually work at scale at the end. Like it’s like, from one side you have like, Okay, open AI asked like, asking like it was the gardens replying to you? And on the other hand, yeah, like you have to go and still like to struggle with it. Right? And it’s the sort of problem like it’s still there. So I can feel you and it’s like, I think you should be talking more about that stuff. Like I don’t know, if you have like, a blog or something like talk about all these like legal problems, like what you said about like, the maybe like that has to be like, like, we need to test it and make sure that like, it seems, you know, like and goes through like Meltwater or something, even if it bounces, like all these like little things that, you know, like, nobody cares about until they, they have to write in there. They’re like 99% of the engineers out there. That’s like, why they get grumpy every day because they have to do these things. Yeah, so it is important to talk about that stuff, I think, anyway.

Will Thompson 34:11
It’s not glamorous, so people don’t want to talk.

Kostas Pardalis 34:12
Yeah, yeah. But I mean, I don’t know. I think we got to make it glamorous. Like, if we talk about even right, like reality is at the end. Like it’s not just like, all these more things together is what changes the world. You know, like, at the end, it’s not just like, suddenly, one day you come up with a trained model on open AI and what’s happened, like, out of the blue like, no, like, there are many people that’s hard to figure out a lot of like raw data to train this thing. Yeah,

Will Thompson 34:39
exactly. Yeah. The real world is very messy. And solving problems in the real world. Requires addressing that messiness?

Kostas Pardalis 34:48
Yeah. 100% and we have to embrace it actually, that’s also important. Talking about mission is let’s go back to your experience being an engineer, founding an engineering startup, right. Tell us a little bit more about like, how it feels like how, what kind of experience it is like how differently things because like, okay, that’s, I think, like people can imagine, probably like, but what do you have to go through as an engineer to make yourself productive in such an environment?

Will Thompson 35:24
Yeah, it was certainly for a while. uncomfortable, right, like, just a real shift in my, in what I was my objectives were, which went from being, you know, we know the customer, we know exactly what they need, we’re going to build this feature, and we can, you know, it’s like, it’s clear to, to going to this situation where, you know, the grounds moving. You know, like, I had gotten into a, you know, comfort zone where I was able to keep things neat. Everything’s tidy up, like, yeah, everything’s just so I know, you know, it’s easy to figure out where everything is. And I, you know, it was nice. I was the kid who cleaned his room, right? So and So, but then going in this in the startup world, it’s like, you don’t it’s not, it’s a luxury you don’t, you can’t really have all the time. And that’s not to say like, you have to embrace creating messes, you just have to prioritize. Very, someone called it brutal prioritization. Yeah. And I think it is, but it’s like, it’s uncomfortable, you have to say, when do I have to stop on this? And also, like, you know, what, do you have to sit down and deal with what’s like, you really have to think hard about, like, what is the most consequential thing right now? And, you know, my, so like, I always have this paranoia. Like, that’s how I that’s drives a lot of my design is like, what is the most likely thing to like, come up and bite me in the ass? Like, what is going to like, what is something we’re going to forget about, and it’s just going to ruin our day, someday, like these little, you know, time bombs. And so you really want to try to not set those things up. So when you’re just like running full speed ahead, six months later, you just like, trip and eat it, and you’re just like, cursing your former self? So it’s like, a lot of like, what is what’s going to hurt the least? Yeah. You know, and so and, you know, that’s, it’s just, you do have to, I think, be okay with being a little uncomfortable. And yeah, and that’s kind of the big change in the startup world for me. Yeah.

Kostas Pardalis 37:42
And dude, like you chose, like to go and work on a problem. That’s like, just an infinite number of exceptions that you can, like, in your mind beforehand. You really saved yourself, like, we’re

Will Thompson 37:54
limiting stuff was different, right? Like, there’s literature, you know, you can, you know, you can read, you know, you can read all the stuff that people are working on in research like that, but then it’s like, oh, we need to automate the format detection? Oh, okay. Well, I’ve done messy stuff like this in the past, like, you know, in the legal platform, there’s all human interface stuff. Yeah. And then we had like this, this case database of all these cases. And some of those are dating back, you know, hundreds of years ago, some of them were probably entered on typewriters in the yard or whatever. So you know, this messy stuff. So I was used to kind of working around that. But this is like, that was our messy data. Yeah. Now, it’s like, everyone’s messy data. So like, it was just, it wasn’t a problem I wasn’t familiar with. It’s just a different scale.

Kostas Pardalis 38:49
Yeah. Yeah. That’s so interesting. I think there’s also especially like, when you’re like, pre product market fit, because Okay, after that, I think things get more normalized, right? Like you at least have, let’s say, six months ahead of you that you know, you are going to be developing. But before that, I think, like the way that I visualize it, like the process, and it’s not just for engineering, I think it’s just much more uncomfortable for engineering. It’s for everyone. Is like doing these things where you go to like, in a shower, you have to be in town, or you’d like really cold and be like, yeah, like we’re doing like to build a skirt going to be like, yeah, like, we are going to warm the world. And then suddenly, you’re doing a nice bath, because you put this thing out there and suddenly was like, What the fuck is this? Like, no, I’m not going to pay shit, you know, like, and you have to go back and forth and not have a heart attack. You know? Like, that’s emotionally like the thing that you have to go through. And okay for a salesperson or like a marketing person that they have, let’s say they grew like working in an environment where, you know, everything is unexpected, it might be a little bit easier, but like for engineering where, okay, at the end, we live in a very deterministic world, right? Like, for us everything is like, it has to be Boolean in a way, like it works or it doesn’t work. Like there’s no like, in between where, like, if it doesn’t pass the tests, we don’t put in production, you know, like, that’s like a very, I think, like, from an emotional standpoint, like point of view, at least, like, it’s a very different experience. And it is brutal. 100% I

Will Thompson 40:35
I completely agree with it. And it’s like, it’s hard to accept that, you know, you know, this thing that you built, you know, it’s not, it’s the uptake on it isn’t what you expected, but we’re still doing really well, right now. You know, it’s like, we’re still going to need this. It’s just that wasn’t the, that wasn’t the like, unlock right. And so yeah, it is, you know, you have to steel yourself a little bit more emotionally, for sure. Yeah.

Brooks Patterson 41:06
One more thing I’ve shared with our marketing team is we’ve just, you know, hey, we need to ship this project. And a super aggressive timeline is a quote from Mario Andretti, you know, legendary race car driver, and he said, it feels like you’re in control, you’re not going fast enough. And I think that’s like, I mean, it implies like everybody at a startup, right, it’s like, you have to go faster than you’re comfortable with. And just like, know, that you can maintain control, you’re just not going to feel like, you know, you have as much as you want, like, your rooms not as clean as you want their dishes in the sink. Yeah. Like, you just have to get comfortable with it not being as tidy as you want. And just keep, you know, keep moving. Because I think a lot of times, as you know, as we move, like that’s how things get better, faster. Instead of like, I gotta get this thing. Perfect first, man. Yeah. It’s emotional. emotionally

Kostas Pardalis 41:59
taxing. Oh, yeah. And I think a way to think about it is, you mentioned like, data are very mentioned, right. And like, at the end, like data is, like, a very simplified model of the world that we live in. So the world is even messier. So you just have to embrace that. And, yeah, I’ll get easier to shave under but yeah, at the end, that’s what you have to go through. But thank you for having fun. So don’t be discouraged. Like, in the end, it can be fun.

Will Thompson 42:30
Yeah, the fun stuff is definitely very fun. Right? Just because once you get some, you know, once you get something, right, it’s like, you know, looking back on all the work it took to get there. Yeah, it’s, you know, you can kind of impress yourself, and then that’s really gratifying.

Kostas Pardalis 42:46
Yeah. And now I remember, like, someone said that something, like a problem was a tweet was happening and like many years ago, so like, having a startup is like having a newborn. It’s like 99% of the time, like crying and full of shit. This 1% of like, when it’s miles have you and like, looks at you. And it’s so rewarding.

Will Thompson 43:08
That’s very apropos. I like that I joined the startup months before we had our first. And so yeah, so it was two babies at once. Yeah, I think that’s an apt comparison. Yeah,

Brooks Patterson 43:25
we have. You have quite a metal if you can handle it. That’s amazing. Yeah. We’re at the buzzer here. Yeah. But we will be on the lookout for your blog about how can the email problem that sin but don’t deliver IT folks who want to learn more about privacy dynamics, so check out what you’re doing? What’s your building? Where can they find that

Will Thompson 43:49
you head over to privacy dynamics.io And we have a Doc site and it goes into, you know, if you want to learn more about anonymization we have we have detailed, you know, literature explaining how we do it, why we do it, how it all works, we have blogs that show, like how to get started, you know, quickstarts for all these different, you know, types of setups. So yeah, just head over. There’s a lot of good information.

Kostas Pardalis 44:18
Yeah, and actually, I would say, I know, that’s okay. Our audience is probably like more on the technical side. But I think only migration is one of the things that like everyone should read about Ryan’s read, not just about, like the legal aspect of that, but like, just to see, like the effort that goes into engineering for these things to happen. And we should also all be like, at least a little bit aware of like, what is going out there because of the internal data, right? Like the medical records belong to me, like, yeah, someone’s like, storing that, that it is my data. So we should all be more literate around that stuff. And it’s amazing stuff like you’re building what kind of knowledge base So we should spread the word around. Yes.

Will Thompson 45:02
That’s great. If anybody has any questions, just reach out to us. Our emails, I should be on our website. So yeah, we’re happy to answer questions. Awesome. Thank you so much. Thank you guys. I had a lovely conversation. Yeah, I really enjoyed it. Well, thanks.

Brooks Patterson 45:15
Yeah. Thanks for joining us, listeners. Thank you all for joining us as well check out privacy dynamics.io and we will catch you on the next episode.

Eric Dodds 45:24
We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe to your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.