Episode: 61

What is Data Design? With Kevin Gervais of Touchless

with Kevin Gervais

Founder and CTO, Touchless

This week on The Data Stack Show, Eric and Kostas are joined by Kevin Gervais, the founder and CTO of touchless.inc. Their conversation highlights how business is a flow of data and that making intentional choices with handling and structuring data can make a big impact on the business.

Notes:

Share on twitter
Share on linkedin

Highlights from this week’s conversation include:

  • Kevin’s interaction with data at an early age (2:35)
  • Working with telecom data (5:08)
  • Analyzing emojis in customer sentiment (8:44)
  • Infrastructure needed for diverse data (12:22)
  • Building better interfaces and looking out for human error (24:17)
  • Dealing with differences in identities in different layers of the stack (41:21)

 

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 0:06
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com.

Welcome back to The Data Stack Show. Today we’re gonna chat with Kevin Gervais. He’s done a lot of interesting things with data. He’s been working with data for a very long time. And the topic that we want to get into today with him is data design. And his philosophy is that, before you start talking about any of the technology, or data flows or infrastructure involved with data, you need to design the data itself, really fascinating stuff. Kostas, I’m interested to know what led Kevin to this philosophy? You don’t arrive at sort of a thesis like data design, without having gone through probably a lot of painful experiences, which isn’t uncommon. So I’m really interested to hear what his background is, and where he sort of built the foundations of this theory.

Kostas Pardalis 1:21
Yeah, 100%. And I’d love to get into more detail about what this whole data design thing is. We keep forgetting that data has a shape, and we define the shape. And it has some properties where again, we define these properties. But we don’t talk about that much, mainly because we have some other more primitive problems to solve. But when it comes to building a sustainable data infrastructure, it’s inevitable to get into these kinds of design conversations and like how you model the world. And it becomes like a bit more of a future of technology. But it’s a very, very, very important aspect of working with data. And it’s like the first time that we are going to discuss that stuff. So I’m very, very excited.

Eric Dodds 2:08
Great. Well, let’s jump in and chat with Kevin.

Kevin, welcome to The Data Stack Show. We’re super excited to chat with you about all things data, and specifically data design.

Kevin Gervais 2:18
Great to be here. It’s a good day to talk about data.

Eric Dodds 2:23
It always is, right, every day. Give us a little background. I mean, you’ve been actually working with data since you were a child, literally, which is pretty wild. But just give us a little bit of your story. And then tell us what you’re doing today.

Kevin Gervais 2:35
Yeah, absolutely. My life started once I got into data. I’ve always been really fascinated about organizing things. And I mean, we were chatting earlier about, like games growing up, we were using Macs, that didn’t have many games on them. And so back then that’s where I learned that, we could use AppleScript to change the data inside of an app and by changing a couple things, you could make a stormtrooper replace the icon of what was some boring thing before, or you could use data to you change one piece of data and you could change the sound of something. And so just learning that working with data can provide an immediate, you know, gratification, you can actually see the impact of it instantly, has always fascinated me. And, and even just the progression of it. Like when I was getting into web design, when someone was asking us to build sites for organizing, like embroidery, like shirts that we get embroidered, and they handed us a bunch of CDs or handed us a bunch of catalogs and said go figure it out. And just being able to have like, it was realizing the satisfaction that you can get out of just organizing stuff has always been a passion. So I spent about 15 years doing that in the web side of things and working on ecommerce and different web projects, and then got into the telecom sector the last eight years. And that’s where you start to see what data looks like when it’s really clean or when it is standardized. And even then it’s not beautiful, but just kind of seeing how others have dealt with some of these things? And how they organize their things has been fascinating. And I’ve been kind of privileged to have been exposed to so many different situations of how data can be organized. Good and bad. So yeah, it’s something I love talking about.

Eric Dodds 4:36
Yeah. One question on the telecom side of things. What kinds of data were you dealing with in telecom, in terms of format? I mean, is it sort of standard stuff or like, I’m just interested to know in the telecom sector, what are the most common types of data and then maybe some of the more challenging types of data that you dealt with in telecom?

Kevin Gervais 4:57
It’s a good question. No one’s ever asked me that. What data did you get exposed to oh, boy.

Eric Dodds 5:07
We try to dive deep here on The Data Stack Show.

Kevin Gervais 5:08
Man, that’s a little personal. I’m starting to get emotional because it’s gonna bring back memories. *laughter* So the business we were in was trying to help telecom companies better serve their customers, so they have a better life cycle with them. So instead of a random person from a call center talking to someone out of the blue that they’ve never met, we were working with a telecom to remember them to make it so the person that sold somebody a phone or tablet was the person that would follow up. And that would keep that relationship alive for years. And do that over text. So in order to have a great relationship like that, they had to have context. So you’re working with transactional records, purchase records, slick packages, how long has it been since they were last talked to? Notes, history. And then as we got into that, then we had to deal with conversational data. So you, you’d have to deal with, like, how do you determine sentiment when most of the APIs that are out there are trained on like, say, email communication, or well-formatted sentences? But how do you look at the sentiment of somebody who’s replying with acronyms over text? Or an emoji? Right? So we had to deal with a lot of data, millions of records of data, that you couldn’t just apply the standards to. And then we got to the POS data, too, because the whole idea too, if you’re trying to figure out, how do you have a good conversation with someone? Or is this conversation working? Or is this script working? You have to tie back to transactional data, and bringing in … in that scenario we had to deal with, the carrier would have certain data about a customer only the products that they sold, but then the store that sold stuff to them with accessories, and other stuff that the carrier doesn’t know. And so we had to marry these two things with worry about duplicates. So it accidentally ended up that we got into the like, we put ourselves in the middle of all of these crazy data problems. And we had to actually solve a lot of them in order for us to do our job and have accurate reporting, right? Like, is this campaign working? You need to deal with all these different things. So yeah, it was like a very interesting experience of having to be exposed to different formats. And also, I think the surprising thing out of that is just seeing that even these large companies that spend hundreds of millions of dollars on some of their systems, they don’t have the cleanest data either, right? So everyone seems to maybe dream of the day that oh, like, someday I’m gonna have everything all perfectly clean. No, I’m sorry. Like, it’s not gonna happen. It’s just how much mess are you willing to have today? But there will always be a mess.

I mean, Kostas’ words ring in my ears all the time. I mean, data in general is messy. Customer data tends to be very messy. I have one very clickbaity question for you? How did you deal with emojis and sentiment? Like, that’s just a really interesting topic that I actually think is probably pretty relevant.

Well, emoji are data too. It’s all just converted into an ASCII code, or basically, it’s, it’s so just being able to understand which code means what, but knowing also which ones are inappropriate, which ones are inferring something very negative in some cases. Like you could have a very positive statement with a series of emojis after and the emojis cancel out the meaning of the words. And so yeah, it was interesting, like we, in the heart of COVID, when that was happening, and a lot of these telecoms shut their stores right away. They have since opened them up, but as soon as COVID hit last year, everything was shut down. 80% of them just shut their stores. And we were trying to understand what was the pulse of people who were still buying, or when carriers were reaching out to customers. What were they saying? So we, in that respect, we came up with a model where we detected that certain phrases or a series of emojis could dictate whether someone was afraid or joy, or were they sad and then we compared that to prior periods to come up with a bit of an index of what is the consumer sentiment during this time of crisis. We did see a difference. We saw worry like whenever the stocks would dive, we saw an increase in the fear in the way that people replied. Yeah, so I think what was fascinating actually when we got into helping people do outreach, right away, we created this concept of standardized lists and standardized chat starters. And so since the beginning with that business, we were always able to know, like, the chat starters in some cases never changed over the years. And so like for a given campaign, to a given segment, this is what replies we should expect to see. And you’ll be able to know this, because it was all kind of standardized right at the beginning. And that, because we did that, allowed us to come up with these patterns that you wouldn’t otherwise, because if you weren’t always asking the same question. You wouldn’t be able to know, is the sentiment changing or not? Right? Like if you’re trying to lose sentiment, just based on a random conversation that people can just tire degrade? It’s gonna be all over the place. So yeah, I guess, like the learning was that because we worried about being able to tie it back to specific baselines, right, like cohorts and scripts right at the beginning, that was an enabler for us to do some of these types of sentiments and sentiment analysis, because we had something to go back to we knew we knew how people reply to that same question that people would ask like, hey, what’s the phone you’re using working out? Or how many questions about your phone? We knew how people replied to that the week before all these things shut down? People reacted differently to that same question. It was like, Huh. Yeah, it was good learning from that.

Kostas Pardalis 12:07
What kind of infrastructure do you need in order to deal with such a diversity in the data that you’re working with? How did you manage to work with all this data? Like in a consistent way? Right?

Kevin Gervais 12:22
It took us a while to figure that out? I know, I could tell you how not to do it. Well, I think actually, what we had to deal with is what I think a lot of companies have are through really a lot of folks, because our business started out where people would upload CSVs, it’s like, everyone knows CSVs, okay, so upload CSVs, they give a structured data, and we would upload it. And so at the time, when we first started the company, it was like, Oh, this is what we do. Somebody gives us a file that always looks like this. And so we will have columns in the database, that are exactly the columns that we received no problem, right. And we did that for years. And then once people started giving us new types of files, we’re like, oh, okay, I guess we got to jam these into these columns we had before. And then, as we got into more and more types of data it became messy, right for us to figure out. I think the big thing that we came away with later is that we shouldn’t have been so opinionated at the beginning of like, having columns for specific types, like, we shouldn’t always assume that there’s like a column called subscriber ID. Yeah. Because maybe it’s not a subscriber ID right. Maybe it’s an accounting ID or maybe it’s a Salesforce ID. And so I think I think the lesson out of that is we should have structured the data based on what type it was, right? Was it an identity? Was it a person record, was it an org record? Or is it an event like, like, what we ended up moving to with the new architecture is to move everything to an event based, Secure-aaS model where you’re event sourcing, right, so you’re actually designing the domains of your data? Yep. And then you’re using Axon, DB, and a bunch of other stuff to kind of force everything into events and and that creates your model, but that was a lot of work. And extremely difficult. And I think, yeah, if we had put the data into a more universal format, the beginning, like just realize that the names of our columns probably will matter, or like, let’s not always expect everything to be perfect integers in a column, we could have saved ourselves a lot of pain. And I think most businesses will like maybe they’re not working with the same scale, kind of data. But yeah, they, I think every business has a lead lifecycle to the data that they have, like, the data that they collect at the beginning is different than the types of data that they collect five years down the road, right? They might change their billing system. They might change their CRM, they might want to change their CRM out in the future. And so I think just designing for agility, right, becomes really important.

Kostas Pardalis 15:29
That’s a great point. I’d like to hear your opinion on that. Because I think that models change not just because of ignorance from the beginning, right? It’s also because at the end, we are building these data models to represent somehow reality, the business reality, and the business reality changes, right? Like, if we think about it, I don’t know, like a company like RudderStack, like a startup, right? What RudderStack was like a year ago, compared to what it is today, it’s a completely different thing. And of course, this is also representative of data models that we have. Could we have done a much better job back then, of course, but I think that even if we managed to do the best possible job in modeling our world back then, just because we didn’t know the world that well, yet, would lead us to change things at some point like to change things. So that’s why I think that’s what you said about building all these components, with agility in mind, and being able to change and adapt like your, your data, I think it’s super important. So how do you do that? How do you build the job data models? Like how what’s like the what, what principles drive design?

Kevin Gervais 15:34
That’s a very good question. There’s 17 …

Kostas Pardalis 16:08
Too personal? *laughter*

Kevin Gervais 16:12
No, that’s good. That’s a good one, you know, here’s how I think about it. I think, first, all a business is is the flow of data. Everything serves the data in the end. I mean, meaning like, it’s like it was talking about like a website, for an example. We put a website up, if you create a design, you put a website up, what’s the whole point of the site? Well, you want someone to call, right? Or you want someone to text in or you want someone to fill out a form? Okay, once they filled out a form and maybe started an order, what is it now? It’s data. So the point of actually a website is to trigger either a data connection to make a phone call, or a data connection to start a text or capture some information and grab that as data, and then flow that to somewhere. That’s really the sole job of the site, right? And especially if you’re trying to, even if you’re branding, your site is just to help make people feel good about the brand, then how do you know if you’re doing that? Well, then the job of the site is to collect data to see if you’re accomplishing that goal, which in that case, time on site, are they interacting with the cool pieces that you’ve put in there that are branding elements, are they watching the videos, etc.? So really, like data is, is so important, and usually it’s thought about as an afterthought. So I think just recognizing the fact that the flow, the capture and transformation, and the flow of data, is kind of what drives business, right. And remembering that makes, I think it’s just important because it helps us with a design process where you collect the data is not usually where you want it to end up. And then also, just remembering that where it ends up today is not necessarily where you want it to end up tomorrow. Most businesses go through a lifecycle, right or, or even an evolution in the systems that they use. So and so the challenge, to answer your question, like how you go about designing for it, I think, first you have to know your inputs, like first, we have to be able to, to track all kinds of things, write all kinds of events, we should be able to identify the types of things that we’re tracking. And we should be able to move those things into different systems without a whole bunch of work.

And what happens if you do want to switch systems? Because at some point, you’re going to want to switch systems out? Yep. And you’re in one CRM one day, and you want to go to another, so I think just having those as inputs into the design process shows some of the variables that you have to consider, right. And so then, like, what, what I’ve noticed is that you actually can design your data. So in the web world, or even application design, there’s a thought there of user interface design or user experience design, right? That’s a function where everyone can understand, okay, I need to have a person draw up something that someone will interact with, where should the button go, right? And it’s very easy to start there and kind of only focus on that because you get that immediate benefit, right? You can draft it, put it out there, and someone interacts with it, and you think your job’s done. But data needs more design than an interface because data, data integration, data transformation doesn’t happen by accident. Like if you want your data to flow seamlessly between systems. And to be future proof, you should design it as much and I would argue more than any graphical interface, yeah, that you have. And so just like there’s standards to user experience design, like don’t put your close button in a random place off to the side of the screen, that you have to like, shake your phone in order to see like, that’d be bad design, you should think of those rules. There’s similar things in the world of data, where there’s, like, we know, we know what a person looks like, a person, as an example, has a name, that name can change. They have a birth date, they have a death date. And they probably have an identifier attached to that. But that’s a person. I mean, it’s sad, but that’s like a person, his name, an identity identifier, a birthdate and a death date.

Now a person can then have identities attached to them. It can have traits attached to them. But those traits and identities can change over time, right? Names can change, addresses can, even interests, right, personalities, gender, those sorts of things, people could, you know, could change that. And so when you actually look at how most CRMs treat that data, if you think that’s going to be your perfect data model, if they think of a person as first name, last name, gender, I don’t know, like address and phone number, and that’s like a contact, it’s, it’s no wonder, you can see why that doesn’t fit many situations. So you end up with duplicates, if somebody belongs to multiple organizations, or etc. So I think going back to like, how do you fix it, it’s extracting away what’s fixed, and what could change. So a person, you might have a birth name, actually, that’s if I think of what is an actual person, you have a given name, you have a maybe maybe a gender at birth, right? That might be on the record. And then you might even have a birthdate and a death date. And that’s it, everything else is changeable. And then a person can be related to various places and reuse. And if we design for things like that, I think we would end up with a better understanding of the relationships across our data,

Kostas Pardalis 23:08
Let’s take a real life situation, you have like an annoying salesperson who decides to go on your Salesforce and put a flag there just to remind them if they have visited like a contact or not, and they have like results to contact or not without consulting the data model, without reaching out to the person who is responsible for the data model or whatever. How do you deal with that?

And what I mean is questions like how do you deal with the human nature of taking control of things to attain what they want in the end, right? Because the problem that I have seen so far, like with all these things that have to do like with modeling and having like, like a very crystal clear, let’s say, where you have, like understanding and distilled way of understanding, like the world around us, is that the biggest enemy of this is like the rest of the people involved, right? They make mistakes, or they decide that I need something else, but I need it now. I’m going to change it. How do we deal with that? How do we deal with humans?

Kevin Gervais 24:16
I think you should like first, I think we can build better interfaces, right? I think like with recent situations, a client I was working with had messy contact records and messy addresses. And they wanted to understand if there is a pattern to customers living in a certain area like do they seem to be getting more people from a certain area? And in order to do that, we looked at the data and it was a human entry error, where so many addresses would have like notes in them, dashes, weird quotes. And oh, instead of having like a unit number, it was like, right in the actual address. And then we recently fixed over, like 50,000 records last weekend just to kind of, you know, get some sense of standardization. And then once we did, we provided instructions. And please make things all in capitals. And even still, because it’s human nature, to your point, someone, even if you do all the cleaning, right, because this was like an extreme example, where we actually cleaned everything, standardized everything, and we gave instructions. And even still, because the interfaces allowed for it, people would go in and just skip through it, or just put a period. Or they would type the name of the city wrong, it wasn’t on purpose. It’s not because they wanted to mess with the model. It was because the interface let them.

So I think ultimately you need to know where you want to end up. But then to actually solve it, don’t give people the ability to mess it up. So I think just being willing to enforce that and build interfaces that check for quality, or check for duplicates.

It’s really the responsibility of a business providing a tool to their staff. It’s more humane, it’s more empathetic for a business to put those filters in place to prevent issues ahead of time. And because when they don’t, you’re just going to frustrate everybody, you’re actually going to like, you’re going to get inefficiency, you’re gonna have bad reporting, you’re going to now try and tell people something that you like, you may even get angry at them. Why did you put space there because you put a dot, and they actually can’t help it. Because the interface is letting them. So I think, first just be willing to fix the interface. So you don’t have bad data coming in. And then the other thing is, I would call it data management. Like I think the other thing that we’re noticing is even if someone were to go through all of the filters, somehow and found their way to put bad data in having a way of going through the warehouse and cleaning it automatically. Well, just like watching for issues, it is something that you can detect as a business and fix and then push those things to the various sources. Once you’ve corrected it, knowing that there is probably going to be someone who will find a way around all the controls you put, yeah, but don’t accept it.

Kostas Pardalis 27:50
Yeah, I think I have a good example that it’s going to resonate very well with Eric, one of the most frustrating things that happen when you build a new product is when your developers there start signing up to test things. Right. So you have to get into this situation where you want to start tracking signups, of course, but at the same time, you have people who are signing up that you don’t want to include in your measurements, because they’re your developers, right? And you have to clean this data, of course, and that one day you come and they’re like, Listen, guys, we have to fix this problem. Okay. So from now on, you’re going to be using a specific format of email that you were using to show I can go and easily filter it. Well, guess what? Everyone agrees on that? But it does not happen.

Eric Dodds 28:46
Yeah. I mean, that’s it, it sounds and it I mean, it’s hilarious, because that it sounds like such a simple problem to solve. But there’s always an edge case, right? Like, to your point, Kevin, like, people always figure out a way around it. And that’s actually true. It’s really interesting, because just thinking back to some of my previous experiences, the same is actually true for direct to consumer products, right? If you think about a business, creating a user experience or user interface for their own employees or staff to do their job. Someone’s always going to find a way to sort of shortcut the process. And the same problem actually applies with, let’s just say a consumer mobile app, right? You try to set these guardrails for onboarding and activation and inevitably someone figures out a way to do something weird that creates a poor experience, both for them as a user, and then also the business who’s trying to optimize the experience and so it is …

Kevin Gervais 29:53
Well even just to that point, if you accept it, so first, like accept there’s like, this is gonna happen, but previously to solve for this was really, really hard, right? Like, this is something like, that’s why I think a lot of people would throw their hands up and say well it happens, right?

Eric Dodds 30:10
But right, it’s almost like accepting a margin of, of error, or sort of, I’ve actually seen this before, it’s just like, okay, well, our reporting is probably just going to be X percent off, because there are these sort of edge cases, right, and so fine, like, we’ll just deal with it.

Kevin Gervais 30:30
But accepting that, sometimes having those margin of error exceptions, it really ruins the reporting too. Especially when you’re, if you’re trying to understand and like, like adoption patterns in your app. And you’ve got a bunch of employees that slip through the cracks, right, that all of a sudden, their interactions are now being tracked? Right? It throws all of your understanding off, because maybe those employees are doing things with the app that no other user is doing. Or maybe they are going to try to look at one thing and then leave. And then so your metrics are like, Oh, no, we’ve got a massive churn problem. Like, it could waste huge amounts of money, time and energy, because the reporting is a bit off, quote, unquote.

Eric Dodds 31:25
Totally, or actually time to activation is another one. If you have people who are very familiar with an app, and they go through and activate very quickly to do a demo, or walkthrough or test something, but they already know the user flow ahead of time that they’re testing or whatever. Like your activation time can be skewed significantly by people who complete the process really, really fast, right? Where so then you have a huge derivation, that is pulling the average way down. And so you think that people are actually onboarding to the product way faster than they are.

Kevin Gervais 32:04
Or I see this all the time on the web, especially for people who have signup processes, let’s say an app. And they’ll have a bunch of their users go to the website, they might Google, they’ve got users that log into their app, right. And maybe it’s a b2b SaaS product, or even consumer app, right. But it’s like, you go to the website to log in, there’s a bunch of those users that are known customers, right? They’re known, they’re known identities, but yet, they often are showing up in Google Analytics reports and things as just regular visitors, right. So you could be looking at a bunch of reports. And if you’re not segmenting your data properly, if you’re not accounting for the fact that this stuff happens, and filtering it, then it can throw off all these other metrics. So someone could look at the report and go, wow, our campaign’s working, when really 80% of those are all just people going to the site to log in? Well, you really should actually be removing all of those visitors from your reporting, because they’re not marketing visitors. They are like known visitors. And so if you’re marketing trying to figure out if your campaigns are working, like maybe it isn’t, right, like maybe most of your visits are just like people who come back anyway. And so figuring out when to flag these things, and how to filter them at the point of collection, I think is really important. Because I know, I’ve actually seen a situation where someone is thinking they have a massive churn problem, when really it was just a data problem. Like they were measuring churn improperly, or they didn’t know how to measure it. And so maybe they were going based on the number of unique identities in the system. But what they really should be doing is looking at people who are built and go from that as the source of truth. So like, sometimes it’s just knowing, like changing your source to power a certain metric, or, yeah, accounting for the fact that you might have duplicates, like there is sometimes a data solution, right? To first figure out what your baseline is at, right? Because it can completely change your decision making. And you might invest in fixing a problem that actually isn’t a problem. Right.

Kostas Pardalis 34:35
Quick question, you mentioned a bit earlier that the company can establish, let’s say like this, the right mechanisms to figure out when issues with data and around quality specific things happen. Can you give us a little bit more context around that? Like what kind of mechanism a company can use to detect that for example dates or the addresses problem that you mentioned, right?

Kevin Gervais 35:02
Addresses are a big problem. So what I usually start with is, I mean, there’s been very I have more more recent thesis is on this since, but where I started from, which I think is a good baseline, is even if you only serve a certain market, right, a certain area of your state, or maybe you’re only in us, or you’re only in Canada, you should store your data in ISO format. Or if you look at SmartyStreets, or some of these other API’s are available, there are these, like international API’s that show you what an international address should look like, right? Like don’t store things in a way that says, okay, like zip, province, city, what if what if it’s a rural route, like if you ever look at a rural address, sometimes it’s like, county road 46 rural route three, intersection of this, like, you can’t just kind of assume that everyone can fit into this, like address one, address two, city state, province, or city, state, country. So thinking of things like, yeah, localities, administrative areas, sub-administrative areas, accounting for the fact that maybe there’s not a real address, and you have to have a latitude longitude. So if you can just but you don’t have to invent things like these, these models already generally exist in certain API, again SmartyStreets is a good one. Or you could look at ISO standards. And if you stored your stuff in that format, and started to create a structure to like, where this stuff should go, it’s like putting it in the right filing cabinet, right, so at least you can know where to look. And then once you’ve done that, and you have the right data model. And you don’t have to overthink it, like you think just like starting with some of these big well known international formats, is a good start. So let’s say that you’re doing that in Postgres, as an example, or SQL Server or some sort of database, then you can put things like Sura, or Prisma, or something on top of that database, which gives you triggers, like on update or on insert, or on deletion, you could trigger little micro functions, right, which could be hosted somewhere. And those micro functions could be things that know that bad data could make its way in accidentally, and at the point of insertion, then start a transformation step that then extracts the unit number from the first part of the address, like maybe some people put in 200 dash, one main street, where 200 is the unit number, well pull that out if you notice there’s a dash and convert that to unit 200. And put that in the unit field. Yeah, like that. So I think from a tool perspective previously to do that would have been a lot of work, right. But now because you can basically have your data go into a nice warehouse, you can have an API layer for free to sit on top of that, to look for changes, and then that can trigger effectively free functions, which can clean up these little patterns, you can actually make the data clean itself. Right. And, and hopefully, yeah, force it into a standard format, and then push that to the various places.

Kostas Pardalis 38:33
So we talked about addresses, and you said that they’re like a very common source of issues with data. What other issues have you seen, like, more commonly, like, together with what else you have seen there?

Kevin Gervais 38:47
Oh I think person records like jump out, where if you’re using something like Salesforce, where I think a lot of people were, where they don’t set things up and it causes issues if they don’t put, like unique identifiers, they don’t put similar contacts. And so if you have a person that is across multiple accounts, like the same contact is in three different accounts as an example, in Salesforce, you should be having a field to store like the unique identifier. And that way, you can start to tie together in the future, that these contacts are related to each other. And then you can basically set up rules to sync the thing. So I think one of the biggest issues I see is just duplicates, right? And then the second piece is just the quality of what’s in a name, right? You’ll see a lot of folks put either both names in the first name field, or they put the left first name and last name and the first name field and keep the last name blank. So like just what gets put in the fields I think is often an issue. And then even just formatting, like I see this all the time from like marketing cloud data, but you’ll have some contacts that are all caps, and some are capitalized, and some are all small. And that would be how it goes out in an email. Right? So usually I see this stuff in the data side. And because that actually will reduce your click rate and could cause more opt outs, if you’re saying, Hey, FRED, and it’s all caps, so like capitalization, putting the right thing in the right field. And tracking that these three different contacts might be the same one, I would say like the top, yeah, the top ones.

Kostas Pardalis 40:41
It’s actually a very interesting problem, which has to do with identity in general, like, and especially now that we are using like, so many different SaaS applications with each one imposes like a data model on their own. When you use Zendesk, they have their own way of representing what the user is, when you’re using Salesforce, the same, your marketing tools, probably they have a little bit of a difference and of course, like the people involved there are also different, right, so what’s your suggestion? Like how to deal with this problem? Which is inevitable, right? Like, that’s how life is like we have all these different systems?

Kevin Gervais 41:21
Yeah, I think a good or at its most extreme, the best one that I’ve seen that does a good job of this is Adobe Identity. I mean, it’s normally used by very large orgs. I think most orgs can learn from that, even if they do a portion of what they do. Adobe Identity says, you can just see all this from their development Docs as inspiration. But they look at everything as an identity that’s attached to a person, and the identities can change. So it’s like you have an identity record. And you can set that type of identity? And is it a permanent identity? Is it a ticket like a Zendesk identity? You can basically come up with your own, like, what is the type of identity that this is, and then you can attach it and detach it from a record, from a contact, at any time. It’s a little bit overkill for most people, I think if you just were to simplify it, just keeping a record of relationships is a list of identities. And then you have a table that says, Okay, this identity is related to this record having that somewhere, it can go a long way to at least keep track of these things, instead of assuming that you’ll always be able to correlate them to each other, like, just creating this type of relationship mapping is an easy way to keep track of it. And yes, sorry, go ahead.

Kostas Pardalis 42:53
Oh, no, I find what you’re saying very interesting. I’m just trying to think of like, who is modeling these identities, like who is responsible at the end, because what you are doing here is we are trying to solve this problem by adding another level of indirection, let’s say so we say, let’s create like these concept of identity, and instead of mapping Zendesk to Salesforce, Zendesk to Marketo, Marketo with Salesforce, let’s go and do Marketo identity, Salesforce identity, Zendesk identity. And if we do that, of course, then all of them are, like, mapped, right? But still, like someone has to manage, like these mappings to the reference identity that we are creating on the design integrity management system. So who does that?

Kevin Gervais 43:42
Yeah, I don’t know if that role exists yet. It’s like, I think it’s what we’ll find, I think we’ll find, though, over time, that, like data quality, will become a function of business. Right. And I think it should, I mean, it’s unfortunate that that’s required. But it is a role that is, you know, realistically required today to kind of manage the fact that this is going to always occur. And the ones that do invest in managing this are the ones that are going to get way more out of their base, because they can infer things that the others can’t. And I think just like as a quick example, like even with phone numbers, I was working with a bunch of records today, a FinTech trying to do some outreach to customers over text and the records provided from the marketing system, some have pluses in front of them, some have brackets, some are just too many digits, some are missing digits. So that affects your ability to reach people, right, if you’re expecting the format to always be clean, we would have rejected 30 to 40 percent of the records. And so you’d be marketing to less people, once we were able to standardize all that in an international format, now you’re reaching 80 something or 90 something percent of the records that were provided, because we were able to standardize it. But like, if you don’t put someone in that role of responsibility to ensure quality, you actually could be really hampering your ability to do marketing or to infer things, right. Sure.

Eric Dodds 45:23
Yeah, jumping back to Adobe, it’s such an interesting point. And Kevin, having been a past user of some of the Adobe Marketing Cloud products, I think they get a bad rap for being a huge, expensive monolith. And in many ways they are. But it’s really powerful technology. And I think it’s great, I just love your comment of sort of looking at their developer docs for inspiration. I think the challenges a lot of companies faces, one, it’s unattainable from a cost perspective. And then two, the question Kostas asked is who manages sort of the central identity? Well, in the Adobe world, it’s Adobe. Right? And so you’re locked in. And it creates a huge amount of inflexibility, which I think is very problematic in many ways.

Kevin Gervais 46:18
And why I think most people should manage that in their own warehouse. Now, the question is, okay, what does that look like? Right? What’s the schema for that? And what’s the turnkey way for them to manage their own identity system in their own environment? And yeah, I think there’s a lot of folks trying to get solutions in the market. To solve that, it might still take some time, right? Like, I don’t know that this is something people can just buy an out of the box thing today. And it will just work right magically to solve all their identity issues and run in their own environment. I think it’s only become clear to me that this is a problem that has to get fixed. So it’s going to take some time. Before everyone, just anyone can do this. I think you could start, you can start in a simple way. Right? You could basically have a table that just stores like really hard coded things like have a column for like, here’s a, an ID as your main ID and you have a column for Zendesk ID, a column for Salesforce ID, a column for Marketo, ID, and then kind of just track that. And that might be like a shortcut, right, where you’re not trying to manage the whole identity layer, but you’re at least trying to map the relationship that these three ideas are all tied to the same contact. Yeah, that’s like, it’s just a little step up from what someone might do on their own. I mean, if you really want to cheat, you could just have a contact record, when we’re talking about person record, you could have a person record with some extra columns in it. And, like, if you don’t want to get into the whole relationship mapping piece, just add some columns for these different identities. And that will allow you to eventually tie them together, right. But just having them stored somewhere is better than just being all up in the air and hoping that you can always match based on email address. Because that’s usually what people do, right? They get what they like, they’ll go and they’ll try and just match on that. But maybe Salesforce has someone’s work email, and HubSpot has somebody’s Gmail. So like, you won’t really be able to match them if you just think that you’re always going to be able to go based on email. So I think there’s shortcuts to solve this. I don’t think you have to jump right ahead to the perfect world identity management thing.I would agree with you that relying on a vendor to hold all of those identities is dangerous, because what if you want to move? Right? What if you want to take control of that? You’re not going to be able to get that perfect export of all the Adobe IDs that they’ve created. They do make it easy, but also like Adobe, the Adobe identity is kind of an overkill solution for most companies that don’t have that type of complexity. So yeah, that’s definitely something people should take on themselves.

Eric Dodds 49:26
Well, you answered the question. We’re at the buzzer here. Brooks is telling us that we’re at the buzzer. So we need to close the show. But I was going to ask you what’s the starting point, but I couldn’t agree more that the starting point is actually just beginning to tie together some of the basic pieces of unique identifiers from the various places in your stack, to build a foundation for that unified profile in the warehouse. And even if you do the basics, like you said, where you’re literally just sort of mapping the unique IDs across tools is such a useful foundation to build for the future. Kevin, one thing we didn’t talk about that we discussed before the show is you have built some unbelievably fast and SEO performant websites literally just using technology to sort of push pages to the first page of Google. We didn’t get a chance to talk about that. But would you come back on the show? And can we break down the stack for sort of the latest, greatest SEO performant website stack, especially relative to the data piece. We would love to have you come back on the show, if you’d be willing?

Kevin Gervais 50:35
Yeah, that’d be great. And especially because to be able to do those cool things with fast web experiences, the data model really is important. You need to put your data in a certain format, you need to have a certain flow working because you have to make it so the browser doesn’t do any other work. And the reason why things are slow, is because sites generally 99-ish percent, 98 percent of the web works this way, where they put all the work on the browser, someone asked to go to the site, and then it has to make a whole bunch of stops to get all the information that the user might be asking for. And all those things take time. And there’s a whole bunch of calculations and work done by the browser to present it. And so if you want the browser to do no work, and just present information instantly and under half a second, the data model needs to be pretty clean on the back end, to make that possible. But yeah, once you get there, the benefit is you can do some pretty cool stuff. So yeah, happy to walk through, like how someone could go about setting that up.

Eric Dodds 51:53
Love it. Well, that was a great preview. We’ll have Kevin back on the show. Kevin, awesome discussion. I learned a ton. Thank you so much for giving us some time. And we’ll talk with you again soon.

Kevin Gervais 52:05
Yeah, thanks. It’s great to chat about all things data.

Eric Dodds 52:10
It always is. Fascinating conversation. My big takeaway was when Kevin said, “All a business is, is the flow of data.”

I haven’t really chewed on that statement enough to know whether I have a strong conviction about it. But it was very thought provoking. And in many ways, I think it makes sense. When you sort of break a business down into its component parts. Even the conversation that maybe a salesperson is having with a prospect, the content of that conversation is data. And so that was very thought provoking me. So I think that’s probably what I’ll be chewing on this week is that statement. How about you?

Kostas Pardalis 52:53
Yeah, absolutely. I really enjoyed the conversation that we had with him about modeling and obstructions around data, I think what I’ll keep from this conversation is that in order to be as correct as possible, or be able to have the right mechanisms in place, to monitor quality or like reacting issues, you need to have a good abstract model of how your world and how your company and how all the functions and all your your interactions with the customers are going to be. That’s what I’m going to keep. I think it’s very … it’s a piece of wisdom that we took from him. And I think it’s great advice for every engineer before you start implementing, like spending time designing things and thinking about why things should be organized in a certain way. It’s something that is super, super, super important, and it comes with maturity. I mean, it’s not a coincidence that he had to mess with so many issues related to data to come to this conclusion in the end. So yeah, that was, I think, a very important part of our conversation. And that’s something that I will definitely think about and keep.

Eric Dodds 54:18
Absolutely. Well, thanks again for joining us on The Data Stack Show, and we will catch you on the next episode.

We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at Eric@datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.