Episode 170:

Discussing Data Roles and Solving Data Problems with Katie Bauer of GlossGenius

December 27, 2023

This week on The Data Stack Show, Eric and Kostas chat with Katie Bauer, the Head of Data at GlossGenius. During the episode, Katie shares her journey in data science, starting from academia to working in various industries including natural language search, social media, and now the beauty and salon space. She discusses the evolution of the data scientist role, the challenges faced in different companies, and the importance of understanding the specific needs of different business models. She also highlights the potential of using data products to provide value back to businesses, the importance of having an analytics engineer in an organization, and more.

Notes:

Highlights from this week’s conversation include:

The evolution of the data scientist role (1:03)
Common problems in different companies (2:05)
Measuring and curating content on Reddit (4:29)
The challenges of working with unstructured content at Reddit and Twitter (11:03)
Lessons learned from Reddit and applying them at Twitter (13:17)
Data challenges and customer behavior analysis at GlossGenius (20:16)
How the data scientist role has changed over time (00:25:10)
The essence of the data scientist/engineer role (29:00)
Dynamics and overlaps between different data roles (32:09)
The perfect data team for Twitter (34:19)
Building a data team at a startup like GlossGenius (36:36)
The right time to bring in a dedicated data person in a startup (38:52)
The analytics engineer role (46:25)
Challenges in implementing telemetry (50:31)
Final thoughts and takeaways (52:24)

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 00:05
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com. We’re here with Katie Bower from gloss genius, Katie, welcome to the show.

Katie Bauer 00:29
Thanks for having me. Excited to be here.

Eric Dodds 00:32
All right, you’ve had an amazing career and data gives us a quick overview.

Katie Bauer 00:37
Yeah, I got into data science in the early 2010s and spent time in a bunch of different places working at our natural language search, startup and social media. And now I am leading the data team at a vertical SAS startup in the beauty and Salon space. Awesome.

Eric Dodds 00:51
Well, so excited to chat with you about What topics you want to dig into?

Kostas Pardalis 00:58
Oh, we have plenty to tell you about. First of all, my favorite topic, which is always about definitions, right? So I’d love to hear what data scientists throw least like, what’s the definition behind that. And more importantly, like how it has changed. Like, we have Katie here today that has experienced a lot of like the evolution of the role. So I’d love to hear from her how things have changed. And also talk a little bit more about how the role is different, like in different companies, different sizes, different products, right? And how, also, like the data scientist overlaps with North, we’ll see, Katie will tell us about other roles that have to do with data. So it will be like, super educational for me at least. But I have a feeling that it’s also going to be like for our audience. What about you, Katie? Is there something in your mind that you would like, love to chat about today?

Katie Bauer 02:05
Yeah, that’s something that I think would be sort of an interesting connection to what you’re talking about is what kind of problems occur across all these different types of companies? Like what is something I’ve had to do at multiple jobs? Or what types of problems appear in more than one context?

Kostas Pardalis 02:19
Yeah, let’s go and do it. What do you think?

Eric Dodds 02:22
Let’s dig in. All right, we’ll start where we always do. So tell us how you got into data at the beginning of your career, and then sort of what you’ve done in between, and then what you’re doing at gloss genius today? Sure.

Katie Bauer 02:36
So like a lot of people who ended up in the data science world in the early 2010s, I was in an academic role, looking to do something different, and decided I’d go try this tech thing. My background was linguistics. And I ended up working at a startup, as a linguistic data set, like an annotator and curator did. For a while it built a lot of my foundational data skills. And when that company was done, I ended up working as an analyst and a data scientist for a couple years in ad tech. Before I ended up moving into social media, which is where I spent a lot of my career. I joined as an early data scientist at Reddit. There were about 200 people at the time, and we had no data science prior to that point. So that was an interesting journey, being there through a period of hyper growth. And then after that, got pulled into a role at Twitter, by a friend who was working there at the time. I spent a couple years there until notable news events pulled me away, and brought me to my current role as a class genius where I run the data team. Gloss genius is a serious see now company, it was a little over 101 I joined. And now we’re more than twice that data team as a single person. Now it’s grown pretty substantially over the course of the past year, really kind of spent the past year getting the team up and running, hiring people, building the data stack, etc. And kind of looking long term for the company at what we might do commercially with the data that we have.

Eric Dodds 04:16
Awesome. Okay, I want to dig into each of those career phases. Let’s move on to Reddit. So social media, I mean, what a topic in and of itself, the data science behind it, what a topic. I’d love to know what were your big takeaways, because if you think about it, you worked in, you know, sort of like a moderator user, and that sort of space within Reddit. And you have all this user generated content, which is fascinating to me from a data science standpoint, because you have all this unstructured data. How do you derive the meaning there’s, you know, sort of lanes within like, content that should be allowed in certain subs, all that sort of stuff, right? What were the big lessons that you took away from sort of, I mean, I kind of view that maybe this is wrong. It’s a wild west in many ways, right? user generated content, unstructured data, like things that are subjective, that sort of need to be objectively implemented in terms of standards. So what were your big takeaways? Yeah,

Katie Bauer 05:23
I mean, I guess like one thing to say on the subjectivity. One thing I developed a really strong allergy for working at Reddit long enough was the idea that we would define quality content. Like, there were like, a lot of ways to like talk about like, is content doing something illegal? Is it distracting? Is it off topic? But like, is this like good content was always a question people didn’t like, there’s no way to measure that. We tried so many different ways. Interested, this was always a dead end, because it’s.

Eric Dodds 05:55
So this is interesting. So, I mean, obviously, Reddit creates some content, but it’s primarily user generated content. Was there a desire for good content? I mean, was that sort of a? That’s interesting to me? Because it’s all user generated, right?

Katie Bauer 06:09
Yeah. Well, and like, there’s like a lot of different angles on this lead. One, there’s, people want to see content that’s engaging. So we’ll come back to the site. There’s like a, like, it’s a breakfast product fundamentally, like you need to match content with people’s interests. And like being able to figure out what is like a good thing to show someone is like the start of their Reddit experiences like Yelp, or any old question like how do we make sure people find their home as fast as possible, so they get the site and come back. But there’s also like, like, a big thing while I was there, and a big initiative that I participated in was trying to flesh out other content verticals on the site that were a little underdeveloped. So for example, people for a long time thought of Reddit as a place for gaming, or political discussions, or things like that. And a big thing that we focused on was trying to make it a more broadly appealing website, like trying to get more beauty content, for example. Or, like, just recipes, or things about families. Like it was really like a way of trying to make it a more broadly appealing website, which, if you think about it from the mechanics of the business as a company that was monetized by and like still is monetized by placing ads, having more broadly appealing content meant you could have more advertisers. So, that was like a financial incentive. But it’s also something that helps you attract different people. And it was tremendously successful, like the website is way more mainstream than it used to be by trying to help curate and amplify content that we wanted to make sure people knew was on the site.

Eric Dodds 07:47
super interesting. And just dig into that one step deeper. How did you measure? How did you measure that? I mean, there are different ways to sort of measure the quality of content, you know, and I mean, obviously, that’s subjective. But in terms of like, how would you know that the beauty section of the site is working, you know, or the recipe section of the site is working? What were the things as a data scientist that you’d say, okay, you know, we’re making the right recommendations like, how did you know that? Because, obviously, you feel strongly about the subjectivity of good content, but you have to have some objective measure to build these models for recommendations.

Katie Bauer 08:31
Yeah. And I guess, like, maybe a general lesson to take from this was that it was really a bunch of separate problems. And we needed to be able to tell what the separate problems were. So another very big initiative we had was to have good category labels for subreddits. Which it sounds easy on its face to say like, yeah, the gaming subreddit is about gaming. Makeup Addiction is about makeup, but there’s like a long tail of weird stuff on Reddit, which I don’t necessarily mean in an unsavory way. There’s just stuff. It’s like, Oh, sure. Yeah, that makes

Eric Dodds 09:06
sense. super neat. Yeah.

Katie Bauer 09:09
So like, to really understand, like, what parts of the site were doing well, we needed good category labels. And that was something that we actually couldn’t really programmatically do. We ended up having to use a human in the loop program to do that, where we would get informed users to give ratings of what they thought something was. And then we would do some stats to like, reconcile the ratings interrater agreement for anyone who’s curious and figure out like, which ones like was their strong agreement upon if there was strong disagreement, we’d go back and get more annotations. And that ended up being the most scalable way to get high quality labels. But once we had those labels, it unlocked a lot of different things where it would tell you kind of like where most of your comments were, or most of your posts, or just any type of engagement really, because up votes are important for Reddit too. But then you can start hanging out? Like, is it the same type of people who are all engaged in this? Like how broadly appealing is it? Is there crossover between different genres? And like an effect that we started observing over time was like there would be, for lack of a better term, events on the website that would suck up oxygen from everything else. Or it’s like, there’s kind of like a baseline amount of activity you would see on a regular basis. And like, if there was some, like, controversial event in the moderation world, for example, like, suddenly, all of the posting and commenting behavior for that day would be like on one post or something. And it helped us realize, like, yeah, you do need more people. It does need to be more broadly appealing. So there are different people with diversified interests to kind of, I guess, like, make sure that your eggs are not all in one basket from a content perspective. Yeah.

Eric Dodds 10:49
Fascinating. I mean, I can’t imagine Wall Street’s best situation.

Katie Bauer 10:55
Yes, I was like, Yeah, that was after my time. But as that was happening, I was like, oh, man, I can only imagine what’s happening internally.

Eric Dodds 11:02
I can only imagine what you were thinking was happening there. Okay, let’s, let’s move on to the next social media role. And what’s fascinating to me here is that you’re dealing with a bunch of unstructured content, subjectivity that reads it, and then you move into a role at Twitter, I guess. Now it’s called X, obviously. Where were you building tooling for teams that were doing similar work to what you were doing at RedHat? Why did you want to move into a role where you were building tooling, as opposed to sort of working on the models that were sort of driving the decisions around the end user experience?

Katie Bauer 11:51
Yeah. Partly, it was the team that I joined. The team that I was a part of, when I started at Twitter, was working very closely with our finance team on narratives for Wall Street. And my initial mandate, when I joined was to help create a relatively objective narrative around product velocity, which meant we had to spend a lot of time measuring the quality of the internally developed technology at the company, and Twitter has a lot of it, they became a large company at a point where there was not an off the shelf tool to build, say, a huge key value store. So they created their own instead of using something open source. And like overtime, that kind of morphed into being more purely focused on the infrastructure organization, and helping engineering managers and product managers in that organization to understand the impact of the different things they were doing, and measure and evaluate and make better decisions about where to invest. I also kind of didn’t want to work on a consumer facing role after Reddit. Like I was a big Reddit user, and I am a big Twitter user even still. And like working in a consumer related role at Reddit was something that, over time, I started having kind of a hard time separating my feelings about a product that I really liked from the business about the product. So I kind of wanted to step back and do something that was a little bit different. But not being very fascinating ended up its own right.

Eric Dodds 13:17
And what were some of the big lessons that you learned about, did you take lessons from what you needed at Reddit, to Twitter as you were building tooling for people who were doing similar roles? I mean, what were some of the I mean, obviously, there were a bunch of internal builds, maybe that, you know, sort of needed to change over time. But what were some of the big things that you took with you from Reddit to Twitter?

Katie Bauer 13:40
Well, the thing that I mentioned a moment ago about, like, a problem that you think is one problem is many problems. That was something I went in thinking about actively. And it was not only reinforced, it was made much more nuanced. Something that was interesting about joining Reddit at such an early stage is that we had basically nothing to work with, from a data science perspective, like there was data, but nothing was aggregated, we didn’t really know what was in any of the datasets that we were going to work with. And coming into my role at Twitter, the thing that my team was doing was also something that had not been done at the company before. So it was kind of a similar thing, where it was just, we don’t know what’s in these datasets. We don’t really know what we’re going to find and people don’t know how to think about this. So I ended up spending a lot of time helping people understand why you would measure things in the first place, or like why certain types of measurement were important versus just the observability dashboards engineers were used to looking at. And like, also related to the lesson about categorizing subreddits, it ended up being really important to start developing segmentations of different types of developers at Twitter to kind of take apart some of the problems. So we had a bunch of data about what they were doing. Like all their different developer tools. All this but like when you viewed it on aggregate, it would be impossible to figure out what you should do. But like, once you started segment it into, these are back end developers, these are machine learning engineers, that ended up being a great way to like figure out more targeted problems to solve that were a lot more tractable, as opposed to just having the mandate of, there’s 4000 engineers at this company, help them? Like it’s who can we help the most? First? How can we describe their problems in ways that they understand? It just makes it more targeted? And easier to relate to? For the end customer? Yeah,

Eric Dodds 15:36
super interesting. Okay, one more question about Twitter. How did you collect that data? Right? I mean, I think about your experience at Reddit, where, you know, have some sort of telemetry running on the site. And, you know, you know, you’re collecting all that data and doing a data store. And you know, of course, you know, sort of running your process from there. But thinking about internal data, collecting telemetry, on those sorts of things seems like a challenging technical problem in and of itself. What did you think about that?

Katie Bauer 16:07
Yeah, it was a very, like, multifaceted problem, because like different tools had different I guess, like, ways that you would measure them to like, some of them are like, there’s a web UI, some of them are command line tools. Fortunately, someone relatively early on in the company history by watching iron, if it was relatively early on, but it was like, for at least four or five years, someone had instructed a lot of our developer tools, at least, that were maintained by a certain organization. So like, we did have a lot of data and it was often incomplete or not exactly what we wanted. But that was a big source of some of the data that we looked at. We also would go into the history of their version control system, and pull out information about what languages are people using? How often are they committing code, and that sort of thing?

Eric Dodds 16:56
Yeah, that’s actually very forward thinking of someone to put telemetry in, on the DevTools is very grateful. Yeah, man, because that is not the case. In many cases, okay. Tell us about glass genius. Can you just dig a little bit deeper, I know, you give us an overview, but dig a little bit deeper into, you know, what does the product do? Who are the customers, etc.

Katie Bauer 17:22
Sure. Glass genius, is a vertical SAS company targeted at independent owners of beauty salons. So that’s everything from hair and nails to things that are maybe a little less expected, like med spas and injections. The product is sort of an all in one platform to help them manage their business, like kind of our tagline is like they can focus on the craft and the art of what they do. And they’ve got a business manager, and their product. Like, that’s not our literal tagline. But I’m paraphrasing a little bit. But it’s like a booking website, that helps them with outreach to customers, like texting and email marketing, helps them take payments, keep cards on file, helps them get paid faster, and a variety of different things.

Eric Dodds 18:10
And so when we think about one of your customers, are we talking about? You know, there’s a point of sale system where, you know, someone comes in for their appointment, and then they check out, you know, and that’s in the calendar system at the point of sale, and so they can accept the payment. And then, you know, there’s a booking thing that sends the email, you know, all that stuff is baked into the platform. Yep. Interesting. Okay. Do users of your customer, did they create an account? How does that work?

Katie Bauer 18:46
Yeah, they don’t, which is an interesting thing about oh, wow, our platform compared to maybe some other companies in the space, is that right now, the customers of our customers, like, they may know who we are, but they don’t necessarily, we’re very tightly aligned with the salon owner, which is a good thing for incentive alignment, and maybe suggest future opportunities for us in terms of areas where we can expand. Yeah,

Eric Dodds 19:11
That’s interesting. Yeah. Because yeah, like the mind body sort of model where it’s, you know, sort of centralizing, and there’s the, you know, I mean, you’re b2b to see. And so, some models sort of anchor around the end consumer and then sort of commoditize the provider. Okay. That’s super interesting. One thing that I would love to dig into just a little bit is what you think about on a daily basis, or, you know, the horrible cliche way to say this is what keeps you up at night. But when I think about glass genius, like it’s a closed system, and what I mean by that is, it doesn’t for your customer, you’re not selling this concept of the modern data stack, you know, that’s completely modular, you know, and what Like, you need to ingest this data with this tool and model it with this tool and blah, blah, blah, right? It’s no, you need to like, I mean, I’m assuming, spin this up, like, you can send things to your customers schedule. You can communicate with them. You can accept payments, reschedule, etc. And of course, I’m sure they’re like backend analytics that your customers have to write that you provide to them. Which is super interesting. What, what, what, what advantages? Do you have a closed system from that standpoint? Like, what are sort of the, you know, what are the things that you like most about working with a system that you actually kind of have control? I mean, in some ways, it seems like the dream where it’s like, wow, like you have all of the data for your customer in one place, which seems great, but it’s almost like, oh, no, I have this, like, what do I do?

Katie Bauer 21:00
Yeah. I mean, it is good to have a problem with maybe having more than you’re prepared to defer? Yeah. But I would say like, the problems that we experienced from a data perspective are problems that you know, you would have at any company where it’s like, was the telemetry implemented correctly? Does it actually say what we needed to say? Like, there’s a lot of instances where a customer enters something in themselves, and if they spell hair cut differently than some other person like a haircut with a space versus a haircut as one word, like working with that data, like reconciling and figuring out how to process those is the same thing. It’s kind of an ongoing thing for us. But like, a lot of our problems are really about like, how do we make sure we know what our customers are really doing with the product? Because like, with any product that you spend some time working on, you have ideas about what people should be doing. But then when you look at the behavior, you’re like, Wait, what is this? Like somebody in the real world? And

Eric Dodds 22:00
you’re like, what, what are you doing for joy? Like,

Katie Bauer 22:04
Sometimes it’s because, like, you found a bug. And sometimes it’s because the user is doing something interesting and creative. And you want to find a way to help them turn that into a real, like product feature, which in my opinion, like for product analytics, specifically, is one of the coolest things about the role is like really knowing what’s happening, interest, which, like, there’s a ton of data like work you need to do to enable something like that. But like another common theme, threading throughout a lot of my career is like, trying to figure out like, how do you make the problem smaller? For us, like we work with all different types of beauty professionals, and something we’re thinking about right now is what’s different between, say, someone who does hair versus someone who provides waxing, these people have different needs in different business models. And there are different expectations for different types of salons. And making sure that we can differentiate that in our data will reveal things to us that we want to be able to act on. Like, we are looking at this to be clear. But it’s something that we need to make sure that we’re always focusing on and trying to find the right segments within our business to know what are the distinct personas that we’re serving? And what are their needs? And how do we build a better product for them? Yeah,

Eric Dodds 23:21
super interesting. Now, there’s a product analytics aspect, which I know is probably an oversimplification, but understanding what’s really going on, I think, is a great way to describe it, the words that you use. There’s also this other interesting aspect. And I have to think that as a data leader, you’re thinking about, okay, if we have a bunch of hair salons, and we learn things about what’s working well, you can build data products that you could use to sort of push a value back. How much of like, are you thinking about that autonomy that’s got to be set as one of the more exciting potential things that you’re thinking about? Yeah,

Katie Bauer 24:06
I’m definitely thinking about it a lot. It’s kind of part of the long term mandate that I have is to make sure that we’re putting our data to good commercial uses, whether it’s through our product or otherwise. And like something that we’re starting to do, that I’m pretty excited about is just using our data to find economic trends. And we’ve been in a couple of trade publications talking about, yeah, it’s fall. So that means there’s lots of pumpkin facials, or that sort of thing. And right now, it’s a bit more fun and entertainment than it is necessarily something we’re going to productize but it’s sort of a way of exploring what we can do right now and build demand and awareness of the valuable data that we have about beauty, which is a huge industry. As I’m sure I don’t need to tell anyone. But there’s a lot that we can learn from the data that we have and we’re looking for ways to put it to good use and also to make sure people know all the exciting stuff that we have.

Eric Dodds 25:03
Awesome. All right, well cost us I’ve been monopolizing the mic, please, please jump in.

Kostas Pardalis 25:10
Okay, Eric, you were having an amazing conversation I really enjoyed, like listening to a show, Katie, you’ve been like data science, like professionals for quite a while now. And many things have changed. And I’d love to hear from you, like, give us a little bit of a glimpse into this journey, like how they’re all it’s all like, changed, right from what was like, a couple of years ago to what it is today.

Katie Bauer 25:43
Yeah, that is a topic I could go on about for a while. And something that is interesting to me, too, as I like, try to learn more about the history of roles, like what we call data science, or maybe AI engineers now, is like people have been doing this for a really long time. But the specific tasks they’re applying to, and the tools they use, for it are maybe a little bit different. When I started the hot thing was data science and deep learning, then we kind of went through an era of modern data stack. And maybe analytics engineer was like the hot job title at that point. And now we’re kind of back on deep learning, although we call it AI now, because what is chat GPT except transformer. And now we’re calling the people, AI engineers. And like, fundamentally, I think the thread through all these things is companies have a bunch of data, and they want to put it to use for something. And the idea of how we do that changes, partly based on what’s trendy, but I think it’s also kind of a reaction to previous waves, where you think about something like data science, it’s kind of a reaction to maybe more traditional BI, the idea that like everything has to be really curated by a person. But then you have the promise of all these algorithms that are going to find amazing insights for you. And then like that, enthusiasm kind of fades, we shift the pendulum back to data that needs to be curated by people, we get into the modern data stack era, which is really oriented around analytics. And as that kind of person gets disillusioned about that, you move back into this idea that AI, like some kind of algorithm, is going to do all the curation for you. I feel like that is kind of a trend, I feel like it’s probably going to continue for all of human history, or at least as long as we have people working with data that we’re going to, like, get disillusioned with one thing, try the other thing. And like we will gradually keep making progress, as we move through all these hype cycles. But what we call these people are constantly changing. But still, it’s the same thing.

Kostas Pardalis 27:51
Yeah, I think. So what’s the essence of them all? Like, what is the goal? Like, let’s say if someone had to choose to be a data scientist or AI engineer, whatever we want both like today? Why would they do it? Like what’s, what should they do? Let me actually ask the question, like, in a different way, what kind of thing this person should be interested in. So at the end, they would be happy, because that’s why we work at the end to be happy. So he’ll talk a little bit about these because I think one of the problems with all these, like changes in the titles and like creating new categories of like professions that are adjacent and overlap. And besides that, it’s likely to get confusing for people who are like, okay, they want to end there. I would say the profession I like to understand, while I’m going to be doing at the end, because it’s wanting like first of all, it’s like, wanting to call someone scientist to another culture more like an engineer for an exam, right? Like, there’s a reason we have different words there. So those are a little bit more about that.

Katie Bauer 29:00
Yeah, I guess like, if there are truly two roles and data, like we have all these different titles, and theoretically job families, but like I, if there are truly two roles, I think it’s more of the scientist-like exploratory knowledge oriented type. And then the builder type, which often gets called engineer. So that maybe is like partly what the pendulum is swinging back and forth. But like, if you want to be an AI engineer, for example, you’re probably going to be doing something in production, you’re probably going to be trying, like to figure out how to use data, like you don’t really care that much about how it’s structured to create an end product experience. Or something like an analytics engineer, like the structure of the data is the main thing that you care about. And you care a lot about getting it into the right place and shape, so that it is oriented around what you know, like I see that as a bit more of like, a knowledge oriented role, even though it has engineering the title, but it’s on where understanding the business logic is really important. And whereas an AI engineer Lessa, like it’s not necessarily about modeling a business process, as much as it is like trying to create something that people can interact with. And by which I mean people who are probably external to your company. So maybe this is also an internal versus external dynamic that I’m describing are like operations versus production. But kind of see those as like the two main profiles, and like, whatever is trendy at the time, maybe says something about like a hype cycle, or what people are excited about from technology. All right,

Kostas Pardalis 30:36
so we talked about, like, the data scientist will talk about, like, their role and how it has changed, right, and how it is like today. And like, I love the perspective that you give them, like how you help us, you’re like a true data scientist. I think at the end, you’re looking for patterns, and you found a button there, you said, like, we shrink from one to the other. So we know also, like we can extrapolate in the future, what is going to happen, Sharon, mixed a lot of things. But one of the things that personal athletes like I find, like so interesting when it comes to data in general, and like they’re like plenty of different professions that they meet, like to come together in order to deliver at the end. Like some kind of product, right? We have data engineers, we have a million engineers, we have the data scientists, we have the BI analysts, we have the business ops people that they also now they work like with the data, we even have, like production engineering that now more and more like the data becomes part of like the product itself. So they need to get access to that stuff and deliver something over there. How do you see the dynamics between these different roles and how they’re defined? And what kind of overlaps exist? Because I’m pretty sure that like, in your experience, you’ve seen a lot of overlap out of necessity, in many cases, because one role might be missing, and someone else might need to, to do the job for them. But how do you see these dynamics there today?

Katie Bauer 32:09
Yeah, I guess like a hot take I have is that I think data science would actually be a really good name for all of these jobs collectively, because I feel like they’re all I don’t know, if it’s like a spectrum, maybe it’s like some kind of higher dimensional space where you are moving through these different attributes that people have. But like, my general theory on, like specialization and data roles is that the smaller the team, and the smaller the company, the less you need to be specialized. And as the scale of what you’re doing becomes bigger, there’s just like more ways in which it can be complex. And as things become more complex, like the different components of the problem, necessitate that you have people who like to think about narrower pieces of it. So to give an example, from my own career, on a lot of early data teams, or like early stage projects, although I’ve been a data scientist, I’ve mostly done data engineering, where I’m building pipelines, I’m trying to define my own metrics and build all of the infrastructure that I needed for myself, or like, even in some cases, written production code, as if I were a member of the engineering team. And like, as I’ve been in a really big company, my job has become more specific, like a lot of the data science team at Twitter, for example, was really hyper focused on experimentation. Like they were almost pure statisticians, sometimes, when that would never happen at a company like the one I work for now. Because it would be like we don’t even have the scale of user base that we would need, like the level of sophistication and experiments that Twitter did. So it’s partly like, you’re required by your environment, to focus on particular parts of the problem, or develop specialized skills or just exercise specialized skills. But it’s always helpful to be aware of the parallel skills to reach into them and use them as you need them. Like if you have a big team, where there’s a lot of people who can, like focus on narrow parts of the problem, like you need to figure out how to coordinate and work well together. But if you don’t like it’s a lot easier to be able to just do things for yourself, if possible.

Kostas Pardalis 34:19
Yeah, that makes sense. So let’s say and we’ll take like the different companies as an example, like, just for the states that they are in, right, like in the south, like with Twitter. If you like to design, let’s say, the perfect data T for like, a company like Twitter, right? What this thing would look like, and

Katie Bauer 34:45
One thing I’ll say is that Twitter would have been a great place for data mesh. They were far along in their data journey for that’s really been reasonable for them to pivot to, but like for that like it’s such a big company with so many different concepts that like having a centralized team can be kind of weird, or kind of a strange fit and pushing some of the domain knowledge into specific teams would have made a lot of sense there. Just because it’s not reasonable for one person to hold everything in their head, at least with tooling as it is today. So like, I would say, you probably would want to have, maybe, maybe there’s like an overall leader for Data Science at Twitter, which there was not one I was there. And you have maybe sort of like a hub and spoke model, like I like Hub and Spoke models where you can get kind of like a standardized quality for the whole company. But like, if a company really becomes huge enough, you almost start having separate companies. within the same company like Microsoft, for example. Like there’s no way you could have a centralized data team. So the bigger you become as a company, perhaps you start getting like a fractal thing where you have like a local data team that does everything that they need for that area that has a certain structure, and then it’s totally separate from some other part of company that has the same structure or whatever structure works for them.

Kostas Pardalis 36:14
Yeah, that makes sense. And then moving to something like gravy, then like, I’m curious about it, because also, like, the unique moment that you came into the company, right, like, so, where there is, let’s say, a lot of opportunity, but there is no structure. And it’s like a startup, but like, I don’t know, like, with, like a large scale, like startup in a way, right? Where you take all the problems of the startup and you just exaggerate them, like to the scale of something like ready to what do you know how a data team would look like in there?

Katie Bauer 36:53
Yeah, I think the right thing to do in that case, is to really look at, essentially, what team would benefit from having data savvy in it, like not every team is going to be super quantitative. But some should be like growth. The team that I worked with, when I joined, was the Humvee team. Like, these are things that it’s like it needs to be measured, it needs to be quantified. And that is probably where you should put your data team first is an area where they can be impactful. The wrong thing to do is to try and cover everything all at once. Like you need to have targeted areas. So you can make progress. And for people who are not familiar with working with a data team, it helps them understand what they get by working with a data team, because you have examples of what it looks like when you go super deep. Whereas it can be kind of underwhelming when you’re just kind of spread vaguely across everything.

Kostas Pardalis 37:49
And what about the startup? Like, as you’re like now, right? And the reason I’m asking is because I like the space where I also have more experience too. But I’m also coming from companies, like all my experience, like coming from like b2b companies. Where, okay, you don’t have much data, the beginning, right, like, so it’s always like, kind of like a question of like, when is the right time like to bring like a dedicated person, right, that can work on the data, because now we have data. And before we didn’t, but like in b2c is a little bit different, because it means to get, let’s say, data at scale, like material, like in the lifecycle of the company. But still, there is like the right time for that, too, right? Like, I wouldn’t assume that you get the data science of your first hire when you start like a company, right? So when it’s like the right time to do that? And how, like, let’s say the team would look like, yeah,

Katie Bauer 38:54
I guess my answer is kind of related to the answer I gave earlier about specialization. As a company grows, at least to start, data is probably something that product managers are doing or some biz ops person is doing on the side. And eventually it grows and becomes complex enough that it needs to be someone’s full time job. I don’t know if I have a unified theory of this. But I would say it’s probably around a time when your user base starts scaling, or maybe you have some early product market fit, and you want to grow. That’s a very good time to start bringing in data, because you have to be more serious about measuring things. For glass genius. Data really originally was part of our finance function. And it makes sense your investor metrics are pretty important for a company when they also need to be right. So the level of accuracy necessitates someone really paying attention to it. As we’ve expanded from there, it’s kind of been trying to figure out like, what are the most important business verticals? What’s the right staffing model and where do we put people? I have a pretty strong opinion that there should be a business case For every, like a dedicated embedded data person you hire. So it’s usually that there needs to be some combination of like to identify the need for this. And a stakeholder that they would be working with also needs to agree with me that they need that. And like the model that we have, right now, at least in terms of embedded resources, we have some working with products, we have some working with gutter markets, which is both marketing and sales. And we have someone working with our customer experience team. And then we also have analytics engineers on data engineers. And someone says our team is like a little bottom heavy. And the number of analytics engineers we have in data engineers we have relative to analysts. But part of that is because I believe having a well maintained data stack requires it to be someone’s full time job. If someone’s always trying to fight off stakeholder requests, there’s never going to be someone thinking about optimizing your DBT models, for example, or refactoring. The way things are written when they are written quickly, and then end up not scaling.

Kostas Pardalis 41:01
What’s the difference between a model call engineer and a data engineer?

Katie Bauer 41:05
In many cases, there probably isn’t one, the divide that we have is that the data engineers are really focused on ingestion and managing Snowflake. So it’s a combination of managing it via vendors, as well as writing custom lambdas in some cases. And then analytics engineers are much more purely focused on the business logic pieces of modeling metrics, modeling concepts that are relevant to the business that need to be spoken about and measured consistently.

Kostas Pardalis 41:31
Okay, and when, what are the boundaries between the two in terms of data models, right, because I mean, the data comes in, you can have data that’s okay, structured, but still very unstructured, right? Like, even data, like for example, from customer directions, right? So you ingest them into tables. From that point, to the point where, let’s say, an analyst and like start working with them, like there’s quite a few steps, in many cases of modeling that needs to happen. So where is where they added engineer stops and where the analytical engineer stops? Yeah,

Katie Bauer 42:16
That’s a good question. Our boundary is really a concept called a staging model, which is sort of like a slightly transformed bit of raw data. And that’s stuff like removing PII, as well as maybe like, if there’s some code in a production database, that is just a number like translating it to its actual English equivalent. And that’s kind of connected to the ingestion piece. So data comes into the warehouse, it is connected to, like, I guess, like a very loose, high level business concept. And maybe the metrics are either the tables that people query for building, or more like fact and dimension style models. It’s decoupled from the source. So we can also swap sources relatively easily if we need to change them, which has happened. But basically, it is data engineers, making sure that data gets into the warehouse in a way that is compliant and protects customer privacy. And then analytics, engineers take that, and then use it to model important concepts. I should say also that analysts write DBT models sometimes or models and looker. But it is generally considered more of a prototype and not recommended that it’s used for anything super serious, because it’s not really built in a way that’s intended to scale.

Kostas Pardalis 43:40
Okay. And how is it? So what’s the loop like? closings? Right, like you have the design team does the injection, let’s say then you get the engineer. Do you like modeling? Right? But the models are always like that. They have dynamic creatures by nature, right? Like, as the business changes like that, models also like change, right? And then you have the analyst who is like using these models to go and do like they ask their questions, right and get answers. But the reason I’m asking is because exactly what you said that sometimes, like the analyst can also create like a model. And I think that’s like, where we see that they need to go and change the models, like on the back end, let’s say it is exactly because something new has been created, right? And that’s like, usually happens like from the analysts. It’s not like the data engineer knows, like, what is going on, what is changing or even like the analytical engineer, like knows, right? So how is this lifecycle happening and how let’s say, well structured with ease, right, or it’s more of a natural process?

Katie Bauer 44:57
I guess right now the team is small enough. And we’re still building out foundations enough that like, it’s not like we have a super well oiled machine of like, every time we release a new product feature, we have something ingested and modeled. And then in a dashboard somewhere, mostly we’re very prioritized and focused on things that are important to company strategy and big initiatives. So it ends up being a little ad hoc. And we’re always looking for ways to optimize the process. But so far, it’s not something that is happening at a pace at which we cannot manage it just through conversations. And we’re looking for ways to make sure that it’s more streamlined and fewer people are involved, then lots of things are in motion at the company right now. Yeah, 100%

Kostas Pardalis 45:42
Have you seen a different way, like, he’s different, like bigger companies, because I would assume that like, the flow itself remains pretty much the same, right? Like, so look much different, like you have different roles. But at the end, change needs to be fed back to the back end from the front end. And the front end is usually like a data scientist or a data analyst or someone who is, you know, like building the product, or have they interface with the problem that is getting solved, right? So how does it come in? Like, how have you seen it like in other companies like Twitter, or like rabid or any other commercial? Do you have any experiences from there? Yeah,

Katie Bauer 46:24
I would say the thing that I most commonly see different from this chain that I’ve described is a lack of the analytics engineer. It’s something that I have been pulled back into doing as a data scientist, or maybe a data engineer is more purely infrastructurally oriented, or doing something that is way more production oriented, then maybe just pulling data into a data warehouse for analytics, like maybe they do both. But then like, there’s someone who needs to actually build a performance data set for internal usage and dashboards and analysis and experimentation. And there’s no one to do it. Other than like a data scientist who doesn’t really know the right principles for it, or an analyst, in some cases, like, because it is not the thing that they’re interested in doing. And they’re not trained in how to do it, right? They just do kind of a fast version that ends up breaking or becoming a problem later. And this job title gets called a lot of different things at Twitter that was called data engineer, to different kinds of data engineers. So like, I don’t really think the title was that important. But I do think it’s important to have someone who is primarily responsible for using business logic to shape data that is then used for other purposes of the company. And like, it might even not even always be used by someone with a data job title, it could be someone on a performance marketing team who’s working with this well curated data, the specific form that it takes will vary on the company and how it’s structured, and how analytical the end stakeholders are. But generally, I find that those are three key things. In a like end to end, like data is created data is used for some kind of decision making purposes, it gets moved somewhere where it can be used, usually a cloud data warehouse, someone needs to transform it, and then someone needs to use it. And the most consistent between all of those is the data engineer moving things into the database where people query it. Who does the transformation piece? And who does the analysis piece, I think varies a lot by company.

Kostas Pardalis 48:30
Yep. 100%. And one last question from me. And then I’ll give the microphone back to Eric. Is there something like a tool or let’s say, like, a technology that you would like to see existing or being used more, let’s say in, in the environment that you operate in today? What do you do, what would be your wish for, like something that would make your life I’d like a data scientist, right, or your data scientists that work with you like, much better and easier.

Katie Bauer 49:07
The tool that I want, which it’s maybe something else, like I can see data contracts, maybe being argued as this, but something that I would really like actually is not a tool for data people, it would be a tool that creates a better developer experience for engineers implementing telemetry. So much hinges on your event data. And a lot of times the engineers putting in like, can’t really QA very well. They usually don’t have good context for what it will end up being used for. And there’s usually so much of that data that it’s really hard to document what’s happening to like, give them a guide to work with. So I really like a tool that improves the developer experience to make it easier for engineers to put in telemetry.

Kostas Pardalis 49:53
That’s interesting. And weather shifts and military Because Okay, they’re like many, like definitions of that. Are you talking about the limits or like in the sense of like SRE thinks of telemetry in and

Katie Bauer 50:05
I guess I’m using the term a little imprecisely, I just mean when people are putting eventing into an app, or Clickstream data, your behavioral data, that sort of thing, just

Eric Dodds 50:13
auto track and everything will be fine.

Katie Bauer 50:17
Nothing ever went wrong.

Kostas Pardalis 50:21
Other people, okay. I don’t know, Eric, I think you should have a lot to say about that. Right? That’s coming, like from other side, like, I felt a lot of people who will show us know,

Eric Dodds 50:31
Ah, man, the event date is so tough. No, it is just super interesting, I think it’s really challenging. Because ultimately, I think that what is difficult is instrumentation. We like to think about it as a technical problem, but it’s ultimately a relationship between people in the organization, right? I think you said it well, when it’s like, well, the engineer who’s instrumenting, it doesn’t necessarily have all the context for how it’s going to be used. I think the other thing that I would argue is that when you initially do instrumentation, it’s kind of like the person at Twitter, who implemented a bunch of telemetry that wasn’t really heavily used. But then when you got there, you were like, yes, like, Thank you. Is that perfect, but thank you. And so I think one of the hardest things about telemetry is that. To do it really well, you need to be very explicit. But in many cases, it’s about option value. And that requires multiple team members to think, way farther ahead than the immediate need. But there’s some linting and some other things that can certainly make that easier. But yeah, that’s tough because event data gets messy, faster than anything else. And boy, if you want to see a nasty warehouse, like, you know, event, instrumentation gone wrong is your quickest path to get.

Katie Bauer 52:02
Definitely, I’m sure we all have detailed horror stories about what we’ve seen.

Eric Dodds 52:06
Yes. Okay, well, we’re at the buzzer, as we like to say. But, Katie, one more question for you. We’ve talked so much about data. If you had to just completely leave data and do a completely different career, what would you do?

Katie Bauer 52:24
I guess, I think it would be fun to maybe have a catering business where someone gives you a really specific, like theme, and you figure out how to make dinner for them. That would be cool. Like figuring out the logistics piece of it as well as like coming up with a creative dinner theme for them.

Eric Dodds 52:42
Yeah, I love it. That’s tricky. My wife is a florist. And perishables are like crazy. It’s crazy to work with. Right, like you’re on a timeline. And yeah, super interesting. Well, thank you so much, Katie, for joining the show. We learned so much. We’d love to have you back. Thanks again.

Katie Bauer 53:01
Yeah, thank you.

Eric Dodds 53:02
We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe to your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.

🎙 Sign up for The Future of Machine Learning Livestream!

🗞️ Signup for Our Newsletter

Episode 170:

Discussing Data Roles and Solving Data Problems with Katie Bauer of GlossGenius

December 27, 2023

Notes:

Transcription:

About the Podcast

Sign Up for The Data Stack Show Newsletter