This week on The Data Stack Show, Eric and John chat with Ben Rogojan, Owner and Data Consultant at Seattle Data Guy. During the episode, Ben discusses the complexities of budgeting for data teams within organizations. The group explores various funding models, including chargebacks and independent budgets, and considers the implications of each on the perception and effectiveness of data teams. The conversation highlights the potential for unhealthy budgeting practices, such as investing in new systems without clear business benefits, which can lead to financial difficulties for companies. The role of the CFO and the importance of aligning data team budgets with organizational goals are also examined, emphasizing the need for strategic investment in data capabilities to support business outcomes. Don’t miss this data discussion!
Highlights from this week’s conversation include:
The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Eric Dodds 00:04
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com. We are here with Ben Rogojan. Ben, you were on the show actually, this is crazy, a couple of years ago. And it’s crazy to say that so you were one of our very early guests. And it’s so great to have you back on. Thanks for joining us.
Ben Rogojan 00:40
Yeah, no, thank you. Thanks so much for having me. Jump on.
Eric Dodds 00:42
All right. Well, for those new listeners to the show, who didn’t hear your original episode, tell us a brief background. So where did you come from? And what do you do today?
Ben Rogojan 00:52
Yeah. So hey, everyone, thanks so much for joining the show, but my name is Ben Rogojan . A lot of you will know me as the Seattle Data Guy online. Currently, I help companies cancel up their end to end data infrastructure, up them, you know, figure out which solutions to pick, there’s just so many these days and sometimes help implement. And before that, you know, I’ve worked as a data engineer for the past near decade, my last job was at Facebook and working as a data engineer, before that working at a healthcare analytics startup, doing a lot of similar work. And honestly, I started out at a hospital, doing a lot of things like programming, and dashboarding, and things of that nature. But that’s really kind of where I started my data journey.
John Wessel 01:35
So then we talked a little bit before the show about your background with Facebook, and then the journey into consulting. So I’m really curious to dig in a little bit on that, like, what problems did you work on at Facebook? And then now that you’re consulting working for various companies, like which, which of those problems, you know, really crossed between the two? You’re like, yeah, this works well, or which of them are like me, and that’s more of a Facebook, you know, bigger tech company problem, and it’s not applicable, really excited to dig into that topic. What about you? What are you excited to talk about?
Ben Rogojan 02:06
Yeah, no, I think that’s definitely the subjects I’m kind of interested in talking about is like really comparing some of the differences, you know, like, in some ways, the similarities in terms of outcomes you’re always trying to get to, but also in terms of like, how you get there, maybe the amount of data you’re dealing with, or just the complexity, and the various challenges that companies of different sizes, and data maturities kind of face?
John Wessel 02:29
Yeah. Awesome. All right, let’s dig in.
Eric Dodds 02:32
Let’s do it, then I’m super pumped to have this conversation with you about relating data to business outcomes, which is a huge topic, I think it’s become much more acute. Of late actually, just because, you know, with the nature of many things, the macro environment, there have really been a lot of layoffs. Actually, I mean, we hear all the time, and I’m sure you hear and John, I’m sure you hear, especially as consultants, you know, our data team isn’t as big as it used to be. Right. And so we’re, you know, there are a lot of things to figure out. And one topic that John and I have been talking a lot about is, you know, how do you relate data stuff to some sort of business outcome. And that sounds a little bit like a tired cliche, but it’s really not as straightforward as you would think it is, especially as we think about how far upstream some of the data stuff can sit from, you know, lets, you know, a number moving in the right direction in some, you know, executive BI dashboard, right? So, I’d love to dig into that in today’s show. And, John, you were really interested in what that looks like at Facebook, which I think is a really interesting topic, because the complexity and if we think about that supply chain, you know, of data, and how far stuff can set up stream at one of the things, it’s gonna be a totally different world than say, like, you know, in a mid market type company.
John Wessel 04:01
Yeah, in some contexts here, I actually took one of my first like, true programming analytics courses. I think it was Udacity or Udemy, one of those, and it was the Facebook and analysts engineer that taught me. No way. So yeah, it was a great course. I learned a lot from it. But I’m curious on the business outcomes, and maybe talk about Facebook, some business outcomes that you worked on there, and then how you got there, and then maybe we could talk about the same or a similar outcome and how maybe you would get there now in a consulting role, and I’m imagining they’re not always or probably often not the same path. Yeah,
Ben Rogojan 04:43
no, it is always dependent, right? Like when one of the nice things about Facebook is that their infrastructure was arguably very mature. Right, and well integrated which in terms of like the data and as well as the solutions were Are you know, when I often work for a company or work for like, as a consultant, you’ll come in, and they’ll be like, Hey, we’ve got, you know, let’s say a marketing funnel, but it’s like across seven different solutions, Maybe I’m exaggerating, it’s probably like three or four. But, you know, it’s across multiple solutions, they have multiple steps going throughout all those solutions, sometimes, you know, maybe one of those steps isn’t captured, or is kind of, like skipped. And so you kind of have to put it all together. Whereas, you know, at Facebook, a lot of that data is generally pretty well integrated, right? Like, it generally has a flow, right? Like, I think that was something that I was impressed with, when I first started, there was like, just how, like, as soon as you signed up for one application, or one, like internal system, like you were basically proliferated through all of it, and you had an ID that kind of, you know, went through all of it. And, you know, it was kind of interesting in that way. So, you know, in terms of business outcomes, you know, some of it was even very similar to that, where we, I worked very close on, like, the HR recruiting kind of data teams. And so like, especially at that time, right, when we were hiring very heavily, you know, we were often looking at the recruiting funnel, and figuring out, okay, where are we winning? Where are we losing? Where are you know, how are people actually going through the questions or the, you know, trying to figure out how, you know, different interviewees kind of kind of do in terms of like, do they have higher or lower kind of acceptance rates, and just seeing if there’s ways we can improve, maybe, you know, how we teach interviewees to make sure that they do a good job of actually like helping, you know, who they’re interviewing in the right ways to make sure like, Hey, if you do have a candidate that could have gotten through, but it was maybe something you didn’t do in terms of setting a good, kind of got a kind of set of whatever we’re gonna say, like, hints, or whatever it might have been required, like, how do we improve that? So there was a lot of focus on that, especially at that time, I think one of the it’s not necessarily a business outcome, but like, one of the first projects I did was honestly, all around data modeling. So at the time, we, you know, like most companies, proliferated, these multiple data models around recruiting and HR. And you have multiple teams to take them on. And there just was this kind of lack of standards across all of them, right, like, everyone’s kind of doing their own thing. And eventually, that starts impacting the ability for analysts to essentially work, right, because it’s like, okay, when I work with this data set, then they’ve got multiple IDs, that all can kind of join to each other. But we don’t really know what the master ID is here. So that can cause a problem. You know, there was some challenges for some people dealing with certain types of data formats, and these seem like small things, but like, that was my first project was like, Okay, we’ve got all these different data models, how do we create like, one that we can all own that like help analysts, you know, create insights faster, gets the data in, don’t reach out to data engineers as fast that was really my goal was like, how do we make it so they don’t have to reach out as much they just can work on, you know, it’s very clear, when they look at it, like this is the data we’re looking at, we’re, we understand what IDs to join to, in that kind of just helps build confidence and build those results much faster?
Eric Dodds 08:05
Well, I have to, I have to ask a question about the HR project, because I think that’s really interesting at a company, you know, as large as Facebook, especially in a phase where there’s a ton of hiring, you have an entire sort of business unit or data practice that’s dedicated to DOT right, where is it? A lot of companies, you know, you have to be pretty large to get to that point. But I think it draws an interesting dynamic out, which is that, and I think this relates directly to the question around the relationship between data stuff and business outcomes, in that there’s a high level of subjectivity there to some extent, right. So, of course, like, what, you know, what are you measuring? As part of that? Well, you know, what’s our close rate on hiring for key positions? Okay, maybe that’s the way the HR team is, you know, is held accountable, right? You know, again, there, there could be a number of metrics that right, but when you’re interviewing, there’s a level of subjectivity there, that’s actually pretty hard to capture with data, even though quantifying the pipeline is really important to drive the accountability to set priorities and stuff like that, right. And then there’s also probably a lot of inherent noise in the data, because the interviewees are going to be very different, right? And have different styles. And that’s okay. So, how did you think about or how did Facebook think about, you know, data as an input to this with some sort of hard business outcome, which is, you know, are we hitting a certain close rate on key positions, and then the subjectivity element of it, right, because that’s a very, I just think that’s such a good example of, there’s a huge data aspect to this, but it’s also, you know, it’s the marriage of all these different inputs that sort of create, you know, an outcome where the sum of the parts is greater than the whole if they can all come together. Yeah,
Ben Rogojan 10:00
yeah, I mean, let me see if I can hopefully answer that question. I think, you know, especially at that point, like, there were some clear goals that I want to say we’re pretty public in terms of like how much Facebook wanted to grow. Right? Like, maybe that was somewhat to some degree partially, you know, to make people on the stock side, happy to show, hey, we’re constantly growing, we’re growing, you know, in all different aspects. But like, I remember, I don’t remember what the targets were. But you know, there was at some point, maybe even someone saying something a double it, I’d have to find articles, but I feel like I remember seeing, like, certain goals being pre pre public. And, you know, because we’ve done all this research, we know that, hey, if we interview 100,000 people, we’re going to likely, you know, land, whatever, 1000 employees, right? And so now, you know, like, Okay, if our goal is to hit 80,000, you know, in the next three years, we need to interview X amount of people, and then kind of just keep walking that back and like, Okay, well, if we need to do, I mean, every day, we need to run X amount of interviews, that means, you know, just it kind of just keeps building upon itself, because you have that information now, like, if this is our target, we also know how much it takes us to get there, we can now kind of kind of get there, you know, break down exactly what we should be doing long term. And then if you see things throughout that whole process, right, like, okay, we’re, now we’re seeing like a reduction, maybe in the numbers, because eventually you will, that was something that we talked about was like, Okay, we were seeing kind of a reduction in percentage, is that because we’ve interviewed everyone, which is I think parts of it, where we’d be like to either interview most people that fit, and maybe they’re, you know, Pat passed or didn’t want the role. And that’s why they’re not here. You know, should we now reach out to them again, right, okay, these, these people didn’t want a role here, let’s set up something to send them an email again, and be like, Hey, we know you didn’t want this six months ago or a year ago. Are you interested? Now? So that kind of gives you that information? Yep.
Eric Dodds 11:51
super interesting. How about the infrastructure? So in January, I’m seeing heavy duty infrastructure that you would use to drive something at Facebook, right, which can be very expensive, both in terms of hard costs and headcount. Yeah,
John Wessel 12:03
I think, before you jump into infrastructure, one thing I’m also curious about is the collection, because I think people really gloss over, because you mentioned that, and this is some very, like, fuzzy things you’re potentially trying to capture. So any, like creative things you all did around the collection of and it could be as simple as like, alright, we have the managers like, fill out this form, if this step or like, I don’t know, interests have a lag, you know, they chatted with a Slack bot, I don’t know. Like, because I think people skip over that. And especially in the analyst like mind, it’s like, you just completely ignore data that you don’t have, that can be really valuable data that if you just collect it, like, first party, like one or two things, you can really like, benefit downstream. So I don’t know if anything comes to mind. But that’s something that I was thinking about. Yeah, like,
Ben Rogojan 12:56
I feel like it’s one of those things I’m like, I feel like there were interesting things like, in ways that we captured information, I’m just spacing on it. But I do remember kind of like, throughout the flow, there’s obviously all these ways that we would kind of capture information, including like, again, after you interviewed someone you go through, you write your notes, they usually they’d yell at you, they have systems that would be like, hey, it’s been like, you know, four hours since you interview this person, like, the more you wait, the worse your memory is going to get on this. Right. And then also just have a clear form where it was like, okay, like, where they do good, where they do bad, how many questions they get through, which questions do they get through? Which we basically had a pretty preset of questions, which was, you know, you could basically, I think, just find on Glassdoor. And we would also have information like, hey, this person has interviewed before. So you know, before you even interview this person, you’d already know like, hey, they’ve interviewed before they’ve seen questions that they’ve seen questions that be so you need to make sure you don’t have that same question. Yeah, so there were definitely a lot of those things throughout the process. Because like, Facebook’s interview process at the time, it might have changed at this point, it’s been like, now more than five years, I want to say it’s like six years, six years since I interviewed and even when I was interviewing, or like doing the actual interviews, it’s been like three years. But like, it was very much like we had a system, it was very standardized. You know, I think in the goal of being that, if it’s standardized, you know, you kind of remove some bias out of it, and you have more of a process. So that’s kind of the goal. But yeah, kind of maybe some ways we would capture it was just you know, as you’re going through the process and be like, hey, time for you to like review and give your perspective and they definitely hound you if it took more than like, I don’t know what it was like they give you like 72 hours if you didn’t fill it out. I think your score didn’t count or something. Remember the current page. I
Eric Dodds 14:47
like that. That’s data governance now. I do think that’s a really good point, though, and I didn’t even Think about this. But when we think about tying a data project to some sort of outcome, thinking about the datasets that are important to that, right? is huge. Not because being biased on what you have, right, exactly. It’s to the point like, okay, you can quantify a funnel, right? I mean, that’s not rocket science. But are the inputs using all of the available inputs? I think, answer every question.
John Wessel 15:28
In the EECOM space, for example, quizzes were awesome. If you could get people to take a quiz to get just even halfway through a quiz. That’s nice, like first party high intent, like, yeah, yeah, they’ll have useful data that might not natively be in your data warehouse. So marketing might be doing that. And data teams like, just didn’t even think of it. Yeah. But
Ben Rogojan 15:55
not not to flip it around too much on y’all. But like, you know, you’re talking about first party data. Obviously, one of the discussions going around the data world is like the death of a third of the cookie, which we still haven’t seen. It’s a dying cookie in 2032. Yeah. So like, are you? Do you guys see any, like people kind of being like, we have to collect first party data even more now. So you can kind of understand who your customer is? Because I feel like you guys deal with that more on the event side of things?
Eric Dodds 16:27
Yeah, for sure. It’s certainly a big topic. I think a lot of companies are. They’re thinking critically about how they adapt to the future when it comes. And we’ve, I would say, increasingly, we have seen data teams who were really trying to adopt, you know, I guess I would call it like a first party. First, Is that even? That’s a nice way to first party first you hear to hear? Approach, right? And I think the big question, there are the sacrifices that you make in one up, so they fall into a couple of categories that actually, I’ll do another flip and ask both of you, because I think you’re thinking a lot of this on the ground as well. There are a couple of areas that we see. So one is advertising, obviously, right. So we talked about Facebook, but your advertising on meta is, you know, through the ecosystem of their apps, right. And so that’s a big concern for companies who have a lot of revenue that’s heavily reliant on, you know, the third party script and cookies being on their site. Now, one thing that is very interesting is that, you know, no one likes change, right. And so if that’s changing, and the third party cookies going away, and we, you know, could expect X revenue from advertising on, you know, Google search, or whatever it is, there’s also this sense of man, it’s going to be great not to rely on this black box that we’re beholden to, right, because whenever they change the rules, or their conversion logic, or their attribution logic, you’re beholden to that, actually. And that is, you know, that that can be a really big challenge. So that’s sort of one area. And then the other area would be, you know, just any sort of like operational tooling. So you know, you can think about, of course, Google Analytics as a huge one. But there’s all sorts of scripts running on everyone’s, you know, websites and apps. And so, when you that’s, in many ways, more of a, like an operational thing, right? Like, are those tools going to face limitations if they can’t store a cookie? And so I’m gonna lose functionality for some operational tool. I mean, it’s all sorts of stuff, right? From, you know, screen recording to analytics to whatever it is right? Personalization tool, so I’m, what are you gonna sing? Yeah, I
John Wessel 19:04
think. So. The started interesting time. So I started, you know, the Google Analytics, like web space around 2015 2016. And the general attitude was, well, like, this is what it is like, this is what you use. You use just you use Google Analytics. we’re beholden to Google, like, we hate them. Some days. We like them. Okay, other days. Yep. Like, that was just, that was what was available for the vast majority of people. Yeah. And I think I think people think we’re happy enough. And then like, you’ve got some evolution of tooling. And you’ve got some probably further skepticism of like, just around Google and Facebook both then you had the big thing with Apple and Facebook, that really, you know, ecommerce really hit some ecommerce companies with some. Basically Facebook not being able to target as well. And then I think people reacted with like, I need more, I need to, like dig into this more and be able to control this more. Yeah. And I think from there, then you have, like Facebook and Google really like for you commerce is what drives a lot of the traffic for people. So I think then you have this attitude of like, okay, well, if I do, like, what could we do if we like control this, and you get some data people involved? And then you end up with like, oh, wow, like, this actually opens up a lot of opportunity, not the least of which, which was just the very basics, two basic things, one site speed. Like, there’s so many things, you can a B test, if you just make the site faster, like that’s one of the best things to do for people, because you just get these marketing teams that would just pile pics one, and they’d have like, 27 pixels, with like, three second four second page load times. And then the attribution was the other thing that like, at least, especially when I was getting a Vulcan, like, like, oh, because I have an email tool, and we use Shopify like Shopify, and then some Google, and you compare attributions, and it would add up to like 200%. Like, well, because they’re each trying to, you know, grab and say, oh, yeah, this guy attributed that to that. So like having that, like, objective, like, first party data to do some objective attribution was another I was
Eric Dodds 21:19
gonna go, I was gonna actually ask both of you about that as sort of a follow up question. I mean, one of the things that we’ve seen with a lot of our customers, when we think about business outcomes is that if you have I mean, if the, as the warehouse has increasingly become the center of the data stack, and you have a first party first approach, it’s, it seems like, it’s been way easier for a lot of companies to create a business case for the data side of things. Because you’re not having to explain, you’re not having to defend the ad platforms, or marketing platforms, interpretation of conversion, which you then have to do some sort of mapping, right? So if you think about, like, the data team is collecting some sort of data from websites, wherever, right. And let’s say you have transactional data, right? So you have purchases or whatever those are, add to carts right? way downstream, that maps to some sort of business KPI, right, it’s number of orders, which is revenue, which, you know, there’s margin, and you sort of apply all that. But it’s this really interesting dynamic, where a lot of times, it’s almost like, we have to defend our interpretation of what’s happening in the ad platform, as opposed to saying, This was raw data. And we like to model it to reflect the actual reality of the business. And you can prove that, which is pretty interesting. Are you saying that then?
Ben Rogojan 22:50
I’m definitely like, I think I haven’t had to spend too much time in like, purely advertising. Like, recently, I think most of my projects, thinking back were like, very, what am I trying to think of it like very domain specific, like working with like a casino? And then like, analyzing their gaming or working with a telecom company, and analyzing calls and things like that. So a little less focused on like, how are you converting? Somebody in more focus on like, how are people just using our product or using the thing that we do? So it’s been very domain specific there?
Eric Dodds 23:29
Yep. Makes sense. Okay, how about the infrastructure question? Yeah, I’m dying. I’m dying to hear about this, because it can get really spendy. And I think in today’s environment, and it’s a, it’s a good topic, to discuss, yeah, I’d
John Wessel 23:41
I love to talk on Facebook first. Some of the infrastructure and tooling. And then like, what are you using now day to day with, like, consulting clients? And I’m expecting the answers. Typically, they’re pretty different. But I’m curious.
Ben Rogojan 23:56
Yeah, I mean, you know, at the time, and I’m sure this is somewhat similar, even even now. But obviously, they’re investing tons into more on the like, Gen AI side, and like hardware and things on that side, and probably making solutions and tooling, like internal tooling to make even that development easier for developers. But that’s something I think Facebook has always done well, like when I was at Facebook is like, they made your job very easy, like to the point that like, I would work with certain data engineers that would then pull me aside like, a few months, and it’d be like, I’m bored, right? Because like, your job has been made, like, easy. You know, in the example, you know, they’ve got something internally that’s very similar to airflow for like, workflow orchestration. And really, all you’re doing is making this kind of half or more like 75% seat will, you know, 25% kind of Python configuration file that you then just push somewhere and like it runs and you know, you kind of just works, right? Like, there’s no need to like, spin up your own like Kubernetes cluster something like spin up, like all of these various things, it’s like someone else’s managing the actual infrastructure, you’re literally just dropping, you know, and committing files somewhere, which obviously, I think is very Facebook specific, they actually had a whole team that was dedicated to what it was called Data swarm and just developing that and managing that. So they were constantly making it better, as well as like, maintaining it on a daily basis. So if it went down, you weren’t, like, I need to solve this problem. It was like, Well, I have nothing to do for the next hour, because someone else is solving that problem. And that’s not my problem. And I can’t like it, I can’t even solve it really, it’s not even accessible for me to solve this problem. So I think there’s that aspect of it. I think the interesting thing is that Facebook was doing the whole, and I think probably a lot of the big data, or big tech companies were doing this, before, more recently, they were doing the whole, like, Hey, we’re gonna put our data in kind of this open format, right? Like, like, it’s just going to kind of exist in, you know, this data lake data warehouse states, somewhere, and then we’re going to use whatever engine on top of it, you know, you can specify that engine later on. And now I’m you’re seeing that now, I think, like Iceberg, people are putting things in S3, and then using whatever engine they want to sit on top of it with, it’s more cost effective, or if it just makes more sense for that specific job. So I do remember that kind of being the thing, what I left was like, okay, hey, you want to use presto, use presto, you want to use Spark, you want to use, you know, something else, you can kind of pull that off the shelf and use that to run this specific job on that data set. And it’s very abstracted away from where it’s like, literally just again, that’s that configuration, like this job is gonna be Spark, this job’s gonna be presto, and you just call it out early on. But again, I think you’re starting to see that now. I think, like, it makes sense, right? Like, as people are trying to control costs to try to figure out, okay, sometimes it’s about cost. And it’s about performance. I do imagine there’ll be a line where, like, certain companies will just make sense to stick with, you know, one, you know, I see that with most of my clients that are more in that mid small size, it’s like, you’re not going to try to juggle BigQuery and Databricks. And Snowflake, you’re gonna pick one and try to do that really well and make sure it fits. But like when I look at my large, the larger organizations I work with, they already are using all of all of the above. And it’s more about maybe trying to coordinate it longer term to try to figure out what makes the most sense for various teams.
Eric Dodds 27:21
Yep, yeah. Well, they’re
Ben Rogojan 27:24
probably just that’s just touching the Iceberg. I think that question can go, you know, multiple different directions. So feel free to keep digging in. Yeah,
Eric Dodds 27:30
Well, I think maybe this is what you were thinking of John. But so that’s Facebook, they have all of this mean, what a luxury to have an entire team, you know, work on this internal tooling. But as we’ve seen in the data space, so often, you know, you have the fangs are really pushing the boundaries on inventing stuff, because of teams that are solving problems that very few other companies have faced. Have you seen them sort of like a guy? So in the mid market, like you said, Okay, we have, we’re sticking with sort of one cloud, like we’re a Google shop, Snowflake, shop, Databricks, shop, whatever, we’re going to do that really well. What about some of the other tooling? Like? I mean, it seems like there’s a lot of SaaS popping up that can help sort of act as that dedicated data team to sort of take care of, you know, those pieces for teams that don’t have, you know, the resources to have like a bespoke solution. Are there areas in particular, where you see like, okay, there’s a ton of really great tooling that’s making this sort of more streamlined and accessible to smaller companies that don’t have resources, like, what areas of the stack? Are there sort of efficiencies due to new tooling?
Ben Rogojan 28:41
Yeah, I think, you know, it’s interesting, because I think Ethan Aaron posted about this there, it was, like, 2015. He’s like, do teams are like one person, especially like midsize companies, then, like 2020, they were like, 30, or 50, year, whatever, they blew up pretty big. And now we’re like, you know, in 2024. And we’re looking at like three to five people, again, on these teams. And so it’s interesting that we got to that point, you know, back in 2008, I think what happened is people found out very quickly that if you built 100, data pipelines, you had to maintain 100 data pipelines. So as the, the faster you build, which a lot of these tools could kind of give you the more you have to maintain. And then you just kept having to kind of build bigger and bigger teams to kind of 20 of them, and only 20 of them actually got used, you know, yes, exactly. When we have them get used or 5% of them get used or whatever. I’m sure you could find some interesting statistics around that. But there’s definitely a lot of tooling that I do think can make things easier. You know, I think what’s interesting about the solutions that have existed now, like, you know, been working in this space for a while, we’ve somehow still recreated the same problems we had before and when I say okay, we have a tool whether you know, be portable, Fivetran estuary to do data extraction. rate. Now we need to write like, okay, don’t wait to get a tool for transformation is great. And now we’re doing the same thing we were doing before, which was okay, someone created a cron script to do data extraction great. Okay, someone created the cron script, or that called a stored procedure somewhere. And it’s a separate script. And so now we have to set up like, you know, I say cron, but mean, like a Python script, managed by cron can now I still like to set up these two things to run about an hour and a half apart, because that’s the optimal timing. And it feels like some, in some way, we recreated that. In this world that’s like, okay, it’s easier now. But we’re still at the same problem where it’s like, okay, your Fivetran estuary job runs a certain time. And now you hopefully run your DBT job or coalesce or whatever your transformation tool is, at the same time, and then hopefully, you know, you’ve got your next you know, your Power BI dashboard updating at the exact same time, or the right right time. So it’s funny how that’s happened. And like, now, again, we have all these orchestrators that had been developed to like, kind of go around that were like, yes, you know, it was what airflow was to, like Python scripts and sequel, you know, kind of one off jobs back in 2015. And it’s just like, it’s the same thing. It’s like, we created the same problem, you think we would have built this solution into it? Or had this in mind, but I do find that I think interesting. But again, all these tools do help. I do see them like, actually, like, I had a client, one of the first clients. I had, when I quit, that I built up their solution with a few tools. And like, every once awhile, I reach out to them, like, Hey, are you doing anything? And every once in a while, they’ll reach out to me to be like, Hey, we think we might need you to help on something. And then like, 24 hours goes by, nope, nevermind, we solved it. And you know, it’s just like one data person, essentially, who’s kind of managing it all and kind of kind of handled it. So yeah, I do think a lot of this has helped. But it is always interesting how we’ve kind of recreated some of the same problems we’ve had for a long time now.
Eric Dodds 31:56
Yeah, it’s like a system that allows for innovation in individual problem areas, creating a more complex system. Right. And, but these systems have to operate like a system, if that makes sense. Right? And so yeah, yeah, it’s super interesting. Okay. I have a question I was thinking a little bit more about earlier, we had sort of discussed things like, this distance of data, Project Data Team, whatever, from like, the business outcome, right. So interesting, this is a question for both of you. Where have you seen that become a problem, right? And so when I say become a problem, to put a sharper point on that, you know, funding gets cut, or the data team comes under scrutiny? Because it’s like, well, this is just a cost center, what value are they adding? Right? But you know, and to some extent, there is a bunch of infrastructure that runs upstream. You know, what’s sort of happening downstream that shows up in the executive, you know, BI dashboard? What are the symptoms of that distance becoming a problem, right? Where it’s like, okay, that you’re in the realm now, where things are, like, getting dangerous, or there, there may be issues, because even though on the ground, you know, all the stuff, we’re doing all this infrastructure, whatever, is making this stuff possible downstream, but our perception is reality, right? That list of like, you might be in trouble if not exactly as a data team, because a lot of times it’s those things that are not a problem until they become a problem, if that makes sense. Right? Like, you know, that that dynamic can persist for a while until whatever, right? The company has a bad quarter at, you know, a new VP comes in who’s like, you know, going through every line item, you know, on the budget and inquiring about every single thing, right? Like those things happen. And so those things, sometimes those dynamics can persist, where a perception doesn’t come to light until there’s some sort of event that brings it to light, and then at that point, it becomes a problem. So what do you think about, like, what are those dynamics that you could catch earlier like symptoms of DOT?
John Wessel 34:11
Yeah, I think a quick one for me is like, you might be in trouble as a data team if you just produce reports and dashboards. Because if you are, if you’ve got your data warehouse integrated into pushing things out, to key partners, like via integrations to tools that people already use, like you’re pushing data back into Salesforce back into ERPs, back into that those lead teams, I think, are, like seen as indispensable, because that sales teams like oh, well, you know, I use that thing. It’s in Salesforce, it’s useful to me. Whereas if you’re just doing dashboards, if, you know, I think dashboards can be useful and reports can be useful. But those can be in trouble because those can be things where it’s like, well, I don’t remember my login or like I used to check that But the data was wrong one time, and I don’t look at it anymore. So that would be my number one thing is are you integrating into the tools people already use? And then are you integrating with, like partners that do really useful things with data? Yeah,
Ben Rogojan 35:12
I think something like along those lines, where you like, if you start having clear disconnects, where your business like, doesn’t seem to care, because of sometimes, like, I think he referenced like that apathy, where it’s like, okay, we ask them for things, it’s wrong, or breaks eventually, like I have, I had a client a while back, who’s like, oh, yeah, we like don’t use the data warehouse anymore. Because, you know, this one report broke. And, you know, now I just, we just don’t do it, you’ll use other options, you know, which domain we created. So, you know, if you start having that apathy, I think that’s one way I think that can also like manifests itself in like, if you’re sitting there, and you’re not like you’re building things, because you think it’s the right way to build things that no one in the business is like asking, like, where things are going to go? I think that’s never a great sign, right? Like, if you’re like, oh, yeah, like, if you’re really building, you know, and just building as Ethan Aaron kind of quota infrastructure for infrastructure sake, and no one at any point is stopping you. Like, they’re like, not like, hey, yeah, what are we doing this for? Like that? There’s some concern, they’re more just in maturity than anything else, like, there should be hopefully, that maturity of like, you know, the business hopefully understands, like, hey, this should probably come in stages, like, at this stage would like when can we expect like Titli feel to like, play with the data and understand it? Because I think the more you can, like, give them some tangibility, the more they’ll, like, see that they can do things. Because on the flip side, when I do, like, let’s say, you know, like clients, as I do start creating their data warehouse, like they have this, like, initial vision of what they do, right? Because they’ve had Excel, they’ve got like, their initial world and what they think. And then if you give them a little more access, suddenly, they’re like, oh, my gosh, right. Like, I’ve got 20,000 things I want to do suddenly, because I can see all this data, I can play with it, I can poke at it. And then the game becomes more of like, hey, we need to now have a process to like, croute know, what’s going to what needs to be prioritized, right? Like, that becomes a discussion, not like what’s going to be created and create all the things you can say like, Okay, now that you have all this access, now we have all these ideas, because you finally do see it all, you know, how do we funnel that into an actual process. So that’s what you want, you want to get to that point where it’s like, the business is like, super excited. And if anything, you’re like having to spend time prioritizing what actually should be done. And like also spending time maybe getting rid of old things and things like that. So
John Wessel 37:29
yeah, I think organizational structure is also a big piece here. Because I found if I can find or make embedded data analysts find them like maybe there’s already like a financial analyst or something, or like, maybe there’s somebody just interested in the analytics that’s already embedded in a marketing team or an ops team. Like those can be some of the best people. And then as far as driving adoption inside those teams, like they can do way more than I could ever do, like a data team. See, because they just know they’re there every day. They can say, Oh, hey, you know, you’ve got this problem, they can, you know, take the data, apply it to a problem in the moment, when you’re on the ground. Can
Eric Dodds 38:09
Do we dig into that a little bit? So when you say, so, do you find and analysts say and find it, because you were a CTO? Right? And so you oversaw like the data practice, called the technical side of things. So you’re saying, there’s like an analyst who works in finance? And so are you essentially building an alliance with that person? Making sure you’re serving them with, you know, things that they need, so that they’re almost an advocate for the data team in there? Or you like trying to poach them? And no, yeah, that’s
John Wessel 38:36
a good clarification. No, like these people are, I did poach one or two. But in general, the good ones. But in general, they stay in their current seat. And these these people are like, typically highly analytical, especially finance is great, because if you’ve got that accounting background, and maybe you’re like a financial analyst, and like, I’ve done this at two companies now, like financial analysts that take, I mean, they take days, hours and hours to close out books for the month before, it just it’s so much work to sell them excel. And there’s actually been two companies now that analysts have gotten the right access to data in a data warehouse. And then they’ve self taught SQL and have been some of the fastest learners and most motivated learners to learn SQL. And they’ve reduced the close times by days at both companies, just because they were eager and hungry, and then had somebody to give them the right access to the data. So that’s just one, like a simple example, that and then other analysts, maybe ops analyst Hoffman, could get really bogged down in like, manually tracking things having to do like spend hours and hours in Excel. If you’re already putting the time in and then you because Ben mentioned automation, you’d always look for jobs like sequel automation. That automation thing can be really crucial for those analysts that are already just spending hours doing so. off manually.
Eric Dodds 40:01
Yeah, one, one question. I’m laughing here because our friend Matt, here just sent us a message and said, you might be in trouble if all finance seems to care about as your capex up. Which is true, I would say like across the board. But that brings up an interesting point about the way that data teams are budgeted, or projects or budgeted, because that can vary a lot. And I’ll give you an extreme example. But I think this actually also relates to how the organization views the data team. So I was talking to someone the other day, and they’re, it’s a very large company. And they work on the data platform team. And they actually do not have a budget for this team that they’re on. That team tracks usage of the data product they built internally, and they have chargebacks that go to the teams. And so, which is a little bit weird, because I mean, that’s slightly perverse, and that, you know, you want people to use more data, but you get, you know, your budget. But like, yeah, so that’s kind of an extreme example. But I would love to hear like, there are different ways that data teams get funded. They’re an independent organization, they get their own budget, right? There can be chargebacks. There can be I mean, what are the different, you know, maybe just think through some of the situations where, or maybe like a healthy example, then and an example of a healthy, dynamic and example where it’s not as healthy just in terms of how the budgeting works around that stuff. Yeah.
Ben Rogojan 41:40
Like, like in terms of like, unhealthy like that, obviously, can go in both directions, right. Like, on one side, like I said, 20, let’s keep just adding more. And because we have added more, let’s add more, without truly trying to connect with, you know, does this help, right, like, does it add up like, like, because we’ve added in these new systems, will our business do better? Right, like, there was a ton of startups or companies that like went from startup to IPO? to probably make it 2022, right? Like they either went bankrupt? Or, you know, yeah, their stock price is doing terrible. I don’t make sure before I say this company name, let me just see. Just let me just check out before it. Yeah. So like, let’s say, for example, and this is not to talk ill of any company, but like, if you’re talking about a company that like, hey, their data infrastructure is amazing, like, people would like look at it. Stitch Fix, I think is a great example. Right? Like, they have this like, they like it was cool to go the website, like the data person and see what they’re doing and like, go do it, you know, and it’s just, it’s unhealthy. But it’s like, is that over? Like fascination with data? Is that helpful or not in the long term? And that I can’t answer. I don’t know, their internal, but I think that can happen. You look at a company like that you’re like, hey, they’re like, cool, they’re doing data, then you think your company needs to be that, and it kind of becomes this cargo cult thing? Yeah. Yeah. Again, it’s not to say that, like, it’s just to say that, like data is and everything, you know, just because you have cool models, just because you’ve done all that your business can still do poorly. And so I think that happened a lot in 2020, we have all these businesses just grow, and they were like, Let’s hire more data people, that seems to be what everyone’s doing. And then, you know, you end up struggling because you’re spending, you know, if you’ve got 20 people that you’re spending 150-200k. And, you know, a year on, like, that’s a significant amount of your budget, especially if you’re a startup. On the flip side, you can also be at this point where I often hear people say, like, if your CFO, if your data team rolls up to the CFO, you’re gonna have a hard time. So like, that’s kind of the other side where it’s like, yeah, you can be like, very, like, treated, like you’re just a cost. And to some businesses, I say like you might be, you might just be a cost. And that might just be your role. And you have to understand that sometimes. But if you think you can do more, it’s gonna be really hard in that situation that’s unhealthy on the other side, where it’s like, you just don’t get enough attention, or you don’t get enough budget. And so you’re only ever going to be able to do just enough to keep them, you know, from having maybe an advantage if they could have it. Yep. I’d say a healthy situation, you know, hopefully, you’re not growing your team unless there’s like, a specific reason like, like a business reason to be like, yes, we need a data engineer, because, you know, maybe you had a data analyst, because I think a lot of people start with a data analyst, you have data analysts, they’ve been building all this nice stuff, but now it’s getting hard to maintain, right? Because it’s like, kind of got these three or four reports or four or five reports. They’re having to manually create them. It’s taking a long time, is there a way we can automate this? And is there a way we can justify, you know, hiring 150 to 200k people to do that, right? Like, does it actually add that to our bottom line? Or does this still just make sense to have this data analyst kind of manage it? Right. So I think that having a healthy team would have those discussions, they wouldn’t just be like, we need to hire a data engineer because that would solve the problem. It’s like, well, these reports are only saving X amount. It might not make sense. Long term. So something along those lines.
Eric Dodds 45:12
Yeah. Makes sense. John, thoughts? Yeah.
John Wessel 45:17
I think when you get when you see, like the infrastructure data engineering stuff as a productivity driver, I think typically for more than one analyst, like maybe just one analyst, but more than one analyst, and we have teams of analysts doing X, Y, and Z, like every single week, and then they have to go through the mental exercises like of like, cool, what if they did last? What if we actually need this report? Like all those things that need to happen prior to, like, No, we need this. It drives value, here’s why. And the business goes through that exercise. And then they get to the point where like, Okay, I think we need data engineering help. It’s an enablement for these analysts, they’ll be more productive. It’s useful. I think that’s a really good exercise. Whereas like to what you’re saying, versus like, oh, like, Yeah, we should hire a data engineer, we need a data warehouse. We need this. We like AI, right? Like all
Eric Dodds 46:10
the cargo colting? Yes. Yeah.
John Wessel 46:12
But I think that process is super helpful. And then the finance thing was interesting, too, is the safest data teams, if you want to, if you want job security be on a data team that reports to CFO, okay, but if you want like to work on really cool stuff, and you were like, Yeah, because the CFO isn’t gonna think typically, not obviously, it was think of more accounting, right? In the thick, more cost. And good, the good news is, you know, CFOs, in charge of the budget, usually, they stick up for their people, which is good news. You might not get to work on the most interesting things, and your team is going to be small, and you’re gonna have to work hard. Yeah. That yeah, very typical, but I think that’s like, kind of the general.
Eric Dodds 46:56
Yeah, that’s okay. Oh, sorry. Go ahead.
Sandy Ryza 46:59
No, I think it’s super interesting. That’s kind of the end. Yeah.
Eric Dodds 47:04
Okay. We’re close to the buzzer here. But interested, you know, just changing gears a little bit with the last couple of minutes we have, for both of you. What are some of you seeing a ton of different companies working on a ton of different stuff? What’s like one of the most fun cool projects you’ve seen recently, uh, you know, are done recently with a company? John, why don’t we start with you, and then Ben can take us home, I
John Wessel 47:29
i think, at least in the E commerce space, I’m most I haven’t gotten to do a project on this yet. But I’m really excited about the search, we got to talk to a really neat company, Marco, that’s working on the space. And like the biggest problem, like data related problems for EECOM, in my opinion, is this discoverability thing, like if you need a part type and a part, Amazon works great. If you’re not sure what you’re looking for, the search experience is really difficult. And the only way discovery works well is if you have a really small SKU count. So if you only sell like 10 things, then like, that’s fine. It’s easy. I think that and then like incorporating data into search. Yep. And like search intent and signals. I think that’s like a really interesting space. But haven’t gotten to do one of those yet, but we’ll see. All right, then.
Ben Rogojan 48:18
Yeah. You know, I think a lot of my projects end up being migrations, which aren’t necessarily boring, but they’re not that the most thrilling. Like, last year, I was just proving out for one client who was spending upwards of like, things like $35,000 a month on their infrastructure, just kind of proving out a simpler version and helping them move to that. Which, you know, it’s cool to always hear those numbers and hear the reductions that you can do in that regard. Right. Like, okay, this is totally possible to reduce, let’s do it. Recently, like, this is just like more of just, I think, kind of a nice project. I like it when they have like this realness to it again, $35,000 to, you know, bring it down to like, $10,000. That’s real. I think the other thing, it’s that it was real, like, I have this client, I’ve had them for a while, where we work off and on. And they always have kind of interesting ideas. And this most recent one was like, they’re basically a logistics company. Like they, they deal with, like busing and like, like, people rent them for various reasons. And they’re like, hey, one of the things that we do is like, during the summer, we do this kind of, like specific sets of bus routes. And one of my employees essentially has to wake up really early to like, we have all the bus routes, we have all the pickup stops and asked to like plan that out. And that takes, you know, they wake up to manually do this whole process. And I don’t want them to ever quit because I don’t think anyone can try to automate like even 70% of it. And so basically, we’ve kind of developed a system to like, just automate that process. And that’s been really cool because again, like we are, in the end, saving someone from having to wake up at 1am to kind of develop this whole thing. and it feels good in that regard. So that’s it. It really isn’t like a complex ML model. It really is just like a rules engine that we created, like part of him. Part of the client really wanted to go down this Jenny AI route. And I was like, I don’t think it’s going to work like, maybe, but I know, we can definitely get something to work. Yeah, maybe in more of a rules engine kind of fashion sort of went down that route. So yeah, I think that’s always kind of cool. Makes me think of another kind of instance, where like, we ended up doing a migration that helped avoid some analysts having to wake up on Saturday and Sunday to do this one report, because a thorough report every day, so anything like that, it’s always kind of cool. Just to help someone out that has some real problem with that.
Eric Dodds 50:44
Yeah, I love that. Well, really quickly, before we hope to remind us where we can find your information where listeners can connect with you and see all your content?
Ben Rogojan 50:54
Yeah, I mean, you can look up the CL data guy. I’m on YouTube, substack, LinkedIn, and probably a few other places. But yeah, you can pretty much find my content. So if you want to watch videos on things like becoming a data engineer, or even more specific topics, like data modeling, I’ve got a few pieces on that. And the same thing on substack got a pretty good plethora of content that ranges from beginner content to, you know, the organizational kind of how you should set up your organization and things like that. So, yeah,
Eric Dodds 51:20
That’s great stuff. We read it all the time. Well, Dan, thank you so much for joining us on the show. Great conversation. I learned a ton to think about, and we will have you back on again soon. Now that you are a multi time guest Yeah, yeah. Yeah.
Ben Rogojan 51:37
Thank you. Thanks so much. I appreciate it.
Eric Dodds 51:38
We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe to your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.
Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
To keep up to date with our future episodes, subscribe to our podcast on Apple, Spotify, Google, or the player of your choice.
Get a monthly newsletter from The Data Stack Show team with a TL;DR of the previous month’s shows, a sneak peak at upcoming episodes, and curated links from Eric, John, & show guests. Follow on our Substack below.