This week on The Data Stack Show, Eric and John chat with Jeff Skoldberg, Principal Consultant, Data Architecture and Analytics at Green Mountain Data Solutions. Jeff has been a data consultant specializing in supply chain analytics and cost optimization and shares his journey from starting as a business analyst at Keurig in 2008 to becoming an independent consultant. They discuss the evolution of the data landscape, including shifts from Microsoft SQL Server to SAP HANA and later to Snowflake. Jeff emphasizes the importance of cost optimization, detailing strategies for managing data costs effectively. The group also discusses two frameworks for using data to control business processes and create actionable dashboards, and more.
Highlights from this week’s conversation include:
The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Eric Dodds 00:05
Welcome to The Data Stack Show.
John Wessel 00:07
The Data Stack Show is a podcast where we talk about the technical business and human challenges involved in data work.
Eric Dodds 00:13
Join our casual conversations with innovators and data professionals to learn about new data technologies and how data teams are run at top companies. Welcome back to the show, we are here with Jeff Skoldberg from Green Mountain Data. Jeff, welcome to The Data Stack Show. We’re super excited to have you.
Jeff Skoldberg 00:37
Thanks so much for having me excited to be here.
Eric Dodds 00:41
All right, well, you are calling in from the Green Mountains. So you come by the name of your consulting practice, honestly, but give us a little background. How’d you get into data?
Jeff Skoldberg 00:53
Sure. So I first started in data in 2008, when I was working at Carrick, as a call center business analyst. I was with Carrick for about 12 years doing data a couple of years on actually on the BI team. So kind of transitioning from a business analyst who is just knee deep and data all the time to it’s actually now it’s my job title, even though I’ve been doing it the whole time. Yeah. And then about five years ago, I said, Hey, let me try and do this on my own. See what happens. So about five years ago, I left Keurig to become an independent consultant. And since then, I’ve been helping companies kind of on their analytics journey, doing end to end stuff. So business analyst, the data pipeline, architect solution architecture, as well as the last mile, which is analytics to dashboard. So that’s what I’ve been doing for the last five years.
John Wessel 01:42
So Jeff, one of the topics I’m really excited to dive into is cost optimization. That’s one that’s near and dear to my heart, previously being in a CTO role. So I’m really excited to dive into that one. Any topics you’re excited to cover?
Jeff Skoldberg 01:57
cost optimization, sounds awesome. We can talk about some frameworks that I use with my clients as I’m walking them through their business problems, and we’re thinking about how we’re going to solve the problems that they come to me with. That’s another one. And we could also talk about maybe how the data landscape has evolved over time. Yeah,
John Wessel 02:14
okay. Sounds great.
Eric Dodds 02:16
All right. Let’s dig in. Jeff, one thing I love is that we get to talk about sort of your tenure at Keurig. A little bit, it seems less and less common, especially in the world of data for someone to spend over a decade at a single company. And not only did you do that, but what a time, like, what a decade, I mean, you sort of chose the years, you know, to span that time at Keurig, from sort of before, some of the most high impact technologies we use today were even invented. So sort of start at the beginning, like, how did you get into? Was that like an early job out of school? How did you even get into data at Keurig?
Jeff Skoldberg 03:02
Absolutely. So I went to school at the University of Vermont. And this was my first job out of college and carried being the largest employer in Vermont at the time, remote work not being as popular as it is now, it was basically my only choice for like working at a big company, my first data roll in 2008, there was as a call center business analyst where we’re looking at average, wait time, average talk time and average handle time, which is like the sum of the two. And using that for forecasting models, like when and when people should be taking their lunch breaks or their 15 minute break, etc. So we’re really building a staffing model based off of our call history and those types of metrics. So to kind of keep a very long story short, eventually, I moved into a role as supply chain analyst where I have very smart managers and mentors who were really great at coming up with KPIs to manage the business. And then I was responsible for bringing them the data, so they can, like, really fulfill those KPIs. So we could go into some examples maybe later in the show if we want to dive deeper there. But it was very much an exercise of like, Hey, we’re gonna try and reduce XYZ cost. And now we need the data to understand XYZ cost. And let’s automate that. So you know, we’re not paying our analysts to pull data, we’re paying our analysts to really present it and show where the pain points are in the weak spots. The good bulk of my career accurately I would say like eight of those 12 years was as supply chain analyst, and then I moved into an actual bi developer on the BI team. When Carrick decided SAP HANA was or just sap in general, was going to be their ERP system of choice. My manager made sure that me as a supply chain analyst had unfettered access to the SAP HANA database. So when he had a business question about how the business is doing something in stock, what’s our instock percent? Which are things shipping on time, et cetera, that I would be able to write those queries and give them the data. And then, shortly after that, I joined the BI team as an official member of the BI team, even though I was unofficial, bi that entire time. Yeah. Very cool.
Eric Dodds 05:16
Okay. I have a couple. I have a couple of questions. So one is about the Keurig business and sort of the digital nature of it over that time period. So in terms of distribution channels, and you said supply chain, was there a big shift to more online sales over that time period? And did that have a big impact on sort of your work with the data?
Jeff Skoldberg 05:42
It’s interesting, it actually kind of went the opposite where it went more b2b in the long run. So when I started there in 2007, they were the largest shipper in Vermont. As I’m sorry, they were the largest shipper in New England. So this is before people bought everything on Amazon. Yeah. So of course, Amazon is the largest shipper in New England now, like, I don’t have to look it up or fact check that you just kind of know, right? Yeah, yeah. But in 2007, it was Keurig. Green Mountain was the largest shipper in New England, they were doing about 20,000 orders a day out of the Waterbury distribution center. Holy cow. Yeah, it’s pretty amazing. And that didn’t grow a lot. Their Consumer Direct, it basically stayed about static over the years, plus or minus, you know, it really exploded was their grocery distribution, their club distribution, love being like Sam’s Costco, BJs. Retail, so like Bed Bath Beyond? We would call that the retail channel. So, um, that’s obviously like, separate from grocery. Yeah. And as they became a nationwide brand, looking at metrics, like what percent of grocery stores are we in? They very quickly got to 100%. Wow, of the, you know, the national brands. So it was just absolute explosive growth.
Eric Dodds 07:03
Wow. Okay, so now, what walks us through the changes in the tech stack, right? Because I mean, you have a sort of business model change, and then explosive growth. And then the other variable is the changing tech landscape. Right? And absolutely, the explosive growth generally means we have all sorts of new data and need all sorts of new processes to build way more data than we were ever processing before. So give us that story.
Jeff Skoldberg 07:35
So I was very fortunate that I landed in a company that knew what business intelligence was. And they had a BI team since before I worked there. So I started there in 2007, my first data roll in 2008. I don’t know how long they had a BI team for, but they had one back then a small one. But, you know, they had a dedicated team called the BI team, which is really cool. Because a lot of places that I’ve consulted at, they’re not there yet. So I was really fortunate to learn in an environment where they were thinking about sustainable data assets, where they were thinking about, Oh, well, Jeff, you might do it this one way in Excel, because that’s all you know how to do. But let’s show you how you would do it in a database. Now you structure your data to support this KPI to automate it and stuff. So I actually had really great mentorship on that text, ICT as well. So it’s really cool that they were lightyears ahead of their time. But on the other hand, the actual tech that they were using was, of course, Microsoft SQL Server. Yeah, as their business warehouse. Their ERP system was an Oracle based product called a PeopleSoft, which is like, I mean, that’s kind of ancient history. Now, it was very much a Microsoft shop. And it was interesting to see it evolve to SAP. And honestly, it almost felt like we were downgrading when we were coming into SAP Business Objects. I forget, what is it called? bw SAP Business Warehouse was like a huge downgrade from your SQL Server. Analysis Services, like the cubes that you have in SQL Server, going from Analysis Services to business objects and Business Warehouse was like, but it was a tough pill to swallow. But then coming into SAP HANA, which acts like a relational database, it acts like a modern data warehouse. Some people might not know about Hana, so I’ll just explain real quick. It’s an in-memory database. So when you turn it on, it takes all of your tables and it loads them in RAM. So this runs on massive servers that are just specially designed for SAP HANA, usually manufactured by Hewlett Packard, so like they have a partnership where Hewlett Packard is designing machines specifically for this database.
Eric Dodds 09:58
I had no idea about Uh
Jeff Skoldberg 10:01
It’s totally wild. We’re talking like terabytes of RAM. Wow,
Eric Dodds 10:05
I had no idea about it. That’s very cool. Hana trivia.
Jeff Skoldberg 10:09
Yeah. And so we were able to do forecast analytics on a 6 billion point dataset using a live connection in Tableau. And you were able to just drag and drop and it would just come up on the screen, maybe like a one second delay or something like that. Computing on live on a 6 billion point dataset. Wow. That’s wild. Totally. So that was really cool. That was my first like, I will say, big data, 6 billion, you’re kind of getting there. In the big data realm, you need specialized equipment to process 6 billion rows. And that’s how I define big data. Some people will say, the number of terabytes is big data, I define it as do you need specialized equipment to actually process this data. And certainly, with that 6 billion rows you do, that was my first kind of experience there with big data. And it was really fun to be able to optimize Hana, like learning about the partitioning, learning how we’re going to be clustering our data and organizing it. And using a union instead of a join to do those forecast accuracy comparisons was like a huge speed boost. All of the little performance tuning tips that you learned along the way were really enjoyable. Very
Eric Dodds 11:20
cool. And when did Tableau come into the picture? Along with SAP, or? Yeah, I
Jeff Skoldberg 11:27
I think they implemented Tableau in 2016. So medium early adopters, they’re like a tableau was certainly around in 2010. Yeah, and then kind of had a huge growth.
Eric Dodds 11:37
Were there any major changes in pipelining? Over time, especially as you acquired additional data sources, right? I mean, I’m sure you had, you know, all sorts of pipelines from all these different, like, distributors, vendors, all that sort of stuff. Sure.
Jeff Skoldberg 11:52
So a lot of my analysis came from two places, SAP directly: our ERP system, our system of record and our demand planning system, which was called de Batra, an Oracle based product. And we have the mantra reef forecasting every combination of product and customer every single day. And not only is it doing that once every single day, it’s doing it for multiple versions of the forecast. So you can have your pure system generated forecast, you could have your what your sales outlook forecasts are like what the sales reps, they want to put their little Juju on the on the forecast, that’s your sales Outlook, you could have your annual operating plan, which is what you were saying at the beginning, and that doesn’t really get adjusted. And then maybe altogether, you have like nine versions of the forecast. So it was forecasting every combination of customer products, and then nine different sets of those every single day. And we’re saving all of these snapshots for a certain amount of time before we drop them. And there’s certain snapshots that we saved forever. So like, one week leading up to the sale, we really care about that. 13 weeks before the sale, we really care about that. And the reason why we cared about those two things is because one week leading up to the sale, you can make the K cup 13 weeks leading up to the sale, you can buy the Keurig brewer and get it there from China. Six months leading up to the sale, and one year leading up to the sale, those are the lags that we would keep forever. And this was years ago. So I’m improvising here a little bit. But more or less, it’s directionally accurate. And so we had to come up with a process to ingest all of this data, snapshot it and then delete certain slices of the snapshots as this as these time as the time lapse and then actually do the comparison. Since then, that was like my main role in SAP HANA. You were asking me about multiple systems. And let me circle back to that. They also had IRI data. Ira is like cash register data. So stadiums, Costco BJs, Walmart, what did you Eric? Are you John actually buying at the cash register? And it’s collected at the row level like that, where then they can actually, it’s actually pretty creepy, what they can do with data. They can be like, what’s the average credit score of the person that buys our K cups? Like they could actually like, you know, they can do that because when they get IRI data, they know that unless you pay in cash, they know who bought it. And luckily, I wasn’t responsible for that pipeline, one of my friends and colleagues was responsible for pipelines. So I just got to consume that data as a resource within Hana DAG that got piped into HANA as well. And I was able to act
John Wessel 14:44
and say, luckily, I assume that that was a huge pain. Like
Jeff Skoldberg 14:47
I think it was a big lift because they didn’t have the integrations that they might have today. So you’re dealing with things like automating exports and stuff like that. Sure. Lots of ways like TO servers. Yeah, large, large, certainly. Very large volumes. Exactly. Because it’s every sale so
Eric Dodds 15:07
fascinating. And this is so far I feel like I’m learning things that we haven’t like the details of some of the stuff that we haven’t really talked about on the show. Oh, totally. I
Jeff Skoldberg 15:18
I love supply chain analytics, because it’s its own niche thing. So like consumer packaged goods is certainly its own niche thing, but just general supply chain any like, if you make heaters I can help you if you make airplanes, I can help you, you know, because supply chain of like, knowing how much product we have in stock today, when is more product coming in? Or when are we making more products? How’s that compared to our demand for the product? That’s all stuff that now is, I almost have like a template for it. It’s like an easy problem for me to solve. So I really enjoy talking about this stuff. And those are the types of clients where I really excel, even though I’ve had clients in all types of industries at this point.
Eric Dodds 15:59
I have John, I know you probably have a ton of questions. But one, one last question on this subject. For me, sort of looking back over your time and all the changes that happen across a number of different vectors. The technology landscape has changed drastically. But what hasn’t changed from your perspective, right? I mean, and what kind of made me think about this a little bit, was you saying they had a BI team, you know, before you even joined, what types of things haven’t changed, even though the world you describe in 2008 is just so drastically different in many ways, from a technological standpoint, for a company that would be setting up the same sort of infrastructure,
Jeff Skoldberg 16:36
star schema. It hasn’t been like, seriously, like I learned about star schema in 2008, when I was like Curie because that’s how they designed their cubes. And nowadays, we could take liberties and every year at coalescer we’re gonna have a speech called is Kimbo all dead? Or is the star schema dead? And they kind of say, yeah, like, you don’t really need to do that anymore. But realistically, what we’re doing is we’re just not following all of the pinball best practices of like, surrogate keys, and like foreign keys, and all of this stuff, but we’re still keeping our facts and dimensions separate. And we’re still joining them right before we send the data into Tableau. We take our fact table, and we join it to our customer table, our product table in our fiscal calendar. And then we sent one big table to Tableau. So it’s not really a star schema, because we didn’t do it by the book of how all the extra steps and added time and complexity that Kimball said you should do. You don’t really need to do that anymore. But this concept of facts and dimensions is timeless. Absolutely. And actually, I hear Joe Rees talking a lot about how, like, people don’t care about data modeling anymore. And how just period, the average data model is very sloppy, and it’s not following a rigid technique. And to a certain extent, I think that’s kind of okay, like, on one hand, it’s really cute to be like a perfect stock star schema, but the six weeks extra that it would take versus just getting the answer, like here’s my query, and this gets the answer. Sometimes you have to look at time to value so
John Wessel 18:18
you always have to look at time to value right, Eric?
Eric Dodds 18:23
Yes. Such a tired phrase. But so true.
John Wessel 18:28
ROI and time to value. Yeah.
Eric Dodds 18:32
John, I’ve been monopolizing the mic.
John Wessel 18:35
Yeah. So yeah, let’s dig into some of that cost optimization thing. You know, we talked before the show that, you know, starting in, let’s say, 2008, you probably had servers that were maybe you had them in Rackspace, right, like that’s a nice 2008 hosting company, right, and maybe our servers, Rackspace. And then like, at some point, let’s say in 2010 2012, ish, like you, you get into AWS. And then you still have servers in AWS. And then eventually, you’ve got things that have come out like Lambda, and then further and further down the road of basically pay per hour, and then pay per second. So this is evolution here. And I guess maybe walk us through from your day to experience like, how that impacted your job on a regular basis. Like I was thinking about this in 2008, or 2010, or 2012. Now I’m thinking about this and 2024, like what’s the evolution there, specifically around cost optimization?
Jeff Skoldberg 19:34
Sure. I’ll kind of think about this in three chunks. There’s my free SAP HANA, where you had the world where you described, it’s like a fixed cost. Yeah, more or less exclusive. Your data gets bigger, you need to spend more but you know what, you’re going to pay for a year because you pay for a year upfront, right? Then there’s holidays and then there’s my post 2018 days where That’s where I’ve seen more change. So like from the 22 doesn’t need to 2018 the big change that I went through, personally in my data endeavor was this adopting of SAP HANA. Sure, I’d like to talk more about from 2018 Onward as now I’m getting into Snowflake, and paying for computer models and stuff like that. It is such a drastic mentality. One little prenote is that not all companies are there yet I have. One of my main clients today is using Amazon Redshift, and they’re not using the server list. They’re using the pay per month. So it’s like, they know what they’re getting upfront, it’s a couple $1,000 a month, and it doesn’t change, it doesn’t matter if you use it or don’t use it. It’s a fixed cost. There’s something to be said for that. Because they know upfront what they’re going to spend, and there’s not going to be like, Oops, I ran a Big Query query without a LIMIT clause, and I accidentally spent $5,000. You know, like, that is actually a problem that people have in BigQuery. Yeah,
John Wessel 20:57
no, it is. I’ve done that before. Not quite to that extent. But yeah, I mean, I had, we had an analyst last company, like exploring some Google Shopping data. And I think it was like 500 bucks, well, yes, for like, 2030 minutes, because of the dataset size, total, but to your point, companies budget annually, based on a fixed cost model, like that’s what your budget is, it’s like, whatever it is, you know, $30,000 for a warehouse for, you know, per month, a large company, maybe per year at a smaller company. But like, but then you have these variable costs, so then it becomes like, as a brand, any sort of leadership position, they’re actually really challenging to manage the cost, up or down, you know, obviously, like, cheaper is always better, in general, but then like, you manage it too well, and you lose your budget, and then, you know, then the next year, you’re fighting to get the budget back. It’s a tough problem. Whereas before, it was like, like, let’s buy this, let’s get it depreciated over three years, you know, it’s an asset, like we, you know, there’s, it’s not operate, it’s not just in that operating cost budget, like bucket, like it is now, remote for a lot of companies, which I know, you can buy reserved, you know, reserved instances, and, and, you know, sign a Snowflake contract for multiple years, and accomplish some of the same things. But I think the mindset, though, is tough. I’ve always had a tough, you know, problem with kind of financing it and figuring out that, you know, like, how do we explain, like, how much is it going to be like? Well, we don’t know, because not a good answer. Right?
Jeff Skoldberg 22:33
Yeah, totally. So I think it comes down to your organization needing to grow expertise in cost optimization for the warehouse that you’re using. Yeah. And I’ll talk about Snowflake, because that’s what I know the most about. I think that most companies grossly overestimate their usage in the first year, and then go way over budget of their usage in the second year. Yeah, so Snowflake, I think it’s still $25,000 minimum investment if you don’t want to pay on a credit card. So you could sign up for a credit card, your account costs 20 bucks a month, if you just use it just a little bit, like whatever it is. But if you want to get Okay, now we want you to send us an invoice instead of and you want a little bit of a discount, right? You have like $25,000 minimum investment to deal with that sales rep. Most clients aren’t hitting that their first year, because they’re implementing they don’t have a lot of usage. Yeah. Year two, all of a sudden, it’s like, Oops, we spent a lot of money. No, and so you really kind of have to figure out how to rein it in. And there’s certain things, certain areas that we can look at how to rein it in. Would you like me to go into a few of those? Yeah,
John Wessel 23:47
I think one interesting thing, like the why behind it? Like, why did it blow up? Right? And I think the positive answer would be, we’re trying to democratize data, we’ve got all these analysts writing queries that we didn’t have access to before. We have, you know, BI tools where people are building their own dashboards, and, you know, maybe even composing their own queries in the tool. So I mean, all that is theoretically a positive thing. But you did just democratize data, which means from a cost standpoint, like, you just gave a bunch of people access to this thing that the meter runs every time you know a new person uses it. Well, so and
Jeff Skoldberg 24:26
Hopefully that’s the case. But oftentimes, the transformation step is spinning the warehouse more than people actually use data. Sure. And so you want to know, what’s your ratio of having your DBT jobs running in your warehouse versus users actually using the data? And the other thing is, a lot of companies are paying more to load their data into Snowflake than their Snowflake bill. It’s actually costing them, meaning that their Fivetran Bill is higher than their Snowflake bill. Yeah,
John Wessel 24:57
yeah.
Jeff Skoldberg 24:57
You want to know that. That’s it. That’s, you want to know that and you want to fix that right now? Basically, because it shouldn’t cost a lot of money to load data into Snowflake. So that’s kind of looking at
John Wessel 25:10
the entire set by using something other than Fivetran. Is that what you’re saying?
Jeff Skoldberg 25:14
It is actually. Okay. I
John Wessel 25:15
didn’t know. You don’t want to share with us? No. Yeah,
Jeff Skoldberg 25:20
I mean, sorry. It’s like, you know, I have my opinions. But Fivetran has a very well known pricing problem in the industry. And it’s like, I’m not here to talk down about any company by any means. But it’s what if you’re paying, if you’re paying by the row, you want to get less by the row, but paying by the row is, is really tough. So let me just give an example. This forecast accuracy thing that we talked about before, where they’re re forecasting every product every day. That means the entire dataset changes every day. You’re not just it’s not like event data, where you’re just getting new events. And it’s like, oh, I want Fivetran to track and do events. That means that I have like a couple billion new rows every day. So right off the bat you like you can’t use Fivetran? Because you would never be able to afford it. Right? But a lot of times in supply chain data when you’re dealing with MRP, which is materials requirements planning, or you’re dealing with inventory snapshots, or you’re dealing with, like forecasting, again, you have more data that changes every day, then doesn’t, right? Yeah. Like a byproduct? How much stock did I have at this warehouse today? And how much do I have yesterday, almost all of it changed, right? If you’re a busy warehouse, you don’t want to pay by row. For those types of things, you want to either come up with your custom loading or come up with, I love tools, I actually prefer tools over custom programs. I think that when teams adapt tools, they can move faster, they can be a leaner and smaller team by adapting a few smart tools. So you just want to look at how the tool is priced and make sure that it fits your use case. Well,
John Wessel 27:01
I think in Fivetran defense, they have done a nice job of developing a lot of quality connectors, because if you compare them to open source connectors on a lot of the like, They’re just better connections, they’re clearly better. So when like the marketing space, like they’ve got a like, if you compare, like I don’t know, like HubSpot or Salesforce connectors versus some of the open source ones. And you’re not like, you know, you’re not just a massive operation and makes a ton of sense. But so
Jeff Skoldberg 27:29
it does make a lot of sense for a lot of data sources. So the data sources that aren’t in the list of what I just mentioned, where you just have massive amounts of changing records every day. For systems like HubSpot, where we’re even Salesforce, you could do okay. I’ve seen Salesforce run out of control on the Fivetran side as well. But a lot of these connectors, I agree with you, are high quality, maybe the highest quality of reliability and accuracy and ease of use. When I evaluate another extract load tool, I basically say is it Fivetran? Simple? And if it’s not Fivetran? Simple, it’s kind of just off the list.
John Wessel 28:06
So the answer is usually no. Right? Yeah, like there are there are some good ones out there like that are, you know, in the next now, but there are a lot of them that like, this is just going to be too hard, or you need a more technical person, it’s a good tool, but you need a more technical person to manage it. Yeah.
Jeff Skoldberg 28:22
But circling back to where we started, if you want to be hyper aware of, am I spending more on loading data into Snowflake than my entire Snowflake bill, because that means you’re a little bit upside down, right? And then you also want to know, like, am I spending more on my transformation compute than my end users are consuming, and really see if you could get it to be that your end users are the cause of your high Snowflake bill, because that’s what you really want, because that means you’re paying the price for democratizing data, but that’s money well spent, whereas the other money might not be money well spent.
John Wessel 28:55
Well, and the other thing too, is I like to think of it as a pull model versus a push model, like push model is like, open Fivetran check all the boxes, like connect everything you can find an API key for it, check all the boxes, get all the data in versus a pull model of like, hey, there’s a business requirement somebody actually needs something wholly through like check the box and Fivetran transform it and DBT you know, deliver it in your BI tool of choice. Like that’s an efficient model but it’s kind of easier to do it the other way right of like hey business like look at all this data we have for you we have everything possible from and then you just list out like every system the business uses it could be 20 systems and then a they’re overwhelmed be you’re gonna waste a ton of money you know constantly ingesting and storing and that data do you see people struggle with that to where they just like kind of check all the boxes, suck it all and and aren’t using most of it?
Jeff Skoldberg 29:50
I do and every single Fivetran customer that I’ve helped has gone back and unchecked boxes but yeah, you know, it’s like how many boxes, can we uncheck what’s being used? If it’s not being used? Let’s uncheck it. Yep. So it is
Eric Dodds 30:05
that hoarder mentality, though we’re like, yeah, we might need that, you know, like,
Jeff Skoldberg 30:10
just take it again later. It’ll be right there.
John Wessel 30:14
Yeah, exactly, exactly.
Eric Dodds 30:18
One thing I was thinking about related to cost optimization, you said you sort of have these frameworks that you use to help your clients think through data projects? Do they incorporate? Like, are they sort of sensitive to cost optimization? Could you explain your framework? So
Jeff Skoldberg 30:38
I wish that they were. So maybe it’s my lack of being able to communicate this to the actual IT teams instead of business teams. But a lot of times, at least with the clients that I have, I’ll say that I stick with clients for a long time. So it’s usually a few clients, they have not been using my framework to optimize their costs. Unfortunately, even though I wish they were using my framework to solve business problems, but not always to optimize the costs. But I was gonna say we could shift into the framework or we can I there’s one or two more things I could say about, hey, yeah,
Eric Dodds 31:16
totally sorry. I didn’t mean to change gears. But yeah, let’s run the cost optimization. Sure. So
Jeff Skoldberg 31:22
I will, just to give people a few tips, the number one mistake that I see out there, which is a very expensive mistake is to use your virtual warehouses and Snowflake like cost centers. So create a supply chain, extra small, finance, extra small, HR Extra small. And so now you’re trying to apply like a cost center, or the department who’s using the warehouse to the warehouse, because a lot of times, I’ll log into the client Snowflake account, and all three of those extra small warehouses that I just said, are on at the same time, when one of them could just be on. So if you just had like, reporting Extra small. Now, instead of paying to have three warehouses on for three different teams, you just have one warehouse on that’s like, kind of like my number one tip is have as few virtual warehouses as you can even to the extent of like, just one of each size, and then just use those. And then the way that you actually apply your cost centers and figuring out who’s using the data is through either query tags, or putting comments in your DBT models, or there’s a great one, so I’m a huge fan of slack DOT Dev. They’re a Snowflake cost optimization company, they have kind of turnkey ways that you can see who’s using what data and what they do is they apply a comment to every query that gets executed in DBT, or Tableau, however you specify it so then I could say, okay, my supply chain team spent $30 consuming data and my HR team spent $100 consuming data. So it’s a much better way of allocating the cost centers than by splitting out your virtual data warehouses.
John Wessel 33:05
Yeah, let me comment on that. I made that mistake early on.
Jeff Skoldberg 33:09
I did it. That’s how everyone did it. You have to learn. Yeah, but go ahead, John. Sorry. Yeah. Yeah, I
John Wessel 33:14
made that mistake early on. Not too badly. But I had an ingestion warehouse. Transformation warehouse and reporting warehouse. So I split it by workload. And, you know, look at the bell, and then the logic of like, oh, wow, all three of those are running, guess what, that’s three times as much as one of them running. So then we just tried like, hey, what if we just literally combined everything into one? And everything was fine. Didn’t have any performance problems at all? So definitely turn on the auto scaling then. So yeah, sure. For your extra small,
Jeff Skoldberg 33:46
let that scale up to maybe three nodes. Or, Yeah, three, three clusters, actually. Yeah.
John Wessel 33:54
Let’s actually talk about auto scaling just for a second. Yeah. But I’m curious, like, what’s the trade off? So like, you have a smaller warehouse that runs for longer, right to do something versus a larger warehouse that can run for shorter for but it’s more expensive? Like, what do you think about that?
Jeff Skoldberg 34:13
When we’re talking about scaling up to the next size warehouse, yeah, right, the query runs for more than two minutes, that you could try and get it under one minute. So like, basically as long. So the one minute is the minimum billing increment. So you’re running a query for exactly a minute, you’ve had 100% utilization. So you don’t want queries running for less than a minute on a larger warehouse size, but you want them running the shortest amount of time possible for one minute. So that’s one kind of market that you can go to for a nice price. And the way that you could tell if scaling up is actually going to help you is if you look in the query profile, there’s something called disk spillage. And then there’s like spillage to remote storage, which means they actually spilled it too. S3, okay, so a virtual warehouse is actually a computer, right? So it’s trying to process as much as it can in RAM. And when the RAM runs out, it’s spilling to disk. But when you fill up the hard drive on that virtual computer, it’s now dumping out to S3. And if it’s something to S3, you know, for a fact that going to the next warehouse size up, it’s going to have a bigger hard drive. And it won’t dump to S3, because now it has more RAM, right? So more would have had more RAM, they had more everything. That’s right. So but looking at disability, and specifically, the remote spillage is how you can tell if it’s scaling up well, and then you don’t want to scale up until your queries run in two seconds. You want it to run for a minute, like let’s say the query is taking an hour before on an extra small, you know, you could go order of magnitude up until that thing is running a minute and then basically costing the same.
John Wessel 35:56
Yeah, nice. That’s super helpful. That is very
Jeff Skoldberg 35:59
cool. And then there’s this concept of scaling horizontally. So you can have an extra small warehouse. And then you could say, Min clusters and Max clusters. So I could call it Jeff’s extra small, okay. And I’ll say Max clusters are three and min clusters as what that means. And that’s a concurrency issue. So if I’m only using it, it’ll just be one cluster running. If now all of a sudden there’s like 40 or 50 Tableau users using that same warehouse at the same time, it’ll just automatically spin up an extra cluster of Jeff’s Extra small. So now it’s now scaling horizontally to handle concurrency versus scaling up to handle writing a harder query to process. So I think three is a good number just to start with. So let all your work use the smallest warehouse, you can let it scale up to three, if you think you need to. And for Tableau, it’s a lot of times I’ll start Tableau with a medium, just because I want Tableau to be a little bit faster. But again, letting it scale up to three if it needs to. And it almost never does. Yeah, it has to be at least eight queries for it to scale up one more. Well, when
John Wessel 37:08
Have you thought about your average workload, right? At a midsize company, or even a larger company? I would guess that like there’s certain peak hours and then peak times a month where you may have like, like, you know, this 100 people or 80 people or whatever, but the majority of the time, even during the workday is going to be a fraction of that. So that makes a lot of sense.
Jeff Skoldberg 37:34
Totally, totally. Yeah. And that’s what’s really nice is that Snowflake will then automatically handle it by Hey, what’s the number of queries I have in my queue right now? Okay, let’s turn on another one. Just
John Wessel 37:44
kill that cue. Yep. Nice. So, yeah, so
Jeff Skoldberg 37:48
That’s my best cost saving tip. That’s like, almost like the free lunch one that anyone could do, you could do it right now, without making a huge impact on your organization, you do have to be a little bit careful, because you might break some jobs, which we’re using the warehouses that you’re deleting. So yeah, like do them one at a time, see what breaks? Well, first, understand ahead of time, what you think is gonna break, fix it, turn off one, see what breaks, you know, that type of thing. So
Eric Dodds 38:11
i love it ‘s super helpful. Why don’t we talk about your frameworks to round out the show?
Jeff Skoldberg 38:19
I’d love to. So when a client comes to me with a particular business problem, and it’s normally the business teams reaching out to me more so than the IT teams, I’ll just kind of put that on the table that it’s very much like there’s a business problem that someone’s trying to solve. There’s two different frameworks that I walk them through. And the overarching idea that sits on top of these two frameworks is this concept that the only reason to use data is to control a business process. So you want to use data to control something that it’s not an FYI thing. For example. How much did we How much did we sell last week, you want to? Let’s even generalize it further to a sales dashboard. Why have a sales dashboard unless you’re trying to control your sales to meet that target? sales dashboard? Isn’t an FYI, I think it’s an area we are marching towards our goal. And if we’re behind our target, for this point in the year, what are the reasons why I’m behind my target? So that’s kind of the overarching thing that sits on top of my two frameworks. But framework number one is gist comes from the Six Sigma manufacturing methodology. And within six sigma, they have this process called define, measure, analyze, improve control, and you can see this framework is set up that it ends with control. So first, we start thinking about controlling a business process. And then we say, here are the steps to actually control the process. We’re going to define it what you’re trying to measure, and like what your problem is, we’re Gonna have to find it, then we just figure out how to measure it, analyze the result, what are you gonna do to improve it, and then you just get to this point where all you’re doing is you’re using the dashboard to control the process and make sure that the process is in control. And you can apply this to anything, you can apply it to sales to forecast accuracy to your Snowflake spend. So the one thing that I love about this tool called select, that I mentioned earlier, is that it has dashboards that show you where you’re spending whatever your most expensive queries are. But it really comes down to this fact that we’re going to use it to control the process of getting our spend under control. So that’s kind of a framework one. And then framework number two is, what should a dashboard do? What should be on a dashboard? Well, every dashboard should do three things. It should say, where are we today? What is the trend? And what is the call to action? And so if we unpack each one of those three, so where are we today, that’s like you’re at the top, they call it a ban, a big ass number at the top that’s like, The Good, The Bad. Like this Tableau dashboard came to my email, I see a picture of the dashboard in the body of the email, and there’s a green check at the top next to a number I know I’m good at. Or there’s a red X at the top next to the number I know I’m bad at. So that’s where we are today. And if we think about it, we’ll just use the sales dashboard example, because it’s a very simple concept for people to think about. If you’re behind your target, it’s an x, right? So number two of all dashboards must be what is the trend. So this might be your sales and weekly buckets on a line chart. So the simplest tools in Tableau, the simplest chart types are the best chart types. So we have a line chart that says, here’s our sales by week, here’s our target by week, which weeks were above and below the target. And same week, last year, maybe so maybe have three lines on a chart. So you could see where this year is compared to your target and last year, so now you have a really good picture of the trend. And then the third thing is, what is the call to action? And that said another way is what are the items within my business topic that are driving my business results. And if we think about a sales dashboard, it’s like, well, these three sales reps are really behind on their sales. They’re the problem, or these three product IDs are really behind on the sales, it’s these products aren’t doing well, or these brands are these package categories, or whatever it happens to be. But at the bottom of the dashboard below the trend chart is going to be a bunch of different bar charts, usually, that can then act as filters on the stuff above it. Yeah. So I could click on this sales rep’s name, and then the whole dashboard filters to just that sales rep. And I could see how far behind he was and what his products were. How are his products doing? And so that brings it all together. Have we now where we are today? What is the trend? What’s the call to action? And I can use that whole thing to more or less, bring the process under control. So it’s the manager’s job who’s consuming the dashboard, to then use it to then go figure out why the problems are the problems. Hmm, that’s my framework.
Eric Dodds 43:16
Yeah, I love it. Okay, one, one question. Right off the bat. And I think I know the answer. But do you find that this increases or decreases the number of dashboards in a company?
Jeff Skoldberg 43:28
I think it does increase because they want to control more points. Right. But I do like this idea of. So I mean, Ethan Aaron always talks about 10 dashboards, which I don’t think is a reasonable number of dashboards for an organization that has more than 10 departments like, I mean, do you think just think of any company that makes something that goes on the grocery shelf? Right? They have manufacturing, they have distribution? Yeah. They have purchasing, hundreds, human resources, and finance. They have more than 10 departments, you can’t just have 10 dashboards, right, right. But each team should not have more than 10. And each individual team maybe should only have like, five KPIs that they’re really looking at. So if you’re the supply chain team, you should care about things that are in stock, things that are late, things that aren’t shipping on time, and like how your inventory is, and is your forecast accurate. But maybe that’s hopefully five things that I just said. And then anything else that you want to measure, you say which one of those five is going and then you end up with maybe 50 dashboards for the whole entire organization. But to get back to your question, it does beg a lot more questions. And they say, Well, what that was so effective, what else can we control? Yeah. And I just did this at one of my clients where the first dashboard went live. They put it on the monitors throughout the building, and everyone was looking at it as they were walking around throughout the building. And instantly this particular KPI within like one week got so much but like how Hundreds of percent better basically, like it was just like a market improvement because people’s names are on the dashboard. If you put someone’s name on a dashboard with a problem next to their name, they’re gonna go clean up that problem, and the problem starts getting cleaned up right away. And you want to be careful because you don’t want a punitive culture, but you also want an effective culture as well.
John Wessel 45:19
Right? Tell me about this, this is something that I guess I hadn’t thought about in this context. But I’ve done before, where it’s almost like you, from a dashboard perspective to you like dashboards, in anything you put up. So you mentioned that first week, it got a lot better, right. But that same dashboard a year later, like people just walked by and ignored it. So one of the things that I’ve done in the past to save about five grand works is keeping it visually fresh, but also just almost like rotating, because it’s like, alright, team, we’re gonna focus on improving this metric. And almost like visually, like, alright, like, that’s the biggest number, everything else was still there. And like, and, you know, almost like keeping people’s interest as part of the strategy and keeping people’s focus. Whereas like, you’re focusing on five things versus one thing. So let’s pick out what we think most impactful this month or this quarter, focus on that, make that really big, make that the focus? Like putting everything else out there? Because we don’t want to, like completely drop off on the other four, but we want to focus on the one. Have you seen people implement that strategy? or thought about that? So
Jeff Skoldberg 46:28
One thing that I’ve done is, when it’s time to like, hey, this dashboard is no longer the thing that we need. It’s not like the Hot Topic anymore, because we’ve achieved control, basically, yeah, if you just put a watch, you can sunset the dashboard and just do like a Slack alert that’s like, hey, if this number goes above 300, Euro, I want to know, because then maybe I can pull that dashboard out of archive and see what’s happening. Yeah. So yeah, that’s, I think, just like putting a watch on things. So you could use Tableau alerts, you could use Slack alerts if you have a pipeline tool.
John Wessel 47:04
So yeah, nice.
Eric Dodds 47:05
Awesome. We’re at the buzzer. And this has been an amazing conversation, Jeff. I mean, I don’t actually think we’ve discussed SAP HANA on the show, maybe ever, but we learned some really interesting things about it. We learned how to optimize Snowflake and love your framework on the dashboarding. I mean, that’s, that really is just I was thinking back through. You know, unfortunately, in all sorts of analytics and bi Ceph. I’ve never used a framework that clear. But I was as you were describing that I was thinking back on, which ones worked really well, you know, like, which ones can I sort of look back on and be like, that was actually extremely effective. And they basically all aligned with the almost exact, you know, it’s like, oh, yeah, the big number at the top where it’s like, this is good or bad. And there’s that’s really the only thing that matters, like, yeah, it was really effective.
Jeff Skoldberg 48:09
Awesome. I’m so glad to hear that conversation was enjoyable. And yeah. Thanks a lot, guys for having me on and asking great questions.
Eric Dodds 48:17
Absolutely. Absolutely. Have a great one up in the Green Mountains. And we’ll see you out in the datasphere on LinkedIn, on LinkedIn peace. The Data Stack Show is brought to you by RudderStack. The warehouse native customer data platform RudderStack is purpose built to help data teams turn customer data into competitive advantage. Learn more at rudderstack.com.
Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
To keep up to date with our future episodes, subscribe to our podcast on Apple, Spotify, Google, or the player of your choice.
Get a monthly newsletter from The Data Stack Show team with a TL;DR of the previous month’s shows, a sneak peak at upcoming episodes, and curated links from Eric, John, & show guests. Follow on our Substack below.