This week on The Data Stack Show, Eric and Kostas chat with Vinoo Ganesh, a founding team member at Bluesky Data. During the episode, Vinoo discusses how to benchmark cost, optimize your workloads, and Bluesky’s role in addressing your Snowflake bills.
Highlights from this week’s conversation include:
The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Eric Dodds 0:05
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com.
Welcome back to The Data Stack Show. Today we are going to talk about a really interesting topic. And it’s ROI related to all of the data workloads that you run cost us, I know that you have questions about what the definition of workload is. And we want to dig into that. We’re going to talk with Vinoo from Bluesky. And what I’m really interested in is on their website, they say, if you’re spending $50,000 or more on your Snowflake Bill, you should talk to us because we can help you drive better ROI, which is fascinating. So I want to know about that number. I also want to know about their relationship with Snowflake. Right? Because reducing your Snowflake bill like Are they friendly with Snowflake? So that’ll be interesting. So I have so many questions to ask. And then you know, of course, what does the product do? But how about you?
Kostas Pardalis 1:19
Yeah, I mean, it’s a little difficult like to spend like 50 grand on Snowflake, right? So you know, you know how easy this. So yeah, it’s like very interesting to share, like war stories, let’s say like, what they experience with their customers. I definitely would like to chat about like the definition of workloads, and what they see out there like in terms of what is the most expensive part of like, the operations around data? And yeah, also like the other thing, which I think is going to be like, really interesting is, I know that right now, like the product is focusing on Snowflake. But what did mean like you take this kind of product, this kind of service and like, deployed like on different data warehouses and data lakes for data infrastructure in general. So I think it’s going to be like a very interesting conversation. There’s a lot of discussion lately about like the cost of Snowflake. So I think it’s the right timing to have this conversation today.
Eric Dodds 2:20
I agree. Well, let’s dive in and talk to Vinoo.
Vinoo, welcome to The Data Stack Show. We are super excited to chat today.
Vinoo Ganesh 2:28
Thank you. Very excited to be here.
Eric Dodds 2:31
All right, well give us your background, you’ve done some really interesting things and some really interesting industries. So tell us about your background and then what led you to Bluesky.
Vinoo Ganesh 2:41
Absolutely. I started my career off at Palantir was that for almost seven years, I did virtually every job you can imagine from software engineer building some of our core distributed systems to like sales person selling our product to deploying it. And, you know, commercial healthcare military environments before eventually leading are the core compute team. So every bit per byte of data that flowed through Palantir flow through my team, one point, after Palantir, I realized that we had built these incredibly powerful analytical tools that a lot of our customers and a lot of just analytics tools, consumers didn’t have the data size or scale to warrant the power of these tools. So does that need to focus on that area and built a data as a service company versus that thinks about like 15 million era are now still chugging along? Oh, really? Fratus? Thanks. Yeah. So then we’ll really focus on how do we take a huge amounts of data, make it accessible and make that data accessible to consumers, who don’t have to do these crazy expensive cleaning operations. After that, an old friend from Palantir reached out and ended up joining Citadel, the hedge fund leading business engineering, national capital. So building all the tools technologies, managing the data engineering team, for the last mile aspects of portfolio managers, alpha generation processes, just trying to make the help people make money effectively. Before that, actually, I guess right after another mutual friend minute introduction, and got introduced to Bluesky, and Bluesky where I am right now, I’m on the founding team. Our goal is really to provide an optimization mechanism and a mechanism for people to introspect their own query workloads and their own data cloud workloads, and really get like the maximum ROI of their data cloud. And that means, what’s their number of dimensions?
Eric Dodds 2:59
Awesome. Okay, I have a question about your time at Palantir. Because the spectrum of job titles that you mentioned is astounding in many ways, right? You just rarely ever hear of someone who goes from software engineering to sales, just sort of owning the data platform. And it sounds like multiple jobs in between. I would just love to know, having that breadth of perspective inside of a single organization, what was some of the most interesting things or unexpected things that you learned doing such drastically different roles?
Vinoo Ganesh 4:43
Absolutely, I think this is one of the things that Palantir does best, where it’s almost like you can have 10 different jobs just in the same umbrella company. So first and foremost, the reason that I ended up forward deploying, as they would call it, is a lot of the design decisions that I guess mine and my co-engineers made on some of the early data storage products were not always optimal. And until we had the real customer-like workloads and using workloads again, but like workload and understanding, there was no way we could have designed a system that actually made sense. So I think the first big and surprising thing was truly how different anyone who’s watching production software will know this, but how different production versus development is really just understanding how we build software, how we actually develop it user focus, especially when the tools and technologies aren’t directly like consumer-facing, like it just a distributed system, or data storage system you can think of as being particularly customer facing, but all the decisions that you make, from a design perspective, from everything, from compliance to storage, all directly affect the user experience. The second thing is really the, almost the value of having a technical slant, not necessarily in your sales cycle, but amending your sales cycle. Being able to actually communicate with the people preparing your software with a deeper understanding of why things are implemented the way they are. Some of the challenges and limitations I think all gave me a lot of respect for the engineering background that I had. And conversely, going back to the engineering side, really understanding how hard it is to move a contract from an initial POC to an enterprise agreement. So difficult.
Eric Dodds 7:17
Yeah. Yeah, that’s great. I was actually, I’m glad you brought that up, because I was gonna ask you from the engineering side, a lot of our listeners are technical, so that’s really helpful to hear that perspective on the sales side, right? Because I’m sure you know, for salespeople building, production software probably seems really, really hard. Moving a contract is difficult, too.
Okay, well, I know Kostas has a bunch of questions. But I actually want to start with a really specific data point that you listed on the Bluesky website. And I think this will be just a great jumping-off point. There are multiple vectors to ROI, but a big one is cost. And when we think about data infrastructure, and the term workloads that I know, we want to dive into, you know, ultimately, it kind of boils down to, you know, what is it producing versus what does it cost you, right? And so on your website, you list a number, spending $50,000 annually on Snowflake, which is interestingly specific, and I’m a marketer so it stuck out to me for those reasons alone, but I’d love to know why that specific breakpoint and what does that number, you know, whether or not it’s the perfect number for as a proxy for what blue sky helps solve? What is represented underneath that? And I think specifically, I’d love to know, how can we help our listeners sort of benchmark cost even?
Vinoo Ganesh 8:50
Absolutely. So I will say transparently the 50k number was kind of a number that was just picked. However, it’s almost it’s one of those like a backronym, where we pick the number and then realize, Wow, this is actually indicative of something pretty close. I think what’s been really interesting, especially with something like the Snowflake ecosystem, is starting off as a small-scale user. And Snowflake is an incredibly powerful tool, SQL is really easy to write. There are all these built-in integrations, that’s just a blessing and a curse. But the number one thing that I’ve heard from our Snowflake customers, is the speed at which you can ramp up your Snowflake spend by actually doing things that add business value is unparalleled. So you deploying like a sigma, or like a DVT is incredibly powerful technologies and tools, but they almost add this exponential growth aspects to them. And so that number in particular, I think it almost as the beginning, was the precipice of I’m now going to become a heavy Snowflake, spender, heavy Snowflake user. And so Bluesky has customers and partners that the But anywhere from like, you know, that 50k number up to the double-digit millions, for snowflakes. And so where you are on that data journey with that data process, and when you actually decide to engage us, tells a lot about how you think about, you know, your utilization of a data cloud. So we picked that number, largely because it really does look like the precipice of starting to expand your utilization and your Snowflake footprint.
Eric Dodds 10:28
Super interesting. And one follow on question to that. And this is probably multiple questions packaged into a single one. But when you think about the, let’s just use the examples that you mentioned, right? So, of course, SQL is easy to write. And I mean, we, it is wonderful that we live in an age where you can deploy Snowflake, start writing SQL get a huge amount of value, in a really short amount of time. But then, when you think about a tool, like sigma, or even DBT, might be a better example where it’s, it’s pretty low DBT, in particular, so it’s pretty low in the stack in terms of like, where it interacts with the data, and then in then sort of pushes value out in a large variety of contexts, right? So you almost have what I would call, like, ROI fragmentation. So how do you so there’s the cost side of it? But then how do you think about ROI in such a fragmented way because it’s touching so many parts of the business? That’s not necessarily a simple calculation?
Vinoo Ganesh 11:37
Definitely. I think maybe this is the finance side of me. But anytime I think about ROI, I think about the really, am I effectively deploying my capital as a business? What I mean by that is not necessarily like, am I spending a certain dollar amount, but the dollar amount that I’m spending, am I actually spending that in the most effective way possible, and kind of think about it in the, I was like, using this car analogy, like, if I’m driving around in a car, I can either be very gas efficient, or very bad at consuming gas, I’m slamming the brake or slamming the gas, in a burn through a lot of gas really quickly. Even the car that I use, and whether it’s, I’m driving a Hummer around, it’s going to be guzzling gas, like crazy. So that I’m paying for the gas either way, does that gas consumption, actually add the value that it should add to my business? So when I think about ROI, I don’t necessarily just think about, am I getting $1 value back for this effective cost of compute that I’m putting in? Am I deploying that capital for my business in the most effective way possible? Because as a concrete example, I think DBT is a super powerful tool, right? Being able to test and almost have like a CI CD process around SQL is incredibly powerful. Absent something like DBT, you can run a series of failed queries, like one after another after another, just cranking up more and more cost. Now, the capital deployment of just letting a query run and failing is a horrible way to deploy capital BI can instead use a tool like DBT, and almost get all of that DBT has its own costs, but get that fails, query capital back and deploy it against another business critical problem. That to me is a much better ROI and a much better deployment of capital.
Eric Dodds 13:30
Okay, I’m gonna ask one more question, but Kostas, I feel like I’ve been hogging the mic. Vinoo, could you help us understand— So let’s say I’m looking at my Snowflake bill. It’s 75 grand, we’re starting to have internal discussions around like, okay, we’re getting some inquiry about like, wow, this cost has really ramped up. Describe the process of how Bluesky would come in and help us address that situation.
Vinoo Ganesh 14:00
Absolutely. So the first thing is, is any engineers what we started with data, right? We want to understand, not just like, the data of what the bill is, but what actually makes up that bill? And why does it look the way that it does? So the first thing that we do is we never need access to any of your business data, or anything, you know, other than metadata of your query history. From that we can actually tell using some proprietary algorithms first, where’s your compute actually going like, you know, am I am I overspending my warehouses and Snowflake to a whole bunch of idle compute. Do I have these massive queries that take up you know, 1000s of credits after one execution? So we really start with an understanding of what is the information that I have on the ground from Snowflake. From there, we start introspecting by adding our own kind of flavor and opinions into our product. We some of the examples I gave, like warehouse idle credit or even the ability to look at a query and say, this query is ordered by this insert is ordered by this table rather is ordered by particular column. Consumers at a table should take advantage of that order by and filter where they can. Those are insights that we can display as well. So it starts with the understanding and onboarding have the unique aspects of a data cloud, then we look forward. So it’s really easy to say, Okay, well, we’re in this position. Now, let’s just do like a P zero tourniquet cost-cutting exercise to end up in the same situation three months from now, when the spend has grown. So we instead do is also provide tools and mechanisms for controlling costs from a derail perspective, as you move forward. These are ways of coalescing functionally equivalent queries, or semantically equivalent queries together to actually attribute a cost, or even highlighting something like a misconfiguration, where I’ve made a sized warehouse a particular way, when the workload doesn’t actually warrant that sizing and warehouse. So it’s a data-driven approach that really starts with visibility, before extending into this, like insights level of what you can manually change, before building Bluesky’s end vision, which is an automated tuning and healing layer. Like, eventually, you’re gonna get tired of implementing these insights, maybe just turn on autopilot, and we can figure out what to do for you.
Eric Dodds 16:26
Super interesting. All right. That’s a great point on which to hand it off to Kostas. Kostas, thank you for your patience.
Kostas Pardalis 16:36
Thank you, Eric. Thank you. So Vinoo, let’s talk a little bit about like workloads. People are using data warehouses, obviously, for analytical purposes. But there are many different things that are happening in a data warehouse before we can get a dashboard or a report or like whatever, right? So can you help me understand like, how do you define the workloads in Bluesky? And, yeah, let’s, we’ll get deeper into that. So let’s start with this.
Vinoo Ganesh 17:13
Sounds good. To me a workload is— in the olden terminology, there’s like the OLTP, all AP and like our batch or streaming or analytical compute, for me, a workload is really going back to that finance, it is the way I’m deploying my capital in my data cloud. So the workload can involve anything from me writing data, persisting that data, me actually doing snowflakes, like auto clustering behind the scenes, to me, we partitioning data, may even doing things like a reverse ETL process of writing data out of the cluster. So these don’t fall into necessarily, like batch analytical or these like clean definitions of what were previously like your, you know, I’m a batch compute heavy workload, it’s much more so how I’m utilizing that compute. That’s how I think about the workload.
Kostas Pardalis 18:10
So what are some common categories of compute utilization that you see out there?
Vinoo Ganesh 18:18
So first and foremost is, I would have never expected this before Bluesky, although I think you can kind of guess it’s there. But the big ones are really BI. Like all the business intelligence tools like Looker, Tableau, sigma has some of these, there’s just such an inundation of wanting to get insights out of my mate gets data my system, that BI actually accounts for a huge amount of the workload. And whether or not these dashboards are actually actively used or consumed, they are the ones writing these automated queries. The challenge with BI especially in terms of like a workload perspective is a BI not working okay, I tools that working like let’s say nine to five, it will execute its CoreOS whenever it wants, it will do data refreshes at any time. So your heaviest consumers can actually be something like, like BI tooling. So I think the second is, I’m gonna kind of pick the ones that I think are unique. The second is maintenance. And few people actually think about maintenance in terms of what needs to happen for your data cloud to operate optimally. And these are literally things like snowflakes re partitioning or re clustering, where I want my data, like I want to partition, I’m using partitioning and clustering interchangeable here. But I want to cluster my data a certain way. But there are some maintenance operations that need to happen snowflakes on Compute behind the scenes, to actually ensure that I am able to read tables the way that I want. And the tables look semantically the way I want them to. So I kind of grew up that all into intermediates, which is I think distinctly separate from like even something like compliance, like CCPA GDPR. These workloads are the right to do Lead, I actually group these into a separate type of workload because they involve both this cycle linear scan or like taking advantage of some unique file format way of scanning through your data, actually deleting and making incremental changes. So I think these are the need, of course, you have your analytics, someone going on running a ML job or running some kind of like just one off SQL query, they get a table back and you have their ETL pipelines that come from a variety of sources as well.
Kostas Pardalis 20:31
Yeah. It’s interesting that you didn’t mention ETL was one of like, the main workloads out there. Why is that useful to like? Are you just included as part of BI? Like, how do you see like a deal? Being part of the workflows there?
Vinoo Ganesh 20:44
That’s a great question. So ETL is always the, you know, it’s kind of the bedrock of like, I’m using data and there’s going to be some cleaning process, some transformation process and some process like all our extract process, there’s all of these kind of, well, then in the same, almost ecosystem, but the reason I don’t think about ETL, as as prevalent of a workload is because it actually tends to be the place that people are investing most of their time and energy. It’s not like long tail of, you know, I built this dashboard two years ago and forgot about it, it’s really like this is clear and business value, because every day these tables need to be updated, they need to be transformed, need to be written to. So deploying capital against ETL jobs is almost an easier justification than deploying it against BI tools, they may not have the right consumers or may not generate as much business value.
Kostas Pardalis 21:39
Make sense. Okay, you mentioned like BI maintenance for blinds, BI. In terms of like, what you’ve seen out there, I would expect that like BI is like one of these things that it’s kind of predictable in a way, outside of, okay, let’s say you have like interactive analytics, where you, obviously, you need to see it live on top of your BI tool and stuff like experimenting with queries, and like all that stuff. But when you have like dashboards, you can deploy, like, by different methodologies like to optimize the process, right? Like materialization, for example, is one of them, like gauge, how many times do you have or Cassie? There are tools in like, the ice, like one of these processes that has been around for like, very long time. So database systems have really like evolves around it. But what do you see in happening out there? Because obviously, there’s a lot of space for optimization from where that there stands. So he’s like, We are missing the rights to link to do that. Is that like, like, why is the reason that there is so much space still for optimization when it comes to BI?
Vinoo Ganesh 22:54
It’s a great question. I think in the past, BI fall in this category have like, read only in the sense of, I’d have a dashboard, it was executed once. And it would just be, you know, like, on a page for someone to consume. In the new world of data applications. Like there’s a lot of these companies like extremely houseware that are doing these really patenting Snowflake, extra required streamlined, doing these really powerful like data application creation, you as a non-technical user, or as a nontechnical but a less had an equal user can now interact with the platform in a way that you previously didn’t really interact with it, not just filtering, but I can actually bring in and join with no code or low code solutions, other tables and create a new derivative data products, just from my own system. So the materialized view creation, caching, they all solve that root node problem, this, you know, compute happening over and over again, and persisting that at any the derivatives products, even notebooking tools, like hex I think, is a really cool product as well. You can create all of this derivative value and all of these derivative data products that still kind of live in the realm of BI, although people are using extra ETL also, but still kind of limit is BI tool BI world independent, I guess a previous just like an individual dashboard that were just sitting there consuming data with normally looking at it.
Kostas Pardalis 24:23
And do you feel like we’d need you to link to optimize like this new, let’s say, I wouldn’t say new workloads, but like new facets of like existing, like workloads, like what is like what do you see there? I mean, obviously, there is an opportunity that’s very, like Bluesky’s up there. But what like a database system should do like to account for these new ways of like interacting with data and make like the full process more performant at the end.
Vinoo Ganesh 24:54
Yeah, so the interesting thing is sequel has been around forever, right? Just the ancing sequels to header has existed. The execution engine has been the thing that’s been particularly played with over the years, like Snowflake. And it’s quite mean Snowflake is effectively Oracle except without the DBAs deployed in the cloud, you can manage on your own. But the execution engine, the thing that actually does a lot of the magic of Snowflake, the clustering, believe that you can write a query, potentially never have it fail, it can just keep spinning. Whereas you do the same thing. It’s something like Spark, it just crashes. Those are dull edge sorts. And so if you look at something like well, we’ll look at Databricks, I think Databricks and spark, there’s such a great company with a really cool technology. They’re investing so heavily, things like photon catalysts, all of these technologies that are really just made for the purpose of optimizing a query execution. I would even say optimizing, making a query execution more predictable. That’s really what I think they’re doing. So in terms of the need of tooling, it’s, for me, it’s for as long as we have people who are going to be authoring queries, we’re going to need people who either are educating folks on how to author queries in the most optimal way, or automated tools and solutions that just abstract that probably, that’s maybe an imperfect comparison. But anyone who worked with the old like C++, memory management things has experienced challenges of memory leaks forever. So you knowing when to like malloc, or like deallocate memory is really, really hard. And so people build layers on top of that, Java became one of the predominant technologies. And then we have like G one garbage collection, all of these new ways of actually abstracting that problem way. So what I see us doing or this space in particular, well, I don’t like it’s just adding a layer of abstraction that handles the complexity of otherwise having to optimize low level SQL code, base table semantics, query semantics.
Eric Dodds 26:58
I have a question, actually. And this is for you, Vinoo, and Kostas as well because I know that you’ve looked at some of these tools, one interesting dynamic, just digging a little bit deeper on some of the tools that allow an end user to actually drive up compute. One interesting dynamic around that is that because those tools can offload compute to Snowflake, it’s not, it doesn’t have to be, like a first order concern for the people building those products. Right it. And so in some ways, you can create a user experience that is amazing. That has the double-edged sword of okay, well, this is enabling a lot more people to do a lot more things, but it’s creating a huge bill on the back end because you’re just hammering compute. How do you see those products and managing that? Because that’s, in my mind, a non-trivial component of your product is sort of the optimization. There are literally entire companies built around crazy optimization. Obviously, that’s exhibit A with Bluesky, but I’d love to hear your thoughts on that. How do you see those products managing that?
Vinoo Ganesh 28:25
Kostas, do you want to go first? Or I can.
Kostas Pardalis 28:29
Yeah, I can. I felt like I was like, what I want to ask you, Vinoo, is like, okay, traditionally, let’s say, the database system has like the query optimizer, right, like, so you have, let’s say, a piece of the technology that is one way or another responsible to go out there and make the best possible choices to execute the query in the best possible way. Obviously, that’s a really hard problem. It will never be like fumbling, literals, blah, blah, blah, like all that stuff. But at least you have access to that, like, traditionally, the DBA. Like, that was like the role of the DBA, right? Like, when things start going wrong, I can use the query optimizer, the planner, like the EXPLAIN commands, blah, blah, blah, like all that stuff, to see what’s going wrong and like figure out ways to manually optimize things. When we put like so many layers of abstraction in between. And I’m talking like, specifically for things like Looker, like BI tools, where you will also have languages that you use to model the data. And there is another piece of software there that takes this data model definition and does whatever it wants to do. The user is like how how do you even like, try to tackle this problem, right like a query that is generated by looker. That then is optimized by the query optimizer and then turns into a plan and says get executed. How do you even like get like figured this out, like, in my mind at least like it’s really, really hard, right? So how can we, I mean, abstraction is good, but it also adds complexity. So how did you think like we can tackle this problem?
Vinoo Ganesh 30:13
Absolutely. And so I think I’m gonna go back to the metaphor, how I think about Snowflake, where anyone any query author, I think of as someone driving a car, and their goal is to get from point A to point B, the way that it how much gas or how much fuel they consume on that journey doesn’t just depend on their ability to become the best driver in the world. It depends on so many things, the kind of car they’re driving the environment they’re driving in, who else is on the road. And in the kind of parallel here is, if I were to say, the a warehouse and Snowflake, so logical grouping of compute, compute cluster, is the cart, the optimal route selection, or the optimal, like, gas consumption to get to that end route depends not only on the person, but also on the car, the best driver in the world, can still use a bunch of gas, driving the hardware to the place of wherever they want to go. It’s kind of the same thing. If I’m authoring a query, and my query optimizer is particularly amazing, it’s done everything perfectly. That’s only a part of the equation. The second part is where do I choose to execute that query? Am I’ve been packed with other like incredibly, computationally expensive queries, so I’m going to actually slow down and they can’t scale up that much am I going to be able to have any kind of data locality depending on the technology that I’m using at that point. So all of these come together to even if the query is written in the most basic query, like select star from this table, there are still so many other elements that are involved in my query execution. When I say the level of abstraction, I also mean, if we are able to, like we could train every driver to drive optimally. And in even in that situation, all these external factors could turn things for low. So what I really mean is, how do I augment that driver, either by extending what the query optimization can do. But also by adding contextual information around street conditions, or the road conditions, meaning in Snowflake terminology, like what other queries are being executed, the car, the size of my warehouse, that I have the number of clusters and then scaling up and down. So abstracting that entire problem space away, such that a user is executing an individual query doesn’t have to worry about that is incredibly powerful. Let me make this a little more concrete, if I’m doing Snowflake is that a very interesting thing with this terminology, warehouse, or warehouse doesn’t mean anything, it’s just like a logical collection of EC two instances. And so how people actually use their name, these warehouses really changes from organization to organization, as people use Looker, their setup instructions, say, Looker warehouse, or is there a little bit more detailed about this, I’ll say look, our extra small warehouse extra large warehouse. But the challenge is, the logical grouping is dependent on the product, not the actual workload of that individual, like technology. So if Looker is actually coming in every day, and firing a query, once every 24 hours, that happens to execute on this massively extra large warehouse that has a very low, very high auto suspend, how many of you spending a lot of money on that one query, and may even be over SPECT. So all of these problems coming together, and the contextual information is really how I think about solving this, instead of a layer of abstraction on top of just the query, it’s on top of the data cloud as a whole.
Kostas Pardalis 33:44
It’s really interesting, and like, I want to offer like another dimension to this problem. And I want to ask you struggled, because you have also like works in like, in the financial sector. So you know, like how people are getting motivated to optimize based on the profit that we can have the alpha that we can generate at the end? We all strive like some of these alphabet. And I keep, like remembering like cases, for example, like BigQuery, right? There was like this case where you could use like, select star, the limit 10, right. So you would expect that the query agent would just like, reach 10 volumes and return them? No, it would go and like actually scan the whole data sets, and then return just then but at the same time, that’s how the queries like pricing is based on like, how much data is read during the operation. So there’s like a lot of like a motivation there like to actually do that because that’s like how the product can make more money. Right? And especially like in consumption-based models. I think this is like a very like strong drive to, let’s say, guide how like the engineering teams there or like the product teams like in Snowflake or whatever commander will make, like certain choices. So how important do you think like outside of the technology itself or like the abstractions that we put there are also like these other factors like, the pricing models of the components have or like, Iona, like the contracts and like the business side of things like how much also affects that the end? How much it will cost us and how we should optimize at the workloads that we have?
Vinoo Ganesh 34:51
That’s a good question. I think one of the really interesting things about how I appreciate Bluesky approaching this, Normally people would think, and it’s entirely understandable, Snowflake must hate us, right? Snowflake is like you’re taking all of our money away. And it’s causing a bunch of issues. This has been completely opposite from snowflakes actual reaction, well, this guy is actually a Snowflake partner, which is super interesting, if you think about it. And I think a lot of this goes back to, you know, snowflakes, a consumption-based model, a consumption-based pricing model. So arguably, you can look at it and say, they want you to spend as much as possible to pay them as much as possible. But there’s a danger underlying this, it’s almost like looking at the finance side, if you put all of your stock in one basket, or all of your money in one stock, it’s generally very high risk. And so I think Snowflake recognizes that problem. And for them, this optimal deploying of capital has a multifaceted benefit. First, they have solutions architects who kind of function like Oracle’s DBAs, but they have solutions architects who are really interested in helping companies grow their data cloud in a responsible way. And I think that’s incredibly powerful. Because Snowflake realizes, if I’m spending my entire compute budget on this one poorly written ETL job, there’s no way I’m an explorer, any other pieces of technology in the Snowflake ecosystem. Whereas if I can instead say, I’ve optimized this one query, maybe you try a new BI problem, or a new business problem, like business venture, with this compute money that you’ve now saved, you’re actually further entrenched in the Snowflake ecosystem. So diversifying the workloads, like diversifying your investment is actually really beneficial. And I actually think it’s one of the best things, I think, is the discovery Amazon out as well, where, you know, even back in like, the startup that I was at previously, we would focus on Well, if they gave us compute credits, or some way of offsetting spend with private pricing, I’m going to take that money and use like Macy, or like Redshift, and try something completely new. So from the consumption-based pricing model perspective, I actually think a diversified investment is better for the data clouds. And so even having their solutions architects potentially, at some point, use Bluesky and say, here are the areas we can cut costs, or here are areas we can redeploy capital, is incredibly powerful. An arrow pointing to your previous question, building these data apps and like, almost Snowflake is now this like API, right? It’s, it’s almost like the I forget who called it this on LinkedIn somewhere, but they were saying snowflakes building their own, like Apple App Store, where you can build all these data apps backed by Snowflake. And I think it’s a really great characterization, because it actually shows that, you know, Snowflake is now handling the back end computation of all of these tools and technologies. And enabling people are people higher in the stack, to generate business value ordinarily wouldn’t be able to do as much work, given their lack of experience building some of these technical products. So it’s kind of the same thing. If I am Snowflake, I would not, I’m not necessarily interested in just optimizing as much compute out of this one app developer, because it means that they can spend money building, refining, doing other things. So actually running those optimizations behind the scenes or deploying a tool that can help these folks who are creating data apps, like grow and scale. Think is really powerful. And one example I will give given our public houseware just one Snowflake’s Startup Challenge a few months ago, and these guys are awesome team, like they’re building really cool products. I’m not an investor. So I’m not but I think they’re actually really cool. And so one of the things that they’re doing is helping people build these data apps, and being able to build a data app. If you’re a small company, just like seats, each company should have all the technology. I’d be terrified at that big compute call from some user accidentally running Ula. So it’s the guardrails around safe compute as well that I think are really powerful.
Kostas Pardalis 39:45
Yeah, that makes total sense. And one last question for me before I hand it back to Eric, is Bluesky right now like working only over Snowflake or do you support also other Do our cloud solutions?
Vinoo Ganesh 40:02
So the irony of me right now is I actually didn’t know anything about Snowflake till I started working at Bluesky. My experience is like fairly heavily SPARC and Databricks. So right now, we’re focused on Snowflake for two reasons. First, Snowflake is, you know, it’s a big dominant player in the ecosystem. And I think there’s a lot of opportunity in Snowflake in particular, like, just from a configuration or query perspective, there’s a lot we can do, especially with also sequel. So right now we’re focused on Snowflake, but that will almost certainly change as time goes on.
Kostas Pardalis 40:38
So do you see there? Like more opportunities in like, systems similar to Snowflake are also like in systems that are more like, Spark with the reason I’m saying that is because like, as a computation models are, like, very different, right? Like, very different types of deployments, different teams that are involved there. So how do you see like the difference there between like, a system like BigQuery, or Snowflake, or Redshift, and then systems that are more like Athena or view Marlin spar or Databricks? How did you see the difference there?
Vinoo Ganesh 41:20
That’s a good question. So I want to say like taking the step from the back from the perspective of just Snowflake, going to Bluesky, I always saw like a broken record, but it’s really about this efficient deployment of capital, right? So I would not necessarily say I mean, this is if we can optimize someone spent, that’s awesome. But I wouldn’t say my goal is to go into a customer and like, bring their Snowflake spend necessarily, like as down as, like humanly possible. My goal is instead to have them effectively deploy their capital. So if they have a bunch of failed queries, or they’re not using certain BI tools, like that’s actually what I’m trying to address, not necessarily, it’s like, even negotiating their price down or something. So I think when you look at something like, like Spark, or Databricks, the number of levers that you have, makes that problem like a dimensional problem. In Snowflake, for example, I don’t ask to set like sparks driver memory years for to executor memory, I execute the query and have some t-shirt size warehouse size of how often or optimally it’s going to run. But it really depends on Snowflake to execute that group slightly, independently, I think only move out of Snowflake and move to other technologies. I mean, big Cray doesn’t have a lot of these knobs either just Redshift, the dimensionality of how many variables we have to tune does become more and more complicated. Our goal is really to look from an organization or like Team wide perspective, not necessarily like this query slow, let me optimize this individual query. It’s really across the organization. Here’s what you’re trying to do. Let me instead help you read, like, optimize that. And I’ll give you an example that may be interesting for some of the listeners, incremental pipelines we’re seeing all over the place now. So a table it’s appending, over and over and over. And oftentimes, because businesses are moving so quickly, we’ve noticed more than a handful of cases where an incremental pipe table is being incrementally appended to, and the downstream consumers at that table still do a full linear scan of all of the table without actually looking at the depths. And that’s actually a really hard problem to identify without a tool that’s actively looking for that as like a best practice. And so the dimensions that we can go are the areas we can expand and Snowflake itself actually lead themselves to saying, given the multi-dimensional problem and Databricks and other places, it’s actually almost easier and more focus for us to focus on optimizing this one soul area. I will say that I even being very knowledgeable in Spark Qlik leading Paljor spark team I have no, I don’t know how that dimensionality is going to make it easy or hard for us, it could be a really challenging space, or it could be something where we can apply some our principles.
Kostas Pardalis 44:05
Absolutely, that makes a little sense. All right, so Eric, all yours.
Eric Dodds 44:12
Okay, I want to return to the car analogy, as we get close to the end of our time. So one interesting thing, the car analogy with the driver is really helpful because you have sort of training, you know, a driver trained to operate the vehicle in a resource-efficient way. And then the vehicle to your point has a huge amount to do with it. But I’m going to really extend the analogy probably to the point of breaking now to make my point formulate my question. But if you think about this, I was actually about this the other day because I was driving a car from the 80s that’s really old and you can basically watch the gas gauge like, you know, go down If you’re driving, I think he gets like eight miles a gallon or something, you know, and then you get and so. And also, like, it’s not very fast. So if you like push the car really hard, you’re getting like a lot of physical feedback that as a driver that tells you like, Hey, you are, you are definitely like, using a lot of fuel here and the car you’re driving, it’s like really loud, and you can see the gas gauge going down, right. So you’re not only getting feedback on your own driving, but you’re also getting feedback, you know, from the vehicle itself, then if you sort of look at the modern version of that, like you get in a Prius, you know, that’s like a hybrid vehicle. And it will give you real-time feedback on the economy of your driving style. And even like the efficiency of the vehicle itself, and sort of conserving resources. So again, I’m drawing the analogy out of the verb. But like, if you think about executing a query, in Snowflake, just in the raw SQL Editor, you have like the little time counter, and that’s like, that is basically your only physical feedback. And then if you have a data app, on top of that, where you’re doing something that doesn’t give you any physical feedback, it creates this weird dynamic where it’s hard as a user to actually get the information that you need in order to optimize while you’re doing your job. And I just love to hear your thoughts on like that as part of this problem set. I mean, I know that Bluesky comes in and helps and you talked about like, Okay, do we have a completely automated solution? But it is interesting. I mean, to your point, like, I don’t believe that there’s malicious, like, “we’re going to obscure all this so that our NDR is crazy.” I mean, you know, it’s like, they have to have a balanced approach to that. But there is actually not a lot of feedback that helps you sort of while you’re doing the actual work itself, adjust what you’re doing to account for resource usage.
Vinoo Ganesh 47:08
Yeah. So I had one of the customers asked me a few, a few months ago, two months ago, now, you know, why don’t you just build a SQL query linter, that just tells you gives you the red Microsoft Word squiggly lines that says this is not optimal? And honestly, it’s, it’s not a bad idea. The reason that I think this is a challenging problem is because in your Prius, or whatever car, you have that one dimension, you have, I’d argue the brake and the gas are two sides of the same coin. Yeah, in not slam the brake as aggressively you can, you know, slim that fear put gas into it. The challenge is, with that as your only lever, actually, the problem space and what you can do to fix whatever is, dashboard is much smaller. Yeah, what I say, hey, this, this query is like not being run optimally. Or if I even said, you’re doing a linear scan over this dataset, the amount of knowledge and expertise it takes to figure out how to solve that problem in an optimal way. It’s actually pretty massive. And even if I were to say something like, I mean, we’ll use like, Java is like garbage collection, right? I could just say in like C++, like, you’ve allocated this thing. My IDE is saying, you didn’t, you know, destroy it properly somewhere else in the code. Well, those are like great during like, effectively, like the product runtime. But when they have external dependencies, not in the same file, it becomes a really complicated, almost intractable problem. Knowing what to do is the second step, and it’s not always easy, even for like really experienced query authors. So we actually, like I’m our deployment lead. And so when I go to customers, I actually use Bluesky. And I will say, here are areas that I think you can do optimization. But for me, it’s still like, let me actually introspect your query, let me understand the table, not just schema, but like attributes at a fundamental level. All of that, that’s what they have to be surfaced to the point that you can tell a nice story around what needs to happen. And that’s almost why. And rather than just kind of lint in Java and say, here are all the things you can do to better fit or optimize your memory. Let’s just build the garbage collection tool, lets you like handle that for you. Yeah, that’s kind of our end state. Like, let’s just handle some of these challenges for you. And actually, the one additional dimension is several of these challenges. You may not even understand, like in a multi JVM, like, if I have two things running, and I have one service that is thrashing the hard drive of whatever box it’s running on. Me as the second service may have no idea why my job is so slow or being queued or IO is as bad as it is.
Eric Dodds 49:50
Yep. Yeah. If I had to summarize that it would be you as a driver, like shouldn’t have to worry about all of these various inputs because it’s a much bigger problem than gas pedal and brake pedal.
Vinoo Ganesh 50:04
Exactly, you should just be able to focus on driving. Exactly. And I think one of the things Snowflake has done is Snowflake has actually, I don’t know if people think about them this way. But they’re giant multi-tenant system. So everyone can share data with everyone else. Everyone’s like executing queries in the seem like technically the same AWS or GCP, infrastructures everyone else. So you’re really part of this massive cluster that’s doing all this computation that can actually affect a lot around whether or not you’re getting your query surfaced and run in the exact same time every single time.
Eric Dodds 50:39
Yep. All right. Well, we’re close to the buzzer here. So one more quick question for you. And this is advice for listeners. For anyone listening who is thinking maybe they don’t have a huge Snowflake bill, but this is gotten their wheels turning on. Like, I wonder what, if anything, is super inefficient in the way that we’re executing stuff on Snowflake? Where would you have them start looking? Like, what where’s the best place to start doing that investigation?
Vinoo Ganesh 51:12
I honestly think one of the two, the first thing I would say is the Bluesky’s team is incredibly knowledgeable on this. And this is not just me, like giving you a sales thing of saying come talk to us. But finding people that are that have a lot of expertise in this space is actually really hard. Because when they develop that expertise, he even have a lot of contextual information, like people at big stuff like consuming companies know about their own unique patterns of data necessarily see the like swath of other ways Snowflake is being used. So I would say like, you can reach out to the Bluesky team. The other thing I would do is, there are a lot of snowflakes, the Definitive Guide just came out. And there’s a lot of like, great material in there. There are blog posts, and there’s a lot of sessions like this, like really people who are spending time in the space, who kind of shared the tidbits of best practices like the order buyer, the auto suspend stuff that we discussed, there’s a lot of information, we have a Bulldozer has a blog, where we’re slowly creating more and more content, that the main thing I would really do is, honestly is experiment, like, try some of the stuff out, you can do it. And Snowflake has made it so easy to try these queries in smaller capability or spin up even a test instance. It’s actually playing with the tools and technologies I think is is pretty powerful.
Eric Dodds 52:30
So helpful. Vinoo, this has been such a great episode, I learned a ton. The multiple analogies were great. So thank you so much for spending some time with us.
Vinoo Ganesh 52:42
Absolutely, thank you both so much. And also being here, glad we got a chance to chat to really appreciate the time and hope this was helpful.
Eric Dodds 52:50
My takeaway from this, Kostas, there were a lot. I love the car analogy. I obviously dug into that multiple times. But I think it was really helpful to hear someone’s so technical mentioned how difficult it is to actually get a sales contract, you know, from initial conversation to you know, sort of signature, and I just really appreciated that it was funny, because the newest, obviously a brilliant, you know, person to even be able to perform all of those job functions, me not only as the brilliant from sort of an engineering and data perspective, but you know, interpersonally, obviously, to be able to actually, like be a salesperson to is a whole different skill set. So I think that’s a really rare combination, but it was just really enjoyable to hear, you know, sort of sort of hear it from the other side to hear an engineer say like, it is so are, you know, to actually like get a sales contract through, right. You know, whereas on the other side, it’s like, you don’t understand how difficult it is to like, you know, scale a distributed system, you know, or, you know, whatever it is. So that was my big takeaway, along with all the other great stuff, but that just made me smile.
Kostas Pardalis 54:12
I think what I really enjoyed from the conversation is the definition of like, the workload does capital allocation, I think that was like very, very interesting to hear over phrasing. And in general, like this whole, like mental model of like, how to think about your data infrastructure and how it is utilized and how you can optimize it, and what optimization means at the end. I think that was like probably the most vulnerable part of the conversation, at least for me, and hopefully, like for maybe for like listeners out there who, sooner or later they will face the need to optimize also for costs and moves us like for performance or as delays in terms of latency and stuff like that. So yeah, I mean We probably need another episode we see there are more like a staff to discuss I’m gonna get like the more detailed, like technical detail on distributions of the house. So I’m looking for relate to have him on the show again in the future.
Eric Dodds 55:14
Absolutely. Well, thank you so much for listening. Subscribe if you haven’t tell a friend about the show, and we will catch you on the next one.
We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at firstname.lastname@example.org. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.