Episode: 59

Making ETL Optional With Justin Borgman – Starburst Data

with Justin Borgman

Chairman & CEO, Starburst Data

This week on The Data Stack Show, Eric and Kostas are joined by Starburst Data’s chairman and CEO Justin Borgman to discuss driving analytics while dealing with increasingly siloed and fragmented stacks.

Notes:

Share on twitter
Share on linkedin

Highlights from this week’s conversation include:

  • Starburst Data is Justin’s second startup (2:42)
  • Starburst focuses on doing data warehousing analytics without the need for the data warehouse (4:14)
  • Multi-cloud solutions among merger and acquisition use cases (8:32)
  • Ways the stack is increasing in complexity (12:25)
  • Comparing essential components of a data stack from 2010 to now (15:01)
  • The future of ETL (27:36)
  • The best maturity stage for an organization to implement Starburst (31:27)
  • Starburst connectors (36:55)
  • Monetizing enterprise solutions while promoting open source ones (41:52)
  • The history of Presto and Trino (45:37)
  • Benefits of a decentralized data mesh (49:53)

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

 

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds  00:06

Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com.

 

Eric Dodds  00:27

Welcome back to the show. We have Justin Borgman from Starburst Data. And I’m really excited to talk with him because I think he may help us make some sense of data mesh, but at the very least, we’ll learn a ton about federated queries and building analytics across different components of the stack. So my main question, and we’ll talk about Presto and Trino and get into the details there, but I think my main question, Kostas, is the view of the stack increasing in complexity. So we had a guest recently talk about how the premise of the cloud was that it will unify all this data and everything. And in fact, it’s creating more complexity and more data silos. I thought that was very compelling. And I think Justin is living that every day with Starburst, trying to make it easier to drive analytics with an increasingly fragmented stack. So I want to ask him about the complexity of the stack and how that’s changing. How about you?

 

Kostas Pardalis  01:30

Yeah, I want to learn more about Presto in general. Presto has been around for quite a while and you know, it has gone through like many different transformations. So that’s definitely part of the conversation that we are going to have. And I want to learn more about Justin’s view of how this data stack is maturing and where he thinks that we are going with the technology, mainly because the interesting part with Presto is that it has a very, very different approach when it comes to querying. It has a very decentralized approach, which is something completely different, actually opposite to the best practice of trying to source all the data and store it in one centralized location and do the queries there. So yeah, I think we will have a lot to chat about with him.

 

Eric Dodds  02:22

Well, let’s dive in and get to know Justin and Starburst.

 

Kostas Pardalis  02:26

Let’s do it.

 

Eric Dodds  02:28

Justin, welcome to the show. It’s really great to have you with us today.

 

Justin Borgman  02:32

Thanks, Eric. Super excited to be here.

 

Eric Dodds  02:35

Well, let’s start where we always start. We’d love to hear your background, and you’ve done some really cool stuff, but kind of what led you to Starburst.

 

Justin Borgman  02:42

Yeah, so let’s see, this is my second startup. My first startup was back in 2010. It was called Hadapt. And it was an early SQL engine for Hadoop, just as it was starting to pick up momentum. And really at the time, people were thinking about Hadoop as a kind of cheap storage or a way of doing batch processing on massive amounts of data. And our idea was to turn it into a data warehouse. In fact, I think the business plan we wrote was to become the next Teradata, with really doing data warehousing within Hadoop. Now as luck would have it, we actually ended up being acquired by Teradata four years later. And I became a vice president and general manager at Teradata, responsible for emerging technologies and really trying to think about the future of data warehousing analytics and what that might look like. And it was in that context, that I actually met the creators of an open source project called Presto, they were at Facebook at the time, Martin, Dain and David. And we started collaborating and working on making Presto better and better and better. And today, that effort is now known as Trino. So the name changed along the way. But that’s really how Starburst was ultimately born as really the founders and creators of that open source project, leaving our respective companies. I left Teradata, they left Facebook, and Starburst was born.

 

Eric Dodds  04:05

Very cool. And can you just give us a quick rundown of what Starburst is? And what does it do? Just for our listeners to have a sense of the product?

 

Justin Borgman  04:14

Yeah, so much the way in my first company was really SQL and Hadoop, this is SQL and anything. And I think that was what got me so excited about it. It’s about doing data warehousing analytics without the need for the data warehouse. And from a technical perspective, it’s basically a database without storage, and it thinks of all other storage as though it’s its own. So you can query the data where it lives. You might have data in Mongo, you might have data streaming in on Kafka, you might have data that you want to access via Elastic and text search. You might have data in traditional legacy systems like Oracle or Teradata, you might have Snowflake, you might have data lakes. That’s one of the areas where we really excel, is accessing data in data lakes. And in all of those cases, you have kind of a single point of access to query the data where it lives, without the need to move it around and do those typical kind of ETL pipelines. So it’s really about giving you faster time to insight. That’s the way we think about it, and removing a lot of that friction traditionally associated with classic data warehousing.

 

Eric Dodds  05:14

Super interesting. So let’s talk about, I’d love your perspective, because you have a great perspective, because you’ve both built systems that drive analytics from a database standpoint, and then are now leading a company that solves problems across different pieces of infrastructure. We had a guest recently, who made a really good point. It sounds very obvious, but the data stack is increasing in complexity, right? I mean, you have all these tools that are making functions within the stack, easier to do that before required a significant amount of engineering effort. And it’s like, okay, great. Like, we’re getting beyond some of the low level plumbing problems, which is awesome. But especially as you reach scale, the stack is increasing in complexity, right? So you have data warehouses, data lakes, Kafka, there are a number of different sort of core pieces of infrastructure that you’re running at scale, which actually makes traditional linear data warehouse into BI dashboard way harder. So can you just talk us through what you’re seeing on the front lines? Like, how are stacks increasing in complexity? And then I’d just love to hear your perspective on Starburst as the answer to managing that without necessarily having to get into the plumbing?

 

Justin Borgman  06:34

Yeah, absolutely. Well, first of all, I 100% agree with your previous guest about the stack gaining complexity. And I think of it a old quote from really a legend in the database space, a guy named Mike Stonebraker, who’s a professor at MIT, and he was the creator of Ingres, and Postgres, and Vertica, and a variety of different database systems over the years, you know, won the Turing Award. And he had written a paper that basically said, there is no one size fits all database system, meaning that you’re always going to have different databases for different types of jobs, different types of use cases. And I think that’s true. Some applications you want to build on Mongo, some might be Oracle, some might be something else. And I think that, for better or worse, leads to greater complexity, because now you have even more data sources. And we find particularly in large enterprises, this is compounded by the fact that you have different departments, different groups within an organization doing their own thing. You may acquire businesses, and every time you have M&A, and you acquire a business, you just acquired their data stack as well. Right? Right. And that’s actually one of the fastest ways we find that our customers end up being multi-cloud is because they bought somebody who runs on Azure or GCP and now they’re multi cloud. So 100% agree on complexity. And that’s a big part of what we hope to solve by essentially, allowing you to go direct to source and be able to run those analytics by connecting directly to where the data is, I think that’s the power of the platform. Essentially, I like to describe it as really giving a data architect or a data engineer, infinite optionality, if they still want to consolidate data into a data lake or data warehouse. That’s cool. I would argue data lakes are probably the better bet of the long run for consolidating data. And we could talk about that just from a TCO perspective, but we …

 

Eric Dodds  08:29

We’ll definitely talk about that.

 

Justin Borgman  08:32

Yeah, absolutely. But you know, the point is, at least you have the freedom of choice. And so that’s really what we’re trying to do is kind of create a single point of access across all those different data sources to add an abstraction, and abstractions are always really for the purposes of creating simplicity, where there is complexity. And I think we allow you to do that within the data architecture realm.

 

Eric Dodds  08:55

Let me ask you, you’re a two time entrepreneur, so I’m gonna ask you a business question that relates directly to this problem. So a lot of times, let’s take the example that you gave of a business acquiring another company and inheriting their stack, right? Integrations and all of that are a whole subject unto themselves. But I would argue that in a lot of those cases, like the synergy, wow, synergy is such a bad buzzword, but let’s say that the results you can produce from understanding the power of the relationship between the two businesses tends to have an outsized impact. Okay, then we’ll just call that synergy for the session.

 

Justin Borgman  09:39

Yeah, I mean, that’s like the truest definition. I agree with you. I know. It has negative connotations only because it’s usually I think overinflated, right. Like people talk about synergy, and then maybe they don’t find this energy, but you’re absolutely right. Yeah. And in this day and age like more than ever, synergy can be created by combining data assets, right?

 

Eric Dodds  10:00

And that was going to be my question like, do you see that, especially among Starburst customers, where, ultimately a lot of these things come to a head and analytics that then influence business processes that influence product? You know, there’s a variety of implications here, right, the analytics and understanding those components is usually the tip of the spear in terms of driving the decisions that filter out and shape the business. Do you see that a lot where when you can combine data from different sources in a way that would be I mean, some of these things like you’re talking multi cloud, if you put a set of data engineers on this, you’re talking months of work to get a basic understanding of how the data relates, and then you’d have a ton of BI work and analyst work to get the insights on top of that. And so do you see that a lot among your customers?

 

Justin Borgman  10:48

Yeah, 100%. In fact, it’s a great use case, actually, for us. Because when we see that an M&A transaction is taking place, we know that there’s instantly going to be an opportunity for the reasons that you mentioned. You’re inherently talking about two different sets of data. And you’re talking about an integration effort, which, from speaking to at least one customer, that is quite acquisitive, often takes like two years to fully integrate those two entities to get the value that the investment banker had written up in the original proposal, right? So it takes a long time. And the beauty of this mindset, or this approach of kind of a single point of access, or or what some are now calling a data mesh, which I’m sure we’ll talk about as well, is that you’re getting instant connectivity. So you don’t have the delays of all the challenges associated with getting the data out of one system, navigating how to transform it, and load it and get it prepared into another system. All of that can be done in weeks, rather than months or years. And that speaks to that time to insight ability that we can provide.

 

Eric Dodds  11:56

Yeah. Okay. One other question for me, and this is I’m just genuinely curious about this. So the stack is increasing in complexity. And you’re seeing this on the front lines, because you’re providing an antidote to that. How is it increasing in complexity? Are there specific trends that you see around particular technologies that maybe add to the complication of what you would normally solve from a low-level plumbing standpoint?

 

Justin Borgman  12:25

Yeah, well, one thing that I’ll mention, and this ties a little bit back to my Stonebraker quote, but there’s a lot of different systems out there now. And it’s not just different types of databases. It’s other forms of data as well, it’s CRM systems, it’s web analytics, it’s a whole host of different data sources that you want to combine to understand your business better, like customer 360 is a very classic use case that we work on with our customers. And very often that involves pulling together a variety of data sources. I think part of this also candidly, is I think fueled by a tremendous amount of venture capital that’s poured into the data space over the last decade, there’s a data landscape that FirstMark Capital produces every year, I’m not sure if you’ve seen it, Matt Turck is the VC who maintains this, and I like to go back just for fun sometimes and look at like the 2012 version of this data landscape. And it’s already complicated. There’s like 30 different data sources. And then you look at the 2021 version, you’re like, it’s an eye chart, like you have to zoom in, you know, like, it’s hard to even find my own company in that space. I think that’s part of it as well, you’ve got a lot of different niche players. Maybe at some point, there’ll be some consolidation that simplifies it, but we don’t see that at least any anytime soon. And that means ever greater complexity. But one other thing I’ll mention that I think is compounding this problem is a demand from the user side, which could be an analyst or data scientist for more self-service access to the data that the organization has. And so you’ve got greater complexity on one end, and a wider variety of potential users on the other end. And I think that that’s a painful place to be in the middle.

 

Eric Dodds  14:09

Yeah, for sure. We had a recent show, we did a fun exercise where someone asked us, how would you build this in 2012, which is a really interesting mental exercise, right, relative to all the options you have now. So okay, well, this is super fascinating. Kostas? Please.

 

Kostas Pardalis  14:27

I have quite a few questions. Justin I’d like to start with a pretty simple one that has to do with the conversations that we had around the data stack. And I’d like to ask you, from your experience or through the lenses of Starburst, what are the essential components today of a data stack that a company needs? And if you can, I’d like to compare it to how a data stack looked back in the Hadoop era when you started your previous company and what are the differences there?

 

Justin Borgman  15:01

Okay, great. All right, well, I’ll start there. Maybe I’ll start with the past and then go to today. So in 2010 was an interesting transition point or the beginning of a transition, I would say, the concept of a data lake was in its infancy back then, of course, back then data lake was synonymous with Hadoop. That was the only data lake. Now it’s increasingly cloud object storage, like S3, or Azure data lake storage or Google cloud storage. But back then it was Hadoop. And I think what people at the time were just starting to think about or transition is like, can I do some data warehousing in Hadoop? Can I do some ETL in Hadoop, at least the T part of ETL? Of course, can I do some transformations in Hadoop and essentially offload very expensive compute from my Teradata system or my Oracle system and use this cheaper batch oriented, infinitely scalable, open source platform instead. And so it’s very interesting from that perspective. I think, a lot fewer data sources in that world. Teradata was striving to be the single source of truth with I will say mixed results, meaning that they were probably the closest thing to a single source of truth. But you still had different data marts and other databases. SQL Server here and there, and Oracle here and there, and so still a bit of a heterogeneous environment, but not nearly at the degree that it is today.

 

Justin Borgman  16:26

For the players back then I would say Tableau was the new kid on the block and killing it. But absolutely the new kid back then, displacing maybe some of the older BI tools like Business Objects, or Cognos and MicroStrategy at the time. And ETL back then was synonymous with Informatica, I think that’s another big change, right? So if we fast forward to today, I think we are in a much “cloudier” world. I mean, that in the sense of like, more data is in the cloud, which maybe makes it cloudy or multiple, multiple levels, especially for those customers who are hybrid, I think those are those are unique challenges, too. But data lake is now synonymous with cloud object storage. I think snowflake is trying to be the Teradata of the future, but very much embracing this same concept of a single source of truth. And then you have the Fivetran or Matillion, or other players sort of like Informatica 2.0. And then at the BI level, Tableau is still very strong, maybe Looker is a more recent addition, there’s also Preset, the company behind Superset, which is interesting, too. But on a surface level, you might say these stacks are similar. I think though, we’re at a point where data lakes have matured or at least data lakes as a data warehousing alternative have matured a lot as a concept. I think back in 2010, when I was doing that first business, it was an appealing idea. But not a lot of people were doing it in practice, largely because it takes a long time to build an analytic database. I learned this the hard way, building a cost-based optimizer and building an execution engine takes a long time. And in 2010, they were all very early so you couldn’t get the same performance out of SQL on Hadoop as you could in Teradata, for example. If we fast forward to today, that gap is much, much narrower to the point that it’s almost insignificant. And whether that’s Starburst querying data in a data lake for other players in the space like Databricks as a as a SQL engine now, for for querying the data lake as well, you see this idea of like a lake house becoming more popular, where I’m going to store a lot of my data in a data lake, and maybe skip out on the Snowflake model. So I guess I would summarize by saying I think the data warehousing model, irrespective of the individual players, is being challenged now today in a way that it wasn’t previously in history.

 

Kostas Pardalis  18:56

Yeah, yeah. Makes total sense. I thought that was a very, very interesting conversion between the two points in time. You mentioned data lakes. And it’s been like a couple of months, at least now that we see quite a few data related companies getting substantial funding, right? And also quite a few open source projects. We have Iceberg that came from Netflix. Hudi, which came from Uber. And of course, we have Delta Lake, right? So what’s your opinion there? Like, what do you see? Because the way that I see it, and how I feel about it is that we have like, some kind of decomposition of a database system, right? Because if you think about something like Postgres, you have an extremely complex system that it’s like a black box at the end that you query using SQL, a very simple let’s say language, okay. And we have reached the point right now where we are talking about transaction logs, about query engines on top of like the file system, like it kind of feels like we have decomposed the database system into small components and the data engineering teams are trying to take all this and recreate, let’s say, a large scale database system. Where are we today? Like, how mature are these technologies? Like if we take for example Hudi, or like Delta Lake compared to something like Snowflake?

 

Justin Borgman  20:24

Yeah, so first of all, I agree with your general sentiments. I mentioned in the opener there was like a database without storage. So you could say we’re like the top half of the database, the query engine, execution engine, SQL parser, the query optimizer, and Iceberg is like the bottom half, if you will, of a database. It’s the storage piece, or Hudi or Delta. And I think what we’re seeing right now, which is a kind of an exciting period in history, is back to that point about data warehousing analytics in a data lake. The one missing piece throughout the last 10 years has been the ability to do updates and deletes of your data. And that’s the gap that I think we’re closing with those data formats, which now allows for what, you know, Teradata calls active data warehouse, like being able to do updates, do deletes, modify your data, and still perform high performance analytics and Power BI tools, all within one system. And that’s I think, like,  we’re right on the cusp of eliminating that delta, if you will, no pun intended, between data warehouses and data lakes as we speak. And I think that decomposition is good for customers, in the sense that it gives them a lot of optionality. So for example, if you’re going to standardize on Delta, you can use data bricks to train a machine learning model, create a recommendation engine, if you’re a retailer, if you buy this pair of shoes, you might like this pair of pants. That’s a great use case for Databricks. And then you might use Starburst to generate your reports, use Tableau to access that data and figure out how much did we sell last month? Or how much do we think we’re going to sell next month? And they can both work off the same file formats. And that’s pretty cool. So I think that gives customers just a lot of flexibility to interchange engines. And also they have flexibility around which formats they choose, Iceberg, Hudi, Delta, all very interesting and promising options. And I guess I’ll just mention one last point, I think the big distinction between this way of thinking and Snowflake is when you load your data into Snowflake, you’ve now locked it into a proprietary format. And that’s an important piece with respect to vendor lock in, and having control and ownership over your own data. And that’s one of the things that I observed even in my time at Teradata. Nobody ever said Teradata was a bad database. It’s a great database, but they really hated the fact that it was inflexible, and they’re very expensive, right?

 

Eric Dodds  22:55

Justin, one question and Kostas, I apologize for jumping in here, but I’d love to benchmark when we talk about performance a lot of times and speed-to-insight is a term that you’ve mentioned a couple times. I’d love your benchmark on that. Because one way I like to frame this question is the definition of real-time has changed over time. Right, and so real-time, that one point may have meant a couple times a day, right? And so it’s getting faster and faster and faster. I’d just love to know, like, what’s your perspective on that changing, especially relative to query performance. And I know that can change based on the business model. But when you talk about recommendations, in an e-commerce standpoint, the bleeding edge of that generally has very heavy requirements as far as performance in real time. But that also is relative. So I’d just love to know, what are you seeing with your customers as far as requirements on performance and delivery from that standpoint?

 

Justin Borgman  23:56

Yeah, so there are two dimensions that we think about with our customers. One is the query response time. And that’s what I think people have classically referred to as performance when it comes to analytic database systems. Like I run a query on a certain amount of data, how fast does it return? And there are industry benchmarks that have been used for a long time, TPCH, TPCDS, these are sort of like standardized benchmarks that you can run your queries through. And of course, we would always say the best benchmarking is actually on your own data, though, even better than industry benchmarks. But that’s one dimension of performance. The other dimension, which I think is often overlooked, and this is what we really refer to when we think about time-to-insight, we think of that as a bit more holistic of a measure, factoring in how long did it take from the moment that data was created to my ability to analyze it. And if you think about it in that context, just to compare and contrast, let’s say Snowflake versus Starburst. Snowflake, maybe a query runs in two seconds, and maybe it takes Starburst 2.6 seconds and you might say, well, Snowflake ran that query faster. Yeah. Okay, a little bit faster. But it might have taken three weeks to get the data into Snowflake in the first place. And so really that query was three weeks. Right? And that’s what I mean by time to insight, is I think people learn over time that there’s a prerequisite step before that traditional data warehouse is able to actually run that first query. And that’s an important tax that you don’t necessarily need to pay.

 

Eric Dodds  25:27

Yeah. Super interesting. Yeah. I think that’s a subject that we want to explore more in the show, just because when you talk about latency, time-to-insight, like those are very subjective, depending on where you’re on the pipeline. So super interesting.

 

Kostas Pardalis  25:43

Yeah. And that’s also something else very interesting, Justin. So let’s talk a little bit about ETL. Okay. And I want to hear from you, what do you think is the future of ETL? ETL has been around since we had the first database systems, exactly because as you said at the beginning, we cannot have one system that does everything; different kinds of workloads require different architectures and different systems. And probably today’s also a bit even more complex, the environment, if you consider that like you have to download data through rest APIs, because something is behind your Salesforce instance, for example, metric or whatever, right? What do you see happening to ETL? Because from what I understand, when you are incorporating, like, Starburst in your architecture, for example, the need for ETLing the data from, I don’t know, like production databases, for example, to your data warehouse is reduced, right? And at the same time, like I’ve seen, I was looking like today, for example, there was an announcement from Snowflake that Iterable, which is like a company like in marketing, if I’m not mistaken, Eric. Right. It’s a marketing product?

 

Eric Dodds  26:56

Yes, indeed. Yeah. Yeah. Like customer journey, like orchestration.

 

Kostas Pardalis  27:01

So now you can get access to your Iterable data on Snowflake directly on Snowflake without doing the ETL through the data sharing capabilities that like Snowflake has, right?

 

Eric Dodds  27:15

That’s interesting.

 

Kostas Pardalis  27:17

They just announced the product today. Again, where’s ETL there? Like until yesterday, if I was using Iterable, I would have to have a pipeline there to pull the data, it will take days, blah, blah, and put it into Snowflake. So how do you feel about ETL? What’s the future of ETL? Based on your experience?

 

Justin Borgman  27:36

Yeah, so I was going to say, and I did not read the news, because you’re more up to speed than I am. But my guess is that Iterable is probably running Snowflake themselves. Just because the way that Snowflake is building its data sharing marketplace, is really like a proprietary network is basically other Snowflake companies, other companies using Snowflake can share data with other companies using Snowflake. So that would make sense to me in that context.

 

Justin Borgman  28:06

And I think that’s like Snowflake’s view of world domination. It’s like, if everybody’s using Snowflake, then great. Yeah, it’s a happy world, you can share among Snowflake databases. So I get it from a business perspective. And obviously, they’ve been a very successful business and Frank Slootman is a very successful CEO. However, I don’t think it necessarily reflects the reality of the data landscapes that customers have. I think it’s, it’s probably naive to think that everything will get ingested and sucked into Snowflake databases so that it can be shared and used. So our approach basically just says, all data sources are essentially equal and we can work with any of them. But to answer your question about the future of ETL. So I think it’s the E and the L that we’re  most focused on making optional, I guess you could say, there may still be times where you want to do the T, for sure. And I think like the way we see the future of this industry, moving forward, we think there’s going to be great reasons to pull data together into one physical place, maybe it’s to power a particular dashboard, or for certain applications, it would make a lot of sense to pull data together. But we think that increasingly, that will be the data lake because of the economics involved, right? Like at the end of the day, the data lake is always going to be your lowest TCO play, the storage is going to be the cheapest, whether it’s S3, Azure Data Lake or whatever. And you get to work with these open data formats that we already touched on earlier. So you’re not locked in. And so we think that’s going to be like your best bet for when you need to consolidate data. And then for other cases, you can just query the data source directly and again, that kind of goes back to that optionality. So I guess to summarize, I would say I don’t think ETL goes away, but I think it becomes more optional.

 

Eric Dodds  29:56

Interesting, just to jump in there, Justin, that is really insightful and I’m going to put my marketing hat on here, because I’ve been burned many times by a marketing tool saying we have this direct integration. And in reality, it’s actually just a sort of behind the scenes like ETL job. And so it makes total sense that like, if it really is delivering on the promise, it probably is that they have their data in Snowflake. And from an actual data movement standpoint, that makes a ton of sense. That was just very clarifying for me, because it’s like, yeah, I’ve, I’ve heard that so many times before. And it’s not true, they’re actually just running some job in the background. And it’s not in real time. And of course, like ETL has major problems when it comes to things like schemas, and all that sort of stuff. But if both systems are in Snowflake like that would actually work pretty well. But then to your point, you’re in the Snowflake ecosystem, right, and their boundaries. So I just appreciated that as a marketer, understanding the technical limitations of problems I faced before trying to move data around.

 

Kostas Pardalis  31:05

All right, that was super interesting. I’m very interested in ETL as we can all understand. So Justin, let’s chat a little bit more about Starburst as a product, right? And my first question is, at what stage of maturity of the data stack, as we talked about, Starburst makes sense to become part of this data stack.

 

Justin Borgman  31:27

Yeah, well, it depends on where you’re starting from. We kind of think about customers on a journey to somewhere, but they’re all starting at a different point in time. For some of our customers it’s simple. The most simplistic way to get started with us is you have data in S3, and you want to query it. And you’re currently thinking about well, do I load it into a data warehouse like Snowflake? Or do I just leave it in open data formats? Do I use something like Athena on AWS, which by the way, is actually Presto Trino under the covers, and that’s, that’s what powers Athena. How do I want to build my modern data warehouse type of stack, and that’s a great application. That’s where the kind of leading internet companies end up using our technology. They have the luxury of designing their stack from the ground up. And very often it is a data lake in S3, or some other cloud object storage, and just querying it directly with Starburst. And in that sense, you’re essentially building an alternative data warehousing style platform. Again, you might use Iceberg, you might use Delta, you might use Hudi, if you want that ability to do updates and deletes as well.

 

Justin Borgman  32:34

So that’s a very simple place where people often start, particularly if they  have the luxury of starting with a clean slate. Another place that customers start is they say, Okay, I have a data lake, but I also have a bunch of other databases. And maybe I’ve got Mongo, maybe I’ve got Oracle, and I really need to join a table that I have in Oracle with some tables that I have in S3, or Hadoop. And that’s another great place to start, is really combining datasets that currently live in different silos. And we can very easily provide fast SQL access to both systems.

 

Justin Borgman  33:12

Another way that people think about us is as an abstraction layer that hides the complexity of data migration. So a lot of people are going through digital transformation where they want to move data off of Teradata or Hadoop, and they want to move it to the cloud. But that can be a pretty disruptive endeavor if you’re trying to really just turn the system off and move into some totally different system. So another approach is you connect Starburst to those systems, have your users end up sending queries to Starburst. And that gives you a bit of breathing room and the luxury of time to kind of move tables out of one system and move them into another system more gradually, without the end user having to know where the data lives. And that’s sort of like hiding where the data lives. That’s thinking of us as a semantic layer, essentially, above all the data. So those are kind of three different areas where we typically start working with customers.

 

Kostas Pardalis  34:03

Yeah, it totally makes total sense. And let’s talk a little bit about the experience, the product experience. And when I say the product experience, I have like two personas, let’s say in mind; one is like the data engineer, like the person who is maintaining the data infrastructure, and probably has to interact with the service as a piece of infrastructure, and then the users who are querying the data, right? So they are obviously different. So what’s the experience these two personas have when they are interacting with Starburst?

 

Justin Borgman  34:37

Yeah, so for the data engineer, they first of all have two choices. We have really two product offerings. Today we have Starburst Enterprise, which you manage yourself. So if you want to control the entire infrastructure, maybe you want to deploy on prem, maybe you want to deploy in the cloud, but you have a particular setup that you want to maintain. Maybe you need Kerberos integration or LDAP integration or you want to run on Kubernetes on prem, you have a lot of flexibility with Starburst Enterprise, but you have to manage it yourself. So that’s for somebody who’s up to that challenge, or maybe who has the requirements to run in their own environment.

 

Justin Borgman  35:13

The other option is something called Starburst Galaxy. And Galaxy is a cloud hosted offering; we manage all that complexity. And essentially, you have a control plane that allows you to connect to your different data sources, and configure the system. You can auto scale up and down. So you’re using your EC2 resources efficiently, you can even auto suspend the cluster where it will just shut off automatically if it’s not being queried. And because we’re like a database without storage, restoring it takes a few seconds. And we’re connected already to the data sources you have. So there’s a lot of nice ease of use features, in particular around Galaxy to make the data engineers life as seamless as possible. For the end user, the experience for both platforms should be roughly the same in the sense that this whole thing should be pretty transparent, meaning that they are just using their favorite tool, whether it’s a query tool, and they like to write their own SQL, or they’re using a popular BI tool. And that connects to either our JDBC, ODBC, or REST API. And now they’re accessing data. And they can be joining Table A and one data source with Table B in another data source, and not have to deal with any of that complexity back to Eric’s earlier question about the growing complexity of the data stack. So we really try to hide that from the end user.

 

Kostas Pardalis  36:32

Are there some requirements from the side of the data sources in order to work properly with Starburst? In terms of data modeling, for example, like, what are the limitations there? How do you take something from Mongo, right, which is like a document based database and something from Postgres, for example, and query them at the same time? Like, how do you do that?

 

Justin Borgman  36:55

Yeah, so the short answer is we have this notion of connectors, but the word connector almost sells it short, because the connectors are actually pretty sophisticated; there’s quite a bit of logic involved in each one. And each connector is different based on the source system that you’re working with. So in a nutshell, the connector is connecting to the catalog of the underlying system, and knows how to essentially pass that SQL query or execute that SQL query or translate that SQL query to the underlying system. It also has the ability to do push down in some cases, to minimize the data moving over the network. Some connectors are parallel. So if you’re connecting to an MPP database system, like let’s say, Oracle, or again, Teradata, or Snowflake, that creates a parallel connection, so you get an even faster read. So each connector is a bit different. But that’s essentially where the logic lies that tells the system how to actually pass through and execute that query.

 

Eric Dodds  37:54

Interesting. So that was going to be one of my questions is maybe a way to frame this to be like ergonomics. So like in terms of the ergonomics, like it is writing SQL and then having the connectors and so again, that abstraction layer, where you’re not having to go a level, is that the idea?

 

Justin Borgman  38:15

Yeah, yeah. So those connectors, I mean, many of them were created by us. Some of them were created by others in the community. And again, they vary in terms of the level of performance or sophistication. The most popular ones tend to be the fastest, most feature rich, just because we have the most people using them. But yeah,  that’s exactly right. In fact, you can build your own connector, maybe you have a particular, I was just speaking with a customer who had their own time series database that they had homegrown, and they wanted to create a connector to that time series database. And they were asking, like, how do I build a connector? And it’s open source. And we can point you to the documentation on how to create a connector to your data source as well.

 

Kostas Pardalis  38:54

So Justin, from what I understand, Starburst is mainly for asking questions, right? It’s like a querying mechanism. Do you also have use cases where like, people are using it to write data back, like, for example, I’m creating some features to train a model, right? Or something like that. So I need this information that I have created out of the initial data sets to write it back into S3. So then I can get, as you mentioned, as an example, Databricks and train my model. Is this something that you see as a use case, and it’s also something that it can happen with a product right now?

 

Justin Borgman  39:34

Yeah, it can. Now it depends on the data source of the connector again, but yes, many of those connectors do support the ability to write data back. In fact, we discovered some actually pretty interesting use cases that we wouldn’t have even thought of where companies are doing what you described, and also even doing kind of ETL style workloads. Despite our conversation earlier where they’re taking data out of one system, maybe it’s a traditional data warehouse, and I’m writing it to Google Cloud Storage to then be ingested by BigQuery. And they’re using Starburst as that federation layer. So it’s pretty flexible that way. Yeah.

 

Kostas Pardalis  40:11

No, that’s super, super interesting. So if I understand correctly from all the conversations that we have had so far, like, a very solid stock would be, I have my data lake with something like Hudi or like Iceberg. That depends on me. From the Starburst point of view, it doesn’t matter what kind of format I’m using. Then on top of that, I can have Starburst to query the data, right? And on top of that, I have a BI Tool like, Looker, for example, Tableau, right. And I can use either like the on prem version of the product, or I use cloud. Yep. So how important, and this is a question that it’s not just technical or product oriented, it’s also a question to the CEO of the company. How important is the cloud model for data related products? It’s something that we have seen, like happening with many companies like Databricks, for example, is a case like this. Confluence, right. And it’s also very common, like evolution that we see with open source projects. We start with the project, and we end up also offering a cloud solution. How important is this and also do you see any alternatives to that if someone wants to monetize a data-related product, especially if it starts from an open source project?

 

Eric Dodds  41:36

Man heavy, heavy questions Kostas? Give a softball question.

 

Kostas Pardalis  41:44

So I have asked my question.

 

Eric Dodds  41:46

Absolutely. I mean, Justin’s solving the problem. I’m super interested.

 

Justin Borgman  41:52

Yeah, absolutely. Look, I think cloud-hosted solutions are the new frontier for building businesses around open source. And I think there are a couple reasons for that. I think, first of all, it gets you out of the sometimes challenging situation of deciding what to contribute to the open source versus hold back for your enterprise edition, which can sometimes be, you know, challenging conversations, because you want to grow the open source project, because that’s your adoption vehicle. But you also want to be able to convert that so you end up with this tension between growing the pie and increasing your share of the pie, right. And I think that the cloud offering takes a lot of that away, because you’re actually adding a new dimension of value for the customer, which is you’re removing complexity, and you’re making it easy, and people are  very willing to pay for that. I think that’s the way they’re used to consuming products now at this point. So yeah, big deal for us. I mean, I think Confluent and Mongo are great role models for us in particular, largely because both of them actually went through the same journey that we’re going through where they had a self managed enterprise edition, and then built a cloud offering. And really serve both markets and have these markets kind of work together. For Mongo, it was the Atlas product, which is their cloud product, Confluent has built a cloud offering as well. And what we’ve seen in both cases, in fact, Mongo had a nice jump in stock price a few weeks ago, it represents now more than half of their revenue, and is the fastest growing part of their business. And similarly, for Confluent, maybe less, less of a share, but the fastest growing element of their business as well. And so we’re very bullish on the future and the prospects of a cloud product here.

 

Kostas Pardalis  43:37

Yeah, it’s very interesting. One last question from me. And I know that you have a lot of experience also, like in the enterprise space where we have primarily like the model of the on prem, like installations until recently. Yeah, do you see that many people predict that the cloud is going to dominate completely, right? Like all these large enterprises out there, they are going to migrate completely to the cloud? Do you see this as a net result of the end? Or do you feel like things are going to build a bit more hybrid at the end? Like, what’s your opinion on that?

 

Justin Borgman  44:11

I really do think they’re going to be hybrid, either for a very long time, or forever, at least long enough that it feels like it will be forever because I think we serve a lot of financial services customers, we serve a lot of healthcare customers, these regulated industries are going to be just more cautious about putting their data somewhere else. And also not for nothing. I think there are actually sometimes TCO arguments to be made for actually running some infrastructure on prem, despite the complexity of having to run your own data center. So I think we’re going to live in a hybrid world at least a lot among large enterprise, Fortune 500 customers for quite a long time. We think that’s also good for our business in the sense that we can provide connectivity even across you know, from one cloud to another cloud or or from the cloud to on prem.

 

Eric Dodds  45:00

Super interesting. We’re getting close to time here. One thing I’d love to do is actually just take a step back and talk about Presto and Trino. Because you were there towards the beginning, and you have some insight and would just love to know, how have those projects developed individually? And what are the differences? And I would just love for our audience. I mean, I think Presto is pretty familiar to a lot of our audience, like in general, but the difference between Presto and Trino and just the way that those communities develop, like you have some specific insight and would love to, would love to hear about that.

 

Justin Borgman  45:37

Yeah, okay, sure. So Presto, just as a refresher, was created at Facebook in 2012, and open sourced in 2014, created by Martin, Dain and David and a guy named Eric, as well. And all those guys work at Starburst today. But in 2012, 2013, they worked at Facebook. And I actually first met them in roughly 2014. So maybe a year after Presto had been open sourced. And we started collaborating  together again, while I was at Teradata. And that collaboration grew over years. And my team at Teradata which had been acquired from an app was contributing and they became leading contributors. And so you have this really vibrant core of call it 10 or 12 engineers who are writing the overwhelming lion’s share of the project. That continued and Starburst was formed in 2017. And actually, initially, the creators of Presto, were still at Facebook. And it was not until maybe a year or so after we had started with Starburst that they decided to join us. And in the process of joining us, actually, before they joined us, they had left Facebook over a disagreement of how the project would be governed, how it would be run. And Martin, Dain, and David were very adamant that it be a meritocratic sort of governance model. And Facebook had Facebook’s priorities, which makes sense, right? Like they wanted to take the direction that benefited their needs. And by the way, Facebook was running basically all of their analytics on the project, so it’d become very core and very strategic to them. But these were slightly divergent goals where Martin, Dain, and David wanted this open community, a vibrant diversity of users and contributors, where you would earn maintainer or committer status based on the merits of your contributions. And Facebook was like, we got to ship this feature, we need to do this thing for our business needs. And because of that, they ended up parting ways. And so Martin, Dain, and David left Facebook and continued developing, but developed on a different code repo called Presto SQL. So there was Presto DB and Presto SQL. And for a few years, nobody knew that there were two Prestos, people weren’t really paying attention. But there were actually these two divergent code repositories. Now, they ended up joining Starburst, we already had about half the contributors, leading contributors to the project. So the Presto SQL side ended up moving much much faster as a development organization. And long story short about a year ago, there were some disputes over the trademark itself, the trademark of Presto. And it turned out that Facebook ended up donating the trademark, the name, which they technically owned, because even though Martin, Dain, and David created it, they created it while employees of Facebook. So for any open source creators out there, if you are working for a company, and you create an open source project, that name is technically owned by the company you work for. So just keep that in mind. But ultimately, they donated that to the Linux Foundation, and the Linux Foundation said hey, we can’t have two Prestos. So you’re going to have to rename Presto SQL. And that’s how Trino was born. So Trino, that lineage of Presto, is what the creators and leading contributors and since then a number of the leading contributors from Facebook have joined us as well. So working on Trino now instead of the original Presto, Trino is what Netflix and Airbnb and LinkedIn and a lot of the big internet companies are running with. And that’s the future. But that’s the backstory of the names and how we got where we are.

 

Eric Dodds  49:16

Yeah, love it. No, that’s a great backstory and I love to peel back layers on the evolution of open source technologies. Well, we’re close to time here. Two more questions for you. One is, what’s the future look like for Starburst? I mean, we’ve talked about problems you’re solving now. But as you look at the stack, I mean, your bet is that you have hybrid on prem cloud. Stack is increasing in complexity. So we’d love to know how Starburst is thinking about the future. And then second, how can people explore Starburst if they’re interested in it today?

 

Justin Borgman  49:53

Cool. So in terms of the future, I will say we’re very bullish on this concept of data mesh. So I don’t know if your audience has heard of the data mesh at this point. But it’s basically this kind of paradigm shift that essentially recognizes that data is inherently decentralized. Not only as a practical matter for a lot of the reasons we mentioned, but also that there’s actually benefits to decentralization if you think about it the right way. And the analogy that I like to use with people is, if you think about Wikipedia, where you know, anybody can sort of like create an article, it’s generally the expert who knows the most about that particular subject who’s writing the Wikipedia article. So you get the person writing about a particular subject area, who knows it very well. And they have ownership for that. That’s kind of like part of what this notion of decentralization means from a domain authority perspective, meaning that like the people who know the domain, and making the decisions about how to interact with that data, what fields are available. So rather than centralization, putting everything in the hands of a data warehouse team in a monolithic way, you sort of let the owners of the data itself, essentially curate the data and publish it, serve it up to the organization as a data product. And that’s another big pillar of data mesh is thinking about data as a product, which is it, which is an interesting concept, I think, as well. So it’s an area we’re very excited about.

 

Eric Dodds  51:21

Okay, so in terms of data mesh, this is a really interesting topic, because it’s a new term; there are different sort of interpretations of how to define it. And hearing you talk about Starburst actually is a little bit of a light bulb for me in terms of data mesh, because in the conversations that we’ve had, the challenge with defining data mesh is a tension between decentralization of data, but also the need to actually centralize that in a way that makes a ton of sense for the business as a whole. And so I would love your thoughts on that tension, right? Because decentralization generally applies to technology, where you have different technologies being employed by different teams, that means different formats of data, all that sort of stuff. But you still have this need to centralize it. And so I would love for you to speak to that, that tension as it relates to data mesh, and then specifically like, is Starburst the stepping stone to like making sense of that.

 

Justin Borgman  52:23

Yeah. So I mean, to me, it all centers around this concept of a data product. And having the data owners, the ones who understand the domain of that data, be the ones responsible for creating and curating that data product. Now that data product I want to stress doesn’t have to be a specific database, or even a specific table, or a specific data set, it could be any combination of those things. So the data product might have a table that lives in S3, and it might have a table that lives in SQL Server, but the product together, which is the I’d say the customers who spend the most and watch ESPN, if you’re a cable provider, for example, maybe those live in two different data sets. One’s a billing data set, one is a shows watched data set that you have in two different systems. But the data product that you’re offering is top spend sports enthusiasts, right? Yeah, product now can span across those data sources, but it’s still offered up to the organization to consume that way. And Starburst essentially becomes the abstraction layer that allows you to serve up those products without having to necessarily reveal where those datasets live. Like the end consumer of that product doesn’t need to know it came from a data warehouse over here and a data lake over there.

 

Eric Dodds  53:46

Quickly listeners who are interested in checking out Starburst, what should they do?

 

Justin Borgman  53:52

Yeah, you can check us out at starburst.io. And you’re welcome to either download the product and get started or register to use Galaxy, which is currently in beta. And will be GA in November. So depending on when this podcast comes out, it may be GA already. Those are your options.

 

Eric Dodds  54:12

Awesome. Well, Justin, this has been really informative and just a great conversation. We’d love to have you back to talk about team structures around data mesh as we shed more light on that subject on the show.

 

Justin Borgman  54:24

But yeah, I think it’s a great topic. It’s probably one of the most important elements of actually implementing a data mesh. It is all about people, process, and technology, and the people being the trickiest part. So would love to.

 

Eric Dodds  54:37

Awesome. Well, we’ll catch up again soon. And thanks again for taking the time.

 

Justin Borgman  54:42

Cool. Thank you guys.

 

Kostas Pardalis  54:43

Thank you, Justin.

 

Eric Dodds  54:46

As always, a great conversation. I think my big takeaway is actually on the data mesh side of things. I think that analytics, federated analytics, as Justin talked about them, I think is the most tactical explanation of the value of data mesh that I’ve heard yet, in a way that makes sense from a technological standpoint, right? Because I think as we’ve talked with other guests on the show, one of the challenges of data mesh is, you know, fragmented technology. Everything’s decentralized. Centralization across all that is very difficult. And having an infrastructure technology agnostic solution to that makes data mesh make a lot of sense. I think my follow up question which we didn’t have time to get to is, the analytics is one thing, like taking action on that data is another thing. But that was really helpful. So I really just appreciated his perspective on that.

 

Kostas Pardalis  55:47

Yeah, absolutely. And I think we have many reasons to want to have him on another episode. There are many things to talk about. One hour wasn’t enough. Yeah, for me, I think the most interesting takeaway was the conversation around ETL. And how it is changing in these more decentralized and federated worlds that we are moving to. And it was interesting to hear from him that the E and the L are not going away. But they’re not as important as they used to be. But the transformation is there and so we will keep needing to transform the data. So yeah, it was very interesting. It was also interesting to hear about the history, the story behind Trino and the trademarks …

 

Eric Dodds  56:34

I loved it. Yeah, coming out of Facebook and …

 

Kostas Pardalis  56:37

the open source drama, which is always interesting. And yeah, I’m really looking forward to recording another episode with Justin, it was great.

 

Eric Dodds  56:45

For sure. Well, thanks for joining us again, and we’ll catch you on the next show.

 

Eric Dodds  56:51

We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at Eric@datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.