This week on The Data Stack Show, Eric and Kostas are joined by Sven Balnojan, product owner (analytics) at Mercateo Gruppe and author of the recent article “How to Become The Next 30 Billion $$$ Data Company”. Their conversation focused on a comparison and contrast of Databricks and Snowflake and discussed what it takes for an open-source project to succeed when a whopping 98% of them fail.
Highlights from this week’s episode include:
The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Eric Dodds 00:06
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com.
Eric Dodds 00:27
Today, we have a guest who has written content for a while about data companies, the data space and major trends in data, and he wrote an article called “How to Become the Next $30 Billion Data Company.” It talked a lot about open source, it talked about major players in the data space, and it talked about a lot of new up and coming players in the data space. We’ll link to it in the show notes. But Sven Balnojan has an academic background, he studied mathematics, he has a PhD actually, and then has worked in a variety of data contexts. And so we’re just really excited to chat with him. Kostas, I think one of the things that I’m interested in is Sven talks a lot about the major players in the data space now. But the fact that they’re actually not necessarily going to be the next really, really big data company, which on the heels of Snowflake’s IPO feels weird to say, because that was such a monumental event, both in the financial markets and in the tech space. But I actually agree with him, and I want to hear from him why he thinks that is. So that’s what I want to know. How about you?
Kostas Pardalis 01:38
Yeah, absolutely. I think we’re going to be discussing this. I mean, I’m passionate about that stuff anyway, and especially when it comes to Snowflake and Databricks. So yeah, I’m looking forward to this conversation actually, that’s the main question that I have. He has this unique rate of being, let’s say, the person with a PhD, very technical, but at the same time, he really enjoys communicating with people out there. He has a newsletter; he’s blogging. So yeah, it would be nice to hear about what drives this passion for communication and expressing his thoughts through these channels. So these are the things that I would like to chat with him about.
Eric Dodds 02:16
Great. Well, let’s dig in.
Kostas Pardalis 02:18
Let’s do it.
Eric Dodds 02:20
Alright Sven, welcome to the show. You have such an interesting background. You have a PhD in a very interesting field in mathematics, but you also have to work in data day to day. And then you also write a lot about data as an industry and the tooling around it. And so I don’t even know how we’re going to pack all the things we want to talk about into one episode. But thanks for joining us.
Sven Balnojan 02:44
Thank you as well.
Eric Dodds 02:45
Alright, well, starting out, can you just give us a little bit of your background? How did you get into the field of data? And then what day to day work do you do in data at your job?
Sven Balnojan 02:59
Yes, of course. So how did I get into data? So I think 12 years ago, a friend actually asked me if I wanted to explore a problem and found a company with him and it wasn’t in the data space. It was just about actually gathering lots and lots of data and providing it to a certain customer set. And so that didn’t work out because we found that the different co-founders had different ideas about where the company should go. So we decided to split up. That was my very early introduction to the world of startups and data. But then later on, I did my PhD in Singularity Theory, which is this weird little subfield of mathematics, and also kept working at a marketing agency for five years, and then I decided to leave academia and then join as a data scientist and as a developer, and then as a DevOps engineer for a company the internal data team. And after I went through all of these steps, I decided I actually had to do product management in product one of the internal data team, and that is where I’m at now.
Eric Dodds 04:10
Very cool. And you’ve been writing on Medium for a while, but you also have a newsletter? Do you want to just tell us briefly about your newsletter and maybe where people can sign up?
Sven Balnojan 04:22
Oh, yes, just got to Google for Three Data Point Thursday for this newsletter. I started that simply because I didn’t feel there are too many good data newsletters out there that take a holistic perspective. So I decided to write my own. And I just write out my kind of special and opinionated perspectives on things that happen, not the new stuff, but I simply collect interesting stuff I found over the years working in the industry.
Eric Dodds 04:53
Very cool, and it’s a great newsletter. We are huge fans of it here on our end and actually reading some of your content for a while, I realized, well, this is great, we should just have him on the show. And I think that’s how this whole thing got started. First of all, I want to ask you a kind of a funny question based on your background. So singularity theory and mathematics, you have a PhD, can you just … two questions on that just because it’s kind of fun to to hear about people’s backgrounds … Could you just briefly explain what that is? And then I’d love to know, are there any lessons from that study that you still apply in your day to day work with data on a practical level?
Sven Balnojan 05:38
That is a good question. For the first one. Singularity theory is studying places where stuff changes suddenly. Like if you imagine the function, it’s a function that has a little kink? And yeah, that’s basically it. It just asks, Okay, so how do these things generally look, what happens to all sorts of weird systems that these things could describe? So that is singularity theory.
Sven Balnojan 06:08
So I do have one thing I apply, quite frequently, but other than that, I was just going to reply, I don’t use anything at all. But one thing is definitely using examples all day long. So that’s using examples to play around. When I’ve developed products, I always go very, very iteratively and produce small little steps, and then work my way forward. That’s what I actually learned, in my PhD thesis, doing little examples, and then trying to find the general picture inside that.
Eric Dodds 06:45
Yeah. So it’s a workflow and a process that you’re applying to just a different practice or different field of study. That’s super interesting.
Sven Balnojan 06:54
Yes, exactly. I mean, I think the pretty good, great engineering professor called Hemming, called it “napkin calculations”. And that’s the very same thing I tried to apply every single day.
Eric Dodds 07:08
Love it. And love the concept of napkin calculations. That’s great. Well, let’s start to dig into the world of data, which you’ve written about, and the subject I want to tackle. And we’ve touched on this in a previous episode, but Kostas I know, in particular, is extremely passionate about the Databricks versus Snowflake conversation. And there’s a lot there, both from a business perspective, data business perspective, other considerations around open source and all sorts of different things. But before we get into that, I’d really love your perspective on what those two platforms do? Because in one sense, you could say, okay, they’re kind of similar, right? They both allow you to store and do things with data with a constellation of different tools around that, right. But they’re pretty different. And they’re used for fairly different things at this point. But could you give us a quick rundown of what Databricks does? What are the most common use cases? And then what Snowflake does and the most common use cases?
Sven Balnojan 08:17
Yes. So that is a great question. So let’s take this specific perspective first, Databricks came out of the university spinner, basically, and from the Apache Spark guys. And then layered on to this massively parallel computing framework, they layered on data lake, which allows for asset transactions, and then it’s slowly added stuff that kind of fits into the context, which is basically notebooks, lately edit ETL transformations, and the most recent edition actually is they acquired the company Redash, which I think you also talked about in another episode. So they actually now also feature in front. So that is, Databricks, in detail.
Sven Balnojan 09:04
And Snowflake came out of this hugely cloud data warehouse, which has a really cool feature, because it decoupled computation from storage. You could scale both things independently, which you usually can’t do in a cloud data warehouse. everything else, actually, you have to scale both things mostly, and then they started layering on stuff, data integration tools, and now they also have machine learning stuff and so on. So for me, Databricks comes from the world of unstructured data, whereas Snowflake comes from the world of structured data, and they will simply converge into one space. But if you take a step back, I think they are actually going to form a whole new sector of cool companies. Because, I mean, if you just take a look at what do data companies actually do fundamentally? So my take on this would be, they enable customers’ companies to derive value from their data.
Sven Balnojan 10:13
I think they have to wrestle with four forces of data. And so I call them DAKS forces. Once the sheer amount of data, which is growing exponentially, as far as my really bad forecast actually tells me, the exponential growth of data, they try to handle that, you’re certainly doing this. And this is the kind of data and the kind of data which will be then then, 10 years with a group of data, I think the amount of data we will have will be 15-20x now that we have today, but all of this data will be unstructured event data, real time data, lots of imagery and lots of sensor data, that’s the data we will have to deal with in the future. That’s the kind of data. Then I did write about that quite a few times, there’s the Snowflake problem. And the Snowflake problem is a very simple thing. If you go into any company, and take in a second company, compare the data setups, the different sources, targets, data has to flow from into, they’re gonna miss probably 80%. But each one will be their own, unique snowflake. And that snowflake problem, you’ve got to solve that in some way. Any company has to solve that problem in some way. And then the fourth force is kind of the decentralization. And that’s a big thing. Decentralization means one, data is now being emitted in lots and lots and lots of decentralized places, you know, and also, is to be used by every single employee in the company in a decentralized way. If you go into a bookstore, you will very likely pull out your phone and look at data, a book critique, or check prices, and so on. And so there’s decentralized data emission and consumption happening in one place. And now these are four forces. What I’m trying to say is that companies Snowflake and Databricks are actually moving into the space and trying to wrestle with all four of these forces. And I don’t think there’s another company currently that is actually doing it. There’s almost no company that is trying to take on all four of these forces. I think Databricks has a slight edge actually on that, because they come from the unstructured world. And Snowflake is not yet there. But that’s kind of my take on it. They are moving into the space. It’s really this new frontier in which I think, actually, the big IPOs in the data space will happen in the next five to 10 years.
Eric Dodds 12:41
Yeah, it is kind of crazy to think about Snowflake’s IPO. It’s crazy to think about another IPO that’s bigger in the data space, because that one was so high impact. But I agree with you, I think a lot of the things we’re going to see or that are going to have an even bigger impact are still nascent.
Kostas Pardalis 13:00
So Sven I have a question, I mean, I want to dive deeper into these differences between these two platforms. I hear you and you’re passionate about these two tools together, right? So can you tell us a little bit more about your experience with these tools? And what was the moment that you realized that these two companies and these two products are on a collision trajectory, let’s say? And the reason I’m asking is because at the beginning when Snowflake was out there, everyone was thinking that okay, Snowflake is a data warehouse, and the main competition that’s going to be BigQuery. Or it’s going to be Redshift, right. But it seems Snowflake had a much bigger vision around that. And suddenly, today, it’s become more and more clear that the actual competition is going to be between these two companies, Snowflake and Databricks. So what brought you to this conclusion? And also before that, what’s your experience with these tools? How have you used them? And how do you feel about them as products?
Sven Balnojan 14:07
So that’s a great question. So I actually haven’t used any of those tools. I’ve played around with both of them. And I actually noticed Databricks because they bought Redash. Because I was playing around with Redash. I think at the time they actually bought Redash. It was a coincidence. And then there was also the point where I realized that kind of their master plan, actually to move into this huge space. And so Databricks was pretty clear in their intent to move things into this bigger space.
Sven Balnojan 14:42
Snowflake on the other hand is a weird thing. Because as you’re saying there are good alternatives for cloud data warehouses out there. I mean, Redshift took the market by storm, right? The weird thing is that, even though I knew that, Snowflake kept coming up as the default choice for lots and lots and lots of companies and teams and simply kept growing. So sometime I just had to check them out. And I realized that they’re actually moving in a very different direction, actually. Amazingly, it has an amazing, well-designed product that is able to not just compete to actually beat Redshift and BigQuery in their markets in some way. Because they’re able to differentiate themselves, even with these huge players, nary resources. And they also are adding lots and lots and lots of stuff in a very simple and easy way to their product. Which by the way, the cloud, the hypercloud, don’t do. They, for some reason, tend to make stuff a little bit more complicated. So that’s my experience so far.
Kostas Pardalis 15:43
Yeah. Makes total sense. Actually, I find these companies very interesting, because, in a way, they started from a completely opposite starting point, right? As you say, you have the academic background of Databricks, right? You have a bunch of geeks in Berkeley, trying to redefine Hadoop and MapReduce and come up with a new processing architecture. And they came up with Spark, and then they own Spark, they built this company today.
Kostas Pardalis 16:20
And on the other hand, you have Snowflake, where you have a bunch of people coming from Oracle, which is the exact opposite. it’s the definition of the corporate environment, right? And they’re okay, there’s something wrong in the industry, which means that there is an opportunity, let’s go into a really nice technology. And they started, if I remember correctly, I think they started with the data warehouse, and the main concept around what was different in their case was the separation between processing and storage, right, which was an amazing, awesome marketing material, let’s say. I think they made a lot of noise around that. And people started thinking, Oh, this sounds cool. It sounds like something good. I can store my data, and I pay a different price for that, and blah, blah, blah, and all these things that I’m processing, but I have a question for you because you’re also a deeply technical person. Don’t you think that this separation already existed in a way, right? Databricks or Spark, didn’t store the data. The data was on a distributed file system, HDFS, right, that was the beginning. And that was also how Hadoop worked. Right? So okay, of course, we are talking about not a cloud solution. We are talking about my on prem, let’s say installation. But my feeling at least, and I want your opinion on that, because I might be wrong, is that this operation actually already existed? It’s not it was just not nobody actually talked about it that much outside of the hardcore engineers that were doing the work, right.
Sven Balnojan 18:01
Yes. I mean, it’s kind of the big debate about self-driving cars. So I keep on telling people the technology for self driving cars is already there. cars can already drive by themselves, but that the technical stuff isn’t figured out yet. So almost anything is already there. I mean, the stuff that technology already has invented is mind blowing. The hard part, I don’t think, is actually inventing something new. It’s kind of making it easily accessible. And maybe it’s just wrapping it nicely. I think that’s actually the hard part. Not so much about the invention. It’s more about letting other people in on the invention. And I think Snowflake does an amazingly well job of exactly that part.
Kostas Pardalis 18:49
Yeah, I agree with what you’re saying. And I think this is where we have to thank marketing. And Eric, that’s a big part of what the value of marketing brings in this world that educates and re-educates people around the products out there. So thank you so much about that. But I think that, from the technical perspective, Sven, I think what was really impressive with Snowflake at the beginning, is that not only we had the separation, right? But at the same time, we had a very complex infrastructure that nobody had to actually manage. Right. And that was at least at the beginning, I think the difference between Databricks and Snowflake, because with Databricks, I mean, okay, today, they also have a cloud offering, which is self-served, but back then working and setting up and operationalizing Spark. I don’t think that was the easiest thing to do. Right?
Sven Balnojan 19:50
I’m pretty sure it’s actually vascular for most companies. I know, from firsthand experience, running, deploying actually most open source tools today. It’s a pain in the ass. Almost any tool I would rather have that as a solution. So that’s definitely true, that point. So you just touched on that because Snowflake came out of this structured space. Actually, I think they must have realized that they’re in this structured data space, they’ve got to move out of that. Because my personal take, as far as my forecast for the next 10 years goes, in 10 years, I think structured data, databases for structured data will have no customer bases. Well, at least, just very little one. So I think all of these companies actually have to move out of the space, or at least, amend their offering as much as that isn’t not going to be the core anymore.
Kostas Pardalis 20:48
Yeah, it makes sense. Makes sense. Although I have to, I have to add something here about the structure of data. I remember when Snowflake came out. And because back then, BigQuery wasn’t that big, right? The main competition and the main, let’s say data warehouse in the space was actually Redshift. And one of the biggest problems that you had with Redshift is that Redshift was exactly as you say, it was completely structured. It was a database, you had your tables, you had your relations, you had your columns, very-well defined data types, blah, blah, blah, blah, blah. One of the early selling points of Snowflake was that they had an amazingly good support for semi-structured data. And when they said semi-structured data, they actually meant JSON. So and I think, also XML, but anyway, I don’t know who was working with XML. But anyway, yeah. And they still have that. But okay, after a while, also BigQuery came and BigQuery had a really good support for nested data structures and stuff like that. So it was a very natural fit for JSON files. But the reason that I’m saying that is because I want to ask you, when you say structured versus unstructured, what do you mean, what in your mind is the unstructured data that Databricks was excelling in working with?
Sven Balnojan 22:22
That is a good question. And, of course, I presume it’s the same. By the way, I mean, Redshift is based on Postgres, I think, as far as I know, and Postgres has amazing JSON support as well. So I presume, again, Snowflake versus an XML, by the way, as well. So presume Snowflake and Redshift actually had probably feature parity, and support for the semi-structured data. And, of course, I mean data that comes with some kind of big blob. And yes, I also mean, objects, images, and videos. And because the reason I’m saying that is in the future, we will have lots and lots of data from decentralized spaces, which simply means they come in different forms. That doesn’t mean they’re actually unstructured, it just means I might not know the structure, or I may have to deal with lots and lots and lots of different structures. And I might as well treat them as unstructured. That’s actually the point. That’s actually the point. The true question is more not do I have a database that actually supports dumping all the stuff in there, I mean, Snowflake actually, I think, has some kind of tail extensions now, which go to S3. So maybe in the future, they will also support images. They already do so. The questions are what do we do with this kind of data? How do we work with it? How do I search through stuff and all these kinds of things?
Kostas Pardalis 23:51
Yep. That makes total sense. And I think from a product perspective, if we see the two companies, the use cases, and the actual, let’s say workloads that they were working on in the beginning were quite different, right, you had Snowflake on one hand that was more traditional BI workloads. So yeah, people wanted to put structured data up there and do BI and see how my company is performing that kind of stuff. Then on the other hand, you had Databricks, which if I remember correctly, one of the most important first use cases about data bricks was actually ETL. You would see many times using actual data warehouses together with Spark. So Spark was used for ETLing data, especially when you had very unstructured data, as you mentioned, Sven, and of course, there was a main, one of the main use cases after a while about Spark was okay, let’s train models. Okay, let’s do statistical analysis. Let’s do things that don’t naturally fit on the SQL dialect of a data warehouse, right. And of course, this gap is closing right now. And we can discuss more about this later about the two. But yeah, and that’s a testament of the type of data that Spark can work. You can have unstructured data, you can work with text, you can work with CSV files, you can work with video files, or binary files and stuff like that. Things that yeah, okay, I mean, a data warehouse or database might have a binary blob that it’s supported there. But yeah, okay. That’s it. It’s not like you’re going to train a sophisticated model on top of that. And that’s also reflected in the tools that they were supporting, right? How you can work with Spark through Pandas, you would see the difference between also the people that were using it, right, you would see the data scientists versus the analysts. And each one of these categories, they have different tools. And these products, and the solutions and technologies were also accommodating, these different, different solutions. But do you see that this has been changing lately, do you see that Databricks, for example, is being used more for analytical purposes, let’s say, and not just this kind of sophisticated ML payloads? And on the other hand, see, or anticipate that Snowflake might try or do already more of the ML and data science related payloads? Because in this case, we are talking about also they need different features that they might not have, and they have to introduce?
Sven Balnojan 26:35
Yeah, sure. So in Snowflake, it’s actually true, I don’t know, the recent product development that well, I can just imagine that actually, the analytical actions of the case might actually be a problem there. Because they still got to keep this customer base and not dilute this customer base away moving into this new space. I mean, it’s a challenge that can be overcome. On the Databrick side, I am not sure as well. So obviously, they’re trying to move into the analyticals. Because they see, they get this new feature area, which is the Delta Live Tables, I think, basically models, you can put into Databricks in a very easy way. And by the way, also in SQL, and you have the Redash integration, which means you can actually type SQL right into the query editor. And then so you can, you can do quite a bit of work with analytical workloads that other companies would run on Snowflake, you can put that into Databricks. But they’re not quite there yet. Because they have to carry on the data engineering and the design systems, all of these guys, because they still live in this node world, in this Apache cluster world. They have for instance, they do have support for DBT, for instance. But they’re slow to move into that direction. And I feel paradoxically that Snowflake’s slow to move into the machine learning and the data scientist directions. It’s actually an interesting play, to watch.
Eric Dodds 28:12
I have a question for both of you. So Sven, you mentioned, and if you could give us just a quick review on these four major forces that are going to be drivers of the new category of companies that we see, if you could give us a quick review of those. And then I’m interested to know which one you think will drive the first wave just as far as the problems that it’s going to create or the pain points that it’s going to create that this next wave of data Kostas, the reason I’m asking is the available technology dictates the behavior of an organization, right? So if you come from the analytical, if you’re if you’re using Snowflake, and you come from the analytical, structured data background, you align your processes to match that, right. And then technology changes. But organizational change is pretty hard, right? Because you’ve done things and built your processes and data flows and pipelines and teams around an infrastructure that’s more focused on structured data. So I’m just interested to know which of the forces, the four forces, that you mentioned, do you think are going to create the first wave of organizational change that responds to technological change?
Sven Balnojan 29:27
That’s a great question. So to recap, the four forces are dragons that have to be tamed. The decentralization of data, the amount of data, the exponentially growing amount of data, the kind of data–structures, real-time, event-based data, and the Snowflake problem. These are the four dragons I’m talking about. And I have no idea. I feel they’re actually all four equally important and all have to be solved. I see a lot of companies just focusing on the amount and the growing amount and specific kind of data. But I also see the decentralization movement that is actually happening. That’s the pattern of the data mesh emerging, which is about decentralizing data usage and production. So I feel all of them will kind of fit all of it.
Kostas Pardalis 30:24
Yeah, that’s an excellent question. Actually, I think that I don’t think that’s in my opinion, at least, I don’t think that it’s the nature of the data that is going to drive this, mainly because before you start dealing with a problem, you don’t really know what kind of dimensions of the data are going to affect you more. And I totally agree with Sven. Pretty much all of them are relevant. Probably some of them are more important in some cases than others. And I will give a very easy example here to explain what I mean. If you take a B2C and a B2B company, one of the most obvious differences there when dealing with data is volume, right? a B2C company has, from almost day one, many, many, many, many more data compared to what a B2B company can have, even when they reach a billion dollar valuation or whatever. On the other hand, what you have with a B2B company, you have the decentralization, because it is suddenly siloed in so many different tools, and somehow, you need to access all this data and you have the complexity of the data, Salesforce alone, the default setup of Salesforce comes out with, if you try to pull the data out of there, you have probably I don’t know, 250-300 tables, right? And that’s the bare minimum, do you need all of them? I don’t know. But they are there, right. And I’ve seen companies, especially when they have more applications, installed on top of layer SFDC, you can have easily 800-900 tables. Okay, that’s a huge, huge complexity, even if the volume is low.
Kostas Pardalis 32:03
So what I would say is that, in my opinion, what is going to happen, and what we are going to see is that there are going to be business reasons that they are going to push the companies into trying to figure out opportunities to create value from themselves because of the data that they have. And as they will start trying to do that, either by doing more sophisticated analytics, or more complete analytics, or start doing predictive analytics, right, they will get to the point where they’re okay, my data looks like this. So I need a solution that can accommodate that. And these things, of course, are going to change right? Now, I think the industry is still learning. So all these best practices around what you need, if you come from this space, or the other space, I think they’re still under definition. As also the technologies are, I think what is important for Snowflake and Databricks is how they can create platforms that can accommodate potentially all the different use cases. So when they land inside the company, for one use case, then they have opportunities to expand. Because both of these solutions take a lot of investment from a company to start using them. Right. So they are very sticky.
Eric Dodds 33:19
Super interesting. Okay, for a follow up question for both of you based on that. Because I mean, we don’t know which of the four dragons are going to have industry wide impact. And I’m sure, depending on the business, and the volume and the type of data, different individual companies will probably face them at different points. But let’s say, well, this question is, I think interesting, just based on a lot of examples that I thought about recently, of companies who have a core competency more on the Databricks side and are moving towards the structured data side, and then the other way around where you have a core competency on the analytic side, and then you want to move more into the Databricks, machine learning, data science side. Do you think that moving one direction to the other is easier? Is going from one side to the other or vice versa easier?
Sven Balnojan 34:11
Oh, that is also a good question. If you look at both of these companies, I don’t think so. I mean, they’re both trying and just based on their experience that these two companies are actually experts in these two moves. They both seem to have trouble. So I think these are both hard moves. That would be my take on that.
Kostas Pardalis 34:34
Yeah, that’s an excellent question. I would say that what Databricks is trying to do is much more ambitious and harder. Now if they manage to do it. I think it’s going to be an amazing feat. And I’ll explain why. By the way, before I explain why it’s harder for Databricks I’ll talk about our friends at Snowflake. Snowflake recently, they launched a product called Snowpark, or something that, I don’t remember exactly the name, but it’s they’re playing a little bit with Spark in the name. And actually, it’s a very competitive product to Spark. So they are adding this kind of functionality there. They are adding and want to support ML use cases, they want to allow users to build much more, let’s say, sophisticated business logic on top of the data engine that they have. So they do that, they’re moving into the space of Databricks, and they do it fast. And if there’s one thing that nobody can deny about Snowflake is their execution, they’re extremely efficient at executing. Okay. Now, on the other side, our friends at Databricks are trying to do that, yeah, they started with a distributed processing engine, right. And, and that’s very good when you want to process data in the way that you usually do in machine learning, where you have all your data there, you know exactly what kind of processing you want to do. And you want to scale that, right, because you have a lot of data, for example, or because the computations that you are doing are so complex that it has to be scaled by more than one machine. So they have that.
Kostas Pardalis 36:23
Now, what they are trying to do, and this is, let’s say, the promise of Delta Lake or the Lakehouse, as they call it, they try to take the concept of the data lake and add some of the stuff that are very core to a data warehouse acid guarantees or transactions. You don’t have transactions in Spark, initially, you didn’t need transactions, because when you’re using a system Spark, you don’t have one user that’s going to compete with another user in accessing state and changing the state, right. So but if you want to do analytics, if you want to do a typical data warehouse payload and workloads, you need to have that. And they try to build these on top of the architecture of a data lake. Now, if they manage to do that, what they are going to do at the end is they’re pretty much going to figure out a way to take a database system and turn it inside out, actually, which is quite a feat, if they do technically it keep in mind that building databases and ensuring all this guarantees that the database needs. It’s not a simple task, right. Databases are extremely complex systems, especially when we are talking about distributed databases. Of course, we have some of the smartest people in the industry doing that. It’s the good thing of having a company that it started from a bunch of geeks from Berkeley. So it would be amazing if they managed to do that because they are going to completely change the way that we work and build databases. So it’s going to be a huge paradigm shift if they do that.
Eric Dodds 38:06
Yeah, I think it’s really interesting. When I think about this just on a very practical level without necessarily considering the technology, but considering the day to day progression that we see in companies. One thing that I’ve noticed over the past several years is that getting your analytics to a good place, generally, is a catalyst for machine learning projects. Because when you have enough data in a clean state where you start to derive insights, that’s a fertile environment for machine learning to accelerate lots of different projects. So as companies grow and get more data, when you have to be at a certain scale for machine learning, to add a ton of value to your organization. So I think it’ll be really interesting to see. But if you think about that, from Snowflake’s standpoint, maybe Databricks from a technology standpoint, comes after Snowflake more quickly. But Snowflake may have a user base that they can pull into the machine learning world because they have a fertile foundation for it.
Eric Dodds 39:21
Alright, we are actually pretty close to the end of the show. But let’s hit one more subject. This has been an awesome conversation. I want to hit one more subject. And Sven, you have written a lot about open source. And so one topic that I’d love to hear you expound on is–and the background behind this question is, you’ve written a lot about business models, which we probably need to just do a show on that alone, because that’s a fascinating topic and you’ve recently written about pricing and open source business models–I kind of want to do the same thing we did with the Databricks versus Snowflake conversation where we start at the very beginning. And I think you would say, and correct me if I’m wrong, when you think about open source business models, you have to ask the question, what is a successful open source project? And then you can talk about how that feeds into a business model. So talk us through what is a successful open source project? And what are the characteristics of that?
Sven Balnojan 40:31
Again, excellent question. Because I actually do have blog posts I’m currently writing and I actually do have around 300 blog posts, which I haven’t published yet. And that might be another one I’d . So the open source industry is actually really young. Open source has been around for 20-25 years, something that.
Sven Balnojan 40:59
So what’s the successful open source project, and that, just to remind you, that the usual research says about 98% of open source projects fail, okay? It actually went up from 90% four years ago to 98%. So I think this is kind of a three dimensional, three tier model, to open source projects. And the very first dimension is when you work in your own project, your one repository or multiple repositories. And you basically try to expand the project and get involved. And then what do you want to do, I mean, you would want to get other developers to first of all, use your crude new tool. And then success in that dimension simply means lots of people using it and lots of people contributing to it. So that it’s easy. I mean, you make it really easy to use it to deploy, and really easy to contribute, providing whatever SDK, CDK, making it easy to set up the test environment and so on. But that’s just the very first dimension. And so, WordPress and the company Automatic, actually started out that way. And they worked for two to three years, they worked on just the WordPress core, which has been for the beginning. And then they realized, okay, so we need the second dimension, which is that of what I call guided extension. That’s the space where you start to add modules, plugins, extensions, and the ability for others to customize your project in their own way. And then you get a little bit of an explosion of repositories, actually, which you don’t own anymore. But actually, that can be a scary thought, the stuff you don’t own anymore. That’s your success in that dimension. And you can only get to you have to get a certain level of success dimension one before you can start working on that success dimension two. So WordPress did that in 2004 or 2005 when they introduced plugins. Success there, it means making it really easy for people to extend stuff, to add plugins, an SDK kit, you would need something that. But then around the second dimension, you actually also want to start to create, an ecosystem, you actually want to have companies that sell themes, you want to have, a company that actually sells plugins, because these companies are core contributors and they have a business incentive to contribute to your product and make it successful. That’s what WordPress did I think, at the end of 2008 or so. They hit just working on these two dimensions, 12% adoption across the CMS space. 12% of the world wide web was powered by WordPress, which already is an amazing feat. And then they must have realized somewhere along the way, they actually need a third dimension. The third dimension is once you hit that space, the one of guided extension, you actually don’t go into the third dimension, it just happens. It’s the one of unguided extension. And for WordPress that happened in 2011 with the appearance of e-commerce sites based on WordPress. That’s the space where they thought it’s a blog, and then they thought it’s a static website, and then they realized, oh, no, actually, people are starting to use our thing and build something completely new out of it. So WooCommerce, for instance, emerged in that space. And that is the part which is this, I think the scariest part for most open source projects because only in level three, I think dimension three open source project, it’s the only true really successful project. That actually means helping others build their own “thingy” out of your project that represents enabling others to actually build their own WordPress just targeted at e-commerce for instance. And they then reconcile that with their business incentive. And that in the way that they simply bought the company. Okay, so that’s a way to deal with that. So that is what I think is a really successful open source project. And by the way, that took them 15 years to get to that space, and other projects as well. if you think about Linux, or MySQL, and so on.
Eric Dodds 45:19
Yeah, it’s super interesting. And I love the way that you’ve consolidated a lot of different components into a very concise set of characteristics. One question I have, though, and this is, it just really struck me that the failure rate for open source projects has increased and now is just incredibly high, and then also using the example of WordPress. And one thing that struck me was that a lot of open source projects are fairly limited in scope, whereas the examples you gave around MySQL or Linux or WordPress, the total addressable market for what the foundational technology enables, right, so if you think about a CMS, the World Wide Web is expanding at an unbelievable rate, the number of websites are as well, exponential growth, and so you’ve had this limited potential for growth of a CMS product for web content. Do you think there is some sort of threshold even if we can’t define it exactly for how large the total addressable problem is for an open source project to be successful?
Sven Balnojan 46:34
Not at all. But I mean, the example is great of WordPress, because they grew into that market in 2004, they were probably targeting market size of they were just doing blogging, and that was probably a thousand blogs worldwide, maybe that’s too low, but it’s a super small size and then extended to the static site market, which was small at the time. So I mean, they kind of found their way and built these tangible things on top of that. So I don’t think so. Depends on the project. And I think every product can find its way.
Eric Dodds 47:08
Yeah, I think that’s a great point. And that’s a really good reminder that it really was blog focused at the beginning. And actually, I mean, really, in many ways, the community forged it into a more comprehensive, flexible CMS. And I mean, even user accounts now and all sorts of stuff that were way beyond the initial scope of blogging, which was, I think, just a really great answer.
Eric Dodds 47:32
Well, Sven, we are at time here, but we have so many more questions to ask you about open source. This has been a really fun conversation, and I hope that our listeners have really enjoyed thinking through some of these big market shifts with us. And we’d love to have you back on the show sometime, again, to pick back up the open source conversation.
Sven Balnojan 47:53
Sure, I’d love to do a follow up conversation. Thank you guys for inviting me.
Eric Dodds 47:57
What a fun conversation. Anytime that I can ask questions that I feel elicit strong opinions from both our guests and you Kostas, I feel I’ve had a great day. I’ve had a really great day. And I think I accomplished that. So thank you for that.
Eric Dodds 48:15
I think my big takeaway… I mean, there were so many things about Databricks versus Snowflake. But I really was caught off guard, I think by the statistics around open source that Sven shared about the failure rate of open source projects, which is just really interesting to me, because he’s studied the open source space a lot. He’s studied open source data companies. And his bet is that the next round of really big data companies will have open source foundations, yet the failure rate of open source projects has been increasing. And so that just is going to give me a lot to think about this week.
Kostas Pardalis 48:51
Yeah, absolutely. I totally agree with you that open source as a phenomenon in general, I think it’s something that we should discuss more about on this show. Outside of the failure that Sven talked a lot about, there’s another characteristic, and that has to do with abuse. It’s very common for people out there who are maintaining repos open source and pretty popular ones that at some point, they quit because of all the abuse that they have to go through from all the people who are just asking and demanding new features and stuff like that. Actually open source is a very, very interesting area of seeing some very interesting, let’s say the best and the worst of human nature in a way. And so yeah, that would be very interesting to discuss more about this and I agree with him that we are going to see more and more big companies that the foundations are on open source and that has to do with the type of the industry, the type of the problems and the complexity of the problems around data. And what are also the competitive advantages in these products compared to a SaaS solution for example, so yeah, and we’re just at the beginning, by the way. I mean, Snowflake and Databricks are just the first players in this space. So we have a lot to see in the future. Yep.
Eric Dodds 50:11
Yeah, it’s gonna be a really exciting decade. Well, thank you again to everyone who joined us on the show. Lots of exciting guests coming up this fall. And we’ll do a season wrap up here pretty soon for season two. And until next time, we will catch you later.
Eric Dodds 50:28
We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me Eric Dodds at Eric@datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.