The Data Stack Show just wrapped up Season Two and hosts Eric and Kostas are back to recap some of the biggest themes and trends from the past season.
Highlights from this week’s episode:
- Dissecting the different team structures from organizations in season two (1:16)
- The people behind the data are key to the data itself (9:17)
- Open source licensing and the core components needed for large scale commercial viability (15:13)
- Game-changing core technologies in the new data economy (22:09)
- Snowflake vs. Databricks battle. “The UFC of Geeks” (25:54)
The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Eric Dodds 00:06
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com.
Eric Dodds 00:29
Welcome to The Data Stack Show. This is a wrap up of season two. That’s right, we’ve completed two whole seasons of the show. I think we’ve recorded 50 episodes, not all of those are out yet. Season Two was 22 or 23 episodes. And for this one, we actually decided to turn the video cameras on. We thought that would be fun so you can see what Kostas and I look like face to face, you can see our podcasting equipment, and all that fun stuff. So we’re going to do this one on video and post it on YouTube. In the season recaps, we like to just do a quick overview of what we covered in the last 20 or so episodes of the show. And so we have a list of the things that stuck out to us. Kostas, I’m just going to run down our list here because we have a lot to cover. And we like to keep these fairly short.
Eric Dodds 01:16
The first topic is team structures, there were a couple of things that stuck out to me across multiple episodes here. And one of them was the different structures we see between the relationship of data engineering and data science and a couple of episodes come to mind there. One is Policygenius, versus two other companies Homesnap, and then The Atlantic. We talked with data scientists at all three companies, leaders in the org, and the high level overview, and I just love your thoughts on this. Policygenius sort of has purview over all the data. So data science is a component of the data practice in the company, you can go back and listen to the episode to hear about that. At Homesnap and The Atlantic, we talked with lead data scientists, and they really do nothing on the data engineering side, they just sort of receive the processed cleansed data that they need in order to build their models. I think there are advantages to both. It’s obviously working very well and in different contexts. But what stuck out to you about that different team structure that we heard about, and there are more examples, but those are just the ones that jumped out.
Eric Dodds 01:33
Yeah, and I think that’s a kind of theme that is going to be repeated in future episodes too. I think we are going to be discussing a lot about that, mainly because anything that has to do with data and how data is positioned inside the company is like something else, right. I mean, quite recently, we started discussing MLOps, for example, like what is MLOps. Who is responsible for MLOps? Is it going to be the data scientist, is it going to be the data engineers, is it going to be some different team? We don’t know yet. Right?
Kostas Pardalis 02:53
Like all these things are emerging out of like where the companies are still trying to figure out how to work with the data, what value they can get from the data, and of course, reorganize the company in a way that can maximize the benefit that comes from using the data. I think something that makes a big difference in how data teams are structured is two things. One, how important data is to the product itself. So I’m talking about Policygenius, I’m talking about a company that operates in the insurance market. Of course, data is very important there. The data itself is actually part of the product, right? So it has a very core role inside the company and inside like the business itself. So that changes things a lot.
Kostas Pardalis 03:42
Now on the other side of the spectrum, you have like this amazing case of The Atlantic, where you have like a very old organization, which is working on introducing data into even the product itself. Because if you remember when we talked, the staff there are pretty amazing, like they are using ML to help a lot like the experience that the end user is going to have reading the Atlantic. But still, you have an organization that needs to evolve into a data-centric, let’s say, I don’t like the data-driven term that much, like a data-centric organization, compared to another company that it’s built from day zero, with data at its core in terms of like the value that they provide. So of course, this is going to affect also how the companies are structured, right? Yeah, that’s, as we say, like, I mean, it’s still something where, like, as professionals, and as the market itself, like, still tries to figure out.
Eric Dodds 04:41
Yeah, I think it’s interesting. If you think about Policygenius, the product wouldn’t work without data. Yep. And I think actually, Jenna, I’ve thought about this multiple times where we said, you’re not really a product company, you are more of a content company. And she disagreed. She said content is the product. And I loved that. But at the same time, you could publish plain text on the internet. And you don’t need any data for that, of course, it is data. And I know, Jenna’s point was very well made, but to your point, from a business model standpoint, it’s not like they have to have some sort of underlying data set in order for the product to actually function, you can just, you know, produce content, which is super interesting.
Eric Dodds 05:26
One other point that came up, and this came from Policygenius, but we heard about different setups was where do data professionals live within the organization? And the two big structures we heard about where data is a centralized practice that acts as a shared resource across teams and a service center, if you will. And the other one was what the team at Policygenius called structured embedding, where they actually assign data professionals various capacities, right, so as an analyst or an engineer, to specific teams, so you’ll have people on the data team, but they’re embedded into the product team or the marketing team, which is really interesting. And that was a really interesting conversation. My question for you Kostas because you run an engineering org and have run engineering orgs, what are the reasons that you would want to pursue a structured and like structured embedding as your org structure as opposed to a shared service centralized model? What are the different business models or questions?
Kostas Pardalis 06:32
That’s an excellent question. And actually, it doesn’t have to do only with data engineers or data analysts, right. Like, it’s a more general question of how you want to organize your engineering organization at the end. And there are like two main, let’s say, directions there. One, you have each function that’s siloed, in a way, right. And each silo interacts with the other one, depending on their needs. So for example, if the engineering team that builds a specific feature, they need some data, okay, they will go to the data team, and be like, we need this pipeline to be built and this data to be delivered, and blah, blah, blah, all that stuff. And the data team is building internally to prioritize this in the way that they have, that they want, and implement it and deliver it, etc, etc. And that’s not just with data engineering, or data in general, like thinking about design or product. Like, again, it’s already been like, front end and back end development. And then you have this concept that I think was pioneered by Spotify, or at least that’s what I remember, right now, as a very good example of a team that does that where you have like this concept of a squad, where you have teams, but they have all the functions inside them, right, like, so you are going to have the front end developer that’s assigned to a specific team, you’re going to have the back end developer, you’re going to have a data engineer, you’re going to have the designer, you’re going to have the product owner and all that stuff. Now, I think at the end, because there’s a whole industry behind how to structure teams and with coaches, with consultants, blah, blah, blah, and all that stuff, and how to increase productivity. Again, I think it’s a matter of culture, the way that you structure your organization is a matter of culture, I think it has to do a lot with the early leadership team. And of course, like the CEO, and the culture that the CEO brings to the company. And at the end, it has to do with how you can manage communication better. These are like the two things that affect that. Outside of this, I wouldn’t say that I prefer one or the other. Because at the end, people need to figure out how they can communicate better. And the structure is going to emerge from that. So I don’t like to see things black or white, although, with an engineering background, I’m trained to do that. But unfortunately, when we are talking mainly about humans and human relationships, things are like, not just black and white. There are many shades in between with many different colors. It’s not even just gray. So that’s my opinion. And yeah, whatever works at the end and keeps people happy with less friction, that’s the most important thing.
Eric Dodds 09:17
Yeah. Okay, next topic on our list was … and this is this has really been a recurring theme through season one and continued into season two … the theme is that the people behind the data are really the key to the data itself. And I’ll bring up two specific examples. We talked with one of the early data scientists from Shipt, which is a grocery delivery company that was purchased, I believe, by Target so they’re just absolutely huge. And then we also talked with Peter, the founder of a company called Aquarium Learning, it’s tooling for building better models. And it was really interesting to talk with both of them. So Shipt deals with a lot of out of stock or in stock type predictions around people purchasing things online across grocery stores that are geographically distributed, a very complex problem. Peter got his start in machine learning working at a self-driving car startup that ended up becoming a gigantic organization. Just really, really interesting people. Yet the thing we heard from both of them was “a model is a model.” And you always have to ask the right questions, and have the right mental framework when you’re building a model. And I’m just interested for you as an engineer, and I actually don’t even know what your experience with building, you know, machine learning models is, but just give us your perspective on that as an engineer, what are the pitfalls, did that really resonate with you, do you think there are areas where that’s not necessarily true?
Kostas Pardalis 10:56
Yeah, I think that’s an excellent point, actually. And I really enjoyed it, like the conversation we had with Aquarium because it managed to communicate something that people tend to forget a lot when it comes to machine learning. But before you end up having a model that does its magic, you might need maybe even thousands of human beings to do very boring stuff like annotating data. And that’s, I don’t know, at least like for the people that live in the Bay Area, in San Francisco, they might see all these Waymo cars driving around again, and again, and again, like they just create data, right, so you need someone to drive around to create, like the data set that is going to feed them with the models, and we are talking about like, probably millions of hours of driving without “purpose”, you know. And that’s the same thing also with pictures where if we are capable today to have models that can identify all the different breeds of dogs, like I don’t know, like all this crazy stuff that you see out there. It’s because one way or another, we humans figured out a way to utilize humans to create these data sets, and log a very big part of the work behind all these models. And it’s something that’s, okay, probably it’s not sexy, it’s not so exciting. We prefer to create this kind of image of these better than human agents that we are going to create blah, blah, blah, like, you know, all that stuff. But in the end, there’s a lot of labor behind that to happen. And I’m really happy that this is something that came up during our conversations. And I think another theme, which is probably more profound when it comes to data and machine learning, especially but I think it’s like a recurrent conversation also in engineering. Right? About why would you go and build something that’s going to replace data engineers, for example, because data engineers are building pipelines, why are you going to automate the process of that, right? Because then you will have the engineers losing their jobs. And in the same way, we say why are we going to create something that does diagnoses on x-rays, our doctors are going to lose their jobs. But if you see at the end, what happens is that all these tools, actually what they do is that they augment what the humans are doing, right? They don’t replace the humans. And I think this is obviously true in engineering. I think it’s also very true about machine learning, at least today. And this is something that I think came up in many episodes this season, and I think is probably one of the more philosophical, let’s say, things that we covered, but probably one of the most interesting and most important ones that we have managed to communicate.
Eric Dodds 13:54
Yeah, I agree. We kind of talked about this a little bit in each case, as well, but machine learning can tend to be an abstract concept in the mind of the average person, when they think about AI. It’s actually one of those terms where everyone’s familiar with it. But when you ask someone to put a really concise definition on it, especially your average person who isn’t an engineer, it’s actually pretty hard to define. But it shows up in very, almost intimate ways in your life when you think about taking a car to get someplace that you need to go or ordering groceries because you need to make food for your family. And so I think that will become an even more important conversation in coming years because machine learning, you know, and AI are really intersecting with our lives in a lot of ways that many times you don’t even necessarily know about as consumers, but there’s a direct influence there which is really interesting. Okay, next topic. Unless you’re done with that one. Do you have more?
Kostas Pardalis 14:57
I’m done. I mean, you know me, if we start discussing this, we will probably make three episodes out of it, so …
Eric Dodds 15:01
Well, I’m saving the best for last. Yeah, you are somewhat of a Greek philosopher yourself. So when we get onto the philosophical stuff … Okay, another hot topic: open source. So, there are two episodes that come to mind. So we talked with Jim from Cockroach Labs. And he had very strong opinions about open source, we kind of talked about the whole Mongo thing and got his opinion on that. And he was great, a very opinionated guy, and I loved it. And then we also talked with Sven. And he has done a lot of writing and thinking about open source business models and how they scale over time. I think that episode was recently published: Slaying the Four Dragons of Data, which is a really, really interesting episode. Either way, they both talked about open source, and they kind of had a little bit of a different take on it. So Jim, from Cockroach, I think said rightly, there’s an ethos behind open source where people want to help other people, and provide that tooling and those resources for free, because it’s a community effort where you’re trying to help make each other’s jobs easier and your work easier. And then people get wrapped around the axle on the licensing conversation, right. So that’s kind of where people get wrapped around the axle. And then Sven kind of had this really interesting, very interesting point around what are the core components of actually being an open source company that would allow you to become a very large organization, in his terms, the next $30 billion evaluation data company. So there’s a lot of crossover between what they said, but what’s your take on that, on Jim’s question around the licensing, and that’s where people get wrapped around the axle. And then also Sven’s point about what are the core components that you need in order to build a company that’s founded on open source that actually becomes commercially viable at a large scale?
Kostas Pardalis 17:10
Yeah, there’s also another episode that has very interesting insights about open source. And that’s the one that we had with Tecton with William, because he’s like the main contributor of Feast, which is the only open source feature store out there right now. And he also added another dimension of open source, and especially open source that is not backed by a commercial entity, which is the abuse that some of the maintainers have to go through. It’s not easy, and open source projects out there are maintained by big corporate entities, right? Like, when you see things like Kubernetes, it’s open source, but of course, you have Google behind it, right? Like you have engineers that are getting paid to, to maintain this. And then you have other tools like Feast, for example. That’s like the other extreme, where you have someone who just wants to do it and build something, and he’s the only maintainer, like he does it not as part of his job. The difference there, and the interesting part, is that the people who interact with these repositories pretty much have the same, let’s say, expectations, regardless of who’s behind the project. And that’s something that is quite interesting and can be quite taxing for projects that do not have someone to sponsor the project, right.
Kostas Pardalis 18:40
Now, I think that as we see this new data economy getting built, we will see open source becoming more and more important, a business tool actually like something that is going to be important to build a business. And it’s a big conversation, why and how. But a very good example of this is like database systems, right? Like you mentioned CockroachDB, for example. Outside of the big corporations of the past, like Oracle, for example, pretty much like everything exciting that happens right now, in terms of a database system is open sourced. And that’s something that I think we will keep saying. Kafka, Databricks, Spark. I don’t think there’s like a specific recipe on how you can do that. But I think that it’s part of the culture of engineers and developers to interact and use open source tools. So I think that’s always going to be an important dimension, especially when we build products that are going to be consumed by developers.
Eric Dodds 19:48
Yeah, I think it’s really interesting. I think, if approached in an authentic way, when you’re dealing with core data technology, it’s a way to understand some of the challenges you face in building the product or some of the problems people face in using it way, way faster because the community is giving you a steady stream of feedback, which is interesting. Whereas you compare that with maybe some open source tools that are very helpful but aren’t necessarily as core to data infrastructure inside of an organization as something like a database or data pipeline or something like that. So super interesting.
Eric Dodds 20:24
Okay, two more questions. And I’m saving the best one for last so that we have a hard time-stop on it, because we’ve talked about it before. And I know we can get long-winded. Okay. Tools, I’m gonna let you pick two of these to talk about. But we talked with a bunch of companies building really interesting things. We already talked about Aquarium, so I’ll leave that one off the list. Meroxa is doing CDC, change data capture stuff, super interesting. Tecton is doing feature store as a service, super interesting, Materialize is doing some really interesting things with streaming SQL and materialized views, which is fascinating. And then we also had multiple discussions about Graph. Neo4j is one of the ones that comes to mind as well. We talked with Affinio, as well as doing some interesting Graph stuff on top of Snowflake. So pick two of those, and not necessarily based on the company, because they’re all really cool. But when you look at the data landscape, or as you said, the new data economy, which is a very, very elevated term there.
Kostas Pardalis 21:33
I want it to happen.
Eric Dodds 21:37
Yeah, that’s like, you’re getting into like Forbes territory there. So be careful. So as you look at the new data economy “trademark”, which of those tools do you think as an engineer do you say, like, Okay, this has the potential , the core technology to be really game changing or have an outsized impact?
Kostas Pardalis 21:56
That’s a very good question. I don’t know if I would choose the product itself, but …
Eric Dodds 22:04
It was an unfair question. But that makes it way more fun for me and the listeners. Yeah.
Kostas Pardalis 22:11
At least one of the things that I think is going to become really important, and we saw with Affinio, if I remember correctly, if not, Affinio, please forgive me, is the whole concept of building data processing applications on top of a data cloud provider like Snowflake. What we saw happening there is that Affinio has amazing technology and algorithms for processing Graph data. And instead of building their own database engine to do that, they built on top of Snowflake. I think as a trend, it’s quite early. If it works, I think it’s going to be one of the foundations of the data economy, okay, to accelerate and enable the data economy. So then I go and write articles in Forbes and The Economist. That’s my dream, that’s why I do whatever I do in my life.
Kostas Pardalis 23:08
So that’s one thing. That’s an amazing trend that I see. And I’m very, very interested to see how it’s going to work in the end. The other thing is anything that has to do with stream processing and streams in general. And there we see two things. One is the transformation of traditionally not streaming data like a database, for example, their stream, that’s what CDC is, at the end. We take the database and turn it into a stream of changes on the state of the database. So from something that we use to only interact with the latest state that the database has now we have like a stream there. And that’s what Meroxa is doing.
Kostas Pardalis 23:48
And the other thing, which is quite interesting is what I see with Materialize where we start having some streaming processing techniques and products that are much much easier to work and integrate with existing database systems like Materialize. And why I think this is important, anyone who hasn’t done that in their engineering or developer career, try to take something like Spark which can do like streaming processing or like Flink and run it and measure the time it will take you to do something with it like something meaningful, and try to do the same thing with Materialize and you will see what are the differences there. I’m not going to say that one is better than the other. But what we can see here is like a shift about the ease and the access that we have in processing, realtime and streaming data. And I think these two things that I mentioned are two trends that I think are going to be important and interesting in the next couple of months or years in terms of how new data processing paradigms are going to be introduced?
Eric Dodds 25:02
Yep, I agree. And we talked with lots of cool companies. Those are just the ones I pulled off the top of my head for the recap. But I just want to do a quick shout out for the ones that I didn’t mention. Avo is a data governance tool. We talked with Stef from Avo. Really, really cool. That was a super neat tool. We talked with Chris from DataKitchen. Let’s see Grafana. We talked with someone from Grafana. Of course, that’s a tool that we all love. I mentioned Neo4j and Meroxa. I just want to make sure I mentioned everyone here. Pandio, Josh from Pandio, they built on top of Apache Pulsar, super interesting for AI/ML use cases. We talked with Sokratis of Clerk, and they’re doing some really interesting things around authentication and user management as a service. I think those are all the tools that we talked about. So just to our audience, if you missed any of those episodes, go back and listen, they’re building some really cool stuff.
Eric Dodds 25:54
Okay, last but not least, we only have a couple minutes here, but we had multiple episodes where we talked about Snowflake versus Databricks. I’m surprised you didn’t see that one coming Kostas.
Kostas Pardalis 26:11
Oh, I think this is going to turn into like the UFC of Geeks, you know …
Eric Dodds 26:17
The conversation is really interesting, though. And I’ll just give the listeners a brief overview of the very high level and you’ve written about this extensively Kostas, and Brooks can put that in the show notes. But Snowflake is approaching the industry from the data warehouse side with an analytics use case primarily. And then moving into data lake, machine learning, and all the ecosystems around those functions. And Databricks is coming the other way. Right? They came from Spark, they were open source, they came out of academia, as opposed to being born out of another large startup and being a venture-backed commercial entity from the beginning. And Databricks is now moving towards the warehouse and analytics use cases. So the simple question is who’s gonna win? And when do we think that’s gonna happen?
Kostas Pardalis 27:14
I don’t know. I mean, wearing my way Economist, Forbes hat, the winner at the end is going to be the consumer, right? Like it’s going to be the customer, because they are going to have some excellent products.
Eric Dodds 27:28
Steve Forbes could not have said it better. I mean, that is straight out of the editor’s note in Forbes. So Steve, if you’re listening, we’d love to have you on to get your opinion on this.
Kostas Pardalis 27:44
Yeah, please, please. I don’t know who’s going to win. What I know is that I think this whole war for the data cloud is just beginning. We have two, let’s say main competitors there. But the most interesting thing, for me, at least like for the next couple of months, is the new companies that are getting into this arena, pretty much like every data lake project from Apache is going to, if they haven’t done it already, they are going to turn into a commercial entity and try to build that. And that’s something we are going to be hearing more and more in the near future like Iceberg, Hudi, and of course, we have other components like Dremio, we need to see what Confluent is going to do.
Kostas Pardalis 28:36
So I think that this competition just started now. And there are many things that can happen. And there are, okay, we were joking, but I think we are going to see some great new technologies that they will try to become, let’s say, the Data Cloud, because that’s what it is at the end like how we can create like a platform where we can have more cases like Affinio where we go there and build products on top of that, just as we do with AWS. Right, like when we want to build our products, we need servers, we go to the cloud provider to do that when we are moving to build data products. We need data infrastructure to do that. Who’s going to become like the leader in this? I don’t know, we’ll see. I think it’s going to be very interesting. And I think that the most important thing to do is keep our eyes open to see the new companies that will enter the market. That’s what is very, very exciting for me.
Eric Dodds 29:35
Yeah, I agree if I put my Steve Forbes hat on, and do my note from the editor, I think as we think about both Snowflake and Databricks and all the tools that we mentioned before that we had the chance to learn about in the last season. The trend that I see that really excites me is, well there are two major things, one, like you said, we’re just at the beginning here and I think it’s fun to talk about Snowflake and Databricks and debate back and forth. But the bigger story is that we’re at the very beginning of a decade-long trend that we’re going to get to see play out in real time, which is going to be really cool.
Eric Dodds 30:13
The other thing I would say is that a lot of the challenges that all of these tools are solving is that they’re removing artificial scarcity from the equation of providing value with data. And many times the artificial scarcity is paid for in engineering time. And so as a lot of these low level plumbing problems are solved by new technologies, we are going to see, I think people do things with data that we haven’t even conceived of yet. Because they will have no artificial scarcity. And they can apply all of the talent and resources in order to actually create value with data. And so I think the next 10 years is going to be phenomenally exciting. And hopefully, we’re still recording episodes, then.
Kostas Pardalis 31:02
We’ll be here. I mean, we will also be writing for The Economist and Forbes, but the podcast is going to continue …
Eric Dodds 31:11
Okay, well, that’s the buzzer. One thing we want to say before we hop off of this episode, we would love your feedback, you can send that to Eric@datastackshow.com. You can also go to datastackshow.com and contact us there. We’d love your feedback. If there’s a technology or a subject that you’d like us to discuss, we’ll go find an expert and talk with them. If there’s something that we’re doing that annoys you, we’d like to hear about that as well. No guarantees that we’ll stop it. And go check out the website to see all of the episodes you missed. We didn’t get to talk about data governance, we didn’t get to talk about stack architectures, there were a lot of good things that we didn’t get to in the season wrap-up. So make sure to go to the website, or onto your favorite podcast provider and check out what you missed. And we will catch you at the beginning of season three next time.
Eric Dodds 32:08
We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds at Eric@datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.