In this episode, Eric and Kostas look back over the great topics and guests from season three of the Data Stack Show.
The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Automated Transcription – May contain errors
Eric Dodds 00:06
Welcome to the data stack show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run a top companies, the Digitech show is brought to you by Rutter stack, the CDP for developers you can learn more at Rutter stack.com. If you are going to the data Council Austin event in January on the 27th. And 28th, you’re definitely going to want to meet costus and me in person, that El Mercado on the night of the 26th we will buy you a drink and talk all things data. We will be on site for the conference, and we’re super excited to meet you cost us tell me what you are most excited about asking our listeners if they actually show up to meet us in person. I don’t know, if I want to ask something I was thinking about like, maybe I’d love to play the game where I say this is interesting. And then we all do shorts. Yeah, if you come and visit us, like you will have the opportunity to play this new game that don’t have a name yet where I say this is interesting. And we all take a shot of tequila or something. Maybe you can play that game. I don’t want to do that many shots of tequila. But we would love to meet you in person. It’s going to be a great conference. And we’re excited to we’re excited to meet some of our listeners. So come by January 26. You can reach out to us on datasets show comm fill out the contact form. Let us know you’re coming and we’ll buy you a drink. See you there. Welcome to the data stack show. Season three recap cost us I can’t believe we recorded three seasons of shows. It’s kind of crazy. I think we have 80 shows and the books. Of course not all of them are quite released yet, because we do the recording ahead of time. But that’s pretty wild. Did you think that we would get this far when we?
Kostas Pardalis 01:58
Oh, yeah, no, I mean, I never expected that it’s going to last too long. To be honest. Yeah, it’s been quite a journey. I think that’s so we’re going to be you know, shows like French and all the stuff. We are getting closer, closer.
Eric Dodds 02:13
Okay, so we I want to talk about a couple of specific themes that arose that are really interesting. First, I wanted to ask you a question. So when you messaged me on Slack, and said, Let’s do a podcast. And we hopped on a call, and we talked about it. You said I want to talk to the people who are doing interesting things in the data space bolts, so we can just learn about what’s happening out there and then meet the people behind it. Do you feel like you understand the data space, the data stack better as a result of doing the show? Like what what’s sort of been the impact for you, personally, as you think about? I mean, you work in the data space every day? Like has it been helpful for you?
Kostas Pardalis 02:56
Okay, that’s an interesting question. I feel like I go to answers. But I also, it also created like new questions, right. But I think at the end, what what matters is like to try and see like, get in contact with people and see how they’re thinking and why they’re doing the things that they’re doing. Because, okay, at the end, like, the market is so young, there are so many things happening, not all of them are going to survive. And of course, like not all of them, like the best way to do things or whatever. So we don’t really know yet what will be happening in a couple of years from now. But having like this kind of contact with passionate people who really love what they’re doing, like, Okay, we had people joining us, all of them. I mean, they have done like, amazing stuff, right? Very smart people, very honest people like with why they are doing the things they are doing. So yeah, I think that’s the, for me, the most important part of like, this show is like this kind of connection with all these people. Like I think it’s what really keeps both of us. I mean, okay, I’m talking more about myself, but I think it also like applies to you, but I think it’s what keeps us like doing it. So yeah, I mean, okay, we say that we do it because we want to share things with other people. We are also selfies, right? So primarily, we are having fun, and we meet all these nice people. Yeah,
Eric Dodds 04:19
super fun. It is it is kind of a paradox, because I agree with you. I think questions have been answered. But it’s a paradox in that, you know, I think some of the simple. Some of the more simple things where we think about technology around data warehouses or data lakes is sort of becoming crystallized across stacks across the board. You know, it’s kind of like, okay, we see patterns emerging there. But then you talk with you talk with people who have developed really groundbreaking technologies, and a lot more questions opened up, right, because these people are really sort of pushing the envelope of what can be done, which is super interesting. Okay, let’s just cover it. A couple of quick themes here of what we talked about. The first thing I want to ask you about is, so I’m just gonna rattle off the main themes from the episodes that I jotted down in my notes as I was reviewing the season. So we talked a ton about ml. So machine learning as a service ml ops, the the emphasis is kind of saying, okay, ml may be like the next step beyond analytics, right? So data stack to serve analytics. And then once you get that sorted out that sort of you serve ml use cases, we talked about batch versus stream, which was super interesting. And then sort of like federated data, which was really interesting. And so sort of that tension. And then we talked a lot about observability. Actually, we talked to several companies who are trying to solve for the challenges you run into with all this data sort of running through a system that’s increasing in complexity. But I would say in almost, maybe not every episode, but a lot of them the process, and sort of even thinking around how you deal with data is increasingly adopting thought patterns from software engineering. And that actually is reflected both, I would say, in the team structure, as well as the tools that people are trying to build observability, for example, right. I mean, that’s sort of a direct adoption. So as a software engineer, tell me what you think about that. That was just a consistent theme we heard throughout the entire season. Yeah.
Kostas Pardalis 06:33
I mean, I think it’s very reasonable to happen. There is a reason that we have all these different disciplines, that they have, like the term engineering in them from mechanical engineering, to chemical engineering, to software engineering, to I don’t know, whatever other like engineering we have, at the end, when does Imagineering social engineering? Yeah. I mean, at the end, when engineers something like there are some very specific principles that they are shared across, like all different disciplines, right. And I don’t think that data would be something different. I think, actually, it’s like an indication of much of the space maturing, but what is happening right now. So it matures, it has to be much more serious, it has to deliver much more consistent results. And that’s when you start going moving, like let’s say, from the experimentation phase to the engineering phase, where now you need to put processes in place. Now you need to ensure quality now you need to observe things and make sure that like they work in the way that they should be working. Right. So how you do that? I don’t know. I mean, obviously, like in data, things are different compared to infrastructure of durability, for example, like whatever else, but see, the principles remain the same. We need to we have a process, we need to observe the process, there are some data about this data or these processes. So we have some data that we need to track and try like to reason using these numbers and see like, Okay, can we trust the data? Can we trust our pipelines? Can we trust our data, Lake or whatever? So we are going to see more and more these principles being applied, like with anything that has to do with data from, like, we’d have companies that are doing version, for example, right? I don’t think we have the man on decision. But there are companies out there like pachyderm, for example, they are doing like data versioning, right, get ops, I mean, at the end, we will get something like detox for data. So what I’m trying to say is that if we try to detach ourselves from what is happening, and like, take somebody’s stance, we will observe and that’s something that we should many times, right, that the data engineer today is like a role that it’s like a hybrid between engineering and operations, right? Let’s like get like an indication that we are still early, probably this is going to break into different roles. And then you might have data and data engineering, right? Where someone is responsible, like for writing, like all the stuff that we need to execute there. And then we have someone who’s like operating or software or whatever. And yeah, I think the next couple of months, maybe years, like one or two years are going to be what’s going to define how exactly and mature like this this discipline? I would say though, because you mentioned the beginning the male parts, and that we consider a male as the next step or whatever I wouldn’t say like if I learned something is that actually male is not the next step. Like it may be something that’s out there right what is happening though inside the companies is that a mainland analytics are kind of like two different functions right. And in many cases, this also reflects on like the infrastructure that the companies are using, do you see like, different infrastructure that a male is using compared to the BI function for example, one is using data warehouse the other might be in like a data lake. I think one pattern that we are going to see a lot, especially with the lake house, paradigm or what varies the merge of these two into one. So we will see that everyone inside the command is going to be using one infrastructure. And if you want like, Okay, I’ll refer to a term that we usually make fun of which is the data message. If if, if there is like, as I see it right now, like value in this term is exactly this unification like, yeah, we are going to have one infrastructure for all the data practitioners decided to combine it, we are not going to have them like also separated as we have them right now. And I think this is this is happening now how exactly is going to happen? Which paradigm is going to succeed at the end? Who is going to like if it’s going to be called Data mess data networks date? I don’t know. I mean, doesn’t matter. But there is a unification, what’s going to happen? These are in terms of like how data is oxes, and how to use inside the organization.
Eric Dodds 11:00
I agree with that. And I think a big driver of that is sort of whatever you want to call it, democratization commoditization of certain technologies that are making ml easier isn’t necessarily the right term. But I think that it’s making it more accessible, right? If you think about a common data schemas, tooling that can sort of enable common ml use cases, on top of existing technology, there are a lot of things that that are making it way more accessible, which is, which is super exciting. Let’s talk quickly about observability. So we talked with a couple of companies, big guy, and light up. And then it was it was a common topic in general. But what do you think about observability? Right. And so let me, I’ll give just a little bit of context here. So we said in one of our recent episodes, the stack is expanding, right? It’s not contracting and complexity, it’s actually expanding and complexity, which creates all sorts of problems in terms of being able to understand whether there are problems across the stack. What do you think about the observability space with data? Is that a I mean, is it a huge need? Do you think that those companies are solving like a really true problem? What are your thoughts?
Kostas Pardalis 12:24
Yeah, obviously, they are solving like the problem. There’s no discussion that the thing is that I think it’s still early, when it comes like to how observability can be successfully implemented. I’ll give an example. Right, like we had even considered not just decision but the whole show, right? In the past, we have also talked with companies like Apple, for example, right? Who are they don’t call it observability. They call it like quality, right? And they’re focusing more on like the streaming side of things, while combining like the guy, at least now are focusing more on like the data that is authorized in the data warehouse to figure out Yes, but we see that we have, let’s say, two sides of the same coin, right? Like it’s against like about data quality, and like figuring out if we can trust our data, right? Now, what’s the best way to do it? Is it like best to rely on an architecture where everything happens on the data warehouse, or you have a more decentralized architecture where like quality, and observability, something that is part of each part of like the the whole workflow that we have in the whole stock? That’s meant to be seen, I think, right now, all these companies, they are taking the same problem from a different angle. And at the end, the market is going to decide who’s going to win based on like, which one of these is like, let’s say, the most important family? Because at the end, what happens like with markets, in general, we’ll have a consolidation and like, we end up with shareholder does everything blah, blah, like all these things?
Eric Dodds 14:00
I agree, my hot take is that I think there’s going to be some combination of both. If you think about the sort of micro problem of data quality of capture, that’s really, really important for certain teams in a localized sense, right? So if we have data coming in, that’s driving some sort of like, very personalized experience, for example, it probably makes sense to be like very rigorous on capture. Now, I’m not saying this unimportant for analytics and other things. But I think about observability as sort of a more comprehensive solution that crosses certain points of the stack, as opposed to a sort of, you know, a rigorous, rigorous approach to ingest but like you said, it remains to be seen it’s it’s a fascinating problem. I think one of maybe one of the most interesting ones beyond maybe data lineage which is come up with
Kostas Pardalis 15:00
Yeah, okay, that’s the I mean, that’s what you saw at the end. But I think it would be interesting to have, maybe that’s something that we should include in our souls from now on like to also interview or either, I mean chat with VCs who have invested in. Because the thing is that we are talking about product categories that are so new, that I really don’t know, like, whatever you’re going to see today probably is not going like to be true, like in a couple of months from now. Right. So it will be interesting to see these people that, okay, they invest their money, and they have every reason to do it as early as possible, why they do it? And what is the thesis behind that? For the states, right of the of the market? Because again, like talking about data warehouses, I mean, it doesn’t make sense to ask the investors, it’s better to go to the companies right now. But for observability, and, yeah, quality? No, I think that it’s the right time where it’s going to be much interesting to hear what a VC has to say not even the founder.
Eric Dodds 16:07
Sure, yeah. That’s such an interesting proxy for sort of what the vision is of the problems of the problems that are that are being solved as people sort of look at their horizon.
Kostas Pardalis 16:18
Yeah. Because from a portfolio management perspective, also, that’s something that these people are doing. There are also like correlations between all these different companies and their investments. So it will also be interesting to hear on like, how they see this category, like related to other categories and data that they might also be investing anyway, I think it’s something that I think it’s worth doing, like find someone who is very active in investing in data related combination. Yeah. Good.
Eric Dodds 16:53
All right, listeners, if you hate that idea, go to demisexual calm and fill out the form and tell us because it never gets someone on the show. Okay, last question, because we’re coming up to time here. One other subject that we discussed a lot was the modern data stack. So we talked about this with someone who’s been at Mixpanel for over a decade. And they’re sort of migrating to this paradigm where they view the warehouse as a central component of the data stack, which is really interesting for, you know, sort of a product analytics company. And we had a panel with DBT data, bricks, five Tran hinge and then actually a VC, this is pretty active, I that may not have actually made it into the season three. So that’s a preview. For everyone coming up. I have mixed feelings about the subject of the modern data stack. in one regard, I think about some of the episodes where we talked with people who just sort of assumed the basic components. You need, like good ingestion, you need a single source of truth, you need to be able to move data easily, you need a sort of flexible pipelines. And that was kind of like, what are you doing that data. And then we also had episodes where people were talking about serious problems with sort of any one of those components of the data stack. But I think probably one of the most interesting things was just hearing people who are practitioners actually trying to explain what it’s like to use the modern data stack, and they just have way lower emphasis on the toolset and more on what it enables them to do, which I think is really interesting. So with that theme, do you feel like you’re you understand the modern data stack better? Or are you more convinced that people like me are making it into a marketing?
Kostas Pardalis 18:46
Okay, I don’t think that there is, let’s say, like, some kind of clear definition of what these modern data psychische Okay, like, there is no such thing. I mean, and we did, I think a very good attempt to make things more clear on the episode that we recorded, but I think the consensus is still that, okay, depends. And probably what is today, it’s not going to be like tomorrow, right? And usually, when you attack these kind of problems, where you have very, like, let’s say, semantic issues, right, like you don’t like people cannot agree on the definition. I think you need to, again, take some distance, and focus not much on the definition, but on the words that we are using and why we are using them. And what why do I say that? The most important thing at the end is that whatever is happening right now in the market is going to be stuck. Right? And what’s like, important about the stock, you don’t have one component that can work on its own. Like no matter what, right? This is not going to be like I’m going out there and I’m buying a CRM where I go to Salesforce. That’s it. Like it works. In SAS, for example, which was like, let’s say the previous wave of integration, you didn’t talk about a stock, I need Shopify together with a CRM and I don’t know, like Marketo mail to all or whatever. Yeah, like you didn’t need all of them together in order to have something that operates, you could have each one of them didn’t make the the components of the and by all of them. Yeah, they did, but they didn’t go out there to buy them as a stock. That’s what I’m trying to say. So what is like I think, very, very interesting and very, very important is that this is a space where synergies are very important. There is no one tool that like one platform that will come and be like, we are doing everything. Even if we are talking about snowflake, right? Like, or I don’t know, like data breach or Google even, right, it’s like, I can’t go to Google right now and not use any other tool to do my job like no, you will need probably to use something for pipelining or something else like for I don’t know, for versioning or observability, or whatever. So I would say that’s for people that they are getting angry with and thinking that this thing is like a marketing term. And we’ll just used by the market to convince them to go and buy. Don’t think about don’t think in this way, like don’t be differentiated, like at the end, focus on the words that are used, and the terms of their use, and try to understand how this is going to affect your work in the future. Because as a buyer, you will never be a buyer of one product, you will always have choose many different products and how they work together is going to be important, right? That’s why we also see that partnerships in this space is something that is starting in companies much much earlier than what happened like with the SAS companies of the past, for example. So yeah, if you want my opinion, that’s what I would say about data stock and the importance of like the modern data stock. And the rest about the definitions and who’s going to be the winner of each one of the data stock parts like it remains to be seen. And we will say like it doesn’t matter all the investments because for the market, right? I mean for the owners and the people who work there, it matters a lot but for the market doesn’t matter.
Eric Dodds 22:14
I agree. Well, we’re at time. Let me just do a couple quick thank yous. We talked with Ben, the Seattle data guy from Facebook non who runs the data engineering newsletter and his day job isn’t Zendesk great episode, Tristan from continuous AI James Sarah, at EY, Bart, who runs the data on Kubernetes community, which is great. Of course, you mentioned Mixpanel, which is a really fun episode on the modern data stack and the warehouse. We also talked with Pete Goddard from Deephaven. And that’s a really interesting episode on sort of the difference between batch and streaming, and doing stuff extremely fast. They have some pretty cool stuff going on there. Jeff Chow from stripe stream processing was a fascinating conversation. Definitely check that one out if you haven’t. And we talked with Igor from big guy, Scott from InterSystems, who talked about data Federation, which is really interesting. We talked about making ETL optional, another Federation conversation with Jeff or sorry, Justin Boardman from Starburst, which was a really great conversation as well. We talked about open source with Ashley from bentos really cool tool. And that was a great conversation, great mascot for the open source project. And he’s just a hilarious guy. We talked about data design with Kevin from touchless technology. We talked about IoT, which was a great episode, not a theme, but a great episode. And we talked with Rob from thing logics, and he talked about how he uses his own technology on his cattle farm in Oregon, which was amazing. And we talked about ETL versus ELT, with Metallian, which is a great conversation as well. We talked with airbike, about open source and ETL, which is a super fun conversation. And we talked about data teams, which was a really interesting conversation with serivice. Don, who works at Robin Hood, and actually has a long history of a bunch of other data companies, which was really, really a conversation as well. So definitely subscribe if you haven’t. That’s just a quick rundown of some of the highlights of season three. And we will catch you on the next one many many exciting episodes that we’ve already recorded for season four that will come out early next year. We hope you enjoyed this episode of the datasets Show be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me Eric DODDS at Eric at data stack show.com. That’s E R I C at data stack show.com. The show is brought to you by Rutter stack the CDP for developers learn how to build a CDP on your data warehouse shutterstock.com