Episode: 22

Season One Recap with Eric Dodds and Kostas Pardalis

with Eric Dodds and Kostas Pardalis

Hosts, The Data Stack Show

Season One of The Data Stack Show is in the books, and in this episode, Kostas and Eric take a look back at some of the biggest takeaways, trends, and topics from the season. With some great guests already set for season two, the next slate of episodes is shaping up to take an even deeper dive into the world of data and the people shaping it.

Notes:

Share on twitter
Share on linkedin

Key points in the conversation include:

  • Patterns with data warehouses and data lakes (3:38)
  • Looking back at the people behind the data and their stories (8:12)
  • Minimizing flaws while remembering that data is built by humans, for humans (11:02)
  • Using proven technology and making mature solutions (15:20)
  • Data involves a significant amount of trust (23:38)

The Data Stack Show is a weekly podcast powered by RudderStack. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds  00:06

Welcome to The Data Stack Show where we talk with data engineers, data teams, data scientists, and the teams and people consuming data products. I’m Eric Dodds.

Kostas Pardalis  00:15

And I’m Kostas Pardalis. Join us each week as we explore the world of data and meet the people shaping it.

Eric Dodds  00:25

All right, we are wrapping up season one of The Data Stack Show. We had 21 episodes in the first season, which is amazing. It feels like it wasn’t that many, but we actually talked to a ton of different people. Kostas, how do you feel now that you have completed the first season of a podcast that you started?

Kostas Pardalis  00:49

Ah, it was an amazing experience. Actually, for me, it was also my first experience with podcasts. And it was a much, much better experience than I expected to be honest, full of surprises. And as you said, like a ton of people that we had the opportunity to chat with. And many different people actually in the use cases. I mean, when we started this podcast the main thing we had in our mind was that we wanted to focus on data, right. And I think one of the most amazing outcomes of all these conversations that we had in this season, is that at the end data is at least pretty much behind almost everything. We had the opportunity, for example, to talk with companies, that on the first look you don’t think that they are a data company, or they are working with data. But at the end they are. If you remember, with Slapdash, for example, the episode, which is a productivity tool, but at the end, it’s built on top of a ton of data to do that. Same thing, also with Panther Labs, where they’re a security tool. But as we were discussing with Jack, the main outcome of the conversation was that security is a data problem. So yeah, it was an amazing journey, actually, and full of surprises and amazing guests. And so I’m really looking forward to continue doing these on the next season.

Eric Dodds  02:17

Yeah, I agree it is. I think the concept of everything, or the reality of everything running on data is really interesting. When you look back at our shows, I’m thinking about sort of two sides of that coin. And I know there are different ways to look at it. But when we talked with Bookshop, and Mason, he talked about the data that they have to wrangle, just in order to display a correct listing of books on their e-commerce website for books. You know, and that’s sort of core to the product, right? They have to leverage that data in order to deliver a product that provides a good experience. But then on the other hand, I’m thinking about our conversation with Jason from Bind in the health insurance space. And he talked a lot about delivering data products to other parts of the organization, right. So, of course, they work on things that feed the core product. But then you have data driving marketing, you have data driving product, you have data driving, sort of testing in different areas of the company. So it’s just interesting to think about data, both driving sort of the core product use cases, but also being a product itself that’s consumed by other parts of the organization.

Kostas Pardalis  03:38

Yeah, absolutely. And if you think about it, I mean, it looks like data is everywhere, which makes sense. And I think that’s another great little bit more technical outcome from all the conversations that we had, that there are some emerging patterns out there in terms of least of the architectures that the companies are using. I’ll give you some examples. Even from the first episode that we had with Mattermost, for example, we saw this pattern of building everything around the data warehouse, right, where you have like the pipelines, which is one part of the architecture of the pipelines or pulling the data from the different sources that the company has and all the data they need, push it into a data warehouse, the data warehouse might be something like Snowflake, for example. And actually one big difference compared like to the past, because it’s the same pattern, but more and more people prefer to just extract and load the data inside the data warehouse and implement any kind of complex, let’s say, transformation logic on top of the data warehouse, which is a result of huge changes that have happened in the space of the data warehousing mainly the scalability of the solutions in terms both of processing storage and of course cost, which is very important. Also, these platforms have become more and more powerful in terms of the expressivity that they have, like the things that you can do. And we reach a point where you see technologies like DBT come into play, right? Another very common layer that emerges inside companies in terms of building their infrastructure, where DBT, you have a layer where you can transform and model your data. And with all the benefits around that. And then of course, you have the consumption part of your architecture, where you have your BI tools connecting there and using something like LookML that has to do with modeling, but this time look at millions more on the visualization part. And that’s a very common pattern that we see. I think it pretty much exists in pretty much every company that we discussed with. But there is also another pattern that emerges which has to do with data lakes, and like the data lake is something that we see coming up more and more when we are talking with organizations, that they incorporate data science in their practices. And the interesting thing, if you asked me, is that the first texture pattern is more about building the infrastructure for the internal consumption of data. If you notice, Eric, we talked about BI right, in the first version where BI’s, like 100 percent, something that is going to drive your organization. And then like other departments, like your marketing, your finance. With data science, there, we see how data can become a product that’s exposed to your own customers. I think this is the most powerful use case around data science. Of course, you can use data science to also do things that are consumed internally, like lead scoring, for example, rights, or do some forecasting for your sales for finance. But I think the most powerful part of data science and how this is used is like building products that are going to drive like the customer experience. And there the requirements are a little bit different. We had some hints around that in the first season. But we are going to have even more exposure to these use cases in season two and it’s one of the things that I’m really excited about, and I’m looking forward to the next season.

Eric Dodds  07:18

Yeah, I agree, I think that we, we didn’t necessarily have a plan for a breakdown of the types of roles that we would talk to in terms of guests who work in the data space. But we did talk with several people who work specifically in data science. So I mentioned Jason from Bind, we talked with Stephen Bailey from Immuta, who does a ton of stuff in data science, Arian Osman, who works for Homesnap. We talked with multiple people doing data science, and it is really interesting to look at that subset of the data world where there are different requirements than, you know, sort of your quote unquote, standard, you know, pulling data, processing it, and then either sending into tools or preparing it for BI.

Eric Dodds  08:12

You know, one thing that I think is really interesting, that really stuck out to me was we got the chance to … how do I want to say this … sort of look at and discuss the people behind data and behind the technology in a variety of different ways. So first of all, I think one of the things that I’ve come to enjoy most about the show is that there’s lots of cool technology. There’s lots of companies doing really neat things with data. But the people who are doing those things have really interesting stories. So I think about Andrew from Earnnest, who you know, works on real estate transactions, and is doing some really interesting things there. And just hearing about, you know, his work as a marine biologist, you know, working in the ocean, and that was just fascinating. Stephen Bailey, who I mentioned before, has a PhD in childhood reading or education. It was just really, really interesting to see that. But then the other angle of that, that we saw, I think is the human element in the actual work. So, you know, we talked with Duc Haba, who was at Cognizant and has just had a long career working in AI. And the way that he addressed it’s really interesting, he talked about how people fear AI, or they’re skeptical of AI, you know, or AI produces a bad result. And so you tend to blame the technology, but he said there are actually people behind that we need to remember that. And then the last, or well there’s many more, but the last one I’ll mention that stuck out was Arian Osman of Homesnap talking about building models for sort of predicting the time that would take to sell your home depending on the price, you know, so you have a slider, you say, Okay, if I price my home at this number it should sell in, you know, 25 days, you know, and of course, the time lengthens if you raise the price. But we talked about how the model, you know, it’s hard to train a model to account for human perception around things like if the price is too low or too high, you know, people have questions, and he talked about ways that you can incorporate that. But all that to say, I just really enjoy, I think one of my favorite things is getting to meet the people who are doing this work and hear about their lives, you know, kind of outside of their data work, and the things that influenced their data work, you know, whether that’s prior experiences or other projects, and then also just the human element of actually doing the work of data and how, you know, technology still requires, you know, a real human element in order to produce a great experience for users.

Kostas Pardalis  11:02

Absolutely, I think you’re touching like a great, great point. And a great insight that we tend to forget, I think, especially as the people that are working in technology, which is that technology in general, regardless if it is around data or not, it is built by humans, and it is built for humans. So it’s not the responsibility of the technology, right, at the end, if something goes wrong, it’s because our models or our architectures, they have some kind of flaw. And of course, they’re going to have a flow. There’s no way that we can build something that it’s going to be so complex and perfect from the first time that we get it public. Alterations are needed. And anything that has to do especially with data, and especially anything that has to do with data science and data analytics, the productization of these practices are something very new. So it will take time. It will take time, both for the people responsible to build them to build the best practices and the engineering practices on how to engineer something that is going to be the best possible solution, and more predictable in the outcomes. But also for the people that are using these tools and products, they have to get educated, right? Technology’s still technology. And data science model is a model that predicts something very specific. It’s not a human that we have in front of us that can adapt without the intervention of other humans. So it’s a very interesting point. I think we need to always remember that, as I said, these are things that are built by humans and for humans. And we have a lot of work in front of us to improve them and figure out what’s the best products that we can build. Which brings me to another outcome of all the conversations that we had, which has to do with the maturity of this market and the industry. We show where there are like specific technologies that are really mature right now. So for example, anything that has to do with data pipelines, right, we talked about ETL that became ELT. That’s a pretty mature part of the stack where I would say the products are almost like commoditized right now, right. And there are some parts of the stack that are very mature, same also with data warehousing. But there’s also a huge, huge space for new products and a lot of opportunities. And we saw that with, for example, we remember like one of the first interviews that we had that was with Meroxa with DeVaris, about CDC, right. CDC is very hot. I mean, it’s something that we see many companies right now trying to come up with solutions, and products to address CDC. Then something that is super hot is anything that has to do with data governance, which, by the way, is funny because it’s something that in the past at least it was you’d hear like data governance, and would be like, okay, that’s something super boring that only Fortune 500 companies care about, and suddenly becomes something that is relevant for everyone, right. And you see many companies trying to address … we had Immuta right, which is just one part of data governance, and they address it, and that’s the access control. Then we had Iteratively, our last episode, where the guys over there are working on trying to figure out how to improve the quality and control the quality of data. So that’s super hot. And I’m pretty sure like in the next couple of months, we are going to see more and more companies appearing trying to solve these problems. And of course, just to make the connection with the architectural patterns that were talking earlier. There’s a lot of space for products around anything that has to do with what is today called ML Ops, which is all the operations around how to productize ML, machine learning, and data science. And as I said more about this on season two, I think it’s going to be very fascinating. And just to get to something that I think is one of your favorite outcomes, it’s about the importance also of boring technology. Right? What do you think about this?

Eric Dodds  15:20

Yeah, yeah, that came up on multiple episodes. You know, when we were talking with people from Bookshop, they brought up the boring stack, you know, we sometimes will ask for a stack breakdown, just because it’s interesting to see different ways that people shape their data stacks around the needs of different businesses. And it was interesting to talk with Mason. And, I mean, his response was, we actually have a pretty boring stack, but it works. And I know that we have an episode coming out in season two, with a company called LeafLink. And we have a similar conversation there. And I think in that episode, I won’t give too much away. But we talked about the balance between building something that is scalable and reliable that works for your users and your company now, with an eye to the future, as opposed to just adopting new technology because it’s new, right, where it has some promise. Like, that’s not always the best decision. And so we talked to a lot of companies who, you know, certainly people are doing really interesting things, but in many cases, they say we have a really boring, but very stable and very efficient stack. And I just found that fascinating. Pushing the limits is really cool. But I think we saw several very mature developers and data engineering and data science practitioners, who have seen a lot and have implemented a lot of things and really build for something that’s going to be scalable, maintainable, and provide a great experience. But that’s from my perspective, I’m interested in your thoughts around that, because you have the actual experience of building all sorts of different architectures. So how did that hit you that we saw that repeated over and over across episodes?

Kostas Pardalis  17:21

Yeah, I think at the end, I mean, from an engineering perspective, people who are in technology, they always love to play with new toys, right? And in some cases, it’s really hard to control the excitement of using the latest shiny thing that came out there and promises to solve another problem or an existing problem in a much better way. But if you approach this from a more mature engineering approach, point of view, at the end, you cannot really build something new without having a very stable foundation, right? I mean, you can, but probably, you’re going to end up having some really bad nights where you won’t be able to sleep because things will go really, really wrong. And it will be super hard to find resources to solve your problems. So if you want my opinion, the best way, when you’re trying to build a new product and solve a problem that wasn’t being solved before or in a new way, one of the best choices that you can do is the foundation of your solution to be based on proven technology.

Kostas Pardalis  18:34

So what we actually see with these companies that have this approach is that you have some really mature and experienced engineering teams behind that they know that I shouldn’t focus on trying to debug and understand how these new shiny thing works. But they said, I should free my mind and let it focus on the problem that I’m trying to solve with my own product. So yeah, I think it’s a great sign of maturity and good engineering practice at the end. And as we approach and chat with more companies, I think this pattern will appear more and more. Either way, just to add something here. We might think on the other hand that yeah, but look at Google, look at Netflix, right? They are using all these new shiny tools, and they are building their own shiny tools to do that. But when we get into this thought process, we forget something very important that these companies are addressing problems at a scale that it’s completely new. They are really pushing the frontiers of technology and what is available right now. And at the same time, they have the resources and the talent to build new solutions that can address these unique challenges that they have. So from starting a company to becoming a company that has to deal with the traffic, for example, or the reliability that Netflix has, that’s a huge, huge road ahead. So that’s another thing that we always need to keep in mind when we see these amazing companies is using and building all these amazing products.

Eric Dodds  20:18

Speaking of Netflix, I mean, it was a real honor to speak with Ioannis from Netflix and hear about all of the various challenges they have. But when we asked him about how they evaluate using or even building new technologies, I think we can have the perception easily that, well, there’s a bunch of engineers in this company, and they have tons of resources. And so they can just try all these new things. That’s not necessarily untrue. But the reality is that they have a very balanced approach to making those decisions, especially around building something new. And there are multiple people inside the company involved in those decisions. You know, you talked about people from the business side, even, you know, sort of thinking through like, do these things make sense for us to build? We have a problem, it’s creating a customer experience, it’s creating, you know, some sort of friction in the business process. But they don’t have a, you know, they’re not reactive in the way that they adopt or build new technologies. And so, yeah, it was really cool to hear about that from Netflix. And here, you know, it isn’t just a free for all of exploring new technology, they actually have a very principled approach to the way that they use new stuff.

Kostas Pardalis  21:40

Yeah, absolutely, Eric, and also we need to keep in mind the unique nature of a company like Netflix, right? I’m pretty sure that if we were talking with someone who is from the, I don’t know, like the production teams, that they have right, that they produce the shows, for example. What we would hear would be a little bit different than what we hear from Ioannis. And there is a good reason for that. Because at the end, Netflix, their main business is in producing content and shows. Technology is there to support that. And that also gives a different, let’s say, freedom and flexibility to the teams that work to build the technical backbone of the company. If you go, for example, to a company like Snowflake, which, by the way, is the company where Ioannis works.

Eric Dodds  22:33

Oh right he went to Snowflake.

Kostas Pardalis  22:35

So maybe in the future, we should also try to get in another episode with him and see how the two environments are different, or the technologies are different, and the products. I’m pretty sure that at this point, Snowflake has some very strict methodologies and processes in terms of how to introduce new technologies, or introduce new practices or even change like the core product that they have. So it’s very easy to get excited when we see just one announcement or one blog post from a company, especially at that scale. But we should always try to remember what the company does, why it does that and that at the end, each company is different. And the culture is also different. And that all these things make the whole process of how to approach new technologies, very different from company to company, without saying that one is better than the other. At the end what matters is the success of the product and the company. And it seems that even companies with very different approaches might succeed. So that’s great. Sure.

Eric Dodds  23:38

Well, one last thing before we close out the season, with a season wrap-up show, the last subject I wanted to touch on was the subject of trust. And we talked about this with multiple people. I think one of the first times that it came to the forefront of conversation, actually, and we didn’t necessarily use the word trust, but I think about our conversation with Axel from Pool. And aside from him telling an amazing story about Paul Graham telling him his startup idea was horrible, which was one of my favorite stories from the season. He had a really simple, but really powerful piece of advice for early stage companies, where he said you need to be very diligent about collecting the data. But he said I always, especially in the early stage, sort of pre, you know, being able to have statistical significance and make decisions based on that. He said, I always use the data as a way to figure out which customers I should talk to directly in order to learn about how they’re using my product. And that sort of brought up this idea that data involves a significant amount of trust, right, so both trust and the data as Axel pointed out, and then we also talked with multiple other guests who talked about how that works inside of an organization. So that showed up in terms of companies, where we talked with the data engineer who was … I think it was Stephen Bailey from Immuta, who said that there was just a huge lack of trust in data. And then that, you know, sort of colors the way that people think about data engineering, and the operations around that, and turning that around is a really significant effort. And then we also talked about with Iteratively, the impact of their work around data governance for companies that adopt them. And they really pointed to trust as well, it creates more harmony between teams, because everyone is really confident around the data. And I think another component of that that came up was the people who are consuming the data are making decisions with it. And so the more confidence and trust that they have, the faster they can move, the better that is for the business. So that was a really powerful topic I think, in terms of summarizing a lot of the topics that came up around data. What did you think about trust as sort of a summary of a lot of the things we discussed with our guests?

Kostas Pardalis  26:12

Oh, yeah, absolutely. I think that, as you said, like trust, both on the data itself, and also on the teams involved, because keep in mind that data and central organization isn’t a disciplinary thing, right? It’s super important. And that’s why we see the emergence of data governance, and all these companies that are trying to tackle the different aspects of data governance, right, from access control, to quality of data. And this is something that actually I think, as I mean, I keep saying that, but I’m really excited about what’s going to happen in the next season, but especially as we get inside ML Logs, where, as we said, machine learning is not just about the impact that you have inside the organization with the data, but it’s also how to deliver innovative products to the customer. Trusting the data, trusting the models, trusting the products that are built on top of the data, it’s going to be huge. We will see a lot of work that’s going to be done on that. And it can be like a total disaster if, for example, your models or your data exposed to your customers in a way that might be perceived as something wrong. Like I’ll give you an example. We all experienced what happened with the Capitol Building, right? And I remember people saying about the recommendation engine on YouTube, where, I think it was on YouTube, where there was a video from all these very sad things that were happening there. And the recommendation engine was recommending products that are related with survival and guns and stuff like that, which I mean, if you think of that, just from the perspective of the data itself, it makes sense, right? There are two related concepts. But there are two very wrong, wrongly related concepts at that specific time. So the more data is exposed and becomes like an integral part of the products that we build, the trust of both the data that we are using, and what we build on top of that, it’s going to be super, super important. And that brings me to my last highlight of all these discussions that we had the past week, which is around open source. And I think trust, it’s another, I mean, open source is another way of building more trust over the technologies that we are using. And it’s a very foundational component of building like technologies that we can use with our data in the best possible way. Absolutely.

Eric Dodds  28:55

Well, I am extremely excited about doing another season. We already have several episodes recorded that I’m really excited about. And I’m excited. We’ll cover other topics. Please feel free to reach out to us with your feedback. We’d love to hear what we’re doing well, what we’re not doing well, what types of guests you would like to hear what types of topics you would like us to cover. So feel free to ping us on that. Kostas’ email is Kostas@Rudderstack.com and I’m Eric@Rudderstack.com, and we will catch you in the next season.