Episode 66:

How Data Infrastructure Has Evolved and Managing High Performing Data Teams with Srivatsan Sridharan

December 15, 2021

This week on The Data Stack Show, Eric and Kostas talked with Srivatsan Sridharan, Head of Data Infrastructure at Robinhood. During the episode, Srivatsan discusses his experience helping to build a data team at Yelp, how organizations have evolved their view of data engineering over the years, and how organizations have evolved their view of data engineering.

Play Video

Notes:

Highlights from this week’s conversation include:

Starting his career on the first-ever data team at Yelp (2:00)
How to approach the adoption of new technology (7:04)
When to use stream processing vs. batching (11:35)
What is a pipeline and why is it core to a data engineer? (14:07)
Where a new data scientist should begin their career (19:14)
The key factors impacting a new technology decision (27:09)
Managing team emotions in decision making (34:25)
The unique challenge of Fintech vs other consumer industries (45:03)

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Automated Transcription – May contain errors

Eric Dodds 00:06
Welcome to the data stack show. Each week we explore the world of data by talking to the people shaping its future, you’ll learn about new data technology and trends and how data teams and processes are run a top companies. The Digitech show is brought to you by rudder stack the CDP for developers you can learn more at rudder stack.com. Welcome back to the dataset Show. Today we’re going to talk with three who is the head of data infrastructure at Robin Hood, and I have tons of questions. This is not going to surprise you at all cost us before he went to Robin Hood’s three spent almost a decade at Yelp, and started doing data stuff they’re very early on, and then spent just a really long time there sort of heading up all sorts of stuff on the data infrastructure side of things. And so I am just fascinated to hear what his experience is like spending almost a decade at a startup like Yelp, especially because he joined you sort of shortly after the iPhone, came to market. So he got to see so much change. So that is what I’m going to ask him about what’s on what’s on your mind.

Kostas Pardalis 01:12
Oh, yeah, I’ll ask questions around how things have changed, like all through all these years. I mean, he has been there doing data engineering, when the term data engineer didn’t even exist until today. And I’m very, very interesting to see from his perspective, how technology has changed, but also how organizations have changed through these times. So yeah, I’ll focus on that. And I’m pretty sure that like we will have more stuff to chat about with him.

Eric Dodds 01:40
All right. Well, let’s dig in and talk about three. Let’s do it. Three, welcome to the dataset show. We’re super excited to talk with you.

Srivatsan Sridharan 01:48
Thank you, thank you, and it can cost us. Thank you for having me on the show. I’m also super excited.

Eric Dodds 01:53
Well give us give us a background. You’ve worked in data for a long time. But can you tell us about your career journey and what you’re doing today?

Srivatsan Sridharan 02:00
Yeah, definitely. So I guess I started my career after grad school about 1011 years ago. And in my first job, when I started at Yelp, I started on a team that was the original data team back then, we didn’t have a concept of a data team or or a data engineering team. But I still remember my first very first project when I was building ETL pipelines and building our very first data warehousing solution. And since then, I’ve really been excited about this space, because I’ve found that data forms the central fabric for everything in any organization. And, and I’ve always enjoyed kind of being in a position where I have a lot of breadth and visibility. And I thought data gave me that and so I stayed on that track for for many years as an engineer, and then transitioned into a management role, and built up the team and supported the team to help grow the data platform at Yelp. So I was there for about eight and a half, nine years in total. And then last year, I made a switch to Robin Hood. I’m in a similar role here at Robin Hood, a supporting the data infrastructure org, working on similar space, but different set of challenges because the FinTech world is is very different from the consumer web world. But yeah, my journey has mostly been in you know, the intersection of data people leadership and technology.

Eric Dodds 03:20
Very cool. So many questions to get to, especially around going from consumer social to to FinTech, which is really interesting. One, one question, I have a few questions about your time at Yelp. So first of all, you the iPhone came out, I think, just a couple of years before you joined it, Yelp. And so it hadn’t really had the exponential growth growth curve yet. And sort of mobile adoption from, say, 2000. We late, you know, 2008, whatever 2007 through to last year over that decade was just mind blowing. And Yelp, I’m guessing was primarily mobile. I could be wrong about that, especially in the later years. But I’d love to know, from a data perspective, going like managing infrastructure through the mobile revolution, what, what was that? Like? Was there anything in particular that sort of as everything shifted to mobile, were concerns from you from like an infrastructure standpoint? would just love to know about that? Yeah, that’s

Srivatsan Sridharan 04:23
a great question, Eric. And I was fortunate to be in a position to see that transition happen. I still remember I forget the year this was early in my time at Yelp, where we were discussing as a company that, hey, we need to go mobile first. And at the time, Yelp did have a mobile app and a web app, but most of our focus and efforts for us was on the website. And there was plenty to think about, right? It’s so funny to think about it. But I still remember being in that, you know, company all hands meeting or whatever, where the CD was like we need to go mobile, mobile is the thing for the future. And the skeptical person in me at the time was like Who uses an iPhone iPhones are too expensive? And nobody is going to use this, but I was wrong. And yeah, I mean that that really took off, as you said over over the next several years. And the interesting things that kind of manifested in in the Data Data World was primarily two things I can think of number one is the sheer rate of growth of data, right? Like all of a sudden, you have single digit millions of users to double digit millions of users or even triple digit millions of users. And and that’s like exponential increase in data, which means that all of the systems that store data, process data transform data, now need to adapt to this rapid change of scale and growth. The other interesting thing was now we are dealing with lossy clients, right? So especially when it comes to tracking our understanding user behavior, like, which is a pretty common pattern that most consumer web companies do. You run your AV tests, you collect data about how people interact with the app. Doing that on the website is easy, because it’s not lossy, or it’s less lossy, doing it on a mobile phone is very lossy. And so dealing with those challenges made made things really interesting. So I’ll give you a specific example. For instance, timestamps, everybody’s fun topic timestamps. I remember an issue where when we are looking at computing metrics for the company, or measuring experiment results, you can’t rely on timestamps coming from mobile devices, because you don’t know when those messages are going to get emitted. Right. And so, dealing with those timestamp issues weren’t something that we had to deal with when we were just working with web as a platform. That’s just a small example. But yeah, those were kind of the big, big things that came out.

Eric Dodds 06:37
Fascinating. So let’s take a little bit more into so you have exponential scale in data. And you were, you really saw massive changes in infrastructure across a number of vectors. How did you approach adoption of new technologies, right? So let’s just rewind and say it’s 2010. And you have certain ways of storing data, data warehouse pipelines, maybe you buy some, maybe you build some whatever. And then over the decade, you have these technologies come out that really make a lot of that difference, or sort of change the process. How did you think about? I mean, of course, you, you, I’m sure you modified the sack over time. And I’m thinking about our listeners who, there’s so many tools coming out? And how did you decide when it was actually time to make a change? Because there’s a cost component to that, like engineering time, ROI over time? How did you approach them?

Srivatsan Sridharan 07:41
Yeah, that’s a great question. And there’s no easy answer to that, I think. So I’ll talk about kind of how I approached it and talk about my observations on how I’ve seen the industry evolve here. My personal kind of approach here is to not necessarily jump to the latest and greatest immediately, because as like credit, as you correctly pointed out, like if this was an open source project that I was working on, or a hobby project I was working on, yes, absolutely up for the new shiny deck, I’m gonna jump to the latest and greatest and learn all of those things. But I think working in an organization where you have to support existing use cases, your systems have to be really, really reliable. I’m not very, very keen on taking risks and jumping onto the latest shiny tech until it has been battle tested. And partly, this is also the appetite of the organization, right? Like there’s, there’s one thing that I can think of or my team can think of, but at the end of the day, we need to align with what the company’s strategy is, and what is the company’s appetite. And so perhaps, if it’s a much larger organization, then it might be harder to jump on to the latest and greatest very quickly, because the cost of migration is very high. But if it’s a much smaller organization, or an organization that’s early in its startup journey, it might be much easier to jump to the latest and greatest. So I’ll give you an example of why, you know, one, one such decision that was easy and one decision that was hard, the easier. One was the adoption to Kafka. So this was perhaps in 2013 or 2014. At the time, LinkedIn had, you know, launched Kafka and Kafka was really taking off. And our initial version of our ETL pipeline was we had a, you know, open source queuing system that we were using, and it really wasn’t scaling for us. And we were constantly dealing with issues with data loss and workers going down and basically the distributed system not being resilient enough, and then Kafka came along. And at the time, we were like, if Kafka really works for a company of LinkedIn scale, it can certainly work for a company like Yelp and so we started prototyping 2013 2014 And, you know, it really worked well for us and, and we’ve seen Kafka to become a very prominent industry standard now, so that was kind of an easier decision to take. The harder decision was the move to streaming, which are stream processing which which probably happened around 2015, or 2016. Up until then, like most of the data processing systems were batch oriented. But with Kafka picking up, there was a big kind of cottage industry of stream processing solutions coming up both blender and open source. So 2016 2017, I remember us debating whether we should invest in a stream processing solution. Apache Flink was something that Yelp had used back then, it was promising new tech, it solved a lot of the problems that we wanted to solve. But the adoption was very hard, because a lot of that code was written in Scala. And Yelp at the time was a Python sharp and so that one was a harder decision to take, because it wasn’t clear what the value proposition is going to be, we could see that it was solving some use cases for us. But we also saw barriers to adoption, because streaming or stream processing was also a harder concept for people to grasp. Because you have to think about things like windowing joining, which you know, normally don’t think of in a batch processing system. So those are some things that immediately kind of jumped to my mind.

Kostas Pardalis 11:05
Three, it’s very interesting, what you mentioned about the streaming versus the bots, why did you I mean, where do you need streaming processing? And where do you need bots? Or do you need both? Or you can use only one like, what’s, what’s your opinion on that? Because there’s a lot of baiting of like, oh, we should don’t do like streaming. batching is just a subset of streaming, blah, blah, blah, like all these things. So what’s your opinion? And your experience?

Srivatsan Sridharan 11:31
Yeah, that’s a million dollar question. That’s a holy war right there cost us I hope I don’t antagonize our listeners here. But But I think I think there’s a place for all of them, although that might change in the next few years, right? I think streaming or stream processing works really well, when you have use cases that require data to be real time, like order of millisecond latency, like let’s say, if you want to do flex, you know, joins across multiple data sources, and the data is changing very, very quickly. And you need to be able to deliver results with order of milliseconds or order second latency, then batching doesn’t really cut it. Like an example of that would be anything that’s on the critical path of a user journey, for instance, right. So you take any kind of consumer social media product, let’s say you’re tracking events, the clicks that the users are making, how they’re navigating through the website. And let’s say if you want to provide personalized recommendations based on their behavior, and you want to be able to provide that personalized recommendation quickly, then you can’t rely on a batch processing system to do that, because the data is changing very quickly, you need to provide timely recommendations and so on. And then in places where the use cases typically batch where something needs to happen in the next four hours in the next eight hours in the next 30 minutes. Batch processing lends itself better. That said the worlds are bridging now, right? Like we have micro batching like Spark streaming is really taking off the folks that data bricks realize that spark is a very powerful tool. And Spark streaming is a micro batch bridge for that. And then there are technologies like beam, which which try to abstract away batching and streaming over the common API so that users don’t have to worry about whether you know data is being batched or streamed under the hood. So I think the industry is changing, I think it’s going to converge. But right now, I do think that there is an opportunity to leverage each of them for distinct use cases.

Kostas Pardalis 13:30
Yeah, yeah. super interesting. Ah, okay. I have a question. I started thinking about this, like, during our conversation, to be honest, I realized that one of the most commonly used words in data engineering is the word pipeline. So but we never like tried to define what the pipeline is, right? Like we take it for granted that pipeline is something important. But I’ve never asked anyone like, what is your pipeline? So based on your experience, and you have a very long experience? What is a pipeline? And why is like such a core concepts and the life of data engineer?

Srivatsan Sridharan 14:09
Yeah, that’s an excellent question. To me, when I think about it from kind of abstract first principles. A pipeline is a way to transport and transform data. And the reason why I think it is so fundamental and so important, is most companies today run on data. And I would even like to claim that every single engineering team for any kind of tech company is a data team, because anything that you do revolves around moving transporting, storing data, and a pipeline is a very core construct for that. And I know over the years, the, the meaning of the word pipeline has also changed because I think 10 years ago, building a data pipeline meant for you just, I don’t know, SCP or data from one place to another, right. We’re not like We can’t do that in a single machine in with memory like this is a distributed data processing problem now, which, which is why pipelines become much more challenging and complex over over the years.

Kostas Pardalis 15:10
Mm hmm. A follow up question to that. You mentioned that when you started working, you were actually like what we define today as a data engineer. But nobody used the term back then. Right. So why do we need different, let’s say, define a different discipline in engineering? Like how did engineer is different compared to a software engineer or like an SRE or I don’t know, DevOps engineer, whatever? What’s the difference there?

Srivatsan Sridharan 15:39
Yeah, that’s a great question. I think what has happened over the years is, as the industry has grown, many of these roles have become started to become specialized. So an example of you know why there is a special skill for data engineering, as opposed to, let’s say, back engineer, or, or an SRE is because one of the key skills that a data engineer needs to have is the ability to debug large scale pipelines, the ability to optimize large scale pipelines. And that requires knowledge of how is data organized represented, how are people querying the data? How is the data stored? What is the business use case for that data? So all of these things, I mean, a back end engineer could do it, right. But but it requires them to learn some special skills. And given that the use of data and the industry is growing more and more, I think companies are finding more value in carving this out as a niche profession. I think it’s very similar to perhaps how DevOps morphed into SRE, right? Like, yeah, there was a time when everybody was the ops team would manage on services. And then they realized that doesn’t work. So I think something similar where people were probably back in engineers were probably managing data, and then they realize that okay, this is there is some special set of skills needed, in addition to being a good engineer, which is perhaps why that evolved that way.

Kostas Pardalis 17:00
Yeah. Yeah. And I think also, that’s something that I, I usually say, when somebody asks me like, What is a good engineer? I think data engineer is an interesting discipline, because it’s, let’s say, hybrid between ops and software engineering, right? You have to build your pipelines. But you also have to operate your pipelines and the operating the pipelines is closer to what an SRE does, for example, building that is closer to what, like a software engineer. So you need to have, let’s say, a way of thinking that it’s comes from both worlds. And I think we can see that also recently, reflecting back on the tools that we built for data engineers, right, like some of them are coming more from the SRA kind of space, some of them are coming like from software engineering. So I think that’s like something that’s very, very interesting with data engineering. So if someone wants to start or like their career as an engineer right now, and they are considering data engineering, or someone who wants to make a change, and go towards like the engineering, what’s your advice to them? Like how they should start? And what are like some fundamental, let’s say, knowledge that someone should have?

Srivatsan Sridharan 18:11
Yeah, that’s a great question as well, I think the even even the rural data and in engineering is becoming so vast that there are sub roles within a train. And so the way I like to think about it is there’s a whole spectrum from the infrastructure to the user of data. So the closest to the infrastructure is like a data infrastructure engineer, like this is the person who’s building things like Kafka or building libraries on top or managing these distributed systems. And then you go one layer on top, which is what is more traditionally known or known as a data engineer, like this is the person who is using these technologies to build pipelines to build data sets, to operate those pipelines. And then you go one layer above the stack, and this person, sometimes called a data scientist, or machine learning engineer, or data engineer, but these are the people who are using data to derive insights or making decisions for the company. So for someone who’s considering a profession in data, I think the first kind of suggestion I would offer them is to try to figure out where they want to be in this data stack, like do they want to be close to the business? Or do they want to be close to the infrastructure, and based on that the paths vary, so they want to be close to the business and they want to get involved in using data to make decisions, then kind of moving to a profession, like data science, were educating themselves about statistics and machine learning techniques can be a good path. There were someone who was interested in moving more to the infrastructure side, learning about how some of these distributed systems work, could be a good path for them. But I think at the core of it is this passion for data? I think, I think that’s what the fundamental thing is, because if you have that passion for data, I think you you discover your path, depending on the company and depending on the opportunities that those companies provide.

Kostas Pardalis 20:00
100% No, I think I think you gave some really valuable advice there. We also have like, like people might have shared like, recently, the terms also like analytics engineer, something that like gets promoted a lot like by DBT. Because DBT, as you said, it’s like a tool that usually affects people that are closer to the business than people that are closer to like the infrastructure behind. So yeah, and then you have envelopes, for example, so many things like coming up, like every day. So I think we will hear more and more times around that stuff. Great. So another question. How did you see you mentioned some technologies like Kafka, for example, writes and you were like an early adopter of Kafka actually, from what I understand. And so that’s quite an achievement, to be honest, because Kafka is not exactly the easiest piece of technology to do operate at scale, especially, especially in the early days. So how have you seen the data stock change through all these years? Right, from 2000? Like a little bit after like iPhone was released until today?

Srivatsan Sridharan 21:12
Yeah, I think I think there’s been a huge tectonic shift in the industry, right? Like, just like seven or eight years ago, some of these technologies were brand new, and as you correctly said, not easy to operate, right. Not easy to scale. But I think what we’ve seen in the last few years is rapid expansion of these technologies and these technologies becoming a very commoditized. The snowflake IPO is a good example, right? Like the company performing really well and carving a niche for itself, similar to confluent. Like what started as Kafka became a big company, an enterprise company there. So I think a lot of these technologies are becoming commoditized. And I think that’s, that’s a big change. And I think that’s actually healthy for the industry, because it reduces the barrier of entry for companies that want to get better, that want to serve their customers better with data, but maybe don’t have the skill set and experience to build and operate these large scale distributed systems. And so I think what has happened is like it’s just democratize the space. And it’s opened up possibilities for for newer companies to emerge. Because if you look at like, seven, eight years back, it was only the likes of, you know, the Googles. And the Facebook’s, the places that were known for really good quality engineers like making dents in the data world. But now, if I were to start my own company, it’s so easy for me, I just, obviously, I need to have the money to do that. Because some of these vendors solutions might be expensive, but like I can easily integrate with a data bricks or a confluent, or snowflake and Bootstrap my data stack.

Kostas Pardalis 22:41
And based on your like, if you had to choose just one technology that you have worked with all these years, which one you would say it’s it was the most influential in making this tectonic change?

Srivatsan Sridharan 22:53
Yeah, it’s a good question. I mean, think about this, I think some of the ones that immediately come to my mind, probably not. Not surprising to our audience is Kafka, and spark are the two big things that I see. It’s very clear how they became their own companies, and very successful because of that. And the the shift of spark was interesting, because when MapReduce came out many, many years ago, but it was the hottest thing in the industry and I, at the time, I was thinking this is probably the compute framework for decades to come. And then spark replaced it. And it’s now the compute framework for perhaps decades to come. So I think those were very, very influential data warehousing has been an interesting topic, because I don’t think there’s been a clear winner in the data warehousing space. Data Warehousing has been there for many, many years, right? Like, we’ve had the likes of Informatica, and, you know, IBM, and it’s not a new concept. And there are different obviously, newer technologies and new ways of doing it. But I feel like that space is still pretty wide open. Yeah.

Eric Dodds 23:55
One. Oh, sorry, to jump in cost us Oh,

Kostas Pardalis 23:58
go go back rocket.

Eric Dodds 24:00
I was just gonna say one thing. I would love your perspective on is you at Yelp worked at a massive scale that Robin Hood, you’re working at a massive scale. You have the data and the resources to do machine learning. If you think about the last decade, are these new technologies that are making these components of dealing with data easier? Is that unlocking machine learning use cases, for companies that aren’t operating at that larger scale who don’t necessarily have the resources? Are you starting to see that shift happen? And sort of go down market? And so machine learning use cases are now being made available or sort of our ease much more easily enabled for smaller companies?

Srivatsan Sridharan 24:45
Yeah, I definitely think so. Because the barrier of entry for using data is very low. So you don’t need to have a 40 person engineering team to you know, build up this data stack anymore. If you have the funding. You just use the solutions and Bootstrap that very quickly and I do think that that’s opening up new opportunities for smaller companies, for sure.

Kostas Pardalis 25:05
I was thinking as you were giving the your answer about, like the most influential, influential technology, the first thing that came to my mind, it’s actually Kafka, but I’m super biased, because I built the company on top of Kafka. So there was like a very important piece of architecture of Blender. But anyway, we talked about the technology, but the technology like leaves inside some context, and the context user is the company, right? So how does technology reflect to the company and vice versa? How different types of companies you have seen them, like adopt different technologies? What kind of impact like choices that might have? And all these things that like, let’s see how we can approach this from more of an organizational point of view based on your on your experience?

Srivatsan Sridharan 25:53
Yeah, it’s, it’s, it’s a good question and a hard one to answer. Because at the end of the day, theoretically, you can evaluate these different technologies and say, hey, they provide this capability, they provide this capability. But when it when it comes to the rubber hitting the road, it becomes a different ballgame. Because you have the complexities around, what are the interests of the engineers on your team, right? Like, you can’t just say use this technology. And if people don’t like the technology, nothing will happen, right. And you also have to understand the lifecycle of the company, the appetite of leadership. So for instance, if if the appetite of leadership is to move towards a BI solution, then you either have to adopt that strategy or convince you know, people up the organization to change or change their strategy. So a lot of these things, things come into play. I think what I’ve seen is with many of these evaluations that we have done, like with me and my teams over the years, at the end of the day, the final decision ends up being on three factors cost, how quickly can we get this up and running? And what is the excitement level of the engineers who are working on that piece of technology? Because let’s take data warehousing for an example. There isn’t in there. Obviously, different solutions have different benefits, but not a lot of foundational differences between snowflake or delta lake, or iceberg or hoody. They’re fairly similar. There are some differences and feature sets. But at the end of the day, like how do you pick that right? And I think that really boils down to based on the existing organization’s context, how quickly how easily can you adapt this technology, make it work with with the cloud, or your data center that you have in your company, make it work with your developer tooling ecosystem, understand what your customers are passionate about what your engineers are passionate about. So those things can come into play?

Kostas Pardalis 27:48
So sorry, based on your experience in like you’re working with many engineers, like every day? And so a bit of a provocative question. But do you think that is the technology is there right now that it’s much more, let’s say, referred by data engineers?

Srivatsan Sridharan 28:04
Great question. Let me think about this. I don’t think I’ve heard a consistent answer. I’ve heard different people wanting different things. Maybe what I would say, going back to my previous comment about Kafka and spark, I think I think Kafka has become so foundational that people have, people don’t even think about it anymore, right? Like, it’s, it’s a layer that exists underneath. And there are abstractions that people have built on top of it. And I think Spark is another critical one, when I’ve interviewed engineers, or when I have interviewed with other companies, I’ve seen this very common pattern where people kind of assume or expect that, you know, spark, and that might be something that has, you know, come up. And airflow is another thing, right? Like some people love it. Some people hate it. But a lot of the data engineering community uses it. So those are probably things that immediately come to my mind. But I’ve definitely seen the jury to be divided there.

Kostas Pardalis 28:57
That’s super interesting. And sorry, Eric, I have one follow up question.

Eric Dodds 29:03
Good, provocative, that’s great. Provocative is good.

Kostas Pardalis 29:08
There’s very interesting detail in the technologies that you have mentioned, all three of them are open source, you didn’t say like, for example, snowflake like $100 billion, like gorilla in the room. So do you think that like being open source is something that is important when it comes like to the preferences of developers house?

Srivatsan Sridharan 29:27
I 100% believe that, because a lot of the engineers that I’ve worked with and being an engineering myself, I can kind of like empathize with it, even though I’m probably a terrible engineer now. But I think the like it’s it’s very ingrained to us as engineers, right, like this aspect of open source and being able to showcase to the world what we worked on and being able to incorporate what what the world is given to us. So what I’ve seen is companies that are able to hire software engineers, data engineers, and so on, which have a high density of engineering I’ve seen them to naturally gravitate towards the solutions that that have an open source component, because it’s just more appealing, as you said, 100 billion dollar company like snowflake is successful. And I’ve known a lot of big ticket tech companies using snowflake. But I’ve not seen a lot of big tech companies yet completely using vendor solutions across the board. So you might see a company that might be using snowflake for their data warehousing solution, but they might be using open source, Spark or open source, Kafka or open source Flink. And that might be because engineers are very excited to work on open source. But companies that don’t have a big engineering presence, like I like to call them companies that maybe I’ll work some people here, companies that have an IT department rather than I love our IT professionals and I’m just making a joke here. But But I think I think companies that don’t have a big presence of data engineers or software engineers, like the perhaps would would want to go to off the shelf solutions, which, which just makes it really easy for them to move.

Kostas Pardalis 31:03
Yeah, there’s all like, great points. And I’m very interested to see like how snowflake is going to respond, let’s say in this lack of open source that they have, because I think at some point, they will do something like there is a gap there for them compared to data bricks, like or confluent, or I don’t know, even Google, let’s say, for example. So I’m, I’m really looking into like what they are going to do in the next couple of months around that. That’s all Eric, is all yours. No more questions.

Eric Dodds 31:35
I wanted to jump in. Because we talked so much about technology, which we love, obviously, that’s a huge part of what the show is about. But you have so much experience managing teams working on data and data infrastructure. And you mentioned your evaluation criteria for new tools. And I think, correct me if I’m wrong, cost us but it’s the first time we’ve heard someone talk about engineer excitement, as a major factor in adopting a technology. And I would love to just dig in on that a little bit more. Have we heard that cost us? You looked unsure?

Kostas Pardalis 32:15
Yeah, no, I don’t think we have discussed this before. But I think especially among people who are responsible for hiring, or who had like, at some point to hire engineers, the stack, it’s something that it’s like always, always important, like, is there unlike this, jokes about like, COBOL, for example, where you have, I mean, being a COBOL developer right now is probably the best thing you can do in your life, because there are so few of them. And there’s so much compiled code in bang, bang, seriously, seriously, I believe that, yeah. They can make like, crazy amount of money. But I don’t think that’s, I mean, it’s easy to go and like hire compiler developers, right? So it’s always important, what kind of stock you have the stock changes, we see, for example, what happens with us right now as a language, what happened with Golang, like a couple of years ago, what happened with Scala even before that, that’s why we have like products like Kafka and data bricks being based on on Scala with whatever that means. Actually, I think, and three mites can correct me if I’m wrong, but in the past, it the way that you were saying that someone was like data engineer, like one of the technologies that we were looking there is if they knew how to write code, which color like Scala for like, is more likely this period of time was considered like the data engineering language. And this means change. And as they change, you have like to keep that in mind. Because yeah, you might end up like, having issues like hiring, so it is important.

Eric Dodds 33:46
Yeah. Well, this question is for both of you, how do you approach making a decision and I’ll set up I’ll set up a little bit of an unfair question, to continue our theme of trying to be provocative, maybe this isn’t going to be productive. But so three, let’s say, you are looking at adopting a new technology, the cost component makes a ton of sense. The ease of migration makes a ton of sense. You have buy in from your boss and or the other stakeholders, but the engineers aren’t excited about it. How much do you wait? How much do you wait that? And how do you think about that both from a near term? And then a long term? Like, maybe they’re not excited now, but this is the right decision? Like, how do you navigate that as a manager?

Srivatsan Sridharan 34:33
Yeah, that’s, that’s a really good question. I think one of the biggest mistakes you can do as a manager here, which I have done before in the past, and therefore can speak with authority is is taking a decision in a vacuum and then going back to your engineering team and saying, This is the decision we’ve taken, and then trying to convince them to buy in on that decision or even incurring the cost of people being pissed off and so on. So I think I think What I’ve learned to be the a good way to approach it, I still won’t call it the most optimal way. Because I don’t know if that’s the most optimal way. But I think a good way to approach it is to include everyone in your decision making process. So not only include your stakeholders, your boss, but also include the engineers who are going to be responsible for implementing that. And then when you get all of these diverse perspectives together, consensus is much easier to be built. Because the people who are on the ground might have a lot more detail about how one specific thing works better or the other. And obviously, there’s the excitement leads to and the engineers can see what I am thinking or what the stakeholders are thinking. And it just builds that shared context, which makes the decision much more easier to take to to kind of answer your earlier question, which you were kind of alluding to, I think, Eric around, how do you factor in the engineering excitement towards it? Yeah. I don’t think you know, that should be the only factor, obviously. Because if you’re over indexing on your engineers excitement, what happens if those engineers leave the company, right? Like you can’t really base a decision based on what two or three people are excited about. It has to be a decision. The technology needs to have a future, and you need to be confident about the technology having a future, and you need to be confident about the ability to hire people with those experiences and skills for that technology. This goes back to that Scala question, right? There are far fewer Scala developers today than there are Java and Python developers. So that’s an interesting data point to consider, for instance. So at the end of the day, like it’s an input in the process, but it’s not the be all end all.

Eric Dodds 36:38
Sure. What. So continuing on this topic of teams, because I think it’s really helpful. One thing we chatted about, as we were prepping for the show, is how do you think about building a team? Right? So we’ve talked about sort of the context of operating in companies like Yelp, or Robin Hood, where there’s a ton of both sort of technological infrastructure, Team infrastructure, you got to see that both from the beginning at Yelp, and I would guess, have built sort of internal teams in different disciplines. How do you think about and help our audience think through? What’s the best way to think through building data infrastructure, and maybe you can help us think about that from like, the startup perspective, which is very different to building like a data team inside of a larger organization? Yeah,

Srivatsan Sridharan 37:32
I think, given that what we’ve seen over the last eight years, 10 years and so on, data is the foundational fabric for every company, every company is a data company today. And startups actually have a distinctive advantages compared to larger companies. Because you get to build things from the ground up and get things right from the get go. Many organizations make the mistake of not investing in their data stack early on. The challenge with not it might be appealing to not invest in a data stack early on, because when you’re an early stage startup, you’re you’re you’re building the product, you’re finding your product market fit. There’s a lot of unknowns here and you want to be scrappy, you don’t want to be investing in pipelines and curated data sets and machine learning thing techniques, you’re just trying to get a product out of the door. So it’s understandable why companies don’t invest in data early on. But the cost of not doing that can be immense. So I’ll give you an example. So I think once companies kind of become larger, and you’re trying to introduce a data stack, in a larger company, you deal with a bunch of problems. Number one, you’re probably already a data company without realizing that. And if you’ve not invested in your data stack, you’re probably doing a lot of manual work to generate your metrics to generate your experiments. And then you need to take a sufficiently large organization and adopt get them to adopt a new technology, which can be very, very costly, lot of migrations. And then if you delay that even further, you get into the situation where different teams will start building their own data stack, whether they know it or not, because the business will put pressure on them to deliver data driven insights. So if you don’t have a consolidated data stack, they would start building their siloed data stacks. And then you will have you’ll run into a situation where they’re different domains of your company will produce different data and those data won’t agree with each other. You run into issues of schema compatibility, and what’s the source of truth. And these are things that, you know, I’ve run into, and I’ve seen my peers and partners run into so I think startups have a distinctive advantage of getting this right. And I’ll make a kind of brief hat tip to the data mesh, which is everybody’s favorite topic these days. I do think it’s a great paper. And I think building a self serve data platform and making sure that every person in the company is invested in making sure that the data is accurate and treating their data as if it were a product or an API can make a huge difference early on.

Eric Dodds 39:53
And so let’s let’s talk about maybe one of our listeners who might find themselves in this situation where they buy into that. And maybe even they’re in a situation where they came from a larger organization where everyone was bought into data, they had awesome tooling, it was pretty self serve, they were able to do really cool things. And then they go to work for an earlier stage company. And they can kind of see, okay, like, in six months or a year, we’re really going to wish we had sort of these pieces of infrastructure in place for even a lot of times it can be really wish we had this kind of data and had been collecting it for a long time. And we’re not doing that that can be kind of a hard sell, right? Because it’s like, okay, I want to spend money and engineering resources, which are the most valuable, you know, sort of hours to vie for inside of a company. How do they sell that internally? Because, like you said, we’re trying to get a product out the door, right? We’re trying to figure out if we have product market fit. And in many ways, it’s true that the founder says like, that’s just not as important as this. And it’s, of course, like more complicated than that. But how would you approach that? Yeah,

Srivatsan Sridharan 41:09
and there’s merit to that, right. I think if you’re a very, very early stage startup, and you you’re in this existential crisis mode, then maybe it’s not right to invest in your data stack, then you need to figure out your your story of what you’re delivering, and who are your customers. But I think once you’ve achieved some kind of a product market fit, that is a great time to push for building a data centric culture. And the way I would approach that, or I would suggest that is, obviously you can’t go to your CEO or to your boss and say, I need five people, I need $5 million to set this up. Right? Obviously, that’s going to be hard. No, I think it’s about finding incremental places, or incremental opportunities to build towards a self serve data platform towards the future. So one example of that could be every company has to report metrics, right? And so typically, what companies do when they’re early stages, a lot of this is manual, right? Somebody is writing a SQL query, somebody putting that into an Excel, maybe there’s a visualization dashboard, right? You throw that in there. And there’s a good business case to be said, around automating that, how to make that data Correct. How to ensure that people aren’t copy pasting stuff, and spending hours and hours validating data, and introducing a tool to solve that problem. And then once you solve the problem, then there comes the data collection problem introduces to solve the data collection. So I think you can incrementally build this, the key thing I think, is making sure that you get your entire company invested in using data. If you’re able to do that the technology and the building, the platform’s becomes much, much easier.

Eric Dodds 42:45
Yeah, it’s interesting. The, it’s easy to think about, okay, we need to build a state a data stack, and it’s easy to start with, here’s the upfront costs, right? It’s engineering and technology or whatever. And it is, I think, way better to think about it, in terms of in the metrics is a great point, like, how much time can we save the company? Right? And in many cases, it’s probably going to break even or even, like, have positive ROI. Because you’re right, I mean, someone writing sequel, I mean, really, I mean, it’s, if you think about how many amazing tools we have, like, there’s a lot of companies are like, query that production, Postgres, it might SQL, and then deliver an Excel file to someone who have an analyst to hammers on it to sort of do the metrics. And if you can automate all that it’s a huge savings. Yeah, yep, definitely.

Kostas Pardalis 43:37
Three, I have one last question, because we’re almost on time here. You have worked in now, let me put it like in a different way, usually asked, What’s the difference between b2c and b2b when it comes to like to the data stack that someone needs, right? And there are like, Okay, some very obvious differences there. But you have experience in two, let’s say, very different types of b2c companies. You’ve been like Yelp. And now you’re like in Robin Hood, both of them like consumer facing products, many users a lot of data, but they are very different in terms of like, the type of product and the industry they come from. So what are the differences there? How the one data stack differs from the other because of like being in a different industry?

Srivatsan Sridharan 44:23
Yeah, they’re definitely kind of bring in unique, unique challenges. I think the biggest contrast that I’ve seen is, the stakes get much higher in a fin tech company. I don’t mean to say that companies that are in the social media world are the stakes aren’t high there. I’m sure the stakes are high there too. But the analogy that I like to draw is, what happens if the review that you posted doesn’t show up or the like that you made on a post doesn’t show up versus the transaction that you made to purchase some shares at a certain price doesn’t execute, right? I mean, this is not from a data perspective, just talking about the fund. Mental nature of the businesses. But I think what that translates to is the stakes get higher on the data side. One particular place where it manifests is correctness. So in in a company that’s a, you know, b2c company, that’s a consumer web, social media and so on, you can afford. I want I don’t want to say you can afford to be incorrect, but there’s a certain tolerance to correctness, right? Like you can afford to be 99.999% correct with your data. In a FinTech company, you can’t afford to be anything less than 100%. Correct. And that introduces very unique challenges on the distributed system side, because you have to not only optimize for, you know, scale, and latency and fault tolerance and performance, but also for correctness. The flip side is typically large social media companies, the scale is much larger compared to FinTech companies. But I don’t I don’t mean this to be a self Robinwood. But I do think it’s a unique space. Because when you merge the world of like FinTech and mass consumer product, you basically get challenges of scale and correctness, which which, which is something very, very unique. The other thing is customer privacy, right? Obviously, it’s it’s a very big topic, it’s it’s a very important topic. For everybody here, the stakes get higher, again, in a FinTech company, because privacy means a lot more now than in a different kind of social media type company. Because now you’re dealing with people’s money. And that just raises the stakes.

Kostas Pardalis 46:24
That’s not super interesting. And I’m pretty sure we need probably another, another episode to discuss about how you gonna have scale and consistency and like everything at the same time. So hopefully, we will have the chance to do that in the future three, it was a great pleasure having you here. We really enjoyed talking with you. And we’re looking forward to record another extra with you.

Srivatsan Sridharan 46:50
Sounds great. And it was a pleasure. I think you guys asked really good provocative questions. Which, which, which, which I appreciated and I think it was, it’s definitely a wonderful experience. For me.

Eric Dodds 47:02
That was a great conversation, I’m trying to decide what my big takeaway is. But I’m actually going to going to talk about something from the very end of the conversation, which is when three talks about how the stakes are higher in FinTech than consumer. And I love that he you could tell there was a tension there. He didn’t want to say that, like correctness was not important in like a social, direct to consumer app context, right? It is very important. But when you’re talking about someone writing a review, on a meal, they had a restaurant versus someone trying to spend their own, you know, hard earned money to buy stock, and another company is just a little bit of a different game. And so, but but I guess all of that to say my big takeaway was, he said, in a consumer social company, you can afford to be 99.9%, correct, or whatever, how many nines he used to sell. But in a consumer financial company, you can’t afford to be anything less than 100%. And I just really made me think about that a lot. Because, in in, like, so many things in life, sort of the last 1% to perfection, can be the most difficult or the most complex, right? We’re sort of building the infrastructure to ensure that so that is going to consume a lot of my thought this week 99 to 100%. You’re very

Kostas Pardalis 48:27
philosophical today, or you can usually it’s me, who’s more into that.

Eric Dodds 48:32
Maybe I’m just trying to share the burden with you. Since you. Yeah, it

Kostas Pardalis 48:38
was a great conversation. I think there are many things to take from this conversation outside of the fact that, like our relationship probably means like a council, but we had, I mean, there’s a, there’s like a wealth of information that comes like from three, like given things of like, today, we defined what the data pipeline is why it is important, like, you know, like things that we take for granted, but actually shouldn’t be taken for granted. Like we should spend time on meditating on like, all these core concepts that we have in our political discipline. And always keep in mind that these things change, like very, very rapidly, right. And I think that’s like, the biggest takeaway from this conversation, like how things can change really, really fast and how you have to be, let’s say, always stay relevant in this profession, keep up to date with like, whatever is happening. And yeah, that’s, that’s what I keep from his conversation. And I’m really looking forward to have another episode with him because I think we have like, plenty more to chat about with him.

Eric Dodds 49:46
And we also learned that if your main goal is being highly in demand and making a ton of money, you should learn COBOL

Kostas Pardalis 49:52
Ah, yeah, that’s true. That’s true. Yeah. COBOL is like yeah, like it’s, I mean, if you if you’re willing to Do not like sacrifice so much of your life. And yeah, you will be rewarded. That’s true.

Eric Dodds 50:07
Well, thanks for joining us, and we’ll catch you on the next day to sex show. We hope you enjoyed this episode of the datasets show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me Eric DODDS at Eric at data stack show.com. That’s E R I C at data stack show.com. The show is brought to you by Rutter stack the CDP for developers learn how to build a CDP on your data warehouse at rudder stack.com

🎙 Sign up for The Future of Machine Learning Livestream!

🗞️ Signup for Our Newsletter

Episode 66:

How Data Infrastructure Has Evolved and Managing High Performing Data Teams with Srivatsan Sridharan

December 15, 2021

Notes:

Transcription:

About the Podcast

Sign Up for The Data Stack Show Newsletter