This week on The Data Stack Show, Eric and Kostas get the chance to talk with Ananth Packkildurai, Zendesk’s principal data engineer and the author of the Data Engineering Weekly newsletter.
Highlights from this week’s episode:
- Ananth’s background (2:51)
- The evolution of Slack (4:54)
- Kafka and Presto’s two of the most reliable and flexible tools for Ananth (9:43)
- How Snowflake gained an advantage over Presto (13:24)
- Opinions about data lakes (17:23)
- Core features of data infrastructure (23:22)
- The tools define the process, and not the other way around (31:30)
- Defining a data mesh (36:44)
- Data is inherently social in nature (40:31)
- Lessons learned from writing Data Engineering Weekly (49:14)
The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Eric Dodds 00:06
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com.
Eric Dodds 00:27
Welcome back to The Data Stack Show. Today we’re talking with Ananth Packkildurai who publishes the Data Engineering Weekly newsletter and my guess is that a lot of you out there listening are already subscribed. If you haven’t, definitely subscribe. Kostas and I both are avid readers, and it’s a tool we use to keep up to date on the industry. And Anath has a fascinating data engineering career. He has worked at Slack, sort of in, you know, hyper-growth mode, which is really cool. And now works at Zendesk. Kostas, I think, my question, which I’m probably just gonna try to steal out of the gate and get this question in before, before the conversation gets going is, what was it like to be a data engineer at Slack, in 2016, versus 2020? Because that period of growth for them is really mind boggling in many ways. I don’t have the numbers on hand, but I’m so fascinated to hear about that. So after I ask that question, what are you going to ask?
Kostas Pardalis 01:28
Okay, first of all, I have to say that you’re quite predictable. But it’s a very good question. So yeah, I mean, being in like Slack at that time, both in like, the lifecycle of Slack, but also in terms of how the industry was back then in terms of like data technology, and all that stuff, I think it makes a lot of sense to ask him that. And I think we’re going to hear some very interesting things about that. For me, I mean, I’d love to get a little bit more technical and abstract at the same time with him, chat a little bit about architectures, what are the differences, some products that you have, he has seen like they are quite important, and we use them since he started working. And how the data stack has changed.
Eric Dodds 02:16
How predictable. Alright, let’s jump in and talk with Ananth. Ananth, we are really excited to talk to you today. I feel like we never have enough time to talk to any of our guests. But the list of things available to us to talk to you about is really, really long. We’ll get right to it. But first of all, thanks for joining the show.
Ananth Packkildurai 02:38
Hey, thank you for having me.
Eric Dodds 02:40
Okay, we start every episode The same way by just asking you to give a little bit of your background and how you ended up where you are today. So tell us about your background and what you’re up to.
Ananth Packkildurai 02:51
Yeah, totally. I’m working as a principal data engineer for Zendesk essentially overseeing customer facing analytics, and what makes for us in turn in the Xeon space in terms of analytics, previous layers towards post process, like where I can build the data, infrastructure, orchestration engine, all sort of thing. So having an engineering background and working for the industry for like, almost 15 years now, so I’m pretty passionate about data engineering. I’m happy to be talking about that.
Eric Dodds 03:23
And you mentioned this in the intro, but you also publish the Data Engineering Weekly newsletter, which has grown significantly. And if anyone in our audience doesn’t subscribe, you should absolutely subscribe to the Data Engineering Weekly newsletter. I get lots of my news personally, about the industry in the space. And so congrats on that being a really successful newsletter.
Ananth Packkildurai 03:47
Thank you. We just crossed the 50th edition of the newsletter. So I’m looking forward to building more on top of it.
Eric Dodds 03:53
Great, well, well, we’ll talk a little bit more about that just because as you know, content producers ourselves, Kostas and I have lots of questions to ask you about that. You’re so productive with that. But let’s start with Slack. Slack is a fascinating company, just an unbelievable acquisition by Salesforce, kind of mind boggling. And you started there in 2016. And as I was thinking about this before the show, I actually remember, gosh, I want to say it was 2015 or 2016, when the company I was at switched from HipChat to Slack. And Slack was kind of, at least for us, in our circles, kind of a new thing. And we were like, Oh, this is awesome. And then you were there for four or five years. So you got to see this unbelievable growth in Slack. Can you just describe to us what it was like when you started there? And then what was it like towards the end of your time from sort of a data engineering perspective?
Ananth Packkildurai 04:54
Yeah, I think that’s a good question. So what it meant in 2016 was that we were barely starting building some of the base or base foundation for data infrastructure at the time. And the good thing about Slack I mean, as you mentioned, right, it’s grown exponentially. And whatever the assumption that we’re making in six months, it will be invalidated, right? The scale that it grows and the ability for us to kind of multiply the data platform at the space keeps the business to keep innovating on top of it. I think that’s a big challenge that we can run through those things. I think I’m glad to be there learning a lot of pragmatic decisions to scale the system for the period of time so.
Eric Dodds 05:40
And in 2016, what when you started, you said, you were just barely building out the data infrastructure. What was the componentry there? And what were the initial needs you were trying to address. I know that the exponential growth means that a lot of those things need to be updated in a short amount of time. But what were the initial problems in 2016?
Ananth Packkildurai 06:03
Mostly like we have this analytics team right now. So now we have to enable and empower analytics teams to build the reporting dashboarding needs to further execute use, to understand our product usage, to understand our customer experience sort of thing. So how do we empower like, how do we empower analytics teams, our data science team to be much more productive? When you started 2016, like, you know, there is no Snowflake, there isn’t there’s no like, any other mature database for us to kind of go and approach or not much maturity on the ingestion framework, also, like we had to build everything, either you have to adopt an open source system, or you have to build by yourself. And even if you adopt an open source system, it may not be scaled at some point of time. So you have to really, really dig in and do certain things. So certain components that we had at the time, like we were running one 12-node Kafka cluster at the time, and we were barely starting to use Airflow at the time, and we were running a single instance Airflow for some time. And another EMR cluster. It was a pretty standard, pretty standard structure to begin with, for any data infrastructure in this case.
Eric Dodds 07:18
Got it. Yeah, it’s wild to think about Snowflake not being sufficient enough back in 2016 for Slack. It’s just crazy to hear that. And then just to quickly, fast forward, what was the high level architecture in 2020, towards the end of your time there?
Ananth Packkildurai 07:39
Yeah, I mean, the foundation remains the same, but it’s more scalability of a system keeps changing, right? The scalability like, we no longer can run Airflow in a single box. So at some point of time, we were well crossed past like 1,000 plus tags, and almost 30k tasks running per day. So we adopted a distributed approach on the Airflow side of it. And we also kind of increasingly adopting other vendors like Snowflake for certain business use cases, in this case, and we are growing our EMR cluster to a larger instances and Kafka from one 12-node cluster has grown to three or four Kafka clusters with 80-plus nodes in this case. So the system component fundamentally remains the same. But more and more security compliance requirements that we had to satisfy. We added additional layers on top of it to kind of fulfill that in this case.
Ananth Packkildurai 08:40
So slightly like scaling and keep the foundations the same. So because it’s very, like the rate of business innovation is happening at the time and the speed at which we are moving, it would be very, very expensive to change any foundational thing. So we keep iterating, keep scaling this thing. For example, we were running Airflow on a single box for a very long time. And one of the things with Airflow as a scalability problem is when the new … There is a census for waiting for the files to come or partitions to be available. It’ll just spin off the process and keep just waiting and it waits doing nothing. So we have a choice whether we have to go distribute more or we can fix it and get a little more bandwidth. So implement a retrial sensor at a later point. I think Airflow community also adopted a similar pattern in this case. So we always have to kind of understand what is really going on internally, and then try to fix it to our need to make sure that it is running in this case.
Kostas Pardalis 09:43
Ananth, I have a question for you. You’ve been in this space for quite a while and you’ve been through a very pivotal phase with the space that has to do with all the technology around data. If you had to choose, let’s say one technology which you consider as the biggest thing that happened this past, let’s say six to seven years, what would you choose? You mentioned a few different things. You mentioned Kafka, EMR, Airflow. But if you had to choose just one, what would that be?
Ananth Packkildurai 10:16
That’s a tough choice, I think. I think, in the last eight years, there are two technologies that have kind of defined the data infrastructure. I think one is Kafka. Another thing, I would say, Presto, these two, the reliability that it provides and the flexibility these two tools can provide. I would choose either one of them, I would rate like both are equally the same weight.
Kostas Pardalis 10:46
That’s really interesting. It’s one of the first times actually that someone is mentioning Presto in our conversation. Can you give us a reason, why Kafka and why Presto? Why these two?
Ananth Packkildurai 10:59
Yeah, so, if you take Kafka, right, the guarantee that what Kafka as a system is doing is pretty simple, very, very simple guarantee that you send the event. And we say, we store it at sequence link when given access to a sequence link, and you can scale up to whatever the way you wanted to do that. And it’s much more reliable in this case. So it eliminates a bigger need for us because everything in data infrastructure that we deal with is out of the events. And when you have a solid system like Kafka can scale and store and ability to process that information in a much more reliable way. I think that’s a big win for that data infrastructure. Second thing I would say is that Presto, Presto, is like a kind of an engine. In a typical data infrastructure world, it is not … like there’s two things I like when it’s like a clear separation of storage. And computing the way Presto kind of adopted in the beginning is just focusing on the computing, not necessarily on the storage part and relying on all the other storage engines and other data formats to do its job. So the same principle, again, goes to Presto. The guarantee and the constraint of the system, much more simplified, and they do that very well. So having this clear boundary established to buy those two systems, and they do what they really do well. I think that’s what makes those systems very powerful. I think that’s a critical thing for any other system that is getting adopted in our sector. Stick to this boundary and do that very well in this case.
Kostas Pardalis 12:41
Yeah, absolutely. And actually, I found very interesting what you mentioned about Presto and the separation of computing to storage. And I want to ask you, because you also mentioned Snowflake and Snowflake built a whole story around delivering exactly that, right? Like we are the first data warehouse that separates storage from computing, and these guys like these, these and that benefits, blah, blah, blah, all that stuff. And that’s also my belief, to be honest, it’s not like a new concept, right? Presto, in a way, was doing something similar. Why do you think that although we had Presto, today, we are talking about Snowflake and not Presto. What made Snowflake win, let’s say in this market at the end?
Ananth Packkildurai 13:24
I would say I’d give a little more credit to the Redshift, in this case. I think, Redshift, 2014, when Redshift is going to come down, I think it changes the way people started to think about Hadoop in 2008. It kind of disturbs the way of introducing the concept of like whole big data processing, you can throw out anything and then process at anything. And then quickly people understood there is a mistake, that you cannot just throw anything without a structure, and then your processing will be not as efficient as one would like to have. So, then the shift has happened between like, okay, let’s store with Parquet or RC format, in this case, right? Then, if you wanted to store that, then you had to access it in a very structured way, then that’s a SQL. I went to other systems that were kind of popularized in this case. And then we slowly kind of go back to the traditional data warehousing style, why can’t we do that with a database? I think Redshift is the first system in my opinion, kind of trying to crack that and maybe successfully answered, certainly that’s left to the market to decide. I think Snowflake kind of came along and captured the concept of that S3 and Presto kind of having a clear separation of storage and computing and how the elasticity that it is providing to scale independent of HDFS that what were the mistakes that we learn from HDFS and Hadoop world and they started, they build it that very well. I think that is what the traction is more towards, a managed service that also kind of adopted a similar strategy that we like to move on. There’s a separate scalability from the computing and storage, as famously, people say, right? Life is too short to scale, both computing and storage at the same time. So I think they got in at the right moment, right time, and the right challenge, like right business problem at the time in the industry.
Kostas Pardalis 15:25
Yeah, and my feeling and like discussing more and more with people around that, I think in a recent conversation that we had with another guest, he mentioned the word about Snowflake, that Snowflake is like the Apple of databases or like data technologies. And like, with Apple you get the iPhone, that’s exactly what he was saying. And I found it very interesting. I get the iPhone, and I keep getting and buying the iPhone. I don’t know exactly, but I like it. Okay. And I think that’s what Snowflake did very well, compared to other technologies. It was also like the product experience itself, there wasn’t just, okay, we separate storage from compute. Yeah, that’s something that we have done. Like, in a way, Hadoop was also doing that, right. Like you have HDFS, and then you could run Spark on top of that, and like, do these. So it’s not like this is a new concept, but the actual product experience that Snowflake provides, and especially if you compare it with Redshift, right, because back then, we had Redshift, the dominant data warehouse. And then Snowflake came and Redshift was compared to Snowflake, too much manual labor, right? Vacuum, deep copies, to rescale your cluster, you had to have downtime. And suddenly, you went to Snowflake, and you didn’t have to do any of these things. So I think the product, or at least experience that they offer to the customers was amazing, especially for this product category. Right? That’s, I think, something that they did very well. And you mentioned something about the concept of taking all the data, regardless of the structure and trying to process them and this generates a mess. What’s your opinion about data lakes and data lake as an architecture because it kind of has, let’s say, this approach, right? Like, the whole idea is we have like a file system. Let’s throw everything there and then create the first layer of some structure and then another layer and then start like processing. And what’s your opinion about data lakes?
Ananth Packkildurai 17:24
Yeah, that’s a good question. So data lakes, I think, I would say like, we should not just implement a data lake for the sake of we are implementing a data lake, it all depends upon the nature of the data source that what we’re what we’re dealing with. So I would like to categorize the data source in two different ways, like a managed data source and unmanaged data source. What I mean by that is that, or like, I would say, control the data source and non controlled data set in this case. So the control data set in the sense of taking an example of a company is like Slack, where the user generated the interaction even though it had the generator, within the product experience that whatever capturing, it’s completely a controlled environment, because it’s within the system of the company scope. So they can define the structure upfront, if you can, if you have controlled data production capabilities. I don’t think there is any reason why you should not adopt a structured approach to dealing with the data. But that cannot always be true where there are scenarios where you might, how the data source that produces the data may not be within your control. And that time, it makes sense to kind of go through things like take the data into the data lake like satellite imagery, or any other information that you are getting from the third party, put the data into a data lake and then do a structure approach in this case. So it’s not like one size fits all. I’m sure you are aware that the lake house concept is coming along, the delta lake and Apache Hudi, Iceberg kind of systems. I feel like these two systems will go hand in hand. It depends upon the nature of the source nature of the data that we’re capturing.
Kostas Pardalis 19:12
Yeah, I completely agree with you. And actually, it’s a space that I’m really paying attention to there. Like so many new technologies out there. You mentioned Hudi. You mentioned Iceberg. I am pretty sure that all of them have received funding or they are going to receive it if they haven’t already. So we are also going to see quite a few new companies in this space. So I think it’s going to be very interesting to see what kind of products will come out of these. And this is a merge of the data lake concept and the data warehouse concept, which is something that we saw with Snowflake. Snowflake started as a data warehouse. At some point, they changed the narrative a little bit. They started saying how you can also build a data lake on top of Snowflake. And now they are talking about the data cloud which is even more expanded as a concept to the previous one, so I think this whole category is the latter definition. And it’s going to be very interesting to see where it goes and what will happen.
Ananth Packkildurai 20:07
One of the things that I’m very passionate about are like one, I’m interested to watch what has happened, what is going to happen in that space is something called I think Snowflake, also knowns, constantly called data sharing, that you can share certain may be publicly available data or any other data to across different companies. Yeah, we see more and more companies adopting SaaS products, right? You know, if you wanted to, if I’m a retailer, I wanted to open a shop, I can do Shopify, or I can accept from payment from Square and like Stripe, like, if I wanted to run a business, I can literally use a mesh of SaaS technologies to build my business over a period of time. Now, again, the question will come like how do I get the intelligence of it? Each and every system is very good on its own. But how do we get the integrated view out of it? I think that’s where data sharing is going to emerge. And obviously, the CDP platform like RudderStack, like how much is it going to disrupt in this space. That is something that is very interesting to watch.
Kostas Pardalis 21:10
Yeah, absolutely. That’s very interesting. And that’s a very good point. Actually, I remember when I was reading the S1 filing from Snowflake, and one of the first things that they were talking about was how these data sharing capabilities and these data marketplaces and all these, let’s say direction would like and data sharing process, how important in these are their vision, and especially because, they based on what they are saying, at least, they generate very strong network effects. So from a business perspective, we are talking about a completely different game, if you might like to implement something like this, and you’ve actually managed to get people to use these kinds of capabilities on top of your data warehouse, data cloud, or whatever you want to call it. Right? So yeah, absolutely. I really want to see what’s going to happen with data sharing. That’s super interesting. Cool. So another technical question. You mentioned, actually not exactly technical, more of an architectural question. You mentioned when you were talking with Eric, that what has changed since the beginning of Slack until today, it’s not that much the architecture itself, but it’s more about the scale of the architecture, right. And from what I understand at least there are some standard components out there that are exactly that. Now, you might implement them differently, depending on your scale or your needs. But at the end, the data infrastructure has some kind of standard structure. Can you give us your perspective on that? What does this architecture look like? What are these components, and then we can discuss a little bit more about each one of these components.
Ananth Packkildurai 22:49
So I mean, this again, is one year gone, so whatever I’m telling right now might be invalidated. So this is where I used to be. So just take that into consideration.
Kostas Pardalis 23:00
Sorry for interrupting you. I’m not talking that much about Slack, specifically, but in general with your experience as a data engineer. So it doesn’t have to be something specific to Slack itself. But I guess that there is some kind of, let’s say, more generic architecture that some pattern is something that we find in every company, right?
Ananth Packkildurai 23:22
Yeah, totally. So one thing that I found out like, so, there are two parts to it. One is from a data infrastructure perspective, and there is a data management perspective. The standard component I found out in most of the company’s is Kafka, obviously, is taking all the events and streaming that event to Kafka. In this case, in order to stream the data, there’s a common approach from most of the companies, either they’re using an arrow or protobuf or trip structure in this case, and some kind of an agent running on the individual machines to kind of capture that data and send that back. That approach that I saw very predominantly there. And then the other part is like, ingesting the third party data is like marketing data or your Stripe, HubSpot, and whatnot. Right. So it seems, people are increasingly adopting the SaaS solutions like any other ingestion framework, they started adopting that over a period of time. In terms of the data, where it is getting stored, S3 predominantly, like most of the people are storing in, like I see most of the ecosystem in AWS data getting stored in a, you know, S3 for a long term retention, and maybe in a pocket format, in a columnar storage. So these are all just to get the data into the system, and how do they access the data? Like from there, people actually may either use Snowflake or Redshift or any other storage in a database. It depends upon the scale and depends upon the need. Presto, Spark SQL predominantly used to do a lot of SQL query processing in this case. And that’s mostly the infrastructure perspective. I think the orchestration engine is one of the core, you know, features of data infrastructure. I’ve seen Airflow adopted very, very wide widespread adoption of Airflow, and obviously, Free Fact and Baxter are kind of high in adoption in this case. So, the standard structure I see here is like that. So there is an ingestion layer that brings the data either from the third party or from the local system, like the internal systems, and there is the storage for long term retention. And also like an efficient after the matrix can have computed over a period of time, the nature of the data engineering or data pipeline is that as we go downstream to the pipeline, the volume of the data going to be narrowed down, because as we go down the pipeline, we tend to store more and more aggregated information and aggregated information, how those are granularity over a period of time. So I see more aggregated information getting stored in Snowflake, and people are adopting it like a hybrid strategy where raw data getting into S3, and aggregation slowly going to flow into, you know, Snowflake or Redshift in this case. So yeah, that’s pretty much a standard.
Kostas Pardalis 26:32
That’s interesting. And how do you go from their old data on S3 to the aggregations that are going to be stored inside Snowflake? Like usually what tools are used for that?
Ananth Packkildurai 26:44
So usually Snowflake and then Airflow are already supporting Snowflake Operator, another aspect of it is that so some kind of a tooling that can do a bulk insert to either Redshift or Snowflake in this case. Because more and more aggregation happened down the pipeline, it’s narrowing down towards the business logic, I would say like a business domain, in this case, right? So if a marketing team really wanted to have set up a set of aggregation, that then can be easily accessed, and then builder, and dashboarding. And reporting another aspect of it is that. So different trying to kind of understand at a different aggregation level. So I think I see that is where tools like Snowflake are kind of very, very helpful. In this case, like, you know, the concept of cloning, virtual table, and all sorts of things is kind of very powerful that you can easily clone certain data sets, but to share across different domains of things.
Kostas Pardalis 27:50
Make sense. And okay, it seems that the data infrastructure, at least, is something quite well defined. So far, at least. What about data management and data governance in data management? You mentioned that this is another important part. Do you think that data management is … what is the state of data management today? Do you see their technologies missing, products missing? Or everything is in place? What about best practices? Like, what’s your feeling about that?
Ananth Packkildurai 28:15
So I think data management usually comes from an afterthought. That’s what I found it in most of the companies, so when I’m talking to them, right, you know, no one is going to implement straightaway that this is my data governance tool, let me bring the data back, like there, there will be some kind of an ecosystem already in place. And people are running some analytics. And at some point of time, either they go into a public or some kind of a GDPR compliance, they want to expand the market, and then they need data governance. So it will always be an afterthought. So that makes the adoption of a data management platform very, very complicated. Because you had to find a solution somehow to integrate your existing infrastructure. And that infrastructure might have multiple technical depth or multiple integration problems here. So I think that is the significant challenge of data management in this case, and even to the lineage and data lineage and data quality. I always think it is an afterthought, and most of the data pipeline when they started that. So that is a big significant challenge. I don’t think anyone’s solved this problem. I think it will remain this way. I feel like it will be a challenge moving forward. It will remain for some time because it’s hard to expect a business to kind of think through that aspect of it when they’re trying to solve or bootstrapping a company. The company will essentially look to solve the business problem, not necessarily data governance and other aspects of it. And, and that’s the gap between when they realize we need data governance and the tooling that is available in the market, there will always be a mismatch between the system they build and then the governance that is offered in this case, right? So I can give a simple example is that like, let’s, let’s take a look at it what is the existing open source tool that is available to kind of implement a data governance is like Apache Ranger, for example, it heavily relies on either Kerberos or LDAP system to be in place to kind of define the role based access and all those things. If I had to implement that on our back access, and I had to manage Kerberos, or LDAP, another access here, which are just pretty expensive things to do in my system, which the organization may or may not need at all. So it’s like a custom control here. Right. So. So that is where the gap is. How do you manage that? It’s going to be a bigger challenge.
Eric Dodds 30:46
And, and one one thing on data governance that I’d love your thoughts on, especially seeing it at scale at companies like Slack, and Zendesk is, the tooling is certainly one part of it. And I think it’s just a really good point you made about the mismatch between sort of the needs of the company, especially in the early stages, and the tooling that’s available, and just how much time you invest in it. But we’ve also heard repeatedly on the show that the other side of data governance is really cultural inside the company, right? So you can have great tooling. But you really have to have a culture around it and sort of align teams around it. What’s your experience been on that front? Not considering the tooling aspect of data governance.
Ananth Packkildurai 31:30
So I have a reverse thought on it, like my opinion, the tools essentially define a process, not the other way around. So if the tool is sufficient and fulfilling enough to do that, I think that is a good enough process to put forward to implement data governance in this case. I would think of architecture in a way like it’s never the people’s fault, it’s always the system’s fault, that a system can have failing, right. So in this case, I would say the tools are still not matching. That’s like people not able to fault.
Eric Dodds 32:06
Yeah, super interesting. I’d agree with that. I mean, I think the sort of primitive example of that is just how many companies use Google Sheets as their sort of primary store for managing data governance, or at least alignment across teams, which sounds really crazy. But I think that’s a symptom of what you’re talking about where the tooling really hasn’t come around at an enterprise scale that allows people to do it.
Ananth Packkildurai 32:35
An efficient tool always, you know, enables workflows to be much more efficient. So let’s say I wanted to implement data governance. And this is my workflow that companies kind of follow. This is how I generate the data. Let’s say I use protobuf to generate the data sets, then that should be a systematic way for me to kind of tell, hey, whenever you’re writing this protobuf message, add this tag that it’s SPI. And we will take care of the rest of the things. And which means that it doesn’t change any of my developer workflow, or the way people access the system here. And the tool is kind of behind the scenes taking care of it. If you don’t have that kind of a tool that kind of meets the developer experience or the user experience there, and that goes back to a point that we discussed about how Snowflake is kind of successful. I think that’s a bigger lesson that we need to kind of learn. Are the tools kind of meeting the user’s needs and giving the user experience a higher chance of success than just having to shake off having those tick marks for a period of time? Okay, yeah, there’s encrypting PII tick marks. And but that doesn’t go into scale in this case.
Eric Dodds 33:48
Yeah. And one thing that’s interesting about data governance is that you sort of have two opposing forces, which makes building the tool really difficult. I’m not an expert, of course, but data by nature, especially when you think about things like customer data, is very dynamic. It’s constantly changing. There’s always some sort of mess with data. And governance is all about some sort of standardization and enforcement. And those two opposing forces, I think, make the tooling pretty hard, especially when the needs vary by business model, team structure, technologies used. So to have a sort of pervasive set of tools that solve those problems really elegantly is challenging because of those opposing forces.
Ananth Packkildurai 34:34
Yeah, totally. I think this is a similar conversation I had some time back with Master Data Management System also, like where does the MDM System stand in the larger scale, and the system like MDM, it’s very easy to like and much more efficient to implement data governance. The fundamental difference in the world that we are living in is the modern data technologies adapted, mostly under schema on a read approach like, I no longer necessarily need to know, the storage and compute kind of separated completely. Whenever I need or whenever I’m going to read the data, I know what is the schema inside that. Whereas if you wanted to enforce the data governance and the like Master Data Management system here, you need to have a schema on write systems. And before I’m right, I need to know what I’m writing in this case. And so there is a conflict of philosophy approach that data in the modern data pipeline in the data engineering approach is going on having a data governance and the MDM technology sort of gearing up for time. So an interesting challenge to solve.
Eric Dodds 35:40
Yeah, it really is. It’ll be cool to see … there’s companies out there trying to solve it; it’ll be cool to see what happens. Okay, let’s talk before the show . We brought up a subject that I’m actually really surprised in almost 50 episodes now that we have not, and correct me if I’m wrong here, Brooks or Kostas. But data mesh has not come up on the show as a topic that we’ve discussed at length, at least to my knowledge, but my memory is somewhat faulty. And you talked about sort of working in the context of data mesh, it’s certainly I don’t know, “buzzword” is the exact correct term. But it’s a really interesting subject. It gets a lot of buzz online. But why don’t we just start out with and I think this is probably a question that a lot of our listeners have. But some of the definitions of data mesh can become pretty complex. But you have some experience sort of working in that context on the ground, can you just give us your definition of data mesh, sort of at a 101 basic level?
Ananth Packkildurai 36:44
Yeah, totally. I think my take on data mesh, the whole discussion that I have observed and understood from my perspective, this is again, whatever I’m telling, it doesn’t mean that this is data mesh, but this is what I understood. And what I perceived in the organizations is like, the founding principle of data mesh that I really liked about treating data as a product. Culturally, what happened is that the feature team was like very busy in developing the feature products, let’s say, like a Slack as a messaging product, they wanted to develop a, like all the feature team want, like if a type of message, is it received by all the recipients who are receiving the information that they received that data. That’s all they care about. Now, there’s a whole other world where we need to capture that business logic and understand and define the user behavior over the period of time. There is a lot of context missing from the future team to go to an analytics team. So there is a lot of back and forth manual synchronization to knowledge sharing, and often what it turned out to be, let’s say, an expert, knowledge sharing way of building analytics in this case. I think what data mesh is going to bring in and say like creating data as a product, the future team that is producing the data should have that ownership of the data also. So they can add more and more context to what it means by this particular field. What is the business context behind that, so it will be easy for them to collaborate and build a data product on top of it. I think that this is a concept that I kind of really liked. But at the same time, it’s gonna, as we kind of discuss more in the data governance also, right. It is a good concept. It’s a good philosophical approach to designing the system. There’s still a lack of tooling to support the theory. And if you want to introduce a certain process to the system, there has to be a tool to back it. And if you don’t have a tooling to back it up, you will end up creating an undetermined stick process, like a non-deterministic process in this case. That is a chaotic environment. So that is the confusion, in my opinion, in this case, concept-wise. It’s really good. And it still needs some tooling and maturity around it.
Eric Dodds 39:14
Super interesting. Okay, Kostas. I need your take on this because offline we’ve discussed data mesh. Does that definition line up with your understanding? I’d just love your thoughts on Anath’s definition.
Kostas Pardalis 39:29
Oh, yeah. I mean, I agree with the definition to be honest. Data mesh has an or best practice, or whatever you want to call it, I think some of that is still under definition and as Ananth said, there’s tooling missing, and as people start building tooling about it, the definition will also change because the tools are also going to change the way that we are implementing things and the way that we are doing things. I think it’s still early, there’s definitely a reason why it exists. And I think it has to do a lot with not just like a technology, let’s say concept, it’s also like an organizational concept. It tries to create, like a kind of framework of how the organization and people inside the organization interact with the data. I will say, I think it’s still early. But I think it’s going to be interesting how the definition will change and how companies are going to adopt it and implement it.
Ananth Packkildurai 40:31
Yeah I think one challenge in that case is like fully adopting a decentralization. It will be a challenging factor, especially for data infrastructure in this case, right. So, data, as initially mentioned, like data is inherently social in nature. What I mean by that, that domain driven approach that we are adopting in the micro services right now is as a user domain that is like maybe a customer domain, in this case, can have a standalone working in the micro services. I think people are trying to copy the similar concept into the data mesh principle and say, like, well, this is a customer domain, all the analytics within the customer domain, owned by this customer domain team. The challenge with that approach is that data is inherently social in nature, a custom standalone domain will not add any value out of it. If you have customer information, we also need to correlate that data into maybe their activity, maybe some other information is happening across the domain that adds more insight and more value. So how is cross domain communication going to happen? And the model that we define in one domain has to be consistent with the other domain. Right? And, and if you don’t, it’s a very common thing, like even with the very controlled environment, like most of things, user ID can be represented in multiple ways you can find like EMI underscore, user table, you can have user_123 table, it’s pretty common to find in the modern data warehouse type so and it will exponentially increase the silos. How is cross domain communication going to happen? Where does standardization happen? And that brings all the challenges that we face in MDM and implementing those data governance systems here, right. So that is where the system is going to, you know, balance it out.
Eric Dodds 42:42
Yeah, there’s two, two thoughts on that. I think that’s just a very astute observation on some of the practical challenges on the ground, because, of course, the idea, I agree, is great. But you have this challenge, where there’s value and velocity that come from decentralization, but you almost need some level of centralization in order to make the decentralization work, like you said, the schema has to match, sort of across domains. And then the other component that comes to mind, and I’m just thinking about what this looks like inside of an organization is that the sort of skill sets and expertise across domains are not necessarily equal, right? And so maybe you have multiple data teams, and I’m sure there’s, there’s ways to make that work in your organization. But different types of data lend themselves to different skill sets, that sort of different processes. And so again, you kind of run into this issue of varying skill sets or emphases across domains will tend to produce unstandardized results in a decentralized system. But in order to get the most value, you need to have some level of standardization. So it’ll be really interesting to see how it plays out.
Ananth Packkildurai 44:02
Yeah, totally, I think not only the skill set, but the different priorities conflict in priorities. Because the whole purpose of the domain teams is to kind of fulfill the business need and satisfy the users there. And they have no additional responsibility to produce the event, make sure that synchronize and standardize associations, not reverting. So when it goes to the project management when you’re prioritizing, oh, this is my quarterly plan, it always kind of gets pushed down in the priority lane. So that would be a hard fight. And especially a product manager is incentivized towards delivering the future, the customer facing future, in this case, the less incentivized to do that, in this case, right? So it all plays around. All the human factors play around where the incentive goes. How do you measure the success in this case of a domain and domain modeling success in this case? So there’s various practical limitations from there.
Eric Dodds 44:59
Yeah. Absolutely. One question. And I want to switch gears here just a little bit unless Kostas, do you have any other hot takes on data mesh in our first extended conversation on the subject?
Kostas Pardalis 45:13
I have a question for Ananth, you mentioned the cross domain issue, right? Like, you cannot have data that it’s going to be isolated only in one domain. And I found this extremely, extremely interesting. And before that, you mentioned when we were talking about data management, about data lineage, do you think that data lineage is one of the ways that we can control and understand how data goes and moves in a cross domain fashion? And do you think there’s some value there? Or is there something else that is missing in order to effectively use this data across all the different domains in the company?
Ananth Packkildurai 45:52
Yes, totally. I think that’s a very good, very fair point to make in this case. I think the adoption and maturity of a data lineage can potentially minimize the risk factor of producing that inconsistency over the period of time. Now saying that, like the current envisioning of this data lineage right now is like after the fact, right, we just said there is a new data set created and let me contribute back to the original lineage. And then the kind of visualizing what is happening and there is no practical application yet built on top of the lineage structure that we cannot build. I think, Marquez, the lineage tool kind of started to put the first application that I know of, is like going after getting a backfilling job based on the lineage that we cannot do that. So I think the majority of our lineage infrastructure, and then how those systems stand up systematically enabled building application on top of lineage that is going to play a significant role in the success of data mesh, in this case, instead of reacting, how can it be an active system, capturing the model and reacting to the model? When we are generating the data itself, and right now it’s a reactive engine. It’s not an active engine, the current way of doing data lineage in this case.
Kostas Pardalis 47:15
That’s super interesting. I mean, that’s, like something that I always found as a very fascinating topic, mainly, because something that the industry is trying to implement for quite a while. I mean, it’s not something new, right? Like, especially in enterprise data management systems. But you put it very well, like, it’s very reactive, and it still feels like there’s something missing there to actually take data lineage and extract like the value that we can from that. And probably the data mesh is the environment, which data lineage is going to find its position to deliver the value, right?
Eric Dodds 47:56
The Data Stack Show hot take on data mash. No, it is a super interesting conversation. And we’re getting close to time here. We have a couple minutes left. Ananth, I wanted to ask you. So you’ve published 50 editions of the Data Engineering Weekly newsletter. So let’s call that a year’s worth of content. And you’re doing a sort of eight to 12ish if my back of the napkin math is right, sort of summaries of pieces of content, every week in the newsletter. And so that gives you just an incredible purview over what’s happening in the landscape of data and data tooling and data thinking, because you’re really studying it and sort of curating it in an editorial way, which is fascinating. I’d love to know, what are the things that you’ve sort of learned or that have really stuck out to you, as you’ve spent a year curating thousands of pieces of content down into what you actually put in a newsletter?
Ananth Packkildurai 49:01
Is that learning mostly on the technology aspect of it? So it’s like an industry aspect of it?
Eric Dodds 49:06
Actually, I think both are really interesting questions. So I think both perspectives would be great.
Ananth Packkildurai 49:14
Yeah, I think what I really liked about it, I think, first of all, writing the Data Engineering newsletter is kind of giving me a very structured approach to learning, right? I mean, that is what I really enjoy about writing the newsletter is it gives me a very good dedicated time for me to learn and then curate certain information. So that’s the biggest benefit I’m getting out of it. So, what I learned is that when we take a look at an accumulated view on a different article, you can always extract some patterns out of it and know how those companies are solving those problems, or what is the current problem they are even working on? That is a good enough indication that says, what are the existing challenges till now? So for the past year, what I looked at was like most of the companies are working on some kind of a data management system, some kind of a data discovery system, and data lineage system, and the cost to do like, how can we make data democratization make a more data driven company? I think that is a challenge. Most of the companies like from, from all the small to large companies, struggling with and most of them kind of writing that on top of it. I think data mesh is kind of a little bit popular in this case, basically because of the very one pain, how can we kind of introduce the systems. Also in the four years since the beginning of Slack when I’m starting to read those things, most of the blog posts coming out are: How do I scale, how to cluster? How do I scale x, y systems in this case? That is kind of reduced to now. And I feel like data infrastructure and data processing largely moved towards cloud-based solutions in this case. People are no longer talking about how do I scale this? And how do I do that? Rather there’s more and more focusing going on in the data management and then the data knowledge phase, like later data literacy pace, enabling the cultural change in the company. I think that’s a very fascinating development.
Eric Dodds 51:18
Yeah, absolutely. Super interesting. I mean, and just out of curiosity, if you’re willing to share, what kind of time investment is it? I have to believe that it’s a huge time investment to curate all of that. I mean, certainly probably educationally helpful for your career, but no small amount of work.
Ananth Packkildurai 51:37
Yeah, totally, I think, roughly three and-a-half hours a week. Not much more than that.
Eric Dodds 51:43
Yeah, wow, that means you’re really efficient. You must be a very fast reader.
Ananth Packkildurai 51:49
Usually in the morning, I just kind of go through different articles, like I kind of flag that and I wanted to read at a later point in time. And usually, I’ll have, you know, 15 to 20 articles, minimum 15. So I’ll just kind of read through that. And then filter out some articles out of it. And then focus on maybe, as you mentioned, eight to 10 articles in this case. So there are a lot of articles that I read that I don’t include, in this case also.
Eric Dodds 52:15
Yeah, absolutely. Okay, one last question. And I’m just thinking about our audience, I think there have been so many helpful things. But there are tons of professionals working in data out there. You had a chance to build some incredible stuff at Slack. You’re doing some amazing things at Zendesk, which we didn’t even get to talk about. But if you could just share maybe one or two pieces of advice for someone who aspires to work in a data role at a company like Slack, or Zendesk, what would you tell them?
Ananth Packkildurai 52:50
That’s a good question. So if anyone’s starting a new career, I think that data engineering is kind of a vast field. But you don’t need to learn everything to start with. It is a continuous learning process. So I would say, if you’re starting with a simple SQL and Python knowledge, it’s sufficient enough for you to get started into data engineering. And keep an open mind and start learning more as you progress over that period of time. Don’t spend more time kind of learning from taking your course or anything, just the simple tooling is more than sufficient for you to get in. And you can learn over a period of time, I think that will remain in this case.
Ananth Packkildurai 53:34
The second thing is like this is a very fast moving field and it is a very new field. And a lot of things are changing over the period of time. The four years before, what I used to work on the infrastructure may no longer be relevant in this case. We no longer wanted to maintain a certain expensive EMR cluster, we wanted to move towards cloud based databases, and outsourcing the computing and all those things. But what is important is to focus on the founding principles, how the system is kind of working. So you started to build this foundation, understanding of a distributed computing understanding of the basic principles of data engineering. I think it can go a long way in this case. So simple tooling, focus on the foundation, and you will be good.
Eric Dodds 54:20
I think that’s such good advice. And I think, like many, many disciplines, if you understand the foundations, you can learn the new tooling. And it’s really important to get a good foundational understanding. So wonderful advice, both for our audience and really for me and Kostas as well.
Eric Dodds 54:37
So thank you for that. Anath, this has been a great show. So interesting to learn about your experience. Congrats again on 50 editions of the Data Engineering Weekly newsletter, and again, those of you listening, if you haven’t subscribed, please subscribe. It’s a great newsletter and it will keep you up to date on everything happening in the data engineering world. Thanks again Anath.
Eric Dodds 55:02
Pretty amazing experience that Anath has. And it’s interesting. We talk a lot about tooling and stuff. But it’s really cool to talk with someone. And I think one of my big takeaways was sort of architecting a system from the ground up, using what I would say are sort of the, like, core componentry. Right. And we’ve talked with a couple other guests who were in contexts where off the shelf, SaaS products just weren’t sufficient. And it was really crazy to hear Ananth say that Snowflake, going back to 2016, Snowflake wasn’t going to work for them. And just to hear that is really interesting. And so I just really enjoyed hearing him talk about the way that they architected this just from the ground up. My other big takeaway is actually for you Kostas. I really wanted a spicier take on data mesh, because your philosophical tendencies are really good for contested topics like that. So I’ll bring it up again in another episode, and you can give us a more opinionated response.
Kostas Pardalis 56:06
Yeah, please do, please do. I’ll be more than happy.
Eric Dodds 56:09
What was your big takeaway?
Kostas Pardalis 56:12
I think I’ll be predictable. I’ll say that what I found really interesting is this concept. I think it’s not the first time that we hear about it. When we are talking about the architecture of a data stack, the components are pretty much the same, regardless of like, the type of the company, the size of the company. What really changes is scale and control. Right? That’s like the two main main things. And I think Anath put it very well and described it very, very, very well. The other thing that I found very interesting is when I asked him about a technology that he thinks that’s the one most important technology after all these years that he’s worked in this space, he actually mentioned two, and one of them is Kafka, and the other one is Presto. And I was surprised with Presto, to be honest, but that part of the conversation that we had about Presto, the separation of storage processing, why Presto didn’t make it, but I mean, okay, Presto wasn’t a company, but there were companies that were trying to like to monetize Presto. And it’s very interesting. One of the first companies that tried to do that, to give Presto as a service was Treasure Data, which by the way, ended up being a CDP at the end. So and then they were acquired by Arm, but that part of the conversation, I think, was really fascinating.
Eric Dodds 57:33
That really was. Kafka makes sense, but I was not expecting him to say Presto either. So something unpredictable in a sea of predictability from you and me. Our questions are completely deterministic.
Kostas Pardalis 57:51
After that many episodes probably yes, yeah.
Eric Dodds 57:56
All right. Well, thanks again for joining us on the show. Make sure to subscribe if you haven’t yet on your favorite podcast network, and we’ll catch you on the next one.
Eric Dodds 58:07
We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me Eric Dodds at Eric@datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.