This week on The Data Stack Show, Eric and Kostas chat with James Serra, data platform architecture lead at EY (Ernst & Young), about data warehouses, lakes, and meshes. James regularly shares his thoughts on the industry at jamesserra.com.
Highlights from this week’s conversation include:
The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Eric Dodds 00:06
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com.
Eric Dodds 00:27
Welcome back to the show. Today, we’re going to talk to James Serra, and lots of interesting things to discuss. He has a great blog, and we read it consistently. And then I’m excited to ask him, this came up a couple of episodes ago. But it’s a buzzword that is kind of all over the data space. James has written a lot about it, but data mesh, and I have been forming my own opinions on data mesh, as a concept in the data space. And James has some strong opinions about it as well. So that’s what I want to ask him about it. And I may even let some of my opinions out. I know there’s some strong opinions in the show, but I may let some of my nascent opinions on data mesh come out. Kostas, what do you want to ask James about?
Kostas Pardalis 01:20
Yeah, I’m very interested to ask him about the industry as a whole, to be honest. I mean, he’s working at Ernst & Young, he’s probably involved in pretty big projects, and in projects with companies that we don’t probably hear that much about here in Silicon Valley. So yeah, I’d love to hear how’s the experience working with the rest of the industry as they’re trying to become data driven? What kind of technologies they are using, if there are any differences, like compared to the technologies that we see here and all that stuff. So yep, that’s what I find super fascinating. And I’m happy to have the opportunity to talk with him about that.
Eric Dodds 01:58
Great. Well, let’s jump in and talk to James.
Kostas Pardalis 02:01
Let’s do it.
Eric Dodds 02:03
James, welcome to the show. We’re really excited to dig into a number of topics with you. So thanks for giving us the time.
James Serra 02:12
Yeah, thanks for inviting me here.
Eric Dodds 02:14
All right, well, just give us a brief background. You have a long history working with data. So tell us where you’ve been and what you do today.
James Serra 02:22
Sure, I currently, I’m a data platform architect lead at EY–Ernst and Young–for about five months. And my main focus here is to build this product internally, it’s a data fabric. And the idea is you want to collect tons of data, it could be third-party data, EY internal data, or client data into this, this data fabric and make it available for other products inside that EY sells to customers as well as use it for understanding our own internal metrics. So it’s a very large project. It’s about 200 people. And it’s very interesting because we work closely with Microsoft, building on to Azure Stack. And it’s unique in that something at this large a scale, it’s not been done much. And so with Microsoft’s help we want to have this built out within the next few months. And before EY, I was at Microsoft for seven years in various different roles, last being at the Microsoft Technology Center in New York City where I spent every day engaged with different customers, whiteboarding data platform type solutions. It could be that they come in and they want to say, as an example, learn more about a modern data warehouse that looks like and through discovery and asking a lot of questions, I would come up with a high level architecture with products that would fit their particular use case. That was always very challenging. There could be many, many products that do the same thing on Microsoft, and so wanted to help narrow them down to make sure they make the right decisions. They don’t know what they don’t know. So it was very much an educational session for each of the customers in various different industries, various sized customers. And I was always in pre-sales technical roles at Microsoft. And so this role at EY is great experience on the building side of things. Before I came to EY, I spent many years in Microsoft databases, data warehouses. I had experience with architecting and developing solutions. The main goal was just to collect data and make better business decisions with customer companies out there. And over the years, I was also a DBA for many, many years and I started back in SQL Server 1.0 and OS 2 back in 1989. But I have a long history of working on the data platform stack.
Eric Dodds 05:11
Super interesting. One question before we dig in, we want to talk about a lot of warehouse stuff, because you’ve produced some great material on your blog, which we’ll put a link to in the show notes. But one question on the project, if you can talk about it, one thing that’s really interesting when you describe the projects that stuck out to me was that there are multiple vectors of both internal and external facing parts of the project it sounds like. And just to be specific on that, there is both sort of first-party data, and then also third-party data, which isn’t necessarily uncommon. But usually, the most common use case we see is that you have first-party data, and you want to augment it in some way with some set of third-party data. But then also, it sounds like the project itself will serve both the business but also be included in the sort of products or customer-facing products that you sell. Right. So sort of an internal data use case, and then also an external data use case. Can you talk any more about that? And I think the main question that comes to mind is, that seems fairly complex dealing with those multiple vectors and multiple types of data and multiple audiences for the data.
James Serra 06:39
Yeah, it is complex. And then you add in the security that is needed at that extreme level to deal with data, that is client data and a regulated company like EY, there’s various rules and regulations you have to follow. And then, of course, each customer’s data that you collect, they don’t want other people to see. So there’s a really high level of security, and a lot of challenges with that. But the main idea is let’s aggregate all this data together and make it available to the product. So as an example, it could be EY has many products they sell and a product that a customer may be interested in, it could take data that the customer has, it could take third-party data, to your point, and aggregate it together and make better machine learning models, it could be reports or dashboards, that that company could use to maybe find out more about their supply chain, where they can increase profits. They could use that data to find fraud, or money laundering that’s going on if they’re a bank. They could use that data to find competitors that are gaining on them in the industry. So there’s many dozens of use cases. Well, all those products need data. And you don’t want a situation where a new product comes along, it creates its own ingestion platform, and ingests its own third-party data and client data. Well, it’s already been done with many other products. So it’s unifying that experience and having one ingestion platform, collecting the third party data. In addition, think of the data saving, licensing savings, you know, third-party data, a company like EY has tens of millions of dollars it spends in third-party data sets. And there’s likely a lot of repeat data sets where people didn’t know that these other data sets are listed in WY. So we’ll have one place where we collect all this data. Then we have a data explorer/marketplace type of environment where anybody can go and search the data we have and say, Oh, look, we have this data. And here’s the hooks into that data. So what happens is, it’s a great product accelerator. If somebody comes up with a new idea for a product, and they say we need 10 different data sets and client data, they can go and find out that’s already existed in this data fabric. And they can quickly ingest that data and use it and get insights on that and build their product and go to market a lot quicker.
James Serra 09:19
So that’s a big idea in the data fabric we’re building, because think of the challenge of adjusting 1000s of files from many different customers. And you have to clean this data and join it, and aggregate it, and secure it. You don’t want everybody kind of reinventing the wheel and doing their own thing. So this is built for multiple different products. And also for internal use that maybe somebody who normally looks at all this data we’ve collected in various engagements that EY’s had and said, Well, let’s see where we can optimize things. Let’s collect the metrics and maybe build some machine learning models on that. Well, we need the data, so let’s have it in one unified place. That’s what the data fabric gives them. So it’s quite challenging because of all these various different data sets. And client data has much different security requirements and required data sets. So we’re going through all those challenges now. And it’s been a great experience working closely with Microsoft to see the various products that they have, and wherever gaps are, filling with other other products outside of Microsoft.
Kostas Pardalis 10:27
James, I have a question for you. I’m listening to you describing all this quite complex architecture so far. And I’m wondering, I mean, one thing is like, okay, we want to insert data, we have data coming from many various places, we are going to store this data in one place. And this is going to solve problems around creating data silos and having access and giving access to all the data to the whole organization built on top of that first class security so we can ensure security and privacy around the data. I was wondering, as we enable more and more use cases around data, and more and more people and organizations at the company are able to access this data and process them. How important is it to keep track of what’s going on with this data? And more specifically, what I am referring to is data lineage. So first of all, how important do you think of VCs, when it becomes a problem? And how do you deal with it?
James Serra 11:29
Yeah, I would say the biggest gap that I’ve seen customers have, especially when I was at Microsoft with building data warehouses, they didn’t give enough time to data governance. And you really need to spend a lot of time thinking through the data governance piece, which includes data lineage as you asked, data quality, data security, data access, all these things can be quite complex. And frequently, customers just did not put enough time in the project plan for all those different areas there. And data lineage is a big one. Because when the end user gets that report, they may want to go, this particular number, I’m not quite sure it’s accurate, where did it come from? And you want to be able to respond to them and show them the various stages they went through. And so data lineage is a big part of what we’re implementing. And there are various products that they give you a lineage, like Azure Purview. So there’s many other great products out outside the Microsoft one for this, that will track this data came from this particular data source, it went and was transformed and cleaned via this procedure and then landed in, say, a data lake that then was moved into, say, a relational database before it was then moved into something like Power BI, where it became a data set that was used for a report that we use for dashboard. And if you can’t get that answer quickly to the end user, you’re going to lose their confidence in what you’re giving them. So the challenge is that data lineage is not as if you can just press a button and scan all the data sources and come up with a lineage on there, there’s a lot of work that could be done behind the scenes. And then you may have to send this information to a data lineage if you’re saying your chaining data inside a stored procedure, because it’s too much for some product to scan a stored procedure and tell it everything it’s doing. So we have to set up guidelines. And if you’re transforming the data, you have to call these APIs in the same purview to tell it what you’re doing in there. And so this becomes a lot of oversight and governance there. So it’s coming up with these particular frameworks and guidelines in there. But that’s only going to oversee it, maybe you have a center of excellence where anything that’s submitted has to follow these rules. And part one of those rules is that it’s gotta send a lineage over to a product. So this gives you a nice clean way of seeing everything. And also that helps in making sure that as you’re building this along, you’re not missing steps, or not properly cleaning something or avoiding duplication of data in the individual source systems that come in there because in most cases, the data that you’re pulling into this data warehouse, you could have dozens of dozens of different sources in this. It’s really important to have the need to track where it starts and where it ends.
Kostas Pardalis 14:41
Yeah, it makes total sense. It’s a very interesting topic. And I’m glad that you broke down data governance into different pieces because I started thinking … you mentioned data quality, and data quality, at least in the companies in Silicon Valley, is a pretty hot thing lately. You see companies, I think, a couple just raised another 100 million dollars or something. You have companies like Big Eye raising money from Sequoia. And in general, everyone is looking into the data quality problem and trying to solve it. From your experience, if you had to describe the two or three more important requirements around data quality that a new product should address, what would those be based on your experience so far?
James Serra 15:32
Yeah. When I look at data quality, the first challenge that a customer has to answer is who owns this data? And I’ve been in rooms where there were almost fistfights that were resulting in and trying to answer that question, because they’re the ones responsible for the data quality. As far as collecting this data into a data warehouse, I can tell you how many times the customer said, oh my data is perfectly clean. And I will say, I’ll bet you a hundred bucks I’ll find some problems with it. And sure enough, as soon as that data comes in, you find that oh, well, the end order entry system, in order to get past the field, you have to enter a birthdate. So people were putting in people who were born in the future or where people were 200 years old. And so you have to get this data and then clean it. So this is part of the data quality. Now you can plug those holes, and you can revert back to the source systems. But the damage has already been done. So somebody’s got to clean it all. So there’s a lot of questions that kind of go back to the source system that owners of that and ask them, what do I do in these situations, if the birthdate is not valid? Should I put null? Like, what’s it going to be? There’s going to be a lack of conformity, if you’re pulling in data from different sources, items, and they have customers in there, somebody, one of those systems could use abbreviations for a state, and others could use the full name.
James Serra 16:51
And if you’re generating reports, you have to have it in one common standard. So that’s a big part of that is somebody that is going to define the standard. Usually you may have a center of excellence team that goes through and says, okay, you need to conform everything. And this is what we’re going to do. Now add in the complexity of Master Data Management, that’s part of the data governance in there is, I plugged in those customer data, the last thing you want to happen is you create a report and the end user looks at and goes, wait a minute, why is this person in here twice, their names misspelled, but they’re really the same people. Now you’ve lost their trust. And it’s really hard to gain that back. So you have to think through this data governance. That’s why I say you can spend a lot of time in it and mastering the data is going to be another important part of data governance.
James Serra 17:35
And even in data quality, well, how do you know the data is bad? Is it a null? Or is it a zero? What does that mean there? So a lot of investigation has got to be done with this. And this is where you want to work closely with your end users. Get them involved in the process early. Ask them how do I know this data is valid? How do I clean? Do these numbers look right? So they’re not left at the end of that going well here’s the report and, and they go well, I’ve had no input into this. So they don’t feel like they were part of it. And I always say, always get those customers involved early on in there, because then they’ll be rooting for you, if they’re part of the process, as opposed to having this almost negative reaction to things that are just handed to them because involves a lot of change what they may have been doing previously, and it’s hard for the people embrace the change, if you haven’t made them part of the process in there. And then I will say the last thing that when people build out their warehouses, is you want to have this one version of the truth. And I’ve had situations where I’ve found people creating reports that were not accurate, because they were in some ways, changing the numbers to make themselves look better. And once you centralize the data in a data warehouse and come up with one formula for all these various metrics and KPIs in there, you’re going to have a possible lot of disputes on what those metrics or KPIs should be. So again, you get these rooms, and you have these arguments in there. But in the end, you will have this one version of the truth so people can be confident that you’re getting the same answers to the questions they’re asking and not having different answers to the one question in there. So it all revolves around data governance. I wish I could say there’s this magic button that can look at your data and clean it up. But there’s not. There’s no shortcuts to this. It’s a lot of time and effort to make the data quality. But in the end, it’s going to be worth it. But you have to put that in your project plan to spend a lot of time on data governance.
Kostas Pardalis 19:41
Yeah, yeah, I think those are some great points around data quality. I mean, I would also add that all these things, there is a reason that we have all these different parts that stay around their data governance. And the reason that we have them under the umbrella of data governance is because for example, data quality and data lineage, it’s important to have both together, right, like, one supplements the other in terms of like the end goal. Same with data access, and all that stuff. So, yeah, that’s a super interesting topic and a super hot topic.
Kostas Pardalis 20:14
Also, I think the industry now is trying to figure out the right ways to implement all these methodologies and functionalities at a large scale, and I think we are going to see a lot of interesting new companies trying to tackle these problems in the future. But talking about new companies, I want to ask you something about a company that, at least in Silicon Valley, we keep forgetting when we are talking about data. And this is Microsoft. So you have a lot of experience with Microsoft and their products. And actually, it’s interesting, because for the database systems, at least, Microsoft is supposed to have probably one of the most comprehensive, most complete, like database systems, which is MS SQL. I mean, it might be a pain to manage but in terms of the capabilities that the system has, the functionality that it has is probably the most advanced database on the market right now. But can you give us a little bit more information about the products Microsoft has around data, like data warehouses, for example, what’s a data warehouse with Microsoft, so if someone wants to go to Azure, today, and what other tools and products they offer for all that stuff that we discussed so far?
James Serra 21:30
Sure. And I had this discussion many times with customers, because, again, they were confused why there’s so many products that Microsoft has, and, okay, what’s your use case, and I’ll narrow down that product list for you to do research on. If we look at the OLTP side, you have your SQL Server, you have your SQL Database, and you have relational databases that have been around for forever, especially with SQL Server, and then SQL Database, which has many different flavors of that is a PaaS solution, instead of an IS solution, that you get within SQL Server in VM. But those are mostly for OLTP which sometimes you can get away with a data warehouse in there, if it’s small, and say under four terabytes. And that only applies to customers who are very small customers who didn’t see a lot of growth in the data that they’re collecting.
James Serra 22:28
Once you get over four terabytes, or around there, you want to start looking at a data warehouse solution. And that’s where in the Microsoft realm, you get into Azure Synapse Analytics, that is the tool of choice, I will say, in Microsoft for large amounts of data for that data warehouse. I have my history with Microsoft, when I first started seven years ago I was in the parallel data warehouse that was Microsoft on-prem data warehousing solution there. It’s like ** with MPP, multiple parallel processing technologies. So that technology gives you an advantage over the traditional SMT technology, in that it can handle massive amounts of data, it distributes the data, and distributes the queries. It could be a long conversation just in itself for how that works. But this opened the door for queries to go anywhere from 20 to 200 times faster than a traditional SQL Server query.
James Serra 23:35
Well that product eventually migrated into SQL data warehouse in Azure. And that has been around for a number of years. And that product then morphed into Azure Synapse. And that technology is still in Synapse under a relational pool that they have, a dedicated pool in there. But that product also added a bunch of other features such as a serverless pool. It has Spark clusters in there, it’s got a data factory built in. So it’s a great tool, if you’re going to build a data warehouse, everything is on a single pane of glass. And that’s where Synapse has a tremendous value for customers to enhance their time to market or time to build the solution in Synapse because of that integration of all those products in that single pane of glass. And it still has that MPP technology in that dedicated pool. And so that’s the way to go with customers. And within there, you can even make the argument that we get into the serverless options. So instead of having a dedicated pool that can be very costly, and maybe I don’t want to use it for databases and data warehouses that are small. You could make the argument well, I can use the service option and only pay for the query and so maybe I can open up this to smaller databases, data set size in there. A lot to do with customers, what’s your current skill set? Are you SQL Server developers, and that’s going to make the transition pretty easy into Synapse there. And so I asked a lot of this during discovery, a lot of questions about customers, you know, to see who would be a good fit. And usually, it doesn’t take more than a couple of days for anybody used to SQL Server, SQL Database to move to something like Synapse. And so that product is what I would say is the go-to in most, I would say almost 90% of cases with customers, Synapse was the solution for them.
Eric Dodds 25:43
It is really funny. We talked with … Kostas, I don’t know if you remember, but we talked to a startup company who was building a product in the medical space, actually, and they were building on a Microsoft stack. And it was great to hear. That was an early episode, I think. But it was fun to hear about that because you hear, like you said, it’s really easy to forget about, especially in the world of data, where it’s all these new fancy tools and new fancy startups that Microsoft has some really awesome technology. So James, thanks for giving us some detail there and reminding us of that.
James Serra 26:21
Yeah, sure. It’s always interesting with customers, they don’t know what they don’t know, they come in and they think, well, we should do everything in SQL Server. Well, wait a minute, we have these past solutions like SQL Database, which has flavors, it has serverless features of managed instances of the hyperscale. So you can handle databases, and all that can be extremely large. And the challenge is the technology is changing so quickly, that even though my full time job at Microsoft, keeping up with the data platform, I could barely do that. And so you can’t expect customers to keep up with it all. So they would come to Microsoft. And they had cloud solution architects and MTC architects like myself, that would educate them, or they go to partners to help educate them. Because the reason data warehouses fail in most cases is just that customers use the wrong technologies for their use case. And I would see customers who use a certain product and I’d go, why didn’t you use this other product, and they’d go, we didn’t even know that. Well, okay. That’s the reason and, and so it’s really important upfront to be aware of all the products and the use cases. Choose it early. And don’t run into mistakes where you’re a few months in, many months in, many millions of dollars spent and you realize this was not the right product. And I have to go back to the beginning.
Eric Dodds 27:44
Wise words. Well, let’s switch gears a little bit. As we were prepping for the show, we talked about … which I love. We started out talking about sort of an extremely complex project with all different types of data and all different types of users. And then talked about the complexity of data governance and data lineage at scale. Let’s step back a bit because it’s something that you’ve written about a good bit that is actually just the fundamentals of the data warehouse, and you have a great post on your blog and a great video on YouTube that I think is called Data Warehouse Explained. And I’d love for you to just give us an overview of that. And as I was saying before the show, I think we get exposed to so many new interesting technologies in the data space, that it’s easy to sort of assume that we know the fundamentals of a tool that we use every day. And so I think zooming out and getting context for that is helpful no matter where you’re at, in terms of working with data. So James, give us a high level overview of the data warehouse explained.
James Serra 29:00
Yeah, sure. And this is particularly true of the smaller companies who are just beginning their journey of trying to get better insights and make better business decisions through data on there. And it could be that they have some source system, it could be a homegrown thing, or that’s LLDP that could collect all this data, maybe about customers. It could be using some CRM or an ERP system like an SAP. And they say, well, we want to generate some additional reports. And we may want to combine what we have with multiple source systems. It could be even, hey, why are sales slow in certain areas of the country? Well, maybe it’s something weather related. So we need to combine our data with weather data or competitive data.
James Serra 29:50
Well, okay, so we want to generate better reports. Well, what you don’t want to do is try to cram that data into say SAP or homegrown applications and just hammer it with reporting on there because you’re going to make the end user very angry. And that’s the first problem I see with customers reporting on live production systems is they spiked the CPU, people start getting angry at IT. What’s going on here? Somebody wrote a query that was malformed. Man if I had $1 for every time I did the kill command with a DBA, I’d be rich right now. And so you need to offload the data from the production system. Now you could replicate that data. And there’s various ways of doing SQL Server. But a better way is to take that data and copy it into some location where you can make it better optimized for your queries there.
James Serra 30:49
So I can put different indexes on it, I can lay it out in a certain way, I can position it in a certain way. I could also change the field names and the table names to make it easier for people to understand. And if it comes from some European system, you may have some really cryptic names. The idea is you want to have self-service BI. You want to create a warehouse that has tables in it that are very easy for an end user to go to a tool and just click and drag those fields onto a report and build it out without having to get IT involved. So you need to make it more presentable by copying out of that source system into that data warehouse in there. Also, you can have a lot more compute on top of that data and the data warehouse in there. You can ingest many different sources of data, you can do the cleaning of the data in there, you can master the data in there. And that gives you protection against, say, a source system upgrading with the new running reports running into sources and they upgrade to an original report may break. Well, if you copy that data into a data warehouse, well, the ETL into the data warehouse may break with the upgrade, but at least the data in there is okay. And you’re not going to have this huge problem of having to go back and rewrite all these various reports with their queries on there. And it also allows you to clean the data and find things that may result in holes in the source system that you can go back to the source system and plug this hole in there because that data is not clean. And by having that data in a data warehouse, you have one version of the truth. And that can be used as the basis to create all reports and dashboards. And you can put that data in a data warehouse and then a third normal form that has many relational databases that can be joined together to produce those queries. But a lot of times customers will go one step further. And they will create a star schema, which is taking that data and those multiple tables and joining it together. So you’ve got factored dimension tables, you have a lot less complexity, because somebody’s done the work to create those joins in there. And so again, that end user can very quickly and easily generate reports off that.
James Serra 33:05
Now there’s other steps you can take, you can aggregate it, you can put it into a product like Azure Analysis Services, where it’s a cube, and it aggregates that data. So it’s also for performance reasons. And you can quickly get answers to queries that may take quite a long time, you can put hierarchies in there. And so there’s all these additional steps you can take. Now, you may be saying well, this is an additional cost and complexity. Well it is. But there’s a reason for that. It’s that you are making it very easy to have reports that are not only easy to create, but very performant. So there’s a trade off in costs and complexity. But it’s worth it, because you will have the speed and the simplicity for your end users there. And so this is a lot of, of what I explained to customers, as the data moves through this modern data warehouse and copies of these things, the end result is going to be worth it. But you’ve got to do the work upfront.
Kostas Pardalis 34:08
James, you gave a very good description of what a data warehouse is, but I would also like to ask you about the concept of the data lake. It’s something that you hear more and more lately? And can you tell us a few things about what a data lake is? What are the differences compared to a data warehouse, and when should someone consider one or the other?
James Serra 34:33
Sure, and that’s a very hot topic. If we go over what I just mentioned, and put everything in a relational database. That was the way it was for many years, but there were problems arising on that. The first is you have to have this maintenance window. We have to knock end users off the system. Because if I’m loading all this data, and I need to clean it and master it, that’s a lot of CPU, a lot of processing that’s going to be done. And many times I see maintenance windows of three hours, four hours, we go to eight hours in there. And what happens if somebody wants access to data 24/7? What happens if you kick them off, but then there’s a problem and you run on the maintenance window, maybe they tell them, You can’t get on the system until it finishes fixing this bottle or whatnot in there.
James Serra 35:17
So along came the data lake to help with some of those problems there. You can think of a data lake, and there are many reasons why not data lake, but one of them could be, I want to offload all that transformation of data, that staging area that you have in a relational database and put that into a data lake. So that data is copied into that data lake instead. And I put compute on top of that data lake and I do all those transformations without affecting the data warehouse, the relational data warehouse. And then that maintenance window essentially goes away, or just maybe a few minutes, when you load the data afterwards from cloning the data lake, so the data lake becomes that staging area. And so that’s one huge reason right there for a data lake. Others are, I can pour data in a data lake, because if you look at the data lakes, the cost can be very, very cheap, especially compared to putting in a relational database. And so I could, as opposed to a relational database where it’s very costly, and I have to delete data that’s older or only keep data in there if I’m absolutely sure I need it, I can just dump all this data in data lake and down the road could see if I need it, or I can keep a complete history of that. Because the data lake is a schema on read, meaning I can put data in there without any upfront work. It’s like a glorified file folder on your laptop, create folders, put the data in there, as opposed to a relational database where it’s schema on write. Meaning I have to go in there, create a database table, create a field, write the ETL and land it in there. And so it’s a lot of extra work there. So I can put that data in the data lake very quickly. And then somebody who has a skill set to read that data in the data lake, they can go in there and look at the data and investigate it and see if it’s even valuable before you go through the work of putting in a relational database, which I spent many times as a DBA, doing all this work for an end user to put data in the database. And then they go and tell me, oh it turns out, we don’t need that data or it’s not relevant, or it doesn’t give us the value we thought. Wow, that’s just weeks out of my life that are gone. Now, I can instead just dump that into the data lake. And if they have that skill set, they can query that data and see what’s important before I do all that work to it. Or maybe they just need a one time report, or maybe their data scientists need to build a machine learning model. So now they have that data lake to do it in there. So it’s kind of the best of both worlds by having that quick access to it.
James Serra 37:46
However, you still, in most cases, want to have a relational database for a few reasons. One of them is in the data lake, the metadata is separate from the data. So it can be quite confusing and challenging for end users to make sense of the data if the metadata is not along with it. Now this is changing. And products like Synapse have ways of making it easier to make sense of that data. But in the end, it’s just files sitting in a folder system. And so that can be too challenging for end users.
James Serra 38:21
It can also have less security there. If I’m dealing with the file folder structure, I can put security on a file. But what happens if that file needs access by many different users who only should see certain rows in there? Maybe it’s separated by department. Well, you can’t do that in a data lake. There’s no row level security that is in a relational database, there’s no column level security, and all this additional security that has been part of relational databases for many years.
James Serra 38:49
Yeah, there’s certain workarounds in the data lake that give you some of that. But it’s very challenging, a lot of complexity, a lot of extra thoughts. So a lot of customers said, I’m going to use the data warehouse as that security layer and that presentation layer. And I will use the data lake for the cleaning and transforming the data for its use by power users. So in most cases that I’ve argued for many years, you should have a data lake, as well as a relational database. That’s changed a bit with Synapse giving you options to use just the data lake. For example, you can use T-SQL in Synapse on data sitting in a data lake. And that was the big problem before: a customer said, Well, I want to use data in the data lake and you’re telling me I have to use something like Hive SQL or Spark SQL, I just want to use regular T-SQL.
James Serra 39:45
And as much as the SQL could have been similar, it still wasn’t enough and products that Microsoft had like U-SQL failed because it just was too different. And so it gives you the benefit of using T-SQL. So what you can actually do is create a view on top of a file, and then you have the metadata in that view, and you can use regular SQL. And then that made it a lot easier to open up the door for customers to say, well, maybe I’ll just keep everything in a data lake because you also have this serverless component that goes scaled up and down. So I can save money that way. But the bottom line is, it still can be very confusing to have data in a data lake if you’re dealing with many sources, many files, meaning folders. Still in a large majority of time, you want to have a relational database with it, a big database with it. But I can see a little bit of movement into getting away with just a data lake especially when you look at things that Databricks has come into play with their data lake house and the Delta Lake, which I can talk more about too. But understand that the data lake is not just what people thought when it first came out–this land of rainbows and unicorns, that you just dump data in there and magic comes out. And it’s all cleaned and governed there. It’s more work to use a data lake there. But you’ll get a lot more benefits out of your solution if you have a data lake and a data warehouse. But realize it doesn’t speed up the process of data governance there, it adds more to it. But in return, you can get a lot more value out of your data.
Eric Dodds 41:26
It’s interesting hearing you give these explanations. The term … hearing you describe all of the practical uses and value you can get from a data warehouse, it almost feels like data warehouse is a strange term. When you think about a warehouse, at least the initial thing that comes to my mind is you’re just storing a bunch of stuff in a warehouse, right? And almost every part of the description you gave was actually really active. Right? You can do this, you can do this, it makes this process easier. There’s these sorts of levels of security, which is really interesting. I guess maybe it’s more akin to maybe an Amazon warehouse where you have all these robots driving extreme efficiency on the floor of the warehouse as opposed to just storing stuff. One question before we leave the data warehouse, data lake discussion. At scale, it certainly makes sense to have a data lake and a data warehouse, we probably don’t have time to get into the details of the data lake house and then some of the new architectures that we’re seeing. But one thing we’ve talked about on the show that I think is helpful, is in the life of an organization, you go through phases where you hit breakpoints on needing to implement new technology, or sort of scale or business reasons where you may want to implement new technology. And we’ve talked about how okay, two guys in a garage as a startup, they’re just querying their production database. And because they don’t have enough data for it to be worth it to add additional infrastructure. And then at the extreme scale, you have companies with multiple data warehouses, multiple data lakes, data marts, complex orchestration, etc. In terms of a data warehouse and a data lake, I would love your perspective on which one comes first? And when does it make sense to augment with the additional tool? And I know that’s a little bit of a loaded question, because there’s a lot of dependencies. But I would just love your high-level thoughts on that?
James Serra 43:49
Yeah, sure. Most customers, I saw that they have been down the road for a number of years, and they’re having pain points. Maybe they have this relational database, and they’re going, well my queries are taking forever, I have this maintenance window, I need to load more data, the DBA is saying we have no more space, no more compute, to do all that. And now the report starts suffering, can’t augment with additional data. So you have all these challenges in there. And that’s the case of a traditional data warehouse. You have these limits, especially if you’re on prem. And then modern data warehouse came out, you can think of it as I’m migrating to the cloud, because in the cloud, I have unlimited compute and storage. And also can then use some additional tools that make it easier to live. There was like a SQL database or Synapse that has PaaS on there. And then you can start using additional tools to master the data and clean the data. And in the end, a modern data warehouse has five steps, you ingest the data, you store it, you transform it, you model it into a form that’s easy to use in a relational database. And then you visualize it on there. And then along the way, there may be machine learning using on there. So the idea is, I need to collect all this data. And for a lot of customers, that’s the first challenge. And I have these four stages of maturity.
James Serra 45:26
The first one is, I have this data that’s sitting everywhere. It’s structured. But it’s locally managed, and you have spread marks and Excel spreadsheets. So stage two, where most customers are at, is you need to centrally locate the data. And it’s always surprising how many customers are not through stage two yet. And that could be creating a modern data warehouse, putting all the data in one central location, and then starting reporting off of that. And that’s great. And it’s sort of a rearview mirror approach, I can use that data to see where I’ve been, and see trends. But the next stage, stage three is predictive analytics, I want to take all that data captured, and I want to put predictive analytics on there. Maybe I want to use that to predict customer churn and take actions beforehand, instead of being reactive, I can be proactive. Maybe I want to see when a part is gonna fail and, and change that part through machine learning telling me that it’s going to fail before it fails in there. And then the next stage after that is transformative, where you want to take data and want to know the size, the speed, or the type of data and collect it all at a very large scale.
James Serra 46:39
And this is where we get into showing customers the art of the possible. If you ask an end user, what would they like, in addition to what they have now, and you want to make it better and they’re using Excel. They’re just going to tell you that they want additional features in Excel. They may not be aware of a product like Power BI, or some machine learning. And I always say show them the icing on the cake upfront. Give them the art of the possible. They’re going to look at those power reports and dashboards, and those machine learning tools to model them and then go, I’m completely shocked. You can see light bulbs going off in their head. Sometimes you can physically see them because they get all these ideas. They had no idea you can get all the value out of that and they started doing well. I see so many ways I can save money with my company. I can see so many ways I can take shortcuts into generating reports, all this machine learning stuff is awesome. You start showing the industry models that they can create, and they just go crazy because you’re making life easier there. But then you have to tell them, okay, well, to do that, you have to get to stage two, at least, and collect all that data. And it’s a lot of work there. But you’re now getting buy in from the end users, you’re getting buy in from the business units, that it may unlock some budget. And so I saw this trend of talking more with end users that would come to me than IT because IT saw everything as just additional work. And they may not be so excited about building this modern data warehouse, and then what the end users, they see the value, they don’t care about the technical details that have been passed on to IT. But they now see what they can get out of this. Especially if you prototype things, use something like Power BI that makes it easy, they can quickly see and touch and feel that report and then they can say this is awesome. This is what we want.
James Serra 48:27
And so that gives me a level set to say, this is where customers, if they start out new, they’re going to use a data lake in almost every case. They’re gonna use a data warehouse in almost every case. If they’ve come from a tradition where they just use a data warehouse, they usually want to incorporate a data lake. And there’s ways of incorporating it where not everything’s going to the lake at first and maybe just new data sets that they haven’t been able to ingest. And that goes through the data lake first. And so it’s a little bit of variation till they eventually get to the ultimate solution of everything on a data lake and then some of the data goes into a relational database in there.
Eric Dodds 49:09
Yeah, absolutely. And I think one of the points that sticks out that I think is a really, really wise point, is, it’s easy to think about the sort of technological or data scale triggers that might necessitate augmenting your stack. But that doesn’t take into consideration trust, which has been a really big thing on the show, really, since we began with people who are going to consume data products whatever your architecture produces. And I think the reporting example is great, where it’s okay, can we actually deliver real value with this component of the stack to sort of an end user consumer within the business. And then that, of course, justifies augmenting the stack for more complex use cases in the future. And that’s just really helpful. I think that was a really helpful way to think through it. We’re closing in on the end here. And one of the subjects that we wanted to get your thoughts on, is what we’ll call sort of a data buzzword. And it came up on an episode, maybe two or three episodes ago. And it was a term that I was really surprised we hadn’t covered yet on the show. And you’ve written a lot about it.
Eric Dodds 50:38
So the term is data mesh. And I’ll say the same thing I said, as we were prepping for the show data mesh is one of those things that it sounds cool. We all think it probably is pretty cool. But if you ask the average person to define it, could you just define data mesh for me in a couple sentences? That’s actually kind of hard. And there are parts of it that are still sort of ambiguous on a practical level. So can you give us your take on data mesh? And then we’ll dig into a couple questions from there?
James Serra 51:13
Yeah, I was unaware about data mesh, that buzzword, until maybe a year ago when I first came around to it. It was very confusing. To your point, and this is one of the challenges with the data mesh is how can you have a new way of building a solution if nobody can agree on what the term means. And I think it’s got a way to go. Because I’m seeing people call everything a data mesh now. And the bottom line of a data mesh is really focused on organizational change, not a technical change. The idea of a data mesh is a mind shift, where you go from a centralized storing of the data to decentralized. So everything I’ve been talking about has been talking about putting data into a central location, a data lake. Well, why not, and this is the data mesh theory, why not have all these various organizations in your company have data as a product have a data domain where instead of, say, HR and payroll and a homegrown application, that could be something maybe dealing with customer orders, instead of copying all that into a central location, you keep it decentralized. And you have each of those teams in those forums who know their data best, keep the day that in their organization, and you as IT, give them the rules and sort of like a contract that they have to follow to govern the data to clean the data, master data. But the data is kept distributed. So you’re reducing the amount of ETL, the copying location. You’re allowing the people who know the data best, create the reports and dashboards, and you’re reducing the bottleneck of IT having to do everything. The idea being we can scale better now, because we’re not limited to IT being the bottleneck. We can have all these organizations who now embed IT-like people in these organizations, and they’re all often doing their own thing.
James Serra 53:39
And so it becomes de-centralized ownership instead of centralized ownership. You have less pipelines going to a central location, and more local pipelines in there. You think of data as a product by each of these organizations, and you now have a cross functional domain teams instead of one siloed data engineering teams. And that is the definition that I would say most people agree on. But there’s many, many different exceptions that people make to it, which is why we see a lot of issues with the confusion to it. And then while all that sounds great in theory to implement that technology, technology can be very challenging. I don’t think the technology is there yet. And then the reason why I had a lot of concern about the data mesh is because while it sounds great in theory to give each of these different domains, their responsibility is, imagine you’re a large company and you have dozens of these domains. And now you’re going to tell all of them to control their own data, to give them extra work. And you have to give them the benefits of why they’re going to do that. And they’re going to be thinking in their own terms of, I’m just going to collect what data satisfies my own needs. They’re not thinking enterprise wide. HR may not be thinking of how to combine their data with all these other pieces of domains in there. And so somebody’s got to have the enterprise view, and someone’s got to collect all that data. And that’s where it gets extremely challenging. So while I like the idea of a data mesh, I see it only used for maybe 1% of customers, because there is so much upfront work to make that organizational change that for many companies, it’s not going to work. And you also have to be at a size where you have this complexity and challenges of scale, which again, 99% of companies don’t have that problem. Many of the current solutions scale very well and will continue to scale very well. I’ve seen Microsoft have many petabytes of data and to make it work. So sometimes the argument of data meshes is that things are not scaling, but they are scaling. And sometimes I feel like they’re creating a panic point where there is not one on there. So that’s where, and that’s what I put on my latest blog, a lot of the challenges I see with the data mesh, but I’m hopeful that for certain customers, it’s going to be worth it, that extra development time, and they’re going to wind up getting a lot more benefit out of their data if they build this and it works correctly.
Eric Dodds 56:41
Yeah, super interesting topic. And it’s been interesting to consider it and have a couple of conversations on the show. And I think you hit the nail on the head when you said conceptually, if you just say, decentralize your data and it sort of has these effects of a sort of democratized access, and all these different components, it actually creates a lot of complexity practically in the stack for most companies, at least as it seems to me. And one of the concepts that comes up on the show a lot over the episodes is that many times, especially when you’re dealing with a sort of particularly critical or high scale data concerns, it’s like simpler is often better. And we’ve heard a couple of times that, yeah, the way we do that is, it’s kind of boring. But guess what, like it works, and it’s reliable, and it’s going to deliver on the mission critical things for our customers, or internal stakeholders, etc.
Eric Dodds 58:02
And so you see that the tooling around centralization is getting better and better, and actually making things a lot simpler, right? We didn’t get into a sort of what Databricks and Snowflake are doing, around combining functionalities, but things that were once harder becoming easier in the context of centralization. So it is interesting. It kind of reminds me, I don’t know if you remember, and I’m far from an expert on organizational design, but I remember, maybe five years or five years back, maybe there was a really big push for this organizational design called “holacracy”. And I remember, it’s kind of like data mesh, where on the outset, it was like, Yeah, that sounds really great. And I happen to be really close to a really large scale company that was implementing this. And on the ground, practically all the employees just said, this is way too complicated. You know, can I just go talk to my manager? And so it kind of feels the same way. But at the same time, time will tell. And there are certainly things that we said 10 years ago, because technologies didn’t exist that do today. And they changed the way that we thought about things. So we will certainly see where things land. But I will say one thing that is neat to hear you point out is that you’ve actually seen it happen on the ground at a real company, which we haven’t talked to someone who’s seen that before.
James Serra 59:31
Yeah, and you really hit the nail on the head. It’s a lot of change. It’s a lot of complexity. The problem I have with data mesh is, sometimes it’s presented to be almost an easy button. And as customers get into it, they realize it’s more work and if you look at some of the use cases, and there’s not a lot of cases that I’ve seen yet who have implemented it and they were spending, some cases, years building a data mesh even before data mesh was a word, because of the complexity and difficulty of getting all the domains within their company. Sometimes there’s dozens of those domains to buy into the data mesh. And the problem is, if you just have one that says, I’m not going to do a data mesh, you’re telling me, I got to do a little extra work. You told me it’s going to take a year or two, and I had to get work done now, so forget the database. Well, now you have a data silo. Now, how do you deal with that? And if everybody’s going off doing their own thing, even though they said they were part of the data mesh, and somebody is using SQL Server, somebody’s using Oracle, you’ve got everybody just coming up with their own technology solutions in there. And you’re the person that’s got to collect all this data and make sense out of it into one. Now you’re opening up a lot of extra work in there. And then even the skill set challenges, each of those domains, now have their own IT like people to go and build these solutions in there. We’re seeing at EY, to find the talent that can do that is so difficult, and because now you’re asking to find even more talent in there. And they may not be as skilled and have the expertise as somebody in IT. So now they may build something that’s sub-optimal. Somebody had a great analogy, it’s like telling all these cities to go and build their own roads, well, I kind of think I can build a road, I can dig a hole. But the end result was I may have some roads that are not built well. And I may do it in a completely different way than the other cities. And so you have this huge mess. And now I have to say, all your cities have to combine all your roads together, from one city to the other, well, who’s going to do that? They’re going to say, it’s not my responsibility, well then IT’s got to go and do it. And they got to combine all the roads together. So it could take a lot of extra time, a lot of extra buy-in. Again, it could be worth it. But you have to know these things upfront. That’s why I try to put in my blog all the concerns you have to go through and make sure that you address all those and go, yeah, this could help us or no, we’re going to take a pass.
Eric Dodds 1:02:08
Yeah, absolutely. I think the road analogy is a great one. Yeah. And I think that’s a huge benefit of having purview over all of the components of the data stack centrally. But time will tell and technology will tell. And unfortunately, we are out of time. But I’m so glad we got to talk about the data mesh buzzword and dig a little bit deeper into that. Always fun to kind of talk about the buzzwords du jour. James, thank you so much for taking the time to join us on the show, we learned a ton. Really fun to hear about all the cool stuff at Microsoft, and all the cool stuff you’re working on at EY. And we’d love to have you back on the show sometime soon.
James Serra 1:02:50
Yeah, happy to come on again. I love talking about this. I can spend hours until my voice goes out. Thank you for having me for this hour.
Eric Dodds 1:02:58
Absolutely. And tell us where people can read your blog. It’s a great blog. We read it a lot. And that’s actually where I’ve read a lot about data mesh. So where can people find your blog posts?
James Serra 1:03:10
Yeah, it’s my name JamesSerra.com. You will find a lot of posts on data architectures and your data mesh. There is a contact me button. If you have questions for me, feel free to shoot them over. And I’ll be happy to answer them.
Eric Dodds 1:03:30
Great. Well, thanks again for joining us. And we’ll talk again soon.
James Serra 1:03:32
Thank you for having me.
Eric Dodds 1:03:34
My takeaway is not related to data mesh, although I’m glad that I shared some opinions with James and we were able to maybe not complain about data mesh, but point out some of the issues around it. But my main takeaway was actually something that was on one of our earliest shows, and that is all the different tools that Microsoft offers that are really cool. And Microsoft for some reason, well, maybe not, for some reason, probably a lot of the reasons we know, but kind of has this weird feel of not being cool, especially for you know, startups or data infrastructure. But they actually offer some really cool tools. So it was fun. And I’m really glad you asked him about some of their products. So that’s my big takeaway. Kostas.
Kostas Pardalis 1:04:26
Yeah, absolutely. I really enjoyed that part. It was a pretty good introduction to all the different data infrastructure products that Microsoft has. And yeah, we shouldn’t forget, like Microsoft is huge. And regardless of what we think about them, I mean, they have built some amazing technologies like MS SQL is one of them, for example. And there are many companies out there using Microsoft products, right? That’s how Microsoft has become so big. So that’s something that we shouldn’t forget. And they also do a lot of research. That’s also one of my takeaways.
Kostas Pardalis 1:05:04
The other one that I found very interesting and important, I think, was just about data governance. I think James’ description of data lineage and data quality, security gave us a good description of how complex of a thing data management is. And I think this is the space where we are going to see a lot of innovation happening in the near future. And so it was very interesting to hear his opinion about that, and how important it is.
Eric Dodds 1:05:35
Absolutely. And I know data lineage is a subject that you’re particularly passionate about. It grabs a lot of headlines.
Kostas Pardalis 1:05:45
Yeah, it’s my equivalent of data mesh for you.
Eric Dodds 1:05:51
One thing, here’s a quick hot take, for those of you who make it to the end of the episode on the perception of Microsoft. Here’s my one minute theory that I just came up with. So do you remember how we talked about Big Query, maybe having some brand perception problems, because they also are people who use Google Docs and Gmail. And so you can, like, use Google for a large scale ML project on your warehouse, because you also have your personal Gmail that you get a bunch of spam emails to? Yep. So here’s my quick one minute theory on Microsoft. They started out, you know, they provided tons of data infrastructure, they always have provided tons of data infrastructure products, and other things like that. But that was sort of a bigger deal, like several decades ago, before their consumer products gained worldwide traction, right. So I think a lot of the people working in data today, their primary interaction with Microsoft was through the Office Suite, which is sort of its own conversation. And so when Office Suite, which is still the most widely used business software in the world, but you have all the cool kids now using Google Docs, and Microsoft Office is not cool. And so if I’m going to choose infrastructure for my startup, I’m not going to choose Microsoft, because I have a bad taste in my mouth.
Kostas Pardalis 1:07:18
Yeah, that’s true. And to your point, we shouldn’t forget that probably one of the most sophisticated and most used data manipulation software out there is Excel. Yeah. So we should never forget that. Regardless of what we are doing, I mean, many very serious decisions about our lives every day are based on stuff that’s happening on Excel. So never forget that.
Eric Dodds 1:07:49
Never forget. We should make some t-shirts that say “Excel. Never forget.”
Eric Dodds 1:07:59
Well, this is your little bonus round with some one minute theories and a t-shirt idea. Thank you again for joining the show. We’ll have more interesting guests, and potentially surprise hot takes at the end of the show for you coming up soon.
Eric Dodds 1:08:15
We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at Eric@datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.
Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
To keep up to date with our future episodes, subscribe to our podcast on Apple, Spotify, Google, or the player of your choice.
Get a monthly newsletter from The Data Stack Show team with a TL;DR of the previous month’s shows, a sneak peak at upcoming episodes, and curated links from Eric, John, & show guests. Follow on our Substack below.