Episode 58:

Data Federation is No Longer The “F” Word with Scott Gnau of InterSystems

October 26, 2021

In this week’s episode of The Data Stack Show, Eric and Kostas host a conversation with Scott Gnau, the head of data platforms at InterSystems, an organization that processes data from, among other industries, the electronic medical records and capital markets spaces. Their discussion dives into an analysis of the concept of a data fabric and how it compares to the idea of a data mesh.

Play Video

Notes:

Highlights from this week’s conversation include:

  • Solving problems with data has been a long-time passion of Scott’s (2:52)
  • Day-to-day use of data at InterSystems (6:25)
  • The technical aspects involved in constructing a data fabric (17:52)
  • Companies at a variety of maturity levels can adopt a data fabric (26:49)
  • A paradigm shift in the marketplace (28:39)
  • Comparing and contrasting data fabric and data mesh (30:49)
  • Sharing data across the business and not having it siloed in different departments (39:46)
  • Privacy and security within a data fabric (41:22)
  • The future of data fabric and pushing the edge (43:17)

 

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

​​Eric Dodds  00:06

Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com.

Eric Dodds  00:25

Welcome back to The Data Stack Show. This week, we’re talking with Scott from InterSystems. And he’s the VP of data platforms there. And he has a really long history of working in data. So he was at Teradata for a long time. He was the CTO at Hortonworks, and has just done a number of things that I think give him really interesting perspective on how the industry has changed over time. And is today doing some really interesting things at InterSystems. Namely, sort of promoting this concept of a data fabric, which can be really interesting. This is not going to surprise you Kostas, or probably our listeners, but I just love talking with people who have worked in data from the very beginning of I guess we could call like the modern data industry, and which actually goes back only a few decades, amazingly enough. And I always love hearing people’s perspective, when they look back through all the changes that have happened. So that is what I’m going to ask, no surprise. But as always, I think we’ll get some interesting insights.

Kostas Pardalis  01:31

Yeah, I mean, I’m also waiting to hear like many stories of how things have changed. I mean, he’s been in this space, like for many decades, and I think it’s going to be great to hear from him how things have changed from like, going from mainframes to the cloud, and then to the data fabric. And of course, I want to learn more about the data fabric itself, like, what is this new thing? We have data meshes, data fabrics, data lakes, data warehouses, lake houses, and who knows what else?

Eric Dodds  02:03

Get the real story right here on The Data Stack Show.

Kostas Pardalis  02:06

Yeah, yeah, I’d love to know more. And like, see, how much of it is more of an architectural pattern? And how much of it is an actual, like technology that is implemented? And what’s the impact it has? And I think we have the right person to answer this question. So let’s go and chat with him.

Eric Dodds  02:23

Let’s do it. Scott, welcome to the show. Really excited to chat with you about lots of different topics, we probably won’t get through all of them. But I really appreciate you taking the time.

Scott Gnau  02:36

Thanks for having me.

Eric Dodds  02:37

You have an incredible resume, we talked a little bit about Hortonworks, kind of a connection that we have there from the east coast, but we’d love to just hear about your background, how you got into data, and then what you’re doing today?

Scott Gnau  02:52

Yeah, I mean, sometimes it sounds like it’s a plan. But it really isn’t, just solving problems with data has always been a passion of mine, even from the first assignments that I had that weren’t necessarily very sophisticated analytically, but involved a lot of data. And being able to resolve that data into some sort of a decision or action quickly. And I started my career in a massively parallel processing kind of environment. Back in the dark ages, like the 90s, when the world’s largest data warehouse at that time was, do you want to guess? 30 gigabytes. And that was huge. And it took racks and racks and racks of space to pull this together. But the point is, there was a lot of information and a lot of intelligence in that data, and I really started my career in the notion of parallel processing to kind of break that down into hundreds and thousands of parallel threads. So that was the decision, so that the analytics could actually run really quickly without requiring a mainframe class kind of compute. And I always found that really interesting, not just because scientifically of doing it, and the physics and all of the programming that goes into it and the analytics, obviously, one of the things that thrilled me is that when you do it, and you do it, right, sometimes the answer you get back is completely unexpected, and you learn something from your data. And that’s actually the cool thing. Later on, when I moved into more of a big data world, it was the same problem to solve, but with much more variety of data. It was no longer just transactions and purchases and customers, but weblogs and social sentiment kinds of data that can be entered into those analytics to get a much more thorough view of what’s happening and a much better kind of decision.

Eric Dodds  04:39

So interesting. It is so crazy to think about. It wasn’t actually that long ago, when 30 gigabytes seemed like such a huge amount of space. And now everyone’s phone has more space than that, which is wild. And what do you do today, working in data day to day?

Scott Gnau  05:01

So now I’m here at InterSystems, which gives me an opportunity and it really feels like a synthesis of all the experiences that I’ve had across my career, whether it be in massively parallel processing, highly efficient processing, to transaction processing, to large variety of data and adding in new kinds of analytics all together, is really the mission that we’re on and that I’m working with here at InterSystems and our data platform organization. Our technology at InterSystems actually started in the healthcare world. And you might imagine, there’s a lot of data in healthcare. It’s really varied, it could be your w-ray image, it could be physician notes, it could be the payment that you made for the visit that you just took, which is structured, transactional, and synthesizing all of that together for better treatments and better outcomes is a use case. And the physics behind that use case is very similar. Now, what we’re seeing is kind of expanded and extended analytics that take advantage of lots of different data of different origin, and deliver analytics or insights directly at the time of interaction with a consumer or directly to someone’s device.

Eric Dodds  06:09

Super interesting, could you just give our listeners a quick example of some of the customers and use cases just so they have sort of a practical knowledge of what your work looks like day to day, you know, serving the life of businesses and consumers?

Scott Gnau  06:25

Sure, you know, day to day here at InterSystems and with our data platform, we think that we capture more than half of North America’s electronic medical records, and certainly a large percentage outside of North America. So just think about anytime you’re interacting with a physician, or at a hospital, or getting a treatment or a service or having an insurance claim, that information is flowing through our technology and being used for not just keeping track of you and your treatments, and all of those things, but also being used in many instances to provide for better outcomes, better treatments, better proactive kinds of treatments, as well as from an operational perspective, a lot of our clients will use that technology to optimize their own operations. How many folks do I need on-call at what period of time? Is there seasonality and all those things so that we can line up the supply chain and all of those things? We also have a decent footprint in the financial services industry and capital markets. And so about 15% of global equity trades, again, go through systems that are managed by InterSystems’ IRIS Data Platform. And again, you think about the synthesis of that, very high volume, can’t lose it, it’s got to scale. And I’ve got to make some decisions about pricing and adjudication in a very fixed amount of time. Those are the kinds of problems that we solve with our technology.

Eric Dodds  07:45

Sure, that’s incredible. I mean, just thinking about the scale of 50% of EMRs, sort of interacting with your platform. Well, Scott, one thing that we chatted about before the show, which I just want to dive right into, is this concept of a data fabric. And we love breaking down terminology on the show. So recently I talked about the term data mesh, and there’s a lot of people excited about this term data mesh. And you’ve talked a lot about this concept of a data fabric. So break it down for us. What is a data fabric?

Scott Gnau  08:22

Well, data fabric is kind of a logical construct that we like to use and think about that kind of sets the bar to help enable our clients and folks in the industry to be successful. And I’ll back up andI’ll talk first about the requirements. And then that’ll kind of lead into, you know, how we think about and why we think about a data fabric as kind of a concept, right? First, we hit on it, and when I was talking about the introduction, right, there’s data volumes and data variety, it’s just like, off the charts crazy, right? Everything now has a digital footprint, and the devices in our hands are compute devices, and they’re creating digital footprints, and all kinds of new data connected and on the web, and social media and all of that interaction data, as well as some of the more traditional transactional systems that folks have, whether it be stock trades, or your checking account, or retail purchases. So, first and foremost, data is just everywhere. It’s high in variety, and it’s extremely high in volume, and it can be very volatile. And so when you think about that, that’s different than certainly 10, 15, or 20 years ago, when the majority of data was kind of created inside of a corporate firewall, largely by mainframes connected to PCs, it was very structured, transactional, and very controllable. Now it’s kind of out there. And so one of the results of that is kind of traditional processing of, hey, let me consolidate the data into a place and try to normalize it and then do something with it, just doesn’t work. It’s just kind of physically impossible, A) to do it, and B) just to keep up with it.

Scott Gnau  10:02

So that means now it’s more important to think about connectivity of data than consolidation. So one of the key underpinnings of a successful data fabric is the notion of data connectivity. Can I really play it where it lies? Can I get access to it in a very seamless fashion? Okay, so there’s that. Another thing that’s happening, obviously, and we see it on the nightly news, and people talk about all the rage about artificial intelligence and machine learning and deep learning. To me, because there are massive amounts of compute available in the world today and massive amounts of bandwidth available, right, we now have a whole new set of analytics that it’s possible to actually deploy, not only is it possible to actually deploy, it was possible to actually get an irrelevant answer and use it for a much more sophisticated kind of analytics and ultimately drive a better insight, a better activity with a client, customer or prospect. And so analytics are no longer just aggregating and summarizing, and, you know, joining tables, but now include all of these other kinds of capabilities as well. So there’s a requirement for some sort of flexibility in the model of what kind of analytic can I run? When can I run it? And how can I interject new analytics as they’re invented into those pipelines in real time without starting over. So that’s another construct of what we’re talking about in a data fabric.

Scott Gnau  11:29

Certainly, in the first point, I talked about data variety, the variety of data and tomorrow, they’ll be some other kinds of data that we hadn’t thought of, right. And so it’s no longer possible just, it’s no longer possible, and it’s no longer efficient to just have a tool that is a SQL engine, or a NoSQL engine, or this or that; you’ve got to think about your data fabric as being able to consume and store any kind of data in its natural format without having to change it when you store it. You don’t want to convert it into rows and columns. You don’t want to apply any change to it. And well, why is that? Well, number one, if you’re going to connect back to it, you want to see it in its native state. But more importantly, if you’re going to generate trust across this ecosystem, you’ve always got to be able to map back to the origin of the data and how the data came to you. And if you make changes to it, you can’t do that. And you can’t build that level of trust, think about running a machine learning algorithm to optimize a treatment for a patient, you really want to trust that the data came from the right place, and that the prediction that you’ve made is accurate. It’s a life or death kind of scenario. And so, so being able to have that kind of construct in it, and I’d say kind of the last thing is that you’ve got to think about being able to deploy insights at various places along the chain. So it’s no longer relevant to run a bunch of batch nightly uploads, and over the weekend, you run some data mining algorithms. And Monday morning, knowledge workers show up and they do something, right, you’ve got to deliver a recommendation that’s relevant to a device to a consumer while they’re interacting with you. And so I mean, just the raw physics of that and kind of speed of light means that you’re processing out to the edge. And that you’ve got to have the capability and the sophistication to deliver in real time or in the right time. So if you take kind of those constructs that I just described, those four main kind of constructs, that’s what we think of as kind of the requirements set for a modern data fabric, where you’re able to weave together different kinds of data with different timeliness from different sources in the cloud, on prem, with different kinds of analytics with different destinations. And one of the things that we were talking about, we were talking about earlier was like, so what, what about the cloud and all of this, what the cloud is actually kind of a culprit to some degree. Number one, because you can now create almost infinite compute resources on demand, which means there’s a whole new set of analytics, it’s possible that wasn’t possible before. And it’s affordable, that’s actually really cool. The other thing is, with all of these connected devices, and everything that’s going on with cloud based technologies, it just is a further distribution of data. And for the first time, you know, image generation right there, there is a whole class of data that’ll actually live its entire lifecycle only in the cloud. And then has the ability to do connectivity in a very seamless and transparent way that generates trust and traceability back to the source.

Eric Dodds  14:36

Interesting. What an interesting concept. I’ve never thought about being at a point where data will live its entire lifecycle in the cloud. And that is just so interesting to consider. But one point you made, Scott, that I’d love to dig into a little bit. I know Kostas probably has a bunch of questions, but I’ll retain the microphone for just a minute longer. So one thing you said, an emphasis on connectivity, connecting data because it’s coming from various sources in different formats, as opposed to sort of collecting or consolidating it was one the dynamics you talked about. And what struck me about that is, it still seems like a lot of companies, even in the context of a cloud data warehouse that you mentioned, are still just trying to collect all their data in one place, right? It’s like, if we can just get all of our data into Snowflake, or BigQuery, we’ll get so many answers. And so that seems to be a trend that’s still pretty strong. But it may depend on the size of the company and sort of the complexity. But I’d just love to dig into that a little bit more, since we do see a trend towards companies working really hard to collect data, where you’re saying connectivity is actually sort of the bigger problem, it seems like.

Scott Gnau  15:57

Yeah, I think connectivity is more sustainable, right? When you think about consolidation of data, there are a whole lot of aspects to that. One is just the sheer movement of data, which is expensive and time consuming. Just the latency of data movement can mean that you’re getting access to the data too late to actually do anything about it, right? So there’s that aspect of it. There’s also the notion of that whole consolidation kind of scenario, when the consolidation fails, or when there’s a failed network port or something like that, then human beings have to get involved, and human beings are more expensive than software typically, have to get involved to kind of resolve what happened, what’s going on. If data gets out of sync, when you’re consolidating it, then you break, you violate some of the trust you build up, because you may get different answers at different points along the pipeline. So I think really solving that problem and thinking about judiciously using consolidation versus doing connectivity becomes a really important new paradigm. So certainly, you know, there’s a lot of buzz in the market about folks moving data and analytics to the cloud. And isn’t this really great? I think that that will abate very quickly, because in the end, it’ll still have the limitations that I described, because it’s still a consolidation play, you just happen to be consolidating in a different place that happens, perhaps to be a bit more expensive than the place you used to be consolidated.

Eric Dodds  17:21

Hmm, interesting.

Kostas Pardalis  17:23

I have one main question right now, which is about data fabric and Scott like, it sounds like a great idea, right? Like, yeah, it makes total sense that instead of replicating the data from all the different places where it leaves it, like trying to move it into one place, and all these things, we can just connect the data together and work on top of that. But how is this data fabric created and implemented, like from a more technical perspective? What are the components of a data fabric?

Scott Gnau  17:52

So that’s a really good question. So obviously, what I described is kind of a logical construct. And if you think about it, from an architectural perspective, where the rubber meets the road is, how the heck do I go and implement this? Right? And, and so there are a lot of different folks and a lot of different companies talking about different ways to do it. I would say that in many cases, certainly the cloud vendors are saying, yeah, you can go build this kind of stuff. You basically cobbled together a collection of seven or eight different technologies, and you can kind of get this functionality. And we see people doing that and trying to make that successful. Certainly the Data Fabric definition that I provided, is kind of the bar that we set all of our technology investment towards at InterSystems with IRIS Data Platform, likening to be able to do that with a single set of technology and provide a little bit lower risk to our customers. Like I say it’s, it’s more a logical construct. And then it becomes kind of the bar that we set for ourselves when we’re making investment decisions in the technology and the flexibility that we create. And like any set of blueprints that you get from an architect, right, you can choose your materials and build out the structure differently. Certainly  we like to think that we can compete with the set of materials that we bring, which is very simple and easy to support.

Kostas Pardalis  19:15

Is there a set of like, let’s say fundamental components that this architecture has something that like, let’s say you cannot have a fabric without at least these components?

Scott Gnau  19:25

Yeah, I mean, bringing it out a little bit further. Certainly you need persistence. You need pipelines, transports. And then you basically need the calculate functions. And I think about it mostly in a microservices kind of architecture. They say if you’re able to move stuff, if you’re able to persist stuff, and then if you’re able to calculate whether that’s an analytic or transformation or whatever, and you have those and you can kind of cobble those three services together, pretty much like our DNA is made up of four base materials. In different combinations, you can make very complex and interesting things. It’s the same thing in this data fabric. And what you need the underlying technology of your data fabric and the standards that you choose to be able to do, is to be able to host those different things and combine those things up and then manage them as end-to-end applications that you build.

Kostas Pardalis  20:21

That’s super interesting. Is there like some kind of relationship between a Data Fabric and what technically used to call a query federation? Because you mentioned a lot of being able to have this kind of decentralized architecture where, from what I understand, at least, I have an analytical function and I can write wherever the data is, instead of having to get the data in one place, and execute and migrate it there. So I remember, for example, a store, okay, of course, like Presto works in your own environment so it wasn’t so decentralized. But in a way, the whole idea of query federation was that instead of copying the data and bringing it into one place, let’s execute the query where the data lives, get the results back and somehow connect and consolidate the results. Is there some kind of relationship there are between the two ideas?

Scott Gnau  21:14

Yeah, there is. And I have to tell you, you may edit this shot, but for the first 25 years of my career federation was the “F word” to me, because it never really worked, right. But when you think about the ability to do connectivity, you now have a new set of tools that can make that  kind of a use case, although there are different ways of turning data virtualization other things come up, you now have some more tools that make it more of a reality. I think there really are two things that we kind of hold ourselves accountable for at InterSystems. One is actually being able to push the processing to the data, or the data to the processing if you want, but typically, you want to push the processing to data, because that’s the cheapest thing to do, and to get the result sent in a pipeline somewhere else, right. And then the second thing that we do that I think is kind of unique in the industry, is that we are multilingual in the kind of process that we allow to run against our data. So we’re multilingual, meaning you can speak SQL, Java, Python, you can interact with the data. And we think that’s really important in the data fabric construct, because what I described is all these new analytics, so it’s no longer just a SQL statement. But you might want to bring in some machine learning that’s written in Python, and you just want to push that out and have and have the technology stack, figure out how to do that and how to run that process in an optimal way and then get the answer back. We’re seeing some interesting use cases from our customers who are able to do this. Because certainly in a traditional machine learning data science model that doesn’t use a data fabric or InterSystems stack, there’s this huge data extract and you extract a bunch of data and give it to the data scientists and then they run their stuff. And they find something that’s interesting. Okay, we think this is interesting. And they wipe the data out and they go get more data and they run it and they get the answer. But then they have to take the answer and manually put it back. And think about all the latency that’s created there, if you can just run the machine learning model on the data where it lies, even if your process was inefficient in our system, but even if it was efficient, though, you’re removing all that latency from the process, you now have the reality of being able to have a much better decision in time to do something about it.

Kostas Pardalis  23:35

Yeah, that’s very interesting. I have, like a very specific question about machine learning, but is this model like applicable both for training and inference, or it’s more about one of them, because I can imagine that like inference can happen much easier at the edge, let’s say on like a mobile application, or like whatever, but training because it’s kind of like a little bit more complicated, more iterable kind of process can happen. All the use cases around like working with data can be like served on a data fabric.

Scott Gnau  24:10

All of those modes. And I think when you think about the fabric that you deploy, and the technologies that you choose, you really think about all those modes, really needing to exist as close to the data as possible.

Kostas Pardalis  24:24

What are the most common use cases that you have seen deployed on the data fabric ? Is it machine learning? Is it like more BI related use cases? What have you seen your customers do?

Scott Gnau  24:35

Kinda like the market, right? The market, everybody gets BI. Human beings kind of think relationally to some degree and they’re used to interacting with those tools. So just like that’s kind of the base of experience in the industry right now you probably see more of a predominance of BI algorithms than the others. But the machine learning stuff is starting to grow and I think, gosh, I remember in the late 80s BI was a new concept, right? And, and yes, I’m that old. Sorry. But it was a new concept. And it took a while for people to catch on that it actually really worked and was very meaningful. Kind of people say, Well, I know my business, I don’t need BI to tell me my business, right?

Scott Gnau  25:21

We’re seeing some of that now with some of the early machine learning stuff. And certainly some of the early adopters get it, and so on, and so forth. But if you look at kind of the middle of the market, kind of the mid to late adopters, they’re still just playing with it, they haven’t fully bought in, but when they do, that capability will become even more important, you’ll see that volume grow. I also think that being able to combine those modes is also a really interesting thing, to think even about some of the most simplistic, I want to do pricing adjudication on transactions, and into that, I want to factor the risk capital impact that it will have on my business. And I want to consider the real relationship that I have with customers. And oh, by the way, I want to run an ML model that maps the vector of the securities pricing or the underlying security to predict whether or not this will be a good transaction, and then I can put all that together in a couple of milliseconds and price to transaction. They combine all of that technology together.

Kostas Pardalis  26:20

Okay, that’s super interesting. Do you, I mean, from your experience with like the customers that they are working with simply meaning like a data fabric, do you think that today, it is basic, like type of company or organization that it’s like, let’s say more ready or more mature to implement and adopt data fabric, or it’s something that you think of like a benefit, or it can even be implemented, like, in any company?

Scott Gnau  26:49

I’m seeing it across the board. I would say that in some of the less mature companies, since they’re kind of coming in, at this point in history, it’s almost like the de facto requirement and that they’re building from, versus a little more mature company that’s got all kinds of legacy applications and legacy businesses, it certainly can’t be compromised. I see them doing things a little bit more incrementally and thinking about how do I go transition and the cool thing about the data fabric is if you if you weave it correctly with the right technology, it will plug into the legacy stuff and leave that kind of unadulterated and start to build new applications in this space and it’ll start to take on critical mass. And at some point you’ll kind of see that cross the chasm, and that’ll be now the de facto standard.

Kostas Pardalis  27:39

Yeah, yeah, that’s interesting. And I’ll go back again to the to the components of the data heartbreak mainly because I mean, if someone like follows the market and like the news and what is happening out there, might be like, aware of things like the data warehouse in the cloud, right Snowflake, BigQuery, then we started like getting into using the the concept of the data lake. Now we have the lake house, which is like a hybrid between the two. I don’t know what will follow after this, but because companies are out there investing right now like huge amounts of both money and efforts to implement like all these patterns, like architectural patterns, right? How do you see these fit under the concept of a data fabric? Is there some kind of conflict there? Is the Data Fabric like something that sits on top? How do you see it? And also like, give us some best practices and some like advice on how we should think as architects, as data architects, with all these different components.

Scott Gnau  28:39

I think it’s really the next generation architecture right? There was a time when data marts were state of the art architecture for analytics, and then they became enterprise data warehouse, and then that became data lake. And then I think this is just kind of what comes after. Right. And as the industry matures, and, by the way, each of those things in and of themselves, were extremely relevant when they came to market, but the market because it’s changing rapidly, and there’s this huge volume explosion of a variety of data. And also because the bar is set higher, because you and I and all of us are much more educated consumers, we expect that the folks that we interact with will understand us better, right? So all these things kind of come together and say the ball has moved. And so, and you mentioned like, and a lot of people and that’s why I’m gonna run my data warehouse in the cloud. That’s interesting, but it’s kind of like putting new leather seats in a 40-year-old car that the transmission just fell out of. Sure, your ride will be more comfortable. And that’s interesting, but is that really your sustainable mode of transportation? And so I really think about it as kind of a paradigm shift, not to use a buzzword, but a paradigm shift in a marketplace where all of these things were relevant at a time and data lakes were very relevant at a time, right? Because I got to capture all the data and figure out what it is and understand it. The thing that sometimes is missing in data lakes is the notion of traceability and connectivity for trust, and they become data swamps. And so there’s that. And again, not bad technology, not bad concepts for the time, but I think the world and the market has moved on. And this is a new place, whether you call it a data fabric, or data mesh, or some other thing, I think whatever that thing is, is really driven by the four underlying pressures that are happening in the marketplace that make each of those previous technologies less interesting to solve the entire problem.

Kostas Pardalis  30:35

Yeah. You mentioned another technology that I’m still trying to figure out, to be honest, which is data mesh. So what’s the relationship between the data mesh and the data fabric? Or like, where are the differences or the overlap?

Scott Gnau  30:49

Again, I think it depends who you’re talking to. So I defined what they meant by Data Fabric, and that’s what I mean, I have heard people use data mash and other terms to kind of describe 80% of what I’m describing, and then another 20%. And, and so I think, again, that I think that the big notion is that the world’s moving on from data lakes, and certainly from data warehouses into kind of this next generation data infrastructure. And whatever you call it, it’s going to be driven by the marketplace requirements, which I think I described for you and and folks who figure out, number one, folks who figure out how to deploy and actually build out that architecture to make their business successful, I think will be much more successful than those who don’t. And I think technology vendors like us who can actually provide a better mousetrap will get some good attention as the market kind of moves into that space.

Eric Dodds  31:45

Scott, one question in the enterprise scale, when you think about the work that you do, in the healthcare industry, the work that you do, and capital markets, massive scale, and you have worked in data for a long time, and so I’d love for you to speak to those in our audience who are hearing what you’re saying. And they probably agree in theory that like, that makes sense. But then they’re kind of facing the day to day of like, okay, like my charge is to go implement, or sort of get value out of the data lake and the data warehouse setup that is, is what our company’s sort of implementing. But I’d love for you to speak to them and sort of like, how do you manage the, because there’s sort of a long tail of the market where if you’re not sort of solving these extremely complicated problems at scale, and maybe some of those tools are sufficient, at least for sort of the problems you’re solving? How do you, or how would you tell someone like that, to think about the future? And how do you prepare for that? And when do you begin to sort of tactically think about things like migration, adopting new technologies, all that sort of stuff?

Scott Gnau  33:03

Yeah, so I think there are a couple of things right, and and you’re also spot on, it can be a daunting task, to go sell a vision of this nature, inside of an organization that tactically needs to get things done, right. And so I think, a couple of things, just just just like, early in my career, I talked about breaking things down into small problems and doing parallel processing, breaking it down into small problems. So there are the drivers that I described, and go look, go look at some of those small problems and figure out okay, is, is data, variety, volume, and location going to impact my application? If not, okay, I’m not going to worry about that for today. But I certainly need to at least have adjudicated that decision. I think also, the notion and this certainly requires corporate CTO buy-in and things of that nature. One of the really cool things about cloud is it’s easy to spin up applications quickly using a collection of micro services. There’s no big capital acquisition, etc, etc, etc. The problem is that it also creates sprawl and silos in a way that we’ve never seen before. Right? And again, for my entire career, the marketplace clients have been talking about the problem of data silos, right. And I think that in today’s world, it’s even harder, right? Because it’s easier to create them. And in the 80s, this was back when you had to go to the capitol committee and get approval and move data. And there were still data silos. Today you don’t need any of that approval, you can create your own thing. And so there are more and more and more and so my point there is certainly to try to take the long view and say, I can’t afford to go build a five year plan to go build a data fabric because I have to run my business. I get that. But there are very easy architectural decisions that you can make, to make that transition easier and to avoid is a continuation of this data sprawl and data silo population, where you end up with disconnected data that you can’t analyze, that ends up being extremely expensive, and potentially redundant. And ultimately, when you start to look at it from that perspective, the ROI on at least agreeing to a data fabric kind of architecture becomes very easy to justify. And then  it becomes how do I tactically go solve this problem, well new applications are going to use this architecture and legacy applications are not. And just because you decide to use an architecture doesn’t mean you have to slow down rolling out solutions. It just means that the choices that you make on storage and transport and the algorithms and the actual technology standards that you choose, are a forethought and not an afterthought.

Eric Dodds  35:47

Sure, that’s interesting. Two things that I just wanted to reiterate that I think were really helpful. One is, I don’t know, if it’s just subconscious, I’m sure I was exposed to some sort of marketing messaging with all of these cloud tools. But the cloud kind of had a promise of like, helping solve the data silo issue. And in many ways, it’s refreshing to hear you say, it’s worse now, because anyone can go into AWS and spin up whatever service they need for whatever they’re doing. And then all of a sudden, even a small to midsize company have a sort of these pockets of replicated technology that are sort of managing data independently of each other, which is super interesting. And then the other one is that when we think about technology migrations, and you think about something like an on prem to a cloud, where it really is sort of a major overhaul, right, like there is a massive migration. If everything’s in the cloud, I think your point about saying, you can make decisions now in the cloud that it’s not like you’re migrating the entire infrastructure of all the technology of the company, you’re dealing with, like you said, sets of microservices. And you can choose to construct those in a way that sort of paves the path, as opposed to thinking about it as, okay, you go from data lake and warehouse to fabric, and it’s a massive one time, you know, sort of painful migration.

Scott Gnau  37:15

Yeah, and it’s just kind of like good programming practice to not put conscience into your program, but actually always point to variables so you can change your mind later, right. Being able to think about it as disaggregated from a specific cloud vendor, but more as an entity of its own becomes very freeing, because then the cloud is a source, a provider, but it also avoids potential lock-in or other downstream impacts. Because you’ve actually up-leveled the whole architecture. And, and I think that’s important. I mean, it’s just, you know, human nature and the nature of business, right? Just through my career, right? Most large companies, you say, well, what’s your BI tool? Well, we have all of them, well, what’s your database standard? Well, we have all of them. And what’s your class standard? Well, we have all of them.

Scott Gnau  38:05

It’s gonna happen. Or you acquire a business that had a different cloud and you want to get the lifeblood of the data and the intelligence and the insights that can be driven from it. If you’ve up-leveled your architecture, and you think about it in a virtualized abstract across multiple clouds, that’s also very valuable in terms of future proofing, what you’re rolling out. And again, it’s not, I’m not here to say one cloud vendor is bad, or one cloud vendor is good, or it’s got nothing to do with that, it’s just the nature of the market is going to dictate and it’s going to change. And it may change suddenly, and without a whole lot of notice. And without a whole lot of logic. And if it disrupts the value chain of the insights that you’re driving and the interactions you’re having with your customers, that’s very bad.

Kostas Pardalis  38:53

Scott, when we are discussing data meshes, one of the definitions of a data mesh that comes up many times is that data mesh is like, let’s say 80%, about an organization architecture and not a data architecture. It says a lot about how companies should be working with data or how they should be organized around the data. And I’d like to ask you, companies have to change in order to adopt these new paradigms, right? What do you think are the main changes that a company has to make, especially the bigger ones that it’s more difficult to change in order to maximize the value that they can get on something like the data fabric? Or maybe there isn’t, but if there is something that has also changed the perception, like how the structure of the company is, what is it?

Scott Gnau  39:46

I think one of the things that is really important in that scenario is really a C-level kind of discussion or board level kind of discussion. Right? And that data about my business belongs to my business and not to an individual department. And I think most companies, most large companies are at that point now and kind of get it competitively, especially thinking about all the new FinTech stuff, because they’re not segregated by business unit, they’re innovating across all this data that’s available. And then that leads into more pragmatically kind of the notion of balancing between security and privacy and purpose, and access to the data. Right? So if all of my data is my business’s asset, and its lunatic conclusions, I want everybody in the company with a need to know to have access to everything. Oh, my God, how do I manage that in a world where, you know, security and privacy and cyber attacks, and all that stuff? So that’s why I say, I think this is a C-level kind of discussion that has to happen. Okay, we agree that this is a corporate asset, here’s how we intend to use it, here’s for what purpose we intend to use it and kind of set that vision at the top level so that it can then apply to different rule sets and different implementations of those protections and what use cases are considered possible versus not?

Kostas Pardalis  41:08

Hmm. You mentioned privacy and security, do you see any implications around that when someone implements a data fabric, or it’s actually like a better architecture to promote both privacy and security?

Scott Gnau  41:22

It’s an architecture and then it becomes an implementation. So just because I have access to data for my job, I may not actually be able to see your discrete record or identify it with you. But it’s important for me to see the diagnosis, the outcome of the treatment, etc, etc. So I can aggregate that with a whole bunch of others to understand trends in that space. And so when I talk about that at a C-level data usage, just kind of policy statements become really important, because that can then frame it right? So yeah, I mean, very few employees, except maybe the attending physician would need that. No, this is Kostas’ information, because he’s sitting here in front of me. Yeah. But there are a lot of use cases where your data, not associated with you necessarily, personally, can be used for managing and anticipating supply chain and what people need to be on, appointment scheduling, and all those kinds of things. And so, and inside of the data fabric, certainly there are plenty of technologies that can be deployed to kind of protect that.

Kostas Pardalis  42:30

Great. And one last question from me, and then Eric, can continue with his questions. Are there any limitations in the decentralization of a data fabric, and what I mean by that, I can think of say, instead of moving all the data from our databases into a data warehouse, we can let’s say federate, or like, connect directly to these databases and like, execute the queries. Okay. But can we push this even farther and get like executing these queries or these analysis and functions to a mobile device? Is it something that’s applicable also in like, in IoT cases where you have the edge and like, you need to do some processing there? What are the limits? And what do you see happening in the next couple of years?

Scott Gnau  43:17

Yeah, I mean, the limits are how far out to the edge you want to go right? So the edge isn’t one thing, it’s like the boundary of an amoeba. And it’s changing all the time, right. And it’s expanding, because edge devices are getting smarter, and smarter, and smarter. And so the edges are moving and ebbing and flowing. And like I said, ultimately, five years from now, we’ll be talking about some other really cool stuff that you can do further out on the edge, because the edge has broadened its boundary. But certainly playing the data as close to where it’s created and where it lives is the important concept there. And, and so certainly from an IoT use case, you try to push, there’s not a single edge, like the end device, but there are multiple layers of the edge and you just try to push out as far as is appropriate. And yeah, so certainly data fabric and data fabric architectures and technologies need to take that into account. And yeah, I mean, think about ARM processors and how much more powerful they’re getting. I had a Raspberry Pi sitting here somewhere that runs a complete image of our database, that’s like, okay, great. If that makes sense. And there’s an analytic we can push out there. And there’s data that’s contained that’s been consumed into that device, then we want to be able to make that happen.

Eric Dodds  44:30

Well, we’re actually getting close to time here. Scott, one thing I’d love to do, you have seen major life cycles in the world of data. And one thing I’d love for you to do is just give some advice to our listeners, especially maybe those who are early in their career, or who, especially maybe aspiring to a leadership role in data, and what are the types of things that you would encourage them to be thinking about now or sort of lessons that you’ve learned that might be helpful to them.

Scott Gnau  45:02

It’s not an assembly line kind of job, meaning it’s not repetitive, right? One of the things that I was interested in reading a couple years ago about data scientists was a new hip job to go get a degree in. And why was that because every day you come to work, it’s a different job. So if you like variety, and creativity, it’s a great place, because I don’t see any slowing of the rate of change in any of the aspects of what’s happening in our environment. So there’s definitely that. And I’d say the other thing is actually I learned this in university in writing papers, whatever it is, like, you can often find and make data, tell whatever story you want to tell. And try not to do that. Because if you’re gonna be successful, you got to learn stuff from the data that you didn’t expect. And that’s when you’re really doing your job well.

Eric Dodds  45:55

Really good advice. And actually, I was talking with an advisor earlier this week, and he was talking about how you do reporting really well. And he said, the thing that makes it really hard is that you can tell whatever story you want. Which is true.

Scott Gnau  46:13

I hope my boss isn’t watching this, because he’ll never trust any report that I bring in.

Eric Dodds  46:17

But yeah, that’s, that’s really great advice, and I really appreciate that insight. Well, Scott, it’s been really wonderful to have you on the show, I loved learning about data fabric. It was really helpful for you to break that down, and loved just learning from you in general about all the amazing things that you’ve done in your career. So thanks for giving us the time to be on the show.

Scott Gnau  46:38

Thanks very much. It was fun being here. And hopefully we’ll see you all again soon.

Eric Dodds  46:42

What a great show. My takeaway is very specific. But I don’t know if I’ve ever heard anyone say the cloud is making the data silo problem worse. And I think that’s because there’s so many cloud tools that maybe promise to solve that problem. And I just found that very refreshing. Because I think a lot of people sort of experienced that pain, although it may not be as challenging from a pipeline perspective, to solve for that as it was in on prem days without sort of streaming tools. Yeah, that was just really interesting. So that’s my takeaway. Kostas.

Kostas Pardalis  47:21

Yeah, absolutely. I don’t think I can agree more with you. And he’s right, like, I even built a business because of that right, like, Blendo to consolidate the data from the cloud to the clouds.

Kostas Pardalis  47:44

So there are business opportunities everywhere. Yeah, he’s right. Like, I mean, just because you have the cloud and you kind of move everything to the cloud doesn’t mean that, like, you don’t still have silos, right. And maybe the problems are even bigger there. Because at least like, back in the days where you only have your mainframes behind your firewall, you have total control over the clouds, you don’t when you are using a cloud based ticketing system or CRM, you don’t have that much control over what are the interfaces, how the data will look like, how fast they can access the data and all that stuff. Which introduces some very interesting challenges out there. So yeah, that’s, that’s very interesting. And I have to say that, like, after the conversation we had with Scottl today, I’m very, very curious to see how these new patterns like the fabric or the data mesh are going to evolve. There are some very interesting technical challenges there. There’s a lot of value and we might not like to implement them, obviously. But as with everything else, usually reality is a little bit different than what we have in our minds. And I think it’s going to be … we’re going to have a couple of very exciting years in front of us in terms of new technologies and how they are going to be implemented.

Eric Dodds  49:00

No question. Well, thank you for joining us on The Data Stack Show, and we’ll catch you on the next one.

Eric Dodds  49:07

We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at Eric@datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.