Episode 37:

The Components of Data Governance with Dave Melillo of FanDuel

May 26, 2021

On this week’s episode of The Data Stack Show, Eric and Kostas talk with Dave Melillo, senior manager operational analytics at FanDuel to discuss data governance, finding a better term for reverse ETL, and talking about impending industry consolidation.

Notes:

Highlights from this week’s episode include:

Dave’s “nerdy” interests in sports statistics and data (2:12)
Trends in collecting, processing, and using data (4:45)
Finding a better term for “reverse ETL” (5:48)
The blurring of the distinction between sources and destinations (7:41)
The role of BI is changing (13:24)
Data governance and the physical execution behind it (19:00)
Data governance is defining and managing data in a logical way that is actionable by the business (23:43)
Consolidation of tools and services (28:49)
Databricks vs. Snowflake (33:49)
Dave’s focus on regulatory data at FanDuel (45:47)

The Data Stack Show is a weekly podcast powered by RudderStack. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 00:06

The Data Stack Show is brought to you by RudderStack, the complete customer data pipeline solution. Thanks for joining the show today.

Eric Dodds 00:18

Welcome back to The Data Stack Show. Very interesting guest–Dave from FanDuel. FanDuel is sort of a fantasy sports and sports betting suite of apps. So it’s gonna be really interesting to talk with Dave about that. I think it’s fairly new, I know, they’ve been doing fantasy sports for a while, but I think the betting aspect is new from a regulatory standpoint. So perhaps we’ll get to hear a little bit about that. But Dave has a really varied background and has worked in all sorts of contexts with data. So I think one of my burning questions, which is pretty tactical, is, when you think about fantasy sports, you have to ingest data from a ton of different places. I mean, you’re talking about statistics, you have games across multiple sports happening on a daily basis. And so when you think about all that’s required to run, you know, a sort of consumer mobile app like that, where people are interacting every single day with data that needs to come from third parties, I always look at that and say, Man, that’s an interesting pipeline problem. So I want to ask about that. Kostas, what’s on your mind?

Kostas Pardalis 01:18

Yeah, first of all, I’ll probably want to learn a little bit more about Fantasy games, to be honest, like, I don’t know much about it. But outside of this, I mean, Dave has a very diverse background. He has worked with data science, data engineering, he has even done work in data architecting. So I want to learn from him about his experience with all the different fields around data and also pick his brain on what’s coming in the future in this space.

Eric Dodds 01:50

Great, well, let’s dive in.

Eric Dodds 01:52

Dave, welcome to The Data Stack Show. We’re really excited to chat with you.

Dave Melillo 01:56

Thanks, Eric. I appreciate it. I’m really excited to be here.

Eric Dodds 01:59

So you have a varied history with data. And we want to hear all about it. So why don’t you give us the brief, you know, sort of two-minute overview of when you got started with data, the different companies you’ve been at, and what you’re up to today.

Dave Melillo 02:12

Totally, I think I’m going to Tarantino it because you know where I am today, that’s kind of the apex of what I’ve been trying to do with data my whole life. I’m currently working at FanDuel, which for people who don’t know, is a daily fantasy sports betting company. It’s in, you know, the sports entertainment space. When I first started studying data, back in high school and things like that, you know, what piqued my interest was sports statistics. I’ve always been kind of a nerd that way. I thought I was going to graduate college and be the statistician for the New York Yankees. Unfortunately, that didn’t happen. But what did happen is I was able to kind of parlay that interest in statistics and data and information technology into roles at, you know, Fortune 500 companies, I worked at software startups. Kind of ran the gamut from different places that I worked at throughout my career. But everything revolves around data, right. The same things that I was doing at Fortune 500 companies, I was doing on the side consulting for small businesses in my area. And so that’s everything from data engineering, to data architecture, to data science, and all the fun stuff in the tip of the spear. But yeah, it all finally came full circle to me landing closer to my passion here at FanDuel, where, you know, we solve everything with data.

Eric Dodds 03:30

Very cool. And tell us a little bit I know, you’re I know, you haven’t been there too long. But what’s your role? Do you have a team? What kind of data projects are you working on at FanDuel?

Dave Melillo 03:40

Yeah, at FanDuel, I’m on the operational side of the business. So it’s a lot of back office support, compliance support, you know, regulatory support. So it might not sound like the sexiest of roles, but it’s really cool, because it’s at the hub of everything that fanduel does with data. So I get exposed to a lot of different pieces of data, not just gameplay stuff, and it’s really, really interesting. And there’s a team and it’s growing exponentially along with the market. So it’s a really fun and exciting place to be right now.

Eric Dodds 04:10

Very cool. Well, I have tons of technical questions. I know Kostas does, too. But one thing I would be interested to hear your perspective on is sort of major trends in the data space. And I’ll name one specifically to maybe direct the question a little bit more but, you know, we have this concept of a data mesh that seems to be becoming more popular. What kind of trends are you seeing in terms of the way that companies are sort of organizing themselves around sort of collecting, processing, and actually using data?

Dave Melillo 04:45

Yeah, that’s a great question. So the trend that I’ve seen most strongly over the past few months, maybe six, maybe even a year, is that people have doubled down on technology like Snowflake, right? And like cloud data warehouses have become commonplace. And that requires a significant amount of investment from a company’s perspective, right to just spin up and migrate to Snowflake, or Redshift or BigQuery, or anything like that is no easy job, right. And it’s no cheap job either. So over the past probably 5-10 years, companies have done that. They’ve started to understand or have a revelation that just because everything’s in that cloud data warehouse doesn’t mean that the business is exposed to it. So that leads to this whole trend of reverse ETL, that has started to emerge. I don’t really like the word reverse ETL. I feel like that’s very much like a sales and marketing term.

Eric Dodds 05:41

I’m so glad you said that. Because, okay, let’s talk about this, I was gonna ask you why you don’t like it, tell us what you would call it.

05:48

I call it data portability. And that’s how I’ve always advertised it internally, to my stakeholders, and people I’m working on projects, because it’s about, you know, making sure that data is portable, no matter where the analysis or the data is generated, right. I think like thinking about reverse ETL is, it makes sense because you can marry it to something that people are familiar with. But I’m not really sure if the concepts of ETL are actually what this is doing. So portability is the word that I use. But you know, it’s really all about getting data in front of people where they’re working on a regular basis. Because just as IT organizations have doubled down and gotten things like Snowflake and Fivetran and this whole chain of tools, you know, go to market functions in the business have also done the same thing. You know, I’m sure that you guys can sympathize with this. But at any company that I go to, they have like a million different SaaS applications, you know, one for customer success, one for sales, one for marketing, etc, etc. And, you know, asking people in the 21st century to be swivel-chairing between like a BI dashboard, and Salesforce and some spreadsheets and things like that is a little bit archaic. So, you know, that’s why I think this whole thing with data portability, and this trend, that’s what people are trying to solve, they understand that, hey, just because I have analysis in a dashboard, or in a data warehouse really doesn’t mean anything to the people who are actually using this data and making it actionable.

Eric Dodds 07:19

Sure. Okay. So I’m going to give you a really brief sort of three-stage history of where we’ve been with data, maybe even the last five years, and Kostas, I want your opinion on this too. This is me, I’m gonna go off the cuff here. But you kind of have the introduction of the … this is dangerous.

Kostas Pardalis 07:38

It’s very exciting. I think Go. Go. Do it. Do it.

Eric Dodds 07:41

Alright. So you have the introduction of the data warehouse, right. So Redshift was sort of the first major player there. And then on the heels of that came Snowflake. And of course, BigQuery is a major player there too. And this allows you to collect all of your data in one place and sort of achieve analysis that before was much more difficult. And then you have … that sort of created the challenge of the second phase and tools like Fivetran and all that solved it, which was okay, now I can collect all my data, but actually doing that is kind of hard. And so I need much, much easier ways to get all these pipelines to talk to each other and sort of integrate my stack, whether it sort of sources to the warehouse, and then also sort of sources to like SaaS tools. And that was phase two, right? So you saw the Segments and the Fivetrans and all the pipeline tools come of age over the last five years, and many are sort of mature now. And I think the third phase, and this is probably where there’s, you know, sort of some prediction coming in, is, and I love the term data portability, is where every source is also becoming a destination. And so this paradigm of sort of linear collect, store, you know, transform, process and deliver is actually becoming almost bi-directional, in a way where the distinction between sources and destinations is starting to blur. How did I do? Was that accurate?

Dave Melillo 09:04

Oh, yeah, I think you nailed it. And I mean, as you were talking, you know, one of the things that started to percolate in my mind is also this whole movement around, kind of view materialization. I know, DBT has come on really strong as of recently, and again, I think, you know, maybe even like the next phase of all this, and what’s going on in the future is all around data governance. And, and maybe that’s the data mesh piece that you talked about at the beginning. It’s like, Okay, I have all these sources, I have all these tools, how can I observe them, make sure that they’re available? How can I make sure that people know what the single sources of truth are? How can I easily create these single sources of truth from large data sets and, and kind of make that available to the rest of the organization? So yeah, I think you did a great job. And I think the future to your point is kind of, it’s kind of a little bit like the Wild West, right because all the big boulders have been solved, but people still experience pain. So, you know, I think you see different vendors kind of attacking the future and from different angles. You know,

Kostas Pardalis 10:06

Dave, I have a question. And I’d like to hear from your experience. So about reverse ETL, right? I mean, it’s a new term, as you say, it’s let’s say this portability, data portability, let’s call it like this. How was it done before? I mean, before the markets. And why did the market decide now to go after this problem?

Dave Melillo 10:28

Yeah, I again, I always say this I, when I started this conversation about data portability, and why it’s emerging, I believe it’s been because things like the cloud data warehouse have become very accessible, right? It’s not hard. Like usually in the past, there is a lot of configuration, a lot of customization, a lot of integration. But now you can white label everything, right, you just subscribe to Snowflake, look at that you have a cloud data warehouse. Same thing for building data pipelines. In the past, you’d have to know Airflow, you’d have to get familiar with DAGs. And you’d have to build it all yourself. Now you just subscribe to like Stitch for a monthly fee, and you can get all of your data into your data warehouse. But now people understand that like, Wow, so we’ve doubled down on this minimum viable data stack. But, no one cares, right? Like my sales, people don’t care that I have a cloud data warehouse, because they’re still consuming content through BI dashboards or through things that we send to Salesforce. So you know, it’s completing that circuit. And that’s really why I believe these portability tools, or even things like DBT have, have really become needed, because it’s that bridge between all that technical debt that you build up with this minimum viable data stack and actually making it actionable. And yeah, I don’t know if that answers your question 100%, but I do believe that you wouldn’t have one without the other. Right. If these cloud data warehouses, if these pipeline tools didn’t exist, I don’t think things you know, in the data portability, or the DBT space would be emerging as well.

Kostas Pardalis 12:05

Yeah, absolutely. To be honest, I think and I believe, actually, that the real enabler here is the cloud data warehouse, I think the rest pretty much emerges because we have access to cheap storage and processing on the cloud. Something that like in the past, we didn’t, I mean, and that’s what makes things easy from one side, but also complicates things like the cloud, makes things cheaper and more accessible. But at the same time, it complicates things by introducing many silos, like all the different SaaS applications that we have. And suddenly, we also have to pull data from there. And it’s not just database systems in the same data center where we, you know, control everything as it was in the past. Because, of course, like ETL is not something new, it’s existed pretty much since we’ve had database systems. So yeah, I would say that, I totally agree with you. And I would probably emphasize a little bit more like the importance of data, cloud data warehouses for that. So you mentioned BI, and I mean, traditionally, data warehousing was the technology that was supporting BI. Do you see the role of BI changing inside the organization? Do you see it as going away, or do you see new roles outside of the BI analysts emerging?

Dave Melillo 13:24

100%. And that’s it, you know, now I’m remembering what your last question was, right? Like, how did we solve this before? All of these great tools? And BI was the answer, right? I remember when I started my career, probably closer to 10-15 years ago, BI was the thing, right? BI was solving all these complex problems that you couldn’t with spreadsheets, right? So something like QlikView and Tableau like they were dominating the space, because they made it so much easier to answer the questions that you had, that you’re trying to solve with spreadsheets and kind of first gen technology back then. So in that way, I totally think that BI now is changing, right? Because you don’t have to do the end to end process with BI anymore. And if you are still doing that, and you’re basically using something like Power BI or Tableau as your data platform, I think that you’re way behind the curve, because you just can’t process things as quickly. You can’t anticipate as quickly. It’s not scalable. Right. And so now, yeah, it’s very interesting. I don’t think BI is going away. But it’s just not the one-stop shop anymore. I think it’s one of many tools that analysts will have to learn and in that way to your point Kostas, I think the definition of an analyst is changing. You know, it used to be that you just had to be good with visualizations and creating some charts. I think the scope of an analyst is increasing now. Right? Like, I think analysts nowadays have to be comfortable jumping into things like a Jupyter Notebook or a Databricks notebook, right? Because there you can do some ETL, you could do some transformation, and, you know, set up for visualization later down the line. I don’t think it was like that before. So I totally think that there still is a role for BI. I just don’t think it’s going to be as critical or as pivotal as it was in the past.

Kostas Pardalis 15:19

Yeah, no, I totally agree. I think that’s the role of BI’s transformation. Obviously, it’s not going away, because reporting is always going to be like the foundation of whatever we’re doing, right. Like, we need to understand the past in order to act in the future and in the present, so I don’t think that like BI is going anywhere. It’s just that, instead of like the BI analysts, we will have a bit of different roles where BI is going to be just part, you know, of a toolset that you are using. As you put it very well earlier. And talking about roles, this is a very interesting new category or new role, let’s say that it’s very promoted by DBT of the analytics engineer, right? What do you think about face like, what’s, what’s your definition? What does it mean for someone to be an analytics engineer? What is this thing?

Dave Melillo 16:07

Yeah, it’s funny, like, I don’t know, what does it mean, right? All these data things, have all of these data roles been valuable from day one, right? Because when I came in, what an analyst was, is not what an analyst is today. And I like this idea of an analytics engineer. I mean, what that means to me is, is someone who’s doing the more technical work behind analytics, right? Because people say, okay, analytics, you get a data set, you chop it up, you pivot it in Excel, and you’re an analyst, it’s like, well, you know, that that world has increased in scope, right, and breadth very much so. So like, I think of an analytics engineer as doing those things like view materialization, even like some data governance, and, and maybe more of what would be thought of more of a data engineer, but not like a very technical data engineer like so in that breath, right? I think that data engineers are becoming more and more and more like developers, right, they are definitely shifting over to more of a developer persona, developer day to day developer tool stacks, right. So I think that analytics engineer, is starting to emerge as the people who are technical enough to be in the conversation with the developers, but still analytical and business minded enough to be able to, to match business requirements to what needs to be done on the back end to set up the business to analyze, right? So again, when I think of things that analytics engineers are doing, it’s, you know, view materialization, data governance and indexing data, building data catalogs, even building maybe some observability and monitoring pieces of the stack, which, you know, that’s another piece that’s emerging. So, yeah, I don’t know, I’m probably not the person to define what the analytics engineer is, but that would be my best guess if I had to take it.

Eric Dodds 17:59

Dave, we brought up data governance a couple times here, and I’m, I’m really interested in … so in many ways, like this and this is, you know, unfortunately, a lot of times the case where the marketing kind of leads too early with the future vision that companies can achieve. And then, you know, you sort of like when the, when the data warehouses came out, you know, sort of, or in the early days, when they were becoming really popular, you know, you had this whole thing of like, and now you can get a 360-degree view of the customer. It’s like, well, in reality, you needed all these pipeline tools in order to make that feasible for, you know, your average company. But to your point on data governance, and I think it’s really interesting in the data mesh concept, governance becomes a problem, because now you have all these different pipelines, maybe different vendors, you know, different internal builds, all that sort of stuff. And so you can sort of move data more easily, and centralize it more easily. But now you’re sending it to all these different places. And so now you have sort of a, it’s hard to do governance at a central level. What are the ways that you see companies solving that?

18:59

I think that’s a great question. And I think, you know, Kostas also kind of tipped onto this or touched on to this, I should say, I really think that the physical component of governance is what’s starting to emerge, because I even remember, what was the … there was a company that we were doing an RFP for, you know, again, like 10 years ago … Collibra, right? It’s a really famous, like, data governance, data catalog, and all that it was really like, a fancy spreadsheet of, you know, data metric definitions and what they were and it allowed people to collaborate, right? So in that way, I think it’s bringing that concept to life and making it physical. So again, what does that really mean? I keep coming back to view materialization. But like, there is no data governance without some type of physical execution behind it. So whether that means that you’re going to roll out Git Ops, so that everything in your GitHub repository aligns very much with all the metrics that you’re creating. I mean, this whole code as documentation, I think is a piece of it as well. Right? Like your code should be your data governance assets. When someone asks like what, you know, MAU, monthly active users are, you shouldn’t be like, pointing to a cell in a spreadsheet and words that define what a monthly active user is, like, you should be able to point to, like maybe a view that oh, well, here’s our view of active users. And this is the SQL behind it or the Python that builds this view. And it’s pulling from these tables, and it has these columns. And these are the act, you know, these are the characteristics of each column and the type, like that’s the piece of data governance that has been missing, I think, for probably a long time is that physical piece to say, Okay, yeah, you’ve defined it. Right. And you’re governing it from that aspect, but how are you making it real?

Eric Dodds 20:58

Yeah. And that, and that kind of goes back to something that has been a recurring theme on the show across so many disciplines within data, whether it’s data science, data engineering, data governance, is that it’s an organizational and sort of cultural question first. And that is, you know, getting shared definitions around how you define the business. And then I love the analogy you gave of the physical manifestation of that, I think that’s just a really helpful way to think about that. And I agree with you there. I mean, DBT is a huge step forward in building some process and tooling around that. But I still think we have yet to see all the different things that are going to make it way easier to do that centrally, within the context of sort of the data mesh future, if we want to call it that.

Dave Melillo 21:51

So and you know, where you hit the nail on the head is, I think all of these tools are still a little bit too technical for business users, right? Like, when I think about DBT, when I think about any of the good, you know, tools that are making it easy to, you know, manifest, this whole process, they’re still very technical. I think the first company or you know, vendor who comes up with like a business way, or a way to empower business users to participate in that process, I think that that’ll be where the major impact comes, because that’s what you’re missing. At the end of the day data people are data people. And it’s great that that’s starting to happen, because I feel like in the past, you were like a marketing person that also knew how to work spreadsheets. So now you’re the marketing data person, right? And I think it’s flipping now, people are understanding, like, you wouldn’t do that with HR, right? Like, at a company, you wouldn’t be like, you’re in marketing, and you’re good with people. So you’re going to be our HR person, but think about it. That’s the way that data has been working for the better part of you know, the 21st century. Only recently have there been college graduates, you know, graduating with analytics degrees and a concentration in statistics that is specific to programming. So it’s like, you know, you know, I think, as data people actually stake their claim, and they are data people, you know, you’re going to need tools that bridge the gap between the data-minded person and the subject matter expert, you know,

Eric Dodds 23:25

Yep, totally. You’re so right.

Kostas Pardalis 23:28

Before we move forward, I have a feeling that this conversation is going to be a lot around data governance, and for a good reason, because it’s something that’s very, very interesting. Yeah. So Dave, can you give us a bit of a definition of what data governance is?

Dave Melillo 23:43

Yeah. I mean, I really, I really think it’s defining and managing your data in a logical way that is actionable by the business. I think of data governance, as for example, a lot of single source of truth projects, right? It could be as simple as customer value. Well, how do you have a data governance program around customer value? It might seem really easy, it’s like, well, the number in Salesforce is our customer value. But where did that number in Salesforce come from? Right. So it’s this whole data lineage that maps all the different data sources to the metric that you want to create, and not only the data lineage and where that information is coming from, but then what is the logic? Is customer value based off of the start and end date? Is it a monthly value? Is it an annual value? And, you know, for all of those questions, how is the answer manifested, and that’s where I think the documentation as code or code as documentation really plays a point. So you have this data lineage piece that traces all of the information that you’re using for the metric, you have the logical piece that is using code to define what these metrics are, and it’s a tangible thing, and then, you know, there’s some type of delivery mechanism. And that’s where, again, the physical piece is really stressed. It’s like, okay, well, once we have the lineage right, once we have the logic down and committed to code, how are we delivering this to stakeholders on a regular basis? Are we materializing views? Are we using a reverse ETL tool to get it out of our data warehouse? Is there another process that we’re using? Right? I think there’s many solutions to the problem. But when I think of data governance, those three pieces of lineage, logic, and delivery are kind of the main components for me.

Kostas Pardalis 25:36

Makes sense. That’s very interesting. And what are the tools that we have today to implement data governance?

Dave Melillo 25:45

Yeah, like, I think that there’s like some all in one tools. I know, Collibra is a really big player in this space, obviously, like, there’s some more legacy providers, like Informatica, I know, they have really robust MDM and data governance features, you know, personally, I don’t really think that there’s like a cool data governance platform, right, and like an emerging one that kind of fits with this minimum viable data stack, because people are kind of managing data governance in a technical way, right? They’re like, Well, you know, our version of data governance is that everything lives in this zone of our Snowflake, data warehouse and then when we clean it, and we prepare information that’s ready for consumption, it’s in this other zone of our data warehouse, some people I think, are are solving with a tool like DBT, right? If it’s scheduled with DBT, and, you know, then set it and forget it, then that’s our data governance, and basically anything in production is, is governed. Anything in dev is not governed. But again, what that does is in a way, it excludes the business user. Because unless the business user can go through, can fork a GitHub repo, can read SQL, can understand all the different programming languages and the transformations that are being done to that data, it’s kind of hard, like, you need to be walked through that process. So like I said, the first company that comes by and can map the technical pieces of data governance, the lineage, the logic, and the delivery to things that the business people would understand and also be able to contribute to, like, I think that’s where you’re going to get lightning in a bottle.

Kostas Pardalis 27:28

Yeah, that’s really interesting what you’re saying, because you are talking a lot about like, a governance platform, like a unifying kind of experience around like governance, which is what Informatica was trying to do, right, or Collibra, in general, like all these more enterprise kind of companies that we have seen so far, like in this space, IBM, I mean, all these companies had some kind of like, master data management platform. But at the same time, I think the Silicon Valley way of doing things is getting these platforms, right, and decomposing them into meaningful parts and building companies and products around that, right. So we have like, now we see companies like Immuta, for example, right? Like they just raised series d $90 million. And they are working, the product is all about data access, right? And how you manage that. And then you have like a number of companies that are doing quality, and even more niche things than just quality, right? Like, just tracking schema changes, right? So this creates a very fragmented kind of landscape with all the tools that are out there. Do you think this kind of work, or it’s like, pretty much a necessity in order to realize the real value of data governance to have one, just one platform that does all that stuff?

Dave Melillo 28:49

I honestly think that we’re on a bubble of all these different data tools. And I have to believe that there’ll be consolidation in the future, which is, I think, is what you’re hinting at. You know, I think you’re already starting to see it with like, I think Twilio bought Segment, or maybe it was the other way around. I’m not sure.

Kostas Pardalis 29:07

Yeah it was Twilio, yeah.

Dave Melillo 29:09

Yeah. So you know, that was a big, not shocking, but, you know, I thought Segment was a huge player in the space and you see them consolidate. I worked at a DevOps company, and they are very similar. They have similar problems when it comes to tool chains that data does, like there’s a million different data pieces that you can do. And, you know, for DevOps, you can have, you know, five different tools just for testing, right. And so, as there has been consolidation in the DevOps space, where like, you know, Google and Microsoft start buying up these little pieces, I think it’s gonna happen with data again. If you want to map the journey from BI to where we are now, think about the huge BI vendors that have gotten acquired, right, I think about Looker. I think they went to Google, right? Yep. Yeah, and there, there have been some other consolidations. I totally think that in the future there will be conversations like in the next five to 10 years, I don’t think that we’re going to be talking about a bunch of different vendors, I think we’ll be talking about one or there will be a solution that emerges. And I’ve already seen this because I like to work with early stage startups around data. You’ll find a tool almost like Zapier, right? That can almost white label all these services, and put them in one thing. So that you’re kind of working off the Snowflake engine, the Fivetran engine, but you’re working in X tool, right to bring it all together. I’m not sure which one’s going to happen first. But it’s either going to be consolidation, or it’s going to be some type of white labeling, because there’s no way that people are going to want to, you know, switch from from thing to thing as they’re trying to go about their day, you know.

Eric Dodds 30:49

Yeah. And you kind of see it broken out by business discipline, because you have some companies in the space, to Kostas’ point, that are focusing on sort of like sales ops, and some sort of like marketing ops, and governance there, I think, Dave, have you heard of a company called Great Expectations?

Dave Melillo 31:05

No, I don’t think so.

Eric Dodds 31:07

They’re kind of interesting, and our listeners, if you haven’t checked him out, it’s just kind of interesting … I think it gets at some of the things you’re talking about where I mean, they’re, they’re an early stage startup as well. And so they’re in their own way, taking a slice of the pie, but they kind of have an interesting framework for thinking about data governance, and sort of managing it at the pipeline level, which is really interesting. So definitely give them a look.

Dave Melillo 31:32

Definitely. No, no. And honestly, like, maybe we’re far away from the consolidation, because it feels like I’m learning about new tools all the time. I know, Presto has started to emerge, you know, from like, to solve for big data issues. I’ve been speaking to Monte Carlo, because I think that’s just a really interesting space around data observability. Right. And sure, it makes a lot of sense. You have all these data tools, now what if one of them fails? Would you even know? Like, are you even doing data quality checks across the whole tool chain to make sure that there’s some, you know, form of validity to everything? So, yeah, to your point, I think that there’s new emerging ones all the time, I just can’t imagine that people will want to continue to buy more subscriptions. Someone’s gonna come along and consolidate for the good of the market.

Eric Dodds 32:19

Yeah, it’s gonna be interesting.

Kostas Pardalis 32:21

I was thinking about what’s interesting with the data space is that the acquisitions actually started from the BI tools, which probably makes sense, because they are like, the most mature ones. But if you think about how crazy it is that even publicly traded companies like Tableau got acquired. Tableau get acquired by Salesforce, right, but outside of this, we haven’t seen anything major happening. And I think it’s probably … okay, we have the Twilio/Segment acquisition, which was pretty big, right? I think it was like $3.2 billion, but the market is in the right conditions for acquisitions, there’s a lot of liquidity, there’s a lot of gas, stocks are pretty high. So I don’t know, I really want to see what Snowflake is going to do. I don’t think they have acquired anything so far. So I think we should be like, keeping our eyes on them.

Dave Melillo 33:09

Definitely. I would peg Snowflake as one of the consolidators. I mean, if you think about it, it would be great to get Snowflake to acquire like something like Fivetran and something like a RudderStack, a Census, a High Touch, then, basically, you’d have a way into your cloud data warehouse, you’d have the cloud data warehouse, and you’d have a way out of the cloud data warehouse, right. So in that way, I’d basically have everything I need. Obviously, there’s other bells and whistles that I could add to that. But I mean, you know, I could kind of plug and go and have a data platform with one vendor, you know, so yeah, it’ll be very interesting to see what happens,

Eric Dodds 33:49

Which makes total sense, because a lot of the … especially in sort of the SMB mid-market, are already using all those tools, right? I mean, it’s just consolidating into one sort of one, one system. Okay, so speaking of data warehouses, and this is actually Kostas, for you and Dave. So Kostas wrote an article recently about sort of Snowflake versus Databricks, and sort of the impending collision there, which is really interesting. We’ll put it in the show notes for everyone to read. It’s really an excellent piece. But Dave, would love your opinion and Kostas, jump in here as well, because you’ve studied this pretty deeply, you have sort of the warehouse side, which is Snowflake, and then you have the data lake side, which is Databricks, and then you have this new emerging category, which is being called “data lake”. So I would love your thoughts on what are we going to see happen there in the next five to 10 years related to sort of all the all the things we’ve talked about?

Dave Melillo 34:43

Yeah, I would love it if Kostas went first so I could kind of copy his answer because I have thoughts on it, but if we have a subject matter expert, it would be great for you to get us going.

Kostas Pardalis 34:53

Yeah, I don’t know if I’m an expert, but I’m very fascinated, especially from the product side of things with that stuff. And that was the whole idea of like the article and what I tried to communicate that we are actually converging into one data platform at the end. Now, how is this going to be named? Is it going to be named cloud data platform, or it’s going to be a data lake, or a lake house, or whatever, that’s something that product marketing will figure out. And it’s not that important. But what is important is that, and that I think resonates very well with what Dave was saying also about data governance is that we need to have like one experience in one platform working with data and unify many of the functions that we have under one platform, that’s the opportunity for the market. But also, that’s what is needed. If you want to really create this data economy and create an industry around data. Right now things are like, extremely fragmented, for a company to manage to have like a data stack, there are just way too many vendors that have to be involved there. Even for pipelines, Eric, like, think about it, like how many different vendors, someone needs to have a complete pipelining, data pipeline inside the company, right? Like it’s probably at least three. So everything’s going to be around one platform. And what I’m thinking is that, I think that’s also like the vision that Snowflake was trying to communicate through their F1 filing is that there’s going to be a data platform, and on top of that, there are applications that are built. So BI becomes an application, right? The pipelines are something that are working around this platform and connect to this platform in and out. And you can build some very interesting things over that. Like, for example, you can start having marketplaces around data. And when you do that, then you have network effects, right. And that’s where it gets like, really, really fascinating. I think we’re just at the beginning. But I also think that the direction of where we’re heading is becoming more clear.

Dave Melillo 36:59

Totally. And I would agree with all of that. And, and to just pick up on it. From my perspective, I think that the platform that is most wide open, but also standardizes on some really basic things is going to be the winner. So the reason I say that is like Databricks, right? Databricks very much takes advantage of SQL and Python, two common languages. And, you know, I’m familiar with Snowflake, I’ve used it, but you know, it feels a little bit more cleugy to me, or like click and drag and drop. I know there is a SQL component to it as well. But you know, I think it’s very appealing to be able to leverage your developer language skills, right. The number one thing that I hate, and I hope does not happen is that someone comes up with their own syntax to manage all of this, right? I really think that the success of any platform, whether it’s Snowflake, or Python, is to capitalize on standard components of the data industry, right? Because again, if you think about it, like if I’m in Power BI, I need to know like DAX, and their language, right? In QlikView, they have their own. And so when it comes to BI, like even Looker, you have to know LookML, it’s all based off of SQL and, and Java-based languages. But I mean, it’s kind of a pain. If you’ve invested five years or you know, you went to the Flatiron School and you learned how to code Python, and then all of a sudden, you’re in Tableau. And you have to drag things onto shelves and figure out how to create a chart by clicking on a bunch of different buttons, right? So when I think of what has the most potential in the future, I mean, I love the notebook infrastructure, right? I’m a big fan of Jupyter Notebooks. I love the Google collab product. And, you know, I’m a big fan of Databricks that way, because it’s a very, it’s like a blank canvas, you’re still guided, but like, I could be using Python in one cell, I could use SQL in another, it’s super flexible, I can fork different pieces of code that I find on the internet into my notebook and make it all work together, the scheduling is a little bit more technical, and less clicky. So when I think about what’s gonna emerge, I think it’s going to be the platform that takes advantage of the popular skill sets in data and doesn’t make people relearn things or learn a specific way of doing things that hopefully that makes sense.

Kostas Pardalis 39:30

Yeah, it does and I totally agree with what you’re saying about platforms and all that stuff. I think products needs to be built with the assumption that they are going to really fast become part of the workflow that the developer has and not create more friction or more, let’s say mental overhead to the engineer to learn something new, right, which by the way, it’s probably something that as long as it will exist only as long as the company exists. So yeah, I totally agree with that. And I think we will see this paradigm of the past where companies were building their own languages like Splunk, for example, right, like, you have to use their own query language to do that. And you have people who specialize only in that, like, that’s what they have on their CV. I think we’re going to see that less than less than the future. And it’s going to be much more risky for companies to do that, and try to build a business around that, unless they do execute very, very well, DBT, for example. But what I think is very smart that DBT did is that it built on top of an existing language, which is SQL, and they just added enough, let’s say, special sauce there from their engineering to make it easier to work with and do things that we couldn’t do in the past. Because, okay, SQL also had like, a lot of like issues, and the ergonomics of the language was very problematic. I mean, that’s amazing what they did with that. But yeah, I totally agree. I think that Python, R, Jupyter Notebooks, every product in the data space needs to at least interoperate with these tools.

Dave Melillo 41:07

Definitely. Yeah. And again, to your point on, like, what does this become? Does it become the delta lake, the lake house, the, you know, the data mart. I’ve heard, remember data marts and data stores from, you know, BI times. I mean, the architecture of this, I think, is really up for grabs. And I think that’s the part that needs to be bespoke. Right? Because I’ve worked at places that have big data problems, right? I’m at a place like that now at FanDuel. Right? I mean, data is the product. And there’s just voluminous volumes of data coming in every second, right? So there’s a whole, you know, the whole data streaming thing is appropriate here, you know, data lakes talking about that’s appropriate here. And so using tools like Databricks that solve big data problems is really apropos, right. But you know, I do a lot of consulting gigs. I’ve also worked at smaller startups, and they don’t have those problems. I was at a startup where, like, their biggest data set was the 50,000 accounts that they had in Salesforce, you know what I mean? So there wasn’t necessarily a big data problem there. But I should still be able to use something like Databricks to solve all the problems that I have at a small company that might not need a data lake, they might not need, like this robust cloud data warehouse, but I can still use that tool in order to facilitate a solution, right, it won’t feel like I’m using a rocket launcher to solve for, you know, something that I could with a hammer. So that’s where that flexibility piece comes in. I do believe the architecture piece is going to continue to be bespoke, per industry, per company, per vertical, right? Because SMB companies in software development are going to have much different data needs than, you know, like a restaurant company that might be you know, nationwide or global. And, that’s another trend that I see emerging. And that’s why I think these data tools are so important, I think small businesses haven’t even really taken full advantage of their data, because they see that like, oh, well, that’s like a corporate, that’s an enterprise problem, right. But if you did have, like, these really great accessible tools, that anyone who knows SQL or Python can jump into, you know, you’d be able to solve, you know, for problems of like a local gym, or like a local bar so that they can manage their data and all their data assets in their business the same way that, you know, Google or a Fortune 1000 company would So, so yeah. So for me like that question of what does the architecture look like in the future? Is it lakes? Is it warehouses? What is it? I don’t think that’ll ever be standardized, but I think that like the tools that we have, should be able to build a variety of those solutions.

Eric Dodds 43:54

Yeah, you know, David’s, it’s really interesting to think about DBT. And I know, I’m not the first person to have this thought. But I think an interesting point to make for our conversation is, DBT has spanned individual users to enterprises and retains its ability to add value. And that in and of itself, is extremely rare to be able to serve, you know, successfully as a company or tool to be able to serve an individual user and the enterprise, especially as you grow because the natural need of any business has to focus on the users that it serves best, right? And so it is, it’s almost impossible to serve an individual user in an enterprise simultaneously. I mean, you have to make all sorts of choices around product features and roadmap and marketing and all that sort of stuff. So really interesting thoughts there. But I agree, I think it’s going to, you know, tools that sort of help democratize that and span the size of business are going to be a huge part. One thing I want to do, I know we’re getting close to time here. I cannot believe we’re getting close to time. It feels like we just started talking. We’ll have to have you back on because I have so many more questions. One, I’d love to ask you about FanDuel, a little bit. And my question is pretty tactical, but I think it’d be interesting, interesting for our audience. You have all sorts of types of data at FanDuel. And it looks like, I mean, I’m not an expert, but it looks like you’d have to ingest a ton of data and statistics across a huge variety of disciplines. And so I’m interested to know, you know, even if we just think about sports, like fantasy football, for example, where and how do you ingest all of the statistics that you need in order to sort of run daily fantasy programs in the app? I mean, that seems like a major sort of data engineering pipeline challenge.

Dave Melillo 45:47

Yeah. And you know, what’s great about my job now and working at a big company is that I’m obfuscated from a lot of those decisions, right? In other roles where I used to be involved in building the pipeline, managing the database, and also producing the insights, you know, in this role, I’m really fortunate to be able to focus on delivering the insights. And so we still have to build like mini pipelines, because to your point, I still have like this massive data lake, let’s call it, and that’s how I see a data lake is like, all of the possible information that you could use for analysis standpoint. And we have to build many pipelines so that we have dependable views that are slices of time, for performance reasons, and just for feasibility. But I wish, you know what, I wish I could tell you how all the information gets into there. And, and to be honest with you, as the company goes through acquisitions, as the company transforms, right, think about it, I mean, this company FanDuel has started to work in an industry that just became legal when you’re talking about sports betting, right, like you look back 5-10 years ago, like there wasn’t legalized sports betting outside of like, Las Vegas, and maybe Atlantic City. I don’t even know if it was in Atlantic City at that point. But to that point, it’s still a little bit of the Wild West. I mean, it’s a mystery. And that’s probably because it’s something that the company is solving for on a daily basis. So I wish I could answer that question. But I can confirm that there is information coming from mobile applications, from reference data that we’re grabbing from databases, and again, it’s a multi product company. So you’re not just talking about one application for daily fantasy, but you’re talking about sports books, racing, poker, like there’s, there’s a myriad of them. But, you know, again, I think that process of ingesting, architecting, and delivering, that is still the core tenants. Like the things that we’re doing at our group level, you know, the little mini pipelines that we’re building, the little databases that we’re building and the views that we’re materializing. It’s the same approach that the company is using. But, you know, I wish I could answer all of it.

Eric Dodds 48:05

Yeah. Yeah, no, we’ll have someone from maybe the data engineering team, if you’d make an intro for us, just because I think it’d be so interesting to hear. Anytime we talk with companies who are ingesting significant amounts of outside data and combining it with internal data those are always fascinating pipeline conversations. Well, to close this out, could you just tell us maybe since you are sort of delivering data products, and less on the pipeline side, Could you just tell us about maybe one of the data products you’re delivering at FanDuel right now?

Dave Melillo 48:36

Sure. We do a bunch of regulatory reporting, which again, doesn’t sound very interesting. But regulators are, they’re mostly accountants by trade. So they’re people who know numbers, and they hold us very accountable. Let’s just say that right? Because what’s at stake at the end of the day is tax money that fuels everything in their state, right? So building regulatory reporting, it’s not as exciting. There’s not as many colors and graphs and charts and dashboards as I’ve been used to in my career. The focus is really more on like, data timeliness, let’s call it, right. All about accuracy and setting up those checks. And it’s also about delivering information in very interesting ways, right, like pushing a CSV file to an SFTP server that people could pick up. Now in my past, I really haven’t done a lot of that, right? Because I’m delivering dashboards to people. I’m delivering people analysis and predictions and things like that. Very rarely am I trying to figure out how I get a 6 million row file scheduled on a daily basis to drop into an SFTP server across multiple states, you know, and regulators on a daily basis. Right. So those are really interesting challenges that we’re solving for and, and that’s where like having a Swiss Army knife tool, like Databricks is super, super helpful because anything that I find online on Stack Overflow about, you know, building those types of pipelines, I can repurpose immediately, and then start deploying in our organization.

Dave Melillo 50:12

So yeah, it’s solving those less sexy–it sometimes feels like archaic types of information–but it’s all about knowing who your audiences are, right and years and things like that. And accountants, they want the line level data, right. There’s no ifs, ands, or buts, you can’t give them a fancy chart, you can’t give them summary information. And on top of it, it has to be accurate, or else you’re going to be spending more time doing reconciliations than you will delivering the product that you signed up for. Right? That whole regulatory reporting pipeline has been really interesting for me, and it feels a little unnatural, about like what I’m doing. But you know, everything changes when your audience changes.

Eric Dodds 50:57

Yeah, I love it. I think it’s, it’s really fun for me, and I hope our audience as well to hear about a different kind of data product on the regulatory side, because the requirements are very different than sort of, maybe summary data around usage, you know, where margin of error is, is acceptable on some level, because, you know, customer data is a little bit messy, and, you know, they’re sort of outliers and other things like that. But line level data on regulatory that is critical to your business continuing to function is a very different type of product to deliver. So super interesting to hear about that.

Dave Melillo 51:29

So I’ve very much made comparisons to healthcare, right? It’s like, you know, we’re all you know, if you can, like messing up insurance claims is really affecting people’s lives. Like, same thing here. It might seem trite, but like this tax money is really important to these states, especially now with the state of you know, the world, right. So, you know, there’s a lot more riding on it, it’s a lot less directionally accurate, right. And that’s a word that I’ve used throughout my career to save my butt is directional accuracy. And I can’t use that word anymore.

Eric Dodds 52:03

Sure. Very cool. Well, Dave, this has been a really wonderful show, we’d love to have you back on, we’d love to get someone from the data engineering team to hear about your pipeline. So we’ll be in touch. And thank you again for your time.

Dave Melillo 52:16

Oh, thank you guys. This has been wonderful. And I really appreciate you having me.

Eric Dodds 52:20

Well, that was a fascinating conversation. I didn’t get my question answered, but maybe we’ll get someone from data engineering on the show. But I love it when we have a show where we get on a topic that everyone is passionate about and has opinions about and we can really dig in on it. I think one of the interesting things to me from the conversation was the comment around data portability. So there’s all sorts of terminology. So you know, data mesh, and connected stack and all these different things. And the concept of data portability, I think, is a really, really helpful way to think about, you know, where things are headed, at least as far as we can see now. So that was what stuck out to me.

Kostas Pardalis 52:58

Yeah, I mean, I think you’re not alone in this. Eric, I also didn’t manage to ask that many questions about fantasy games, but it doesn’t matter. I think I really enjoyed the conversation. We had a lot to chat about with Dave, about data platforms, and what the future will look like. So that was super interesting for me, and I really liked his opinion and his views and all these things around how we are going to be using data in the future. It’s fun also to interact with people who really understand the products. And they don’t agree with the marketing terms that we come up with while we try to market new products. So this whole thing about reverse ETL and what’s the right name of it? I think he puts it very well with the term data portability. And yeah, I’m really looking forward to talking with him again. And hopefully next time, I’ll manage to ask my questions on fantasy gaming.

Eric Dodds 53:54

Yes, we can have it that’d be actually fun to have an episode where we cover topics we don’t know about. All right. Well, thank you again for joining us, subscribe on your favorite podcast app. You’ll get notified of new episodes every week, and we’ll catch you next time.

Eric Dodds 54:09

The Data Stack Show is brought to you by RudderStack, the complete customer data pipeline solution. Learn more at RudderStack.com.

🎙 Sign up for The Future of Machine Learning Livestream!

🗞️ Signup for Our Newsletter

Episode 37:

The Components of Data Governance with Dave Melillo of FanDuel

May 26, 2021

Notes:

Transcription:

About the Podcast

Sign Up for The Data Stack Show Newsletter