Episode 17:

Working with Data at Netflix with Ioannis Papapanagiotou

December 9, 2020

This week on The Data Stack Show, Kostas and Eric are joined by Ioannis Papapanagiotou, senior engineering manager at Netflix. Ioannis oversees Netflix’s data storage platform and its data integration platform. Their conversation highlighted the various responsibilities his lean teams have, utilizing open source technology and incorporating change data capture solutions.

Notes:

Key points in this week’s episode include:

Ioannis’ background with academia and Netflix (4:42)
Comparing the data storage and data integration teams (6:19)
Discussing indexing and encryption (20:31)
Netflix’s role in the open source community (27:21)
Implementing change data capture (40:42)
Using Bulldozer to efficiently move data in batches from data warehouse tables to key-value stores (42:43)

The Data Stack Show is a weekly podcast powered by RudderStack. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 00:06

Welcome back to The Data Stack Show, we have an extremely exciting guest today, Ioannis from Netflix. And if you’re in the world of data and technology and open source tooling, there’s a really good chance that you’ve heard of Netflix because they have so many projects that have become extremely popular, and have really done some amazing things. So this is going to be a very technical conversation, probably some of it over my head. But it’s rare that we get a chance to talk with someone who’s had such close involvement with projects like this. Kostas, what are the burning questions in your mind that you want to ask Ioannis about working on the data side of Netflix?

Kostas Pardalis 00:48

Yeah. First of all, what I think it’s going to be very interesting is that Ioannis is actually managing two teams over there. One is dedicated to anything that has to do with media storage. And the other team is dedicated to syncing data between the different data systems that they have for more kinds of analytics use cases. So it’s very interesting. I mean, it’s quite rare to find someone who has experience in these kinds of different diverse use cases of manipulating and working with data. And it will be amazing to hear from him what commonalities are there or like what differences. So that’s one thing, which I think is super interesting. The other thing, of course, is key. I mean, Netflix is a huge company dealing with millions of viewers, they have very unique requirements and needs around the data that they work with, something that’s not easy to find in other companies. So I think that would be great to hear from him what it means to operate data teams inside such a big organization, not in terms of the people involved, necessarily, but at least in terms of the data that needs to be handled. And of course, all the use cases and the products that they build on top of data. I mean, everyone knows, and talks and makes comments around the recommendation algorithms, for example, and that Netflix has, and of course, all these are driven and supported by the things that Ioannis is managing. So I think it’s going to be super interesting to learn from his experience.

Eric Dodds 02:18

I agree, I think the other thing that will be interesting is to hear about some of the tools that they use, that people might not be as familiar with that or popular products, but they get so much attention for things that they’ve built. It’ll be great to learn more about some of the more common tools that they use internally. So why don’t we jump in and start learning about Netflix.

Eric Dodds 02:40

We have a really special guest today who I’m so excited to learn from Ioannis from Netflix. He’s a senior engineering manager. And we’re going to learn about the way that they do things in Netflix and the way that they build things. Ioannis, thank you so much for taking time to join us on the show today.

Ioannis Papapanagiotou 02:56

Thank you so much for having me and Kostas.

Kostas Pardalis 02:59

Yeah, that’s great to have you here today Ioannis. For me, there’s another reason actually, it’s not just I mean, the stories that you can share with us from Netflix, which, obviously is going to be interesting for everyone. But for me, it’s also important because you are the first Greek person that we’re having on this show. So you’re another expat from Greece, as I am, so double happy today for this episode. And I’m really looking forward to discussing and learning about your experience at Netflix.

Ioannis Papapanagiotou 03:29

You know, I found out over the last few years that there are like a lot of people working in the data space that are Greeks, I don’t know if this is for a specific reason. Or maybe they had great faculty members in Greece in the data space that resulted in them working this. But you know, there’s a lot so maybe you have more in the near future.

Kostas Pardalis 03:45

Yeah, actually, that’s very interesting. I don’t know exactly why this is true. But first of all, there’s a quite big team in Redshift, like there’s a kind of like Greek mafia there. And the engineering of Redshift, Snowflake has quite a few Greeks also working there. And there are many, I mean, you’re, you know, more about, like the academic space here. But from what I know, there are also like quite a few like Greeks in the academic space in the United States working on databases. And so it looks like the Greeks have a thing around databases. Databases and DevOps also. So they are also quite well known for having like, good SREs. That’s also interesting. Yeah. Cool. So let’s start. Ioannis, can you do a quick introduction? Tell us a few things about you and your background? How, what happens or what did you do before you joined Netflix? And of course, like, what is it that you’re doing on Netflix today?

Ioannis Papapanagiotou 04:42

Yeah, absolutely. I’m kind of a little different than most people in the Bay Area. I joined Netflix about five, five and a half years ago out of academia. So I was a faculty member at Purdue University, before I came to Netflix. Well, before that, I was a software engineer as well. So for me coming to the Bay Area about five and a half years ago was a new thing, I joined Netflix as a senior software engineer, working mainly on the key value stores. And you know, over the years, for the last three years, I have been managing parts of the infrastructure, especially starting from the key value store, no SQL databases, and then recently moving to a new organization, or about a year ago, called the storage and data integrations. Our focus is building integration solutions and storage solutions for whatever the company needs for us to provide.

Kostas Pardalis 05:36

Yeah, that’s very interesting. I know that actually, there are like two parts and two teams under your organization, let’s say, one that is like working with a data storage platform, and one, which is the data integration platform, and you have like two separate teams. And if I remember correctly, and correct me if I’m wrong, on that data storage platform is more responsible about having an overall like, storage solution for the company, which includes like, how you also store your media, which is, of course, like a very big thing at Netflix, right. And then there is also the data integration platform, which from what I understand works mainly on how the different data storage systems can exchange and sync data between them. Is this correct?

Ioannis Papapanagiotou 06:19

Yeah, that’s correct. I’m surprised with the background you’ve done. So yeah, that’s, that’s definitely correct, on the storage side, you know, we’ve been, we’ve seen, like, you know, we’re ingesting more media assets out of productions. And those productions happen, you know, anywhere around the globe. So, you know, my team is responsible for some of the store transfer solutions, and also how we store the data, most of the data ended up being stored, like in the final guide on S3, encrypted. So my team has been in short responsible for services on how we transfer and store the data, how we index the data, how we encrypt the data, in fact, from the time they arrive to Netflix up to the time they are stored in an object store that’s in our infrastructure. So those systems, you know, of course, in the last few years, as the company is growing a lot and becoming one of the largest studios, you know, I’ve seen a great evolution. And that’s one of the reasons we actually started, you know, building solutions in the space. And, and the other side of the team, the integration team is mainly focusing on building integrations between like different data systems like Cassandra, AWS, Aurora, Postgres MySQL, with other systems, like, you know, sending the data to Elasticsearch or sending the data to a data warehouse. And you know, that, both teams actually, you know, evolved in the last few years out of the needs of the company to invest in this space. An example, the, you know, we’re building a lot of services, especially on the content side, that, you know, we were using one source of truth, and then we had another database to, for example, index the data, like Elasticsearch. And, you know, we’re talking about ways that we can effectively synchronize those, those data systems, and a few years back, and there was not a good solution, you know, just using some scripts or using some jobs over the weekend. And then we thought, you know, what’s the best possible way for us to build a solution that will kind of synchronize those two systems. And then eventually, as we evolved as a team, we started supporting, like, more systems, like moving data from Airtable and Google Sheets, to data warehouses, and also like moving data from data warehouses to key value stores for, for example, our machine learning team to do online machine learning. So yeah, this is how we effectively form those two teams right now. And, you know, new teams, great, exciting areas to work on.

Kostas Pardalis 08:42

Yeah, that’s super interesting, actually, before we move into more details about the technical side of things, from an organizational point of view, like these two teams, I mean, from someone from the outside it sounds like they are working on quite different things. I mean, okay, it’s data again, but very, very different types of data. And I assume that inside the organization that, let’s say, the consumers of this data are also different, right? So how does this work in terms of managing these teams? Are there like, on a technology or organizational level similarities between the problems that are solved? What do they have in common? And what’s the difference between these two? And I guess that this is very interesting for me, because it’s quite unique. And of course, it’s also, because it has to do with being Netflix and you have like a studio there, and you have like the scale of Netflix, but I’m usually, you know, I meet data teams that they work mainly on database systems, more structured data. So I would like to hear from you, what’s the commonalities between the two problems and what are also the differences and the challenges that you have seen, like by managing these two teams?

Ioannis Papapanagiotou 09:54

Yeah, that’s a great question. You know, both teams have evolved from the needs of the company in emerging content space. So like, both the teams have been working, focusing a lot on the content space. And you know, while the technology they are building is different, they have a few common things. The first and most important thing is, they’re solving like immediate business problems. Right. And you know, even of course, like the status of the team and the evolution of the team. And the second aspect of, of both teams is they’re building what we call high leverage data platform solutions. So building solutions that can be used by many different teams. Now, in regards to the other aspect of your question about the challenges in leading to teams, I think there are challenges, of course, but you know, we have spent a significant amount of time in, you know, in hiring and retaining, you know, really amazing talent in the team. And you know, eventually that becomes a little easier for the manager to kind of manage the team. And, you know, of course, like, over the last few years, we have evolved some of the practices or have shared some of the practices within the team in terms of, you know, the way we do project management and project management. And, you know, we have found some interesting efficiencies and organizational structures as a group, which effectively makes a, you know, evidence life easier, right. The other aspect also is that, you know, looking about pretty much the identity of a Netflix engineer, which is usually on the senior software engineer, we’re hiring, like people who are great in communication rating, in terms of like, you know, building products, but they’re also hiring people that are great in terms of how they deal with customers, they deal with partners, and we deal with cross functionally. So that’s why, you know, usually that management aspect of the manager becomes a little easier. And that’s one of the reasons I would say that, you know, you know, my job has been extremely hard to manage that wonderful team.

Kostas Pardalis 11:40

That’s great. That’s great to hear. So can you share a little bit more about like the structure of the teams that exist that like, first of all, is structure between the team’s like, identical, or there are any difference there and share a little bit more information about like, the size, their roles, and stuff like that, just to get like an idea of like, how a company like Netflix has evolved in like managing these kinds of problems?

Ioannis Papapanagiotou 12:13

So I think one of the teams is about, I think, if I recall correctly, like, 12 engineers, or so, and the other side of the team is about five engineers. So mainly on the startup space, we have done a little more investment than integration space in terms of size. But you know, to some extent, both teams are working cross functionally with many other departments in the company. So, you know, you may think that, you know, we are building a product as a team, but we’re not, we’re usually, you know, building products in collaboration with many other partners. And again, this is an artifact, as I said, of the need of many teams to jump in and solve those business problems. And the fact that we, you know, we were a little more lean and agile in terms of how we do those practices. And, of course, like many other teams at Netflix that, you know, they actually, as you said before they work on a specific problem one team may do the warehouses, the other team may do the databases, the other team may do like streaming platforms, and so forth. So yeah, we had I guess, the order now, to some extent.

Kostas Pardalis 13:13

It’s very interesting. I mean, it’s surprising for me how lean the teams are and for the size of the company. I find this very, very, very interesting. Cool. So I mean, my interest is more on around the integration, the data integration team, to be honest, also, because of my background, and like the stuff I have done like in my life working, but before we move there, and we can discuss further about it, can you give us like some more information like technical information around like the data storage platform that you have, especially for the media, I think it’s a very unique problem that you’re solving there on a global scale, as you said, you mentioned that, like, you have production teams all over the world. And I think it would be great to know a little bit more about what kind of technologies you are using? And what are the use cases, how the teams are interacting with this data? And what’s the life cycle of this data?

Ioannis Papapanagiotou 14:08

Yeah, absolutely. So we, as you said, we have productions anywhere on the globe, when ingesting data into our infrastructure and the storage infrastructure through either like people uploading the data at Netflix, we provide some sort of an upload manager, and there is a UI that people can use, in fact, to upload the assets. We’re also like, you know, providing ,we plan to provide, you know, very soon, like a file system user space where people can actually store the data. And effectively, the data will be backed to the cloud. You can think about it like, you know, Dropbox for media assets, I would say. And then, you know, we also have like ways that people can upload the data through, you know, different API’s. And finally, you know, they’re like, their productions can even upload the data through like snowball devices. You know, those big suitcases that AWS provides that you know, eventually the data are being stored in S3. But in our case, you know, in all these cases, at the end of the day, the data, you know, are encrypted, you know, and they are stored in a specific format that we use on AWS S3, that’s where they finally get into. And, you know, and while you know, the data are being transferred, we also like pretty much indexing each of these files. So, you know, we know like, you know, what’s the size of the file, you know, what’s the metadata of the file, and then, you know, we can even group files together and create file sessions, or we can group files together and what, pretty much we call on the media industry side, the assets where, you know, an asset can be, let’s say, a movie, right. And this is represented by like many different files. And a lot of that, then you have, like a number of other services that are using those files and folder services, there’s metadata to some extent, to generate, you know, any kind of business need they have, right. And this is how at the high level kind of the storage is organized. As a storage involves, offer some other products, like we offer, like file systems or service, you know, the places in the company that also use the AWS file systems, but we also offer faster service, which is based on Ceph. And as I said, our team is also importing, you know, is managing the way we store the data on S3 as well.

Eric Dodds 16:14

Ioannis, one question for you on file storage and I thought of this when you mentioned groups of assets, and I may be thinking about this incorrectly. But do you, so you know, you serve ads in a dynamic way on certain content? How are those files managed? Because that can change depending on the context of the user? Are the ad assets and the actual sort of media assets of the show or movie or piece of content that the users consuming are those stored together? And if not, are there challenges around sort of delivering those in a dynamic way?

Ioannis Papapanagiotou 16:58

Yeah, that’s a good question. First of all, I think that Netflix does not offer advertisements on the platform. But you know, the kind of the area that, you know, we have been focusing on is more of how we ingest media assets to Netflix, not on how we stream the media assets to Netflix, the streaming side is handled by a different team, which is the Open Connect organization, which you know, we have caches around the globe, when you know, when you click to play a movie, effectively get the content from that cache. Our team is mainly focusing right now, on the time that the data arrive from production to Netflix up to the time that you know, they get, you know, we do any post-production activities, like encoding and so forth.

Eric Dodds 17:41

Interesting. And one, one follow up question to that would be so five years plus at Netflix, has the file or sort of compression component of the actual assets themselves changed in that time period? You know, I know, with sort of heavy assets, like video, compression and file format, you know, are concerns, what changes have you seen from that standpoint? And has it affected the way that you store that data?

Ioannis Papapanagiotou 18:13

You know, I am not sure. I would say that the honest answer to that. I don’t think we’re compressing right now data and the way we store them, of course, like they’re, you know, they can use file format, like they can use, like, you know, depending on the resolution they have been encoded, and so forth, or being captured by the cameras. But I don’t think that we’re actually compressing right now the data before we store them for object storage, this is probably something that we should be looking at. Right? But you know, I’m also not sure about, you know, the efficiencies we can get in terms of compression. So yeah, I’m not sure about that area, to be honest.

Kostas Pardalis 18:53

So Ioannis, if I understand correctly, the parts of your work in terms of the life cycle of the production in Netflix starts from the production, I mean, when the content is actually captured, and it ends when it goes through production, and also probably post production, and then you’re done. Right, then it’s another team that is responsible about taking this content and actually figuring out how it has to be streamed and delivered to the end user. Is this correct?

Ioannis Papapanagiotou 19:23

Yeah, that’s, that’s, that’s correct. But it’s not only about, you know, doing that from the production. And of course, the productions can do whatever they like, as well, in some cases, but it’s also like, you know, there can be like post production vendors that may use our ecosystem. So you know, they might like VFX artists can use our systems or even animation space or even like, post production, other post production vendors can use it. So it can be used by like different partners, I would say. And, you know, a lot of that is also to some extent some of them are obstructed from us because they’re actually using some of the higher level business logic applications that the company has built, you know, usually it comes to us when it becomes like, you know, like when a file arrives the Netflix standard becomes like an indexable ecosystem for us to use.

Kostas Pardalis 20:07

So, a bit of like a more technical question from that. You mentioned two things about these assets. One is indexing and the other is encryption. So let’s start with indexing. When you say indexing, you’re talking about indexing the metadata of these assets, or you also perform some other analysis on the video itself, that it can be searchable.

Ioannis Papapanagiotou 20:31

So yeah, that’s a question for us, which we know we’re a platform team. We’re like a low level platform team, you know, we are actually for the file itself, we get metadata, of course, metadata, of course, we keep an ID for each of the files. And through that ID, we can characterize the files themselves. And of course, like, we think that ID we keep like a structured format, about, you know, the metadata of the file itself. So for example, when you want to see how many files have been stored for a specific production, you can actually, you know, use that ecosystem to derive those statistics. And then, after that, we actually send the data to S3, and then we kind of create the objects. And then we have our own, like key management service that effectively takes the date and includes this data, and then we store them on a list eventually. Yeah. And then we give also, like some form of metadata for the objects, we store in S3 as well.

Kostas Pardalis 21:22

Okay, so this indexing happens, where and where these indices are stored? Is this part of S3 again, or like you have a different kind of technology where this indexing happens? And then it’s exposed like to the users for searching and whatever other use cases you have?

Ioannis Papapanagiotou 21:38

Yeah, so in terms of the files, we actually have a service that kind of does that. And then this is backed by, it used to be backed by a graph database in the past, which is based on TitanDB, or like the most modern Java graph, we recently replaced that with using CockroachDB. And then there is some indexing capabilities of that through Elasticsearch. And then the metadata for how we store the data effectively on AWS S3, we’re actually using a Cassandra cluster. And of course, we also have an Elasticsearch cluster for indexing the data.

Kostas Pardalis 22:10

Oh, that’s very interesting. How did you decide to use CockroachDB, by the way?

Ioannis Papapanagiotou 22:17

I mean, there are some qualities of CockroachDB that we appreciated. And, you know, as we want to effectively make some of these services more global, the ability to have like, distribute transactions became fairly important for the services. So you know, we thought that, you know, Cockroach is more like what we call a new SQL database that provides those new capabilities. And therefore, you know, it’s also like, it was interesting for us, because it provides the wire protocol based on Postgres, so it was kind of like, felt easier for us, you know, people who understand SQL. And so it became like an easy transition for us from like a TitanDB interface, which was, which we initially thought was great, but then effectively understood that the level of nestings between like, different files are not that many. So that’s why eventually, we decided to go to CockroachDB.

Kostas Pardalis 23:03

Oh, that’s very interesting you are like one of the … I’m aware of CockroachDB, and I’m following their development, but it’s very interesting to hear from someone who’s using it in the production environment. So that’s why I wanted to ask and I didn’t want to miss the opportunity to ask about it. That’s great. So second question, because I said that one is the indexing. The other is the encryption. So how important is encryption? And how do you perform encryption efficiently on such a large scale? Because I assume that if we’re talking about uncompressed media files, we’re talking about, like huge volumes of data. So how does this work? And what kind of overhead it adds to the whole platform?

Ioannis Papapanagiotou 23:48

You know, it’s definitely like, you know, it’s definitely like, a lot of media assets at the petabyte scale, but at the same time, you know, the speed in which we are, like we received the assets is not that huge, that you can expect, let’s say, from a direct to consumer case, because this is like more of an enterprise software side, right. So in that case, the speed is less of importance, though it is important in many cases, when, you know, we have to turn around pretty fast for production. So, you know, for us, it’s important, because, you know, we want to make sure that, you know, we stored our data in a secure way. And then you know, even the access mechanism of the data, you know, is fairly controlled, right? So, we will make sure that, you know, whoever accesses the data has the right to be able to do that. So the data cannot be viewed by anybody external to that. So that’s why we kind of focus a lot on the encrypted side of the data. And of course, like we have different formats that we store the data and coding formats, and of course, like each one of them is encrypted as well.

Kostas Pardalis 24:46

So Ioannis, just to understand a little bit better, like the way that you have implemented this is like as we consider a file system that itself implements encryption, or you encrypt the object itself, like on top of the files system.

Ioannis Papapanagiotou 25:01

Yeah, so we include the actual object itself that is being stored on AWS. But the way we are going to present the data to a user could be through like a UI, in a file format, or it could be through a file system and user space and those data the way we’re going to see them aren’t going to be encrypted, right, you know, you’re gonna think you’re using a normal file system. And you’re doing like normal interaction. So you will do with, let’s say, an NFS mount on your laptop, right? And you just see the data. But of course, in order to get the data, you have to get the proper privileges and have the proper access to the proper project, and so forth. So from the user perspective, let’s say from the artist’s perspective, we’re just seeing the filesystem. But the way the actual data stored on the cloud is encrypted, right?

Kostas Pardalis 25:46

Oh, that’s super interesting. And the appropriate user management of who you have built in, like, access management and all that stuff, like, do you use technologies from AWS to do that? Like IAM or something? Or it’s something that you have built like, internally?

Ioannis Papapanagiotou 26:02

We use IAM. Yeah, but of course, there’s a number of like, internal services that our information security team has built, you know, that, you know, that are, like specialized for the Netflix business itself.

Kostas Pardalis 26:19

Right. Okay. I think enough questions for the data storage platform. As I said, I didn’t intend to ask so many questions around that. But it was super interesting. And like, every one of your answers actually brought in, like, more questions. So let’s move forward to some other questions. So I have a question. I mean, I’ve seen that Netflix is quite active in terms of open source. And when I say active in open source, both in terms of how you adopt open source internally, and I think we’ve heard examples of technologies already. But also by contributing back to the open source community, let’s say, what’s the reason for that? Like, okay, I understand, like, why you use the tools, although I’d like to hear your opinion on that, like, why you prefer, like the use of open source solutions, but what’s the relationship that Netflix has with open source? And why you decided to do that? And what’s the value that you see, not only as an organization, but you personally for that?

Ioannis Papapanagiotou 27:21

Yeah. So you know, I’m gonna, I’m gonna probably focus a little on a data platform perspective for this question. But you know, Netflix kind of follows like a balanced approach. There are a number of systems that, of course, we’re building in house. They’re also like a number of users of many open source projects, you know, like Apache Cassandra, Elasticsearch, Kafka, Flink, and many others. And, you know, we have also open sourced a number of our own solutions, like Metacat, you know, Iceberg, EVCache for our caching solution, Dynomite, which is a proxy layer for Redis. And we’re also like using vendor solutions, like you know, a lot of database offerings from AWS. We have invested a lot into both open sourcing some of our code, but also supporting like some of the open source community. An example, we have like a healthy number of Apache Cassandra committers in our database team. And, of course, like, there are many projects that we’re supporting the community, as we use those products, both because we use them and we want to make sure that, you know, if there’s a bug, we can fix it but that’s why we want to support back the open source community. There are many reasons that we also do like open sourcing, but I think fundamentally, one of them is, of course, like the hiring, right? You know, you can get really great engineers, when they contribute to your project, you tend to know them better, by the way they interface, not only about the technical skills, but also like some sort of a you know how to collaborate, how to communicate, and so forth. But it’s also I think, there also are other benefits in my opinion, like, for example, when somebody open sources a project, and then maintains that project properly and open properly, and so forth, it becomes like an identity, right, you know, you tend to have these external identities. So to some extent, you know, you make yourself marketable in the future as well. So that’s why we seem excited about, you know, open sourcing some of the projects. Another reason as well is that, you know, of course, we run systems in productions that we have in the open source space, and, you know, many of these systems, you know, we want the community to contribute to them, you know, evolve and make them better. So that, you know, we can, you know, they can fix bugs, we can fix bugs that we see, maybe, you know, we’re gonna see similar problems. But, you know, the more of our open source are adopted by the community, the more we’re going to have commonalities between those different comments that use the same open source projects. And of course, like I said, you know, even with their, like, number of projects that we have, you know, either donated to the Apache community or the Cloud Foundation community and so forth. So that, you know, we can effectively enlarge the community from just like Netflix engineers working on a project.

Kostas Pardalis 29:53

Yeah, I think you had some very good points around open source and why it’s important to the company and that’s really, really, really interesting to hear especially what you said about two things, one, hiring, it’s important, but the other is also about what you mentioned about collaboration. That’s super interesting. So, in terms of the projects that you have controlled so far, and if you know, like, which one is like the most successful in terms of adoption by the open source community, so far from Netflix.

Ioannis Papapanagiotou 30:26

I think there are a number of projects that, you know, have been successful, and to be honest with you most of the most successful ones I was not involved in them. So, what I’m thinking right now, I think Spinnaker has been fairly successful as a, you know, multi cloud continuous delivery platform. Metaflow is another recent example, it’s, you know, recently spoken about publicly. So you know, these are the two kinds of main projects that come to my mind, right now, that can be like very much a big success recently.

Kostas Pardalis 30:55

Your favorite one that came out of your teams?

Ioannis Papapanagiotou 30:59

My favorite one… So I will say I have two favorite ones out of the fact that it was I was managing those themes, the key value stores on Netflix. So one of them is EVCache, which is our caching solution that we use at Netflix. And then the second one was Dynomite, which is like a proxy layer, we use for some of the again, no key-value stores that we have here at Netflix, I was part of the when I joined Netflix, I was part of the Dynomite team for about two-and-a-half years serving this project, contributing back to the open source. And I would say that was really, really exciting to, you know, work and collaborate with a number of companies and open source users.

Kostas Pardalis 31:32

Hmm, that’s interesting. What was the initial need that made you build Dynomite? You said Dynomite is the cache on top of Redis? Right?

Ioannis Papapanagiotou 31:45

Yeah, because I think back then, fundamentally speaking, Redis was a single node system, I think later on with Redis Cluster, it became like, again, like a multi node system, but it was like more like a master slave system, or like primary secondary system, where, you know, it’s great, it focused a lot on if you think about the CAP theorem with the consistency and partition tolerant. Whereas Dynomite in Netflix a lot, mostly is focusing more on the availability side, because a lot of what makes sense for the business is to make sure that we achieve like seven nines of availability. So that’s why we wanted the system that, you know, would still have the properties of Redis, which is really amazing, in terms of like a no key-value store with advanced data structures, and all the amazing work that Salvatore Sanfilippo has done, but still make it like, you know, highly available. And that’s why we chose to build that, you know, kind of proxy layer above Redis. There also is like a few other things, like, you know, we were working on the Cassandra space for like many years now, again, another ABC stem. And we had substantial experience with the way the Dynamo protocol works. So a lot of like, you know, the sidecars and the components of the ecosystem, were, you know, pretty easy, or automation was kind of pretty easy to adapt with Dynomite based on this architecture.

Eric Dodds 32:57

Ioannis, a question on internal projects that’s sort of more general, not necessarily about specific projects, but do you, what is the process like of deciding to undertake a project like Dynomite? Do you have lots of conversations about those things internally? And then, as a follow up, are there lots of things that you talk about that you don’t end up building?

Ioannis Papapanagiotou 33:25

You know, I think one of the great things about Netflix is the fact that a lot of the decision making is happening, you know, at the software engineering layer, where we have this notion of like, informed captains. So yeah, usually, you know, the informed captain brings up, you know, business use case and why we need to build a product or a project, and then, you know, tries to convey that with another partners and tries to make sure that, you know, there’s alignment that this is, this will provide, like, you know, substantial business value to the company. And then, and then continues building the project to some extent, and then try to showcase to maybe a prototype that the value is this project is going to have to the company, and then it then takes some sort of natural way, we’ll say by, you know, the leadership team funding the project, and then making like a successful project within the company.

Kostas Pardalis 34:11

So Ioannis, I mean, from your experience so far, with all these open source projects that you have published at Netflix, and considering that many of these projects are the outcome of like very specific and large scale problems that Netflix has, so I’m pretty sure that there are many people out there like all the data engineers who are dealing with probably similar problems, but like, not on the same scale, right. So what’s your advice towards the people out there that they learn about these technologies and how they should use and try to adapt these technologies to the scale of problem that they have? Is there something like that you have seen or you have communicated like with the communities out there and what do you think is important for someone to keep in mind when using all these projects that Netflix is maintaining right now.

Ioannis Papapanagiotou 35:00

Yeah, that’s a good question. I mean, I have found, even myself fairly challenging to really, you know, identify the right source information of it all. So I understand when people see many companies, including Netflix, open source a number of projects is, you know, which one is like the one that some person may want to invest in. You know, to be honest, in many cases, some of these projects are being built based on the advanced needs of a specific company. So, you know, if, if I were starting new, you know, what I would propose is, you know, first understand the problem space, you know, before kind of going deep into a specific solution. Again, unfortunately, I have not really found a great description about, you know, our space, which is kind of a data platform, other than I’ll say, like, most recent post by Andreessen Horowitz, about, you know, the high level architecture of data platforms. But you know, and the second step would probably be like, identify a project that is in an interesting area, maybe have like a healthy number of contributors, that someone can collaborate and grow by learning from other people that are more experienced. And of course, like that project does not have to be like, necessarily like a Netflix project. But, you know, as I said, you know, if somebody would be interested in a Netflix project, you know, they’re like, a few of them that have a very healthy community around them. Like one of them, as I said, Spinnaker, which I said, the multi cloud, continuous delivery platform. And of course, like, there are other projects that, you know, Netflix has been using, for example, we have been doing, you know, a fair number of contributions, the infrastructure, and in many other projects, as well as you know, that that other companies or other entities have built.

Kostas Pardalis 36:32

That’s some great advice, I think, cool. Thank you so much for that. So moving forwards, and let’s chat a little bit more about the data integration platform. Can you describe in a little bit more detail, like what the data integration platform does? And what’s the problem behind it? Why it’s a problem, like in Netflix, and what’s like the solutions that you have come up with for these problems?

Ioannis Papapanagiotou 37:02

Yeah, so the data integrations team is like, I would say, like a small but very talented team, which effectively, you know, focuses on building a number of integrations, the formation of them initially was done, you know, based on the fact that we wanted to build some solutions in which we will be able to keep multiple data systems in sync. And so we started investing in building like change data capture solutions in connector, you know, for relational databases, like Postgres, MySQL, Aurora, or in recent, most recently, about a year ago, we also started investing in the noSQL space, like Cassandra, the latter is a little more complicated, because it has those, you know, characteristics of a multi master eventually consistent system. Actually, one of my team members gave a talk recently, and at QCon. So he spoke in detail about those for people quite interested about that. And, of course, we have written a few blog posts about, you know, Delta and DBLog, which is kind of the system that we have built. But at a high level, you know, we were seeing patterns that, you know, people were building, they were having different data systems, they were trying to solve this problem with some sort of multi system transactions, which don’t really work, or even like, with some sort of repair jobs, when one of the systems was becoming inconsistent. So, you know, we try to build some sort of solutions that will kind of not need to do that, but rather, you know, some sort of pass, pass some sort of the log of a database, you know, send these logs through a streaming system, and then send the data to another system, that it’s going to be like your secondary system. But as, as our side infrastructure evolved, you know, more services were actually using those those like database integrations, and effectively, you know, we gain towards a more high level project, which is now a project that, you know, many things are working and that is which all the data mesh, which is more about, you know, centralizing a lot of how we move data between like different data systems. We also have, like, we also started some sort of a different effort, we call it like the batch data movement effort, which the focus is more about how we efficiently move the data from like, you know, data warehouses to effectively to another like, like, key value store. So some sort of, you know, people can do like point queries over there. And, of course, like I said, you know, we’ve been working also in systems like, you know, moving data from semi structured with the mental systems, like Airtable and Google Sheets, to our data warehouse, so can do some, you know, business analytics, and then build business intelligence on top of that. So this is kind of an area where, you know, we have been investing with this team in the last about a year and a half now.

Kostas Pardalis 39:36

That’s super interesting. Based on my understanding, at least, like the most common way of like performing CDCs by attaching to the replication log or on the log mechanism that the database has, listen to the changes that they happen there and then replicate these like to another system. That’s on a very high level on how CDC is usually implemented on the database. But you also mentioned Cassandra, and you said that there are specific challenges there because of the eventual consistency you have like multi node environment and all that stuff. So can you give us a little bit more like information on the like, of how the CDC part is implemented on something like Cassandra? And where do we stand on that? Like, do you have this like currently implemented and using it inside Netflix? And what are the differences and challenges that are compared like to the more traditional CDC that we have seen or something like Postgres or MySQL?

Ioannis Papapanagiotou 40:42

Yeah, so the CDC events from like, NoSQL database, like, you know, active setups, like Cassandra, they do have some unique names or like data partitioning and replication. So and you know, most of the current CDC solutions for these rely on running within the database cluster, providing a stream with like duplicate events. Our solution was more focused on doubling the stream processing framework. So effectively, you know, this involves having like a copy of the source database in a stream processor, like Apache Flink with you know, we use a lot of metrics. And this enables, to some extent, a better handling of the CDC streams, since we had our like before and after images of like their own changes themselves. This is a little different from like, the traditional CDC that you have seen in like, as I said, in Postgres, and MySQL, MariaDB and other systems, in which you have like a single stream of events that comes from a single node, which is kind of like your primary node. So as I said, it was a little more difficult to do it in Cassandra, because of the challenges of like, partitioning and replication and so forth, and mainly was a duplication of events.

Kostas Pardalis 41:52

That’s interesting, do you, I mean, you mentioned that another thing that you’re doing, like recently, is moving data out of the data warehouse and syncing these into a key value store. Right. Two questions here. One is, what’s the use case behind these, like, traditionally, I mean, data warehouse is considered mainly the destination of data right? Like we collect the data, doing some ETL, and all that staff, put it in the data warehouse, and from there, we do the analytics reporting, and that’s like the traditional BI that we have. So what you’re describing goes like a step beyond that. And you’re actually going to pull the data out of the data warehouse and push it into a key value store, like, why do you want to do that? And what kind of data are these that you are pushing into the key value stores?

Ioannis Papapanagiotou 42:43

Yeah, so you know, recently, we wrote about the system would evolve, we called the Bulldozer in its initial name. So I said, you know, there are many services, that have the requirement to do like a fast lookup for fine grain data, which needs to be generated, like periodically. An example would be enhancing our user experience on online application features of subscriber preference data to recommend movies and TV shows. And the truth is that, you know, data warehouses are not designed to serve those point queries. But rather, the key value stores are designed to do that. Therefore, you know, we build like some sort of Bulldozer, a system that can move the data from the data warehouse, to a globally like low latency, you know, fairly reliable key value store. And, you know, we try to make Bulldozer some sort of a self service platform, that, you know, it can be used, you know, fairly easy by users by just, you know, effectively like working on a configuration. You know, behind that it uses like some of the ecosystem we call the Netflix scheduler, which is like some sort of scheduling framework group built on top of Meson, which is a general purpose workflow station system. And I guess the use cases include, like members, we want, like predicted scores of data to help improve their specialized experience. Or it can be like metadata from AirTable to Google Sheets for data lifecycle management, or even like, you know, modeling data for messaging personalization. When you think about that case of like, or even can be like, you know, when you want to know like machine learning, right? You can, you can do that on the data warehouse, you probably need that, to do that on a key value store. When you think about the CDC and the Delta concept, which I described, it’s kind of different, right? Because it’s actually the opposite direction right? You move it from primary data store to secondary which that secondary could be a data warehouse, whereas the Bulldozer is more from the data warehouse to the key value store.

Kostas Pardalis 44:41

And how does this differ from the traditional CDC approach or you see it as like the same thing just like flipping between the destination and the source? Is the methodology the same or because you have to primarily pull data out of the data warehouse like things have to change?

Ioannis Papapanagiotou 45:02

Yeah. So I mean, if you think about the system where you pull the data of the warehouse, you probably are going to do that in some sort of a batch way. So you’re going to read the data and take them out. Whereas more of the CDC ecosystem is focusing on the real time, parsing the data, parsing the log in real time. Of course, like many people say, what is really the result I need, right. But what you need is in the CDC aspect to really, you know, get them mutations happen with the database, pretty fast, and then move it in another system. Because you know, latency matters. On the other hand, when you move data from like your data warehouse, to a key value store, the latency of actually moving those batches is important, but it’s not that important. What is important is the latency to access the data from the value stores. So fundamentally speaking, the systems are totally different. One of them is, as I said, using like, a workflow orchestration framework of Meson. Whereas the other one is more like parsing log, throwing to Kafka, and at the same time, through Kafka and then also doing enrichments, because what happens is, once you move the data from a primary source to secondary, you may want to enrich the data the way by combining information from different other types of services, which eventually makes the way you design microservices much simpler.

Kostas Pardalis 46:25

That’s super interesting. That’s really, really interesting. So what do you use as a data warehouse? Netflix, I assume, I mean, is some kind of like these popular cloud data warehouses like Redshift, or Snowflake, or you have something that was built like in house.

Ioannis Papapanagiotou 46:45

So I think our data warehouse consists of a large number of data sets that are stored in a nominal S3, we use Druid, Elasticsearch, Redshift, Snowflake, mySQL, you know, our platform supports, like anything like from Spark, Presto, Pig, Hive, and many other systems, right. So it’s not just the system, it’s multiple systems that we use.

Kostas Pardalis 47:06

Okay, and Bulldozer can interact with all these different systems.

Ioannis Papapanagiotou 47:11

Well, right now, the way we pull the data is we pull the data, you know, out of like, in an Iceberg format, right? And then you know, and then we send the data to the key value store. So we pull the data out of like S3 buckets, and then in an Iceberg format, and then we throw them to a key value. Actually we use an abstraction of the key value stores, and then through that, then it’s being sent to the key value stores like a caching system.

Kostas Pardalis 47:34

Okay, that’s interesting. So you have, the different what we say, like data warehouse like Redshift, and Spark, and like, all these different technologies, but at the end, you sync all these data they live there on S3 using Iceberg. And Bulldozer comes after Iceberg to pull the data from there and push it back to the key value stores that you want through this abstraction layer that you have built. Right?

Ioannis Papapanagiotou 48:01

Right. So Bulldozer typically comes after, after Iceberg, that’s correct.

Kostas Pardalis 48:06

That’s great. So, we have talked about like data sources so far that are like, like traditional, like database systems, or data processing systems like Kafka or Spark, how do you also work, I know that you are very microservices pro at Netflix. Do micro services also play a role in all these processes? Is something that like, for example, you consider like CDC implemented on top of like micro services, like pulling data or events from there, and moving it around, or that’s like something that the team does not work with, or you don’t utilize, or there’s no reason to do that, like at Netflix in terms of like the architecture in general that you have.

Ioannis Papapanagiotou 48:47

So, you know, Delta or those CDC events, right, it’s in the platform. Delta platform is not just the CDC. It’s also like the way we do enrichments. And the way we talk to other services to get information has simplified a lot of the way we have like the microservice, we do, we use but you know, of course, like the microservice collection which is kind of a different thing. But you know, if you think about, you know, how we usually implement some things in terms of like, the way we’ll communicate with microservices, there are like Delta can simplify parts of that. And an example will be I’ll give you an example and say, for like, a movie service, right, where, you know, for a movie service, you know, we want to effectively find information from different micro services like let’s say the deal service, or the talent service, or the vendor service, and then we what we used to do in the past, we used to have this polling system that would actually pull this information from the movie sets database from the deal service, from the talent service, or the vendor service, and then combine them and then send them to the derived data source. You know, with the Delta itself, what we have done is we have simplified a lot this architecture by effectively instead of polling the database itself, the movie sets data store, what we do is we have the collector servers that are pulling the information, the mutations from the data store. And then the Delta application itself does the queries with all the services without really needing to build like a polling service. So in this architecture, which we also describe in the blog post of Delta, it has substantially simplified the way we do micro services in some areas of the business itself, which is mainly on the content side.

Kostas Pardalis 50:29

Super interesting. I mean, I think we need another episode just to discuss that. And we’ll probably also, like use a third screen or something, because I think we are getting a lot of like, complexity that infrastructure like Netflix has, but that’s super, super interesting. So Ioannis. Okay, we’re close to the end of the episode, like two questions for you before the end. So one is you mentioned a lot, Bulldozer, and Delta, are any parts of these open source right now? Or do you plan to open source on parts of this project?

Ioannis Papapanagiotou 51:07

I think, you know, the way that I think like most data systems are going to evolve, it’s going to be like systems of systems where, you know, you’re going to be using those open source components to build the systems. So like, an example would be the Delta is using underneath like Apache Flink and Apache Kafka. And then, and then there’s also the CDC aspect, which, you know, we were thinking of actually open sourcing that. We haven’t been to the stage where we’re ready to open source it, but this is something that we’re seriously considering on the CDC aspect. But, you know, open sourcing the whole platform doesn’t really make sense, because, you know, it’s kind of composed of many systems. We wrote a blog post about a month ago about how Bulldozer works trying to, like push the lessons out of the community so we can receive that feedback. But you know, open sourcing that, again, it’s probably opinionated to how we do things at Netflix. So I’m not confident in the outcome, like a value to open source. But you know, whenever we think that something is like, it’s an entity that can be open sourced, then definitely this is the focus, we’re going to easily move forward with.

Kostas Pardalis 52:11

Yeah, makes total sense. And last question, what do you think of the next data platform problems or challenges that the industry is going to like to spend time and resources in general to solve? What are some interesting problems that you see out there that haven’t been addressed yet? Or it just, you know, like, happened because of the evolution of the industry?

Ioannis Papapanagiotou 52:36

Yeah, I think there’s like, I think there’s like a number of interesting problems that, you know, we’re gonna see in, in the near future about in the data platform, one of them, you know, is, is data governance, in terms of like, you know, looking at, you know, data quality, data detection, how do you do insights about the data? How do you get the look, the data. And what we have seen around the industry is that, you know, there are like many, you know, separate solutions that addresses each of these problems. But, you know, I’m curious to see if there’s a solution that can actually address this in a more total way. And the other aspect, we’re investing heavily here at Netflix is the notion of data mesh, where, you know, we want to abstract a lot of the aspects of the data platform and you know, instead of like people developing their own pipelines, we want to provide, you know, that abstraction layer, so it can be like a more centralized, you know, pipeline to some extent with all the features that users need need. So that’s why I think like, some sort of the data platform problems will start to become more of like a high level than they are today, which is like data warehouse and databases.

Kostas Pardalis 53:39

That’s great. That’s some great insights and what is happening in the industry right now. Ioannis, thank you so much. I mean, we could keep chatting for at least another hour. But I think we have reached the limit of our time for this episode. Thank you so much for spending this time and sharing all this valuable information with us. And I personally, really looking forward to see you again in the future and like, discuss more about whatever interesting stuff you will be doing in the future.

Ioannis Papapanagiotou 54:09

Yeah, thank you so much for having me.

Eric Dodds 54:11

Well, that was an absolutely fascinating conversation, and just the types of problems that they have to face at Netflix, because of the scale. And just to hear about things like team structure where the streaming team deals with the distribution of assets, whereas Ioannis’ team, you know, deals with sort of collecting and storing and making available those assets. At most companies, those are the same teams. It was just so interesting to learn about that. What stuck out to you Kostas as sort of the major takeaways.

Kostas Pardalis 54:47

There are a couple of things actually, I was very impressed with the size of the teams. First of all, we are talking about pretty small teams if you think about it, but they are taking care of a huge infrastructure. And I mean both the storage infrastructure that they have, and also the analytics infrastructure they have. That was very interesting to see how these small teams can be so agile and so effective. The other thing, which I guess it’s not only characteristic of Netflix, but other companies of their size is like, how many different technologies are involved pretty much they use across the whole organization, like every data product that exists out there, every possible data warehouse technology, from cloud to on prem. And at the same time, they still have to be, let’s say, on the state of the art of things and build their own technology to support their needs. So that was very, very interesting. And I think that’s a big benefit of observing what these companies are doing is because you can take a glimpse of the future, let’s say, of what problems data engineers will have to deal in the future. And the other thing is open source, they’re contributing a lot, open source they have many projects that they maintain out there, and quite interesting projects also. But at the same time, like probably the complete data stack that they have is based on open source solutions. So and they contribute back again, like Ioannis, for example, saying about the contribution to Cassandra, like they have like quite a few committers in the company, but they’re committing back to Cassandra, so these are the things that I found like extremely interesting. And another thing that I would like to ask everyone to pay some attention is the concept of CDC. They are really investing a lot in implementing solutions on top of CDC. And it’s something that I think we will be hearing more and more about, like in the near future, and of course, about something that we discussed also with DeVaris from Meroxa, if you remember, I mean, his company and his product is all about CDC. And I think that this is like a term that we will hear more and more in the near future.

Eric Dodds 56:52

Great. Well, thank you so much again for joining us on The Data Stack Show and we will catch you next time.

🎙 Sign up for The Future of Machine Learning Livestream!

🗞️ Signup for Our Newsletter

Episode 17:

Working with Data at Netflix with Ioannis Papapanagiotou

December 9, 2020

Notes:

Transcription:

About the Podcast

Sign Up for The Data Stack Show Newsletter