Episode 92:

Building a Decentralized Storage System for Media File Collaboration with Tejas Chopra of Netflix

June 22, 2022

This week on The Data Stack Show, Eric and Kostas chat with Tejas Chopra, Senior Software Engineer at Netflix. During the episode, Tejas discusses all things Netflix Drive, from what it is to use cases and technical components.

Notes:

Highlights from this week’s conversation include:

Tejas’ background and career journey (2:49, 43:04)
Digital collaboration with Netflix Drive (7:57)
A formal version control component (23:44)
Centralized store vs. local affairs (31:05)
The different skill sets a data engineer needs (37:38)
How to get into data engineering (40:57)
New technologies coming into day-to-day work (44:39)

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 0:05
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com.

Welcome back to The Data Stack Show today we’re talking with Tejas from Netflix. He is building Netflix Drive, which is a fascinating system that has enabled Netflix artists and employees around the world to collaborate on media files. Super interesting. Kostas, I am really interested to know, working in the cloud is so second nature to most of us, right when you think about Google Drive, or Dropbox, or whatever, even files that you can easily share on your phone, right? It’s just so natural. And so I want to know what it was like before they started building Netflix Drive and another pandemic was a catalyst for that. But that’s gonna be my question, what was the workflow like before? And then how did they start to undertake migrating that into the cloud? How about you?

Kostas Pardalis 1:12
Yeah, it’s a very good opportunity to discuss with an expert actually, what this whole thing about local affairs is when it comes to building application experience and product experience. So there’s a lot like a lot of like, conversation and noise around that stuff more than like the web application space. And I mean, get like we have, we see a lot of that happening, actually in applications that are like figma, for example. Where you can edit things, and you can collaborate online and be also offline, and then we’ll continue working from there. There’s like a bunch of like, obligations like these, that they are showing this kind of like, block out for someone like, let’s say, experience. So it would be great to talk with him and see what exactly it means on the back end for that. And what it means to try and do that on the scale of not like just entertained this movie much with two people or three people. But this game of Netflix, right, where you have huge media files and very complicated workflows. So yeah, I’m very, very excited to talk about that stuff with him.

Eric Dodds 2:24
All right, well, let’s do it.

Kostas Pardalis 2:26
Let’s do it.

Eric Dodds 2:27
Tejas, welcome to The Data Stack Show. We are so excited to chat with you.

Tejas Chopra 2:32
Thank you. And it’s a pleasure being here to meet, Kostas, and you as well, Eric, so thank you for having me here.

Eric Dodds 2:38
Absolutely. We always start in the same place, so tell us about your background and then how you ended up at Netflix.

Tejas Chopra 2:47
Sure, yeah. So I actually grew up in India and came to the US around 12 years ago, did my masters from Carnegie Mellon University and started working in the Bay Area at some smaller companies, my focus has always been on back-end systems, low-level operating systems. That’s where I started working. And I was writing GNU debuggers. So a lot of debugging for processor cores. through acquisitions, I went through several companies and then I worked at a startup called Datrium, and Datrium was trying to revolutionize how we think about storage about virtual machine storage. And it was the one-stop shop for not just primary, but backup use case as well. There, I worked on file systems. So I helped write a file system and some components of it and some data management primitives, like snapshotting, replication, all of that. And then after Datrium, I got a job at Box. And Box is a pioneer a content management, cloud content management. So I was working there, building a lot of the services to power petabytes of data on the cloud, intelligently placing data on the cloud, and leveraging a lot of techniques for on-premise and cloud storage and developing solutions around that. at Netflix, I started working around two years ago, and my focus has been mostly something called Netflix Drive. The way to look at it is it’s a Google Drive. But Google Drive is for your files and folders. Netflix Drive is for media assets. So when you think about Netflix, you think about the great movies that you watch. And we also produce movies, we also make movies through Netflix studios. Now, when you make the movie, you have artists that collaborate to work on a movie, the visual effect the animation side, typically, they used to go to the production site and work there. But with the pandemic, in our world today. They were from their home. So how do you build solutions that can give them the same experience of collaboration that is something that Netflix Drive enables, and that’s been my focus at Netflix as well. So that’s how I got into data. That’s how my journey has been so far.

Eric Dodds 4:51
Wonderful. Okay, I want to hear about Netflix cloud, but first—and our audience knows that I always do a little bit of LinkedIn stalking—I noticed that you were a software engineering intern at Apple very early on. I think on LinkedIn it said 2011, which was a really interesting time because the iPhone came out and widespread worldwide adoption is happening. So I’m super curious, what did you work on there? And what that like?

Tejas Chopra 5:22
Absolutely. I was a part of the Media IMG group there, which is image and multimedia, if I remember it correctly, and a lot of us were working on applications such as FaceTime, and a lot of like processor cores that were being licensed by Apple, how to do testing for those cores. So we used to get processor cores from external companies, if I remember correctly, it was imagination. And they also used to provide a software, but their processor cores, how do they fit on the Mac or the iPads? Whether they render the image correctly or not. That had to go through a rigorous process of testing, and validation. And one of my first jobs was to validate. So I used to write kernel extensions to validate those processor cores on the iPads. That was my job. But some of my peers were working on initial versions of FaceTime at that point. Thanks. So it was really a fun time. It was right around the time when Steve Jobs was still around so we did bump into him a couple of times in Apple. Life was very different back then. Technology was there, but Bitcoin wasn’t there (or at least, I didn’t know about it). So Apple was the craze. It still is the case. But it was such a great feeling to be in college and work for Apple. So I was really in a happy space. And that was my first time in California. So when I landed in California, I remember I saw everything golden. And I thought this must be heaven. Like, it’s just so beautiful. It is so beautiful. I remember that feeling very well.

Eric Dodds 6:59
Yeah. Okay. Was Steve Jobs wearing a black turtleneck when you bumped into him?

Tejas Chopra 7:03
Oh, yeah. Oh, yeah.

Eric Dodds 7:05
Perfect. That may be the best thing. That’s so great. Okay, so Netflix cloud, this is what I’m interested in. It’s really interesting for me, and probably some of our listeners, to hear something like Google Drive or collaborating on business files or documents or whatever, code, all that sort of stuff. It’s so second nature now, for anyone who works in and around technology. And so it’s a little bit funny to say like, imagine Google Drive. It’s like, yeah, isn’t that just how people work? So can you explain the switch? What was the infrastructure like? And how did people interact with it before the pandemic, because it sounds like it was working fine but it wasn’t actually similar to how I think a lot of people collaborate in a digital environment day to day.

Tejas Chopra 7:55
Exactly. When you think about a movie-making process, and you’re right, like Google Drive today, right, and the way we collaborate and work, it’s second nature to all of us, we don’t even realize the things and services we use. But when it comes to movie making, you have a camera that captures a movie, but the movie, when it is captured is very different than the movie that you see on the screen. And there are so many things that go behind the scenes, there’s like cuts, edits, rendering, there are so many different variations of the movie based on your device type. So all of that processing, pre-processing, post-processing activities on a recorded image, or a recorded movie happens behind the scenes, right, and you have a lot of camera footage that gets collected, and only one person actually makes it to the final cut. So to actually transfer that amount of data. Typically, in earlier, artists use to actually go to the production side, just because you can avoid that transfer of data and the time it takes to transfer that data, you can directly work there on those. And then you can actually, you know, have the final iterations that you can work off. And you can use the cloud, let’s say you worked on an image, you posted it to cloud, using Google Drive, let’s say a small image, right? It’s like a photo. And then you can share it with some other artists that wants to like either add some color or some other edits to that image. But the problem is, at scale, this doesn’t work. Google Drive has limitations, right? It only allows some 10s of 1,000s of files. When you are an artist, and you have a huge corpus of data, you want to have the ability to just work on assets that you care about. So surfacing the right assets on your purview is very important. So you need that control. When it comes to data. It’s not just you show all the data to everyone. You need levels of control, you need access levels or authorization, all of that to be built and those primitives, those security primitives are lacking in Google Drive, because it’s just imagined to be a file sharing service. Sure. So we wanted to take that forward. So that was one thing that was a problem. The other thing is when artists work from their homes, they work from different machines, you have Photoshop on one machine, you’ve configured your brush size and everything and you’re working on an image, some, you just close your laptop, you go to another machine, you want to have the same image with your same settings persisted, right? All of these, that in some files in these applications, a simple way is you have those files, you put them in your email or your Google Drive, and you bootstrap the application on the other machine with those folders. But this can all be made seamless, you can actually run Photoshop off a shared cloud drive that allows you to sync between machines, sync with other artists collaborate remotely with other people. That is what the vision was for Netflix Drive. And it’s just one part of the equation like there are so many other things that it can enable. Because right now we are talking about your machines like our Mac OS, Windows, or your Linux boxes. So it has that component where it has different OS versions that it supports. But also, if we move away from media, if you think about any form of sharing, not just media, and not just studios, Netflix Drive, if built correctly, with the vision can actually solve all the issues. It’s a superset of Google Drive. So we are able to, and we’ve designed it in a way where you can plug in any metadata and any data store on its back end. So it’s an abstraction layer, we can plug in a cloud database and a cloud data store, we can plug in an on-premise database and an on-premise data store, or we can plug in a hybrid one. And it will work. We plan to open source it so people can actually use it. And we are also currently the first version that we’ve built internally works with S3 as the object store. And it works with CockroachDB as the date metadata store. So we have a layer on top of CockroachDB and a layer on top of S3. And that takes care of the first version of Netflix Drive. But that’s the vision with Netflix Drive.

Eric Dodds 11:55
Fascinating. Okay, Kostas, I’m going to hand the mic to you. I usually monopolize at this point in the conversation but I’m so interested to hear what you’re going to ask, especially because Tejas just mentioned CockroachDB.

Kostas Pardalis 12:09
Yeah, we had the pleasure to have an episode with them some time ago so obviously, a very interesting database system, but I was about to ask you, actually, if you have any plans like to open source or anything, but you answered that already. So having to find another question is, after it’s good open source, are you going to start the company?

Tejas Chopra 12:36
I think so far we’re not thinking that far ahead. We have a lot of plans with Netflix Drive open sourcing is one trying to see the different applications that can support building the different abstraction layers. And one other thing with Netflix Drive is when you think about a file system, right to think about your local machine, you have reads right, all of these calls that happen. Netflix Drive not only does that it also exposes APIs so you can actually call APIs on a live file system. And these APIs are used to actually enable workflows that are built on top of Netflix Drive, like I said, to surface the right files, hydrate a new machine with just a subset of the files from your older machine. And we are using Netflix Drive actively in different types of ways. We, we use it in animation, we use it in rendering, we use it a user’s home directories, so your MacBook all of the files on your Mac can actually run up Netflix Drive, and you can go on your other machine and it will just surface all the files.

Kostas Pardalis 13:37
Yeah, I have a question because these cloud file systems have been around for a while, like, right, Dropbox Box Drive. And, okay, they are designed for sharing. Like for power and really working over the data. So it would be awesome if you can do it. I remember the first time that we in my first company had Dropbox like to share data between like the founders there and like the employees and like, Okay, we would like, Oh, looks like a file system. Let’s start editing the same thing at the same time and then it was a bit of a mess, to be honest. Like, it’s not exactly like there will be other standards. Okay, this file system is not do like, as a network file system as you’ve been used in the past. So is this something that Netflix Drive can do? The lounge-like he’s designed like with these in mind?

Tejas Chopra 14:36
Yes, that’s right, because a lot of the things that we work on are creative iterations where artists actually work on drawings, which require strict requirements of latency and experience. So Netflix Drive, you go sis tiered forms of storage. So it works with a local, your local file, or local storage. It will cache the files in your local storage to give you the great performance, but it also allows you to have to build intermediate storages before cloud. And the way to think about this is, let’s say you’re an artist, you work on 100 iterations of something, and then you’re like, Aha, this one is what I like. This is the one that I want to collaborate and share with it with someone. But those 100 iterations, if you just build a system like Dropbox box, where you have cloud and you have your local machine, and you don’t have to worry, all of these 100 iterations will go to cloud because you don’t have enough space on your machine. So you’re paying for the cost. That is to storage cost in the cloud, the REST request cost or whatever cost you pay for cloud, and then you will have to delete files, but you still are paying something. But having these feared forms of storages, you can actually have the 99 iterations sit in the middle. And only the final cut can actually by the use of the API make it to cloud to be so that the other artists can collaborate with you on that. Now, this is unique, this is not this control, they probably may have tiered storage on the background, like Box and Dropbox probably have it, but the user cannot control it. Yeah, Netflix Drive allows the user to control it so that they can actually build different types of applications on top of it. This is a very, very simple example of iterations. But like, let’s say you some files are temporary in nature, you do not want these temporary files, you want the files to be around for some time, but you don’t want them to like go to cloud because you don’t want to pay the cost, you can still use this intermediate storage, storage pods and store these files DAG, and Netflix already has the storage pods around the globe. So we actually can bootstrap these different caches and stores media stores relatively simply, and we can provide the same experience. And I’ll explain why we also need this just to take a step back. Many locations do not have great access accessibility to cloud. So the pipe between cloud and their machine is not that wide. So they don’t get great throughput and bandwidth. But Netflix has its own back channel of having a great throughput and bandwidth by the use of Open Connect and RCDs. So we actually leverage that high network throughput that we get to stand up these intermediate storage locations that are closer to the artists. So if artists are working in LA, they do not have to push that files to cloud, they can just work off the media stores or storage locations in the middle. And most of the files can be surfaced from that. So this gives great performance again, some performance was the main reason why we designed it in a tiered form.

Kostas Pardalis 17:30
Okay, what are the trade-offs there when it comes to comparison? So let’s say we have two editors who want to edit the same file at the same time.

Tejas Chopra 17:41
That’s a great question because the simplest solution is what we try to design first, which is the last writer wins, whoever did the file last wins, but that may actually result in losing work, right? If some. So the way we do that is we actually have allow the user to select what they want to service. So if every, like, let’s say that two artists are working on the same asset, they can either overwrite the existing asset with their own final copy, or they can accept the changes. And we so far, we haven’t designed it to do this. But the vision is, whenever an artist writes to a file, it generates an event. And that event is actually consumed by other artists that are collaborating on the same file. And when that event is consumed, they can take a decision whether they want to just overwrite the consistent the current copy with what the other artists has worked on, save their current copy to a temporary location, and do a git merge in some ways on their own. Yeah. Or they want to write and reject that completely.

Kostas Pardalis 18:45
Can you think of like an architecture where you have, let’s say, these events that are immediate when like, something’s happening, like on the binary file level, and then, like, the clients implement some kind of like CRDT that they can like, technically automatically resolve any kind of like conflicts and then eventually be consistent at the end and do it like in such a way because CRDTs are used a lot. And it’s kind of like environment? Yeah, make total sense. But at the same time, they’re like, meaning design like for text edits so they plan things out, they’re going to do like on petabytes by file of something, right. So is this like how you think about it?

Tejas Chopra 19:27
Yes. So far, the first version, we are not thinking about CRDTs because a lot of these files are image files that are compressed differently. So if you do the binary translation, it may not be the direct editing, that may work because you may not be able to see the image after that. So we are on the artists to actually open their own copy of the image, open the other copy of the image, see if they want to apply something changed something and then commit a final copy to cloud. So we’re keeping it simple. We’re not trying to anticipate the first few versions that we are getting. We will not have a lot of artists collaborating on the same asset at the same time. So we envisioned it to be working between artists that work in different parts of the world that have different time zones that they work on. So it’s like a pipeline at that point where you have one workflow, that other person changes, and then the other workflow in the next stage picks it up and works. That’s how we’re thinking of it initially. But yes, those are great. Things we learn along the way, and we’ll probably have to implement a CRDT for media files.

Kostas Pardalis 20:29
Yeah, that’s not super interesting. Sorry.

Eric Dodds 20:32
Oh, go for it. Go for it, Kostas. I interrupted.

Kostas Pardalis 20:34
I’m very excited. I was, not playing around, but taking a look at what’s going on with Saudi these because like, there’s a lot of like, not exactly a chi. But there are things happening with MVPs. But they’re very, like their use case is like, extremely focused on, like, collaborative text editing. And maybe, let’s say the most help say that, like, the different things that you might see out there is like what figma was doing, you have like a visual environment, and you have like, the changes there that you track, and we try to, like, make them visually consistent. But at the same time, like, okay, it makes a lot of sense to think about how you can use some kind of dysfunctionality with data in general. Like, not just like a sequence or FedEx on a string. So I’m very excited to hear that there is actually a use case that makes sense. But I just have no idea about like the research that is going on there. To be honest, I don’t know if anything has been done there yet.

Tejas Chopra 21:38
Yeah, I’m not sure either. The way I look at it, we are probably not folks that are best aware of how to merge two images. That’s where an artist comes with their creativity, right. And I think at some point, all the technology will not ever replace creativity. So we want to still keep the true essence of that alive. I think that for smaller merges maybe it’s easier to, like, tackle those, because Netflix Drive is a generic system. It’s not just used for images and files, images and media files, but also other files as well. Like, there could be some tracking files and other files, metadata files. So those could definitely be solved by this. But I still believe that there always will be an area where creativity will trump technology there.

Kostas Pardalis 22:28
Yeah, 100%. At the end, there are always limitations and you need to find the right rate dose and when to involve a human there to decide that, okay, this is what we keep at the end and whatnot. Okay, Eric, yours. I’m sorry. I got too excited.

Eric Dodds 22:46
Oh, no, no. That’s super interesting. Well, one of the questions that flows from this naturally is a formal version control component of the system. So like, last writer wins. And then you have like intermediary stores? Like, have you considered a formal version control mechanism, which is interesting to think about? The way that generally happens, at least that I’ve seen in the context, or I’ve seen people work on media is, you just save like, an, a working version of the file, and like append a number to it, right, v1, v2, v3. Right? And so with, with heavy media that you run into, like, create a ton of storage bloat, and all that sort of stuff. There’s horrible documentation because it’s just a bunch of files in a folder that are named sequentially, maybe. Right. So has that entered the conversation at all?

Tejas Chopra 23:43
Oh, yeah. So we are using for our first version because we are building with S3 as the backend data store. S3 allows you to have multiple versions of an object. So when Netflix Drive comes up, it has two variants, or it has multiple very variants, but two of them are, you explicitly save a file using APIs. And the other is you automatically save a file in the background. So while you’re working on a file, it’ll automatically save the file. And every such thing actually creates a new version of the file in S3. So you would create a completely new version, it creates a new version, we are the way we are thinking about it. And it’s still in the works. When you think about media is a big file right? Now, even if you change a small pixel in the file, you don’t want the entire file to be a new version. So the way we do it is we chunk the file into parts. And that allows us to not just to just actually think and replicate or create a new version for the chunk that has the changes. That’s fine. So 99% of the file doesn’t even need to be streamed to cloud, it will only be that chunk that change. That’s fine. The other is it allows us to also deduplicate better in the future. Because if there are two video files, two big files, two movie files 99% of the movie the same, you can actually duplicate and you will reduce your cloud storage. So we do have versioning, on the background that S3 takes care of. And we can also surface the correct version. So we can, were right now, imagining, if there’s a time machine kind of an interface, where you can not just look at the current files, but go back and look at the versions of your file by picking the right version from Cloud. So that’s how version control can actually help us. And even if two artists collaborate, let’s say they’re both writing to the same folder. And so one version will be overwritten by the other version, you can always go back to that other person’s version as well. So that’s the beauty of having that versioning with object stores.

Eric Dodds 25:40
Fascinating. And so just to be clear, because it’s really interesting, that happens without the end user, the artist having to declare anything related to version control, but then they can sort of access the versions as needed.

Tejas Chopra 25:57
Exactly. So that auto checkpointing implied that the artist doesn’t even need to like click on save, it’s like your Google Doc, where it a lot autosave. But some artists don’t want to like have the auto checkpointing, they know that they will work on their machine. And they only want to save a file, or they only want to copy the cloud or intermediate data stores, when they really are done with the file. So they will explicitly like not use the checkpointing feature, they will call the Save button. And it’ll automatically like overwrite the previous version in the background. They don’t need to rename or anything and you’re just automatically take care of it in the background.

Eric Dodds 26:33
Fascinating. And this is a really specific question, but I’m just really interested: When you chunk the file, like when you break it apart— Well, two questions related to this. So when they download the file to work on, are you actually stitching it back together when they download it to work on or whatever?

Tejas Chopra 26:52
So today, our metadata store is the one that maps of file to the chunks. So we are given a file who if it has 100 chunks, it will like have that metadata mapping, they read a file and all the objects, and each chunk becomes an object in S3. So file to an object translation happens by the metadata. And that’s where the value is. And when we have to download a file, we actually look at the file, we get all the objects that belong to a file. Today, as it stands. Currently, we download all the objects for a given file. But in the future, we have plans to just download specific chunks for the file based on if the user requests a specific offset, we do have on-demand prefetch, which means you do not fetch the files from Cloud unless you really touch the file locally. So you will only fetch the metadata. So your LS and other commands can work. But when you start working on a file, that’s when we will prefetch the file from Cloud and get all the objects for the file. So we do have that today. But we get it still at the granularity of a file. And we do not. Today we don’t have the implementation to get just specific objects from a file or specific chunks from a file. But that’s in the works.

Eric Dodds 28:00
Got it. And the second question is, how did you decide how big the chunks are?

Tejas Chopra 28:08
That’s a great question. So we started this project when S3 did not have support for large file sizes, we typically see movie files are upwards of five gigabytes, sometimes S3 is maximum size was much less than that, I think 500 MB at that point in time. So we decided we have to take matters in our own hands, we have to chunk. So we decided we’ll go with 64-megabyte chunks. That’s something that we chose number. And we found that that gives us the maximum benefit. But we recently had a hackathon in Netflix. And one of the projects that my team worked on was variable chunking, which is don’t choose 64-megabyte chunks, check if you can use variable chunking algorithms like Robin card fingerprinting, to choose variable size chunks, because that gives you a higher probability of deduplication. We will explore that. That’s one of the features we can add in the future but so far we have to size jump.

Eric Dodds 29:01
Got it. That’s interesting. I wasn’t initially thinking about storage capacity limitations. I was more thinking about is there based on the average length of a movie or show file, like, is there a particular slice of that, in terms of the number, how long it is timewise, where it makes sense to have a cut-off or something.

Tejas Chopra 29:35
Today, I think there are a lot of algorithms they try to research. What’s the right chunk at which you break a movie? And people have gone into a lot of depth with these algorithms. We don’t, we haven’t used them. But as part of this hackathon project, it was just to see if we ever were to go that route. What are the savings we can get? What is the impact we can have? It simplifies some things, but it complicates other things. You really, it’s the cost of how complicated you want your code to be, versus what are the benefits of having a simple solution. And sometimes keeping things simple, isn’t really done. That was the goal, and we still have to do a full analysis of if we were to variably, chunk files and stored, what does that translate into savings? And what does that translate into performance? So we are looking into that, yes.

Kostas Pardalis 30:30
I have a question that is more about working with data in the scope of like, let’s say, data analytics and data science. But you are describing an environment where people are working with data again, right, like audio files, or video files, but still data. But the approach there is like local first, right? Like, you need to have the file locally to work, you cannot just edit the file, like remotely on a server and a VM on AWS. Now, when we are talking about like data analytics, and in general, like the more let’s say, structured data kind of work, we usually take, assume that like, everything is centralized, right? Like we have a data warehouse, or even a data lake or a lake house, or whatever we want to call it. But still we are talking about like a centralized store. That’s where the data is, and we execute all the queries there. We don’t have like a local affairs kind of modality there. Do you feel like we are going to see a transition? Or do you see use cases where local first makes sense also, for these use cases?

Tejas Chopra 31:39
It depends. I do believe that it may not be local, first, that we make sense for these use cases. But I do believe that decentralized data, lakes and data warehouses will be something that will happen in the future. And let me take a step back and explain what I mean here. When you think about data lake, right, you have the central place where all the data lives, and you run algorithms on top of it. So you’re taking your code to the data. But now imagine, if I tell you that there is a way to split this data into pieces, each data piece can be operated upon by a subset of that algorithm. It may be a subset of the algorithm. And the overall impact of parallel. Applying these algorithms on these data pieces is the same as applying an algorithm on that entire data lake. Right, it may seem impossible, it may seem like that may not work, because you need so much information from the entire data. But there are techniques today, where you can work on pieces of data, and still aggregate them in a way such that the total sum of all of these are these aggregations is equivalent to operating on a big data lake, I think that is where we will move, the world will move because having a central data lake has a lot of restrictions when it comes to privacy when it comes to security. And it has privacy and security. But you can think of ways in which you have a medical industry, right? You have healthcare data, you’re working on some algorithms and machine learning on that healthcare data. Now you go to a separate company, you have healthcare data there, you cannot translate all the data, because that’s compromising user information. You cannot also hide a lot of the learnings because some of these learnings may tell you a lot about the users as well. So they can tell you personally identifiable information. So how do you deal with such situations where you want to apply the learnings from one data set to the other data set? Right, this becomes a classic case of there are two data lakes and you want to like apply some algorithms and take learnings from both. Now take this concept to other places you want to weigh and there are techniques and privacy-preserving computation, where you can work on decentralized data storage backends and still preserve the privacy work on encrypted data instead of decrypting the data and securely get your learning. That is where I think we will move in the future. And that’s how I think, as regards the current case of having local storage or not. I think that for some of these applications, latency is not that big of a concern as much as it is throughput and bandwidth. So for local storage usually are solving for the latency problem where you have a user that needs a great user experience. And these are creative artists, right? So or you want a user who’s clicking on an application, and they expect great UX. So you want that to be served locally. Some of these are, you know, you run queries over large datasets, all of which may not be surfaced locally. So I think that we will still until we get to the point where we solve the decentralized data lake problem, we will still work with ways in which we run algorithms on top of data rather than taking data to algorithms.

Kostas Pardalis 34:54
And about this algorithm, these techniques that you talked about: where do we stand today in terms of the state of the art? Do you think that we are ready to build products from that are there still like research that has to be done before we can start even thinking of productizing this?

Tejas Chopra 35:10
Yeah, there is research going on right now, there are different ways in which you can work on this data. So there is smpc, which is secure multi-party computation, there’s an entire field of research that allows you to break your data into pieces, operate on each piece individually, and then collect all of the learnings and have the same impact as working on that huge humongous piece of data. The problem with these fields is that there’s a lot of message passing between all of these different pieces for them to come to an agreement of what that eventual result should be. That’s the problem of consensus, and it takes time. So every operation that you do on each piece of data, you need to tell all your peers about it, and you have to come to a consensus. So that is what is impacting. That is why it’s not mainstream today. But I imagine there’ll be a lot of research that will come out in the future, which will try to remove this consensus or like figure out a way to like, get it much better. That’s when we have this more mainstream.

Kostas Pardalis 36:10
That’s pretty cool. All right. I think we’ve talked a lot about Netflix Drive. I can’t wait to see it as open source, to be honest. Do you have any like estimation of when this is going to happen?

Tejas Chopra 36:22
So we are thinking we will try to open source it this year. And if not this year, then maybe next early next year? But yeah, that’s the plan.

Kostas Pardalis 36:29
Okay, that’s awesome. I’ll keep an eye on it, see when it’s helping. So I know that you’d have also like other interests, it’s not just the Netflix Drive that you’re working on. And you’re like, also a very experienced engineer yourself. And you have seen things like changing and happening, like all these years. So there is something that like, I’d like to ask you, based on your experiences, not only Netflix, but also like other companies that you have worked in, that’s about the introduction of like, literally any new engineering discipline, which is data engineering, right, and data engineers, and how this is different than an application engineer, or like a back end engineer, or a systems engineer, or I don’t know, whatever, like all these different flavors, that we have engineers out there, why? Why it’s different. And if in your minds, there is a good reason for that, like, what comes to us is different, like what are the different skill sets thoughts, but data engineer needs?

Tejas Chopra 37:33
That’s a very pertinent and a good question. From the way I look at it, one thing that binds all these engineering disciplines together, the common thing between all of them is curiosity, you have to be curious with regards to any field that you’re specializing in and that curiosity can have different dimensions. When it comes to systems engineer, you’re looking at how systems work, trying to squeeze out latency, trying to squeeze out CPU, performance, power, all of that optimizing for that. So that’s something that you focus on. And that’s something that you can work in a silo and you can come down to how you can like, look at metrics for your CPU and all of that on your machine. And you can work off it, when it comes to front end or like application also, not just front end, its application as back end and front end, both you actually work on you can work on the full stack. But again, your view is very, it’s for a particular application for a particular machine for a particular like environment that it’s written in. When it comes to data, it’s actually far broader. Because you cannot get a lot of learnings from just the data that is produced in these two different streams, you actually have to work on systems that can allow you to operate on data at 10x the scale and so you, you actually can leverage a lot of tools such as Spark, a dupe and all of that, that can work in parallel to observe data to like get learnings from data, these tools only give you benefit when they work at scale. So I think scale is a bigger difference in data compared to these other disciplines. And also you optimize for there are different things you optimize for in these different fields right for an application, you optimize for user experience, for systems you optimize for system performance, when it comes to data engineering to optimize for learnings from the data. Now, you want to remove the outliers, you want to have the least amount of false positives or false negatives. So you actually cleaning the data, having the right source of data, how do you optimize the performance of parallel or parallel lies, your operations on data? How do you become more cost-efficient when it comes to data? The other two fields do not have cost efficiency as a big metric that but with a humongous amount of data. How can you apply hearing of storage, how can you apply compute and storage scaling differently to save your costs and reduce the time it takes to run these queries on top of that data. Those are some of the things that differ here. So I think that that’s a different mindset. And you have to look at every problem with that mindset to see if you fit what it takes to be a data engineer.

Kostas Pardalis 40:24
I loved what you said about like, the different things that you optimize for. I think that’s like, very to the point like that was, that was great. And I mean, based on your experience, because you have seen many awards of with many engineers out there where the best data engineers are coming from, because no one like, okay, starts today out of Collins, and they’re like, I’m a data engineer, like, no. So what let’s say the journey, the path that you see, that’s probably like the most successful for someone like to get people data engineering at some point?

Tejas Chopra 40:56
Yeah, so I think there are two ways, two things that I’ve seen. One is, if you work at a big company that has a humongous amount of data, you can kind of learn about the tools that exist today, that the state of the art tools, so if you work at a company, such as Facebook, or Netflix or Googles of the world, which has a huge amount of data, you can look at how SPARC jobs are optimized, how Hadoop is being used, how there are different data lakes that are being used for different applications. So that gives you a very good idea to get started. But I think that as an engineer, every time a system becomes 10 times at size, you have to re-architect systems, right? That’s a rule of thumb, like every time you 10x, you have to throw away the older architecture and re-architect, you want to go through that in your life, at least once or twice for you to like understand what works for a petabyte of data will not work for 10 petabytes of data. And you have to throw away the existing tools that you’re using to analyze that data and see what you can use. So if you go through that cycle, once you kind of know how to, like operate at that scale, and how you can and then you are well accustomed to bootstrapping a startup, where you have very limited data and then scaling it, as well as working in a bigger organization where you have lots of data, and you just need to optimize the cycles. So I think these two are two variants that can help you as a data engineer.

Kostas Pardalis 42:18
Yeah, makes a lot of sense. Makes a lot of sense. Okay, one last question from me and then I’ll give the microphone to the stage to Eric. So starting, if I remember, when we have like a quick introduction, before we started recording, you said your first job was at Apple, and you were writing extensions to the kernel, they’re like to do testing. From that to architecting, and building large scale distributed file systems on the cloud. And also, let us also operate locally. How was the journey? And what is different between you back then, and today as an engineer?

Tejas Chopra 43:02
Yeah, I think that you always stand on the shoulders of giants, I would say. So I think in my case, it’s been when I was right out of college, I was very new to the field of software. My goal was to throw code at a problem write a lot of code to solve a problem with my learnings, and with my mentors, I’ve learned that sometimes you need to reduce code to solve a problem. So the more code you write, the more bugs you will have. So learnings such as those have really helped me. And also I have been exposed to a lot of technology in these different companies, and have learned to look at it try to fit the puzzles together in different industries. So that has really helped me. And the third thing is the confidence that my peers and my mentors have shown on me, so sometimes you don’t even realize your own potential until you’re faced with a challenge or a problem. So I think that in my case, I’ve been very lucky with that. So that’s helped me in my journey, and I hope to give back in the same way that I hope to be good, I’m still learning from my mentors, I hope to mentor other people as well to keep the cycle of like growth for everyone. But that’s, I would say a short answer.

Kostas Pardalis 44:13
Awesome. Awesome. Eric, all yours.

Eric Dodds 44:16
All right. Well, we have time, I’m going to switch this subject matter up just a little bit because one of your other passions is blockchain, Web3, and all the subjects that surround those. And as we were chatting before the show a little bit, we were talking about how those can be very sort of buzzword topics and actually, I was thinking, Kostas, do you remember when we interviewed Peter from Aquarium and he had worked on self-driving cars, and then it’s like, man, the media wave on self-driving cars hit way too early. It’s just like, this is awesome. And then it’s like, okay, I think he said that famous quote of the future. Here’s just not distributed yet. So is that the case with sort of blockchain and web three? And I think specifically because you work so deeply in data infrastructure, what I’d love to hear is, when do you think your average data engineer, those sorts of technologies are really going to impact their day-to-day work? And in a widespread way, what is that going to look like?

Tejas Chopra 45:27
When you step back and look at the world around us, internet was one of the first decentralized architectures. The internet is made up of a collection of machines, a collection of nodes in itself. If some roads go down, you get routed to the right information, internet solves the problem of getting to an information, a piece of information. But there are other paths to information, which is storing information or processing information, right. So internet kind of solves the decentralization problem of not having a central authority when it comes to transfer of information. But when it comes to storage, of information or processing of information, they are still under the centralized waters, because today you have AWS, Google, cloud, Microsoft, all of them like big humongous entities that own the cloud in some ways. So I feel that what blockchain does, in general, takes these three parts out of that centralized waters, because you no longer have to have storage that is centralized. And because you do not have storage that is centralized. Processing is nothing but it takes storage as an input, or it takes data as an input. And it produces some other data in some other form as an output. So by taking data storage out of the centralized waters, you inherently also take in processing out of the centralized waters. That is why blockchain is more exciting these days. Because it takes us to that vision of having all the three corners of transfer, storing and processing of data out of the centralized waters. So that’s one, I think, what will happen is, and people use the term blockchain very loosely, people try to retrofit a lot of applications and make it blockchain in some ways, but really, what the blockchain is, is transactions, it’s a chain of transactions. So if you have data that just sitting, there is no transaction, you only can use the blockchain if there is some form of transaction, this transaction can be you sharing an image with someone, the transaction can be used giving a file to someone else, that information should be stored on the blockchain, everything else, even when the existing like blockchains that exist, like file, coin, IPFS, all of that exists. The data, the metadata is stored on the blockchain. But the data itself lives off the chain because there is no value to putting data on a chain. And the blockchain itself gets replicated on every node. So if you have even a megabyte or a five megabyte or a huge file, that’s it, you’ve now exhausted all the nodes in the network because they will all have to download the entire chain. So the way I feel what’ll happen is, we’ll move away from the concept of centralized authorities owning the cloud, to becoming like a decentralized cloud, where compute storage and services are not tied to AWS or Google or Microsoft, but it is run by a bunch of nodes around the world. And it will be tokenized. And folks like you and me can actually give our spare CPU cycles and spare hard drive. It does decentralized cloud. And we’ll have encryption and other niceties. Take care of storing the files on those, and will tokenize and get rewards for it. That’s where I think the future will go with this. The other thing is also there is when we think about the physical world around us, we can look at scarcity, right? There’s land, but its scars, you have a bottle, and you can touch it, and it’s scars, because it’s right there and it’s one. But if you have to take the same Scott city concept to the digital realm, how do you do that? NFT is non-fungible tokens in the blockchain enable you to do that they can take the concept of scarcity that exists in the physical world to a digital realm. So you can imagine your land sales actually have tokens on the blockchain, which represent the land. So you will avoid cases where the same land is sold to multiple people. And there are many frauds that happened because of that. So you avoid that completely. The other thing, you pay taxes today to your government, right, any government, but you don’t know exactly where the money’s going. By having a blockchain take care of that you actually can look at how much money the government is spending on different initiatives. And that’s open for everyone to view. So I think that blockchain enables a lot of things that are not possible today because of regulations because of central authorities. And when it comes to data, I think decentralized Data Cloud or decentralized cloud, or I mean, for lack of a better term sky would be the how blockchain can do Sit up data infrastructure today. So this ties into the same conversation we were having about decentralized data lakes. So I think that that can be enabled with using Blockchain-like technology.

Eric Dodds 50:13
Fascinating. That is absolutely fascinating. And so this is interesting. So GE, it sounds like you’re proposing that the big three, who have these massive businesses built on storage, we’ll see disruption from this decentralization.

Tejas Chopra 50:33
They should see disruption because you do not. So there are so many problems, right, this vendor lock-in, that you pay a lot of money, you may take that storage is cheap, because of the tears. But if you look at just you peek at the alternatives that you have in storage, and file coin, which are the decentralized storage alternatives, they are like 1/10, the cost of Amazon. So you actually can use these decentralized techniques, the only challenge is, it will come down to performance. When you use Amazon, you know that you will be within the latency requirements and performance requirements, that Amazon has the SRS. But when it comes to decentralization, unless you have a big number of nodes, their performance will always be a bit flaky. So it’s like a cold start problem where you really need to have a lot of participants for it to even make sense. So I think that a lot of efforts will go into that direction. And I truly believe that owning your data and securing your data, and privacy are the new like, are table stakes. Now, it’s not an afterthought, security will not be an afterthought, for any data service, it will be built in when designing services. So decentralization enables you to do that with central authorities, you’re putting your keys in their basket and hoping them to comply, which I don’t think will work in the future. So that’s, I think that if you look at the world, 20 years from now, 30 years from now, and if you iterate backward, this is the right time to like invest in research on these things.

Eric Dodds 52:03
Yeah, absolutely fascinating. Absolutely fascinating. Yes, I know that we have some listeners who are looking for the decentralized storage layer companies to invest in very cool.

Well, Tejas, this has been such a fascinating episode. I have learned so much and just really appreciate you giving us the time and teaching us so much about Netflix cloud. And then also what the future of blockchain and data is.

Tejas Chopra 52:31
Absolutely, it was a pleasure here. And thank you so much for having me here. And I hope that I was able to provide value to anyone who is listening yet. And I hope that people bootstrap ideas that can enable these technologies and make data better for the world.

Eric Dodds 52:45
Wonderful. Well, thanks, again.

Tejas Chopra 52:47
Thank you.

Eric Dodds 52:49
I have two takeaways from this. One, is that I really appreciated when we were talking about the decision around how to chunk the files, and how big to make those file sizes are links, that they had actually done a lot of research on that and decided that keeping things simple and just doing 64-megabyte chunks was just fine, right? They weren’t going to try to over-optimize that. Which seems like a really natural thing to do. He wasn’t opposed to doing that in the future. But I when we’ve heard this, a lot of times on the show, or we could do something more complicated, but simple works really, really well, which is great. The other takeaway was that I missed an opportunity. We’ve had three people from Netflix on the show. And we haven’t asked any of them, like, can you go home and watch Netflix without thinking about work? And that was a huge missed opportunity. I’m so mad.

Kostas Pardalis 53:52
Yeah, that’s true. That’s true. I think next time, we should do that. Maybe we should also do like a Netflix reunion or something. Get all of them on the show.

Eric Dodds 54:01
We should. We really should.

Kostas Pardalis 54:02
Yeah, yeah. Yeah, we should do that. Yeah, I mean, I think Netflix like very special combined for this show. I mean, we’ve had some amazing conversations with great people are so far. And it keeps being amazing. Like what kind of projects are coming out of this company? And considering? Yeah, I really looking forward to see like the next generation of cloud storage companies after Dropbox and Google Drive, because it seems that that’s what I’m keeping from today that there’s still like, a lot of space for disruption there. And by the way, they’re going to open source it at some point.

Eric Dodds 54:41
That is going to be so interesting, so we’ll see. Maybe we’ll have him back on as a founder maybe. Well, thanks again for joining today’s Tech Show and we will catch you on the next one.

We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.

🎙 Sign up for The Future of Machine Learning Livestream!

🗞️ Signup for Our Newsletter

Episode 92:

Building a Decentralized Storage System for Media File Collaboration with Tejas Chopra of Netflix

June 22, 2022

Notes:

Transcription:

About the Podcast

Sign Up for The Data Stack Show Newsletter