Episode 89:

Solving Microservice Orchestration Issues at Netflix with Viren Baraiya of Orkes

June 1, 2022

This week on The Data Stack Show, Eric and Kostas chat with Viren Baraiya, co-founder and CTO of Orkes. During the episode, Viren discusses microservices, open source projects, building a company, and more.

Play Video

Notes:

Share on twitter
Share on linkedin

Highlights from this week’s conversation include:

  • Viren’s background and career journey (2:23)
  • Engineering challenges in Netflix transitions (6:05)
  • How Conductor changed the process (9:30)
  • Building a lot more microservices (16:04)
  • Open sourcing Conductor (17:38)
  • Defining “orchestration” (22:05)
  • Using an orchestrator written in Java (31:04)
  • Building a cloud service around microservices (34:59)
  • Differentiating product experiences (37:17)
  • Orchestration platforms in new environments (42:15)
  • Advice for those early on in their career (46:10)

 

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 0:05
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com.

Welcome to The Data Stack Show, we are talking with a fascinating guest today, another guest from Netflix actually, we talked to someone from Netflix actually early in the life of the show. And I had a great conversation. And today we’re going to talk with Viren, and he actually created a technology and Netflix open sourced it there, and then came back to commercialize it later in his career, which is a fascinating journey. And it’s in the orchestration space, which is, which is super interesting. And we haven’t talked a ton about on the show, Kostas. I know you have technical questions. My question is going to be… Orchestration tooling is not necessarily something that’s new, so I want to know what specific conditions at Netflix, were sort of the catalyst for actually building a brand new orchestration tool because that’s going to be really interesting to hear, especially from the Netflix perspective, what problems are they facing? Where were they at as a company, etc? So yeah, that’s what I’m gonna ask about. How about you?

Kostas Pardalis 1:25
I think it’s a great opportunity to get a little bit deeper into the definition of what arbitration is because orchestration means many different things for different people in software engineering, and I think this is something that’s going to be very useful for our audience that’s primarily data engineers to hear about. So I am hopefully, we’re going to spend a good amount of time like talking about the different flavors of orchestration of their ends, when and how we use them.

Eric Dodds 1:54
Absolutely. Well, let’s dig in and talk with Viren.

Kostas Pardalis 1:57
Let’s do it.

Eric Dodds 1:59
Viren, welcome to The Data Stack Show. We’re so excited to chat today.

Viren Baraiya 2:02
Thank you. Thank you for having me here.

Eric Dodds 2:04
All right, so we always start by just getting a brief background on you. So could you tell us to tell us where did you start your career? How did you get into data and engineering? And then what led you to starting workups?

Unknown Speaker 2:19
Yeah. So kind of, I’ll keep it short, but essentially, I spend kind of in my early days of my career a decent amount of time working for firms in Wall Street. Lastly, Goldman Sachs. And one thing that was kind of the case with DBT, all the Wall Street firms is that data is their kind of secret sauce, right? Especially in today’s world. At some point of time, I hadn’t kind of each to kind of go live with more technical. So went on to kind of work at Netflix, which was the early days of Netflix, in terms of their pivot from being a pure streaming player, and number one, at that point of time, to becoming a studio. And I got to work with some really brilliant engineers there. And I thought there might be an opportunity to kind of scale myself out further, spend some time at Google afterward, building with a couple of developer products, Firebase and Google Play, to be more precise. And then one thing that I had done while I was at Netflix was kind of help build out this orchestration platform, while conductor and open source state and we had seen a great momentum in the open source community. And even from the timing perspective, and felt was the right time. So decided to take up lunch and start building out a cloud-hosted version of connector, and solid orcas with a bunch of my colleagues from Netflix. And yeah, here we are. It’s been an almost three, four months old journey now.

Eric Dodds 3:37
Wow, awesome. Well, congratulations. You’re sort of just starting out, but that’s super exciting. Okay, I have to ask this question: What was it like going from Wall Street to Netflix? Just from a data standpoint, but also a cultural standpoint, it seems like that would be a huge shift.

Viren Baraiya 3:58
Yeah, absolutely. Like, if you think about engineering practices, for example, in Wall Street, right, like, and women’s rights, to be honest, right, like, prides itself on being very forward thinking, very tech oriented for the Wall Street, and they rely a lot more on open source compared to anybody else. So in some ways, engineering, why twice kind of this tech stack, and everything was similar. But how you think about kind of building things is very different, right? Like, when you think about companies like Netflix site, for example, or any tech companies, right? The pace at which the innovation happens is very different, right? It’s very rapid because here you’re always innovating for the future, not for the current problem. So that was one thing and second, in terms of like, just the cultural aspect of it, right? Like, if you think about it, tech companies tend to be a lot more open to new ideas, taking bold risks when it comes to technical investment and they end in you. You essentially hire the best engineers and let them do their best as opposed to kind of manage them from top down. So I think In terms of being able to do things, there’s a lot more freedom, I would say. And also kind of the problem side, like you are no longer in the second or third line when it comes to working with the customers. And in Wall Street, you never work with the customers directly, or you directly work with the customers at times, depending upon the team. And you can see and more importantly when you’re when your family or friends ask, what do you do? You can tell them I work for Netflix. And if you do this, this is what I did.

Eric Dodds 5:24
Yeah. Yeah, that’s a lot easier at dinner parties and cocktail parties.

Viren Baraiya 5:29
Absolutely, absolutely. Yes. Yes.

Eric Dodds 5:32
Very cool. Thanks for sharing some of that insight. Okay, so let’s, let’s go back to Netflix. So what a fascinating time to be there when Netflix is going from a content aggregator and distributor to being a producer. Those are very different kinds of companies. And you said conductor was kind of born out of that transition. What were the challenges that you faced as an engineering team that were represented in that transition?

Viren Baraiya 6:05
I joined a team and our mission was to kind of build out the platform team to support the entire content engineering and civil engineering organization at Netflix. And Netflix, as you know, has historically invested very heavily into microservices, almost the champion of microservices, right. And one side effect of microservices, as you could see is that you end up with so many, this little kind of services with a single responsibility principle, right. But now, as your business processes get more complex, and this started to become especially true in the studio world, where more you are not only dealing with the data that you own, and the teams that you work with, but also external partners, external teams, really long running process, like to give you an example, right, like before, as you know, it could take months before a show is completed, right, like in terms of its entire production process. And below, you are managing these long-running workflows all over the place. And this was one of the need that we wanted to have. And as they say, sometimes the things that— you know it is not because you thought of a cool idea, but rather, there was a problem that was hitting you directly, right, and I was responsible for, I mean, traditional way of solving this problem will be like, you just end up building an enterprise service bus or a Pub/Sub system, like SQS site and build everything on top of it. And that’s exactly what we were doing. And what we realized was that it worked well, when your processes were very, were very well defined. And were kind of simple enough. Now, there were two things that were happening, right. One is that the number of processes were exploring. The second thing was that Netflix is not a traditional Hollywood studio company, right? It’s a tech company. And they think about problems in a very different way. He wants to want to experiment with processes and see what works, what doesn’t work, which means you want to be able to rapidly change things and test it out saying this, whether this works or not right. And so that agility was another situation, right? Like one thing that we absolutely did not like it, Netflix was building monoliths. But what we realized was that we were building distributed monoliths, because now the world was there, all over the place. And one change: I would go and talk to 100 engineers, beg them to prioritize the change. And if a product manager wanted to change something, you would go and talk to 100 engineers to figure out how the process works to and this is where we thought, this is not gonna scale, and we have to build something. So that’s where basically we started thinking about conductor, we started with a very simple use cases, and it evolved very organically over the period of time.

Eric Dodds 8:36
Can you give us an example of just one of the simple use cases and how it sort of solved that across like some specific microservices?

Viren Baraiya 8:42
Absolutely. So like, very first use case, like that wasn’t really very simple. Like it was basically, you have a bunch of that you have received for marketing agency, or you use ML to produce from the video files. We wanted to encode them in different formats, one for a browser for iOS, or Android for TV, and then deploy it to a CDN. Very simple, right? You take an image and code deploy, and then you test and see what works, what doesn’t work? Does the format is a PNG works better on iOS versus Android? If not do the same thing, the entire process? Sure, and it looked very straightforward, and simple application that we thought will be a very good test. And that was the very first use case that we actually build Conductor for.

Eric Dodds 9:24
And so how did Conductor change? What was the process before and then how did Conductor change it?

Viren Baraiya 9:30
So if you think about the vision, the process was based, right like that, there’ll be an application that is responsible for publishing images. So now the person who is building the application is not necessarily the audio engineer or video engineer, right, or grant image or the engineer working with the images. So now it’s a different team, they have a microservice. So you call on microservice to say, give me an encode in PNG format, then you wait. We relied very heavily on spot resources on AWS and Netflix has done some fantastic work there. So it could take some time you wait for it you would complete then you will go and deploy, what if your deployment fails, you retry, and then this thing works. But then your product manager comes and says, like, hey, but what we realize is that this particular format does not work very well, because latency issues may because quality issues, whatever, can we change the encoding? So you go back and change your code, redeploy, or you go and ask that engineer on Google, so then call a team to say, hey, what’s the API endpoint where I can use it to encode in different format, and add to the site like it’s a very intensive process? Yeah. And sometimes we were changing this in multiple times during a week, right? To see what works, what doesn’t work, getting feedback from the users from AB test, and whatever, not org. Now, I want to deploy 20 images instead of 10 images, because I have one more abs to run. So those were kind of starting to become a little bit unmanageable. And this is a very simple example, right? But if you put something in between to say, okay, depending upon the country in our so right, where we are launched the show, I want to have a different format, because some countries want to already probably need a lower bandwidth, immune fight, starts to get very complicated, as you could imagine.

Eric Dodds 11:08
So interesting. So in the New World, Conductor sits on top and interacts with all of the various microservices to streamline that process.

Viren Baraiya 11:19
Absolutely. So like in the New World, essentially, what happens is that instead of writing all the code, what you’re saying is that I have a microservice that can do image encoding, and I have got 10 Different one of them that each one is responsible for a different kind of format. And then as an engineer or a product manager, you basically work with your product manager, say, Okay, what’s the flow that we want to see, right? Like, and it’s like, okay, if the country code is this miserable set of images, we will produce, this is the CDN location that we will produce, you actually build our DAG, right a graph of what the good thing is going to look like. And when a new show is ready to be published, you call it in does everything, you want to change something, you go back and update the workflow, because the microservices are there, right, they are not changing as much. It’s the flow that you are tweaking and fine-tuning, optimizing for. And now as an engineer, also, right, like, at some point of time, you can give it the good thing to the product manager and say like, why don’t you just try it out if you want, separately, and if you find something that is missing, I can build a microservice. And then you can plug it in, right. And it becomes a lot more tight coupling, or rather tight transition between the engineer and the PMS. The expectation is what the team is going to go and manage these things engineers are still responsible for at building these things. But your work gets simplified, right. And now you don’t worry about, oh, I have to put a retry logic that’s taken care of a conductor, a conductor will take care of retries. So just write saying you as you basically write all the people in this way you write for best case scenarios, and conductor takes care of all the HVAC is the fields that he drives and everything else.

Eric Dodds 12:46
Fascinating. Okay. I have one more question because I know that Kostas’ mind is rapidly exploding with questions. And this may sound funny, but was it an immediate success in terms of adoption? Because sometimes those things are changing even if technology makes things better, like adoption can be difficult, or like, Oh, we don’t necessarily want to change the way that we do this or like, it’s actually work to migrate our whole thing like, like, how did that happen culturally inside of Netflix?

Viren Baraiya 13:16
Actually, that’s a very interesting question. Because Netflix, famously is culture of freedom and responsibility, which means you don’t have mandates to say, hey, you use this framework. You know what frameworks are there and you choose what you want to use.

Eric Dodds 13:28
Sure. Yeah. Like self serve with all these options.

Viren Baraiya 13:31
Exactly. Yeah, and nobody’s going to tell you, why did you choose this site, it’s up to you to decide to get the job done. So that becomes a very challenging thing that you can’t just build something and get a VP or director to go and say, I sent out an email to everyone saying, We didn’t build this fantastic new framework, everybody must use it. Not. So now, we didn’t have to go and talk to everyone. So one approach that we took right from the beginning was that we ourselves were developers, right? So we understand the pain points of developers have it. So we made it very much like a democratized version, right? We did not have a product manager, we made it very conscious decision that we don’t want to have em kind of shepherding the product, but rather, let’s talk to engineers, what do they want. So like every feature that we built, was out of a necessity and a recommendation or a need from another engineer. And there was one thing, second thing was we kept it very agile in terms of its development, rather than try to think about, we have to build a very perfect system from the get-go. We made it functional. And we built the resiliency in everything kind of along the way, as we were testing with the internal users. That was another way kind of we kind of tried to evangelize within Netflix itself. And of course, they were always kind of skeptics or people who want it a different way. Right. And we tried to keep it like as open as possible. There was one thing so the side effect of that is like if you look at the current repository, also it’s very flexible system. It’s pretty much plug-and-play. You can plug and play like it’s a Lego block right inside. On bass. And that was one of the reasons why it turned out to be like that. Because we wanted to be able to satisfy as much as possible, like, in some ways you can think about right, that increases the complexity, and effort. But the advantage there was that everybody felt that they had a stake in the game, but the most important thing was to get them invested in the product and then you have someone who is happy.

Eric Dodds 15:25
Super interesting. Okay, I lied. I have just one more question. I promise, Kostas, because it’s always fun to hear about how these projects sort of form inside of a company like Netflix with such an interesting culture. I know that microservices were a catalyst for building Conductor to sort of make them easier to interact with, but it wouldn’t surprise me if actually microservices proliferated at a higher rate after people started using Conductor because it was easier to manage a lot more. Did you see people building a lot more microservices?

Viren Baraiya 16:03
Absolutely. Because see, like, what happens is that if you don’t have something like Conductor, then you tend to take shortest path, right? Especially when you are at a time pressure are you to deliver things. So if you were to build something of a complicated business flow, building five microservices, writing orchestration logic on top of it. And making all of these things work is more time effort versus putting everything into a monolithic block and get it out. Yeah. So in some ways conductor kind of encouraged people that like break it down, because now you have another side effect topic, right? The moment you break it down, you have a lot more composability to be able to change clothes and everything. So that was one thing that really kind of inspired people to do that. The other thing was that we built two critical aspects that everyone wants side traceability and controllability, right? Like, you can actually trace and the entire execution graph visually and see the graph. They had this turned out to be a sleeper hit for us. Like, I had never thought that this is gonna be the killer feature. We thought the killer feature is going to be distributed scheduler, but no killer feature was that the UI that people loved it because I couldn’t tell you exactly where things was wrong. Because that’s the problem you face. And you would go and look at the logs everywhere and see what’s going on it was it I have a UI and just click on it and say, Oh, this is what’s wrong. Go fix the code reply and yeah, works. So some of it was like that. Also, that just encouraged now, if he was gonna tell you the features that you otherwise wouldn’t get it right, and that encourage people to write more microservices.

Kostas Pardalis 17:32
I have a question about the open source side of the project. How soon after you developed Conductor internally, you open sourced the project?

Viren Baraiya 17:45
I think it was about six to eight months journey. We took it to the place where it was, it had enough features that it did not look like a toy project. We also had to decouple everything from internal non workers side of the world. And we wanted to put together some amount of government process a process also, right, like my team, we did not have any open source product that we were managing ourselves. So we had to kind of figure out learning from other teams in Netflix side. So yeah, overall, it took us about three quarters. And the day, we decided that we will open source, it took about a couple of months to get everything ready, right? Prepared legal reviews, patent reviews, and all of these things.

Kostas Pardalis 18:27
Yeah, makes sense. And how long it took after you open-source the project to start getting engagement and creating a community of adopters of Conductor browser?

Viren Baraiya 18:39
I would say, what I’ve seen is typically you have this initial box, right? Like open source, people are excited, they want to try it out. So there was this initial bump, then it starts to taper off. Because there’s nothing new there. And it kind of stayed there until very recently, I would say. So what was happening was that so and what we had done was that, like, we were doing meetups at Netflix, about Canada, and also we’re talking about Canada in other meetups and everything. So as we kind of talk to people, it started to kind of grow the momentum. The other thing was, if the community is always that consumer of the product, right? It does not grow the community well, right. Like, we kind of also made it much easier for people to contribute back to Conductor. And once people started contributing back right, it started to grow further. Because now again, like they have a stake in the game, they have the ownership in the product itself because they have contributed.

Kostas Pardalis 19:33
Yep, of course. Makes a lot of sense. And do you have any use cases that came out of the open source community that surprised you?

Viren Baraiya 19:44
Absolutely. So one use case that I if you asked me today, right, without learning about this, I would like I would not ever thought about it in my like, when does the right for security? Oh, I’m using conductor to orchestrate the security flows things like detection things like for example, take, for example, let’s say you put a file in S3 bucket, typically you want to run some processes and checks to ensure that you will upload a secret. BI will take around profiles or variety or there is no virus in and you will run a bunch of workflows around it right to some automated some manual. And this is all done by folks who are into the security space, not necessarily writing microservices, but it turned out to be a very good use case for them. So this was one thing that surprised me that there is a strong use case here that I had not thought about it.

Kostas Pardalis 20:32
Yeah, that’s so fascinating. I would never think about it.

Viren Baraiya 20:37
Exactly. But then more I think about it, right? Like, these are long-running tools, right? Like, it might take some time to scan object if people are putting 1000s of objects in S3 bucket, for example, right? It may not get a real-time treatment. So you have a backlog. And if you find something, maybe somebody has to do manual intervention to verify right? Then you have a human process involved in it, you send an alert or something, wait for someone to reply. So all this flows becomes, it becomes a pretty good use case. That’s what I realized.

Kostas Pardalis 21:07
Yeah, makes sense. Okay, I’ll probably get back to more open source and company also related questions a little bit later. But I’d like to discuss with you about what orchestration means. It’s a term that it’s being used like in a lot in engineering, in software engineering. And not necessarily like in every, let’s say, these have been the same way, right? Like we have orchestration that the engineers are talking about, we have orchestration that has to do with microservices, we have workflow orchestration, then we have orchestration on Kubernetes. And probably like many other types of orchestration out there, can you help us like, define, let’s say, a taxonomy around orchestration tools out there and understand better, like, the differences between the different tools and the use cases?

Viren Baraiya 22:03
Yeah, absolutely. As you say, orchestration is an overloaded term. It has different meanings to the people and use cases. Having spent some time on MySpace, right? Like, what I realized is that like, essentially, if you look at from the persona, right, like, who is looking at the word orchestration, there is a kind of a different meaning to it, right? And like, if you go in our top to bottom right, in a company, so if you look at people, let’s say, people on the business side, right, business analysts, product managers, who are dealing with the business processes that are high-level right, for them, essentially, if you think about when they think about orchestration, they are looking at how various business components are getting orchestrated, right? So like, in a, in an E-commerce company, this might be how am I orchestrating between my payment and shipping and delivery and, and no tracking systems, or fulfillment services and things like that. When again, this is at a business level. And, again, when you think about also measuring the output of the orchestration, right SLA is any other key metric that might be defining, right? They look at it from that perspective. Also, I like the time it takes to complete certain activities, meantime, for failures and things like that, right? How often they fail? And where are the optimizations that they can make, based on those data points, right? They are the ones who are actually building the systems, then that typically goes to the back-end engineer. And when they describe the same flow to the backend engineer specifically for them, you could think about these individual things as kind of either microservices or other services, which for the lack of better word, right, you can just call it microservices. And for them when they think about orchestration, right, it’s about a whole bunch of microservices. And I have to build a flow around that, how do I build that? But now, what I am looking at, as an output from this is how do I handle certain things? Like for example, in a distributed world, those services are going to fail services are going to have a different SLAs, how do I handle failures, retries different SLAs across them, I want to be able to run some things in parallel. So I can optimize the time it takes to kind of complete the entire process or no resources and everything. How do I kind of achieve that? And if I’m doing some things in parallel, how do I wait for them to complete? Because you can’t just always do that otherwise, right? So that’s how a back-end engineer is thinking about orchestration. Then if you take it further down one level, right in terms of the platform side of it, right? The end, you look at, even zoom into the set, say an individual microservice. This microservice is typically getting deployed onto a container these days, right or, or VMs, or even a bare metal machine somewhere. But you’re on deploying one thing, right? You are deploying kind of a whole bunch of things. And typically, you don’t only deploy a service you deploy a service with sometimes it is in the initial phase, right? Some more semantics around it, right, like the networking configuration databases and everything. And that starts to get into more of the continuous deploy inside of it, right, which is where the container orchestration, for example, has become very mainstream with Kubernetes. And Arvo is another one, for example, where, right, where essentially, it doesn’t matter what you’re doing, you have a piece of code that you are deploying, and scaling out and scaling down, right? That’s what you are focusing on. That’s another level of content orchestration that is happening. And just to go back to the backend engineer, right, like, there are different flavors of back end engineers also write your back end engineers that are working with product managers to build an application, you also have data engineers, who are dealing with the massive amount of data and orchestrating right, this is where things like Airflow voc at Netflix, Spinnaker also is being used for similar purposes, where you have data sitting in different places, and you are essentially orchestrating that, in a batch world, you are processing data, aggregating, they’re putting into database, maybe reading some machine learning model, making inferences, putting into database, the whole thing is basically a flow. And offline flow is very well orchestrated through something like conductor or Airflow and similar systems. And a slight variation of this is kind of real-time data platforms where you still have flows. If you think about right, let’s say, I click on a button in my phone or a website, and you are sending out a signal, right, and analytics data points back to the server. Now this has to go through kind of certain journey, right? Like you are waiting for it to do streaming aggregation. But once it is aggregated, it goes through maybe a couple of other systems where either it is being used to do either further aggregation, get it into more of an analytic store, or maybe you are doing kind of real-time monitoring, through machine learning, right. So that’s kind of the under flavor where you’re this will start or end of a workflow. It’s continuously running pipeline, but you have a complicated row that is built out. I think Kafka or confluent has some tooling around that. But I think that that seems to be right now still a very wide open space, I would say it’s still an unsolved problem.

Kostas Pardalis 27:01
Makes sense. So just to give an example—because our audience is primarily people that are working with data and their data engineers—what’s the difference between a system like Airflow and conductor? And why I wouldn’t use, let’s say, conductor and I would prefer, like Airflow in order to orchestrate my pipelines?

Viren Baraiya 27:26
Absolutely. So like, if you think about Airflow, from its genesis, and the kinds of use cases it solves, it’s mostly about data, right? Like, typically you have data sitting in different buckets, or databases like Hive, and you are processing, right. And these are typically batch jobs that you run an hourly basis, maybe twice a day, three times a day, or daily and things like that, and runs through the data pipeline. The other important part of a data pipeline also is kind of the dependency management, right? Like, you have to run in a specific sequence. Because your data, it didn’t even step depends upon the previous step. Also, reliability in the context of data is very different from a reliability in a microservices world, right? When you think about returning some data, you are essentially running data for that particular date, or a time frame, right. And you’re only processing that data alone, you’re never posting the latest data. So that’s, that’s kind of high-level use cases that I’ve seen Airflow being used for any does well, right. Also, if you think from the users of Airflow, right, these are mostly people dealing with data. And the language of the choice today for that is Python. So Airflow, DAG, is are written in Python, and they tend to be simple in nature, right? Like, you have a sequence of things that you do, sometimes you put an end, you’re done with it like this, pipelines are very stable, very fixed, you don’t change them every day AB testing of this pipeline doesn’t make sense to do that right? Conductor kind of goes on to the extreme, it’s more about flows, which could be running for a very long period of time, say months at end, or very short one where you complete the entire flow in few 100 milliseconds. Yeah, and everything in between, right. But instead of running a few executions a day, you could be running few hundreds of millions of executions per day, or even a billion executions per day, depending upon your use case. So the scale side of it is very different. In the same time, a typical workflow is operating on not petabytes of data, a step in the workflow is typically dealing with a finite set of data. Yeah. And sometimes you do, like, for example, one use case that we had on Netflix was the processing the video files, and a raw video file could be petabyte in size. Yeah. In that case, you’re still processing for a longer period of time. The other thing is that connector is very general purpose. And it is meant for pretty much the entire spectrum of audiences. Right? So it’s very much language agnostic. We had workflows where one step was written in C++, another icon and CORBA in Java and so forth. So it allows you to kind of mix and match depending upon the O that’ll be the step in the process. So another becomes very useful in this kind of scenarios where you’re a very heterogeneous environment. And the scale is another thing.

Kostas Pardalis 30:10
That’s very interesting. And it’s like something like that, as you were talking about Airflow and CPython. Like, I wondered like, it came up, like as a question to me. So, conductor is written in Java, right? How easy it is— I mean, you gave an answer, but I would like to hear a little bit more about how easy it is for a team that is primarily using Golang, for example, to create microservices to employ like, an orchestrator lends reason like in Java, because probably you’re not going to have a team there that knows Java, right? How does this work? And which team is using it? Is it the platform team? Who is responsible for managing, deploying, and taking care of the orchestrator?

Viren Baraiya 31:02
So let me ask you the first question. So like, I think the way filter works essentially, is it exposes its API through HTTP and gRPC. And that’s how it becomes kind of language agnostic, right? So let’s say if you’re a Golang shop, you’re writing your microservices in go Lang. And you’re building your orchestration flow in conductor connector also provides client API. So like, there are two parts to contract your client for SDK. And the server side is in Java SDK is written in different languages. So I think there are three right now Java, Python and go-line. So you use that SDK in that particular language to interact with conductor. And gRPC is great, where, if you want to create bindings for rust, for example, right, you can do that using gRPC compiler. So that’s kind of how it works today. And it is why it is language agnostic because the entire model is that way. Yeah. The second part who runs Canada? It’s a very interesting question. I think it is. I have seen kind of both sides of it, in the sense that where there’s a platform team that is responsible for running conduct, this was exactly what we were doing at Netflix, and my team was responsible for managing as a platform for all the teams. But that was a model that Netflix, your platform team. And this tends to be a lot more common in tech companies where your platform team responsible for all the components, and then everybody else uses that right. We also seen the other side where you have business teams that almost the entire stack by themselves, and then they are kind of responsible for running conductor, on their site like so. I think, in some ways, goes back to the culture of the company, right? How we are formed, and what’s their kind of usage model? For all the maintaining the products? How that works?

Kostas Pardalis 32:41
Yep. Yep. That’s interesting, and probably not solved yet. I mean, as you said, also has to do a little with the culture. I hear a lot about platform teams but doesn’t mean that every company out there has a platform team or you can just like wake up one day and be like, let’s have a platform team now.

Viren Baraiya 33:01
There are two challenges. Building a good platform team is not easy. Hiring platform is even more difficult, like hiring engineers are difficult. Now, we’re talking about hiring platform engineers, because it’s made it exponentially harder to build a team. I think, in reality, what really works well is that, if you treat your platform team as a mini Cloud team in your organization, right, so today, for example, if I want to use an RDS database, right, from AWS, I can go to console provision one for myself, and start using it. And AWS takes care of everything else for me, right provisioning, backup restores, and everything. So if you end up building a platform team that can get to that stage where next say, any product for that matter, I noticed conducted, that they are able to offer in a self-service mode, and they focus on building that platform out of it, right? Yeah. But then again, next, cloud companies are offering more and more of these things, right. So the line between the internal battle team and a cloud provider becomes thinner and thinner day by day.

Kostas Pardalis 34:01
100%. All right. So let’s go back to building a company. So you open source the project, you started having, like some traction out there. And at some point, you decided to build a company around the core technology? My first question is when we are talking about an orchestrator who is going to be interacting with microservices, and as you said, there are like use cases there where the report as you might want to run like millions of like interactions, and the latency should be like super low, right? How do you build a cloud service around that? And you make sure that the microservices themselves that the company is building, or like, sharing the same resources, let’s say or the same networks, or like all the stuff that’s there to make sure that the latency like remains as low as possible?

Viren Baraiya 34:01
The key to that is essentially your deployment model, right? Like, how are you doing those things like lower the latency you want, right, you want to be as much colocated as possible. Like, essentially what we have done is like, we essentially have build out like two different models of deployments where one deployment is where you are, you have kind of connector running in a separate VPC and your microservices in a separate VPC and your VPC peering that allows you to kind of communicate with each other, and you try to kind of keep the affinity between availability zones. So your network does not go through very heavy kind of hops, the second model where you want really low latency, right? Is it sometimes to kind of deploy conductor inside your own network, as close to the microservices as possible, reducing kind of network because now we’re talking about a few tenths of millisecond latency differences, right, as possible or even embedded, like, the beautiful thing about connector is that it can run in a cloud environment and run millions of flows, if not billions, every day. Or you can also embed inside where no, you are running with a very low memory footprint, pretty much running in the customers Edge environment, right, like small deployments. So you essentially have to kind of make those types of decisions to figure out what are your requirements and how you deploy that. And to be honest, like this was something that we had to kind of think it through and figure out exactly how this is going to work and come up with a solution there. But this, this always is an interesting challenge to solve.

Kostas Pardalis 36:31
So how is the product experience different between the two deployment models? And the reason I’m asking is because Eric is probably aware of that, because at Rudderstack, there were multiple different deployment models, although like, for different reasons, there, it wasn’t that much like the performance, it was like, in many cases had to do with like, let’s say, compliance, but building like a product that has like consistent experience, regardless of the deployment models, like like, super, super hard. So how do you approach this problem?

Viren Baraiya 37:12
I think about the end users of the system. I would say, there are two groups, right? So there are the engineers who are actually using the product to build the applications like, for them, there should be no difference. We are still dealing with the same set of API’s. And same set of constructs and everything, they will go to UI, you have a URL where you go to UI and look at your workflows, and manage everything, right. So that experience must be consistent, no matter how things are deployed. The second set very actually matters is the people who are actually responsible for the operational aspect of it. And this could be a platform team DevOps SRE is, and this is where I think the key difference comes, right. And it’s very similar to running a relational database in a VM that you have provisioned yourself versus running something like RDS, right? Yes, we are a fully hosted service gives you an experience where essentially, you don’t even need that team, things are taken care of for you, right? Like it scales automatically for you. You won’t have to worry about backups, what kind of database I should be using if I need to get this performance or whatever not. You use specify how much capacity that you want to run with. Yeah, and system kind of scales for you, right? The other option, essentially, when you are running in your environment, right? You are also making those decisions by yourself now that you know how big the instance should be, version by backups be. And if something goes wrong, how do I restore from the backup, and I’m also responsible for BI costs, right? Like I, I can run a 1,000 node cluster without having that show up on my annual or monthly billing from AWS or GCP. So I think for those people, I think the experience becomes slightly different. Ultimately, the goal here is still to be kind of make it easy in terms of the UI interactions, like the console side of the world, right? Where if you are dealing with the conductor console, in the cloud, to say provisionally discussed are, this is where my backups are, and restore everything. It should be as frictionless as possible. Once you say that, Oh, here’s a backup, you’ll download it, run this command. Just make it available. So that part is I think, where the challenges are.

Kostas Pardalis 39:21
Makes all sense. And how I mean, you started you were going through like your journey and you took like about the differences between like working in finance and the financial sector and then going like to a B2C company like Netflix. But if we take it like from Netflix on the day, like, you were at Netflix, working like with your customers being like inside Netflix, then you open source the project was finally like you had, like a much more open, let’s say platform to experience there because people were like, stop using and giving feedback. And now you’re did Another step forward. And so you started the company. So how does it feel? And like, what’s the difference between like these steps that you had to go through?

Viren Baraiya 40:13
I think there are some things which are common. Like, for example, you still care about the community, we’re still kind of working with the community trying to build the community and grow the community, that part does not change much product, in some sense, also, right, like, you will have the same amount of focus, whether you are internal or external. One key difference here is that internally, typically, sometimes you have some other pressures in terms of, I need this feature because we have this thing that’s coming up, right? So you prioritize in a company, your prioritization has a different kind of way of thinking about it, right? It could be depending upon the customer pipeline, the features and things like that. The second thing is that, as you cannot build a company, right, like, you cannot just think about product alone. Now also to think about everything around it, right? Like about the company, your investors, your customers, and especially in a startup environment, right? You are, you are the engineer, you are the customer support person, you are the marketing person, you are the revenue officer, you are playing pretty much all the head side. So you’re doing probably 10x smaller work.

Kostas Pardalis 41:21
Yep. And you also have to make money, which is also important.

Viren Baraiya 41:27
Absolutely. All right.

Kostas Pardalis 41:29
That’s awesome. That one last question from me. And then I will give the microphone to Eric. Alittle bit of like a technical question, but we talked about orchestration, when we’re talking about like orchestration of microservices, right? There is like another kind of, let’s say, like, computation, platform or model that’s like becoming like more and more popular lately, which has like to do with edge computing, where you have say, this, like functions that are close to the edge and like executed from there, and all that stuff. Do you see any kind of like opportunities there for orchestration platforms to work in such environment? And if yes, how?

Viren Baraiya 42:12
Actually, that’s an excellent question. And to be honest, there is a huge opportunity there. And the reason is this type, that what is happening is that as can have, and this is, again, my interpretation, so I could be 100 miles off from the reality, but hardware has become a lot more powerful, right. So there’s a lot more opportunity to kind of push a lot of processing closer to where the customers are. And this could be, for example, in the embedded devices side where you are not running in cloud, the whole thing runs on a customer environment, like sometimes on premise, for example, right? And if it does one thing, that’s, that’s fine, but usually, again, you have multiple things that you are coordinating an opportunity against, right? And, and, and the concerns around reliability, fault, tolerance, drivability, failure, and they do not go away, right? Because he pushed you to customer environment. Yeah, but it also now positioning that you have much less visibility and control over this environment. So you want this to be even more reliable, and be able to handle more failures compared to anything else. So in some sense, that’s, that’s a huge opportunity. And at the same time, there are some constraints, right? Like, even though hardware has become powerful, you are still constrained with the memory, for example, right? Amount of components that you can lower because it’s also learning other things, right. Yeah. It’s understanding orchestration. So but to be honest, we have seen use cases for conductor in this space. And there are some customers using that particular area.

Kostas Pardalis 43:37
Oh, that’s very interesting. Do you see do you feel like there are also changes that need to happen in how let’s get straight the result detected in order like to be more efficiently working with the edge computing environment? Or we are fine with how I conductor was designed and implemented so far?

Viren Baraiya 43:59
I mean, I know it needs some genius. I’d like for example, a few things I’d like you don’t want. Like if you’re in the cloud, right, you can have a Cassandra and Elasticsearch and reduced and few other components working together, right? And that’s completely fine. Because you have all these things at your disposal and any you can orchestrate that the moment you put it in, in the environment, do you want a lot more self-contained systems, right. So you are kind of almost kind of going back to the drawing board and see, what are the bare minimum components that you want? What can be run as an embedded mode, find the alternative and login there? Right. One advantage that we had with Sinatra was that because it was designed as a modular system from the beginning, it just made it possible to say, Okay, we cannot use Elasticsearch because it’s just too expensive. We’re not possible to run in an age environment. It should be replaced with this embedded database. Right. And YouTube liberals interfaces and get it done. There was an advantage but as you say, it requires changes.

Kostas Pardalis 44:55
Yeah. Awesome. Yeah, that’s very interesting. Hopefully, we’ll have like the opportunity the future to talk more about that stuff. Eric, all yours.

Eric Dodds 45:04
This has been such a fun conversation. Unfortunately, we’re really close to time here so we only have time for one question, although, we know that I always lie about only having one question. But I think in many ways, a lot of us who work in the tech industry sort of being at a company like Netflix, being instrumental in building a technology that sort of solves a major problem, and then goes on to be open source. And then I think, for some of us who are entrepreneurial in nature, actually starting a company on that? I mean, hey, congratulations, that’s really just an incredible journey. I think it’s sort of an aspirational story for a lot of us. Do you have any advice for people who say me, and that’s sort of the pinnacle of the experience of being involved in engineering and cool open source projects and solving problems? Like, I would just love for you to talk to some of our listeners who are early or mid in their career, and give them some advice that you learned along the way?

Viren Baraiya 46:04
Yeah, sure. I mean, I’m early in my journey, so we’ll see how that kind of ends up being, but like, here’s what my thought process was right? You can keep doing the same thing and keep polishing, like, you can go from Netflix to Google to somewhere else, like Metaflow, for example, and keep doing those things, right. But in the end, the way I think about it, is that like, unless your career progression, which gives you a kind of a step function, right? It’s not worthwhile, and you ought to look for those step functions. So, yeah, and that could be, for example, learning new technology coming up with some new frameworks evangelizing those things, and what’s the next thing after that, right, like, it’s easy to productize that right, and see how it works. And it’s a very different kind of experience, right? Like, there’s one thing about building a product where no, you are dealing with your ID and compiler, breaking your head with bugs and totally different ballgame as compared to that, in terms of how do you go about raising money, right? Because now you don’t know, actually, even before that, right? You will start a company by yourself, you’ll also find a co-founder. So firstly, how to convince your potential co-founder to say, Hey, this is a great idea. Once you convince them, you have to go and find an investor, especially in the enterprise world, right? Like you can’t go slap, you need some outside investment? How do you kind of show the value that what you’re building makes sense, you have the right skill sets, and in particular, to kind of go and build this out? So that’s kind of like the story, but how do you tell the story in a compelling way? Right? That’s the Onra part. And then finally, once you have that, right, how are you going to go around building this out? Like, where are you going to hire people? Right? How are you going to scale? And all of these things, either? How are you gonna find customers? What’s the go-to-market strategy? And how do you actually implement, it starts to kind of get into that, right? So it’s, it’s a very rewarding experience. Like, to me when I was thinking about it, what I realized was that no matter what’s the outcome, I’m going to come out on top like that learning is going to be valuable. And you’re gonna be it’s gonna be super useful in the career.

Eric Dodds 48:06
That’s such good advice for all of us, no matter the outcome, if you learn, that’s the ultimate progress. So thank you for that. And thank you so much for really helping educate us on orchestration, sort of as a category and all of the differences there. We had a great time on the show. Thanks for joining us.

Viren Baraiya 48:26
Yeah, thank you so much. Thanks for all the insightful questions. Yeah, it was really fun.

Eric Dodds 48:30
I don’t know if I have a really insightful, technical or sort of data-related takeaway from this show. So forgive me, but I just think it’s really interesting to think about working on infrastructure or on infrastructure at Netflix, while they’re transforming the company from being a sort of content distributor to being a content producer. And it was actually fascinating to hear about that problem described through the lens of microservices. Right. I mean, you wouldn’t think about in like a Harvard Business Review case study of Netflix is pivot from distributor to studio. Like, they’re not going to talk about microservices. But that actually was a real pain point, as they were making the transition. And so I just really appreciated that perspective. You wouldn’t really hear about that particular specific flavor of technical challenge in the process of a transition like that. So it was really fun to get an insight there.

Kostas Pardalis 49:32
Yeah, and it seems like Netflix is one of these companies that they are really shooting the next wave of innovation right now. I mean, there are like a couple of different products and companies. They are actually coming from Netflix, which is great. And it’s like super interesting to see all these people how they were together in Netflix, and now they’re out there and like the market and like building companies and Creating like new products. So they definitely did something right. And I guess the Harvard Business Review should look into it at some point. But outside of these and like all the very interesting, like conversation that we haven’t follow the technical details of orchestration, I think one thing about AI, I’ll keep an eye I would really love to learn more about is like edge computing and orchestration, which is still like something. It’s early for this kind of technologies. But I think we are going to be hearing more and more about like in the future. So that’s another thing that I’m keeping from the conversation we had.

Eric Dodds 50:41
For sure. And if anyone is listening who is with the Harvard Business Review, it’s a little bit abnormal, but we’re happy to do a cover story if you’re interested. So definitely hit us up. And reach out to Brooks, if you want to talk about that. Lots of great shows coming up. We will catch you on the next one.

We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.