Episode: 56

Stream Processing and Observability with Jeff Chao of Stripe

with Jeff Chao

Software Engineer, Stripe

​​This week’s episode of The Data Stack Show features a conversation with Eric, Kostas, and Stripe software engineer Jeff Chao. Jeff has a background in stream processing, has worked at Salesforce, Heroku, and Netflix, and has been maintaining an open source project called Mantis.

Notes:

Share on twitter
Share on linkedin

Highlights from this week’s conversation include:

  • Jeff’s history with stream processing (2:52)
  • Working with Mantis to address impact of Netflix downtime (4:20)
  • Defining observability as operational insight (6:58)
  • Time series data and the value of data today (18:52)
  • Data integration’s shift from batch to streaming (29:34)
  • The current state of change data capture (32:20)
  • How an engineer thinks of the end user (56:21)

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds  00:06

Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com.

Eric Dodds  00:27

Welcome back to the show. We have a really special guest who has worked on some unbelievable technology at companies like Heroku, and Netflix. Really deep in the stack of those companies. And so what a  treat to get to talk to someone who has that level of experience.

Eric Dodds  00:47

I’m super excited to talk to Jeff. I think one of the things that I’m interested in asking him about is just some of the technology he’s worked on that tends to be sort of internal infrastructure technology at these companies. And so I’m really interested to know, what does he think about the end consumer of the product? And Netflix is a great example. So he has worked on a lot of infrastructure stuff, but I’m just interested to know, does he think about the end consumer in his work? Because the work product doesn’t necessarily touch the consumer, and in the sense of like, the last mile, so if we have time, I’m going to try to sneak that question in, but Kostas I know, you have a ton of technical questions. So what are you interested in?

Kostas Pardalis  01:35

Yeah, absolutely. I mean, Jeff is also one of the maintainers of an open source project called Mantis, which is a stream processing framework. So Jeff is like the kind of person that you could call an expert in stream processing. So I have quite a few questions around that. I want to learn more about what stream processing is and why it’s important. It’s also like a term that is coming up in our conversations, again and again, especially compared to batch processing. So I really want to hear his opinion about what happened. How was streaming processing 10 years ago, how is it today, and where are we moving towards in terms of these technologies? So yeah, I think I’ll focus more on the streaming processing side of things. And we’ll see, I’m pretty sure  there will be surprises.

Eric Dodds  02:27

There always are. Great. Well, let’s jump in and talk with Jeff.

Kostas Pardalis  02:30

Yeah, let’s do it.

Eric Dodds  02:32

Jeff, we are so excited to have you on. Thanks for taking the time to join us.

Jeff Chao  02:37

Hey, thank you. It’s a pleasure to be here.

Eric Dodds  02:40

You have a really impressive resume. And I’m gonna let you just give you a quick background. So where did you come from? And what do you do? What do you do today?

Jeff Chao  02:52

Hey, great. Thanks there. So yeah, I’m Jeff. I have a stream processing background. I’m currently at Stripe. I have been here for a little over a month. But prior to that I was at Netflix for a little over three years. And at Netflix, I worked on an open source project called Mantis. We can talk about it later, but fundamentally, it’s a platform that enables developers to build real time cost effective stream processing applications, specifically geared towards the observability space. So I’ve been working on that for the entirety of my Netflix tenure. And prior to that I was at Heroku working in the data space as well. I worked on Kafka as a service and Postgres as a service. And then prior to that I was at Salesforce through a series of acquisitions working on stream processors.

Eric Dodds  03:44

Super interesting. And what a treat, I just have to say, again, what a treat. You’ve worked on some of the most famous companies in terms of tech and just all the amazing engineering heritage. So really excited to have you on the show. Why don’t I think we should just start with talking about Mantis. So born inside of Netflix, and you did a ton of work on it. And from my understanding, it solved an observability problem at Netflix and that was sort of the main use case there. But I guess it has since expanded, but tell us about Mantis.

Jeff Chao  04:20

Yeah, definitely. So Mantis was born out of Netflix. It came before I joined, and I joined and worked on it a bunch and I’m currently one of the committers here. But the whole premise of Mantis was originally around reducing downtime, or the meantime to detect in Netflix playback. So back in the day, 2014 and prior, imagine if you are a Netflix member and you’re going home and then you’re trying to play your video and Netflix is down. Back then the amount of downtime for the number of members Netflix had on the service was, let’s say it was like, if the service was down for like two hours, you would have some impact. So that same impact today would be in the order of minutes. So as Netflix continued to grow, we realized, like, hey, we need to have better detection and remediation and time to insight. So the current systems at the time were very limiting. You had your typical logs, metrics and traces, which typically gets sent through like some sort of aggregation tier eventually to be served for you. But the problem is like, the longer it takes for you to get the insight means the longer the downtime is, which means the more impact it is to Netflix members trying to watch their favorite films and series on the service. So Mantis is a system that was born to enable highly granular, sub-second, or second detection. And also it allows you to trigger automatic remediations. So then, we can reduce the ultimate impact for the playback experience at Netflix. Since then, it’s expanded to a wide number of other use cases at Netflix. It’s used for debugging use cases, it’s used for things like anomaly detection. And then at the end of the day, you can sync the events out to some other persistent, durable thing like Kafka. And then that opened up other use cases in the data and analytics space.

Kostas Pardalis  06:33

Before we continue, you mentioned the word observability. Can you define us a little bit what observability means for Mantis? Specifically, like, what was the exact problem that Mantis was going to solve? And how is it described? That’s one question. The second question is like, why is observability a good use case for streaming processing?

Jeff Chao  06:58

Yeah, that’s a great question. So the first question, how do I define observability? Yeah, that has been a definition that’s been going through the community for some time. But for me, like buzzwords aside, what I’m really saying is operational insight. So I want to be able to ask questions about my system, either ad hoc or not, and get an answer without necessarily having to re-instrument my system. Now, whether I re-instrument or not, I guess it doesn’t matter, because it’s more of like, what kind of insight can I get into my system? And so specifically for Mantis, the insight would be, well, one of the questions we would mainly ask is, what does the playback experience look like for a member at Netflix? And a member at Netflix means what does the playback experience look like for someone in a specific country, watching on a specific device, for which television series in which episode, in which part in the series? So you can imagine the cardinality of that can get pretty high. So how do we answer those questions without ensuring that the costs in answering such questions cost more than the actual system serving the playback experience? So cardinality was definitely a consideration. Also stream processing, being able to evaluate events one at a time and leave things like windowing, or any sort of batchy statefulness up to the developer to define the semantics of how that should work. Because a developer develops their microservice application, events will stream out of there; they kind of know what the behavior should look like, and kind of what windows they should define, and how.

Kostas Pardalis  08:54

That’s great. So we are talking about focusing more on observability around a service, right? So for example, most of the time when I hear the term observability, from people, it’s more of like, it’s a very common term, right? Where people won’t like to have, as you said, the operational awareness for their servers and more low level, let’s say, services there. Is Mantis used also for these use cases or is it specifically for more higher level like product related services in a microservice architecture? That it’s commonly deployed and used for?

Jeff Chao  09:31

Yeah, that’s a great question. So it’s generally used at most of the application tier, but more specifically, so we have logs, metrics and traces, right? There’s sort of a fourth thing, which is events. Events are just a thing that has a bunch of fields in it. Some people, it looks like structured logs, and it’s very ambiguously defined. But the idea is, it’s just a thing that has a bunch of fields in it. So the developers are free to find whatever fields they want for, for whatever use case. So it depends on the instrumentation or the library that at Netflix, we have auto instrumentation through Spring Boot, we just use gRPC interceptors. So as events come through the request path, things are automatically added for the developers, so they don’t have to worry about adding it. But should they choose to add their own fields or add more context, they can do that. And on the flip side, if they aren’t on the request response pattern, if they had just any standard micro service, or any, like stateful system, they can just use a library to just add fields as they want. So that I mean, the long story short is it can be used for lower level system stuff. It’s up to the developer to explicitly add those in.

Kostas Pardalis  10:58

That’s great. And okay, this is probably a question that you have heard many times before. But what made the team at Netflix build something from scratch instead of using something that already exists out there for instrumenting services and collecting events?

Jeff Chao  11:17

Yeah, that’s a great question. Build versus buy, right? Yeah. So definitely, if you imagine back in 2013 before … so Mantis has been in prod at Netflix since 2014. So developments probably a year or two even before that. And back in those days, we had Storm, and I think Trident might have been around, Trident on top of Storm. And then we also had Spark streaming at some point. So the technologies that existed at the time weren’t satisfactory for the, for the, the requirements that Netflix had. So really they, being the people that came before me, made Mantis with a few certain trade-offs in mind, and operating principles on what the architecture should look like. So there’s three, the first is data should be on demand access. But so data, you shouldn’t pay the cost to export, serialize, or just move data in general unless you need to, unless somebody subscribed to it or is listening to that data. When they’re subscribed, they should be able to filter for sample, project exactly what they’re looking for. So you can get the granularity that you want, you just have to be very intentional about doing so. So number one is on demand data access. Number two is aggressively reusing the data. So oftentimes, people are subscribing to a stream, and they’re looking for the same data, or very similar shape of data or like a subset. So if you have that data already on hand in memory, for example, then you shouldn’t have to go all the way up stream to the applications to grab that data again, you should just send it back down to the subscriber. So there’s a sort of bookkeeping mechanism there. And then lastly, was cloud native auto scaling. So it needed auto scaling native in the platform at different tiers to be able to scale in and out resources to teach that just, generally account for the subscription load and the publishing load. Yeah, so the systems evaluated at the time during those years didn’t didn’t fit those three things.

Kostas Pardalis  13:53

Yeah, yeah, I love trade-offs. It’s where engineering starts actually. So it’s always very interesting to hear how the trade-offs based on the system was based … Yeah, defines pretty much like the context of like the software that was built. This is great. And you said like in 2013, back then, the landscape of the available technologies were very different. Today, 2021, there is a lot of work being done on both streaming platforms and in data platforms in general. Is there something out there now that if you had at Netflix right to choose today to implement Mantis, again, are there solutions out there that are closer to this basic that can satisfy these three doors that we are talking about based on your experience and knowledge?

Jeff Chao  14:42

Yeah, I haven’t fully looked at the entire landscape of technologies. I mean, from the stream processing side, we have Flink. Flink has its whole async checkpointing thing. Oftentimes, people use it with Kafka within the whole Kafka streaming ecosystem. We have Kafka streams or even Kafka connects for the integration story. There are other newcomers with a sort of querying type user experience like Materialize. Details aside, I think the premise of Mantis was or is that if you’re looking for operational insights, insights right now, at least during an incident are more valuable than insights a month ago than insights a few weeks ago. So the main trade off is, we favor the cost effectiveness over not necessarily the correctness but the persistence of the data. So what that means is effectively a sort of ephemeral or at most once delivery mechanism, where yeah, if no one subscribes to a stream, then the data, the cost isn’t paid. And so if you want to actually have that context, from weeks ago, you would actually explicitly and intentionally have to sync the data out to some more durable store. And then later on, do like a stream table join or just join it together with that context later on. But you did touch on an interesting thing, though, like Mantis today. So it is open source, but then the community isn’t as large as I would hope. And of course, part of that is like putting in the rigor and the care that you would for an open source community. And so it becomes tough, because one of the benefits of open source is in theory you can broaden your development to people outside your company, you can withstand a lot of organizational pressure, you can withstand a lot of other things that might come through in the development experience, of course, with trade offs. But with that, I think, there can still be room in general for existing systems to, to consider like a mechanism where we have to be intentional about persisting the data.

Kostas Pardalis  17:13

Yeah, and I think that’s like a very, very interesting topic, which is something that since I started, like working with data systems, like I keep thinking about it, which is this characteristic of like, what I call time series data, in general is like the more more generic, let’s say, for this type of data that they have this temporal nature, like events, for example. I think it’s very common that these assumptions make total sense. The data today are much more important than the data that you collected a year ago, right. And this is something that’s okay, outside probably have like some infosec related use cases where yeah, the data is like very important to be kept, because you might have like, retrospectively go back and figure out what happened six months ago when someone broke into your system. But outside of this, I think I like in a majority of the use cases around like streaming data, data today are much more valuable than the data yesterday. And this is like a question that I would also like to ask Eric, because Eric comes from being the consumer of this type of data from a marketing perspective. But time series data is also used in his everyday work, like the most typical example is the instrumentation. Let’s say you have the customer, right, trying to capture all the different events that a customer is generating, and try to rebuild the behavior out of this. And based on that, run some campaigns and do marketing. But what’s your experience, Eric? How do the data, like in marketing today, have the same value as the data that have been collected for your customers like six months ago?

Eric Dodds  18:52

That’s a great question. And I would say, in my experience, a lot of marketing data and marketing reporting actually tends to be pretty primitive. Because unfortunately, I think marketers are getting more technical. So I don’t mean, I am a marketer. So I guess I can speak badly about my own kind. But the marketing automation tools and CRM are where a lot of the customer profile information lives. And so when you think about time series data, especially in the context of an event based paradigm, I actually still think even though there’s some really good marketing analytics tools out there, I think in terms of like the fundamental building blocks of that, a lot of marketing reporting is really behind. And so they tend to rely on a snapshot that comes from exporting structured data out of relational databases, which is in the form of marketing automation and CRM tools. But I would say, for me, the time series data is actually the most important. And so when, like, I’m just the biggest fan of thinking about data in terms of events, especially when it comes to marketing, because it’s really the most robust way to identify trends over time. And so, to me, that really is the most valuable kind of data. Like if I can look back six months, and look at a metric and the way it changes, and then if I can identify the marketing activities that I’ve executed, or the tactics or campaigns that I’ve run, and line those up on a timeline relative to data that’s displayed as events over time, that’s where you get your most valuable insights. Because you can triangulate pretty accurately within reason. I mean, of course, you run into issues around statistical significance, etc. But that’s where you can kind of say, like, okay, I can start to see that when we execute these sorts of tactics, they have an impact, but the delay is actually like a month or three months. And so from a forecasting perspective, when you think about deploying a budget, and then when you’re going to start seeing results, it’s really helpful, especially in businesses that have a sale cycle or activation cycle that is longer, right? So if you think about a multi-month activation or purchase cycle, it’s pretty hard to get, without time series data, it’s very difficult to get insight into when your marketing activities and especially your budget are actually going to show up on the bottom line as it were. So that was maybe a longer answer than you were looking for. But without a doubt, like if you can get time series data, and we do that in a warehouse and use some BI tools from that standpoint, using, you know, event based data. That’s really the holy grail for marketing, I think, especially relative to attribution.

Kostas Pardalis  21:55

Yeah, yeah. makes total sense. And actually, while you were talking about all that, Eric, like, I started thinking, because you mentioned the evolution in marketing around data, and like how primitive some stuff are. And at the same time, on the other hand, we are talking with Jeff about how streaming processing on real time data was implemented natively, it’s like back in 2013. And it’s very interesting to see the difference between different roles inside the company and how they are implementing and they’re using technologies. And I wanted to ask you Jeff, you’ve been like, in this space, and you’re working on streaming data for a long time now. Kafka, I think, as a technology has been around for almost 10 years now, if I’m not mistaken, how? Yeah, it’s actually like a long time in, like, in tech time, but how have you seen, like streaming as a paradigm, streaming data as a paradigm change all these years, and where was it when you started working or not? And how is it today?

Jeff Chao  23:04

Yeah, that’s a great question. You mentioned Kafka. 10 years. Boy, it’s been long. I remember Kafka back in the 0.7 days. Yeah. Well, if I remember correctly, I think the offsets were stored in Zookeeper. Yeah. So and that was before the high level consumer was introduced. So yeah, we’ve come a long way in the community in the data space, in the streaming data space. I remember back then, like a lot of it … first of all data, depending on the company like you might not even have that much data, you could serve things out of standard leader follow Postgres or some relational database, Mongo, and Webscale, was a huge thing during those days as well. In terms of streaming, a lot has changed in the community in the sense that data integration is like a very common ubiquitous word now, like moving data, the like, with schema and versioning, and exposing that in some sort of via catalog. And connecting sources and syncs and simple transforms. Like a lot of people are trying to solve that problem and really understand what the developer experience, the user experience around that looks like. Personally, I feel like it’s not quite solved yet. And then we still got some ways to go.

Jeff Chao  24:27

Another thing is, data quality is a huge thing. As your company grows in  employees and in users, etc. There will be more data. The data can come from anywhere in any form. So how do you make sure that your data upholds some level of quality threshold? So there’s a bunch of companies tackling that as well. So we have like alerting thresholds on systems infrastructure? We have KPIs for business, but then we are still trying to work on what does that look like for data? What does that experience look like for data as well?

Jeff Chao  25:02

And the last bit, started by the Flink folks, is streaming is a superset of batch. So batch has been around for a while in the Hadoop ecosystem. And there’s been some efforts through an SQL like interface to merge the different paradigms, streaming and batch, into a single abstraction. It might have separate underlying infrastructures, but at least a single abstraction for a user to work with. That’s still going, I think, at least from the larger companies. It’s hard to move because of processes and integrations. So at least from what I’ve seen, the larger, the larger, a lot of the larger companies are still having a lot of batch systems through Spark and whatnot. But on that, with that said a lot there, there has been a relatively recent technology through Apache Iceberg to help with the table format, and then you use that with the column file store like Parquet or something. So with tight integration with Spark like that, it makes it pleasant to work with. But at the same time, it’s sort of, I don’t know what that means for the “velocity of the initiative of streaming as a superset of batch.”

Kostas Pardalis  26:19

That’s a very interesting point. Like, actually, I wanted to ask you exactly that. We have like quite a few conversations on this show where people were actually, they were trying to say is that batch in an abstract way, at least is like a subset of stream, right? And that we are going toward a reality where ideally, everything can be treated as a stream, right? Is this something like? Do you agree with that? Do you see this happening? Like, do you think that at some points, like batch is going to be just, let’s say, an observer pattern of like accessing and like working with data and like streaming is going to be like the de facto way of working with data, or you think that like, we are going to get up to balance point where like, both are going to coexist and have like a single abstraction of the end on top?

Jeff Chao  27:09

Yeah, I’m all for simplifying it under a single abstraction in the future, if we can get the verbs right. Verbs being assuming we stick with the SQL paradigm. In its purest form, right? Batch is just like fixed windows. Windows are typically on longer time horizons and streaming are just generally on smaller windows, or more, I guess, dynamic on smaller time horizons. The windows, you might have some state and checkpoint the state, you checkpoint in batch as well; on one hand, you’re saving more often than the other. So in its purest form, that’s what I believe. And I’m not sure practically, if we can get there. And how we would define the criteria for success in that case, because there’s been a lot of history, at least for larger companies with batch systems. And a lot of these larger companies, and even  mid-market companies have like ML initiatives, right. And so I’m not familiar with how amenable the streaming patterns are to those as well. So data is accessed in more use cases than before, when we were starting out talking, the talk of streaming is a superset of batch. So there’s, well, I’m all for it. And I champion that fact, I think practically speaking, there’s a lot of, I guess, practical hurdles that we have to go over. And the practical hurdles are for good reasons, because people are in the business of whatever core competency that their companies are trying to deliver for their users.

Kostas Pardalis  28:58

Makes total sense. That’s a very interesting insight. I spread out like that we have many more ways of accessing and working with data today. And that makes total sense. Okay, question about data integration. You mentioned the term before, data integration. Traditionally, at least in like, we look at the vendors that were used a couple of years ago as a batch business. Most of the data integration was happening with some batch processing systems. Is this changing? Is data integration more of a streaming workload today? Do you think that this is going to happen? And just how, like how it has changed?

Jeff Chao  29:34

Yeah, I’m really glad you asked this question. I think it is, at least from all of the startups that I’ve seen in the space, it’s moving towards streaming. And for one reason that I see: the latency aspect. I mean, some people might have batch systems for whatever requirements they have for their use cases. Like maybe they have to do something once a day or once a week for whatever reason their business requires. Or maybe it just exists, but getting the data from upstream, like the processing can be, according to however they want whatever semantics they want. But getting the data, people really want their data as quickly as they can, assuming that the trade off correctness is, I mean, you start, you still get the correctness of the data. So when I say getting the data quickly, I mean, you have applications, rest APIs, gRPC APIs, they’re basically just generating data. These data are generally persisted in some sort of database, or there’s a bunch of data sources, there’s many applications, so downstream of that, it would be really nice to just get that data as soon as you can, and what better way to do that than in the streaming way.

Jeff Chao  30:47

And I think one of the pivotal points was, I mean, Kafka was definitely a contributing factor. So it helps with that it’s got that at least once delivery, and then you can get the correctness factor by replaying and de-duping on the other side. So the idea is if you can essentially have a little “stream processor”, that reads from these log data structures off of these databases, events at a time, and then write them into Kafka. Then downstream, they can pick up the data at their own leisure, they can process the data at their own leisure and the canonical word today that we call that CDC or Change Data Capture. Surprise!

Kostas Pardalis  31:35

Yeah, yeah. Let’s discuss a bit more about CDC because it’s something that you hear a lot about. And it’s very, I think it’s a very interesting topic. Okay. So this is like more of a pattern, right? Like, how do you capture changes on the state of a database system? And you can propagate these changes to other systems? Tell us about it. How do you see CDC change through time? If I remember correctly, I’ve seen projects, Debezium, they’re around for a while, right? So it’s not like something that came up today. But it seems that today, people are much more interested in CDC to the point where we even see startups out there actually implementing CDC as a service, whatever that means. So what is the current state of CDC? Where do you think that it’s going?

Jeff Chao  32:20

Yeah, CDC is a fascinating thing. So the way I see it is like, there’s this concept of like the stream table duality, where operation, like a table in a database represents an integration on top of a set of a change stream, or it’s a snapshot of a point in time of what the representation of this stream of changes looks like. So then, if you take the derivative of that you get the chain streams, or what I mean by that is like the operations that happened. So if you’re inserting a record or updating a record, what did the event look like before? What does it look like after, what is the operation and the timestamp? So then, if you have a stream processor that just reads from the beginning of that stream, and applies the operation, you could eventually get to a snapshot. And that’s just like, a very interesting thing. It’s just this log data, slog looking data structure that is traditionally internal to database systems. And a lot of them do replication this way, particularly in the leader follower model, right, the single leader model, and then so it’s like an extension of that where like, hey, what if we expose that in a public slash stable API for people or consumers outside of the whole replication ecosystem, so that we can look at that chain stream and move it move that materialized views after the fact.

Jeff Chao  33:47

So one of the use cases that I see is people reading from these streams. And then they have a bunch of stream processors that materialize different views for different consumers. So let’s say I’m interested in a user’s table. And if I want to, I don’t care of the exact event, a user event, I just care about maybe like some sort of aggregation or roll up so that instead of querying that database directly, I can, if I can relax my latency guarantees a little bit, then I can rely on some other thing to read from a chain stream, populate a materialized view for me. And then I can just read from that without hammering the main, the main data set. So the tricky part is you don’t just have one database system, you have, at least in that in a larger setting, you might have like multiple systems, with their own representation of this log like data structure, and their own representation of how you get data out of those boxes over the network into your systems.

Jeff Chao  34:57

So there’s a semantics of doing that. There’s also the shape. So what kind of envelope? Or if at all, is there in place for you to move these payloads from different technologies in a coherent way that something can transform into data that eventually should be enriched for downstream consumers. So there’s a lot of work being done for that. And that’s a hard problem. The problem just stems from how people are generating lots of data. There’s lots of applications, many different data sources. But at the end of the day, people aren’t looking … downstream consumers aren’t looking at hey, is this Postgres, MySQL, Cassandra, Mongo, etc. They’re just looking at my user model, my accounts model, my billings model, etc.

Eric Dodds  35:48

Jeff, one question. So it’s interesting to think about CDC, we’ve talked about it on the show several times. But it’s really interesting that you started out with a use case that is very practical from a business like operations execution use, right. So a user’s table, when we think about that, could apply to product marketing, sales, and customer work. CDC is one of the things that is not a new idea. But is the trend towards moving CDC closer to the end consumer of the data. And even from an organizational standpoint, I’m just interested in like, as you sort of see the technology as it’s being leveraged inside of organizations, and even some of the tooling around it. But at a base level, it’s a mechanism to capture useful data. But is it moving closer to the end consumer who may not be as technical?

Jeff Chao  36:51

I think, in terms of moving the data, it’s getting better in terms of moving towards the end user, I’m not sure actually. At the end of the day, the user just wants some view of the data. It can be the raw view of whatever the table looks like, all the way upstream, or it can be some materialized view that in a stream processor in between has materialized and enriched for them. I think that the tricky thing is CDC has been around for a long time; it’s not a new concept. But the interesting thing is there have been a few enabling technologies that made the developer experience better. So we talked about Debezium and Kafka, and Kafka Connect with this high level consumer, and the consumer rebalanced protocol. Like there, there’s just a bunch of things that have made it incrementally easier. And today, I think technology aside, the concepts have been distilled to a point where if people choose not to use Kafka or that whole ecosystem, they can take those concepts and then build in their own CDC-ike technology. And so exporting the data out of a database from like a Mongo log or MySQL bin log, Postgres wall, etc, like, that hit that experience has gotten better. Taking that data, and moving it in a scalable way has gotten better. And then I think the last piece is, how, how can you transform that data in an easy way to materialize the view. That is so tricky, because like an end consumer, there’s different personas, right? So if I have the persona at a lower layer, like a data engineer, I can say, hey, look, write a stream, write a Flink job or something that reads from a Kafka stream, make it do a join with this external data source, and produce a view to like Redshift, Snowflake or, or, or Iceberg, and then someone downstream of that can use like a Presto, or whatever tools that they use to get that data. But the tricky thing is you need to have that person in between, right that job to take the raw stream of data, and then make the view. So I’m wondering if there’s a way where the person all the way downstream that traditionally writes Presto and Trino or works more on like visualizations and dashboards and stuff that they can just own that stack.

Eric Dodds  39:28

And sure, yeah, that’s super interesting. Yeah, that’s exactly what I was getting at. And I really appreciate the candid way that you said, the end consumer and in many ways, because this is me, but the end consumer just wants a materialized view. And it’s kind of like, I don’t really care how it gets there.

Jeff Chao  39:48

Yeah, exactly. I imagine in a smaller company, you would just wear multiple hats and figure out how to ETL the data yourself. Or in a larger company you would ping like multiple teams and take forever to get the result that you actually want.

Eric Dodds  40:03

Yeah, yeah. So I think that I just think it’s a really valuable insight saying that the developer experience has gotten way better. And it seems like it’s at least moved more towards the data engineering type persona, where it’s a lot easier for them to leverage CDC in order to produce the result for the end consumer.

Jeff Chao  40:26

Yeah, we’re getting there, we still have some ways to go because CDC is, I mean, probably the thing that’s been gated, not gated, but more prevalent in larger companies, because the resources and the teams, right. And in smaller companies, or even the mid-market companies, they’re still generating data. And they still need insight into their systems. And everybody wants a materialized view for whatever use case, but they don’t have the resources in teams, they don’t have ingestion people, they don’t have data engineers. So developer experience-wise, we as a community have got some way to go to simplify that and make it more accessible for companies that are smaller or in the mid-market size.

Eric Dodds  41:12

Sure, Jeff, one question. And this is jumping back a little bit in the conversation. But I just want to connect the dots for me and hopefully, for our listeners, when you were talking about Mantis, and downtime for end users watching video on Netflix, I’m interested to know, when you think about the observability of like, okay, we have downtime, and we need much more robust data around that. When you were looking at that problem, how much were you interfacing with the teams who were dealing with the user interface side of trying to communicate about those problems with the end user? Or is that part of it? And the reason I asked that is because as you kind of think about being a consumer of Netflix, as I’m sure almost all of our listeners are, it’s interesting to know, like how the observability data gets to the engineering team so that they can fix the problem faster. And then what does the loop back to the consumer look like if that makes sense? Or like the end user watching videos?

Jeff Chao  42:23

Yeah, so you kind of have to define what’s interesting to you. And one of the most interesting metrics for Netflix playback experience is called stream starts per second, or streams per second, or SPS. And what that means is, anytime someone hits play on a television show, or a film, basically a giant event gets fired off. And so in its purest form, you can just count the number of those. And if it deviates from some threshold, either static or dynamic, then you want to sound the alarm. Sounding the alarm is a tricky thing, because Netflix has hundreds of micro services. And these micro services are generally operated and owned by disjoint teams that might not even talk to each other. They might not know of each other. So it gets tricky with the alert, because you have to be able to have a coherent alert, that is ideally consolidated. So it’s not only quick but consolidated so that you can triage appropriately and page the appropriate teams.

Eric Dodds  43:39

The context.

Jeff Chao  43:40

Exactly. Exactly. So stream starts per second is like the main metric. So what happens is everything goes through like a front end proxy. And then that will distribute through to a bunch of microservices within the Netflix ecosystem internal infrastructure. So Netflix fortunately has, we use dynamic thresholds, but even without those, a static threshold is sufficient because the Netflix playback is pretty predictable over a 24 hour period. Because people get home from work, they watch Netflix during the work hours, they probably aren’t watching Netflix or shouldn’t be watching Netflix. And so you have this sort of sinusoidal pattern throughout the week, day in and day out.

Eric Dodds  44:32

Yeah, that’s super interesting. Wow, how interesting to think about sort of the rhythm of a day within a time zone, creating some level of predictability.

Jeff Chao  44:43

And it’s really interesting too, because there’s not like a single alert, right? So there’s lots of granularity. You can learn like a dip in global stream starts per second. But that might not even dip because you might have like, if you imagine a long tail distribution, right? Like someone watching in like a smaller country on an older device, we care about those members too, and their playback experience. So if you’re doing things like aggregating or looking on a less granular view, then you’re going to miss that person’s playback experience if it’s bad. So with Mantis, it allows you to zone in on that person. So if someone’s watching on a Wii U, which is decommissioned by now, I think, WiiU, Stranger Things, series three, episode one in like Russia, or something like that, we’ll be able to see if that person is having playback problems.

Eric Dodds  45:48

So fascinating. So fascinating. I mean, I just think about the scale of Netflix, and the ability to sort of quickly, identify statistically significant problems on a regional level is, is just really cool. I mean, a lot of companies don’t have that at an international scale.

Jeff Chao  46:07

Yeah. The example that I like to give is like, suppose you’re having playback problems, you call customer support, they could inspect a stream from an application, and then put in your like, target the query to exactly your device, and put 100% sampling and see all of the events coming through. And then as you can click, tap, and swipe through the application, assuming you’re on your phone or something, they can see all of the events for you going through live and help you troubleshoot right there and then. But doing that through like, like if you’re trying to store all of those events, and then aggregate and then look at it after the fact that could get quite costly for Netflix.

Kostas Pardalis  46:50

Jeff, I want to take you back to CDC again, if that’s okay. Because I know Eric has more questions that are, let’s say, closer to the customer, let’s say you have all that stuff.

Jeff Chao  47:04

Yeah, we can jump around.

Kostas Pardalis  47:06

Yeah, absolutely. But then I have a question that I like, I need to ask you based on what you mentioned previously about the data modeling on top of like the CDC streams. And I wanted to ask the following question. So traditionally, data integration is built around the concept of the data warehouse, let’s assume there is a data warehouse somewhere, we are going to like to load the data onto this data warehouse. And in the process of doing that, we are also going to integrate that data together. So yeah, sure, like I might have one service that represents the user in one way or another service that represents the user somehow different. And we are going to normalize these into one data model that then whoever wants to work on that is going to do that. But that’s very, let’s say, so far it has been done like always in the more let’s say, batch way, we are assuming that the batch processes are happening. So assuming CDC and assuming that we have these streams of data, right? Is this going to change? Is this work on the data modeling side of things going to keep happening like on a data warehouse? Do we need the data warehouse? Or are these things that we can also do on the streams?

Jeff Chao  48:21

That’s interesting. So I think, personally, the answer would be where does the tooling lie? And where is the best experience and most familiar experience and like a market leader for tooling? So it’s also sort of a question for Eric, which I’ll pass on to in a minute. But like the downstream consumer, the end user, you know, what tooling do they like? And what are they familiar with? And what do they need to generate whatever they need to generate for their use case and their own customers? So the CDC thing, unfortunately, I think it just gets your data faster. So instead of reading two tables, two snapshots from two different data sources, you’re reading two streams of changes. And then as the events come in, you are joining or enriching those two streams with some extra context and then materializing the view as you go.

Jeff Chao  49:23

So for Eric’s case, it makes his life better, he gets the data faster. But how did he get the data? Like, honestly, I don’t think Eric would care or it matters. I mean, if the batch thing works in a day, if you can stomach like doing it every hour and re-computing everything that might even be fine. Or in the streaming sense if you can get it down to minutes. Like, that’s great for Eric. But the data is still the same at the end of the day.

Kostas Pardalis  49:49

Mm hmm. Absolutely. Yeah. It’s very interesting. So another question, and I think you are probably the perfect person to ask that also, because of your experience and the companies that you have worked at. So you worked at Salesforce, Heroku, then at Netflix, so Heroku had a very interesting product. For me itself, it was called, like Heroku Connect, right, which the whole idea was, we have a Postgres database on Heroku, we have a Salesforce account somewhere. And these two different systems are synchronized one way or another. This product has been around for quite a while, I think I always found it very interesting and having like a little bit of the knowledge of how it worked, do you see CBC as a pattern also applies to these kinds of use cases? Because here, we don’t have a database system in another database system, right? Like we have a database system. And if we want to make it a little bit more technical, we have a service that is exposed to a REST API or a SOAP API. And the modality is a bit different. Do you think that CDC can be used or should be considered also as a pattern that can do these kinds of integrations?

Jeff Chao  51:04

Yeah, it’s just a pattern, as you said, or a tool or mechanism to move data into the tool that is most comfortable for the end user. So it’s really interesting in the Salesforce case, because you’re moving data out of Heroku Postgres, which might be from a smaller dev team from a new app that somebody in a large company just started or like an acquisition of a larger company. Or maybe they have customers of their company that want to use Salesforce, and they have their own integration. So it’s just a natural fit to take the data from your online systems, which are in Postgres, and move them to a system, another data source or a sink in this case, like Salesforce. And then your actual customers are, in this case, would be most comfortable with the Salesforce ecosystem. There’s lots of tooling and other Salesforce products that they can use and then integrate as well. So if it’s not Salesforce, it could be some other thing. It just depends on which customer are you trying to serve for the CDC mechanism that you happen to have within your infrastructure?

Kostas Pardalis  52:12

Yeah, yeah. So it could be Marqueto. So we cannot be very happy, right?

Eric Dodds  52:18

Yeah. I mean, as I, as I think about, again, coming, like, I’m not an engineer, and I don’t have formal training as an engineer. But I understand the concept of CDC, and as a non engineer, it feels incredibly freeing when I think about the stack. Because the event based time series data to me is so valuable. And in sort of an ideal world, I know, there are limitations and like, they’re sort of working on the ergonomics. But theoretically, I could get that from any system because they all run on a database. And that’s very appealing. Like, that’s super interesting to me. So it’s very compelling. Like, I think it’s a really compelling concept. And I think I mean, at least from an outsider’s perspective, like, I think the value there, and as the ease of delivery gets better and better, I think we’re gonna see a lot more of it, just because it’s kind of a way to provide the most valuable type of data to downstream teams, regardless of the system, right, and so you don’t have to force a super high level of conformity in terms of like the way that things are captured at the source, if that makes sense.

Jeff Chao  53:36

That’s a really awesome insight, Eric. And while you’re speaking, it reminded me of another word, which is just getting … if you have the raw chain stream, you’re effectively just getting the raw data, you’re not getting a snapshot of a point in time you’re getting it as it comes. And so the beauty of that is, it’s up to the downstream, ideally, would be up to the downstream user to interpret that in a way that it fits their use case or fits whatever problem they’re trying to solve. So it’s a bit more flexible in that sense, because you’re not working with data that’s rolled up. It’s just that today, like you would have to ask someone to interpret that stream and then put it in a way that works for you. But yeah, if you don’t have the raw data, then it might be a little bit different to more difficult to do what you need to do. And that’s I guess what really that was one of the fundamental values of CDC is.

Eric Dodds  54:35

Yeah, for sure. Well, we’re, we have a little bit more time. So I’m going to ask a question that’s less technical. You have worked on some really heavy duty systems, really deep in the stack, at some really amazing companies. And one thing I was thinking about just prepping for the show today was … I love that you describe some of the problems that you’re solving in terms of the user experience. So for example, if someone’s trying to watch Netflix, and they have downtime, right, or I translate that as well, you interrupted my like binge buzz, because I’m trying to hammer through re-runs. But that really stuck out to me, because you’re working pretty deep in the stack. And in many cases, it sounds like some of the stuff that you’re doing doesn’t even necessarily touch the end user if that makes sense. It certainly has an influence but you’re sort of like when we think about observability, you’re trying to get system data to people who can solve a problem with a system, which is, you know, sort of an internal feedback loop that someone’s going to do something to get it back on line for the consumer, or the end user. How do you think about that, especially having, like, spanned multiple different companies who are very user centric, but working very deep in the stack? I would just love your perspective. And I think our audience would appreciate, like, is that a concern for you? Like, do you try to think about the end user, even if what you’re doing is more of an internal feedback loop with data or infrastructure?

Jeff Chao  56:21

Yeah, I love this question. It’s something that really resonates for me, because a lot of times as engineers, we tend to get stuck in the weeds of the deep technical. And it takes a lot of discipline and rigor to remember to get out of that. And start actually with what problem are you trying to solve? Who are you trying to solve it for? What things have been attempted before? And why are they bad? And how can you improve it? So with that, you’re basically just getting context to help you inform the inputs and the requirements to how you should solve your problem and what kind of technology you should be introducing to solve such problems. Because if you start from the inside out, from the technical side, I just feel like that likelihood of building the wrong thing is much higher. And then therefore, you’ll end up building a solution and search for a problem instead of actually building the right thing. So definitely, it’s a skill or a trait, but definitely something that any engineer should be doing first or thinking about is like, what problem are you actually trying to solve? And is it the right problem to be solved? And then there’s other tactical things like incrementally building upon hypotheses, like build something, test it out, build something, get feedback. So just as things that mitigate the risk of building the wrong thing?

Eric Dodds  57:46

Yeah. And you know, one thing that came to mind, number one, love the perspective, and it sounds like I don’t want to assume things about you. But it sounds like you’ve developed a strong level of discipline in terms of thinking that way, which is really cool. And the other thing that just came to mind around data is that you have the end user, right, so the person watching a video with Netflix, but then there are also end users of the data products inside the company as well, like the first stop end users inside the company. And then the last stop, or the Last Mile end users, the one who’s actually consuming the product. And that’s a really interesting environment to operate in. And in many ways, kind of is complex, because it’s easy to stop at the first end user, and not do the work of thinking about the end end user.

Jeff Chao  58:41

Yeah, yeah. And, and that’s okay, you can just start somewhere because as you mentioned, like if you’re deep down in infosec, you have to focus, you have to pick some user persona, work with them, get a tight feedback loop, going, iterate, ship, and iterate and ship. And then from there, you can either make their experience even better you can bring into a different user persona and see how you can uplevel the abstraction in your whatever product or service that you’re providing. And then ultimately, Netflix is a single product by itself. So in my case, like, depending on which partner teams I was partnering with, if their use cases would eventually serve that streaming path that like the watch the actual series and films watching path, then you would have to actually keep that in mind as well. So there’s different levels of customers, if you will, or different levels of partners and user personas. And you can just focus and then incrementally chip away at it.

Eric Dodds  59:43

Yeah, absolutely. And I think I’m far from the expert. But if there’s one thing that I have learned about mature engineers, is that stepping back to ask whether you’re building the right thing, regardless of technology, is absolutely a hallmark of a mature engineer, so love that mindset and a great reminder for us and, and the audience. Well, we are at the end of our time, Jeff, I feel like we could keep talking for another hour or two, which means we’ll have to have you back on the show. But this has been great.

Jeff Chao  1:00:14

Yeah, it’s been awesome talking to you all. Thanks.

Eric Dodds  1:00:17

Awesome. Well, we’ll check back in with you. We’d love to have you back on the show. Best of luck at Stripe. That’s a new adventure. So super exciting. And congrats.

Jeff Chao  1:00:25

Thanks again, I’ll see you guys later.

Eric Dodds  1:00:27

That show felt like it passed in three minutes. But we talked for an hour or so. There was so much in there. I think my big takeaway, we’ve talked about CDC multiple times on the show, but I think I may have had a little bit of a lightbulb moment or sort of an epiphany, thinking about how potentially flexible CDC is for event data. I’m a huge fan of event data. And I don’t hide that, of course. But it just seems really interesting as a technology. And before I thought it was interesting. Now I would say I’m probably bullish on it as sort of a core piece of the stack that we see developing over the next five years. So that’s my takeaway and my prediction.

Kostas Pardalis  1:01:12

Absolutely. I think that CDC is a term that we are going to be hearing more from now on, although, as Jeff also said, it’s not something new, like it’s been around for quite a while. But it has been a little bit more of like an esoteric kind of pattern that like bigger companies always used. But I mean, okay, the overall conversation, the whole conversation that we have observed was like, amazing, and I’m pretty sure that like we could have a separate episode for each one of the things that we discussed with him. But what I would like to mention and remind, and what really impressed me is two things. One, how he was talking about technologies and why they were built, how they were built based on the concept of trade-offs. And what are the trade-offs? And how many times when we asked him about, is this how things are going to look like in the future? Is this like the right direction? Is the other the right direction? His response was, let’s ask Eric, who is going to be using it, right? And these two characteristics are like the characteristics of like, like a very experienced and good engineer. Like that’s at the end how you build technology, because technology has to serve someone, right? Like we don’t build technology for the sake of technology. And so that’s like, what I want to keep at the end of this conversation, outside of all the amazing information that Jeff shared with us about all these amazing companies and technologies that he has worked with so far.

Eric Dodds  1:02:44

Yeah, absolutely. I mean, to hear someone who has worked on a project like Mantis, and has been deeply involved in really cool sort of advancements in tech like CDC, to hear him step back and ask the question, am I building the right thing? is really great. I think that’s just a really healthy reminder for all of us. I appreciated that. Great. Well, thanks for joining us. tons of great episodes coming up, subscribe if you haven’t, and we’ll catch you on the next one.

Eric Dodds  1:03:18

We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at Eric@datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.