Episode 91:

The Future of Streaming Data with Stripe, Deephaven, Materialize, and Benthos

June 15, 2022

This week on The Data Stack Show, Eric and Kostas chat with Jeff Chao, a staff engineer at Stripe, Pete Goddard, the CEO of Deephaven, Arjun Narayan, co-founder and CEO of Materialize, and Ashley Jeffs, a software engineer at Benthos. Together they discuss batch versus streaming, transitioning from traditional data methods, and define “streaming ETL” as they push for simplicity across the board.

Play Video

Notes:

Highlights from this week’s conversation include:

How we should think about batch versus streaming (6:02)
Defining “streaming ETL” (9:34)
A brief history of streaming processing platforms (22:07)
The birth and evolution of Benthos (28:41)
What led Jeff to build a new tool (34:29)
Why you shouldn’t share all the data (37:23)
Making streaming technologies approachable to engineers (42:09)
Breaking out of traditional terminology (52:58)

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 0:05
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com.

Kostas, I love our panel discussions because they have so many different perspectives. In the live stream today, the panel is full of streaming experts. So we have people who have built streaming projects, open source, multiple people, actually someone who’s been off streaming infrastructure has tried to Netflix and then several really cool streaming tools, the founders of these streaming tools, which is super cool.

Here’s the burning question that I want to ask streaming is becoming increasingly popular and the technology around it has grown significantly, right. So if you think beyond Kafka, right, Kafka is kind of a de facto, right. But there are so many new technologies that have emerged. And so what I want to ask these people is how we think about the difference between batch versus streaming, is that even a helpful paradigm as we look out into the future of how data should flow through a system? So that’s what I’m gonna ask. What about you?

Kostas Pardalis 1:20
Yeah, actually, I want to ask what the title of the sponsor, Lee’s what’s the future of streaming, like the industries like both using and has developed in large number of like streaming platforms for the past decades or so. So what’s next, like, what we need to build and what kind of use cases we are addressing that we couldn’t do if isn’t like in the past, or in the present? Because all of these guys are like building some amazing new tools, bring these some new paradigms in how to do like streaming processing. And I’d love to learn, like, the why, and the how and how this connects with the previous generation of streaming processing platforms.

Eric Dodds 2:03
Well, let’s do it.

Welcome to The Data Stack Show. I have been so excited about having this group of panelists because we had such fun conversations with all of you about streaming. Today we get to talk about all sorts of stuff as it relates to streaming, the future of streaming, etc. So before we get going, let’s just do some quick intros for those people who have not listened to the episodes with all of you. So I’ll just call your name out and go in the order that you are in the zoom box. So Jeff, do you want to kick us off?

Jeff Chao 2:37
Thanks. Hey, everybody. Good morning. Good afternoon. I’m Jeff. I’m an engineer at Stripe I work on stream processing systems. Right now I lead Change Data Capture at Stripe. And before stripe. I worked at Netflix for a number of years where I also worked on stream processing systems, worked on an open source project called Mantis and also worked on other infrastructures such as Flink and Kafka.

Eric Dodds 3:00
Awesome, thanks so much. Pete.

Pete Goddard 3:02
Hi, everybody. I’m Pete Goddard, CEO of Deephaven Data Lab. My history is in the capital markets. So think of Flow Wall Street meets systems meets quantitative math and things like that. I run Deephaven, which is an open source query engine is built from the ground up very good with real-time data on its own and in conjunction with historical or static stuff. It’s also a series of APIs and integrations and experiences that create a framework because working with real-time data is pretty young, so we want people to be productive. Nice to be here today.

Eric Dodds 3:38
Thanks so much. Alright, Arjun.

Arjun Narayan 3:40
Hi, everyone. I’m Arjun Narayan. I’m the co-founder and CEO of Materialize. Materialize is a streaming database that allows you to build and deploy streaming applications and analytics with just standard SQL Materialize is built on top of an open source stream processor called timely data flow. I’m sort of a bring a sequel, partisan perspective to this panel. Before this, I was an engineer at cockroach labs working on CockroachDB which is a scalable scale-out sequel OLTP database and before that, did a Ph.D. in distributed systems being in the distributed systems and database world for quite a while and excited to be here.

Eric Dodds 4:20
Awesome. Love that you’re getting to buy us out upfront. It’s great. It’s SQL partisan.

Arjun Narayan 4:26
I think a few people needed the disclosure early on before I got firing.

Eric Dodds 4:33
Love it. All right, Ashley.

Ashley Jeffs 4:34
Ashley Jeffs. Hi, everyone. I maintain the open source streaming ETL service, I think we’ll call it today, at Benthos, and that includes writing code, pull requests, building emojis, and mascots. And I’ve been doing that for five years and before that my life had no meaning.

Eric Dodds 4:54
The emojis are the most important part of the project.

Ashley Jeffs 5:00
Pretty much. That’s how your project is a success.

Eric Dodds 5:03
That’s exactly right. This bag is actually great. Well Ashley, I actually want to use streaming ETL, which is a really interesting phrase. And so what I want to start off with is a discussion on how we should think about batch versus streaming. There’s a lot of new streaming technology out there. And people are working on all sorts of different data flows. This standard stack has a lot of batch and has maybe implemented some sort of streaming capabilities. But I’d love to hear from each of you, your perspective on like, how to think about batch versus streaming, sort of in this new world where we have a lot of streaming tools. And, Jeff, I think I’d actually like to start with you because you built a lot of these systems at large companies, and you’re sort of a practitioner doing the work inside of a company. So what does that look like for you in the work that you’re doing?

Jeff Chao 6:00
Yeah, it’s come a long way but it’s still quite nascent, it being stream processing. For me, I like to segment it into two parts. One is the developer experience. And then the other is the user experience. So in terms of developer experience, no, people are sort of conforming around this SQL-like tooling. There’s also declarative tooling as well, the tooling to get you started on writing stream processing applications, and then deploying them out, maintaining observing them, etc. So that’s still got quite a ways to go, it’s certainly a lot better than it has been in the past, especially compared batch given that that has had more like community involvement in terms of like bringing that up in the open source and improving on the tooling. The second part is user experience. So from the users perspective, they just want to access their data, whether it’s from streaming or batch worlds, whether they’re doing streaming or batch computations. The challenges are in that the semantics are, are a bit different. And so the output or the behaviors of a stream app streaming applications might not necessarily be intuitive to the end user. So for example, in streaming things are in more like real-time fashion. So windowing comes into play, where each batch, the windows are much larger that plays into the last part of my thinking, which is what I think about in terms of use cases. So specifically with SLOs, as the requirements for these use cases. Slo is being freshness, correctness, coverage, and throughput. And then so as an example, at stripe, I work in the finance the payment space. So correctness is pretty important freshness. Less though, originally, but it became more important because expectations only go up. So originally, Stripe had a lot of systems in batch. But we were recomputing the world every day, or every so often. And it just got very expensive. So then try to go the angle of reducing the cost for that use case, after the cost is okay, well, we want things quicker, expectations only go up. So we want more freshness, oh, but the freshness is there. But now we want things to be correct. And then so now we’re doing that at Netflix is a bit different. So I worked on a system called Mantis, where we were basically trying to make sure that the systems the streaming systems could stay up when you hit play that it works. And it works well the bit raised there as it should be. And so Mantis took a different angle where it used the user experience for the fernet use an SQL like user experience, but made a certain set of tradeoffs that we can go into another time or later to make sure that we can compute all of these streams in a very cost effective manner without losing the ability to get gather insights. And Arjun knows exactly what I’m talking about here with Materialize. So that’s it. So developer experience, user experience, use cases, by use cases, led by SLOs, which is throughput, correctness, specialist, and coverage.

Eric Dodds 8:53
Yeah, totally. Well, I love the perspective of the user demand, right, and sort of streaming being driven by user demand for data that is more real time, which is interesting, right? And that’s sort of a self-fulfilling cycle where you get data faster, and then you want it faster, and then you want it faster, which is interesting.

Ashley, let’s jump to you. Do you want to explain what you mean by streaming ETL? And I think for a lot of people, those are two separate terms, right, as they just think about their practical day-to-day work. Well, ETL handles— we have ETL jobs that run our batch stuff, and then we have streaming stuff, but you put the two terms together. Can you dig into why you did that?

Ashley Jeffs 9:32
Yeah, so my introduction to data engineering was streaming because we had data volumes that were too big to run as a batch. We’re basically selling that data and it needed to be dealt with continuously. So to do that, at the levels that we had in order to execute that some batch job, we would need a stupid amount of machines to be running it so that it was kind of like a defensive move. which I think is why some people early on, resorted to streaming. And it was about that time that Kafka sort of arrived. And we started seeing that in the wild. But the main, the main takeaway, so you mentioned like an ETL job is a batch is a process that I would describe as you can have humans in the loop. So it’s, it’s something that runs at some cadence and has an execution time, that means it’s realistic to have a human watching it happen, and maybe even able to intervene. And it’s, therefore, something that the tooling doesn’t necessarily need to be that reliable, it doesn’t need to be as self-fixing, self-correcting. Whereas when you’re in a space where the data is continuously moving, and it’s moving a volume that you can’t deal with, in a batch context anymore, you don’t have the resources for dealing with a backlog that’s so large, like a day’s worth of data now is a massive cost to the business to process offline. Now you’re in a position where you can’t always have a human watching it happen. And if something does occur, that it is an issue, they’re not, they’re not going to have to deal with it in a way that doesn’t cost the business a huge amount of money. So the requirements for the tooling to deal with that problem are now it’s much more important that it’s able to keep that process running for as long as possible with as little intervention as possible. And when things do go wrong, it needs to be self-fixing, to an extent where it’s not going to take you a silly amount of time. And I think what’s kind of happened over time is that there are there were people like in my position who had to stream and they basically had to reinvent all these ETL tools because that’s all we were doing, we were just taking data transfer, transforming it enrichments, that kind of thing, filter, maybe a bit of mapping, that kind of stuff. And then we’re just writing it somewhere else. So that is an ETL job, which traditionally, you wouldn’t have so much data that you can just do that once a day, and it would go in for an hour or that kind of thing. But then we were in a position where that wouldn’t work. So we had to reinvent all these tools. And essentially, in the process of doing that, which I would imagine my colleagues here are familiar with is this idea that you have to make your tool that much better at dealing with various edge cases and problems, because you can’t expect a human to be able to just come in and fix everything for you. And I think one of one of the repercussions of that is because the tools are becoming so much more autonomous and able to deal with all those problems. People who have batch workloads that don’t necessarily need to do things in a streaming way, they can look at these tools and go, Oh, that looks a lot easier. That actually looks like I have to do less stuff in order to have that do my workload. So I kind of call it a streaming ETL. Because I see a lot of people using benthos, who they have a batch workload, right, they don’t have a volume of data that requires continuous real-time flow, they’d be quite happy with just an hour long job run daily, or even weekly, and they’re choosing to use benthos, just because they know that they can just leave it running. And if there’s a problem, it’ll fix itself. Like it’s not, it’s not going to require some sort of intervention. And they don’t need to think, oh, have we checked it executed today and check the status of that, they just know that they’ll get alerts if something’s gone wrong. And if something has gone wrong, they probably just need to restart it or add more nodes, that kind of thing. So I kind of feel like it’s, it’s, I call it that because I don’t want to I don’t want people to look at streaming projects and think Oh, but that’s not for me, because I don’t have real-time payloads or they have requirements that would necessitate something that complicated, because really, at the end of the day, it ends up becoming operationally simpler in a lot of ways than some sort of large batch workflow tool.

Eric Dodds 13:31
Super helpful perspective. Okay. Arjun, I’m gonna jump to you. And I’m gonna add a modification to the question. So with that context, where you don’t necessarily have to have a streaming service running something, because the basic parameters are sort of SLAs around bachelor fine, but as simplifies a lot of things. We’d love for you to speak to that. But also, with the added question of like, Do you think there’s a future where batch kind of goes away because of those techniques, right, the tooling gets way better. And it’s actually just easier to have continuous streams.

Arjun Narayan 14:05
So I think it’s worth thinking about the use case carefully in that there’s a big difference between humans in the loop versus no human in the loop. So when there’s a human in the loop, there is a much higher latency that we can tolerate, because there’s only so much there’s only so frequently that a human’s going to like look at a dashboard or, or react to, to a new data point, right? And, and humans go home, they sleep, you can run the batch job. And then there’s a big difference in a phase shift that comes about when you start doing things in an automated fashion, where powering automated actions off of batch workloads starts to immediately feel too slow because you have the data, you know what the data point is, you’ve collected the events, say the customers on your website, they’ve done action and then it’s going to take you something like 10 hours for the ETL to crank through for the batch. workload to finish and then for you to take some action, that immediately feels like an unacceptable amount of latency. And that’s oftentimes the big differentiator between streaming and batch. It’s when you’re doing these automated actions. And most companies get introduced to this workflow because they start to do email marketing. And so they have some data, they have some actions, they segment their users, they decide they want to send email marketing. And that’s actually fine and batch. But you quickly realize the difference between taking action when somebody is on your website, versus the next day, I’m sure we’ve all had that experience where like, you were searching for a mattress, you go online, you spend about two hours, there’s a two-hour window for trying to find the perfect mattress. And then you find the perfect mattress, you hit checkout. The next day, you come back on the internet, like mattresses, mattresses, mattresses, so it’s too late, right? Like, it’s because all those batch jobs just finished overnight, and then they decided you’re in the market for mattresses. And your perspective is like I was and I’m not buying another mattress for another decade, please go away. And then eventually they realized that the moment of as moved on, and if these automated actions, these sort of personalization, these, this segmenting the customer, these things were streaming really delivers outsized returns, because there is no human in the loop.

Pete Goddard 16:15
Yeah, so I love what you said there, Arjun, and I think it really speaks to my perspective on streaming. So Deephaven was found because we thought we saw the future. Well, our team comes from the capital markets, and unlike most of the people that are in this space, frankly, people in the capital markets, no, it is not a batch world. And now they’re streaming, we know that the opposite is true. We know that forever since the capital markets went electronic, just the late 90s. In Western Europe, and, and in the US, that actually, you make all of your money in streaming, there is no such thing as trading an hour ago, there’s no such thing as I want to buy a stock yesterday, I can do it. I can only react to things right now. So all of Wall Street has been automated for a couple of decades, around real-time data first. And it’s done that with his Of course you need context. Of course, you need history, of course, you need state. But that real-time streaming technology has been married to historical static batch data sets as well. I think the change in the last 10 years, and certainly the change that Deephaven is trying to be a part of is how do you move that from being bespoke code that is written by very, very sophisticated developers, and frankly, the elite players in the tech groups across Wall Street? How do you migrate it from being those custom solutions to instead be general purpose streaming technologies. And I think many of the technologies that you guys are mentioning here are relevant, there are the transport buses that obviously have become quite popular with Kafka compatible stuff. And the zero and rabbit MQ and solace, and Chronicle queues and all of these types of things. All you’re doing is you’re really taking what we’ve been doing on Wall Street for a long time with our custom solutions, and we’re making it open source, general form, etc. And we think that this is the future for many other industries that will be what all of you are saying is what we agree with, and that is your real-time data is the most valuable stuff, whether real time to you means a millisecond real time to means 10 minutes, I don’t care. Let’s call that real-time. And what Jeff said about that data being available to a user, that same as any historical data is entirely irrelevant. And so what we think is important is that the same methods need to be worked on streaming, as on and dynamic data as it does on batch and static data. And good technologies will make it so that the user doesn’t have to really care, they can just use the same stuff. And it will work on both people allergic can all join, and you can just get to work on data and stop thinking about streaming versus batch. And that’s the way Deephaven organizes it.

Arjun Narayan 19:05
Pete, if I may jump in, I love the background that you just gave us because you’re absolutely right that Wall Street has been doing streaming before anybody else has been doing streaming. I mean, if you look back before Kafka and rabbit MQ there was Tibco. And then, before any of these stream processing frameworks, there was KDB plus and these were sort of the pioneering ways in which to express computation and move data around but it also the flip side of it required this extremely specialized skill sets, right, like you have all these banks where there was these three KTB programmers and they were writing this almost hieroglyphic programming that nobody can understand and everything sort of runs through them. And what I love about Deephaven really bringing the best of the modernized data science, Python, our communities, the communities that have I’ve not had access to streaming, they have been living in a pure batch world and bringing the best of both worlds together. And I obviously bring a very different sort of set of backgrounds to streaming. I mean, I’ve never, ever been on Wall Street never worked. But I have the greatest respect because those guys really were first to implementing these technologies and pushing them to production. And whenever we do interact with them, they bring a wealth of expertise and experience and speaking in much more technical position than then I think, I think in Silicon Valley, including sort of the deep theoretical understanding of like the bitemporal, multi-temporal queries and how one can express those computations.

Pete Goddard 20:40
It’s sweet of you to say. I’m not a computer scientist, I’m not well-versed in the history of how all this has come to be in the last few decades. So the battle between streaming and batch is somewhat lost. On me, it seems like, it seems like semantics to a certain extent, all because when we think about architecture, we think of prints, first principles of architecture. Fundamentally, we just think that data changes and your architecture should expect data to change, I know that you origin very much embrace similar there rather than pictures of data that then you need to compute on a whole new batch set that has come in, to have your architecture organized in such a way that new stuff is going to come in, you want to be able to react to it to serve your use cases. That’s the architecture you should put in place if that means you want to use the word stream for that. Fantastic, but you can use whatever terminology you want. Just fundamentally, we think you should embrace the first principle that your data is going to change, and you should be organized accordingly.

Kostas Pardalis 21:52
So guys, I think that’s like a great opportunity to have like a little bit of like history lesson, let’s say, so do the great like to hear from you of like how you have experienced, let’s say the evolution of like the streaming processing platforms. I remember, for example, I mean, beginning of like the previous decades, we’ve had Spark streaming at some point, then we had Twitter coming out like with storm sumser. Like there was like some kind of explosion in like seeing at least open source projects around like streaming processing, new architectures coming out there like the Lambda architecture, for example, where like, people were trying to put like bots together with streaming, then we have Kafka with architecture, I think they like trying to merge everything into like a streaming platform and like, give like also some primitives that are like more native to bots. And obviously, like, there’s a lot of value delivered through all that stuff. Obviously, some technology survived. Some are not obsolete today. But we have like companies like Confluence going public, and people making money, obviously, out of all that stuff, which is always good. So what’s happened this, like past decade, let’s say and also like, what do you feel like it’s next? Because all of you like one way or another, like you’re working like building the next iteration of streaming platforms? So it would be great to have this connection? Let’s start with Arjun.

Arjun Narayan 23:24
Thank you. Yeah, I think what Pete articulated as the destination or the place, we all want to get to where we know batch and streaming are really implementation details. I completely agree with that one or 2%? I think we’re about 10% of the way there, right? It’s been a multi-decade journey. And some of the projects you brought up, I think Storm was sort of the earliest open source project. And the way I would articulate it is the things you could do in streaming or a tiny, tiny, tiny, tiny subset of what you could do at batch back that right, so precisely on the implementation level, you couldn’t do any query or any stream processing that required that your stream processors maintain a lot of state. So you couldn’t effectively do lookups to historical data, you were sort of building these, and in the algorithms’ world, this sort of streaming algorithms are those algorithms precisely that, don’t maintain very much state sort of in a big old sense. And that was really what was enabled by Apache Storm, right? So you had this as a user, you had this big trade-off, you had, do I care about low latency? And am I willing to limit what I do computation-wise? Or do I need that complex computation of willing to give up low latency and do it in batch processing? So that was sort of the trade-off you had to navigate in terms of a Venn diagram of the use cases were pretty separated like those giant circle, which is the things you can do imagine this tiny, tiny circle off to the side of what you can do with streaming. And over time, have we as we’ve had more kids capable Stream Processors become available, those that Venn diagrams have started to have a little bit of overlap, right. And that that is where I would say is the Coppa architectures that I think Kafka folks articulated before that we would have the Lambda architecture so so so so zooming even further back people, people didn’t want to make this trade-off. They were like, I care about low latency. But I also want the fancy computation. So maybe what I can do is sort of by getting the best, I’ll run a batch computation side by side with the streaming computation, I’ll do the fancy stuff in batch. And I’ll do some sort of approximation of the computation that I want in streaming. And I’ll keep periodically running the batch computation because I’m going to diverge from from from the true computation that I want because the streaming is an approximation. And then I’ll sort of revert back to the batch and then reset, and then re continue. And this will get and this was hideously complex. Right now you’re running like three different systems, you’ve got the batch, the streaming, and then the little thing, putting it all together. And in the Catholic cap architecture was the simplification saying why you just want everything through the stream processor. The problem with this, as Ashley has brought up, is the developer experience is terrible, right? Like everything in streaming is manual and writing lots of code and maintaining cons of infrastructure. Whereas in batch, the lived experience that people have in batches, I thought that I thought I just write a little sequel, or I write little in the data science world, I write this little Python program. And I get to harness the power of a scale out horizontally scalable, reliable, massively parallel cloud architecture that works on terabytes and terabytes of data. For me, you don’t have any of that in streaming, right. So streaming today, it’s like, well, that’s great. Now you’re on the hook for your schema changes, you’re on the hook for errors in your data stream, you’re on the hook for expressing the implementation. So the nice thing about SQL or, and it’s not just SQL, there’s other sort of languages for sure, in which you can express computation declaratively. But in streaming, you haven’t had that luxury, you’ve had to build out your own implementation languages. And I think over time, and certainly our goal at Materialize is to make for the subset of Bash competition, which you think is a pretty large subset of SQL to make that available to people in streaming so that they can be streaming with just SQL, this does not cover all of the use cases, right? So just as you have Snowflake and Databricks, right, like there’s, there’s a whole world out there, that’s not SQL for for for data science and machine learning. And absolutely will need solutions in that side of the space as well. But I think we’re maybe 10, maybe 20% of the way there such that we’re getting to that promised land the way I would rephrase what pizza, it’s like, we need to get to the point where streaming is a superset of what you could do in batch, right? So that eventually we think of batch as a subset of the capabilities that we can do in stream, does this mean the batch processes are gonna go away? No, I think I think there will absolutely be some times where the same computation you can sort of choose to only as a batch runner because I have all that infrastructure already set up, maybe it’s a little bit cheaper, or do I want a streaming runner, you should be able to move computation back and forth between sort of underlying infrastructures, I don’t think batch is going away entirely. But I want it to become, I want the bachelor streaming debate, I think, decades from now to be the sort of as much of an implementation detail as like, Hey, are you running this on a columnar or a row-oriented execution engine? Like actually, 99% of people do not care what the answer to that question is, right? They just experience it as like, Oh, this is great. I’m having a really pleasant developer experience.

Kostas Pardalis 28:39
Yeah, it makes total sense. So okay, Ashley, I’d like to ask you. So you said you were working like in a company, you have like to work with a lot of data, you have like to deliver the data, like in a real-time fashion. Why didn’t you use, I don’t know, like Flink and Kafka or Storm, like whatever was like available back then out there? You decided to go and like build your own solution that’s ended up becoming Benthos.

Ashley Jeffs 29:09
It was because of the types of work we wanted to do. One of my early frustrations with stream processing tools was that there was this focus on the really high fruit in the tree, which was the actual queries I was interested in single message transforms, which was we’ve taken data, we’re doing a small modification to it, you can do that stuff with those tools, but we had latency requirements that it just didn’t, didn’t work out. But the other bit that was missing was being able to orchestrate a network of enrichments. So we had lots of different types of data signed, see enrichment that we needed to add to these payloads as they were passing through our platform. And it was non-trivial than mappings between them so we needed one long as you’re coming together to ended up with this DAG, of Directed Acyclic Graph of these enrichments that we needed to execute in a streaming context, which meant as quickly as possible with as much parallelism as possible. And none of those tools, you could do it with those tools as a framework. But we needed something that was going to execute those things. And what we were looking for was decorative, because we needed to slowly iterate on those enrichments change, where they were, how they operated, what kind of data they were giving us back and requiring because they were being actively worked on, and slowly changed the grass as new requirements came in. And the idea of having to compile a code base every time that happened, it just wasn’t realistic. So we wanted to offload some of that work to non-engineers, but it still be owned by engineers. So we kind of went down this decorative path. And that’s kind of where the benthos essentially started was kind of iterating on those key principles. But I mean, all of those, all of those tasks is something that you could do with the batch system right now. And as we were getting these streaming tools that were kind of like an alternative to batch for the analytics part, we just didn’t have any, any options for the actual pipelining stuff. There’s like logging tools, because almost all engineers have logging as a streaming problem is basically it’s stream processor, but then you don’t get the delivery guarantees and the observability. And those kinds of things baked into it. So yeah, just it felt like for me in my position, that was the biggest gap. That’s the bit that I kind of I focused in on was integrations with services, enrichments transformations, that kind of stuff.

Kostas Pardalis 31:44
How do you see Benthos— How did the project like evolve? Especially from the perspective of let’s say, the use cases, right from when it started inside, like the company up to today, right? Where you’re like maintaining a project that is used by so many people outside the company that initially was intended for… What’s your experience there? How did you see things changing in terms of taking up streaming processor for a very specific use case that you had there and being adopted, I would assume, for different use cases?

Ashley Jeffs 32:23
We already had a bit of practice with making something like that generalized. So having like a configuration schema for expressing enrichment has integrations to arbitrary things. So you have the language that can communicate, interacting with an HTTP service, or a Lambda function, or a database query, or this and that and this, so And I’d kind of dabbled with it, I could see where the problems were in generalizing that stuff. So I kind of just, you end up just biding a bigger chunk of how much you’re going to abstract over that stuff. And it still be user friendly, once I got that cube reached that nice Goldilocks position of being super easy for people to pick up quickly and run with. But also when somebody comes to and says, Hey, I have this thing that it can’t do yet, can you make it do that? You can really quickly add that in. So you’re not immobilized by it. It’s so generic, and it’s so generalized, that you can’t, you can’t make things easy for people. But then it’s not so easy and intuitive for its existing use cases. eConsult something else in that’s mostly just the config, it’s basically the way that I’ve kind of decided to chop the concept of batching. And what a message is, what a batch is, inside the process itself? And what the responsibilities of the various component types like what is it? What is a process? What is a transform? What is it? What is an output? How do you compose those things? But did the core concepts haven’t really changed that much since day one? And that’s because I’d already had practice with a few attempts at generalizing that kind of thing. And then after, after that point, it literally just becomes a case of five years of people going, Oh, but what if he could just do this extra bit as well? And I’m like, Okay, let’s do it. Let’s put that in. And never say no.

Kostas Pardalis 34:06
That’s cool. Okay, Jeff, your experience on like that in Netflix, again, like, all these tools out there, right? Why you ended up like building Mt. It’s like, what was the reason that made you say, okay, we need something new? Like, whatever tools are out there, they do not fill the gaps that we have here.

Jeff Chao 34:29
Semantics fundamentally is a stream processor, but the use case that started it all was basically trying to reduce downtime or MCTD. Or and like, meantime to insight and understanding why something’s happening, like something negative is happening to the service. And so you have your traditional telemetry stacks like metrics, logging, logs and traces. And so certainly we at Netflix at the time, we did use that but after some point, it just got too expensive to like, shove off. If we wanted like the most granular insight possible, we would have to log everything all the time aggregated store it, hopefully TTL it out the other end late, but we needed something that could get us that cost-efficient cost effectiveness, while upholding the like the granular insights. And so to give an example, like petabytes of data, were going through the system, and then most people would filter out like 99.9% of it. And then so one of the great use cases is like, I’ll give you, I’ll give you all an example. So Mantis gives you the ability to say, if I’m looking at, if there’s a playback issue like someone’s having problem playing a title, we can get that insight into which country, they’re coming from which title which episode, which device, and then the like versions of the device, and then you can exactly see like, what’s going on. And so that’s very, very targeted query to like a specific thing. And so you don’t really pay, you don’t pay the cost of that query, if it’s not in use, basically. So it’s more of like a reactive model, like the reactive streams model. And so another example of like, when we were decommissioning a Netflix app for some devices, we can see exactly which users are using that over time. And then maybe some of the mini l say, like, Hey, we’re going to decommission this app for this device, please go do something else. And so it was more of like a real-time use case like, like, what’s happening here. And now another example is when we, when we do like regional failovers, which happens quite regularly. You can see the number of people hitting play, like dip in one region and increase in another region. And then when we fell back and goes the other way. And so it’s used for a lot of tooling. But then the other feedback we’ve gotten was like, Okay, well, what if I don’t want real time? What if I want to look at something like in a batch case, or in like a snapshot of data, or join that like historical context with the streaming set of data to make some sort of decision later on? So then our answer to that was okay, we would just build an STS, so st get out somewhere else. And then you would have to do basically a stream table join later on. But that’s another story. So it all started with observability, basically.

Pete Goddard
Jeff, can you by chance— I’ve heard you speak before and one of the things that you say that I think is really interesting, and very important, actually, for the stream community is this idea of not sharing all of the data. I think that that’s one of the founding principles of Bantus. And I think, for stream processing players that are used to just receiving a firehose queue and then having to do things about it, you’re approaching that a little bit differently. And I think that’s an important concept because you can also be moving a lot less data around your, around your system or your mash, which is a principle also that I think our June holds, dear. So can you talk to us about how that came about as a principle and how you’ve executed that in your system?

Jeff Chao
Yeah, there are three parts. So first is we encourage developers to publish all of the data that they can so so like, an event might have hundreds of fields, and it might be like very large in size, basically. But we also have it so that the infrastructure doesn’t consume all of that data by default, like, you have the ability to do that you being the user, but you have to be very intentional in doing so. So certainly, there are some long-running jobs that basically perform a select star 100% stamp, but it has to be maybe it’s a low volume stream or something like that. But most people would either do like a sample or select some of the fields or even filter some events out based on a condition. So the first sort of first is like, publish everything, but be very intentional about what you consume. And then the second is, like reusing, like subscriptions basically. So if multiple people are asking for the same set of data, and if you have that operate in memory, just don’t go all the way up to the source of the data, like these applications, just send what you have down to the people that are subscribed to that same with the same pattern updated that they’re asking for. And so really, you’re getting like a lot of reuse, you’re getting a lot of like intentionality, like consume only what you need. Because in reality, you don’t want all of the data. And if you do, then you can do that if you want.

Pete Goddard 39:23
Yeah, we hold the same things dear. We use the phrase, “we move data around the system in lazy fashion,” that the producer, our APIs allow the producer to have quite a bit of information from the consumer in terms of what it wants from the table and at what update frequency it wants because there are different use cases. Some of them are throughput, some of them are latency, and some of them are consistency-driven. And then we also sort of hold on to what you just talked about in regards to sharing work products. We memorize data, so if you have a few consumers that want the same thing, you’re able to. Because oftentimes they scale out at a few 1,000 consumers. And so there’s less work that needs to be done in these the types of use cases that are really evolving quickly in the streaming world, we think.

Kostas Pardalis 39:23
So, guys, one of the things that surface like through the conversations that we have is that like something that’s like, quite important, around like streaming processing is like the developer experience, it’s like something that like, still needs like a lot of work, and probably is one of the things that I mean, like the previous wave of technology out there, like, didn’t pay as much attention as like they should. And my question is like, it all starts bit with you, because you’re like, offering like a product that exposes like an interface to the users that’s like very, let’s say familiar, like, to a very specific group of people that are not necessarily extremely technical, right, like in the sense of like, the engineering of the systems. So I want to ask, like, how we can take like streaming, and make it at the end, like more accessible to engineers out there, like developers who don’t necessarily know, like, or have to know, like, remedies, but like delivery semantics, like all these things that you have to understand in order like Dwolla, about like producers and consumers, all these things, like when you’re interacting with a database, you don’t think in these terms, right, like, and database is like what most engineers like developers have been exposed to. So how we can bridge let’s say, all these primitives that in the new language of like streaming is bringing on the table with what is most commonly known out there to developers. And it’s one of the questions that we got also, like, from one of like, the attendees, that’s okay, like, you can, let’s say, for example, it seems like the term extracts to publish loads to consume, but at the end, like, is this what is needed? Like, do we need to change our language and like educate people, or we can do better adults?

Pete Goddard 42:01
So I have a bit of a radical answer to that. It’s a little bit self-serving, so I hope you’ll pardon my indulgence here. We actually introduced a new primitive that we think is important. So we think event streams and message queues are vital. And we understand that that is the foundational building block of the conversation we’re having today, we could not embrace it more heavily. However, we don’t think it’s the whole story. We think that the most intuitive object for people to work with for data-driven applications, for data science for data analysis, is the table, the data frame, right? I’m not making this up. This isn’t Pete Goddard’s opinion, the walk around, see how many people use Excel, go survey SQL, look at our data frames, look at Python, pandas is not an accident, they’re all tables or something darn close to the table. So what we’ve done at Deephaven, and we put it out in open source is we’ve really tried to create a dynamic table. So tables should not be static, they should be things that can change. Right? That should be true at the API level, and that should be true at the engine level. And that should be true at the user experience level in the API, the JavaScript APIs that are supporting that. So we think that that is a very important proposal to the world or to the community in regards to how do I make it easier to work with streams. So when you’re using Deephaven, you might integrate with whatever, you’ll copy paste, something we have, from our documentation to onboard a Kafka stream or a red panda stream, or you’ll integrate solace or something like that, right. And now you or, or our enterprise, customers will push binary log from Java or C++ applications of theirs, right, but once it hits, when it hits Deephaven, it really is a, it says table structure. And then we keep track of changes in the table. And I think making it so that we can do really high performance compute because we’re only doing calculations on changes instead of on the actual data, which means less data and faster compute is powerful, but it really works. Because we let people think in terms of tables, and we let people use tables and Oh, you want to do machine learning? Well, chances are you’re using tensors, which are arrays that are chunks or columns of tables. This construct of pushing a table around a dynamic table around as a first primitive is really really helpful for the user experiences you’re talking about. In addition to one of the things that this group is well aware of, as soon as you get out of the bat, think of how much money meaning market cap is involved in batch world, right? So just think of BI right BI you have Tableau and Looker and power or BI and Qlik and 12 other competitors, right? Give any of them a stream. And your experience is pretty much Gonzo. So, so there needs to be significant investment to support those types of development, exploratory interactive, and dashboarding experiences with real-time data. We’ve been very involved with that for a number of years. But we obviously will need a community to get it all the way there.

Kostas Pardalis 45:30
Yeah. So Jeff, you mentioned like for specific value implemented more of like a reactive model there, right, which reactive programming is like parenting that is, I guess, like, quite well known in people that are doing like, even like front end development, right. So what’s your opinion on like, how we can make streaming technologies more approachable to more engineers out there?

Jeff Chao 45:56
Yeah, that’s a good question. In my mind, what’s going on my head, there’s like the user, and then there’s developer, user response story. But what I’ve been thinking about a lot lately is I’ve actually written a stream processor, very early on in my, in my computer in career, it looks like take a list, call stream call flat map call reduce. Right. So that’s like an API that’s very standard and a lot of programming languages. Or take another example, you have a list, and then you use, you have a for loop, and then you go through, and then you do something with it. And so like kind of like the reactive streams, like you have like API that has your flat maps and all that other stuff, like, I’m wondering, from a developer perspective, perspective, that’s kind of just what I want to do. I know how to write or loop I know how to write a stream, flat MapReduce, etc. And if I could just give that to somebody, and then they’ll parallelize it, they’ll manage all the state throwing the rocks DB and all that other stuff. Like, that’s good for me, because I know, I know what a list is. And I know what a for loop is. And, and that works for me. Another example is like in some of the stream processors, there are great APIs do like windowing, and triggers and all sorts of stuff. But if you want the ultimate flexibility, there are things like just the process function, or, or just basically a block of code that you write, you run, you put all sorts of variables in there. And that’s pretty nice. Like, that’s, that’s kind of just what I want it to developer.

Kostas Pardalis 47:28
Yeah, makes sense. Ashley, what about you? You have built a tool from scratch, so and you are like, interacting a lot with like, developers out there. So what is missing? Like, what do you think that’s missing? Like, from streaming infrastructure to become more approachable?

Ashley Jeffs 47:45
If you take a developer that’s innocent and hasn’t suffered at the hands of stream processing, and you invite them into your world of, hey, why don’t you do that with streams? I think the immediate repulsion to those all it looks like a lot of hassle that looks like he’s gonna wake me up at night. And I’m gonna have to do all kinds of stuff to get that backup. And I think answering that, it that anxiety is, I think food for me, that’s been one of the— I wouldn’t say it’s a struggle. It’s more when people get to me, they’ve already overcome that somehow. So I’m kind of like seeing this relieved, oh, actually, this isn’t that bad. But I think the operational side is still a nightmare. And it’s still there are a lot of moving parts. Still, there are all the different services you mentioned, like an architecture of stream processing, a team that decided that they’re going to lean into a heavily and everything’s going to be on the low latency side, the number of moving parts is vast. If you want everything and every component wants its own state, it wants its own claim to the disk, it wants its own thing that can fail. And the idea of being woken up at 3 am, and one of your disks is corrupt. And you’ve got to sort out okay, well, which services in my streaming platform now have been suffering, how have they been suffering? How do I recover them from that state? It’s overwhelming, I think for a lot of people, and it’s very comfortable to just say, Okay, well, we don’t need that yet. We don’t need that kind of thing. I think it’s kind of similar to Kubernetes and things like that, where it’s exciting to some people but to a lot of people, it’s too much hassle feels like it’s not worth the investment. And I think it’s something that is improving over time. There are obviously, alternatives to Kafka. Now, there’s companies like red panda, lots of tools are coming along that try to, you know, simplify a lot of the products here and nice and simple to use and just kind of slot into existing systems, which is great. So I think that’s kind of what I’m hopeful for in the future is that a lot of those moving parts become less moving and more simple.

Pete Goddard 50:01
Isn’t the data itself and the use case itself deriving this like it’s, I’m just unfamiliar, right? I’m not, I don’t come from the same background as you. So I’m just unfamiliar of, oh, this is a thing that’s working bad. Should we try it in streaming that doesn’t, there isn’t much resonance with me. But hey, there’s a new type of data that drives a new type of value to me, I need to keep up with my competition, I need to be responsive in a shorter amount of time, or there are new datasets that are coming in where transactionality might not be as important, therefore, I can embrace a new way of doing it. To me this, from the people you talk to, is that oftentimes driving the migration to dynamic data, real-time data streaming technologies? Or is it really just, oh, what you said it at first, which was they want to do you wanted to do batch, but you need to do it at such scale, that you kind of had to do it all the time, or, or what really drives people into this dynamic data stuff.

Ashley Jeffs 51:01
It’s a bit of both. There are some people who have to, and they used to have something that they wrote themselves, and it’s dodgy. And they wanted something that’s not. And then there’s, there’s a lot of people who are just using Benthos, which is a string processor to basically just read a batch workload, because you can read like SFTP falls, and they just, they like the fact that they can just run this thing and it’s always on. They only need this data, like once a day, to be refreshed. But they’re just like the fact that this thing will just always run, it just sorts itself out and seemingly doesn’t require much effort. It’s got a nice little config that three people can be working on in a source control. And it to them it’s a simpler world. And the job isn’t that complicated. It’s just hitting some endpoints, and maybe has a fallback with some dead letter Q or something on an alert. So those two extremes, and then there’s everything in between, as well as people who feel like we probably ought to make this more efficient, or Streamy, but we don’t have to yet until we find the thing that suits have particular requirements. And it’s kind of hard to say, but yeah, I see all of them. I see people from all walks of life coming and discovering, actually, it’s not that bad, is the most it that’s the main take is actually it’s not that bad.

Eric Dodds 52:19
That’s great. Okay, well, we are right at time here. But I want to get in this question really quickly. And we’ll see. We’ll see how many of you can answer Arjun, we’ll start with you. Chris asked a great question. He said, “Coming from the traditional batch ETL world, I found using new vocabulary to describe things can help with thinking about how to do things in new ways. Examples: extra versus publish or load versus consume. Has there been any terminology that you use or have heard that you think has been helpful in terms of breaking out of the way that we talk about these things traditionally?”

Arjun Narayan 52:54
I love this question, and I’m going to take the completely contrarian point of view, which is, no. We should stop doing this. There’s so much amazing vocabulary and intuition in batch. I mean, it’s the entire reason we named the company Materialize, right? Which is, which is what are you doing? What are you trying to do and with the vast majority of the streaming pipelines, you are trying to build a Materialize view that stays up to date over changing data and relating the difficult newness of streaming to the concepts that people are familiar with from batch, I think has a tremendous amount of value. I know this is a bit of a contrarian heartache, it is in very much a reaction to how needlessly complicated streaming has been for so long, and try to simplify it in the terms that are most relatable to the audience. And we really see the audience as people the 100x, the amount of developers who have not tried streaming have never poked at it, and are maybe even a little bit rightly so afraid of poking that bear. So I’m gonna take the contrary and take maybe I should have been the last to speak, but…

Eric Dodds 54:09
I love it. Jeff, what say you?

Jeff Chao 54:11
Yeah, no, I’m actually plus one on that. Just keep it simple. So I sit in the Change Data Capture space right now, really, you’re just capturing changes from databases, aka extracting something from a database. Someone wants to transform that and someone wants to load that somewhere else. So just keep it simple. I think ultimately, for me, like, are we talking about the technology? Are we talking about the use case, I think ETL is a use case, like you’re going to extract something, you’re going to transform it, you’re gonna load it as a technology. There are many things other than food that have unlimitied definitions.

Eric Dodds 54:42
Yep, that’s great. All right, Ashley, you invented a new term called streaming ETL earlier on the call so what do you think?

Ashley Jeffs 54:49
I did not invent that. I promise you I’ve seen that somewhere else and I thought, oh, maybe that fits on not good to ask you about the stuff I’m really bad with. Because I mean, I was I didn’t think I was a data engineer for ages. Like I kind of discovered data engineering way late. I had a data engineering tool and I didn’t know what data engineering was at some in the oldest stuff. I’ve had to kind of learn what do they mean when they talk about ETL? Because to me, that was just processing like that. Every program is ETL is reading something doing something and then putting something else? I mean, yeah, I feel like all these kinds of super vague in kind of conveniently unspecific anyway, so I just kind of adopted them for whatever I had going on.

Jeff Chao 55:36
I mean, just wait ’till you hear ELT.

Eric Dodds 55:42
The marketers are trying to confuse us. Alright, Pete with the last word.

Pete Goddard 55:47
I think my answer is similar. Mostly, I just don’t feel qualified to dictate Lambda language and nomenclature to people. Certainly, I come from a space where Pub/Sub systems feel pretty natural that you’re moving data from one server to another. And there are publishers and subscribers that forms that data mash? I don’t think that that’s a sophisticated concept to them to embrace, and might be an easy way to think of things. As others have said, I think the best thing we can do as a group is make it so that we can talk in English that a sixth grader can understand, right? So I am moving data here, this thing wants this data, it is going to do that with it, etc. And whatever words you choose to coalesce your team around, that’ll work for us. So we do think this idea of the only term I would introduce is the one that I mentioned earlier, which is is an invention of ours, which is this streaming table, which is, look, it’s got the same attributes as streams plus a lot other more use cases that you can handle semantically with that primitive, but it’s a table that changes and probably when I say, hey, it’s a table that changes to a sixth grader, they can generally understand probably what I mean. And that is exactly what it is.

Eric Dodds 57:10
Yep. Love it. Love the push for simplicity across the board. All right. Well, thank you so much, everyone, for joining. We’re a little over time so thank you to the listeners and to the panelists. We really appreciate your time and have learned so much.

Arjun Narayan 57:25
Thank you.

Pete Goddard 57:26
Thank you very much, everybody.

Ashley Jeffs 57:28
Cheers.

Eric Dodds 57:29
Kostas, we get to talk to a lot of really smart people. I think my takeaway was that through all of the different complexities that we talked through, right, so use case complexities, technology complexities, different opinions on that was at the end of the day when we asked them how to describe these things with the user question. Everyone said, keep it really simple, right? And actually even said when I first I didn’t know what ETL meant, right, I just called it processing. taking data from here, and moving it over here. I was just processing data, right. And I really appreciated that because I think it’s good. With all of the new technology, sort of the new ways of thinking about things. I think it’s really healthy for us to step back and say, at the end of the day, we’re moving data, like, the technology can do some really cool stuff. But the fundamentals are actually not that complicated. And I think the other sub-point, I guess, I’m doing, I’m forcing it to for one year, is that Jeff made the point that the user, the end user who wants the data could care less. So those are just really, really good reminders for me. How about you?

Kostas Pardalis 58:45
Yeah, absolutely. I think that’s what I’m going to keep from this conversation that we had is that the experience matters a lot when you’re working with these tools, and there are like two levels of experience there. One is the experience of the developer has was going like to build whatever on top of these technologies. And then there’s also like the user experience, which is the consumer of the data that comes out these systems, right. And both of them like they like if we want to move forward and increase the adoption of these tools, we need to make sure that they don’t have to learn new terminology. They don’t need to learn. They don’t have like to learn new ways of thinking and designing systems. And we need to keep things familiar and simple as much as we can.

Eric Dodds 59:29
Yep. Well, another amazing live stream panel. We have more of these on the schedule, so be sure to keep your eye out. We’ll let you know when they’re coming in. We will catch you on the next one.

We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.

🎙 Sign up for The Future of Machine Learning Livestream!

🗞️ Signup for Our Newsletter

Episode 91:

The Future of Streaming Data with Stripe, Deephaven, Materialize, and Benthos

June 15, 2022

Notes:

Transcription:

About the Podcast

Sign Up for The Data Stack Show Newsletter